Project 2 Specification
22 May 2001

Goals

The goal of this project is to develop—through experience—your ability to:

The General Message Sorting Problem

We all have too much to read: We are overwhelmed by material, both relevant and irrelevant. One version of the problem is the difficulty of even sorting articles into appropriate categories. We want you to develop software that will assist with this problem.

In its simplest form, you can imagine a training set of news articles, each of which is identified by its category. You are to devise a program that uses the training set to learn how to identify the category of subsequent articles.

The Specific Problem

International Conglomerates, Inc., is a huge multinational corporation that has virtual monopoly on a wide range of industries.

Presently, International Conglomerates has a central news center that receives news wires from all over the world. A human editor reviews each incoming article and classifies it, so that it can be routed to the appropriate division. Naturally, this is a very labor-intensive and error-prone task. CEO Ima $avage wonders if it is it possible to build a computer system to do the classification job automatically.

You have been hired as a high-powered consultant to look into the problem. You soon determine that there are four divisions within International Conglomerates, each of which is interested in a particular category of news:
Division Category of interest Typical subjects
Conglomo-Organic Division organic livestock, sugar, wheat, grain, tea, vegetable oil
Conglomo-Petrochemical Division petrochemicals crude oil, oil, natural gas, plastics, etc.
Conglomo-Metals Division metals gold, silver, copper, tin, steel, etc.
Conglomo-Finance Division financial bond/interest rates, stock market, money supply, etc.

In order to help with your task, the Ms. $avage has provided you with a corpus of old news articles and the divisions to which they were routed.

The corpus contains articles in the form suggested by the following samples. Evidently, the raw text has been prepared for digestion by a Scheme program:

(0000 (financial) 
     (BANK OF JAPAN INTERVENES IN EARLY TOKYO AFTERNOON) 
     (TOKYO *comma* April 1) 
(The Bank of Japan intervened in the market in the early afternoon *comma* 
 buying dollars around 147 *period* 30 yen and continuing to buy them as 
 high as 147 *period* 50 yen *comma* dealers said *period* The Bank 
 intervened just after the dollar started rising on buying by securities 
 houses at around 147 *period* 05 yen *comma* and hoped to accelerate the 
 dollar *apostrophe* s advance *comma* they said *period* The dollar rose as 
 high as 147 *period* 50 yen *period* REUTER)) 
(0001 (organic) 
     (MALAYSIA CUTS GAZETTED RUBBER PRICE) 
     (KUALA LUMPUR *comma* April 1) 
(Malaysian government said it cut the gazetted price of rubber to 202 *dash* 7/8 cents 
 per kg from 213 *dash* 1/2 cents in March *comma* effective immediately 
 *period* No export duty is applicable at this level *comma* against 3/8 
 cent per kg last month *comma* because the government raised the export 
 duty threshold price to 210 cents per kg in early 1985 *period* The cess 
 for rubber research and replanting remained unchanged at 3 *period* 85 and 
 9 *period* 92 cents per kg respectively *period* REUTER)) 

Your job

After discussion with Ms. $avage, you agree to write a program that learns to classify such articles on the basis of their content. That is, you are to write a program that uses the given corpus of human-categorized articles to learn to recognize articles that are not categorized.

In particular, given a test file containing articles with empty category expressions, your program will write a file in which each articles is represented as a two-element list containing an identification number and one of four categories, ORGANIC, PETRO, METAL, or FINANCIAL. Thus, if the articles in the sample set were provided as a test set, with empty category expressions, your program is to write a file containing the following:

(0000 FINANCIAL) 
(0001 ORGANIC) 

More precisely, you agree to:

Your approach and implementation should handle categories, training data, and test sets in general, not just those of interest to International Conglomerates. If we were to change the categories, training data, and test sets, your approach and implementation should still work. Thus, you are not to manually tune your program to the particular categories provided.

What you start with

We provide you with a set of useful utility functions.

Key Files