CIS 798 (Topics in Computer Science)

Topics in Intelligent Systems and Machine Learning

Fall, 1999

Homework Assignment 2

Part 2 (30 points): Decision Tree Learning – The Badge Game

You should finish this part by Friday, October 8, 1999.

We will use a data set that is based on the Badge Game (http://l2r.cs.uiuc.edu/~danr/Teaching/CS346-99/game.html).

The data is given as a set of examples, each of the form (label, string1, string2), where the two strings form the example, and the label can be either + or -. The data already comes in two sets, Train and Test, consisting of 80% (235 examples) and 20% (59 examples) of the data, respectively. There are 158 positive examples in total (126 in Train) and 136 negative examples in total (109 in Train).

In this part of your assignment, you must pre-process the data and extract features (attributes) from it. Use 20 attributes, representing the characters in various positions in the two strings. For example, the attribute X(i,j) will stand for the ith character in the jth string (i = 1,2,...10; j = 1,2). The data has been cleaned so that the strings contain only letters, and there are only two strings in each example. Do not distinguish between lower-case and upper case letters. In this way, every variable may have one of 26 different values.

Note: You are required to begin with this set of attributes as a baseline. However, if you wish to use different (constructed) attributes, please do. If you choose to do so, describe the set of attributes you use, and compare the results you get with the basic set with those you get with the new set(s) of attribute.

When running your code, try (at least) the following two options:

  1. Limit the depth of the tree. For example, do not allow the tree to be deeper than 3, 5, or 7 nodes.
  2. Construct a learning curve. Construct a series of subsets of part-1-train, containing 40, 80, 160, and 235 examples. For each of these subsamples, construct a decision tree, and test the accuracy of the tree on part-1-test. You may plot this curve using Microsoft Excel, GNUplot, or any other plotting software. No hand-drawn plots, please.

Report the results from these experiments along with your result on the full training and test data.