K-State CIS 798: Homework 2, Part 1

CIS 798 (Topics in Computer Science)

Topics in Intelligent Systems and Machine Learning

Fall, 1999

Homework Assignment 2

Part 1 (30 points): Decision Tree Learning – Implementing ID3

You should finish this part by Friday, October 1, 1999.

In Parts 1 and 2, you will write programs to do the following:

Pre-process the data.
Grow a decision tree using the information gain splitting heuristic.
Display a decision tree.
Evaluate a decision tree on a data set.

Each node in the decision tree corresponds to an attribute. You may choose the nodes to be binary attributes of the form attribute = value (for example, we might check whether attribute 3, say, attribute X(3,1) from above, has the value `d’), or you may allow them to accept more values. Other than that, there are a few decisions you need to make in your implementation. Make sure you explain in your short report what your algorithm does.

Before you run your decision tree learner on the data, test it on a small set of examples for which you can construct the tree yourself (e.g., the data from Exercise 3.2, Mitchell) and make sure you get what you want. Once your code is working, run it on Train and test it on Test.

You may print decision trees in ASCII text format. For example:

attribute 0 == x

attribute 1 == y

attribute 2 == z

class = +

attribute 2 == k

class = -

attribute 1 != y

class = +

attribute 0 != x

attribute 1 == r:

class = +

attribute 1 != r:

class = -

(Use the explicit attributes here, so that the output will be comprehensible.) Your routine for testing the accuracy of a decision tree should print the results in the following form.

Test Cases True False

+ 100 70 5

- 50 45 30

This says that:

70 test examples were predicted to belong to class + and actually did belong to class + (true positive).

5 examples were predicted to be in class + but were actually in class - (false positive).

45 test examples were predicted to belong to class - and actually did belong to class - (true negatives).

30 examples were predicted to be in class - but were actually in class + (false negatives).

Finally, report the error rate. The error rate is the sum of the errors (30 + 5) divided by the total number of examples (150); that is, 23%.