CIS 732, Fall 2001: Machine Learning and Pattern Recognition

CIS 732

Machine Learning and Pattern Recognition

Fall, 2001

Homework Assignment 5 (Machine Problem)

Tuesday, 27 November 2001

Due: Thursday, 06 December 2001

(before midnight Friday 07 December 2001)

Refer to the course intro handout for guidelines on working with other students.

Note: Remember to submit your solutions in electronic form using hwsubmit and produce them only from your personal source code, scripts, and documents from the machine learning applications used in this MP (not common work or sources other than the textbook or properly cited references).

Problems

First, log into your course accounts on the KDD Core (Ringil, Fingolfin, Yavanna, Nienna, Frodo, Samwise, Merry, Pippin) and make sure your home directory is in order. Notify admin@www.kddresearch.org (and cc: cis732ta@www.kddresearch.org) if you have any problems at this stage. You should have MLC++-2.01.tar.gz installed from MP2. Actually, on KDD Core systems, it is already in /usr and you can just set your path environment variable in your .tcshrc or .cshrc and the MLCDIR in your .login, then run Inducer.

1. (60 points total) Learning time series data with NeuroSolutions.

Your solution to this problem must be in MS Excel, PostScript, or PDF format, and you must use a spreadsheet (I recommend GNUmeric or Excel 2000/XP) to record your solution.

For all parts, turn in training (80%) and cross validation (20%) error values.

a) (30 points) Train a Jordan-Elman network for the same task and report the results. Use the default settings and the input recurrent network (the upper left entry among the 4 choices. Take a screen shot of your artificial neural network after training (in Windows, hit Print-Screen and paste the Clipboard into your word processor).

b) (15 points) Train a time-delay neural network for the same task and report the results.

c) (15 points) Train a Gamma memory for the same task and report the results.

2. (40 points) Comparing Inducer Performance. Run Discrete-Naïve-Bayes on the Pima and Monk3-Full data sets and compare its performance (training and test set accuracy) to C4.5. Write a program (shell script, Perl script, or C or Java program) to do 5-way cross-validation on these 2 data sets. Turn in this program along with a file containing a table of training and test set accuracy values.

Extra credit (20 points) Evaluating significance. Run a paired t-test between Discrete-Naïve-Bayes and C4.5 on the Mushroom data set, divided in 5 segments (train on 4 segments and test on the 5^th). Report on the test set accuracy and the significance level.