CIS 732, Fall 2001: Machine Learning and Pattern Recognition

CIS 732

Machine Learning and Pattern Recognition

Fall, 2003

Homework Assignment 2 (Machine Problem)

Sunday, 21 September 2003

Due: Friday, 03 October 2003

(before midnight Saturday, 04 October 2003)

Refer to the course intro handout for guidelines on working with other students.

Note: Remember to submit your solutions in electronic form by uploading them to the Yahoo! Group ksu-cis732-fall2003 and produce them only from your personal source code, scripts, and documents from the machine learning applications used in this MP (not common work or sources other than the textbook or properly cited references).

You have nearly 3 weeks to complete 3 parts to this machine problem (MP), so please start early and finish about one part per week. The point value of each part is an approximate indicator of difficulty (your personal assessment can and should vary). Problem 3 is considerably harder because you are being asked to write your own code.

Problems

First, log into your course accounts on the KDD Core (Ringil, Fingolfin, Anaire, Finarfin, Azaghal, Narvi, Telchar, Gimli) and make sure your home directory is in order. Notify admin@www.kddresearch.org (and cc: cis732ta@www.kddresearch.org) if you have any problems at this stage.

1. (42 points total) Running decision tree and simple Bayes inducers in WEKA and MLC++.

In your web browser, open the URL

http://www.cs.waikato.ac.nz/~ml/weka/

Download the Waikato Environment for Knowledge Analysis (WEKA) v3.2.3 (GUI version © 2002 I. H. Witten, E. Frank, et al.) to your local system (this can be a Windows, Unix, Mac, or other system, but the binaries are precompiled for ix86 Linux. Follow the instructions in the WEKA3 manual for installing it into your home directory.

a) (12 points) Your solution to this problem must be in MS Excel, PostScript, or PDF format, and you must use a spreadsheet (I recommend GNUmeric or Excel 2000/XP) to record your solution. Follow the instructions in the WEKA3 User Guide (also in your first notes packet) to run the ID3 inducer on the Credit (CRX) and Vote data sets from the UCI Machine Learning Database Repository:

http://www.ics.uci.edu/~mlearn/MLRepository.html

Use the .test files for testing. Turn in the ASCII file containing the decision tree and another file (.xls, .ps, or .pdf) containing a table of test set accuracy values for each data set. (For the next machine problem, you will compare the ID3 results – accuracy, overfitting, example learning curves – with Simple Bayes and C4.5.)

b) (20 points) Develop a classification-based solution, using the ID3 and Simple (Naïve) Bayesian inducers in WEKA, to the prediction task in the COIL 2000 (Dutch Insurance Company) challenge problem from the UCI KDD Repository:

http://kdd.ics.uci.edu

http://kdd.ics.uci.edu/databases/tic/tic.html

Read and think about the domain carefully first – do not simply throw potential solutions (inducers) at the problem.

For instance, you must decide what kind of discretization you want.

Your solution should include a paragraph on bias towards comprehensibility (Michalski, 1993) and how your inducer(s) apply or do not apply this bias.

c) (10 points) Repeat the process from part (a) with the Feature Subset Selection (FSS) inducer, which you can read about in the MLC++ user guide (http://www.sgi.com/tech/mlc). You may use the MLC++ database version of CRX and Vote:

http://www.kddresearch.org/Resources/ (db.tgz)

The wrapped inducer should be ID3. Report both test and training accuracy. Think carefully about how to generate training set accuracy. Note: MLC++ is installed on the KDD Core Red Hat Linux systems ({Ringil | Frodo | Samwise | Merry | Pippin}.cis.ksu.edu) as Inducer (use which and whereis to locate it locally).

2. (10 points) Running Feedforward ANNs in NeuroSolutions.

Download the NeuroSolutions 4 demo from http://www.nd.com and install it on a Windows 98/Me/XP Home or NT4/2000/XP Pro machine. NS4 is installed on the “hobbits” (4 Pentium Pro workstations dual-booting Windows 2000 Professional and Red Hat Linux 6.2, located in 227 Nichols Hall), and you may log in with your CIS login to use them.

Use the NeuralBuilder wizard (which is fully documented in the online help for NeuroSolutions 4) to build a multilayer perceptron for learning the sleep stage data provided in the example data directory. Your training data file should be Sleep1.asc and your desired response file should be Sleep1t.asc. Use a 15% holdout data set for cross validation. Report both training and cross validation performance (mean-squared error) by selecting the appropriate probes in the wizard or stamping them from the tool palettes, and recording the final value after training (for 2000 epochs, twice the default). Replace the sigmoidal activation units with linear approximators to the sigmoid transfer function. Finally, double the number of hidden layer units. Turn in a screenshot showing the revised network, the progress bar, and the MSE values after training.

3. (48 points) Implementing Simple Bayes. http://snurl.com/2eth The specification for this problem shall match exactly. There will be a follow-up using your code in later MPs, so it is a good idea not to skip this one.

Extra credit

(10 points) Try the MATLAB Neural Network toolkit on Sleep1 and report the same results for a feedforward ANN (specifically, a multi-layer perceptron) trained with backprop. This package can be found on the KDD Core systems, including a Windows version installed on the Linux Dwarves.