CIS 732
Sunday, 21 September 2003
Due: Friday,
03 October 2003
(before
midnight Saturday, 04 October 2003)
Refer
to the course intro handout for guidelines on working with other students.
Note: Remember to submit your solutions in
electronic form by uploading them to the Yahoo! Group ksu-cis732-fall2003 and produce them only from your personal
source code, scripts, and documents from the machine learning applications used
in this MP (not common work or sources other than the textbook or properly
cited references).
You
have nearly 3 weeks to complete 3 parts to this machine problem
(MP), so please start early and finish about one part per week. The point value of each part is an
approximate indicator of difficulty (your personal assessment can and should
vary). Problem 3 is considerably harder
because you are being asked to write your own code.
Problems
First, log into
your course accounts on the KDD Core (Ringil, Fingolfin, Anaire, Finarfin,
Azaghal, Narvi, Telchar, Gimli) and make sure your home directory is in order. Notify admin@www.kddresearch.org (and cc: cis732ta@www.kddresearch.org) if
you have any problems at this stage.
1.
(42
points total) Running decision tree and simple Bayes inducers in WEKA and MLC++.
In
your web browser, open the URL
http://www.cs.waikato.ac.nz/~ml/weka/
Download
the Waikato Environment for Knowledge
Analysis (WEKA) v3.2.3 (GUI version © 2002 I. H. Witten, E. Frank, et al.) to your local system (this can
be a Windows, Unix, Mac, or other system, but the binaries are precompiled for
ix86 Linux. Follow the instructions in
the WEKA3 manual for installing it into
your home directory.
a)
(12 points) Your
solution to this problem must be in MS Excel, PostScript, or PDF format, and
you must use a spreadsheet (I recommend GNUmeric or Excel 2000/XP) to record
your solution. Follow the instructions in the WEKA3 User Guide (also in your first notes packet) to run the ID3 inducer on the Credit (CRX) and Vote data
sets from the UCI Machine Learning Database Repository:
http://www.ics.uci.edu/~mlearn/MLRepository.html
Use the .test files for testing. Turn in the ASCII file
containing the decision tree and another file (.xls, .ps, or .pdf) containing a
table of test set accuracy values for each data set. (For the next machine problem, you will compare the ID3 results – accuracy, overfitting,
example learning curves – with Simple Bayes and C4.5.)
b)
(20
points) Develop a classification-based
solution, using the ID3 and Simple
(Naïve) Bayesian inducers in WEKA, to the prediction task in the COIL 2000 (Dutch Insurance Company)
challenge problem from the UCI KDD Repository:
http://kdd.ics.uci.edu/databases/tic/tic.html
Read and think
about the domain carefully first – do not simply throw potential solutions
(inducers) at the problem.
For instance, you must decide what kind of
discretization you want.
Your solution should include a paragraph on bias
towards comprehensibility (Michalski, 1993) and how your inducer(s)
apply or do not apply this bias.
c)
(10
points) Repeat the process from part (a) with the Feature Subset Selection
(FSS) inducer, which you can read about in the MLC++ user guide (http://www.sgi.com/tech/mlc). You may use the MLC++ database version of CRX and Vote:
http://www.kddresearch.org/Resources/
(db.tgz)
The wrapped inducer should be ID3.
Report both test and training
accuracy. Think carefully about how to generate training set accuracy. Note: MLC++ is installed on the KDD Core Red Hat Linux systems ({Ringil |
Frodo | Samwise | Merry | Pippin}.cis.ksu.edu) as Inducer (use which and
whereis to locate it locally).
2.
(10
points) Running Feedforward ANNs in NeuroSolutions.
Download
the NeuroSolutions 4 demo from http://www.nd.com and install it on a Windows
98/Me/XP Home or NT4/2000/XP Pro machine.
NS4 is installed on the “hobbits” (4 Pentium Pro workstations dual-booting Windows 2000 Professional and Red Hat Linux 6.2, located in 227
Nichols Hall), and you may log in with your CIS login to use them.
Use the NeuralBuilder wizard (which is fully
documented in the online help for NeuroSolutions
4) to build a multilayer perceptron for learning the sleep stage data
provided in the example data directory.
Your training data file should be Sleep1.asc
and your desired response file should be Sleep1t.asc. Use a 15% holdout data set for cross
validation. Report both training and cross validation performance (mean-squared
error) by selecting the appropriate probes in the wizard or stamping them
from the tool palettes, and recording the final value after training (for 2000
epochs, twice the default). Replace
the sigmoidal activation units with linear approximators to the sigmoid
transfer function. Finally, double
the number of hidden layer units.
Turn in a screenshot showing the revised network, the progress bar, and
the MSE values after training.
3.
(48
points) Implementing Simple Bayes.
http://snurl.com/2eth The specification for this problem shall match exactly. There will
be a follow-up using your code in later MPs, so it is a good idea not to skip
this one.
Extra credit
(10 points) Try the MATLAB Neural Network toolkit on Sleep1 and report the same results for a feedforward ANN
(specifically, a multi-layer perceptron) trained with backprop. This package can be found on the KDD Core
systems, including a Windows version installed on the Linux Dwarves.