CIS 690
(Implementation of High-Performance Data Mining Systems)
Monday, June 4, 2001
Due: Friday, June 8, 2001
(by midnight)
This assignment is designed to give you some practice in using existing machine learning (ML) and inference packages to implement for ML for knowledge discovery in databases (KDD).
Refer to the course intro
handout for guidelines on working with other students. Remember to submit your solutions by compressed
e-mail attachment to cis690ta@www.kddresearch.org
and produce them only from your personal data, source code, and notes
(not common work or sources other than the codes specified in this machine
problem). If you intend to use other
references (e.g., codes downloaded from the CMU archive, NRL archive, or other
software repositories such as those referenced from KD Nuggets or the
instructors “related links” page), get the instructor’s permission, and cite
your reference properly.
1.
(20 points) Running MineSet. For
this machine problem, you will use your course accounts on the KSU CIS
department KDD cluster. The SGI MineSet 3.1 server is
running on Goodland and 5 machines in N16. Note: Your accounts should
be operational by the time this assignment is distributed – check the course
web page for the latest information.
a)
(10 points) Download the
files:
http://www.kddresearch.org/Courses/Summer-2001/CIS690/Resources/MLC++-2.01.tar.gz
http://www.kddresearch.org/Courses/Summer-2001/CIS690/Resources/db.tar.gz
These are the pre-compiled binaries for RedHat Linux 6.x / 7.x and the MLC++ version of selected data sets from the UC Irvine Machine Learning Database Repository. Follow the instructions in the MLC++ manual (Utilities 2.0, available from http://www.sgi.com/tech/mlc) for installing it in your scratch directory on Topeka and Salina:
/cis/topeka/scratch/CIS690/yourlogin
Follow the instructions in the MLC++ tutorial (also available at http://www.sgi.com/tech/mlc) to run the ID3 inducer on the Mushroom data set from the UCI Machine Learning Database Repository. Use the .test files for testing. Turn in:
-
(2 points) a shell script called RUNMLC.script
that takes as its arguments the data set name, the test set name, and the
inducer name and invokes MLC++ (e.g., RUNMLC.script vote.data vote.test
ID3).
-
(3 points) the answer to
the following question: how many rules are produced by the decision tree
inducer?
-
(5 points) a screen shot
of the decision tree Overview and of the TreeViz visualization zoomed into a leaf
of the decision tree (paste this into an MS Word 2000, PostScript, or PDF file)
NB: you can also produce .dt files using MLC++ that can be read using TreeViz.
b)
(10 points) Using the MineSet
3.1 client for Windows, run the Churn problem through Naïve
Bayes. Use the Evidence Visualizer
to explore the results; turn in a screen shot of the pie chart or color bar
visualization in EviViz as well as the .eviviz file.
2.
(10 points) Using NeuroSolutions.
a)
Download the NeuroSolutions 4 demo from http://www.nd.com and install it on a Windows 95,
98, NT 4.0, or NT 5 (Windows 2000) machine.
b)
(10 points) Use the NeuralBuilder
wizard (which is fully documented in the online help for NeuroSolutions 4) to build a multilayer perceptron for learning the
Anneal data set from the UCI archive.
You will need to process anneal.data and anneal.names to
make a training data file (with the attribute names in the first row) and anneal.test
and anneal.names to make a desired response file. Report
both training and cross validation performance (mean-squared error) by
stamping a MatrixViewer probe on top of the (octagonal) costTransmitter module
and recording the final value after training (for the default of 1000 epochs).
Submit your screen shot file with a caption explaining
which probes are which.
3.
(10 points) Using BayesWare
Discoverer.
Download and install the Discoverer 1.0 package
(formerly Bayesian Knowledge Discoverer or BKD) from http://bayesware.com, or run the copy
installed on Goodland or in N16. Use it
to learn Asia, aka Lung-Cancer, a very small BBN, from data. You can obtain this data from http://www.kddresearch.org/Courses/Spring-2001/CIS690/Homework/Problems/HW2/AsiaDat.zip
and will need to do a little preprocessing to get it
into Discoverer.
Submit a screen shot of your Asia BBN and also attach
the saved network, titled Asia.
4. (10 points) Using Hugin.
Download the Hugin Lite demo
from http://www.hugin.dk and install it.
a) (5 points) Use Hugin to build a full Bayesian network for the Sprinkler-Rain example from lecture, using your own subjective estimates of CPTs. Make sure that all your probability values are legitimate (specifically, that they have the proper range and marginalize properly). Submit a screen shot of your BBN and attach it as a Hugin file titled Sprinkler-Rain.hkb.
b) (5 points) Perform inference using this network and 5 randomly generated examples with Season and Sprinkler as your evidence. Submit a screen shot of your BBN after inference on one example and the Most Probable Explanation (in a small table) for all 5 cases.
Extra Credit (10 points):
Produce dotty (PostScript) and MineSet TreeViz files for the Diabetes
data set using ID3 and view them.
Submit the .ps file of the dotty output and a screen shot of the
decision tree in MineSet 3.1.