CIS 690

(Implementation of High-Performance Data Mining Systems)

Summer, 2001

 

Homework Assignment 2 (Machine Problem)

 

Monday, June 4, 2001

Due: Friday, June 8, 2001 (by midnight)

 

This assignment is designed to give you some practice in using existing machine learning (ML) and inference packages to implement for ML for knowledge discovery in databases (KDD).

 

Refer to the course intro handout for guidelines on working with other students.  Remember to submit your solutions by compressed e-mail attachment to cis690ta@www.kddresearch.org and produce them only from your personal data, source code, and notes (not common work or sources other than the codes specified in this machine problem).  If you intend to use other references (e.g., codes downloaded from the CMU archive, NRL archive, or other software repositories such as those referenced from KD Nuggets or the instructors “related links” page), get the instructor’s permission, and cite your reference properly.

 

1.        (20 points) Running MineSet.  For this machine problem, you will use your course accounts on the KSU CIS department KDD cluster.  The SGI MineSet 3.1 server is running on Goodland and 5 machines in N16.  Note: Your accounts should be operational by the time this assignment is distributed – check the course web page for the latest information.

a)       (10 points) Download the files:

http://www.kddresearch.org/Courses/Summer-2001/CIS690/Resources/MLC++-2.01.tar.gz

http://www.kddresearch.org/Courses/Summer-2001/CIS690/Resources/db.tar.gz

These are the pre-compiled binaries for RedHat Linux 6.x / 7.x and the MLC++ version of selected data sets from the UC Irvine Machine Learning Database Repository.  Follow the instructions in the MLC++ manual (Utilities 2.0, available from http://www.sgi.com/tech/mlc) for installing it in your scratch directory on Topeka and Salina:

/cis/topeka/scratch/CIS690/yourlogin

Follow the instructions in the MLC++ tutorial (also available at http://www.sgi.com/tech/mlc) to run the ID3 inducer on the Mushroom data set from the UCI Machine Learning Database Repository.  Use the .test files for testing. Turn in:

-           (2 points) a shell script called RUNMLC.script that takes as its arguments the data set name, the test set name, and the inducer name and invokes MLC++ (e.g., RUNMLC.script vote.data vote.test ID3).

-          (3 points) the answer to the following question: how many rules are produced by the decision tree inducer?

-          (5 points) a screen shot of the decision tree Overview and of the TreeViz visualization zoomed into a leaf of the decision tree (paste this into an MS Word 2000, PostScript, or PDF file)

NB: you can also produce .dt files using MLC++ that can be read using TreeViz.

b)       (10 points) Using the MineSet 3.1 client for Windows, run the Churn problem through Naïve Bayes.  Use the Evidence Visualizer to explore the results; turn in a screen shot of the pie chart or color bar visualization in EviViz as well as the .eviviz file.

 

 

2.        (10 points) Using NeuroSolutions.

a)       Download the NeuroSolutions 4 demo from http://www.nd.com and install it on a Windows 95, 98, NT 4.0, or NT 5 (Windows 2000) machine.

b)       (10 points) Use the NeuralBuilder wizard (which is fully documented in the online help for NeuroSolutions 4) to build a multilayer perceptron for learning the Anneal data set from the UCI archive.  You will need to process anneal.data and anneal.names to make a training data file (with the attribute names in the first row) and anneal.test and anneal.names to make a desired response file.  Report both training and cross validation performance (mean-squared error) by stamping a MatrixViewer probe on top of the (octagonal) costTransmitter module and recording the final value after training (for the default of 1000 epochs).

 

Submit your screen shot file with a caption explaining which probes are which.

 

 

3.        (10 points) Using BayesWare Discoverer.

Download and install the Discoverer 1.0 package (formerly Bayesian Knowledge Discoverer or BKD) from http://bayesware.com, or run the copy installed on Goodland or in N16.  Use it to learn Asia, aka Lung-Cancer, a very small BBN, from data.  You can obtain this data from http://www.kddresearch.org/Courses/Spring-2001/CIS690/Homework/Problems/HW2/AsiaDat.zip

and will need to do a little preprocessing to get it into Discoverer.

 

Submit a screen shot of your Asia BBN and also attach the saved network, titled Asia.

 

 

4.        (10 points) Using Hugin.

Download the Hugin Lite demo from http://www.hugin.dk and install it.

a)       (5 points) Use Hugin to build a full Bayesian network for the Sprinkler-Rain example from lecture, using your own subjective estimates of CPTs.  Make sure that all your probability values are legitimate (specifically, that they have the proper range and marginalize properly).  Submit a screen shot of your BBN and attach it as a Hugin file titled Sprinkler-Rain.hkb.

b)       (5 points) Perform inference using this network and 5 randomly generated examples with Season and Sprinkler as your evidence. Submit a screen shot of your BBN after inference on one example and the Most Probable Explanation  (in a small table)  for all 5 cases.

 

Extra Credit (10 points): Produce dotty (PostScript) and MineSet TreeViz files for the Diabetes data set using ID3 and view them.  Submit the .ps file of the dotty output and a screen shot of the decision tree in MineSet 3.1.