CIS 690: Implementation of High-Performance
Data Mining Systems

Summer, 2003

 

 

Credit: 3 hours
Prerequisite: CIS 300 (Algorithms and Data Structures) and instructor permission, or CIS 500 (Analysis of Algorithms and Data Structures); basic courses in probability and statistics, databases recommended
Textbook: none (course notes)
Venue: Monday-Friday 2:30-4:40pm, 236 Nichols Hall (lecture) and 126/128 Nichols Hall (lab)
Instructor: William H. Hsu, Department of Computing and Information Sciences
     Office: 213 Nichols Hall         URL: http://www.cis.ksu.edu/~bhsu         E-mail: bhsu@cis.ksu.edu
     Office phone: 532-6350         Home phone: 539-7180
Office hours: after class; 1-3pm Monday; by appointment
Class web page:http://www.kddresearch.org/Courses/Summer-2003/CIS690/

 

Course Description

This is a implementation practicum and basic tutorial on knowledge discovery in databases (KDD) for students interested in applications of pattern recognition and machine learning such as data mining, classification, expert systems, and planning and design automation.  No prior background in artificial intelligence, machine learning, or knowledge-based systems is assumed or required, but preliminary coursework in probability and database systems is recommended.  The course will introduce the following basic algorithms and models: decision trees, simple (naïve) Bayes, feedforward artificial neural networks (specifically, multilayer perceptrons), and the simple genetic algorithm.  It will focus on implementation of some basic algorithms and configuring, modifying, and augmenting existing codes for machine learning and KDD.

Half of the course will be spent in lecture and discussion (60 minutes per day, 5 days per week); the other half, in the laboratory.

 

Course Requirements

Homework: 2 (out of 3) programming and written assignments (15%)
Paper reviews: 2 (out of 3) written reviews (1-2 pages) of research papers (10%)
Examinations: 1 in-class midterm (15%)
Computer language(s): C/C++ and Java (either permitted for term programming project)
Project: programming practicum using Linux (Beowulf) supercluster (60% total)

 

Selected Reading (on Reserve in K-State CIS Library)

 

Additional Bibliography (Excerpted in Course Notes and Handouts)

 

Class Calendar

Lecture Date Topic Source
0 May 19 Administrivia; overview of KDD
Lab environment, MLC++
TMM Chapter 1
1 May 20 Decision trees
ID3 in MLC++; C4.5
TMM 3; Quinlan; RN 18
2 May 21 Decision trees, overfitting
MineSet Tree Visualizer
TMM 3; Quinlan; RN 18
3 May 22 Wrappers
Wrappers in MLC++
MLC++ manual
4 May 23 Bagging
Using wrapper inducers in MLC++
MLC++ manual
5 May 26 Boosting
Implementing wrapper inducers
MLC++ manual
6

May 27

Simple Bayes (naïve Bayes)
Naïve Bayes inducer in MLC++
TMM 6; MLC++ manual
7 May 28 Improving simple Bayes (naïve Bayes)
Improvements to simple Bayes
TMM 6; paper
8 May 29 Using simple Bayes for text mining
NCSA Data to Knowledge (D2K)
TMM 6; paper; D2K manual
9 May 30 Introduction to Bayesian networks
Hugin
TMM 6
10 June 2 Bioinformatics Topics / Lab work TBA
11 June 3 Bioinformatics Topics / Lab work TBA
12 June 4 In class Midterm TBA
13 June 5 Learning / building Bayesian networks Bayesian Network Interchange Format (BNIF) TMM 6
14 June 6 Learning Bayesian network structure
MSBN, XML; ODBC and Bayesian networks
TMM 6; XBN docs
15 June 9 Perceptrons and winnow
Perceptrons in MLC++; SNOW
TMM 4; RN 19; MLC++ manual
16 June 10 Intro to artificial neural networks (ANNs)
SNNS
TMM 4; RN 19; SNNS manual
17 June 11
Bioinformatics Topics
TBA
18 June 12
Bioinformatics Topics
TBA
19 June 13 Bioinformatics Topics TBA
20 June 16 Bioinformatics Topics TBA
21 June 17 Conclusions and wrap-up
KDD developer resources
NO FINAL EXAM
TMM 1, 3, 4, 6, 9; RN 18, 19; DEG 1, 6

TMM: Machine Learning, T. M. Mitchell
RN: Artificial Intelligence: A Modern Approach, S. J. Russell and P. Norvig
DEG: Genetic Algorithms in Search, Optimization, and Machine Learning, D. E. Goldberg

Lightly-shaded entries denote the (tentative) due dates of paper reviews.
Heavily-shaded entries denote the (tentative) due dates of written or programming assignments.