Homework 5
Problem Set
Assigned Sat 31 Mar 2007
Due Fri 06 Apr 2007

This write assignment is designed to give you more practice with 
statistical evaluation of hypotheses, COLT, and inductive rule learning 
in more depth.

Create a file, README.txt, that will document your code manifest (you 
may organize your sources in directories called MP4-i, for 1 \leq i \leq 
4), and record individual results in files titled mp4-i.txt.   NOTE: You 
MUST use this exact file name for your solution to be graded.  This will 
be your experimental logbook for this machine problem.

1. (25%) Statistical evaluation of hyptheses.
    a) Problem 5.1, p. 152 Mitchell.  Suppose you test a hypothesis h 
and find that it commits r = 300 errors on a sample S of n = 1000 
randomly drawn test examples.  What is the standard deviation in 
error_S(h)?  How does tihs compare to the standard deviation in the 
example at the end of Section 5.3.4?
    b) Problem 5.3, p. 152 Mitchell.  Suppose hypothesis h commits r = 
10 errors over a sample of n = 65 independently drawn examples.  What is 
the 90% confidence interval (two-sided) for the true error rate?  What 
is the 95% one-sided interval (i.e., what is the upper bound U such that 
error_D(h) \leq U with 95% confidence)?  What is the 90% one-sided interval?

2. (25%) Computational learning theory (COLT).  Problem 7.4, p. 227 
Mitchell.  Consider a learning problem in which X = R is the set of real 
numbers, and C = H is the set of intervals over the reals, H = {(a < x < 
b) | a, b \in R}.  What is the probability that a hypothesis consistent 
with m examples of this target concept will have error at least 
\epsilon?  Solve this using the VC dimension.  Can you find a second way 
to solve this, based on first principles and ignoring the VC dimension?

For Problem 3, consult the documentation pages on machine learning at 
Prof. Steve Muggleton's site at Imperial College, UK:
http://www.doc.ic.ac.uk/~shm/progol.html

3. (25%) Inductive Logic Programming (ILP).  Use Muggleton's example 
generator (http://wwwhomes.doc.ic.ac.uk/~shm/Software/GenerateTrains/) 
to produce a data set containing 200 "Michalski AQ-style" train examples 
WITHIN YOUR PROJECT DOMAIN.  Hold out 100 examples for validation (i.e., 
measuring validation set accuracy as an estimate of test error and 
generalization quality).  Turn in all examples as two files 
(PS5_3-train.txt and PS5_3-test.txt).  Use INVERTED RESOLUTION (\S 10.7, 
p. 293-301) to produce an example hypothesis from two of your GENERATED 
training examples.

4. (25%)  Term project selection.  Finalize your term project topic and 
produce a formal statement of the learning problem, giving
    a) the input: a DATA DICTIONARY consisting of the list of 
attributes, their provenance (where they come from or how they are computed)
    b)  the target output: the type, name, and basic semantics of each 
output attribute
    c) the training method: supervised, semi-supervised, unsupervised, 
reinforcement
    d) the presentation of examples or reinforcements: how sampling is 
done to choose examples
    e) the evaluation mechanism
    f) variants of experiments to be conducted: combiners (committee 
machines to be used), etc.

Extra credit (25%).  Convert the data in Problem 3 to Attribute 
Relational File Format and train J48 using 10 to 100 examples, 
incrementing by 10 each time.  Using GNUplot or Excel, plot a curve of 
test set accuracy measured on the SAME validation data.

In Problem Set 5, you will look at statistical evaluation of hypotheses, 
COLT, and inductive rule learning in more depth, and run through some 
examples and a proof by hand.   Turn in PS5_3-train.arff and 
PS5_3-test.arff along with the output of WEKA (PS5_3-results.txt).