Homework 5 Problem Set Assigned Sat 31 Mar 2007 Due Fri 06 Apr 2007 This write assignment is designed to give you more practice with statistical evaluation of hypotheses, COLT, and inductive rule learning in more depth. Create a file, README.txt, that will document your code manifest (you may organize your sources in directories called MP4-i, for 1 \leq i \leq 4), and record individual results in files titled mp4-i.txt. NOTE: You MUST use this exact file name for your solution to be graded. This will be your experimental logbook for this machine problem. 1. (25%) Statistical evaluation of hyptheses. a) Problem 5.1, p. 152 Mitchell. Suppose you test a hypothesis h and find that it commits r = 300 errors on a sample S of n = 1000 randomly drawn test examples. What is the standard deviation in error_S(h)? How does tihs compare to the standard deviation in the example at the end of Section 5.3.4? b) Problem 5.3, p. 152 Mitchell. Suppose hypothesis h commits r = 10 errors over a sample of n = 65 independently drawn examples. What is the 90% confidence interval (two-sided) for the true error rate? What is the 95% one-sided interval (i.e., what is the upper bound U such that error_D(h) \leq U with 95% confidence)? What is the 90% one-sided interval? 2. (25%) Computational learning theory (COLT). Problem 7.4, p. 227 Mitchell. Consider a learning problem in which X = R is the set of real numbers, and C = H is the set of intervals over the reals, H = {(a < x < b) | a, b \in R}. What is the probability that a hypothesis consistent with m examples of this target concept will have error at least \epsilon? Solve this using the VC dimension. Can you find a second way to solve this, based on first principles and ignoring the VC dimension? For Problem 3, consult the documentation pages on machine learning at Prof. Steve Muggleton's site at Imperial College, UK: http://www.doc.ic.ac.uk/~shm/progol.html 3. (25%) Inductive Logic Programming (ILP). Use Muggleton's example generator (http://wwwhomes.doc.ic.ac.uk/~shm/Software/GenerateTrains/) to produce a data set containing 200 "Michalski AQ-style" train examples WITHIN YOUR PROJECT DOMAIN. Hold out 100 examples for validation (i.e., measuring validation set accuracy as an estimate of test error and generalization quality). Turn in all examples as two files (PS5_3-train.txt and PS5_3-test.txt). Use INVERTED RESOLUTION (\S 10.7, p. 293-301) to produce an example hypothesis from two of your GENERATED training examples. 4. (25%) Term project selection. Finalize your term project topic and produce a formal statement of the learning problem, giving a) the input: a DATA DICTIONARY consisting of the list of attributes, their provenance (where they come from or how they are computed) b) the target output: the type, name, and basic semantics of each output attribute c) the training method: supervised, semi-supervised, unsupervised, reinforcement d) the presentation of examples or reinforcements: how sampling is done to choose examples e) the evaluation mechanism f) variants of experiments to be conducted: combiners (committee machines to be used), etc. Extra credit (25%). Convert the data in Problem 3 to Attribute Relational File Format and train J48 using 10 to 100 examples, incrementing by 10 each time. Using GNUplot or Excel, plot a curve of test set accuracy measured on the SAME validation data. In Problem Set 5, you will look at statistical evaluation of hypotheses, COLT, and inductive rule learning in more depth, and run through some examples and a proof by hand. Turn in PS5_3-train.arff and PS5_3-test.arff along with the output of WEKA (PS5_3-results.txt).