Homework 4 Machine Problem Assigned Wed 07 Mar 2007 Due Fri 16 Mar 2007 This programming assignment is designed to get you started with implementations for your term project, and to give you some practice with the basics of genetic and evolutionary computation (GEC) and inductive logic programming (ILP). For problems 1-2, refer to the Evolutionary Computation in Java (ECJ) package by Sean Luke et al., available from George Mason University http://cs.gmu.edu/~eclab/projects/ecj/ Look at the documentation for ECJ and the four tutorials provided with the package: http://cs.gmu.edu/~eclab/projects/ecj/docs/ Create a file, README.txt, that will document your code manifest (you may organize your sources in directories called MP4-i, for 1 \leq i \leq 4), and record individual results in files titled mp4-i.txt. NOTE: You MUST use this exact file name for your solution to be graded. This will be your experimental logbook for this machine problem. 1. (25%) Getting started with genetic programming. a) Download version 15 of the ECJ package from GMU and install it on your personal computer. You will run through Tutorials 1 (a GA for MaxOnes) and 4 (a GP-based symbolic regressor). b) For Tutorial 4, plot the fitness curve. Consult the KozaStatistics section of the JavaDoc documentation to see what columns to slice out, and use Unix 'cut' to get the data. Plot the data to get a curve like this one: http://www.primordion.com/Xholon/samples/ecjtutorial4_GP.html c) Do the same for Tutorial 1. Turn in your modified Java sources and indicate in README.txt what files you changed. In mp4-1.txt, put the fitness values you obtained. If you are interested in using a GA or GP system in your term project, you are strongly advised to complete Tutorials 2 (GA-based concept learning with integer-valued attributes) and 3 (floating point evolution strategies with real-valued attributes) as well. 2. (25%) GP-based time series prediction. Now use ECJ to predict a continuation of the Laser-Generated Data (Santa Fe Time Series A) from PS3. You should treat this as an application and extension of Tutorial 4 where it is up to you to design the representation and select the operators. In MP6, you will use an ECJ/WEKA interface developed at K-State to compare ECJ and WEKA results. If you are doing a time series prediction-related project, you should repeat this exercise for the other Santa Fe data sets over the next month. For Problem 3, consult the documentation pages on machine learning at Prof. Steve Muggleton's site at Imperial College, UK: http://www.doc.ic.ac.uk/~shm/progol.html 3. (25%) Inductive Logic Programming (ILP). First, download the Progol package by Muggleton et al. Read the Progol and Cigol section in Chapter 10, Mitchell, then look at the "multiplication" and "animals" example. (Chapter 8 of Russell and Norvig, second edition, on First-Order Logic, may also be a useful reference to you.) Next, use Muggleton's example generator (http://wwwhomes.doc.ic.ac.uk/~shm/Software/GenerateTrains/) to produce a data set containing 200 "Michalski AQ-style" train examples. Hold out 100 examples for validation (i.e., measuring validation set accuracy as an estimate of test error and generalization quality). Then, train with 10 to 100 examples, incrementing by 10 each time. Using GNUplot or Excel, plot a curve of test set accuracy measured on the SAME validation data. 4. (25%) Term project implementation. Get started producing the training data you specified for PS3. Examples will be posted over the next three days (before Fri 09 Mar 2007) for each of the six projects you have to choose from. If you proposed a different project topic and it is approved, make sure you consult with the instructor about how to do this part of MP4. In Problem Set 5, you will look at statistical evaluation of hypotheses, COLT, and inductive rule learning in more depth, and run through some examples and a proof by hand.