id3
Class TDDTInducer

java.lang.Object
  |
  +--shared.BaseInducer
        |
        +--shared.Inducer
              |
              +--id3.TDDTInducer
Direct Known Subclasses:
ID3Inducer

public abstract class TDDTInducer
extends Inducer

Top-down decision-tree (TDDT) inducer induces decision trees top-down by building smaller training sets and inducing trees for them recursively. The decision tree built has categorizers at each node, and these determine how to branch, i.e., to which child to branch, or whether to classify. The common cases are: AttrCategorizers, which simply return a given value for an attribute in the node, and ThresholdCategorizers, which return 0 or one based on whether an attribute is less than or greater than a given threshold (valid only for real attributes). The leaves are usually constant categorizers, i.e., they just return a constant value independent of the instance.

The induction algorithm calls best_split, a pure virtual function, to determine the best root split. Once the split has been chosen, the data in the node is split according to the categorizer best_split returns. A node is formed, and the algorithm is called recursively with each of the children. Once each child returns with a subtree, we connect them to the root we split. ID3Inducer, for example, implements the best_split using information gain, but other methods are possible. best_split() can return any categorizer, thus opening the possibility for oblique trees with perceptrons at nodes, recursive trees, etc. The leaves can also be of any classifier, thus perceptron-trees (Utgoff) can be created, or a nearest-neighbor within a leaf, etc.

Complexity :

The complexity of train() is proportional to the number of nodes in the resulting tree times the time for deciding on the split() categorizer (done by the derived classes). predict() takes time proportional to the sum of the categorizers time over the path from the root to a leaf node.

Enhancements :

We may speed things up by having an option to test only splits where the class label changes. For some measures (e.g., entropy), it can be shown that a split will never be made between two instances with the same class label (Fayyad IJCAI 93 page 1022, Machine Learning journal Vol 8, no 1, page 87, 1992). We may wish to discretize the real values first. By making them linear discrete, we can use the regular counters and things will be faster (note however that the split will usually remain special since it's a binary threshold split, not a multi-way split).

Another problem is with attributes that have many values, for example social-security-number. Computing all cut points can be very expensive. We may want to skip such attributes by claiming that each value must have at least some number of instances. Utgoff in ML94 (page 322) mentions that ID slows his system down considerably. The problem of course is that if you threshold, it sometimes make sense to split on such attributes. Taken to an extreme, if we had a real "real-value," all values would be different with probability 1, and hence we would skip such an attribute.

To speed things up, we may want to have an Inducer that accepts a decision tree and builds stuff in it (vs. getting a graph). Other options allow for doing the recursion by calling a function instead of creating the actual class. The advantage of the current method is that it allows a subclass to keep track of the number of levels (useful for lookahead or something). Yet another option is to "recycle" inducers by using our "this" and just changing the training set.

We currently split instances but keep the original structure, that is, we don't actually delete the attribute tested on. It may be faster in some cases to actually create a new List without the attribute. The disadvantage is that for multi-valued attributes we may wish to branch again, so we can't always delete. The same goes for tests which are not attributes (e.g., conjunctions).


Field Summary
static byte allOrNothing
          LeafDistType value.
static byte confidence
          Pruning method value.
static byte error
          Evaluation metric value.
static byte evidenceProjection
          LeafDistType value.
static byte frequencyCounts
          LeafDistType value.
static byte KLdistance
          Pruning method value.
static byte laplaceCorrection
          LeafDistType value.
static byte linear
          Pruning method value.
static byte logLoss
          Evaluation metric value.
static byte lossConfidence
          Pruning method value.
static byte lossLaplace
          Pruning method value.
static byte MSE
          Evaluation metric value.
static byte none
          Pruning method value.
static byte penalty
          Pruning method value.
 
Fields inherited from class shared.BaseInducer
AHA_IB_INDUCER, AM_INDUCER, BAGGING_INDUCER, BOOSTER_INDUCER, C45_INDUCER, C45AP_INDUCER, C45R_INDUCER, C50_INDUCER, CART_INDUCER, CatDT_INDUCER, CF_INDUCER, CLUSTER_INDUCER, CN2_INDUCER, CONST_INDUCER, COODG_INDUCER, DDT_INDUCER, DF_INDUCER, DISC_NB_INDUCER, DISC_SEARCH_INDUCER, DISC_TAB_INDUCER, ENTROPY_ODG_INDUCER, FCF_INDUCER, FSS_INDUCER, getEnv, HOODG_INDUCER, IB_INDUCER, ID3_INDUCER, LAZY_DT_INDUCER, LIST_HOODG_INDUCER, LIST_ODG_INDUCER, logOptions, NAIVE_BAYES_INDUCER, NULL_INDUCER, OC1_INDUCER, ODT_INDUCER, ONER_INDUCER, OODG_INDUCER, ORDER_FSS_INDUCER, PEBLS_INDUCER, PERCEPTRON_INDUCER, PERF_EST_INDUCER, PROJECT_INDUCER, RIPPER_INDUCER, SGI_DT_INDUCER, STACKING_INDUCER, T2_INDUCER, TABLE_CAS_INDUCER, TABLE_INDUCER, TDDT_INDUCER, TS, WEIGHT_SEARCH_INDUCER, WINNOW_INDUCER
 
Constructor Summary
TDDTInducer(java.lang.String dscr)
          Constructor.
TDDTInducer(java.lang.String descr, CGraph aCgraph)
          Constructor.
TDDTInducer(TDDTInducer source)
          Copy constructor.
 
Method Summary
protected  void accumulate_tree_stats()
          Sets statistical information about the tree.
 double addErrs(double N, double e)
           
abstract  NodeCategorizer best_split(java.util.LinkedList catNames)
          Best_split finds the best split in the node and returns a categorizer implementing it.
protected  void build_categorizer(DecisionTree dTree)
          Builds a decision tree categorizer for the given DecisionTree.
 double calc_KL_distance(DecisionTree decisionTree)
          Determines the Kullback Leibler distance between the root of the decision tree and all of its children.
protected  void connect(CatGraph catGraph, Node from, Node to, int edgeVal, java.lang.String edgeName)
          Connects two nodes in the specified CatGraph.
 void copy_options(TDDTInducer inducer)
          Copies the option settings from the given TDDTInducer.
 LeafCategorizer create_leaf_categorizer(double totalWeight, int[] tieBreakingOrder, DoubleRef numErrors, DoubleRef pessimisticErrors)
          Creates a leaf categorizer (has no children).
 LeafCategorizer create_leaf_categorizer(double totalWeight, int[] tieBreakingOrder, DoubleRef numErrors, DoubleRef pessimisticErrors, double[] distrArray)
          Creates a leaf categorizer (has no children).
abstract  TDDTInducer create_subinducer(java.lang.String dscr, CGraph aCgraph)
          Create_subinducer creates the Inducer for calling recursively.
 boolean get_adjust_thresholds()
          Returns whether threshold should be adjusted to equal instance values.
 Categorizer get_categorizer()
          Returns the categorizer that the Inducer has generated.
 boolean get_cont_mdl_adjust()
          Returns whether Minimum Description Length Adjustment for continuous attributes should be applied to mutual info.
 boolean get_debug()
          Accesses the debug variable.
 boolean get_empty_node_parent_dist()
          Returns whether an empty node should have the parent's distribution.
 byte get_evaluation_metric()
          Accesses the evaluation metric used.
 double get_evidence_factor()
          Returns the evidence correction factor.
 boolean get_have_continuous_attributes()
          Returns whether there are continuous attributes present in the data.
 byte get_leaf_dist_type()
          Returns the type of distribution to build at leaves.
 int get_level()
          Returns the level set for this TDDTInducer.
 double get_lower_bound_min_split_weight()
          Returns the lower bound for minimum split weight.
 double get_m_estimate_factor()
          Returns the m-estimate factor for laplace.
 int get_max_level()
          Returns the maximum level which may be set for a TDDTInducer.
 double get_min_split_weight_percent()
          Returns the percentage value for minimum split weight.
 boolean get_nominal_lbound_only()
          Returns TRUE if lower bounds are to be used for nominal values, FALSE otherwise.
 boolean get_parent_tie_breaking()
          Get the order for breaking distribution ties.
 boolean get_pruning_branch_replacement()
          Returns whether pruning should allow replacing a node with its largest subtree.
 double get_pruning_factor()
          Returns the factor of how much pruning should be done.
 byte get_pruning_method()
          Returns the Pruning method to be used.
 double get_smooth_factor()
          Returns the exponential factor for smoothing.
 int get_smooth_inst()
          Returns the number of thresholds on either side to use for smoothing.
 byte get_split_score_criterion()
          Return the criterion used for scoring.
protected  double get_total_inst_weight()
          Returns the total weight of instances in the data set this inducer is using.
 boolean get_unknown_edges()
          Returns whether unknown edges are allowed.
 double get_upper_bound_min_split_weight()
          Returns the upper bound for minimum split weight.
 Node getNewNode(Node node, int category)
          Creates a new categorizer Node.
 Node induce_decision_tree(CGraph aCgraph, int[] tieBreakingOrder, DoubleRef numSubtreeErrors, DoubleRef pessimisticSubtreeErrors, IntRef numLeaves, int remainingSiblings)
          Induce a decision tree in the given graph.
protected  void induce_tree_from_split(DecisionTree decisionTree, NodeCategorizer splitCat, java.util.LinkedList catNames, int[] tieBreakingOrder, DoubleRef numSubtreeErrors, DoubleRef pessimisticSubtreeErrors, IntRef numLeaves, int remainingSiblings)
          Induce decision tree from a given split.
static double kullback_leibler_distance(double[] p, double[] q)
          Computes a Kullback Leibler distance metric given an array of p(x) and q(x) for all x.
protected  double laplace_subtree_loss(LogOptions logOptions, DecisionTree dt, Node subtree, InstanceList il, int[] tieBreakingOrder)
          Computes the subtree loss, skewing the leaf weight distributions using the laplace correction.
 java.lang.String name_sub_inducer(java.lang.String catDescr, int catNum)
          Create a string to name the subinducer.
protected static double num_cat_errors(Categorizer cat, int predictClass, double totalWeight)
          Computes the number of errors this node would make as a leaf.
 int num_nodes()
          Returns the number of nodes (categorizers) in the decision tree, and number of leaves.
 int num_nontrivial_leaves()
          Returns the number of leaves that contain no Instances and have only uknown edges leading to them.
 int num_nontrivial_nodes()
          Returns the number of Nodes that contain no Instances and have only uknown edges leading to them.
protected  double pessimistic_subtree_errors(LogOptions logOptions, DecisionTree dt, Node subtree, InstanceList il, double zValue, int[] tieBreakingOrder)
          Computes the pessimistic errors for subtree given a training set.
protected  double pessimistic_subtree_loss(LogOptions logOptions, DecisionTree dt, Node subtree, InstanceList il, double zValue, int[] tieBreakingOrder)
          Computes the pessimistic loss for subtree given a training set.
 double predictErrors(int total, int misclass)
           
 void prune_subtree(DecisionTree decisionTree, int[] tieBreakingOrder, Node largestChild, DoubleRef numSubtreeErrors, DoubleRef pessimisticSubtreeErrors, IntRef numLeaves)
          When the subtree rooted from the current node does not improve the error, the subtree may be replaced by a leaf or by its largest child.
 void prune(boolean prune)
          Sets the errorprune value to the given value.
 double prune(DecisionTree dTree, Node node, InstanceList il, boolean prune)
          Prunes decision tree.
 Categorizer release_categorizer()
          Gives ownership of the generated categorizer to the caller, reverting the Inducer to untrained state.
 void set_adjust_thresholds(boolean b)
          Sets whether threshold should be adjusted to equal instance values.
 void set_cont_mdl_adjust(boolean val)
          Sets whether the Minimum Description Length Adjustment for continuous attributes should be applied to mutual info.
 void set_debug(boolean val)
          Sets the debugging option.
 void set_empty_node_parent_dist(boolean b)
          Sets whether an empty node should have the parent's distribution.
 void set_evaluation_metric(byte metric)
          Sets the evaluation metric used.
 void set_evidence_factor(double factor)
          Sets the evidence correction factor.
 void set_have_continuous_attributes(boolean val)
          Sets whether there are continuous attributes present in the data.
 void set_leaf_dist_type(byte type)
          Sets the type of distribution to build at leaves.
 void set_level(int lvl)
          Sets the level of this TDDTInducer.
 void set_lower_bound_min_split_weight(double val)
          Sets the lower bound for minimum split weight.
 void set_m_estimate_factor(double factor)
          Sets the m-estimate factor for laplace.
 void set_max_level(int level)
          Sets the maximum level for a TDDTInducer.
 void set_min_split_weight_percent(double val)
          Sets a new percentage value for minimum split weight.
 void set_nominal_lbound_only(boolean val)
          Sets which lower bounds are used for nominal attributes.
 void set_parent_tie_breaking(boolean b)
          Set the tie breaking order for distribution ties.
 void set_pruning_branch_replacement(boolean b)
          Sets whether pruning should allow replacing a node with its largest subtree.
 void set_pruning_factor(double val)
          Sets the factor of how much pruning should be done.
 void set_pruning_method(byte pM)
          Sets the Pruning method to be used.
 void set_smooth_factor(double factor)
          Sets the exponential factor for smoothing.
 void set_smooth_inst(int inst)
          Sets the number of thresholds on either side to use for smoothing.
 void set_split_score_criterion(byte val)
          Sets the criterion used for split scoring.
protected  void set_total_inst_weight(double wt)
          Sets the total weight of instances in the data set this inducer is currently using.
 void set_unknown_edges(boolean val)
          Sets whether unknown categories are allowable for edges if the decision tree.
 void set_upper_bound_min_split_weight(double val)
          Sets the upper bound for minimum split weight.
 void set_user_options(java.lang.String prefix)
          Sets the user options according to the option file.
 int train_no_stats()
          Trains the inducer on a stored dataset.
 void train()
          Trains the inducer on a stored dataset and collects statistics.
 boolean was_trained(boolean fatalOnFalse)
          Checks if this inducer has a valid decision tree.
 
Methods inherited from class shared.Inducer
can_cast_to_inducer, cast_to_inducer, display_struct, display_struct, project_train_and_perf_files, project_train_and_perf_files, project_train_and_perf_files, project_train_and_perf_files, project_train_and_perf, project_train_and_test_files, project_train_and_test_files, project_train_and_test_files, project_train_and_test_files, project_train_and_test, supports_full_testing, train_and_perf, train_and_test
 
Methods inherited from class shared.BaseInducer
assign_data, can_cast_to_incr_inducer, cast_to_incr_inducer, class_id, description, get_log_level, get_log_options, get_log_stream, has_data, has_data, instance_list, normalize_weights, read_data, read_data, read_data, release_data, set_log_level, set_log_options, set_log_prefixes, set_log_stream, train_and_test_files, train_and_test_files, train_and_test_files, train_and_test_files
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

none

public static final byte none
Pruning method value.

confidence

public static final byte confidence
Pruning method value.

penalty

public static final byte penalty
Pruning method value.

linear

public static final byte linear
Pruning method value.

KLdistance

public static final byte KLdistance
Pruning method value.

lossConfidence

public static final byte lossConfidence
Pruning method value.

lossLaplace

public static final byte lossLaplace
Pruning method value.

allOrNothing

public static final byte allOrNothing
LeafDistType value.

frequencyCounts

public static final byte frequencyCounts
LeafDistType value.

laplaceCorrection

public static final byte laplaceCorrection
LeafDistType value.

evidenceProjection

public static final byte evidenceProjection
LeafDistType value.

error

public static final byte error
Evaluation metric value.

MSE

public static final byte MSE
Evaluation metric value.

logLoss

public static final byte logLoss
Evaluation metric value.
Constructor Detail

TDDTInducer

public TDDTInducer(java.lang.String dscr)
Constructor.
Parameters:
dscr - The description of the inducer.

TDDTInducer

public TDDTInducer(java.lang.String descr,
                   CGraph aCgraph)
Constructor.
Parameters:
descr - The description of this inducer.
aCgraph - The CGraph that will hold the decision tree.

TDDTInducer

public TDDTInducer(TDDTInducer source)
Copy constructor.
Parameters:
source - The TDDTInducer that is being copied.
Method Detail

set_level

public void set_level(int lvl)
Sets the level of this TDDTInducer.
Parameters:
lvl - The level to be set.

get_level

public int get_level()
Returns the level set for this TDDTInducer.
Returns:
This TDDTInducer's level setting.

set_total_inst_weight

protected void set_total_inst_weight(double wt)
Sets the total weight of instances in the data set this inducer is currently using.
Parameters:
wt - The weight that should be set.

get_total_inst_weight

protected double get_total_inst_weight()
Returns the total weight of instances in the data set this inducer is using.
Returns:
The weight of the instances.

get_max_level

public int get_max_level()
Returns the maximum level which may be set for a TDDTInducer.
Returns:
The maximum weight of instances.

set_max_level

public void set_max_level(int level)
Sets the maximum level for a TDDTInducer.
Parameters:
level - The new maximum level.

set_lower_bound_min_split_weight

public void set_lower_bound_min_split_weight(double val)
Sets the lower bound for minimum split weight.
Parameters:
val - The new lower bound.

get_lower_bound_min_split_weight

public double get_lower_bound_min_split_weight()
Returns the lower bound for minimum split weight.
Returns:
The lower bound for minimum split weight.

set_upper_bound_min_split_weight

public void set_upper_bound_min_split_weight(double val)
Sets the upper bound for minimum split weight.
Parameters:
val - The new upper bound.

get_upper_bound_min_split_weight

public double get_upper_bound_min_split_weight()
Returns the upper bound for minimum split weight.
Returns:
The upper bound for minimum split weight.

set_min_split_weight_percent

public void set_min_split_weight_percent(double val)
Sets a new percentage value for minimum split weight.
Parameters:
val - The new percentage.

get_min_split_weight_percent

public double get_min_split_weight_percent()
Returns the percentage value for minimum split weight.
Returns:
The percentage value for minimum split weight.

set_nominal_lbound_only

public void set_nominal_lbound_only(boolean val)
Sets which lower bounds are used for nominal attributes. TRUE indicates lowerBoundMinSplitWeight, upperBoundMinSplitWeight, and minSplitWeightPercent are not used for setting minimum instances in a node for nominal attributes, FALSE indicates they will be used.
Parameters:
val - The value for the boolean option.

get_nominal_lbound_only

public boolean get_nominal_lbound_only()
Returns TRUE if lower bounds are to be used for nominal values, FALSE otherwise.
Returns:
TRUE indicates lowerBoundMinSplitWeight, upperBoundMinSplitWeight, and minSplitWeightPercent are not used for setting minimum instances in a node for nominal attributes, FALSE indicates they will be used.

set_unknown_edges

public void set_unknown_edges(boolean val)
Sets whether unknown categories are allowable for edges if the decision tree.
Parameters:
val - TRUE if unknown edges are allowable, FALSE otherwise.

get_unknown_edges

public boolean get_unknown_edges()
Returns whether unknown edges are allowed.
Returns:
TRUE if unknown edges are allowable, FALSE otherwise.

get_split_score_criterion

public byte get_split_score_criterion()
Return the criterion used for scoring.
Returns:
The split score criterion.

set_split_score_criterion

public void set_split_score_criterion(byte val)
Sets the criterion used for split scoring.
Parameters:
val - The new split score criterion.

set_empty_node_parent_dist

public void set_empty_node_parent_dist(boolean b)
Sets whether an empty node should have the parent's distribution.
Parameters:
b - TRUE indicates an empty node should have the parent's distribution, FALSE otherwise.

get_empty_node_parent_dist

public boolean get_empty_node_parent_dist()
Returns whether an empty node should have the parent's distribution.
Returns:
TRUE indicates an empty node should have the parent's distribution, FALSE otherwise.

set_parent_tie_breaking

public void set_parent_tie_breaking(boolean b)
Set the tie breaking order for distribution ties.
Parameters:
b - the new order for breaking distribution ties.

get_parent_tie_breaking

public boolean get_parent_tie_breaking()
Get the order for breaking distribution ties.
Returns:
Order for breaking distribution ties.

set_pruning_method

public void set_pruning_method(byte pM)
Sets the Pruning method to be used.
Parameters:
pM - The Pruning method to be used. If the value is not NONE and pruning_factor is 0, then a node will be made a leaf when its (potential) children do not improve the error count.

get_pruning_method

public byte get_pruning_method()
Returns the Pruning method to be used.
Returns:
The Pruning method used.

set_pruning_branch_replacement

public void set_pruning_branch_replacement(boolean b)
Sets whether pruning should allow replacing a node with its largest subtree.
Parameters:
b - TRUE indicates pruning should allow replacing a node with its largest subtree, FALSE otherwise.

get_pruning_branch_replacement

public boolean get_pruning_branch_replacement()
Returns whether pruning should allow replacing a node with its largest subtree.
Returns:
TRUE indicates pruning should allow replacing a node with its largest subtree, FALSE otherwise.

set_adjust_thresholds

public void set_adjust_thresholds(boolean b)
Sets whether threshold should be adjusted to equal instance values.
Parameters:
b - TRUE indicates threshold should be adjusted to equal instance values, FALSE otherwise.

get_adjust_thresholds

public boolean get_adjust_thresholds()
Returns whether threshold should be adjusted to equal instance values.
Returns:
TRUE indicates threshold should be adjusted to equal instance values, FALSE otherwise.

set_pruning_factor

public void set_pruning_factor(double val)
Sets the factor of how much pruning should be done.
Parameters:
val - Factor of how much pruning should be done. High values indicate more pruning.

get_pruning_factor

public double get_pruning_factor()
Returns the factor of how much pruning should be done.
Returns:
Factor of how much pruning should be done. High values indicate more pruning.

get_smooth_inst

public int get_smooth_inst()
Returns the number of thresholds on either side to use for smoothing.
Returns:
Number of thresholds on either side to use for smoothing; 0 for no smoothing.

set_smooth_inst

public void set_smooth_inst(int inst)
Sets the number of thresholds on either side to use for smoothing.
Parameters:
inst - Number of thresholds on either side to use for smoothing; 0 for no smoothing.

get_smooth_factor

public double get_smooth_factor()
Returns the exponential factor for smoothing.
Returns:
The exponential factor for smoothing.

set_smooth_factor

public void set_smooth_factor(double factor)
Sets the exponential factor for smoothing.
Parameters:
factor - The new exponential factor for smoothing.

set_cont_mdl_adjust

public void set_cont_mdl_adjust(boolean val)
Sets whether the Minimum Description Length Adjustment for continuous attributes should be applied to mutual info.
Parameters:
val - TRUE if the Minimum Description Length Adjustment for continuous attributes should be applied to mutual info, FALSE otherwise.

get_cont_mdl_adjust

public boolean get_cont_mdl_adjust()
Returns whether Minimum Description Length Adjustment for continuous attributes should be applied to mutual info.
Returns:
TRUE if the Minimum Description Length Adjustment for continuous attributes should * be applied to mutual info, FALSE otherwise.

set_leaf_dist_type

public void set_leaf_dist_type(byte type)
Sets the type of distribution to build at leaves.
Parameters:
type - The type of distribution to build at leaves.

get_leaf_dist_type

public byte get_leaf_dist_type()
Returns the type of distribution to build at leaves.
Returns:
The type of distribution to build at leaves.

set_m_estimate_factor

public void set_m_estimate_factor(double factor)
Sets the m-estimate factor for laplace.
Parameters:
factor - The new m-estimate factor for laplace.

get_m_estimate_factor

public double get_m_estimate_factor()
Returns the m-estimate factor for laplace.
Returns:
The m-estimate factor for laplace.

set_evidence_factor

public void set_evidence_factor(double factor)
Sets the evidence correction factor.
Parameters:
factor - The new evidence correction factor.

get_evidence_factor

public double get_evidence_factor()
Returns the evidence correction factor.
Returns:
The evidence correction factor.

get_have_continuous_attributes

public boolean get_have_continuous_attributes()
Returns whether there are continuous attributes present in the data.
Returns:
TRUE indicates there are continuous attributes in the data, FALSE otherwise.

set_have_continuous_attributes

public void set_have_continuous_attributes(boolean val)
Sets whether there are continuous attributes present in the data.
Parameters:
val - TRUE indicates there are continuous attributes in the data, FALSE otherwise.

get_debug

public boolean get_debug()
Accesses the debug variable.
Returns:
TRUE if debugging statements are active, FALSE otherwise.

set_debug

public void set_debug(boolean val)
Sets the debugging option.
Parameters:
val - TRUE if debugging statements are active, FALSE otherwise.

set_evaluation_metric

public void set_evaluation_metric(byte metric)
Sets the evaluation metric used.
Parameters:
metric - The evaluation metric to be used.

get_evaluation_metric

public byte get_evaluation_metric()
Accesses the evaluation metric used.
Returns:
The evaluation metric to be used.

accumulate_tree_stats

protected void accumulate_tree_stats()
Sets statistical information about the tree. This information consists of the total number of nontrivial nodes and the total number of attributes.

copy_options

public void copy_options(TDDTInducer inducer)
Copies the option settings from the given TDDTInducer.
Parameters:
inducer - The TDDTInducer with the options to be copied.

set_user_options

public void set_user_options(java.lang.String prefix)
Sets the user options according to the option file.
Parameters:
prefix - The prefix for the option names.

train

public void train()
Trains the inducer on a stored dataset and collects statistics.
Overrides:
train in class Inducer

train_no_stats

public int train_no_stats()
Trains the inducer on a stored dataset. No statistical data is collected for the test of the categorizer.
Returns:
The number of attributes.

num_nontrivial_nodes

public int num_nontrivial_nodes()
Returns the number of Nodes that contain no Instances and have only uknown edges leading to them.
Returns:
The number of Nodes that contain no Instances and have only uknown edges leading to them.

num_nontrivial_leaves

public int num_nontrivial_leaves()
Returns the number of leaves that contain no Instances and have only uknown edges leading to them.
Returns:
The number of leaves that contain no Instances and have only uknown edges leading to them.

was_trained

public boolean was_trained(boolean fatalOnFalse)
Checks if this inducer has a valid decision tree.
Parameters:
fatalOnFalse - TRUE if an error message should be displayed if the inducer is not trained, FALSE otherwise.
Returns:
True iff the class has a valid decisionTree categorizer.

induce_decision_tree

public Node induce_decision_tree(CGraph aCgraph,
                                 int[] tieBreakingOrder,
                                 DoubleRef numSubtreeErrors,
                                 DoubleRef pessimisticSubtreeErrors,
                                 IntRef numLeaves,
                                 int remainingSiblings)
Induce a decision tree in the given graph.
Parameters:
aCgraph - The graph that will contain the decision tree.
tieBreakingOrder - The tie breaking order for breaking distribution ties.
numSubtreeErrors - Number of errors to this point.
pessimisticSubtreeErrors - Error estimate if this was a leaf node.
numLeaves - The number of leaves in the decision tree.
remainingSiblings - Siblings that have not be induced yet.
Returns:
The root node of the decision tree.

build_categorizer

protected void build_categorizer(DecisionTree dTree)
Builds a decision tree categorizer for the given DecisionTree.
Parameters:
dTree - The DecisionTree to use for creating the categorizer.

best_split

public abstract NodeCategorizer best_split(java.util.LinkedList catNames)
Best_split finds the best split in the node and returns a categorizer implementing it. It allocates and returns catNames containing the names of the resulting categories.
Parameters:
catNames - The list of categories found in the Node.
Returns:
The Categorizer using the list of categories.

num_cat_errors

protected static double num_cat_errors(Categorizer cat,
                                       int predictClass,
                                       double totalWeight)
Computes the number of errors this node would make as a leaf. If totalWeight is zero, the distribution is ignored, else totalWeight must be the sum of the distribution counts.
Parameters:
cat - The Categorizer for the node being checked.
predictClass - The category for which this node is being checked.
totalWeight - The weight of all instances in a data set.
Returns:
The number of errors this node would make if it were a leaf on the decision tree.

create_leaf_categorizer

public LeafCategorizer create_leaf_categorizer(double totalWeight,
                                               int[] tieBreakingOrder,
                                               DoubleRef numErrors,
                                               DoubleRef pessimisticErrors)
Creates a leaf categorizer (has no children). We currently create a ConstCategorizer with a description and the majority category. Note that the augCategory will contain the correct string, but the description will contain more information which may help when displaying the graph. The augCategory string must be the same for CatTestResult to work properly (it compares the actual string for debugging purposes).
Parameters:
tieBreakingOrder - Order for breaking distribution ties.
totalWeight - The total weight of the training data set.
numErrors - The number of errors this LeafCategorizer will produce.
pessimisticErrors - Error estimate if this was a leaf node.
Returns:
The LeafCategorizer created.

create_leaf_categorizer

public LeafCategorizer create_leaf_categorizer(double totalWeight,
                                               int[] tieBreakingOrder,
                                               DoubleRef numErrors,
                                               DoubleRef pessimisticErrors,
                                               double[] distrArray)
Creates a leaf categorizer (has no children). We currently create a ConstCategorizer with a description and the majority category. If the distrArray is given, we don't reference the training set, except for its schema (used in pruning).

Note that the augCategory will contain the correct string, but the description will contain more information which may help when displaying the graph. The augCategory string must be the same for CatTestResult to work properly (it compares the actual string for debugging purposes).

Parameters:
tieBreakingOrder - Order for breaking distribution ties.
totalWeight - The total weight of the training data set.
numErrors - The number of errors this LeafCategorizer will produce.
pessimisticErrors - Error estimate if this was a leaf node.
distrArray - Distribution of weight over labels.
Returns:
The LeafCategorizer created.

induce_tree_from_split

protected void induce_tree_from_split(DecisionTree decisionTree,
                                      NodeCategorizer splitCat,
                                      java.util.LinkedList catNames,
                                      int[] tieBreakingOrder,
                                      DoubleRef numSubtreeErrors,
                                      DoubleRef pessimisticSubtreeErrors,
                                      IntRef numLeaves,
                                      int remainingSiblings)
Induce decision tree from a given split. The split is provided in the form of a categorizer, which picks which subtree a given instance will follow.
Parameters:
decisionTree - Decision tree induced.
splitCat - The categorizer for this split.
catNames - List of category names.
tieBreakingOrder - Order for breaking distribution ties.
numSubtreeErrors - Number of errors this subtree produces in categorization of Instances.
pessimisticSubtreeErrors - Error estimate if this was a leaf node.
numLeaves - Number of leaves on a subtree.
remainingSiblings - Siblings that have not be induced yet.

connect

protected void connect(CatGraph catGraph,
                       Node from,
                       Node to,
                       int edgeVal,
                       java.lang.String edgeName)
Connects two nodes in the specified CatGraph.
Parameters:
catGraph - The CatGreph containing these nodes.
from - The node from which the edge originates.
to - The node to which the edge connects.
edgeVal - The value of the AugCategory associated with that edge.
edgeName - The name of the edge.

name_sub_inducer

public java.lang.String name_sub_inducer(java.lang.String catDescr,
                                         int catNum)
Create a string to name the subinducer. We just append some basic info.
Parameters:
catDescr - The description of this subinducer.
catNum - The category number for which this subinducer is inducing.
Returns:
The name of the subinducer.

create_subinducer

public abstract TDDTInducer create_subinducer(java.lang.String dscr,
                                              CGraph aCgraph)
Create_subinducer creates the Inducer for calling recursively. Note that since this is an abstract class, it can't create a copy of itself.
Parameters:
dscr - The description for the sub inducer.
aCgraph - The categorizer graph to use for the subinducer.
Returns:
The new subinducer.

prune_subtree

public void prune_subtree(DecisionTree decisionTree,
                          int[] tieBreakingOrder,
                          Node largestChild,
                          DoubleRef numSubtreeErrors,
                          DoubleRef pessimisticSubtreeErrors,
                          IntRef numLeaves)
When the subtree rooted from the current node does not improve the error, the subtree may be replaced by a leaf or by its largest child. This serves as a collapsing mechanism if the pruning factor is 0, i.e., we collapse the subtree if it has the same number of errors as all children.

"Confidence" pruning is based on C4.5's pruning method. "Penalty" pruning is based on "Pessimistic Decision tree pruning based on tree size" by Yishay Mansour, ICML-97. "Linear" pruning is used to implement cost-complexity pruning as described in CART. Its use is not recommended otherwise. "KLdistance" pruning uses the Kullback-Leibler distance metric to determine whether to prune.

This function is divided into three main parts. First, initial checks are performed and values are set. Second, the test specific to each pruning method is performed. Last, if pruning is necessary, do it.

Parameters:
decisionTree - Tree to be pruned.
tieBreakingOrder - Order for breaking distribution ties.
largestChild - The largest child node.
numSubtreeErrors - Number of errors this subtree produces in categorization of Instances.
pessimisticSubtreeErrors - Error estimate if this was a leaf node.
numLeaves - Number of leaves on a subtree.

laplace_subtree_loss

protected double laplace_subtree_loss(LogOptions logOptions,
                                      DecisionTree dt,
                                      Node subtree,
                                      InstanceList il,
                                      int[] tieBreakingOrder)
Computes the subtree loss, skewing the leaf weight distributions using the laplace correction.
Parameters:
logOptions - The options for logging messages.
dt - DecisionTree over which the metric is to be calculated.
subtree - The current node used for calculation.
il - The InstanceList containing Instances that reach the current subtree Node.
tieBreakingOrder - Order for breaking distribution ties.
Returns:
The loss produced by this subtree.

pessimistic_subtree_errors

protected double pessimistic_subtree_errors(LogOptions logOptions,
                                            DecisionTree dt,
                                            Node subtree,
                                            InstanceList il,
                                            double zValue,
                                            int[] tieBreakingOrder)
Computes the pessimistic errors for subtree given a training set. We only prune subtrees that have const categorizers at the leaves. It may be an interesting research topic finding ways to prune nodes containing other categorizers (naive-bayes, for example).
Parameters:
logOptions - The options for logging information.
dt - The decision tree for which a pessimistic errors value is calculated.
subtree - The current node in the decision tree.
il - The list of instances for determining errors.
zValue - Z value for pessimistic error correction.
tieBreakingOrder - Order for breaking distribution ties.
Returns:
The number of pessimistic errors.

pessimistic_subtree_loss

protected double pessimistic_subtree_loss(LogOptions logOptions,
                                          DecisionTree dt,
                                          Node subtree,
                                          InstanceList il,
                                          double zValue,
                                          int[] tieBreakingOrder)
Computes the pessimistic loss for subtree given a training set.
Parameters:
logOptions - The options for logging information.
dt - The decision tree for which a pessimistic loss value is calculated.
subtree - The current node in the decision tree.
il - The list of instances for determining errors.
zValue - Z value for pessimistic error correction.
tieBreakingOrder - Order for breaking distribution ties.
Returns:
The number of pessimistic loss.

calc_KL_distance

public double calc_KL_distance(DecisionTree decisionTree)
Determines the Kullback Leibler distance between the root of the decision tree and all of its children.
Parameters:
decisionTree - The decision tree over which the metric is to be used.
Returns:
The distance value.

kullback_leibler_distance

public static double kullback_leibler_distance(double[] p,
                                               double[] q)
Computes a Kullback Leibler distance metric given an array of p(x) and q(x) for all x. The Kullback Leibler distance is defined as:
sum over all x p(x) log(p(x)/q(x))

This is not idential to the one in NaiveBayesCat, as this one has been changed to handle 0's in the arrays as follows
p(x) q(x) kl(p,q)
0 0 ignore
0 >0 0
>0 0 error
>0 >0 OK
Note that ignore and 0 are note the same, as the former does not update the count of the number of non-empty children.

Parameters:
p - The child node distribution.
q - The parent node distribution.
Returns:
The distance value.

get_categorizer

public Categorizer get_categorizer()
Returns the categorizer that the Inducer has generated.
Overrides:
get_categorizer in class Inducer
Returns:
The generated Categorizer.

num_nodes

public int num_nodes()
Returns the number of nodes (categorizers) in the decision tree, and number of leaves.
Returns:
An integer representing the number of categorizers(Node and Leaf) in the decision tree.

release_categorizer

public Categorizer release_categorizer()
Gives ownership of the generated categorizer to the caller, reverting the Inducer to untrained state.
Overrides:
release_categorizer in class Inducer
Returns:
The Categorizer created by training this Inducer.

prune

public double prune(DecisionTree dTree,
                    Node node,
                    InstanceList il,
                    boolean prune)
Prunes decision tree.
Parameters:
dTree - The decision tree to be pruned.
node - The current node in the decision tree.
il - Instance list for determining errors.
prune - TRUE if the decision tree is to be pruned, FALSE otherwise.
Returns:
The error of the pruned tree.

predictErrors

public double predictErrors(int total,
                            int misclass)
Parameters:
total -  
misclass -  
Returns:
 

addErrs

public double addErrs(double N,
                      double e)
Parameters:
N -  
e -  
Returns:
 

getNewNode

public Node getNewNode(Node node,
                       int category)
Creates a new categorizer Node.
Parameters:
node - The Node for which a categorizer Node is to be created.
category - The category to be stored in this Node.
Returns:
The new Node if not a leaf or the old Node if the result is a leaf.

prune

public void prune(boolean prune)
Sets the errorprune value to the given value.
Parameters:
prune - The new value for the errorprune.