|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--shared.BaseInducer | +--shared.Inducer | +--id3.TDDTInducer
Top-down decision-tree (TDDT) inducer induces decision trees top-down by building smaller training sets and inducing trees for them recursively. The decision tree built has categorizers at each node, and these determine how to branch, i.e., to which child to branch, or whether to classify. The common cases are: AttrCategorizers, which simply return a given value for an attribute in the node, and ThresholdCategorizers, which return 0 or one based on whether an attribute is less than or greater than a given threshold (valid only for real attributes). The leaves are usually constant categorizers, i.e., they just return a constant value independent of the instance.
The induction algorithm calls best_split, a pure virtual function, to determine the best root split. Once the split has been chosen, the data in the node is split according to the categorizer best_split returns. A node is formed, and the algorithm is called recursively with each of the children. Once each child returns with a subtree, we connect them to the root we split. ID3Inducer, for example, implements the best_split using information gain, but other methods are possible. best_split() can return any categorizer, thus opening the possibility for oblique trees with perceptrons at nodes, recursive trees, etc. The leaves can also be of any classifier, thus perceptron-trees (Utgoff) can be created, or a nearest-neighbor within a leaf, etc.
Complexity :
The complexity of train() is proportional to the number of nodes in the resulting tree times the time for deciding on the split() categorizer (done by the derived classes). predict() takes time proportional to the sum of the categorizers time over the path from the root to a leaf node.
Enhancements :
We may speed things up by having an option to test only splits where the class label changes. For some measures (e.g., entropy), it can be shown that a split will never be made between two instances with the same class label (Fayyad IJCAI 93 page 1022, Machine Learning journal Vol 8, no 1, page 87, 1992). We may wish to discretize the real values first. By making them linear discrete, we can use the regular counters and things will be faster (note however that the split will usually remain special since it's a binary threshold split, not a multi-way split).
Another problem is with attributes that have many values, for example social-security-number. Computing all cut points can be very expensive. We may want to skip such attributes by claiming that each value must have at least some number of instances. Utgoff in ML94 (page 322) mentions that ID slows his system down considerably. The problem of course is that if you threshold, it sometimes make sense to split on such attributes. Taken to an extreme, if we had a real "real-value," all values would be different with probability 1, and hence we would skip such an attribute.
To speed things up, we may want to have an Inducer that accepts a decision tree and builds stuff in it (vs. getting a graph). Other options allow for doing the recursion by calling a function instead of creating the actual class. The advantage of the current method is that it allows a subclass to keep track of the number of levels (useful for lookahead or something). Yet another option is to "recycle" inducers by using our "this" and just changing the training set.
We currently split instances but keep the original structure, that is, we don't actually delete the attribute tested on. It may be faster in some cases to actually create a new List without the attribute. The disadvantage is that for multi-valued attributes we may wish to branch again, so we can't always delete. The same goes for tests which are not attributes (e.g., conjunctions).
Field Summary | |
static byte |
allOrNothing
LeafDistType value. |
static byte |
confidence
Pruning method value. |
static byte |
error
Evaluation metric value. |
static byte |
evidenceProjection
LeafDistType value. |
static byte |
frequencyCounts
LeafDistType value. |
static byte |
KLdistance
Pruning method value. |
static byte |
laplaceCorrection
LeafDistType value. |
static byte |
linear
Pruning method value. |
static byte |
logLoss
Evaluation metric value. |
static byte |
lossConfidence
Pruning method value. |
static byte |
lossLaplace
Pruning method value. |
static byte |
MSE
Evaluation metric value. |
static byte |
none
Pruning method value. |
static byte |
penalty
Pruning method value. |
Constructor Summary | |
TDDTInducer(java.lang.String dscr)
Constructor. |
|
TDDTInducer(java.lang.String descr,
CGraph aCgraph)
Constructor. |
|
TDDTInducer(TDDTInducer source)
Copy constructor. |
Method Summary | |
protected void |
accumulate_tree_stats()
Sets statistical information about the tree. |
double |
addErrs(double N,
double e)
|
abstract NodeCategorizer |
best_split(java.util.LinkedList catNames)
Best_split finds the best split in the node and returns a categorizer implementing it. |
protected void |
build_categorizer(DecisionTree dTree)
Builds a decision tree categorizer for the given DecisionTree. |
double |
calc_KL_distance(DecisionTree decisionTree)
Determines the Kullback Leibler distance between the root of the decision tree and all of its children. |
protected void |
connect(CatGraph catGraph,
Node from,
Node to,
int edgeVal,
java.lang.String edgeName)
Connects two nodes in the specified CatGraph. |
void |
copy_options(TDDTInducer inducer)
Copies the option settings from the given TDDTInducer. |
LeafCategorizer |
create_leaf_categorizer(double totalWeight,
int[] tieBreakingOrder,
DoubleRef numErrors,
DoubleRef pessimisticErrors)
Creates a leaf categorizer (has no children). |
LeafCategorizer |
create_leaf_categorizer(double totalWeight,
int[] tieBreakingOrder,
DoubleRef numErrors,
DoubleRef pessimisticErrors,
double[] distrArray)
Creates a leaf categorizer (has no children). |
abstract TDDTInducer |
create_subinducer(java.lang.String dscr,
CGraph aCgraph)
Create_subinducer creates the Inducer for calling recursively. |
boolean |
get_adjust_thresholds()
Returns whether threshold should be adjusted to equal instance values. |
Categorizer |
get_categorizer()
Returns the categorizer that the Inducer has generated. |
boolean |
get_cont_mdl_adjust()
Returns whether Minimum Description Length Adjustment for continuous attributes should be applied to mutual info. |
boolean |
get_debug()
Accesses the debug variable. |
boolean |
get_empty_node_parent_dist()
Returns whether an empty node should have the parent's distribution. |
byte |
get_evaluation_metric()
Accesses the evaluation metric used. |
double |
get_evidence_factor()
Returns the evidence correction factor. |
boolean |
get_have_continuous_attributes()
Returns whether there are continuous attributes present in the data. |
byte |
get_leaf_dist_type()
Returns the type of distribution to build at leaves. |
int |
get_level()
Returns the level set for this TDDTInducer. |
double |
get_lower_bound_min_split_weight()
Returns the lower bound for minimum split weight. |
double |
get_m_estimate_factor()
Returns the m-estimate factor for laplace. |
int |
get_max_level()
Returns the maximum level which may be set for a TDDTInducer. |
double |
get_min_split_weight_percent()
Returns the percentage value for minimum split weight. |
boolean |
get_nominal_lbound_only()
Returns TRUE if lower bounds are to be used for nominal values, FALSE otherwise. |
boolean |
get_parent_tie_breaking()
Get the order for breaking distribution ties. |
boolean |
get_pruning_branch_replacement()
Returns whether pruning should allow replacing a node with its largest subtree. |
double |
get_pruning_factor()
Returns the factor of how much pruning should be done. |
byte |
get_pruning_method()
Returns the Pruning method to be used. |
double |
get_smooth_factor()
Returns the exponential factor for smoothing. |
int |
get_smooth_inst()
Returns the number of thresholds on either side to use for smoothing. |
byte |
get_split_score_criterion()
Return the criterion used for scoring. |
protected double |
get_total_inst_weight()
Returns the total weight of instances in the data set this inducer is using. |
boolean |
get_unknown_edges()
Returns whether unknown edges are allowed. |
double |
get_upper_bound_min_split_weight()
Returns the upper bound for minimum split weight. |
Node |
getNewNode(Node node,
int category)
Creates a new categorizer Node. |
Node |
induce_decision_tree(CGraph aCgraph,
int[] tieBreakingOrder,
DoubleRef numSubtreeErrors,
DoubleRef pessimisticSubtreeErrors,
IntRef numLeaves,
int remainingSiblings)
Induce a decision tree in the given graph. |
protected void |
induce_tree_from_split(DecisionTree decisionTree,
NodeCategorizer splitCat,
java.util.LinkedList catNames,
int[] tieBreakingOrder,
DoubleRef numSubtreeErrors,
DoubleRef pessimisticSubtreeErrors,
IntRef numLeaves,
int remainingSiblings)
Induce decision tree from a given split. |
static double |
kullback_leibler_distance(double[] p,
double[] q)
Computes a Kullback Leibler distance metric given an array of p(x) and q(x) for all x. |
protected double |
laplace_subtree_loss(LogOptions logOptions,
DecisionTree dt,
Node subtree,
InstanceList il,
int[] tieBreakingOrder)
Computes the subtree loss, skewing the leaf weight distributions using the laplace correction. |
java.lang.String |
name_sub_inducer(java.lang.String catDescr,
int catNum)
Create a string to name the subinducer. |
protected static double |
num_cat_errors(Categorizer cat,
int predictClass,
double totalWeight)
Computes the number of errors this node would make as a leaf. |
int |
num_nodes()
Returns the number of nodes (categorizers) in the decision tree, and number of leaves. |
int |
num_nontrivial_leaves()
Returns the number of leaves that contain no Instances and have only uknown edges leading to them. |
int |
num_nontrivial_nodes()
Returns the number of Nodes that contain no Instances and have only uknown edges leading to them. |
protected double |
pessimistic_subtree_errors(LogOptions logOptions,
DecisionTree dt,
Node subtree,
InstanceList il,
double zValue,
int[] tieBreakingOrder)
Computes the pessimistic errors for subtree given a training set. |
protected double |
pessimistic_subtree_loss(LogOptions logOptions,
DecisionTree dt,
Node subtree,
InstanceList il,
double zValue,
int[] tieBreakingOrder)
Computes the pessimistic loss for subtree given a training set. |
double |
predictErrors(int total,
int misclass)
|
void |
prune_subtree(DecisionTree decisionTree,
int[] tieBreakingOrder,
Node largestChild,
DoubleRef numSubtreeErrors,
DoubleRef pessimisticSubtreeErrors,
IntRef numLeaves)
When the subtree rooted from the current node does not improve the error, the subtree may be replaced by a leaf or by its largest child. |
void |
prune(boolean prune)
Sets the errorprune value to the given value. |
double |
prune(DecisionTree dTree,
Node node,
InstanceList il,
boolean prune)
Prunes decision tree. |
Categorizer |
release_categorizer()
Gives ownership of the generated categorizer to the caller, reverting the Inducer to untrained state. |
void |
set_adjust_thresholds(boolean b)
Sets whether threshold should be adjusted to equal instance values. |
void |
set_cont_mdl_adjust(boolean val)
Sets whether the Minimum Description Length Adjustment for continuous attributes should be applied to mutual info. |
void |
set_debug(boolean val)
Sets the debugging option. |
void |
set_empty_node_parent_dist(boolean b)
Sets whether an empty node should have the parent's distribution. |
void |
set_evaluation_metric(byte metric)
Sets the evaluation metric used. |
void |
set_evidence_factor(double factor)
Sets the evidence correction factor. |
void |
set_have_continuous_attributes(boolean val)
Sets whether there are continuous attributes present in the data. |
void |
set_leaf_dist_type(byte type)
Sets the type of distribution to build at leaves. |
void |
set_level(int lvl)
Sets the level of this TDDTInducer. |
void |
set_lower_bound_min_split_weight(double val)
Sets the lower bound for minimum split weight. |
void |
set_m_estimate_factor(double factor)
Sets the m-estimate factor for laplace. |
void |
set_max_level(int level)
Sets the maximum level for a TDDTInducer. |
void |
set_min_split_weight_percent(double val)
Sets a new percentage value for minimum split weight. |
void |
set_nominal_lbound_only(boolean val)
Sets which lower bounds are used for nominal attributes. |
void |
set_parent_tie_breaking(boolean b)
Set the tie breaking order for distribution ties. |
void |
set_pruning_branch_replacement(boolean b)
Sets whether pruning should allow replacing a node with its largest subtree. |
void |
set_pruning_factor(double val)
Sets the factor of how much pruning should be done. |
void |
set_pruning_method(byte pM)
Sets the Pruning method to be used. |
void |
set_smooth_factor(double factor)
Sets the exponential factor for smoothing. |
void |
set_smooth_inst(int inst)
Sets the number of thresholds on either side to use for smoothing. |
void |
set_split_score_criterion(byte val)
Sets the criterion used for split scoring. |
protected void |
set_total_inst_weight(double wt)
Sets the total weight of instances in the data set this inducer is currently using. |
void |
set_unknown_edges(boolean val)
Sets whether unknown categories are allowable for edges if the decision tree. |
void |
set_upper_bound_min_split_weight(double val)
Sets the upper bound for minimum split weight. |
void |
set_user_options(java.lang.String prefix)
Sets the user options according to the option file. |
int |
train_no_stats()
Trains the inducer on a stored dataset. |
void |
train()
Trains the inducer on a stored dataset and collects statistics. |
boolean |
was_trained(boolean fatalOnFalse)
Checks if this inducer has a valid decision tree. |
Methods inherited from class java.lang.Object |
clone,
equals,
finalize,
getClass,
hashCode,
notify,
notifyAll,
toString,
wait,
wait,
wait |
Field Detail |
public static final byte none
public static final byte confidence
public static final byte penalty
public static final byte linear
public static final byte KLdistance
public static final byte lossConfidence
public static final byte lossLaplace
public static final byte allOrNothing
public static final byte frequencyCounts
public static final byte laplaceCorrection
public static final byte evidenceProjection
public static final byte error
public static final byte MSE
public static final byte logLoss
Constructor Detail |
public TDDTInducer(java.lang.String dscr)
dscr
- The description of the inducer.public TDDTInducer(java.lang.String descr, CGraph aCgraph)
descr
- The description of this inducer.aCgraph
- The CGraph that will hold the decision tree.public TDDTInducer(TDDTInducer source)
source
- The TDDTInducer that is being copied.Method Detail |
public void set_level(int lvl)
lvl
- The level to be set.public int get_level()
protected void set_total_inst_weight(double wt)
wt
- The weight that should be set.protected double get_total_inst_weight()
public int get_max_level()
public void set_max_level(int level)
level
- The new maximum level.public void set_lower_bound_min_split_weight(double val)
val
- The new lower bound.public double get_lower_bound_min_split_weight()
public void set_upper_bound_min_split_weight(double val)
val
- The new upper bound.public double get_upper_bound_min_split_weight()
public void set_min_split_weight_percent(double val)
val
- The new percentage.public double get_min_split_weight_percent()
public void set_nominal_lbound_only(boolean val)
val
- The value for the boolean option.public boolean get_nominal_lbound_only()
public void set_unknown_edges(boolean val)
val
- TRUE if unknown edges are allowable, FALSE otherwise.public boolean get_unknown_edges()
public byte get_split_score_criterion()
public void set_split_score_criterion(byte val)
val
- The new split score criterion.public void set_empty_node_parent_dist(boolean b)
b
- TRUE indicates an empty node should have the parent's distribution,
FALSE otherwise.public boolean get_empty_node_parent_dist()
public void set_parent_tie_breaking(boolean b)
b
- the new order for breaking distribution ties.public boolean get_parent_tie_breaking()
public void set_pruning_method(byte pM)
pM
- The Pruning method to be used. If the value is not NONE and pruning_factor is 0,
then a node will be made a leaf when its (potential) children do not improve
the error count.public byte get_pruning_method()
public void set_pruning_branch_replacement(boolean b)
b
- TRUE indicates pruning should allow replacing a node with its largest subtree,
FALSE otherwise.public boolean get_pruning_branch_replacement()
public void set_adjust_thresholds(boolean b)
b
- TRUE indicates threshold should be adjusted to equal instance values, FALSE otherwise.public boolean get_adjust_thresholds()
public void set_pruning_factor(double val)
val
- Factor of how much pruning should be done. High values indicate more pruning.public double get_pruning_factor()
public int get_smooth_inst()
public void set_smooth_inst(int inst)
inst
- Number of thresholds on either side to use for smoothing; 0 for no smoothing.public double get_smooth_factor()
public void set_smooth_factor(double factor)
factor
- The new exponential factor for smoothing.public void set_cont_mdl_adjust(boolean val)
val
- TRUE if the Minimum Description Length Adjustment for continuous attributes should be applied to mutual info, FALSE otherwise.public boolean get_cont_mdl_adjust()
public void set_leaf_dist_type(byte type)
type
- The type of distribution to build at leaves.public byte get_leaf_dist_type()
public void set_m_estimate_factor(double factor)
factor
- The new m-estimate factor for laplace.public double get_m_estimate_factor()
public void set_evidence_factor(double factor)
factor
- The new evidence correction factor.public double get_evidence_factor()
public boolean get_have_continuous_attributes()
public void set_have_continuous_attributes(boolean val)
val
- TRUE indicates there are continuous attributes in the data, FALSE otherwise.public boolean get_debug()
public void set_debug(boolean val)
val
- TRUE if debugging statements are active, FALSE otherwise.public void set_evaluation_metric(byte metric)
metric
- The evaluation metric to be used.public byte get_evaluation_metric()
protected void accumulate_tree_stats()
public void copy_options(TDDTInducer inducer)
inducer
- The TDDTInducer with the options to be copied.public void set_user_options(java.lang.String prefix)
prefix
- The prefix for the option names.public void train()
public int train_no_stats()
public int num_nontrivial_nodes()
public int num_nontrivial_leaves()
public boolean was_trained(boolean fatalOnFalse)
fatalOnFalse
- TRUE if an error message should be displayed if the inducer is not trained,
FALSE otherwise.public Node induce_decision_tree(CGraph aCgraph, int[] tieBreakingOrder, DoubleRef numSubtreeErrors, DoubleRef pessimisticSubtreeErrors, IntRef numLeaves, int remainingSiblings)
aCgraph
- The graph that will contain the decision tree.tieBreakingOrder
- The tie breaking order for breaking distribution ties.numSubtreeErrors
- Number of errors to this point.pessimisticSubtreeErrors
- Error estimate if this was a leaf node.numLeaves
- The number of leaves in the decision tree.remainingSiblings
- Siblings that have not be induced yet.protected void build_categorizer(DecisionTree dTree)
dTree
- The DecisionTree to use for creating the categorizer.public abstract NodeCategorizer best_split(java.util.LinkedList catNames)
catNames
- The list of categories found in the Node.protected static double num_cat_errors(Categorizer cat, int predictClass, double totalWeight)
cat
- The Categorizer for the node being checked.predictClass
- The category for which this node is being
checked.totalWeight
- The weight of all instances in a data set.public LeafCategorizer create_leaf_categorizer(double totalWeight, int[] tieBreakingOrder, DoubleRef numErrors, DoubleRef pessimisticErrors)
tieBreakingOrder
- Order for breaking distribution ties.totalWeight
- The total weight of the training data set.numErrors
- The number of errors this LeafCategorizer
will produce.pessimisticErrors
- Error estimate if this was a leaf node.public LeafCategorizer create_leaf_categorizer(double totalWeight, int[] tieBreakingOrder, DoubleRef numErrors, DoubleRef pessimisticErrors, double[] distrArray)
Note that the augCategory will contain the correct string, but the description will contain more information which may help when displaying the graph. The augCategory string must be the same for CatTestResult to work properly (it compares the actual string for debugging purposes).
tieBreakingOrder
- Order for breaking distribution ties.totalWeight
- The total weight of the training data set.numErrors
- The number of errors this LeafCategorizer
will produce.pessimisticErrors
- Error estimate if this was a leaf node.distrArray
- Distribution of weight over labels.protected void induce_tree_from_split(DecisionTree decisionTree, NodeCategorizer splitCat, java.util.LinkedList catNames, int[] tieBreakingOrder, DoubleRef numSubtreeErrors, DoubleRef pessimisticSubtreeErrors, IntRef numLeaves, int remainingSiblings)
decisionTree
- Decision tree induced.splitCat
- The categorizer for this split.catNames
- List of category names.tieBreakingOrder
- Order for breaking distribution ties.numSubtreeErrors
- Number of errors this subtree produces in categorization of Instances.pessimisticSubtreeErrors
- Error estimate if this was a leaf node.numLeaves
- Number of leaves on a subtree.remainingSiblings
- Siblings that have not be induced yet.protected void connect(CatGraph catGraph, Node from, Node to, int edgeVal, java.lang.String edgeName)
catGraph
- The CatGreph containing these nodes.from
- The node from which the edge originates.to
- The node to which the edge connects.edgeVal
- The value of the AugCategory associated with that edge.edgeName
- The name of the edge.public java.lang.String name_sub_inducer(java.lang.String catDescr, int catNum)
catDescr
- The description of this subinducer.catNum
- The category number for which this subinducer is
inducing.public abstract TDDTInducer create_subinducer(java.lang.String dscr, CGraph aCgraph)
dscr
- The description for the sub inducer.aCgraph
- The categorizer graph to use for the subinducer.public void prune_subtree(DecisionTree decisionTree, int[] tieBreakingOrder, Node largestChild, DoubleRef numSubtreeErrors, DoubleRef pessimisticSubtreeErrors, IntRef numLeaves)
"Confidence" pruning is based on C4.5's pruning method. "Penalty" pruning is based on "Pessimistic Decision tree pruning based on tree size" by Yishay Mansour, ICML-97. "Linear" pruning is used to implement cost-complexity pruning as described in CART. Its use is not recommended otherwise. "KLdistance" pruning uses the Kullback-Leibler distance metric to determine whether to prune.
This function is divided into three main parts. First, initial checks are performed and values are set. Second, the test specific to each pruning method is performed. Last, if pruning is necessary, do it.
decisionTree
- Tree to be pruned.tieBreakingOrder
- Order for breaking distribution ties.largestChild
- The largest child node.numSubtreeErrors
- Number of errors this subtree produces in categorization of Instances.pessimisticSubtreeErrors
- Error estimate if this was a leaf node.numLeaves
- Number of leaves on a subtree.protected double laplace_subtree_loss(LogOptions logOptions, DecisionTree dt, Node subtree, InstanceList il, int[] tieBreakingOrder)
logOptions
- The options for logging messages.dt
- DecisionTree over which the metric is to be calculated.subtree
- The current node used for calculation.il
- The InstanceList containing Instances that reach the current subtree Node.tieBreakingOrder
- Order for breaking distribution ties.protected double pessimistic_subtree_errors(LogOptions logOptions, DecisionTree dt, Node subtree, InstanceList il, double zValue, int[] tieBreakingOrder)
logOptions
- The options for logging information.dt
- The decision tree for which a pessimistic errors value is calculated.subtree
- The current node in the decision tree.il
- The list of instances for determining errors.zValue
- Z value for pessimistic error correction.tieBreakingOrder
- Order for breaking distribution ties.protected double pessimistic_subtree_loss(LogOptions logOptions, DecisionTree dt, Node subtree, InstanceList il, double zValue, int[] tieBreakingOrder)
logOptions
- The options for logging information.dt
- The decision tree for which a pessimistic loss value is calculated.subtree
- The current node in the decision tree.il
- The list of instances for determining errors.zValue
- Z value for pessimistic error correction.tieBreakingOrder
- Order for breaking distribution ties.public double calc_KL_distance(DecisionTree decisionTree)
decisionTree
- The decision tree over which the metric is to be used.public static double kullback_leibler_distance(double[] p, double[] q)
This is not idential to the one in NaiveBayesCat, as this one has
been changed to handle 0's in the arrays as follows
p(x) | q(x) | kl(p,q) |
---|---|---|
0 | 0 | ignore |
0 | >0 | 0 |
>0 | 0 | error |
>0 | >0 | OK |
p
- The child node distribution.q
- The parent node distribution.public Categorizer get_categorizer()
public int num_nodes()
public Categorizer release_categorizer()
public double prune(DecisionTree dTree, Node node, InstanceList il, boolean prune)
dTree
- The decision tree to be pruned.node
- The current node in the decision tree.il
- Instance list for determining errors.prune
- TRUE if the decision tree is to be pruned, FALSE otherwise.public double predictErrors(int total, int misclass)
total
- misclass
- public double addErrs(double N, double e)
N
- e
- public Node getNewNode(Node node, int category)
node
- The Node for which a categorizer Node is to be created.category
- The category to be stored in this Node.public void prune(boolean prune)
prune
- The new value for the errorprune.
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |