id3
Class ID3Inducer

java.lang.Object
  |
  +--shared.BaseInducer
        |
        +--shared.Inducer
              |
              +--id3.TDDTInducer
                    |
                    +--id3.ID3Inducer

public class ID3Inducer
extends TDDTInducer

The ID3Class is the Java implementation of the ID3 algorithm. The ID3 algorithm is a top-down decision-tree induction algorithm. This algorithm uses the mutual information (original gain criteria),and not the more recent information gain ratio.

Complexity:

Our split() method uses entropy and takes time O(vy) where v is the total number of attribute values (over all attributes) and y is the number of label values. This can be derived by noting that mutual_info is computed for each attribute.

Node categorizers (for predict) are AttrCategorizer and take constant time, thus the overall prediction time is O(path-length).

See TDDTInducer for more complexity information.

Enhancements:

The ID3Compute entropy once for the node, and pass it along to avoid multiple computations like we do now.


Fields inherited from class id3.TDDTInducer
allOrNothing, confidence, error, evidenceProjection, frequencyCounts, KLdistance, laplaceCorrection, linear, logLoss, lossConfidence, lossLaplace, MSE, none, penalty
 
Fields inherited from class shared.BaseInducer
AHA_IB_INDUCER, AM_INDUCER, BAGGING_INDUCER, BOOSTER_INDUCER, C45_INDUCER, C45AP_INDUCER, C45R_INDUCER, C50_INDUCER, CART_INDUCER, CatDT_INDUCER, CF_INDUCER, CLUSTER_INDUCER, CN2_INDUCER, CONST_INDUCER, COODG_INDUCER, DDT_INDUCER, DF_INDUCER, DISC_NB_INDUCER, DISC_SEARCH_INDUCER, DISC_TAB_INDUCER, ENTROPY_ODG_INDUCER, FCF_INDUCER, FSS_INDUCER, getEnv, HOODG_INDUCER, IB_INDUCER, ID3_INDUCER, LAZY_DT_INDUCER, LIST_HOODG_INDUCER, LIST_ODG_INDUCER, logOptions, NAIVE_BAYES_INDUCER, NULL_INDUCER, OC1_INDUCER, ODT_INDUCER, ONER_INDUCER, OODG_INDUCER, ORDER_FSS_INDUCER, PEBLS_INDUCER, PERCEPTRON_INDUCER, PERF_EST_INDUCER, PROJECT_INDUCER, RIPPER_INDUCER, SGI_DT_INDUCER, STACKING_INDUCER, T2_INDUCER, TABLE_CAS_INDUCER, TABLE_INDUCER, TDDT_INDUCER, TS, WEIGHT_SEARCH_INDUCER, WINNOW_INDUCER
 
Constructor Summary
ID3Inducer(ID3Inducer source)
          Copy Constructor.
ID3Inducer(java.lang.String dscr)
          Constructor.
ID3Inducer(java.lang.String dscr, CGraph aCgraph)
          Constructor.
 
Method Summary
 boolean all_attributes_multi_val()
          Checks if all attributes are multi-valued.
 void best_split_info(SplitAttr[] bestSplit, SplitAttr[] splits)
          Fills in the array of SplitAttr for current subtree.
 NodeCategorizer best_split(java.util.LinkedList catNames)
          Returns the AttrCategorizer that splits on the best attribute found using mutual information(information gain).
 int class_id()
          Deprecated. This method should be replaced with Java's instanceof operator.
 Inducer copy()
          Returns the reference to the copy of ID3Inducer with the same settings.
 TDDTInducer create_subinducer(java.lang.String descr, CGraph aCgraph)
          Create an Inducer for recursive calls.
 boolean find_splits(SplitAttr[] bestSplit, SplitAttr[] splits)
          Fills in the array of splits for current subtree.
 boolean multi_val_attribute(int attrNum)
          Return true if the attribute has many values according to the C4.5 definition.
 void pick_best_split(SplitAttr[] bestSplit, SplitAttr[] splits, StatData allMutualInfo, StatData allNonMultiValMutualInfo)
          Choose the best attribute to split the on from all possible splits.
 void split_info(int attrNum, SplitAttr split)
          Compute the split information for a given attribute.
 void split_info(int attrNum, SplitAttr split, RealAndLabelColumn[] realColumns)
          Compute the split information for a given attribute.
 NodeCategorizer split_to_cat(SplitAttr split, java.util.LinkedList catNames)
          Build categorizer for the given attribute.
 
Methods inherited from class id3.TDDTInducer
accumulate_tree_stats, addErrs, build_categorizer, calc_KL_distance, connect, copy_options, create_leaf_categorizer, create_leaf_categorizer, get_adjust_thresholds, get_categorizer, get_cont_mdl_adjust, get_debug, get_empty_node_parent_dist, get_evaluation_metric, get_evidence_factor, get_have_continuous_attributes, get_leaf_dist_type, get_level, get_lower_bound_min_split_weight, get_m_estimate_factor, get_max_level, get_min_split_weight_percent, get_nominal_lbound_only, get_parent_tie_breaking, get_pruning_branch_replacement, get_pruning_factor, get_pruning_method, get_smooth_factor, get_smooth_inst, get_split_score_criterion, get_total_inst_weight, get_unknown_edges, get_upper_bound_min_split_weight, getNewNode, induce_decision_tree, induce_tree_from_split, kullback_leibler_distance, laplace_subtree_loss, name_sub_inducer, num_cat_errors, num_nodes, num_nontrivial_leaves, num_nontrivial_nodes, pessimistic_subtree_errors, pessimistic_subtree_loss, predictErrors, prune_subtree, prune, prune, release_categorizer, set_adjust_thresholds, set_cont_mdl_adjust, set_debug, set_empty_node_parent_dist, set_evaluation_metric, set_evidence_factor, set_have_continuous_attributes, set_leaf_dist_type, set_level, set_lower_bound_min_split_weight, set_m_estimate_factor, set_max_level, set_min_split_weight_percent, set_nominal_lbound_only, set_parent_tie_breaking, set_pruning_branch_replacement, set_pruning_factor, set_pruning_method, set_smooth_factor, set_smooth_inst, set_split_score_criterion, set_total_inst_weight, set_unknown_edges, set_upper_bound_min_split_weight, set_user_options, train_no_stats, train, was_trained
 
Methods inherited from class shared.Inducer
can_cast_to_inducer, cast_to_inducer, display_struct, display_struct, project_train_and_perf_files, project_train_and_perf_files, project_train_and_perf_files, project_train_and_perf_files, project_train_and_perf, project_train_and_test_files, project_train_and_test_files, project_train_and_test_files, project_train_and_test_files, project_train_and_test, supports_full_testing, train_and_perf, train_and_test
 
Methods inherited from class shared.BaseInducer
assign_data, can_cast_to_incr_inducer, cast_to_incr_inducer, description, get_log_level, get_log_options, get_log_stream, has_data, has_data, instance_list, normalize_weights, read_data, read_data, read_data, release_data, set_log_level, set_log_options, set_log_prefixes, set_log_stream, train_and_test_files, train_and_test_files, train_and_test_files, train_and_test_files
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ID3Inducer

public ID3Inducer(java.lang.String dscr,
                  CGraph aCgraph)
Constructor.
Parameters:
dscr - The description of this inducer.
aCgraph - A previously developed Cgraph.

ID3Inducer

public ID3Inducer(java.lang.String dscr)
Constructor.
Parameters:
dscr - The description of this inducer.

ID3Inducer

public ID3Inducer(ID3Inducer source)
Copy Constructor.
Parameters:
source - The original ID3Inducer that is being copied.
Method Detail

best_split

public NodeCategorizer best_split(java.util.LinkedList catNames)
Returns the AttrCategorizer that splits on the best attribute found using mutual information(information gain). Returns null if there is nothing good to split on. Ties between this attribute and earlier attributes are broken.
Overrides:
best_split in class TDDTInducer
Parameters:
catNames - The names of the categories that each instance may be catagorized under.
Returns:
The NodeCategorizer that splits on the best attribute found. May be null if no good attribute split is found.

find_splits

public boolean find_splits(SplitAttr[] bestSplit,
                           SplitAttr[] splits)
Fills in the array of splits for current subtree. It does very little, but rarely overriden whereas best_split_info is overridden by subclasses.
Parameters:
bestSplit - This is an array of the best splits found during the splitting process.
splits - This is an array of all splits found during the splitting process.
Returns:
False if there is only one label value, the maximum number of splits is reached, or if there is no reasonable split available.

best_split_info

public void best_split_info(SplitAttr[] bestSplit,
                            SplitAttr[] splits)
Fills in the array of SplitAttr for current subtree. This function is a good candidate to override in subclasses.
Parameters:
bestSplit - This is an array of the best splits found during the splitting process.
splits - This is an array of all splits found during the splitting process.

multi_val_attribute

public boolean multi_val_attribute(int attrNum)
Return true if the attribute has many values according to the C4.5 definition.
Parameters:
attrNum - The number of the attribute being checked.
Returns:
True if this attribute has many values, False otherwise.

pick_best_split

public void pick_best_split(SplitAttr[] bestSplit,
                            SplitAttr[] splits,
                            StatData allMutualInfo,
                            StatData allNonMultiValMutualInfo)
Choose the best attribute to split the on from all possible splits.
Parameters:
bestSplit - The array of the best splits found during splitting process.
splits - The array of all splits found during the splitting process.
allMutualInfo - Statistical information about all instances.
allNonMultiValMutualInfo - Statistical information about instances where an attribute can only have one value at a time.

all_attributes_multi_val

public boolean all_attributes_multi_val()
Checks if all attributes are multi-valued. An attribute is multivalued if an instance can have more than one value at one time. If the attribute contains values that are neither real or nominal, an abort message is issued.
Returns:
False if the values are real numbers or if the nominal values are not multivalued. Otherwise True is returned.

split_info

public void split_info(int attrNum,
                       SplitAttr split)
Compute the split information for a given attribute.
Parameters:
attrNum - The index number of this attribute column.
split - The attribute to be split on.

split_info

public void split_info(int attrNum,
                       SplitAttr split,
                       RealAndLabelColumn[] realColumns)
Compute the split information for a given attribute.
Parameters:
attrNum - The index number of this attribute column.
split - The attribute to be split on.
realColumns - The columns of values specified for each attribute over all instances in a data set.

create_subinducer

public TDDTInducer create_subinducer(java.lang.String descr,
                                     CGraph aCgraph)
Create an Inducer for recursive calls. Since TDDTInducer is an abstract class, it can't do the recursive call.
Overrides:
create_subinducer in class TDDTInducer
Parameters:
descr - The description of the new subinducer.
aCgraph - A previously defined Cgraph for the inducer.
Returns:
The new sub-ID3Inducer created.

split_to_cat

public NodeCategorizer split_to_cat(SplitAttr split,
                                    java.util.LinkedList catNames)
Build categorizer for the given attribute. The specified SplitAttr is assumed to be valid.
Parameters:
split - The attribute to be split on.
catNames - The category names that an instance may be categorized under.
Returns:
The NodeCategorizer containing a categorizer that splits on this attribute.

copy

public Inducer copy()
Returns the reference to the copy of ID3Inducer with the same settings.
Returns:
A reference to an ID3Inducer.

class_id

public int class_id()
Deprecated. This method should be replaced with Java's instanceof operator.

Returns the class id of this of this inducer.
Overrides:
class_id in class BaseInducer
Returns:
Integer assigned to this inducer.