javax.datamining.algorithm.tree
Interface TreeSettings

All Superinterfaces:
AlgorithmSettings, SupervisedAlgorithmSettings

public interface TreeSettings
extends SupervisedAlgorithmSettings

A TreeSettings object is a subclass of SupervisedAlgorithmSettings that supports capabilities of the classification and regression mining function and corresponding algorithms. These include: number of surrogates, termination criteria, tree selection method, number of splits, cost function, and pruning function.

Surrogates are node predicates that best mimic the action of the primary split predicate. These surrogates, if they exist, are applied during scoring when data is missing that is required for evaluation of the primary split predicate. The surrogates are ranked in accordance with how well they mimic the action of the primary predicate. In the event of multiple missing data elements in a record to be scored, the highest ranked surrogate with non-missing predicate data is applied.

The termination criteria preclude further splitting of a node, which enhances build performance.

Tree models typically produce a hierarchical list of decision trees. Deeper trees extend and encompass all shallower trees. The list is constructed via a pruning process and a tree selection method picks the decision tree which is to be used as the final model. An evaluation data set or cross-validation is used to compute the error associated with each tree in the hierarchical list. The associated statistics (misclassification rate, gini, entropy, mean square error, mean absolute deviation) assist in the selection decision. Tree selection methods include automated choices, such as One-Standard-Error Tree and the Minimum Variance Tree, and the non-automated choice: Manual.

The One-Standard-Error Tree is the tree whose error rate is within one standard error of the minimum error rate tree and is smaller than the minimum error rate tree. This method is conservative, justified by principles of parsimony and applies to classification targets only. Restriction to classification targets is done because the standard error of the error rate in regression depends on 4th order terms, which are typically not reliable.

The Minimum Variance Tree is the tree in the list with the smallest error rate. This technique applies to both regression and classification targets.

The number of maximum splits indicates the number of child nodes of each interior node. Possible values include binary (exactly 2 children per interior node) or k-ary alternatives at a node(2 or more children per interior node). k-ary trees can have a variable number of children per interior node.

The cost function is used to measure the goodness of the split. Cost functions are specific to target type. For example, classification targets can use reduction in misclassification rate, entropy, or gini as a measure of goodness. Regression targets can use reduction in mean squared error or mean absolute deviation as a measure of goodness.

The pruning function is used to create a set of sub-trees. Pruning functions are specific to target-type. The pruning operation trades off complexity (number of nodes) for error rate and produces a hierarchical list of trees. Error rate is measured by one of the set of metrics that are used to measure goodness of split.

The optional prior probabilities refer to the assumed distribution of classification target values in the data used to build a model prior to any sampling. PriorProbabilities can be specified by the user or computed from the training data (default). This information can be used for creating a more balanced mix of target data values for model building. The use might specify prior probabilities to inform the build process that a target-value-stratified sampling procedure has been used to create the training sample. Another use of prior probabilities is to weight errors differently among the target classes. A target class with a higher prior probability, in essence, represents a target class with greater importance in the training sample. This use is an alternative to use of a cost matrix to specify such weights.

The optional cost matrix specifies a two-dimensional, N x N matrix that defines the cost associated with a prediction error when the prediction differs from the actual value. A cost matrix is typically used in classification models, where N is the number of classes in the target, and the columns and rows are labeled with class values.

Author:
JSR-73 Java Data Mining Expert Group
See Also:
AlgorithmSettings, SupervisedAlgorithmSettings

Method Summary
 void computeNodeStatistics(boolean computeNodeStatistics)
          If compute is true, the DME computes node statistics for the tree model.
 boolean determineMaxDepth()
          Returns true if the maximum depth is determined by DME at runtime.
 void determineMaxDepth(boolean determineMaxDepth)
          The previous maximum depth setting is used if fasle.
 TreeHomogeneityMetric getBuildHomogeneityMetric()
          Returns the homogeneity metric used to measure goodness of a split.
 boolean getComputeNodeStatistics()
          Returns whether node statistics will be computed.
 int getMaxDepth()
          Returns the maximum depth of the tree model to be built, last set by the user.
 double getMaximumPValue()
          Returns the largest acceptable probability of any target value to split a node.
 int getMaxSplits()
          Returns the maximum number of children at any interior node.
 int getMaxSurrogates()
          Returns the maximum number of surrogate splits to be computed by the model at each node.
 double getMinDecreaseInImpurity()
          Returns the minimum decrease in impurity required to justify splitting a node.
 double getMinNodeSize()
          Returns the minimum node size.
 SizeUnit getMinNodeSizeUnit()
          Returns the size unit of the minimum node size.
 TreeHomogeneityMetric getPruningHomogeneityMetric()
          Returns homogeneity metric used to establish pruning path through the full tree.
 TreeSelectionMethod getTreeSelectionMethod()
          Returns the method used to select a tree from the hierarchical list of trees produced by the pruning process.
 void setBuildHomogeneityMetric(TreeHomogeneityMetric buildMetric)
          Sets the homogeneity metric used to measure goodness of a split.
 void setMaxDepth(int maxDepth)
          Sets the maximum depth of the tree model to be built.
 void setMaximumPValue(double maxPValue)
          Sets the largest acceptable probability of any target value to split a node.
 void setMaxSplits(int maxSplits)
          Sets the maximum number of children at any interior node.
 void setMaxSurrogates(int maxSurrogates)
          Sets the maximum number of surrogate splits to be computed by the model at each node.
 void setMinDecreaseInImpurity(double minImpurity)
          Sets the minimum decrease in impurity required to justify splitting a node.
 void setMinNodeSize(double size, SizeUnit unit)
          Sets the minimum node size.
 void setPruningHomogeneityMetric(TreeHomogeneityMetric pruningMetric)
          Sets the homogeneity metric used to establish a pruning path through the full tree.
 void setTreeSelectionMethod(TreeSelectionMethod selectionMethod)
          Sets the method used to select a tree from the hierarchical list of trees produced by the pruning process.
 
Methods inherited from interface javax.datamining.base.AlgorithmSettings
getMiningAlgorithm, verify
 

Method Detail

computeNodeStatistics

public void computeNodeStatistics(boolean computeNodeStatistics)
If compute is true, the DME computes node statistics for the tree model. If false, these are not computed.

Parameters:
computeNodeStatistics -
Returns:
void

determineMaxDepth

public boolean determineMaxDepth()
Returns true if the maximum depth is determined by DME at runtime. Returns fasle if the previous maximum depth setting is to be used.

Returns:
boolean

determineMaxDepth

public void determineMaxDepth(boolean determineMaxDepth)
The previous maximum depth setting is used if fasle. If true, the maximum depth is determined by DME at runtime.

Parameters:
determineMaxDepth -
Returns:
void

getBuildHomogeneityMetric

public TreeHomogeneityMetric getBuildHomogeneityMetric()
Returns the homogeneity metric used to measure goodness of a split.

Returns:
TreeHomogeneityMetric

getComputeNodeStatistics

public boolean getComputeNodeStatistics()
Returns whether node statistics will be computed. Node statistics will be computed if true is returned, and no node statistics is computed if false is returned.

Returns:
boolean

getMaxDepth

public int getMaxDepth()
Returns the maximum depth of the tree model to be built, last set by the user.

If determineMaxDepth method is used with true, this setting is ignored.

Returns:
int

getMaximumPValue

public double getMaximumPValue()
Returns the largest acceptable probability of any target value to split a node.

Returns:
double

getMaxSplits

public int getMaxSplits()
Returns the maximum number of children at any interior node. Choices are binary or k-ary where k >= 2.

Returns:
int

getMaxSurrogates

public int getMaxSurrogates()
Returns the maximum number of surrogate splits to be computed by the model at each node.

Returns:
int

getMinDecreaseInImpurity

public double getMinDecreaseInImpurity()
Returns the minimum decrease in impurity required to justify splitting a node.

Returns:
double

getMinNodeSize

public double getMinNodeSize()
Returns the minimum node size. If the unit is count, returns the number of cases. If the unit is percentage, returns the percentage of the minimum number of cases per node.

Returns:
double

getMinNodeSizeUnit

public SizeUnit getMinNodeSizeUnit()
Returns the size unit of the minimum node size.

Returns:
SizeUnit

getPruningHomogeneityMetric

public TreeHomogeneityMetric getPruningHomogeneityMetric()
Returns homogeneity metric used to establish pruning path through the full tree.

Returns:
TreeHomogeneityMetric

getTreeSelectionMethod

public TreeSelectionMethod getTreeSelectionMethod()
Returns the method used to select a tree from the hierarchical list of trees produced by the pruning process.

Returns:
TreeSelectionMethod

setBuildHomogeneityMetric

public void setBuildHomogeneityMetric(TreeHomogeneityMetric buildMetric)
Sets the homogeneity metric used to measure goodness of a split. These are specific to target type. Ordered (regression targets) metrics include mean squared error and mean absolute deviation, or other vendor metric. Categorical target metrics include gini, entropy, misclassification rate, or other vendor metric. The build metric must not be null.

Parameters:
buildMetric - The homogeneity metric used to measure goodness of a split.
Returns:
void

setMaxDepth

public void setMaxDepth(int maxDepth)
Sets the maximum depth of the tree model to be built. This is a termination criterion. The build process halts the search for extensions of any node at this depth. The maximum depth must be a positive number and less than or equal to the maximum depth allowed.

Parameters:
maxDepth - The maximum depth of the tree model.
Returns:
void

setMaximumPValue

public void setMaximumPValue(double maxPValue)
Sets the largest acceptable probability of any target value to split a node. A node with a target value probability exceeding the maximum value is considered to be homogenous, i.e., no further splitting is required. This node becomes terminal (a leaf node). The maximum P value must be between 0 and 1.

Parameters:
maxPValue - The largest acceptable probability of any target value to split a node.
Returns:
void

setMaxSplits

public void setMaxSplits(int maxSplits)
Sets the maximum number of children at any interior node. Choices are binary or k-ary where k >= 2.

Parameters:
maxSplits - The maximum number of children at any interior node.
Returns:
void

setMaxSurrogates

public void setMaxSurrogates(int maxSurrogates)
Sets the maximum number of surrogate splits to be computed by the model at each node. The miaximum surrogates must be a non-negative number.

Parameters:
maxSurrogates - The maximum number of surrogate splits.
Returns:
void

setMinDecreaseInImpurity

public void setMinDecreaseInImpurity(double minImpurity)
Sets the minimum decrease in impurity required to justify splitting a node. This is a termination criterion. If no candidate split can be found with a decrease greater than the minimum, then the build process halts. The minimum decreased impurity must be a non-negative number.

Parameters:
minImpurity - The minimum decrease in impurity required to justify splitting a node.
Returns:
void

setMinNodeSize

public void setMinNodeSize(double size,
                           SizeUnit unit)
Sets the minimum node size. If unit is count, size is the number of cases. If unit is percentage, size is the percentage of the minimum number of cases per node. The node size must be a non-negative number.

Parameters:
size - The size of the minimum node.
unit - The unit of the size.
Returns:
void

setPruningHomogeneityMetric

public void setPruningHomogeneityMetric(TreeHomogeneityMetric pruningMetric)
Sets the homogeneity metric used to establish a pruning path through the full tree. These are specific to target type. Ordered (regression targets) metrics include mean squared error and mean absolute deviation, or other vendor metric. Categorical target metrics include gini, entropy, misclassification rate, or other vendor metric. The pruning metric must not be null.

Parameters:
pruningMetric - The homogeneity metric used to establish pruning path through the full tree.
Returns:
void

setTreeSelectionMethod

public void setTreeSelectionMethod(TreeSelectionMethod selectionMethod)
Sets the method used to select a tree from the hierarchical list of trees produced by the pruning process. The tree selection method must not be null.

Parameters:
selectionMethod - The method used to select a tree from the hierarchical list of trees produced by the pruning process.
Returns:
void