|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
A TreeSettings object is a subclass
of SupervisedAlgorithmSettings that
supports capabilities of the classification and
regression mining function and corresponding
algorithms. These include: number of surrogates,
termination criteria, tree selection method,
number of splits, cost function, and pruning
function.
Surrogates are node predicates that best mimic the action of the primary split predicate. These surrogates, if they exist, are applied during scoring when data is missing that is required for evaluation of the primary split predicate. The surrogates are ranked in accordance with how well they mimic the action of the primary predicate. In the event of multiple missing data elements in a record to be scored, the highest ranked surrogate with non-missing predicate data is applied.
The termination criteria preclude further splitting of a node, which enhances build performance.
Tree models typically produce a hierarchical list of decision trees. Deeper trees extend and encompass all shallower trees. The list is constructed via a pruning process and a tree selection method picks the decision tree which is to be used as the final model. An evaluation data set or cross-validation is used to compute the error associated with each tree in the hierarchical list. The associated statistics (misclassification rate, gini, entropy, mean square error, mean absolute deviation) assist in the selection decision. Tree selection methods include automated choices, such as One-Standard-Error Tree and the Minimum Variance Tree, and the non-automated choice: Manual.
The One-Standard-Error Tree is the tree whose error rate is within one standard error of the minimum error rate tree and is smaller than the minimum error rate tree. This method is conservative, justified by principles of parsimony and applies to classification targets only. Restriction to classification targets is done because the standard error of the error rate in regression depends on 4th order terms, which are typically not reliable.
The Minimum Variance Tree is the tree in the list with the smallest error rate. This technique applies to both regression and classification targets.
The number of maximum splits indicates the number of child nodes of each interior node. Possible values include binary (exactly 2 children per interior node) or k-ary alternatives at a node(2 or more children per interior node). k-ary trees can have a variable number of children per interior node.
The cost function is used to measure the goodness of the split. Cost functions are specific to target type. For example, classification targets can use reduction in misclassification rate, entropy, or gini as a measure of goodness. Regression targets can use reduction in mean squared error or mean absolute deviation as a measure of goodness.
The pruning function is used to create a set of sub-trees. Pruning functions are specific to target-type. The pruning operation trades off complexity (number of nodes) for error rate and produces a hierarchical list of trees. Error rate is measured by one of the set of metrics that are used to measure goodness of split.
The optional prior probabilities refer to the assumed distribution of classification target values in the data used to build a model prior to any sampling. PriorProbabilities can be specified by the user or computed from the training data (default). This information can be used for creating a more balanced mix of target data values for model building. The use might specify prior probabilities to inform the build process that a target-value-stratified sampling procedure has been used to create the training sample. Another use of prior probabilities is to weight errors differently among the target classes. A target class with a higher prior probability, in essence, represents a target class with greater importance in the training sample. This use is an alternative to use of a cost matrix to specify such weights.
The optional cost matrix specifies a two-dimensional, N x N matrix that defines the cost associated with a prediction error when the prediction differs from the actual value. A cost matrix is typically used in classification models, where N is the number of classes in the target, and the columns and rows are labeled with class values.
AlgorithmSettings,
SupervisedAlgorithmSettings| Method Summary | |
void |
computeNodeStatistics(boolean computeNodeStatistics)
If compute is true, the DME computes node statistics for the tree model. |
boolean |
determineMaxDepth()
Returns true if the maximum depth is determined by DME at runtime. |
void |
determineMaxDepth(boolean determineMaxDepth)
The previous maximum depth setting is used if fasle. |
TreeHomogeneityMetric |
getBuildHomogeneityMetric()
Returns the homogeneity metric used to measure goodness of a split. |
boolean |
getComputeNodeStatistics()
Returns whether node statistics will be computed. |
int |
getMaxDepth()
Returns the maximum depth of the tree model to be built, last set by the user. |
double |
getMaximumPValue()
Returns the largest acceptable probability of any target value to split a node. |
int |
getMaxSplits()
Returns the maximum number of children at any interior node. |
int |
getMaxSurrogates()
Returns the maximum number of surrogate splits to be computed by the model at each node. |
double |
getMinDecreaseInImpurity()
Returns the minimum decrease in impurity required to justify splitting a node. |
double |
getMinNodeSize()
Returns the minimum node size. |
SizeUnit |
getMinNodeSizeUnit()
Returns the size unit of the minimum node size. |
TreeHomogeneityMetric |
getPruningHomogeneityMetric()
Returns homogeneity metric used to establish pruning path through the full tree. |
TreeSelectionMethod |
getTreeSelectionMethod()
Returns the method used to select a tree from the hierarchical list of trees produced by the pruning process. |
void |
setBuildHomogeneityMetric(TreeHomogeneityMetric buildMetric)
Sets the homogeneity metric used to measure goodness of a split. |
void |
setMaxDepth(int maxDepth)
Sets the maximum depth of the tree model to be built. |
void |
setMaximumPValue(double maxPValue)
Sets the largest acceptable probability of any target value to split a node. |
void |
setMaxSplits(int maxSplits)
Sets the maximum number of children at any interior node. |
void |
setMaxSurrogates(int maxSurrogates)
Sets the maximum number of surrogate splits to be computed by the model at each node. |
void |
setMinDecreaseInImpurity(double minImpurity)
Sets the minimum decrease in impurity required to justify splitting a node. |
void |
setMinNodeSize(double size,
SizeUnit unit)
Sets the minimum node size. |
void |
setPruningHomogeneityMetric(TreeHomogeneityMetric pruningMetric)
Sets the homogeneity metric used to establish a pruning path through the full tree. |
void |
setTreeSelectionMethod(TreeSelectionMethod selectionMethod)
Sets the method used to select a tree from the hierarchical list of trees produced by the pruning process. |
| Methods inherited from interface javax.datamining.base.AlgorithmSettings |
getMiningAlgorithm, verify |
| Method Detail |
public void computeNodeStatistics(boolean computeNodeStatistics)
computeNodeStatistics -
public boolean determineMaxDepth()
public void determineMaxDepth(boolean determineMaxDepth)
determineMaxDepth -
public TreeHomogeneityMetric getBuildHomogeneityMetric()
public boolean getComputeNodeStatistics()
public int getMaxDepth()
If determineMaxDepth method is used with true, this setting is ignored.
public double getMaximumPValue()
public int getMaxSplits()
public int getMaxSurrogates()
public double getMinDecreaseInImpurity()
public double getMinNodeSize()
public SizeUnit getMinNodeSizeUnit()
public TreeHomogeneityMetric getPruningHomogeneityMetric()
public TreeSelectionMethod getTreeSelectionMethod()
public void setBuildHomogeneityMetric(TreeHomogeneityMetric buildMetric)
buildMetric - The homogeneity metric used to measure goodness of a split.
public void setMaxDepth(int maxDepth)
maxDepth - The maximum depth of the tree model.
public void setMaximumPValue(double maxPValue)
maxPValue - The largest acceptable probability of any target value to split a node.
public void setMaxSplits(int maxSplits)
maxSplits - The maximum number of children at any interior node.
public void setMaxSurrogates(int maxSurrogates)
maxSurrogates - The maximum number of surrogate splits.
public void setMinDecreaseInImpurity(double minImpurity)
minImpurity - The minimum decrease in impurity required to justify splitting a node.
public void setMinNodeSize(double size,
SizeUnit unit)
unit is count, size is the number of cases. If unit is percentage, size is the percentage of the minimum number of cases per node.
The node size must be a non-negative number.
size - The size of the minimum node.unit - The unit of the size.
public void setPruningHomogeneityMetric(TreeHomogeneityMetric pruningMetric)
pruningMetric - The homogeneity metric used to establish pruning path through the full tree.
public void setTreeSelectionMethod(TreeSelectionMethod selectionMethod)
selectionMethod - The method used to select a tree from the hierarchical list of trees produced by the pruning process.
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||