Decision Tree and CART Operator General Principles

Decision trees are rule-based classifiers that consist of a hierarchy of decision points (the nodes of the tree).

A decision tree can model how the value of a target, dependent variable can be predicted by using the values of a set of predictor, independent variables.

A decision tree could be either a classification tree (labels with a range of discrete values) or a regression tree (labels with a range of continuous (numeric) values).

Classification decision tree analysis is characterized by the predicted outcome being the class to which the data belongs. The Decision Tree operator in Team Studio performs classification decision tree analysis using the C4.5 (Quinlan 1993) algorithm. In decision tree classification analysis, each dependent variable can have only a discrete list of possible values.

Regression decision tree analysis is characterized by the predicted outcome being considered a real number, such as the price of a house, or a patient's length of stay in a hospital. The Team Studio CART operators perform regression decision tree analysis using the CART (Breiman et al. 1984) algorithm. In CART decision tree analysis, each dependent variable can have either a discrete or a continuous list of possible values.

Decision Trees are trained by recursive partitioning: splitting the data set into different groups, and then analyzing each group to perform additional splitting of the data into sub-groups. The algorithm stops splitting based on various stop conditions so it does not over-fit the model.

When you use a decision tree to classify an unknown instance, a single feature is examined at each node of the tree. Based on the value of that feature, the next node is selected. Each node represents a set of records (rows) from the original dataset. Nodes that have child nodes are called "interior" nodes. Nodes that do not have child nodes are called "terminal" or "leaf" nodes. The topmost node is called the "root" node. Unlike a real tree, decision trees are drawn with their root at the top. The root node represents all of the rows in the dataset.

Related concepts