Intuiting Decision Trees

by Justin Skycak on

Decision trees are able to model nonlinear data while remaining interpretable.

This post is part of the series Intuiting Predictive Algorithms.


One drawback of SVMs and NNs is that they are black-box models, meaning they are uninterpretable. Although they can model highly nonlinear data, we can’t make much sense of what the model has learned by looking at the parameters. On the other hand, the parameters in linear regressions make intuitive sense as the contributions of individual factors to the overall decision, but they are restricted by linearity and thus won’t make a good predictive model for highly nonlinear data. Decision trees bridge the gap and are able to model nonlinear data while remaining interpretable.

A decision tree constructs a model by recursively partitioning (“splitting”) class data along some value of a predictor (“attribute”) until each partition represents a single class of data. The tree starts with a single node that represents all of the data, and then splits into two child nodes to separate the data into two groups that are as homogeneous (“pure”) as possible. Then, each child node performs the same splitting process to produce two more child nodes of maximum purity, and so on, until each terminal node (“leaf”) of the tree is 100% pure or the data cannot be split any more (sometimes otherwise identical records may have different classes). The predicted probability distribution for the class of any input $x$ is computed as the frequency distribution of classes within the input’s corresponding leaf.

The metric that is used to quantify the purity of a split is called the splitting criterion, and it is often chosen as information gain or Gini impurity. Information gain measures the reduction in impurity (“information entropy”) achieved by a split. Information entropy for a node is measured by the expectation value

$\begin{align*} H &= \left< \log \left(\frac{1}{P(y=c)} \right) \right>_{c} \\[5pt] &= - \sum_{c} P(y=c) \log P(y=c) \end{align*}$


over data points $(x,y)$ and classes $c \in \lbrace y \rbrace$ within the node, where $P(y=c)$ is the proportion of data points in the parent node that have $y=c.$ Information entropy is largest for uniform distributions, and zero for distributions that are concentrated at a single point. Information gain is the entropy of the parent node, minus the weighted average entropy of the child nodes (weighted by its proportion of data points from the parent node). Gini impurity is very similar to information entropy, just a little faster to compute. It is given by

$\begin{align*} G = 1- \sum_{c} P(y=c)^2 \end{align*}$


To prevent the tree from overfitting the data, which it will almost certainly do if left to construct an unlimited number of partitions, the tree is “pruned.” Pruning can be achieved by stopping the tree prior to full growth, in which it is called pre-pruning, or by cutting the tree short after full growth, in which it is called post-pruning. Pre-pruning can be achieved by avoiding splitting a node if the split purity is below some threshold value – though, any choice of such value is rather ad-hoc. With post-pruning, it is possible to take a more principled approach, using cross-validation to check the effect of pruning on the tree’s test accuracy.


This post is part of the series Intuiting Predictive Algorithms.