Thursday, December 10, 2015

Decision Trees and Random Forests: The Problem

Talk given to the Olin College Machine Learning Reading Group.

Here we'll investigate visualization and model introspection of one of the most popular and useful machine learning models: Random Forests, and their constituent Decision Trees.  I use Random Forests for many first-pass models, because there is only one hyperparameter in need of tuning, the max depth.  Increasing the number of estimators (decision trees) does not overfit. The only danger comes from allowing the trees to expand fully, past inferences that are generalizable.  K-fold cross validation over range(5,36,5) max depth often suffices.  


Suppose we don't yet care about the ensemble method, and just want to know about a single tree.  How might we go about visualizing it?

The simplest method would be to show the data structure itself: all of the nodes, their features, their splits, the gini impurity at each node, the volume of test samples that passes through them, and maybe which class has the majority in each node.  Within sklearn,  tree.export_graphviz with the dot package converts a tree (which can be extracted from an RF ensemble) into its graphical form. Not very pretty; we see what is happening.


But, this fails with the exponential horizontal expansion of the tree at greater depths.

Recently sklearn's documentation was updated to include this much nicer visualization on the toy Iris dataset:
Which includes coloration based on class affiliation, significantly improving the usefulness of the visualization.  By not perfectly aligning the nodes, we free up space.  This could yield one or two more layers in depth before unusability due to expansion, but the problem persists.

We could take a different approach and look at the volume running through the nodes colored by their classes, in order to focus on the parts of the tree that matter most.
And also emphasizing the split features:
Good luck making sense of the relations or reading the labels past a depth of about 6.  We know there is useful information beneath these depths, because trees and forests we train are much more powerful with this expansion, but it can't be expressed easily.  We'll keep trying.

In the excellent decision trees tutorial by r2d3, the creators offer a tree with histogram nodes.  The interactive visualizations are impressive, and quite good for teaching purposes.

The insight into the feature space is much better than the previous, as the histograms take a stab at showing how well the model operates in the input data set, rather than just showing the model space.
Offsets between the nodes and the outputs make the arrangement a bit nicer and free up some width.  The exact visual arrangement needs some work given the overlaps that exist, but I'd say this is preferable to what we've seen so far.  The utility to a data scientist is a bit higher, but still not great.

There's one last presentation that I think would be useful because of its clever transformation.  In the visualizations reviewed so far, the limiting factor was the available width.  Here, we can gain more visual space by radial arrangement.  


The Sunburst visualization by BigML takes the distribution of the volume of samples, arranges it radially, and gains space with each level of depth with the increasing circumference of the each ring.  This buys us 2-3 more levels before failure.  The interactive visualization is quite shiny, and allows switching between class distribution, confidence/mixedness, and feature to be split.  Unfortunately, reading the feature splits off hierarchically is a bit more difficult, but there is help text off the figure.  It allows zooming by clicking on one region, but overall exploration of the tree is difficult.  

Are These Decision Tree Visualizations Useful?

Maybe these are useful within the context of a tutorial, but a great deal of inspection work is required, even with the interactive sunburst visualization.  Exploring these trees becomes impossible past depths of ~10.  If we could combine the zooming and labeling of the sunburst with the histograms of r2d3's tree, we could make some headway into seeing the model in the data space, but moving through the tree would still be quite tedious.  

Trying to show the data structure directly preserves the relational information between features, and shows how the model makes predictions, but tries to do too much.  I hypothesize that we need to know only the most important relational information, which is contained in the highest nodes of the tree, and most of the information on feature importance.  How the model predicts is given the spotlight, whereas if you'd like to know (or estimate) what the model predicts on a given input, you can only try to follow the decisions through most of the tree if you have a practical number of features.  The sunburst UI is pretty, but it finding the prediction for a desired input is not straightforward.  With categorical variables, the number of features explodes - see the difference between the nice linear thresholds possible on the housing dataset (r2d3) versus all of the features in the forest cover dataset (linked sunburst).  The histograms of r2d3's visualization mix showing the why of training data with the what of the confidences of the predictions on new data.  We make a bit of progress on these dimensions, but I don't trust it to scale to larger, practical data sets.

The Wrong Question

In practical machine learning, decision trees are 'poorly conditioned' (too fragile and dependent on the feature distributions within the input set) and prone to overfitting, so they are not used when random forests could be.  I chose this exploratory sequence to illustrate that even data scientists, rather than simple admirers of data visualization, chose graphical representation of their models that do not serve a purpose.  That any of these visualizations were used to improve a model, redesign a feature engineering approach, or make a convincing argument to a client would shock me.  

The obvious choice for visualization - just showing the data structure (a weighted, directed digraph), in progressively more clever ways - doesn't help anyone iterate on the modeling process.  Can we do better?

Random Forests present a challenge by increasing the complexity.  I've yet to find a satisfactory decision tree visualization, much less for random forests.   We'll examine if we can present the data that could be used as evidence in an argument, rather than as decoration.


1 comment: