
The Unreasonable Effectiveness of Random Forests - PaulHoule
https://medium.com/rants-on-machine-learning/the-unreasonable-effectiveness-of-random-forests-f33c3ce28883
======
ajtulloch
The article mentions that "The main drawback of Random Forests is the model
size. You could easily end up with a forest that takes hundreds of megabytes
of memory and is slow to evaluate."

A cool trick for speeding up trained random forests/gradient boosted decision
trees is to dump the tree to C/ASM, compile it, and dlopen it as a function
pointer. Depending on the model and the architecture, you can get an 8x
speedup relative to an optimized C implementation that walks the tree.

There's an implementation for scikit-learn with benchmarks at
[https://github.com/ajtulloch/sklearn-
compiledtrees](https://github.com/ajtulloch/sklearn-compiledtrees) if you're
interested in this technique.

~~~
CyberDildonics
Why go through all that though? The fundamental principle is that it is
flattening the buffer to one linear section of memory, making the pointers
jump to a location that is likely already prefetched and possibly in the same
cache line. Dumping to C seems like it is missing the point.

A library like Cereal should be able to do this almost trivially simply by
serializing the data, unserializing it to a flat buffer, then pointing to the
start of that buffer.

Also I should say that a random forest created by pointer hopping to each new
level is a very naive implementation and will take a long time to create due
to heap allocation and a long time to traverse due to cache misses.

~~~
ajtulloch
That's actually not the case - that's one effect, sure, but when you compared
the compiled trees to the flattened trees (the strategy just described), the
compiled trees are substantially faster across a range of parameters. See
[http://tullo.ch/articles/decision-tree-
evaluation/](http://tullo.ch/articles/decision-tree-evaluation/) for a
detailed evaluation.

~~~
CyberDildonics
That's really awesome that you put that together, but I have to think that 200
points and a depth < 2 would basically be a linear search over about 100
elements right?

Fundamentally I can't think of any reason compiling a tree would be a good
approach in a general sense.

Another optimization is to put make splits multiple levels instead of just
one. How many might depend on how many you can squeeze into a cache line. If
you split position is a float and the dimension is a byte, each split would be
5 bytes. You can squeeze 12 of these into a cache line. You can go 3 level
down instead of one by packing 7 of them into a chunk (1 split + 2 splits + 4
splits). This still leaves room for a 32 bit index or pointer for each of the
4 leaves.

------
daenz
Random forests are great. I use them in real-time silhouette identification of
people against backgrounds by porting the random forest to the GPU. This
allows blazing fast forest evaluation for every pixel of my input image
concurrently. I used this paper to implement the design of it [http://www.msr-
waypoint.com/pubs/71445/ForestFire.pdf](http://www.msr-
waypoint.com/pubs/71445/ForestFire.pdf)

The main trick is in storing the tree as a binary heap in the rows of a
texture, with the left and right subtrees at 2i and 2i+1, respectively, and
where the attribute index and split value are encoded in the color channels.
Your texture width is your tree depth^2-1, and your texture height the number
of trees in your forest. The texture ends up looking something like this
(scaled up x10)
[http://i.imgur.com/afH5iFl.png](http://i.imgur.com/afH5iFl.png)

With a GPU shader, you can navigate these trees very quickly to classify an
entire input image at once.

------
nl
This is so true. Random forest should be the default choice for most problem
sets.

Now we have _The Unreasonable Effectiveness of Random Forests_ and _The
Unreasonable Effectiveness of Recurrent Neural Networks_. We just need _The
Unreasonable Effectiveness of XGBoost (for winning Kaggle competitions)_ and
we'll have the whole set.

[1] [http://karpathy.github.io/2015/05/21/rnn-
effectiveness/](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

~~~
xiler
Another one is The Unreasonable Effectiveness of Deep Learning [1] by LeCun.

[1] [http://on-demand.gputechconf.com/gtc/2014/webinar/gtc-
expres...](http://on-demand.gputechconf.com/gtc/2014/webinar/gtc-express-
convolutional-networks-webinar.pdf)

~~~
sdenton4
All of which are a play on 'The Unreasonable Effectiveness of Mathematics in
the Natural Science,' I believe.

[https://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness...](https://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness_of_Mathematics_in_the_Natural_Sciences)

------
chestervonwinch
I'm surprised the author didn't mention model variance reduction, which is
what RFs _were designed_ to do. For example, if we generate 100 data sets
which are the same except for noise, the 100 decision boundaries drawn by a
classification tree will vary _much more_ than the decision boundaries drawn
by RFs on the same sets.

Here's what this means visually:

[http://i.imgur.com/IjfXFkm.png](http://i.imgur.com/IjfXFkm.png)

There is just one of the 100 data sets generated shown.

~~~
reagency
Hmm, those CT boundaries aren't just high variance, they have extremely high
error, which seems a more significant problem. They are linear!

~~~
chestervonwinch
Each CT has zero error on its training set. The overall structure in each
training set is the same (shown by the circles and triangles). Each CT,
however, fits to the noise that is specific each training set. This causes the
"linear" regions you mention. This is caused by how CTs recursively subdivide
the feature space - these are essentially small subdivision areas fit to
noise, which have extended beyond the class region to which they "should"
belong.

------
atpaino
Nice post. However, I was kind of surprised to see the author list
interpretability as one of the drawbacks of random forests:

> Another point that some might find a concern is that random forest models
> are black boxes that are very hard to interpret.

I generally agree that random forests are more difficult to interpret than
linear models such as logistic regression, but I think they're still far more
interpretable than more comparable non-linear models such as neural networks
or SVMs with non-linear kernels. At the end of the day, a random forest is
just a bunch of decision trees, each of which are very straightforward for
humans to understand. Additionally, there are a number of straightforward
methods available for assessing the importance of each individual feature in a
random forest in aggregate ([1], section 15.4). Neural networks, on the other
hand, result in models that are too cryptic to be meaningfully inspected by a
human, and have more complex variable importance measures [2].

The relative interpretability of these non-linear models played a large factor
in our decision to add random forests to our modeling stack at Sift Science,
which you can read more about here:
[http://blog.siftscience.com/blog/2015/large-scale-
decision-f...](http://blog.siftscience.com/blog/2015/large-scale-decision-
forests-lessons-learned)

[1]:
[http://statweb.stanford.edu/~tibs/ElemStatLearn/](http://statweb.stanford.edu/~tibs/ElemStatLearn/)

[2]:
[http://www.massey.ac.nz/~mkjoy/pdf/Olden,Joy&DeathEM.pdf](http://www.massey.ac.nz/~mkjoy/pdf/Olden,Joy&DeathEM.pdf)

------
thenomad
I wish there were some examples in the article of their effectiveness,
mentions of best areas for their use, etc. Currently, as a non-ML expert, my
takeaway from the article was "there's a thing called a Random Forest. That's
quite cool."

Maybe I'm not the target audience, though.

~~~
micro_cam
RF's work well on heterogenous data (ie a mix of data types from difrent
sources which might include numerical, categorical and text data etc. They
also tend to be fairly resistance to noise in the data and to work well on
data sets that are wide (ie have more features/dimensions then
cases/observations). Finally they are fairly resistant to overfitting (ie
fitting quirks in the training data instead of generalizable signal) and don't
need too much parameter tuning.

This is largely because they are a randomized ensemble of weaker models.
Individual decision trees are quite prone to overfitting and other issues but
in an rf you grow a bunch of them on diffrent bootstrap samples of the data
and let them vote and it turns out the combined performance is much better and
much less error prone then a single model.

Specific examples where they work well include genetic data (many more noisy
variables then observations) and customer/consumer data. They also get used in
image data and signal processing but deep neural networks are recently tending
to beat them here and in similar less heterogenous data sets.

------
stared
I share the awe of RF (especially as they look naive, but turn out to be
extremely good for a wide range of problems), however, the main problem is
that they are black boxes. Typically, it is easy to get good results, but
almost no insight, or further pointers.

Many times I ended up using linear regression (with properly engineered
variables), or something as simple, because it gave almost as good results as
RF, but I could interpret the results, inputs, etc. (And may task was academic
or business analytics, so CV score was not the only figure of merit.)

~~~
andosa
Random forests being black boxes is something that is widely touted even among
active practitioner, but not really true any more. There are methods available
for decomposing random forest predictions into feature contributions, so that
each prediction is represented as the sum of the bias term and contribution
from each feature (similar to what you get in linear regression, but at the
level of each individual prediction instead of one for the whole model), see
for example [http://blog.datadive.net/interpreting-random-
forests/](http://blog.datadive.net/interpreting-random-forests/). Also there
are methods for extracting, pruning and summarizing random forest rules to
make them human readable, see for example inTrees package in R.

------
ScottBurson
Here's an interesting paper on learning methods for high-dimensional data:
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.149...](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.149.4944&rep=rep1&type=pdf)

The abstract:

 _In this paper we perform an empirical evaluation of supervised learning on
high-dimensional data. We evaluate performance on three metrics: accuracy,
AUC, and squared loss and study the effect of increasing dimensionality on the
performance of the learning algorithms. Our findings are consistent with
previous studies for problems of relatively low dimension, but suggest that as
dimensionality increases the relative performance of the learning algorithms
changes. To our surprise, the method that performs consistently well across
all dimensions is random forests, followed by neural nets, boosted trees, and
SVMs._

~~~
ScottBurson
Another paper comparing classifier methods, that also concludes RF is one of
the best:
[http://jmlr.org/papers/volume15/delgado14a/delgado14a.pdf](http://jmlr.org/papers/volume15/delgado14a/delgado14a.pdf)

------
SwellJoe
Any recommendations for beginner materials about this and other modeling and
machine learning topics? I worked for a few years with the folks behind SciPy
and NumPy and iPython, and helped package the suite of tools many use for this
kind of work, and so I have quite a bit of familiarity with the tools for
experimenting with this stuff, but I never dug into the math and I'm not even
sure where to start on it.

------
j2kun
A bigger-picture "unreasonable effectiveness" article would be "The
Unreasonable Effectiveness of Weighted Majority Votes in Machine Learning."

This Monday I'm publishing a post on [1] that gives some rigorous explanation
for this, which essentially covers the results of this paper [2] whose main
theorem is a claim about majority voting schemes.

[1]: [http://jeremykun.com/](http://jeremykun.com/)

[2]:
[http://cseweb.ucsd.edu/~yfreund/papers/BoostingtheMargin.pdf](http://cseweb.ucsd.edu/~yfreund/papers/BoostingtheMargin.pdf)

~~~
bagboost
Thanks, [2] is a very nice paper with intuitive ideas and insights. To
summarize it: When the common people agree and few disagree there are great
expectations that a generalization is possible and effective.

------
getup8
One of thing to watch out for when using RFs is when your test set doesn't
have identical factor levels as your training data. Often can actually lead to
a significant bias in the prediction. Can't find a great paper on this but
some discussion here:[http://stats.stackexchange.com/questions/29446/random-
forest...](http://stats.stackexchange.com/questions/29446/random-forest-and-
new-factor-levels-in-test-set)

------
JD557
I really enjoyed the point made on the first paragraph, especially the
following phrase:

> Some like SVMs for the elegance of their formulation or the quality of the
> available implementations, some like decision rules for their simplicity and
> interpretability, and some are crazy about neural networks for their
> flexibility.

Nowadays it seems that ANN (and related algorithms) get all the credit. It's
refreshing to see someone point out why some people actually prefer other
methods.

------
eternalban
aside: There really needs to be a law :) capping cpu cycles / information
gained. Energy ain't free. Emissions are not harmless. OP appears to be a gas-
guzzler algorithm.

~~~
1971genocide
What are you even talking about ??

~~~
eternalban
The article. Did you read it all the way to the end?

