
Random Forests for Complete Beginners - signa11
https://victorzhou.com/blog/intro-to-random-forests/
======
anthony_doan
My thesis is related to trees, forests, and ensemble of forests.

This is pretty concise but mostly for decision tree but only half of it.

CART is the framework for decision tree for classification and regression.
This article only addresses the Classification part which usually use Gini
which is a class of split that split along parallel axises (there are oblique
trees). The regression part uses more traditional statistical linear
regression to calculate split point (SSTO).

Very light on Random Forests though, doesn't talk about out of bagging,
implication of bootstrap and ordinal data, etc.. but overall I think it's a
neat introduction.

> we only try a subset of the features,

You bootstrap features without replacement at every split. Instead of just
bootstrapping rows/observations like in bagging.

The concept of weak learners ensemble together to become a strong learner is
done by Dr. Ho work under her paper Random Subspace where she does it with
decision tree and basically proposed Random Forest before Dr. Leo Breiman
(both independently came to Random Forest). Her advisor have the theoretical
paper for proof of weak learner, stocastic discrimination.

~~~
sandGorgon
is there a better tutorial/course for a beginner into this field ?

the end goal not being academia, but being able to think and write
_reasonable_ production code.

~~~
savagedata
I wrote this regression tree tutorial a few years back that might be a good
complement to the tutorial above since it covers regression instead of
classification and goes on to talk about bagging vs random forest, out-of-bag
samples, and tuning parameters: [https://github.com/savagedata/regression-
tree-tutorial](https://github.com/savagedata/regression-tree-tutorial) I wrote
it at the start of my career and haven't shared it beyond my study group, so
I'm happy to hear feedback.

~~~
anthony_doan
It's a really good tutorial.

I like how you talk about Conditional Inference. My thesis is suppose to
overcome the brute force of exhaustive search for best splits that Random
Forest does (I use Dr. Loh's GUIDE trees) using statistical methods.

> Many implementations of random forest default to 1/3 of your predictor
> variables.

This is interesting. I hear it was sqroot(total number of predictors).

> Ensemble methods combine many individual trees to create one better, more
> stable model.

I think stable can be more clarify to having good training accuracy and low
generalize error (unseen data error rate) compare to individual tree. This is
what Dr. Ho talk about with forest.

But other than that I think it's an awesome tutorial.

I've seen what other tree and forest do for better generalization with unseen
data is pruning is using CV and choosing 0.5 to 1.0 std error as a cut off
point. That may be a thing to talk about if you are interested in that.

~~~
savagedata
Thank you for the useful feedback! I'll have to look up GUIDE trees.

> This is interesting. I hear it was sqroot(total number of predictors).

I was probably looking at the randomForest R package documentation [1], which
says:

> mtry Number of variables randomly sampled as candidates at each split. Note
> that the default values are different for classification (sqrt(p) where p is
> number of variables in x) and regression (p/3)

I checked the H2O implementation of random forest [2] and they use the same
defaults.

I'll add a note about the one third default being specific to regression since
that seems like an important distinction.

[1]
[https://www.rdocumentation.org/packages/randomForest/version...](https://www.rdocumentation.org/packages/randomForest/versions/4.6-14/topics/randomForest)

[2] [http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-
science/d...](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-
science/drf.html)

------
djaychela
I've started studying ML (as someone who's hoping to transition at 47 to a
different career from teaching and music), and some concepts I've found easy
to get, others not so much (I'm old, so give me a break!). Often I need to
have a concept explained in a different manner before I'll have a 'Oh, I see!'
moment, and because I don't have a great grasp on the maths needed (which I'm
working on, but it's s-l-o-w), as soon as equations I can't get appear, I tend
to lose focus and think I can't do it.

This post covers things that I've seen in the past, and seems to sum up my
internal understanding of the concept, which is good for me as reading through
it I had a number of 'see, I do get it' moments. The only criticism I have of
it is that it seems to gloss over what differences the trees within the random
forest have - as I understand it, they are all slightly different, and this
gives them greater accuracy?

Anyway, thanks for posting it - I'll read the other posts when I get a chance.

~~~
eden_h
> it seems to gloss over what differences the trees within the random forest
> have - as I understand it, they are all slightly different, and this gives
> them greater accuracy?

They _kinda_ cover it in the section in 3.1 and 3.2 with Bagging and Bagging
-> RandomForest, but it'd be good for them to explain Boosted Trees here as
well.

As far as I understand it, random forests are an aggregation of trained trees
based on randomly sampled data points from the original data set. It doesn't
necessarily make them more accurate on the training dataset, but it makes them
more generalised and less likely to overfit
([https://en.wikipedia.org/wiki/Overfitting](https://en.wikipedia.org/wiki/Overfitting)),
because the different trees are likely to focus on different characteristics
of the dataset.

Boosted trees do become more accurate, as they resample, but give more
priority to data points that weren't correctly classified by the earlier
models.

~~~
psandersen
Just to add here that the columns and not just the rows are also sampled at
each node so the trees are deliberately prevented from learning too much. This
helps improve diversity and reduce overfitting. This is typically around 1/3rd
of the number of columns and controlled by the mtry parameter.

------
vzhou842
Hey, Author here. If you're new to ML you might also like my introduction to
Neural Networks: [https://victorzhou.com/blog/intro-to-neural-
networks/](https://victorzhou.com/blog/intro-to-neural-networks/)

Discussion of my neural networks post on HN:
[https://news.ycombinator.com/item?id=19320217](https://news.ycombinator.com/item?id=19320217)

~~~
rrggrr
Great contributions. Great work and I ought not ask for more, but gosh if you
could put in real world examples with code it would be great.

~~~
vzhou842
Thanks!

When you say "real world examples", what do you have in mind? A lot of "real
world" uses of random forests are basically just directly calling scikit-learn
or something like that. In my neural network post, I implemented a simple
neural network from scratch because I felt like it'd be valuable for
beginners, but I wouldn't call that "real world".

------
b_tterc_p
While there’s nothing wrong with random forests, they’re a bit of a red flag
for me as they’re easy to implement without any real understanding of what’s
going on. A lot of junior data scientists just default to saying random forest
to solve any problem because it tends to have the most predictive power of the
models they’re comfortable with. That’s a bad sign.

~~~
RavlaAlvar
I am probably one of the junior DS you are referring to. But, I genuinely want
to know the reason of using anything other than gradient boost tree to do
classification on structured data.

~~~
onlyrealcuzzo
Is there a place that tells you: If you have this type of data and want this
kind of answer, here's the best algorithm (and why)??

~~~
donjigweed
[https://scikit-
learn.org/stable/tutorial/machine_learning_ma...](https://scikit-
learn.org/stable/tutorial/machine_learning_map/)

------
techno_modus
Most of the post (~80%) is actually about decision trees: principles, how to
train, choosing splits etc.

------
chaosbutters
this is actually a really good explanation because it shows in simple pictures
and words the concepts so there is no confusion. Well done.

------
delinka
I was so hoping more like "random cities" and "random islands" where the
forest is a collection of tall flora to be rendered for the viewer's/gamer's
pleasure.

------
queercode
Thanks for posting this - from a fellow HH member who saw this yesterday!

