This is pretty concise but mostly for decision tree but only half of it.
CART is the framework for decision tree for classification and regression. This article only addresses the Classification part which usually use Gini which is a class of split that split along parallel axises (there are oblique trees). The regression part uses more traditional statistical linear regression to calculate split point (SSTO).
Very light on Random Forests though, doesn't talk about out of bagging, implication of bootstrap and ordinal data, etc.. but overall I think it's a neat introduction.
> we only try a subset of the features,
You bootstrap features without replacement at every split. Instead of just bootstrapping rows/observations like in bagging.
The concept of weak learners ensemble together to become a strong learner is done by Dr. Ho work under her paper Random Subspace where she does it with decision tree and basically proposed Random Forest before Dr. Leo Breiman (both independently came to Random Forest). Her advisor have the theoretical paper for proof of weak learner, stocastic discrimination.
You sound like you know what you're talking about. My site is open source, and if you want you can make a few edits to this article and submit a pull request! https://github.com/vzhou842/victorzhou.com/blob/master/conte...
Aren't you effectively claiming authorship for other's work?
the end goal not being academia, but being able to think and write reasonable production code.
For Random Forests, I like this one: https://www.gormanalysis.com/blog/random-forest-from-top-to-..., which also has a link to a decision-tree post. That blog also has the best GBM explainer I've seen yet (Gradient Boosted Machines are the _other_ tree-ensembling method in common use, where the trees are _stacked_ instead of _bagged_)
Your goal should not be to know enough to write an RF implementation, but rather to have some intuition behind how it works, so you can better choose when to use it or not. The likelihood of it ever making sense for you to write and RF algorithm for production use is extremely unlikely; use the great code that already exists for most languages.
I like how you talk about Conditional Inference. My thesis is suppose to overcome the brute force of exhaustive search for best splits that Random Forest does (I use Dr. Loh's GUIDE trees) using statistical methods.
> Many implementations of random forest default to 1/3 of your predictor variables.
This is interesting. I hear it was sqroot(total number of predictors).
> Ensemble methods combine many individual trees to create one better, more stable model.
I think stable can be more clarify to having good training accuracy and low generalize error (unseen data error rate) compare to individual tree. This is what Dr. Ho talk about with forest.
But other than that I think it's an awesome tutorial.
I've seen what other tree and forest do for better generalization with unseen data is pruning is using CV and choosing 0.5 to 1.0 std error as a cut off point. That may be a thing to talk about if you are interested in that.
> This is interesting. I hear it was sqroot(total number of predictors).
I was probably looking at the randomForest R package documentation , which says:
> mtry Number of variables randomly sampled as candidates at each split. Note that the default values are different for classification (sqrt(p) where p is number of variables in x) and regression (p/3)
I checked the H2O implementation of random forest  and they use the same defaults.
I'll add a note about the one third default being specific to regression since that seems like an important distinction.
I have no clue. I have a degree in Computer Science and am finishing my master in Applied Statistic. It just happen my skill set, statistical modeling, have many overlap with Machine Learning.
General overview maybe https://r4ds.had.co.nz/
But for straight up modeling I would start and recommend this book: https://otexts.org/fpp2/ (update: this book doesn't go over EDA, outlier, and such... r4ds does EDA. For outlier and imputation some statistic book can cover that.)
It's for time series modeling (statistical slanted) but I think it goes over modeling aspect really well and it very intro (coursera have a step above that book for a little more depth in time series). To be fair... statistical modeling for univariate time series statistical model is number one-ish see m4 competition or the uber blog on time series.
> the end goal not being academia, but being able to think and write reasonable production code.
For the thinking part which affect writing code, from my experiences (I can be wrong), data science/ML approach to modeling is different than statistic.
The vast majority of the time Data Science/ML are given the data which is why I believe AI algorithm can be bias (see Gyfcat and Asian computer vision classification problem). Where as statistic you usually figure out your hypothesis or what you are trying to answer and then you create a design experiment or strategy on how to collect the data without being bias and hopefully controlling factors. But statistic also do given data but the vast majority of models out there is slanted toward explanatory vs forecasting/prediction. DS/ML seems to care more about forecasting/prediction.
I also think how each discipline approach modeling affect the way a person think within that field too. I have not figure out the unified thinking of how ML/DS discipline approach model but I am confidence that it's not the same as statistic. But speaking from statistic vs applied math, I can give an example. For time series data, statistician model on the assumption that all we're given is the data and we'll try to extract every bit of information out of the data and explain away the variance with models (each predictor can remove away the noise by explaining it whatever left is error/chances). It's more data focus. For applied math, they figure out how the data is generated eg their model assume this is how the data is generated so they got these stochastic processes and is uses more toward probability than statistic.
So thinking would affect code... So how statistician code stuff is probably different than ML/DS. I've seen people calling imputation witchcraft and throw temporal data in random forest >___>.
> write reasonable production code.
I do R for modeling.
If you're not creating any new fancy algorithm, you can just use packages. You train them on the data set, and just ship the trained model. You can wrap it as a REST service via https://www.rplumber.io/. I like to think Python have something similar?
Do note I know very little about deep learning or how to ship that. I've chosen to specialize in statistical modeling and R for modeling.
This post covers things that I've seen in the past, and seems to sum up my internal understanding of the concept, which is good for me as reading through it I had a number of 'see, I do get it' moments. The only criticism I have of it is that it seems to gloss over what differences the trees within the random forest have - as I understand it, they are all slightly different, and this gives them greater accuracy?
Anyway, thanks for posting it - I'll read the other posts when I get a chance.
I can give you an example from my own work. We have a random forest on a 400+ attribute input (ie: 400 variables). All we want at the end is a probability from 0.0 to 1.0.
Our random forest model will build around 500 trees. Each tree randomly selected a small subset of those 400+ input attributes and says "what's the best I can do using only these attributes?". Generally, it does okay. But when you average the 500 trees, the accuracy is pretty darned good.
Edit later: To be clear, each new tree is generated using the random subset of variables. The point is that each tree may glean some insight about that small combination of variables.
That's the point where you should backtrack.
Since this is a blog post it may not be kosher copacetic on all the details. But if you're studying from books and papers all the notation will either be way too well-known (set theory, cartesian products, R^d vector spaces, lp/Lp norms) or explicitly explained.
If you're behind or fuzzy on the more basic stuff (what's an equivalence class? What's a Cartesian product?) I recommend the first two chapters in Munkres' book of topology. The book builds pretty far out into uncharted territory, but its recap of the basics is rigorous and superbly explained with copious illuminating prose.
They kinda cover it in the section in 3.1 and 3.2 with Bagging and Bagging -> RandomForest, but it'd be good for them to explain Boosted Trees here as well.
As far as I understand it, random forests are an aggregation of trained trees based on randomly sampled data points from the original data set. It doesn't necessarily make them more accurate on the training dataset, but it makes them more generalised and less likely to overfit (https://en.wikipedia.org/wiki/Overfitting), because the different trees are likely to focus on different characteristics of the dataset.
Boosted trees do become more accurate, as they resample, but give more priority to data points that weren't correctly classified by the earlier models.
In the first step of the decision tree training we pick the best feature split, divide the data into two groups, then train each of the two data groups independently. But what about the 2nd-best feature split? In some sense, we lose the information the other splits could provide.
To see this, when testing queries, the first step is to look at that best split and pass the query to one of the two sub-trees. But those trees have only been trained with half of the training set data, and thus have weaker discriminatory power. Every split down the tree has diminishing returns in terms of how much information it provides.
Now think about what the random forest does. If the feature which contains the best split is chosen for a particular tree, the split will be the same. But if it doesn't, then if the feature for the second split is present then it will be chosen. If the top 2 features aren't present then the third best split will be chosen, and so on.
Thus, across our forest we have representatives of a range of feature-splits, each trained on more data and thus have more discriminatory power per split. The aggregation step at the end combines the information gleaned from these different models. Each one of them is weaker than the original CART decision tree, but has gotten more information out of the data for the features it was given. Thus, together, they are much better predictors than by themselves.
Discussion of my neural networks post on HN: https://news.ycombinator.com/item?id=19320217
When you say "real world examples", what do you have in mind? A lot of "real world" uses of random forests are basically just directly calling scikit-learn or something like that. In my neural network post, I implemented a simple neural network from scratch because I felt like it'd be valuable for beginners, but I wouldn't call that "real world".
There’s nothing wrong with random forest. It’s a perfectly good model. But when it is someone’s only tool, it implies they both don’t know much much about the toolkit, and also how that one particular tool works.
I rarely use anything but linear models, trees and forests fwiw.
If they start going to GBM's or neural nets first...I'd call _that_ a bad sign (and it happens).
There are also lots of good ways to peer into the inner workings of a tree ensemble model nowadays. It's not laid out plainly for you like a linear model, but it's not an impenetrable black box as people like to suggest.