There's always a point in the lecture or explanation where they go
So we just find the optimal split/feature based on entropy
which no one talks a ton about, but naively implemented is something on the order of O(kNlogN). For each split. Multiply that by the number of leaves (2^depth), and multiply that by the number of trees in your forest.
I learned this the hard way when I tried implementing random forests on GPU for a class (would not recommend: efficiently forming decision trees seem to involve a lot of data copying and shifting around). I actually learned a lot from reading sklearn's implementation of decision trees in Cython - it uses quite a number of neat tricks to make things really fast.
Scikit is shockingly slow, in comparison. Also bloated, but that's more a matter of 1) not having a "release" impl that ditches data only useful for debugging, and 2) using 64-bit data types all over the place, despite running in parallel arrays! (https://github.com/scikit-learn/scikit-learn/blob/master/skl...)
What you're talking about, where you simply generate a set of random splits across features, is Extremely Randomized Trees (https://link.springer.com/article/10.1007%2Fs10994-006-6226-...).
Another difference is that RFs use a bootstrapped dataset and ERTs use the full dataset.
In any case, I agree with the thrust of your original comment that the specifications of the RF algorithm can be relaxed, usually for performance reasons, and still retain strong performance. But this goes back to my original comment that the performance considerations of random forests often aren't highlighted to new learners (whereas introducing ERTs to a beginner would probably shock them - how could you take totally random splits and still get any reasonable performance!)
You are correct. For classification the usual rule of thumb is to select square root of the number of features.
The main problem with decision trees is that they will overfit to the maximum extent possible, which is why pruning  and depth limitation  are used to reduce overfitting and improve the decision tree's ability to generalize onto the test set. On the data set provided for my assignment, the ranking of performance was: (traditional algorithm) < (traditional with depth limit) < (traditional with pruning).
: Basically the maximum depth of the tree is limited to some height. http://scikit-learn.org/stable/modules/generated/sklearn.tre... (see `max_depth`).
Author here. Thanks for the links. I originally planned to try a bigger dataset but didn't want to make the article too long, so left out the optimizations.
I had to do the same exercise for my ML class and had similar results. Random pruning (mostly) worked great for me.
IMHO, While DTs do overfit a lot, they are a great starting point for beginners because of their (relative) simplicity. Better to start light and then introduce the math heavy neural nets and SVMs.
Fairly often people writing tutorials jump straight to Random Forest or Gradient Boosting, and those are great to use but maybe too big a conceptual leap to understand straight away if your theoretical background is weak.
I wrote an article earlier this year on how I use decision trees to classify players for daily fantasy sports into different groups that people may find useful https://medium.com/@bmb21/why-is-caris-levert-projected-for-...
Short version: always use permutation importance, but use leave-it-out importance when it really matters
Using them to get a feel about dataset is an interesting use case and I haven't tried that. It does sound interesting though. Something to do with my next ML problem I suppose. :)
Just skimmed through your article(class final exams tomorrow), it is informative. Thanks for sharing.
If you have any specific questions about ensembles, feel free reply here or send me a mail to contact(at)<myusername>(dot)com.
It probably makes little difference if you use map or LabelEncoder but I tend to prefer LabelEncoder to avoid typos and make it more concise (those dicts can become very long if we have a lot of labels).
They are included in the IG formula but since they are multiplied by 0, they don't have any effect.