
Introduction to Decision Tree Learning - austin_kodra
https://heartbeat.fritz.ai/introduction-to-decision-tree-learning-cd604f85e236?source=collection_home---1------0----------------
======
sh33mp
A funny thing about decision trees (or random forests) is how conceptually
simple they are, but in terms of implementation they're very non-trivial.

There's always a point in the lecture or explanation where they go

 _So we just find the optimal split /feature based on entropy_

which no one talks a ton about, but naively implemented is something on the
order of O(kNlogN). For each split. Multiply that by the number of leaves
(2^depth), and multiply that by the number of trees in your forest.

I learned this the hard way when I tried implementing random forests on GPU
for a class (would not recommend: efficiently forming decision trees seem to
involve a lot of data copying and shifting around). I actually learned a lot
from reading sklearn's implementation of decision trees in Cython - it uses
quite a number of neat tricks to make things really fast.

~~~
folli
In random forests you don't really need to find the optimal split. You usually
generate a number of random splits and select the best one.

~~~
sh33mp
I believe the formulation of random forests requires you to find the optimal
split, albeit over a subset of features.

What you're talking about, where you simply generate a set of random splits
across features, is Extremely Randomized Trees
([https://link.springer.com/article/10.1007%2Fs10994-006-6226-...](https://link.springer.com/article/10.1007%2Fs10994-006-6226-1)).

~~~
folli
Since we're splitting hairs instead of training sets: Classical RFs select one
random feature and choose the optimal split, whereas Extremely RTs choose the
best feature (out of a random subset) whereby for each feature only one random
split is tested.

Another difference is that RFs use a bootstrapped dataset and ERTs use the
full dataset.

~~~
sh33mp
Ah, I was under the impression that RFs choose from a subset of features, not
just one feature.

In any case, I agree with the thrust of your original comment that the
specifications of the RF algorithm can be relaxed, usually for performance
reasons, and still retain strong performance. But this goes back to my
original comment that the performance considerations of random forests often
aren't highlighted to new learners (whereas introducing ERTs to a beginner
would probably shock them - how could you take totally random splits and still
get any reasonable performance!)

~~~
Scea91
> Ah, I was under the impression that RFs choose from a subset of features,
> not just one feature.

You are correct. For classification the usual rule of thumb is to select
square root of the number of features.

------
kevindong
I actually had to implement decision trees this semester for my Data Mining &
Machine Learning class. Albeit we couldn't use any libraries and had to
implement everything from scratch (aside from being able to read in the data
with pandas).

The main problem with decision trees is that they will overfit to the maximum
extent possible, which is why pruning [0] and depth limitation [1] are used to
reduce overfitting and improve the decision tree's ability to generalize onto
the test set. On the data set provided for my assignment, the ranking of
performance was: (traditional algorithm) < (traditional with depth limit) <
(traditional with pruning).

[0]:
[https://en.wikipedia.org/wiki/Pruning_(decision_trees)](https://en.wikipedia.org/wiki/Pruning_\(decision_trees\))

[1]: Basically the maximum depth of the tree is limited to some height.
[http://scikit-
learn.org/stable/modules/generated/sklearn.tre...](http://scikit-
learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
(see `max_depth`).

~~~
ishansharma
Hi,

Author here. Thanks for the links. I originally planned to try a bigger
dataset but didn't want to make the article too long, so left out the
optimizations.

I had to do the same exercise for my ML class and had similar results. Random
pruning (mostly) worked great for me.

IMHO, While DTs do overfit a lot, they are a great starting point for
beginners because of their (relative) simplicity. Better to start light and
then introduce the math heavy neural nets and SVMs.

~~~
mliswhat
Thanks for stopping by the thread. I'm doing my thesis in regression trees and
appreciate any love trees get in the ML space. I quite liked the article, but
my only comment is that the pseudo-code for ID3 was a little hard for me to
read formatting wise.

~~~
ishansharma
Ah, I guess I could have used a Gist for that. Thanks for the suggestion, let
me see if I can update the article.

------
pcprincipal
This is great and has an awesome level of detail on information gain. I have
found decision trees really useful for getting a feel for what features in a
dataset matter and trying out different max depths on trees to get more
insight into the data.

I wrote an article earlier this year on how I use decision trees to classify
players for daily fantasy sports into different groups that people may find
useful [https://medium.com/@bmb21/why-is-caris-levert-projected-
for-...](https://medium.com/@bmb21/why-is-caris-levert-projected-
for-53-points-a-decisiontreeregressorstory-deugging-story-c6071ee44efb)

~~~
claytonjy
I use RF's commonly for getting that same feel, but I recently learned that
I've been making some big mistakes when interpreting default feature-
importance outputs; this recent article really opened my eyes:
[http://parrt.cs.usfca.edu/doc/rf-
importance/index.html](http://parrt.cs.usfca.edu/doc/rf-importance/index.html)

Short version: always use permutation importance, but use leave-it-out
importance when it really matters

------
lordnacho
Is there going to be a next part about ensembles of trees? Thinking boosting /
random forest.

~~~
ishansharma
I am considering doing a more programming focused part, where I take a big
dataset and implement a decision tree using Scikit Learn. After that, I will
look into ensembles.

If you have any specific questions about ensembles, feel free reply here or
send me a mail to contact(at)<myusername>(dot)com.

------
dx034
Great article, just one small nitpick: sklearn actually does support encoding
text labels. I just tested with your dataset to reconfirm,
sklearn.preprocessing.LabelEncoder() has no problem encoding labels correctly.

It probably makes little difference if you use map or LabelEncoder but I tend
to prefer LabelEncoder to avoid typos and make it more concise (those dicts
can become very long if we have a lot of labels).

------
kqr2
For the Information Gain calculations, why are the chocolates we don't want to
eat (Blue & Kit Kats) discarded, i.e. considered entropy 0. Shouldn't it be
included as part of the Information Gain formula?

~~~
ishansharma
Since we are looking at entropy with respect to class (whether we want to eat
or not eat), entropy for blue & kit kats is 0 as we don't want to eat them at
all. So there's no randomness.

They are included in the IG formula but since they are multiplied by 0, they
don't have any effect.

------
michaelhoffman
Nice tutorial! If you want a description of how random forests build on
decision trees, my student and I wrote a little commentary. Sadly, it is
behind a paywall for now.

[http://www.pnas.org/content/115/8/1690](http://www.pnas.org/content/115/8/1690)

