Hacker News new | comments | ask | show | jobs | submit login
Introduction to Decision Tree Learning (fritz.ai)
294 points by austin_kodra 9 months ago | hide | past | web | favorite | 26 comments



A funny thing about decision trees (or random forests) is how conceptually simple they are, but in terms of implementation they're very non-trivial.

There's always a point in the lecture or explanation where they go

So we just find the optimal split/feature based on entropy

which no one talks a ton about, but naively implemented is something on the order of O(kNlogN). For each split. Multiply that by the number of leaves (2^depth), and multiply that by the number of trees in your forest.

I learned this the hard way when I tried implementing random forests on GPU for a class (would not recommend: efficiently forming decision trees seem to involve a lot of data copying and shifting around). I actually learned a lot from reading sklearn's implementation of decision trees in Cython - it uses quite a number of neat tricks to make things really fast.


For optimizations, have a look at section 3 of the XGBoost paper: https://arxiv.org/pdf/1603.02754.pdf. LightGBM has similar features (e.g. binning: http://lightgbm.readthedocs.io/en/latest/Parameters.html#io-...).

Scikit is shockingly slow, in comparison. Also bloated, but that's more a matter of 1) not having a "release" impl that ditches data only useful for debugging, and 2) using 64-bit data types all over the place, despite running in parallel arrays! (https://github.com/scikit-learn/scikit-learn/blob/master/skl...)


In random forests you don't really need to find the optimal split. You usually generate a number of random splits and select the best one.


I believe the formulation of random forests requires you to find the optimal split, albeit over a subset of features.

What you're talking about, where you simply generate a set of random splits across features, is Extremely Randomized Trees (https://link.springer.com/article/10.1007%2Fs10994-006-6226-...).


Since we're splitting hairs instead of training sets: Classical RFs select one random feature and choose the optimal split, whereas Extremely RTs choose the best feature (out of a random subset) whereby for each feature only one random split is tested.

Another difference is that RFs use a bootstrapped dataset and ERTs use the full dataset.


Ah, I was under the impression that RFs choose from a subset of features, not just one feature.

In any case, I agree with the thrust of your original comment that the specifications of the RF algorithm can be relaxed, usually for performance reasons, and still retain strong performance. But this goes back to my original comment that the performance considerations of random forests often aren't highlighted to new learners (whereas introducing ERTs to a beginner would probably shock them - how could you take totally random splits and still get any reasonable performance!)


> Ah, I was under the impression that RFs choose from a subset of features, not just one feature.

You are correct. For classification the usual rule of thumb is to select square root of the number of features.


Thanks for sharing. I will take a look at Cython implementation, that should help me learn a few things.


I actually had to implement decision trees this semester for my Data Mining & Machine Learning class. Albeit we couldn't use any libraries and had to implement everything from scratch (aside from being able to read in the data with pandas).

The main problem with decision trees is that they will overfit to the maximum extent possible, which is why pruning [0] and depth limitation [1] are used to reduce overfitting and improve the decision tree's ability to generalize onto the test set. On the data set provided for my assignment, the ranking of performance was: (traditional algorithm) < (traditional with depth limit) < (traditional with pruning).

[0]: https://en.wikipedia.org/wiki/Pruning_(decision_trees)

[1]: Basically the maximum depth of the tree is limited to some height. http://scikit-learn.org/stable/modules/generated/sklearn.tre... (see `max_depth`).


Hi,

Author here. Thanks for the links. I originally planned to try a bigger dataset but didn't want to make the article too long, so left out the optimizations.

I had to do the same exercise for my ML class and had similar results. Random pruning (mostly) worked great for me.

IMHO, While DTs do overfit a lot, they are a great starting point for beginners because of their (relative) simplicity. Better to start light and then introduce the math heavy neural nets and SVMs.


I don’t know why but I felt regression to be a very easy starting point. y = mx + c is just high school math.


Thanks for stopping by the thread. I'm doing my thesis in regression trees and appreciate any love trees get in the ML space. I quite liked the article, but my only comment is that the pseudo-code for ID3 was a little hard for me to read formatting wise.


Ah, I guess I could have used a Gist for that. Thanks for the suggestion, let me see if I can update the article.


Hey, great explanation of decision trees. I might recommend, instead of going for a bigger dataset, maybe following it up by adding bagging or boosting.

Fairly often people writing tutorials jump straight to Random Forest or Gradient Boosting, and those are great to use but maybe too big a conceptual leap to understand straight away if your theoretical background is weak.


Thank you. I will consider doing that. I will have to create or find a relevant dataset that's small enough for those concepts but I guess that'll help rather than jumping straight to random forest or gradient boosting.


I have some notes on the Computational Learning Theory view of decision tree fitting (and over fitting) here: http://www.win-vector.com/blog/2017/01/why-do-decision-trees...


There is method called random forests, which basically construct trees on random slices of data and let them vote on outcome. It is still very simple method and it has better over-fitting performance.


This is great and has an awesome level of detail on information gain. I have found decision trees really useful for getting a feel for what features in a dataset matter and trying out different max depths on trees to get more insight into the data.

I wrote an article earlier this year on how I use decision trees to classify players for daily fantasy sports into different groups that people may find useful https://medium.com/@bmb21/why-is-caris-levert-projected-for-...


I use RF's commonly for getting that same feel, but I recently learned that I've been making some big mistakes when interpreting default feature-importance outputs; this recent article really opened my eyes: http://parrt.cs.usfca.edu/doc/rf-importance/index.html

Short version: always use permutation importance, but use leave-it-out importance when it really matters


Thank you for the kind words. My goal was to make the explanations as simple as possible.

Using them to get a feel about dataset is an interesting use case and I haven't tried that. It does sound interesting though. Something to do with my next ML problem I suppose. :)

Just skimmed through your article(class final exams tomorrow), it is informative. Thanks for sharing.


Is there going to be a next part about ensembles of trees? Thinking boosting / random forest.


I am considering doing a more programming focused part, where I take a big dataset and implement a decision tree using Scikit Learn. After that, I will look into ensembles.

If you have any specific questions about ensembles, feel free reply here or send me a mail to contact(at)<myusername>(dot)com.


Great article, just one small nitpick: sklearn actually does support encoding text labels. I just tested with your dataset to reconfirm, sklearn.preprocessing.LabelEncoder() has no problem encoding labels correctly.

It probably makes little difference if you use map or LabelEncoder but I tend to prefer LabelEncoder to avoid typos and make it more concise (those dicts can become very long if we have a lot of labels).


For the Information Gain calculations, why are the chocolates we don't want to eat (Blue & Kit Kats) discarded, i.e. considered entropy 0. Shouldn't it be included as part of the Information Gain formula?


Since we are looking at entropy with respect to class (whether we want to eat or not eat), entropy for blue & kit kats is 0 as we don't want to eat them at all. So there's no randomness.

They are included in the IG formula but since they are multiplied by 0, they don't have any effect.


Nice tutorial! If you want a description of how random forests build on decision trees, my student and I wrote a little commentary. Sadly, it is behind a paywall for now.

http://www.pnas.org/content/115/8/1690




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: