Decision trees are the fundamental building block of gradient boosting machines and Random Forests™, probably the two most popular machine learning models for structured data. Visualizing decision trees is a tremendous aid when learning how these models work and when interpreting models. Unfortunately, current visualization packages are rudimentary and not immediately helpful to the novice. So, we've created a general package called animl for scikit-learn decision tree visualization and model interpretation.
This is cool. I like it, and will probably use it in my work, but it feels like there’s a lot going on. I don’t like how some of the final leaf nodes seem to be shown differently than the nodes higher up. Sometimes different chart types, sometimes reversed axes. I would also reccomend use of swarm plots for showing your regression scatter plots. Swarm plots are sexy, but not in the laughably uncomfortably way of the very similar violin plot.
Yep, the leaves are predictor nodes whereas internal nodes are decision nodes. They are doing different things so we figured we should show them using different visualizations.
Wow, I wondered why you put a TM on Random Forests. I guess it is trademark of Salford Systems, which is kind of weird. Maybe we can just call them random forests and ignore that.
> Although owners of trademarked names may suggest otherwise, publishers are not obligated to denote the trademark status of a name when that name is mentioned in text. Authors representing trademark owners frequently feel obligated to use the trademark or registered-trademark symbol (™ or ®) after the first mention of their product names but often do not use these symbols consistently to indicate the trademark status of other names not owned by their particular sponsor or employer.
The people who own the trademark may feel obligated to use those marks, but nobody else ever is.
There's a lot of "folk law" (that is, urban legends repeated by the ignorant) surrounding this concept, so if you think I'm wrong, please do yourself and the rest of us a favor and research good cites to show that there's actual law saying I'm wrong. Thanks.
I'm often guilty of this too - but we really should put the (tm) there. It's nice that they made code of the algorithm publicly available and all they ask is that we respect their trademark in return. I think that's more than fair. :)
(I discussed this a few years with the co-inventor of random forests, Adele Cutler, and she confirmed that this is something that she wants to see happen.)
Not the answer to your question, but in case it helps anyone: trademarks are unrelated to patents. You can use a random forest but you can not call them “random forest”. “Aleatory jungle” is fine, though.
"stochastic treeset". Sounds way more scientific, which can be required to convince a pointy-hair boss. "Random" forest sounds... well, I can flip a coin too, how is that going to solve my problem?
For the same reason, "naive" bayes classifier are very hard to sell, to the point I stopped naming them and now just tell "a very fast machine learning algorithm", unless specifically asked.
Indeed. They were the inspiration for this visualization. I wanted to do something for my book with Jeremy Howard https://mlbook.explained.ai/ and those guys show the way, but of course it isn't a general library. Love that r2d3.us page.
Good to see others looking into tree model viz. I've done work with larger scale tree visualizations and found you quickly run out of space. I wound up using interactivity to reveal branch level info, dynamically pruned the tree based on train support, and I used a more sophisticated layout technique to pack more info in. https://www.google.com/amp/s/blog.bigml.com/2012/01/23/beaut...
FWIW visualizing trees like that helps spot problems really quickly. Overfitting behavior typically involves overusing a certain field, or growing long and relatively narrow branches.
Not sure about the choice of pie chart as the default leaf format (humans are bad at guessing proportions from pie charts) but otherwise it does look great and convey the information efficiently.
howdy! We use a pie chart for classifier leaves, despite their bad reputation. For the purpose of indicating purity, the viewer only needs an indication of whether there is a single strong majority category. The viewer does not need to see the exact relationship between elements of the pie chart, which is one key area where pie charts fail.