Hacker News new | past | comments | ask | show | jobs | submit login
Why do tree-based models still outperform deep learning on tabular data? (arxiv.org)
315 points by isolli on Aug 3, 2022 | hide | past | favorite | 139 comments



I have a theory - tree based models require minimal feature engineering. They are capable of handling categorical data in principled ways, they can handle the most skewed/multimodal/heteroskedastic continuous numeric data just as easily as a 0-1 scaled normal distribution, and they are easy to regularize compared to a DL model (which could have untold millions of possible parameter combinations, let alone getting the thing to train to a global optimum).

I think if you spent months getting your data and model structure to a good place, you could certainly get a DL model to out-perform a gradient boosted tree. But why do that, when the GBT will be done today?


this is along the lines of my thinking. people organize and summarize data before throwing it into spreadsheets, where deep learning models do their thing by generating new representations from raw data.

in a sense, most data in spreadsheets is compressed and deep learning models prefer to find their own compression that best suits the task at hand.

or in human terms: "these spreadsheets are garbage. i can't work with this. can you bring me the raw data please?" :)


> I have a theory - tree based models require minimal feature engineering.

Actually, the whole premise of Deep Learning is to learn proper feature representations from data with minimal data preprocessing. And it works wonderfully in CV and NLP but is less performant in tabular data. The paper indicates that there are several contributing factors to the DL underperforming.


I think another aspect is that most modern GBT models prefer the entire dataset to be in memory, thereby doing a full scan of the data for each iteration to calculate the optimal split point. That’s hard to compete with if your batch size is small in a NN model.


that's an interesting idea. but at the end of the paper they do an analysis of the effect of different hyperparameters for the nets with their dataset and find that the batch size doesn't seem to matter much. (although they're trying size ranges like [256, 512, 1024] as opposed to turning batching off entirely)


The issue isn’t batch size as a parameter but rather needing to load the entire dataset into memory


> thereby doing a full scan of the data for each iteration to calculate the optimal split point

> (although they're trying size ranges like [256, 512, 1024] as opposed to turning batching off entirely)

> The issue isn’t batch size as a parameter but rather needing to load the entire dataset into memory

what's stored in memory is an implementation detail. the key idea is that the tree algorithms are choosing an optimal based on the entire dataset, where sgd is working on small randomly chosen batches. turning off batching means computing gradients on the entire dataset instead.

although the typical bottleneck in gpu computing is moving data to and from the gpu's workarea (which is probably why you mention memory), there is nothing theoretical that says these computations could not be implemented in a streaming manner.


Oh, you touched my favorite topic of whole dataset training.

Take a look at [1] and go straight to the page 8, figure 2(b).

[1] http://proceedings.mlr.press/v48/taylor16.pdf

The paper talks about whole dataset training and one of the datasets used is HIGGS [2]. The figure 2(b) shows two whole dataset training approaches (L-BFGS and ADMM) vs SGD. SGD tops at the accuracy with which both whole dataset approaches start, basically.

[2] https://archive.ics.uci.edu/ml/datasets/HIGGS#

HIGGS is strange dataset. It is narrow, having only 29 features. It is also relatively long, about 11M samples (10M to train, 0.5M to validate and last 0.5M to test). It is also hard to get right with SGD.

But if you perform whole dataset optimization, even linear regression can get you good accuracy [3] (some experiments of mine).

[3] https://github.com/thesz/higgs-logistic-regression


they also do subsampling of the data though.


I think you’re on the right track that trees are good at feature engineering. But the key problem is that DL researchers are horrible at feature engineering, because they have never had to do it. These folks included.

The feature engineering they do here is absolutely horrible! They use a QuantileTransform and that’s it. They don’t even tune the critical hyper parameter of the number of quantiles. Do they always use the scikitlearn default of 1,000 quantiles? No wonder uninformative features are hurting- they are getting expanded into 1000 even more uninformative features! Also with a single quantile transform like that, the relative values of the quantiles are completely lost! If the values 86 and 87 fall into different bins, the model has literally no information that the two bins are similar to each other, or even that they come from the same raw input.

For a very large dataset a NN would learn its way around this kind of bone headed mistake. But for this size dataset, these researchers have absolutely crippled the nets with this thoughtless approach to feature engineering.

There is plenty more to criticize about their experiments, but it’s probably less important. E.g. Their HP ranges are too small to allow for the kind of nets that are known to work best in the modern era (after Double Descent theory has been worked out) - large heavily regularized nets. They don’t let the nets get very big and they don’t let the regularization get nearly big enough.

So. Bad comparison. But it’s also very true that XGB “just works” most of the time. NN’s are finicky and complicated and very few people really understand them well enough to apply them to novel situations. Those who do are working on fancy AI problems, not writing poor comparison papers like this one.


I think you misunderstand what quantiletransformer does; it just transforms the distribution of the data be be more normal, it's not a binning technique. To many quantiles will just result in excess noise, not loss of information.


Oh I see. Interesting.


It occurs to me that a system, trained on peer-reviewed applied-machine-learning literature and Kaggle winners, that generates candidates for structured feature-engineering specifications, based on plaintext descriptions of columns' real-world meaning, should be considered a requisite part of the "meta" here.

Ah, and then you could iterate within the resulting feature-engineering-suggestion space as a hyper-parameter between experiments, which could be optimized with e.g. https://github.com/HIPS/Spearmint . The papers write themselves!


I agree. The majority of DL layers are about feature engineering, not performing classification.


Could you say more about this?

One of the things that interests me about nominal AI applications is the extent to which they're sort of a Mechanical Turk or what I've heard called Artificial Artificial Intelligence. By which I mean it's sold as computer magic, but most of the magic is actually humans sneaking in a lot of human judgement. That can come through humans directly massaging the output or through human-driven selection of results. But I've also been wondering to what extent natural human intelligence is getting put in at the lower layers of the system, like feature engineering.


This is how I see it:

Statistical modeling: input -> feature extraction (manual) -> model selection (manual) -> output

Machine learning: input -> feature extraction (manual) -> model selection (auto) -> output

Deep learning: input -> feature extraction (auto) -> model selection (auto) -> output

So take a DL image classifier. The convolution + pooling layers perform automatic feature extraction. Back to OPs point, why use something like DL when you've already engineered your features?


Wow, what a good summary. Thanks!


looks like AI, quacks like a bunch of linear equations


Lots of linear functions with \x -> max(0, x) thrown in between.

(That's literally what neural nets with relu as activation unit do.)


What are the principled ways that tree based models handle categorical data? If you end up having to do one-hot encoding it feels like you need very wide forests or very deep trees. If your categorical data is actually vaguely continuous then splits can be quite efficient but that’s rare.

I assume some day someone will be able to explain all this in information theoretic terms. I’m never sure if we’re comparing like with like (are the deep learning models we’re comparing against actually that deep, for example?) but clearly there’s something to the intuition that many small overfit models are more efficient than one big general model.


Some implementations will group the categories into two groups, right and left. An easy way to do this is to create the groups based on whatever grouping creates the biggest decrease in the loss. There are more complicated implementations, lightgbm groups by the sum(gradient) / (sum(hessian) + smooth) where smooth is some smoothing parameter which prioritizes class size over loss decrease.

They link to this paper in their docs: https://www.tandfonline.com/doi/abs/10.1080/01621459.1958.10...


One way is to use categorical set splits [1] (proposed for categorical set inputs, but works for categorical features as well), used in TF-DF [1]. Greedy, and expensive to train (cheap inference though), but it gives great results.

[1] https://arxiv.org/pdf/2009.09991.pdf [2] https://www.tensorflow.org/decision_forests/text_features


do you reckon it's possible to somehow transfer learn from a GBT to a NN ?


occam's razor


> Results show that tree-based models remain state-of-the-art on medium-sized data (∼10K samples) even without accounting for their superior speed.

Is that really "medium"? That seems very small to me. MNIST has 60,000 samples and ImageNet has millions.

I think the title overstates the findings. I'd be interested to hear how these methods compare on much larger datasets. Is there a threshold at which deep learning outperforms tree-based models?

Edit: They touch on this in the appendix:

> A.2.2 Large-sized datasets

> We extend our benchmark to large-scale datasets: in Figures 9, 10, 11 and 12, we compare the results of our models on the same set of datasets, in large-size (train set truncated to 50,000 samples) and medium-size (train set truncated to 10,000 samples) settings.

> We only keep datasets with more than 50,000 samples and restrict the train set size to 50,000 samples (vs 10,000 samples for the medium-sized benchmark). Unfortunately, this excludes a lot of datasets, which makes the comparison less clear. However, it seems that, in most cases, increasing the train set size reduces the gap between neural networks and tree-based models. We leave a rigorous study of this trend to future work.


Many real world problems that result in data are decidedly medium: small enough to fit in excel, large enough to be too big to comfortable handle in excel.


Then these real world problems don’t actually warrant deep learning ?

I thought the biggest leap in NN and deep learning in recent history was the realization that we need a ton of data to get maximal effectiveness from them; it now sounds counterproductive to forget this and cry they don’t work well with 10,000 rows.


> Then these real world problems don’t actually warrant deep learning ?

It is an important lesson to be communicated.

I'd like to present a conjecture: everyone thinks their data is big until they have worked on much larger dataset. ("We have 10k samples, it is quite big!" -> "We have 1m data records, is quite big!" -> "Our process outputs that much per day")


That's right. Most problems don't warrant deep learning with the current state of the art.


I've put in production numerous models with millions of tabular data points and a 10^5-10^6 feature space where tree-based models (or FF nets) outperform more complex DL approaches.


I'll chime in with billions of data points and 100-300 feature space with some smart feature engineering outperforming DL in runtime/compute (by orders of magnitude) and performance. But the domain was very specific and everything prior to the tree was doable with matrix operations, with the tree model summarizing a mixture of experts that chose optimal features.


Damn, this sounds fascinating! Have you shared more anywhere eelse?


What kind of freakish tabular data do you have with a million columns??


A badly defined one probably? At least for one of the test arms.


Categorical variables in the datasets of large tech companies can take a lot of different values.


MNIST is not your typical real world tabular data. Many if not most data science problems out there are still in the range of a few k samples from my perspective (trying to "sell" ML to the average company) From a statistical point of view I would not call the datasets small (you can decently compare two means from subsets without needing a student's distribution).


Assuming the categories are meant to apply to any data sets, anything amenable to machine learning at all is at least medium data. "Small" data would be something like a human trial with n=6 because the length and compliance of the protocol is so onerous. There are entirely different statistical techniques for finding significance in the face of extremely low power.


It's baffling to me how little research attention there has been to improving tree-based methods, considering their effectiveness.

For example, LightGBM and XGBoost allow some regularization terms, but the variance/bias is still mostly controlled by globally setting the max depth and max node count (and then parameter searching to find good settings).

Surely there must be more powerful and sophisticated ways of deciding when to stop building each tree than counting the number of nodes? If this was neural nets there would be a hundred competing papers proposing different methods and arguing over their strengths and weaknesses.

I'm not sure whether the problem is that neural nets are just fundamentally more sexy, or that in order to make SOTA improvements in GBMs you need to dive into some gnarly C++. Probably both.


Why do you think there has been little research attention? Time was, 'machine learning' was little but tree-based methods (and that was how they distinguished themselves from 'AI'). Go look at Breiman's CV or random conference proceedings. Or as tree-based method proponents love to point out, pretty much everyone on Kaggle up until recently used trees for everything non-image-based; that's a ton of effort invested in tweaking trees. And there were hardware efforts to accelerate them (I recall MS talking about how they were investing in FPGAs for MS Azure to run trees better), so 'GPUs' isn't an excuse.


I think people throwing algorithms at problems at Kaggle or combining them is not the kind of research, which the GP was referring to.

Limiting a tree by its depth is a very general global parameter for a tree. One could try to use any kind of criteria for deciding when to stop making more child nodes in a tree, depending on what information is locally available and that depends on how the tree algorithm is actually implemented. So people doing Kaggle challenges would have to dig into the source code of the tree implementation, then change things there, to modify locally available knowledge, in order to allow for more fine grained decision making at each node.

That is only the constructive side of things, when the tree is created. Even more powerful is the post processing / destructive / prunning side of things, because theoretically the whole tree structure can be taken into account, when deciding what branch to cut.

I think the GP is referring to research in the area of what other useful things one can come up with here. As far as I know, these are not the usual things people do in Kaggle challenges. Correct me if I am wrong.


Because anytime I search for literature on basic tweaks to the structure of decision trees, I find nothing.

Another example: modern GBM implementations all use binary trees. How would they perform with ternary trees? Or k-way trees for larger k plus some additional soft penalty that encourages minimizing the number branches, unless the information gain is really worth it?

(You can simulate ternary trees with binary, but the splitting behavior is different because ternary can more easily identify good splitting regions in the middle range of the histogram values.)

This seems like such a basic structural question, but the only relevant search result was this downvoted Stack Exchange question from 5 years ago: https://stats.stackexchange.com/questions/305685/ternary-dec...

There are lots of papers on ternary trees in old-school contexts like Ternary Decision Diagrams etc., but nothing relevant to the context of modern tree ensemble performance. Or maybe I'm just bad at searching?

(I implemented this and saw a small accuracy increase from ternary on large datasets, but much worse training speed because you get less mileage from the histogram subtraction trick. Maybe the accuracy would be even better with a more clever choice of soft penalty.)


> LightGBM not improving, meanwhile 8-figure budgets builds GPU and auto-logins..

My take? management agenda to build plug-and-play researchers (humans on jobs), rather than domain specialists. DeepLearning fits that description with all-plumbing, all-the-time.. domain specialists want graduate school, weekends and health benefits..


Is this supposed to be an ironic take on the "plug-and-play HN comment blaming cost cutting management for everything"?

Because it's not this.

Deep learning researchers are specialists in their specific technologies. For example recent advances in transformers not withstanding, it takes quite a long time for say someone who knows how to build deep learning models on images using CNNs to come up to speed on how build good natural language processing models.


There are a fair number of papers (start with Dart/dropout, Bart (bayesian sampling of the whole gbm) but they start to look like global optimization problems and part of the reasons trees work so well is that the local greedy optimization can be made super fast on modern cpu caches.

So even if you can fit a more compact forest that performs well through clever regularization its usually better/faster in practice to grow more simple trees with more randomization and let overfitting average out.


I think part of the problem is that the upper bound on neural nets, as far as we can tell, might very well be general intelligence, and things like self-driving cars, and other nearly magical use-cases that seem within reach. Whereas tree based models, for a series of reasons, many related to scaling, don't offer that feeling of limitless potential.


Maybe. But then again we often try to solve very specific problems, which are very far from requiring anything close to a general intelligence. Heck, a general intelligence might even be bad at things, just like humans are not as good as computers at certain things.


There's a science-fiction story somewhere in there. Imagine an alien planet where brains evolved using a structure completely different than a neural net... some sort of classification model where they were incredibly efficient inside their domain, but broke unpredictably when brought outside of it.


If you’re interested in how tree-based models work, I wrote an interactive explanation on decision trees here: https://mlu-explain.github.io/decision-tree/

and random forests here: https://mlu-explain.github.io/random-forest/

It’s also worth noting that a recentish paper shows neural networks can perform well on tabular data if well-regularized: https://arxiv.org/abs/2106.11189v1?utm_source=jesper&utm_med...


That was super easy to digest, thank you!


Really nice interactive explanations!


I like decision trees and this helps support my case for using them. I often go even further and don't use an algorithm to build the trees but build trees myself along intuitive causal lines and use the data to train its parameters. I sometimes build a few models manually and see which fits better with the data.

Prior knowledge can prevent the pitfalls of automatically built models.

Trees may be better than NNs because they overfit less but you can overfit even less with a bespoke model. For example, I've seen an automatically generated tree made to tune the efficiency of a factory end up using as a main feature "be after a specific date" because a machine was upgraded on that date and so the learning algorithm latched on to that unactionable piece of data as a main predictor for the model.

This was an easy fix, to not feed timestamp data to the model but there are lots of more subtle cases like this and I've seen people spend much time cleaning and "tweaking" the input data to get the answers they want out of their ML models.

If you have to try to make your ML model behave by manually selecting what data to feed it, you might as well go all the way and build a clean causal model yourself that reflect your priors and domain knowledge about the subject.

I have an ML background but I often get more performance out of my models by doing something along the lines of what a Bayesian statistician would do.

Of course with highly dimensional data like pixels in images you almost have no choice to use NNs. There's no way to hand-build these models.


I want to know more about your process. How do you optimise the split point. How do you choose which feature to split on first


I start with common sense priors or "intuition" but also sometimes write code to try different combinations and assess accuracy.

For example, not too long ago I was trying to estimate and predict ATM usage and considered using days of the week as a feature. The simplest case is a single node tree that assumes every day of the week has similar usage. You just calculate mean and variance.

But then, a lot of ATMs have different usage patterns on weekends than weekdays. So I considered a split into two groups, calculating different mean/variance for weekend days and week days. I could have tried different groupings of days but those two were easy to understand for users, covered most cases I was interested in with less risk of overfitting given the amount of data that I was working with (with more data I might have considered more splits even take into account special days like holidays). The more splits the more data you need to train the model, especially if you want not just a point estimate but an idea of variance (more parameters to estimate require more data).

The decision whether to split or not can be done on a per ATM basis cross validating the splitting strategy by assessing a fleet of ATMs.

People optimize different things in these kinds of problem. Here https://www.saedsayad.com/decision_tree_reg.htm for example they try to reduce prediction variance. Oftentimes decision trees try to optimize information gain or gain ratio (https://en.wikipedia.org/wiki/Information_gain_(decision_tre...) Gain ratio seems like a pretty ad-hoc technique to me so to prevent over-fitting and select the correct model (to split or not), I sometimes use a relative entropy penalty on the trained parameters kinda like described here: http://www.inference.org.uk/mackay/itprnn/ps/343.355.pdf. I've never used it but apparently if you have problems where you can't calculate that parameter relative entropy, there are approximation techniques such as https://en.wikipedia.org/wiki/Bayesian_information_criterion.

Then there's things I sometimes do with hyperparameters to improve performance like mixing into individual ATM data fake data points that represent the prior for a wider group or class of ATMs. Again using common sense to adjust those priors (and always cross validating my tweaks against real data).


To add to this. One of the best thing about this approach is how little computational power it requires compared to NNs. I was able to finish this project in little time without having to get extra hardware approved. One core of our DB is now constantly used to feed real time data to constantly retrain the models but that, plus a partial core doing the analysis is all it takes to analyze billions in transactions per year in near real time and keep our predictions up to date and live on our portal.

I'm also not too worried that I could get better performance using NNs as Bayesian methods are basically proven to be near optimal under some fairly reasonable assumptions and the more automatic methods overfitting or latching on artifacts of imperfect inputs actually have worse performance as the article points out.

Until you get to really high dimensional problems, this is a better approach imo.


With these partially hand built trees, are you using only 1 or 2 at inference time? Or is it still a full on large forest of 200 trees?

To elaborate: Are you coding up this bayesian logic as part of the training process, and then you run this code 200 times to end up with 200 trees? That code mixes the hand built rules with the normal approach to tree construction (random split location, random feature choice out of N features, etc)


In this case I ended up with two very simple trees which I train for each ATM (with a prior trained on the larger fleet). You could try a mix though. I didn't have time to try too many things.

The following step which I didn't mention, was to use the trees for generating predictions into the future. Little monte carlo simulations using one of the tree (selected using that relative entropy methods, you could even use a weighted mix of both trees) that I run a few thousand times for each ATM and to generate a forecasted range of when the ATM is going to run out of cash.


Our lab works on changing this. I think it might still take some years for a full solution, but so far we are successful with NNs for small datasets with meta-learning (https://arxiv.org/abs/2207.01848) and large datasets with regularisation (https://arxiv.org/abs/2106.11189). The first is very new, but the second is also cited in this paper, but they didn’t run it as a baseline.


Why not run your own method on their benchmark? If your method is general-purpose, it should be very easy. And it would be very impressive to see your method succeed on a task that was not chosen by you to write your paper.


Yes, good point! The paper I am responsible for is targeted at smaller datasets, but I will propose this to the authors of the second paper :)


Tree based algorithms have the advantage that global or robust optimization of them is possible. Do your approaches offer similar guarantees?


The first work, I linked, arguably gets around this issue completely. There we train a neural network to approximate Bayesian prediction directly. It accepts the dataset as input. Thus, there is no optimisation procedure per dataset, just a forward pass. So far, this seems to be pretty stable.


Just because you optimize network parameters to fit a surrogate model of the data doesn't mean you are not doing local optimization without guarantees on robustness or global optimality w.r.t. network parameters to fit the surrogate. Am i missing something?


Related, an article from 2019 [0] on how neural network finally beat statistical models (e.g. ARIMA) in time-series forecasting.

[0] https://towardsdatascience.com/n-beats-beating-statistical-m...


I can only assume this was achieved years ago, as a black project in the financial market.


I had an econometrics professor who would have objected to calling ARIMA models statistics. He though they were no better than data mining.


A great paper and an important result.

However, it omits to cite the highly relevant SRBench paper from 2021, which also carefully curates a suitable set of regression benchmarks and shows that Genetic Programming approaches also tend to be better than deep learning.

https://github.com/cavalab/srbench

cc u/optimalsolver


We use XGBoost as the core learner for reinforcement learning at https://improve.ai despite the popularity of neural networks in academia.

With tabular or nested data a human has already done a lot of work to organize that data in a machine friendly form - much of the feature engineering is performed by the data schema itself.


Your quick start link just links to the iOS SDK on GitHub. Perhaps things would make more sense if that were available, but right now no idea how to use your tool.

And it appears that there are some html docs in the python repo, but they don't seem to be hosted anywhere. So its not clear how to use it either.


Thanks so much for pointing this out. I’ll get that fixed!


Your app sounds amazing, as is the ethics model of it. Can’t wait to test it out!


Thank you. I’m surprised that anyone read that part. Thank you for noticing.


Does anyone see explainability as another good reason to trees on tabular data, for which I think users would expect more digestable outputs?


The kinds of trees that come out of these algorithms are so huge they really aren’t any more interpretable than a NN.


Not exactly. These tree models are ensemble methods, meaning they comprise several trees. Each individual tree may be small, but it is difficult to pinpoint explanations when that tree is but one amongst a forest


Yes, I've been looking at using decision trees for explaining models that are difficult to understand. Currently seeing useful results on real data sets. If you're interested, I've implemented parts of TREPAN [1] and it's very approachable. However it's also important to have interpretable features which is a whole other thing.

[1] https://research.cs.wisc.edu/machine-learning/shavlik-group/...


Wait until we see deep learning AI creating tree-based models. /s


Not actually that complicated of an approach! Hard to implement correctly, which is why it's not more popular right now.


Technically you can transform any finite / not runtime defined computation graph into a "tree" model.


See symbolic regression and genetic programming.


triggering co-pilot to use xgboost might count as one today lol?


It seems to me that one crucial difference between tabular data and images or text is that in the latter there is a huge amount of structure available. In text, all the words depend on all their neighbor words, and images tend to be well-approximated by low-rank things in one form or another.

Tabular data doesn't have any of that.


Exactly. I get why people like to compare things like NN's and trees because it's a good way to learn. But it doesn't take too much understanding to see that they both have strengths/weaknesses and are suitable for different problems.


I can't explain it, but I help maintain TensorFlow Decision Forests [1] and Yggdrasil Decision Forests [2], and in an AutoML system at work that trains models on lots of various users data, decision forest models gets selected as best (after AutoML tries various model types and hyperparameters) somewhere between 20% to 40% of the times, systematically. It's pretty interesting. Other ML types considered are NN, linear models (with auto feature crossings generation), and a couple of other variations.

[1] https://github.com/tensorflow/decision-forests [2] https://github.com/google/yggdrasil-decision-forests


Super interesting! Do you know the kind of data that it's usually used for? And in the remaining 80% to 60%, do NNs acccount for a large portion of the best models?

Bonus question: are the stats you're mentioning publically available?


Sadly (but correctly) nothing is public, no one ever sees any data, it's a service. Pure FNN (feedforward NN) models, if I recall correctly is also ~30 to 40%.

Since the server doesn't work for all types of data, and probably folks that are experts in ML would do their own hyperparameter tuning, and custom models, this leads to the bias on the type of datasets that are compete.

But this share have been consistent over many months of various unrelated datasets, I believe.


I think one of the issues is that there's no pretrained universal spreadsheet models (that I'm aware of, granted I don't do much work with tabular data) equivalent to ImageNet based models that you can use as a base and then transfer learn on top of that.


Because high tabular data doesnt have enough complexity compared to language or images.


A important point is that it's an absolute pain in the ass to preprocess tabular data for neural networks.

Categorical > one hot encoding > deal with new categories in test time (sklearn does this, but it's really slow and clunky)

Numerical > either figure it out the data distribution for each column and normalize by that or normalize everything by z score. Found an outlier?? Oops, every feature collapsed to 0

Can you that for 10 features? Sure, now try it again with 500, it's not fun

Ok, now that you've done all that you can begin training and possibly get some reasonable result.

Compare that with tree models: data>model>results


You could encode categorical variables into embeddings, they are more natural to deal that way.

https://tech.instacart.com/deep-learning-with-emojis-not-mat...


It's all about features and data scale. Recommendation system itself is actually a large table, DL method already proved effectiveness there. Let's say if you have text in your tabular data. Tree model(with traditional method such as tfidf) will do much worse than the transformer-based model. DL always suffers from inadequate data, so if there's no enough data or inductive biases, tree model can be a better choice in that way.


Well iirc a convenient trait of a random forest classifier is that it cannot overfit onto learning data. Something that's not exactly true for deep learning.


Any reference for this claim? In my opinion this is most certainly not the case - random forests are hard to overfit compared to gradient boosted trees but you can overfit with them too if you don't tune your parameters right.

Overfitting is generally a function of the the size of your data and the complexity expressible in your model.


I think the origin of the claim is from Leo Breiman himself [0]. I’m not sure if he discusses this in any papers, but probably.

Sibling comments have some nuanced discussion of overfitting in tree-based models

0: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home...


RFs can overfit but in practice usually won't.

I think Hastie & Tibshirani have an article on this.


My reference would be my machine learning prof mentioning it at the lectures a few years ago, something about decorrelation as the other guy in the replies mentions. It's possible he meant hard to overfit in comparison to something, not completely impossible.


A random forest is an ensemble model - i.e. lots of instances of a model, all separately fitted to the data, with some randomness so they're all different (e.g. bootstrapping, random subset of the data used to train etc). These models are then all used to classify, and their results averaged. This averaging will work to mitigate any overfitting that a single model might suffer from.


If you keep adding more and more trees to a random forest, variance will go down (they stop being 'random' and become more like the average over all possible trees). In this respect they can't overfit, but in practice there are other tweaks you can make such as tree depth and number of features) that you often want to tweak to increase model performance. These can certainly cause orverfitting.


I'm curious on this claim. Feels like any model can over fit.


A model can only overfit if it has enough parameters. But indeed, any model with enough parameters can overfit.

The thing about deep learning is that the number of parameters is astronomical.


Hence, the interesting question is why deep learning models don't overfit all the time.


This is from 2019 so I suspect there’s a more recent paper now, but this is interesting: https://openai.com/blog/deep-double-descent/


Agreed, but it doesn't really explain it in any real sense.


Work is ongoing, this is a whole field of research now, a good keyword to search for is "implicit regularization".


“However, in the over-parameterized regime, there are many models that fit the train set and there exist such good models.”

One explanation is that a very large model effectively fits several models and penalises over-fit models itself.


I really like how AI research is all about empirical analysis of mathematics.


The answer is really very simple: because we use training and test sets, and generally use iterative training methods. As soon as test set error plateaus or starts to degrade, you stop training. This is called early stopping and can be considered a form of regularization. In theory it would seem a neural network (with early stopping "correctly" implemented) can never overfit, although I suppose in practice it can vary a little (because of test variance, too large step sizes, etc.). It's possible to train huge overparametrized networks this way.

One big problem I can see with this approach (maybe compared to over forms of regularization) is that it might be very space/memory inefficient. But if you've got a big budget I think it's a surefire way to make sure your model performs as well as possible (without having to tune hyperparameters too much or deal with local minima that probably happens at near-capacity networks).


Random Forests can overfit, but with sufficient data (and no data leakage) it is difficult to do so.

The final output of a random forest regression for example is just the average of all the final prediction nodes of the individual trees, which thanks to bootstrapping are decorrelated. So - adding more trees does not tend to alter predictions very much.


With algorithms that, for each additional input entry, either add a predictable amount of complexity to improve fitness (e.g. another tree split) or decide that the model is already fit enough, one needs to do something deliberate or very stupid to overfit data compare to neural networks where overfitting is hard to detect, let alone prevent quantitatively, and that are sometimes designed to overfit in order to interpolate "intelligently" between samples.


Random forests can overfit, but they naturally have fewer parameters than nets, so don't overfit as a function of training time.

Neural nets have far more parameters and so are susceptible to overfitting with more training time.


If you train your ensemble long enough, won't it overfit too?


No, it's not sequential. For a given number of bootstrap replicates you can over fit by using trees that are too complex, but you won't overfit in the number of bootstrap replicates like you would with Gradient Boosted Trees.


Can you elaborate? I am unfamiliar with gradient boosted trees.


I think the idea is that gbt's are trained sequentially on the "residuals" left from the previous tree, while rf's are uncorrelated (can be trained in parallel, don't depend on one another). This means that in rf's there's no "compounding" effect that can lead to overfitting. The main way to overfit rf's would be to train many large trees.


What is tabular data precisely?

I can represent an image as a table of RGB values. I can represent hierarchical data as a table of unnested values.


I've mostly come to the conclusion that DL is not going to beat tree based models especially when it comes to training speed.

What I am now looking at is if it's possible to feed in world knowledge from models like beer as features to XGBoost but without resorting to creating templates for table rows


The answer is extremely straightforward: less computation.

That’s less computation when traversing nodes in a set which can result in either an exponential or logarithmic difference depending upon the data and means of traversal. This holds true almost universally.


Are the tree based models being provided with data that's "kind of like" the data they were designed for/at training time for the deep learning model? Or "exactly like"?

I think that's perhaps the difference?


That should only be the case when the variables are conditionally independent?


I think K nearest neighbor is an underrated learning model.

Assuming you can define a good distance measurement how cool is it to use every piece of data. And with no training.

I wonder how they compare against tree based models on tabular data just in accuracy.


I like using KNN as much as the next person but tabular data can be quite wide, and KNN suffers from the curse of dimensionality.


The next course of action would be to split your distance calculation into multiple steps - a few dimensions at a time.

Then we need to decide on an order to these comparisons - what about using entropy derived from the training data?

Then as a further improvement, what if the order need not be constant, depending on which group the data point is closer to? And what if not all dimensions need to be considered?

Before you know it, woah! You've reinvented trees!

This comment is meant as a joke


That’s a great way to explain it though.


why would anyone assume deep learning to beat everything at everything, even applied relatively naively?


I didn't get to what extent they compared capability of these methods for extrapolation, which tree-based methods are not really suited for.


Deep learning doesn't actually work unless you're doing vision or NLP.


And RL, graphs, audio/speech, approximating simulations (weather), etc.


Ok true I forgot speech recognition


What an inane comment.


Tree based models seem like obvious approaches that should just work. It's more or less how a human would parse a problem. Throw some X and O's on a 2d plot. Draw lines to partition it into X sections and O sections. That's a tree based model.


Now you can contact the authors and tell them to replace the publication by your 4 lines.

And tree based models is not what you described, at most that could be a tree classifier.

EDIT: First of all sorry for my comment, I think you did not like it. My point is that what you state as obvious on how tree based methods work is most probably not what makes them powerful but the fact that you have a bunch of them. To me if there is some intuition, statistics is more prevalent than separating spaces in hyperplanes.

Here you can see the boundaries of a random forest compared to a single tree, the more dimensions you add the more blurry and unintuitive in terms of hyperplanes it gets:

https://scikit-learn.org/stable/_images/sphx_glr_plot_classi...


I did not like the comment. It is needlessly aggressive and implies I have the arrogant view that my post is better than the author's paper. I do not.

I also suspect your hypothesis is not correct. The top row of your link is a good example. The NN tries to infer a gradient but that's really tough to do with limited data. That is to say, the tree based models will locally fit their partitions to the exact training data and the NN will try to view it in the big picture filling in the gaps. Tree based model works better for most real world tabular data.

The paper concludes something similar:

> This superiority is explained by specific features of tabular data: irregular patterns in the target function, uninformative features, and non rotationally-invariant data where linear combinations of features misrepresent the information.

I daresay my perspective is better aligned with the paper than yours. Are YOU trying to replace the publication with your 4 lines?


Totally agree and sorry again for my wording as I probably miss understood what you meant.

And no, I am trying to replace what I consider is a wrong intuition (tree methods are strong models mostly because single trees separate data in hyperplanes) with my 4 lines. This is just my opinion.


To be fair "Obvious" doesn't mean correct or proven. Even if their results boil down to obvious method is better than more complicated method it is still worth a paper.


[dead]


So it sounds like you’d need your 4 lines of pseudo-code from your first post and then another few lines of comments to describe how your 4 lines replace the authors’ publication? Something like “All tree models are partitions of the space based on training data, and the extension to any particular tree model is an obvious exercise left to the reader”?


Why have 2 separate people asserted I have issue with the author's publication?

My point is that tree based models are models that partition the space. This is not obvious. But the concept of partitioning the space is a very obvious thing to try because it just makes visual sense. Take a look at the example linked above: https://scikit-learn.org/stable/_images/sphx_glr_plot_classi...

All models indirectly partition the space. Tree based models partition the space explicitly. Other model types do so implicitly.


Patience will be necessary; a bad reading of your meaning will be a bit sticky because it allows others to pick up computation where others left off. In the short term it looks like misunderstanding. Over time it means a great deal more thoughtfulness than any one person would have been able to give.


You are saying it sounds as if he his explicit denial is what he was saying? You're like a kid that hears another kid getting upset at a mispronunciation of their name. "My name is John," he whines. "Sounds like you are Joonnna," you taunt. It is an exceedingly childish thing to do. Maybe others here aren't mature enough to realize you are being a bully, but I'm not stupid enough to think that "not X" said with emotional force means something "sounds like X". I'd also rather give this response rather than making spyware go through the humiliation of repeating himself. He got downvoted because he cussed, not because intentionally distorting what people say is funny.


I actually agree with this take! Deep learning models shine in problem domains where the gap between the data representation to the way how humans parse it is large. In images, for example, we just have RGB matrix but what human vision does is very complex. So representation learning done by deep learning models tries to close the gap. For lot of tabular data, human's don't go from raw representation to a mental model quickly. We in fact follow a model similar to tree based classifier: look at how a variable is splitting the output, compare averages across groups etc.


Why would a human do that, and why would it work?

How would that work for effectively finding the boundary of an ellipse or a spiral shaped set?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: