Deep learning really shines when the input is raw and at a very low abstraction level: pixels, byte pair encodings etc. Using deeply learning for classification on tabular data is just needless complexity, as the variables are often at a very high abstraction level already. Also with tabular data there are generally not many spatial or temporal relationship between the variables, which CNNs and transformers excel at.
Also images and text have tons of recurring patterns that can be exploited to train big models with lots of data. There is an internet worth of each modality that at least generally can all contribute helping a model build up a better overall understanding.
There is no analog for tabular data, it's all different.
I wonder if tree-based models would outperform in situations concerning source code as well; given it's already quite structured. Going a step further and supplying and AST may be even more beneficial.
Yes, what I meant was deep learning is great at deriving those higher level abstractions from low level raw data. Words can be seen as something in between, bag of words can be fairly effective at simpler tasks, but LLMs embed words into higher and higher abstractions.
I'm not sure this is surprising. Say you were to glue together 10 datasets with the same 10 explanatory features and 1 response feature, but distributed very differently to each other. This would be no problem for tree based model because they'll conditionalise indefinitely to get a good fit. If the number of records is relatively small (say 10k) the dataset will be much too scarce for an NN to learn these discontinuities -- its like it has 1000 records per segment.
Similarly, tabular data is often of this nature. Its not i.i.d, it tends to cluster.
I still don't get the impetus or desire to make NNs work better for tabular data. Regression works pretty well and is easy to interpret/diagnose/work with. GBMs work really well (given a few considerations) and is trickier to work with but nothing crazy. When I see all the fancy hijinks people get up to when applying NNs to audio/text/pictures I think it's really cool but also not something I'd want to have to do if I didn't absolutely need to when working with data out of a relational db. And anyways, how much of a benefit could it actually bring? GBMs are already capable of fitting and dramatically overfitting most datasets.
The paper offers a reason why NNs working for tabular data would be good:
>Creating tabular-specific deep learning architectures is a very active area of research (see section 2) given that tree-based models are not differentiable, and thus cannot be easily composed and jointly trained with other deep learning blocks.
Here is a second reason, from the paper
>Impressed by the superiority of tree-based models on tabular data, we strive to understand which inductive biases make them well-suited for these data.
which is a great reason, because understanding the inductive biases of different learning/regression techniques gets us closer to a more general understanding of how to encode inductive biases in a generic learning algorithm.
My hypothesis is decision trees are more robust to nonstationary distributions. If the variance and means of the features shift dramatically, the model isn't going to blow up, because it's not additive.
In the domains where NNs work well (image processing and language), you're dealing with a predictable and stable distribution of values. Elephants might look a bit different in the train and test set, but you're not randomly getting 100x the variance of the input data. The decision tree just isn't going to care as much, because splits around the mean will lead to the same outcome.
Another hypothesis is that zooming into bivariable relationships is more important in tabular data. Neural nets are better at local and global context. But they struggle if all that matters is the relationship between two columns of data because of the additive nature. Large networks can figure it out due to model capacity, but then you'll run into overfitting.
In case anyone's sufficiently motivated (no promises, but I might test it out eventually), a couple deep architectures that might address those concerns are:
1. Something like a deep support vector machine. Instead of (linear) -> (any activation), you want to create a bunch of features that look like testing the vector against a splitting hyperplane. One option is (bias) -> (matmul) -> (1-bit sigmoid). Applying a bias term _for each row_ let's you choose the branch location, the matmul's result will be positive or negative at each output feature depending on which side of the hyperplane normal to the vector described by the corresponding row you happen to fall on. Then just bring that down to -1 or 1 so you can't sneak much nonstationary drift variance into the output (perhaps train with a normal sigmoid annealed to behave more like this one, and a suitable regularizing term to keep the network from sneaking in values near 0 to thwart your annealing).
2. Use an attention-like mechanism, but across features (this would likely require an additional tensor channel, so that each "feature" carries information in a high enough dimensional space for this to do something meaningful). You apply the inductive bias that sparse feature interactions are important and need to be discovered.
There is no such thing as "best possible model, full stop". Models are always context dependent, have implicit or explicit assumptions about what is signal and what is noise, have different performance characteristics in training or execution. Choosing the "best" model for your task is a form of hyperparameter optimization in itself.
Plenty of places use DL models, even if it's just a component of their stack. I would guess that that gradient-boosted trees are more common in applications, though.
Still mostly NLP and image stuff. Most actual data in the wild is tabular - which GBTs are usually some combination of better and easier. In some circumstances, NN can still work well in tabular problems with the right feature engineering or model stacking.
They are also more attractive for streaming data. Tree-based models can't learn incrementally. They have to be retrained from scratch each time.
ML is very good at figuring out stuff like every day at 22:00 this asset goes up if this another asset is not at a daily maximum and the volatility of the market is low.
You might call this overfitting/noise/.... but if you do it carefully it's profitable.
Real-time parsing of incoming news events and live scanning of internet news sites - coupled with sentiment analysis. Latency is an interesting challenge in that space.
Multiple parts of the iPhone stack run DL models locally on your phone. They even added hardware acceleration to the camera because most of the picture quality upgrades is software rather than hardware.
At this point I wish every junior DS could read this paper and not come in to every problem with the new bright idea that they’re going to beat XGBoost with their DL architecture. Free promotion if they never say the words “latent subspace”
When working with tabular data, there are very few situations where absolute model performance is the only criteria that's important. In practice, the following are equally as important:
- Explainability / debug-ability of models
- Effort to train, deploy, and manage NN models in production
- Capturing, collating, and organizing new & better datasets
- Local developer experience and human-model-iteration time
Building all of your software in C or Assembly will be faster and higher performant. But at what cost and with what tradeoffs? Building a website has a different set of tradeoffs than building a program for the Mars rover.
It's funny; as a regular non-ML programmer, the optimum for every one of those factors for "tabular data" would seem to me, to be to "throw the tabular data into a relational data warehouse, and ask your questions in the form of SQL queries."
Or, if the "tabular data" is heavily relationship-based, then possibly replace "relational data warehouse" with "graph database", and "SQL queries" with whatever querying language that graph DB is natively / most expressively queried in.
Of course, this is the most important implicit "equally important" factor, one that an ML dev would think goes without mentioning: the generality or "power" of the model in what questions it can answer. You can only make these trade-offs in the context of knowing what kinds of questions you want your model to solve for! If all your questions are quantitative ones, maybe the right "model" for you is an RDBMS!
---
Though, that being said... why can't a deep-learning model emulate the thing that an RDBMS does, "at runtime", as part of its "mental toolkit" for approaching problems? That would be the best of both worlds, no?
I know that LLMs in particular have been observed to have "emergent numeracy" above a certain training-set size. There is a step function in how they approach such problems, going from their only being able to answer arithmetic questions on numbers of bounded size, and sometimes getting the answers wrong (probably this is due to a memorization-based approach); to being able to answer arbitrary arithmetic questions on operands of unbounded size, and always getting the answer correct.
I would guess that that what's happening, is that they are developing a functional component of their network that works akin to an Arithmetic Logic Unit, operating not on tokens, but on tokens transformed into a "numeric register" representation that is amenable to having math done to it with stable, quantized, position-independent results. (Just like the functional component that human brains develop after seeing enough math problems... probably.)
Do you, as an ML dev, think it would ever be possible for any of the model architectures we're familiar with today, to be trained such that they would develop an analogous emergent functional component for handling tabular-data questions, by transforming its internal working state into relational-DB/graph-DB data structures — e.g. page-heaps of binary-packed row-tuples; B-tree indices; etc — and then manipulating the working state in that form, using learned algorithms applicable to that type of data?
It seems to me (possibly just because I don't know any better) that just as with numeracy, "being able to put the data into a different and better internal representation" is what would be needed for deep-learning models to become truly good at dealing with tabular-data problems.
But, unlike with numeracy, "thinking as if you were a relational database" is not something a single human would ever intuit how to do without being taught. Relational algebra — and the data-structures and algorithms to make it practical to have a Turing machine do said relational algebra — wasn't even a single intuition, but a conscious effort, of multiple humans, working together over years. I strongly doubt that there's any number of "tabular-data problems" that you could show a human being, that would result in them developing an intuitional ability to do what a relational database does with its memory to efficiently answer queries.
(I suppose we could give an ML model an RDBMS, and hardwire it to interact with it. I know there are hybrid ML + formal-logic systems. Are there hybrid ML + data-warehouse systems? Not where the model queries an external DB — while that can be done, it'd be only in the same "stop and do this" way that ChatGPT runs Python code, which wouldn't make it a thinking tool the way that the formal-logic proof engines are for hybrid ML systems. Rather, I mean that some data-warehouse execution engine could be embedded into the ML execution framework itself, deployed as part of the GPU shader-program to each tensor core, such that data-warehouse operations can be done as a native part of the network's per-node instruction-set. Anyone ever tried this?)
> It's funny; as a regular non-ML programmer, the optimum for every one of those factors for "tabular data" would seem to me, to be to "throw the tabular data into a relational data warehouse, and ask your questions in the form of SQL queries."
It's doubly funny; as someone that comes from an ML background, and has developed and maintained multiple ML systems at multiple orgs, that I also think the answer very often is, "throw the tabular data into a relational data warehouse, and ask your questions in the form of SQL queries."
model.predict() is pretty easy to call, but judging the validity of the model is still hard and very manual. Linear model is less flexible and powerful, but easier to analyse/validate.
Begging the question again, as there is no evidence that "intelligence" runs on neurons. (And plenty of evidence that "intelligence" can exist without neurons.)
Single-celled organisms are pretty intelligent, and they have zero neurons.
Meanwhile there is no evidence at all that intelligence runs on neurons except that "brains contain lots of neurons", which is a logical fallacy because brains contain lots of other spurious things too.
You already made a faulty assumption — that we're interested in "classifying the data" in the first place.
Maybe we already know everything about the dataset. For example, if it's line-of-business customer data gradually built up by a sales team, then the brains of the salespeople have likely already done all the "implicit classification" needed to generate good questions about the dataset.
And this is, by far, the usual scenario for Business Intelligence questions: someone with "business-domain knowledge", e.g. an executive, has formed an intuitional hypothesis about the data based on their personal experience; and so they ask someone with "data-domain knowledge", e.g. a business analyst or data scientist, to test that hypothesis.
It's actually rare, in my experience, to have a tabular-data dataset that someone is motivated to understand, that doesn't also "come with" a set of people who can already act as (good!) models trained on that dataset, to aid them in that understanding. (Sometimes these people can't find each-other — but they do usually exist.)
AFAIK, having reams of entirely opaque and ill-understood tabular data, such that you need classification/clustering to get started on asking questions, only really happens in the sciences: sensor-network climate data; longitudinal-study medical-outcome data; census data; housing-market data; etc. In other words, it's almost always universities and governments — not businesses — that care about analyzing opaque tabular data.
And that's a key to understanding the constraints in play for choosing models! Because business-driven analyses are usually time-constrained in some way (potentially even needing post-training question-answers to be generated in soft-realtime); while institutional analyses usually aren't. Big difference!
I might be misunderstanding your point, but there's use cases that have repeatedly come up for me in multiple businesses, below being some examples, without getting too specific:
- identify latent features of customers via their behavioral data, to be used for profiling customers or recommending products to them
- within a large amount of customer behavioral data, identify potentially fraudulent behavior
- identify causes of seasonality (e.g. temporal patterns) in the data in order to improve forecasting (sales, traffic, whatever)
In those cases part of the investigation is to initially take a hands-off (unsupervised) approach, so that we can compare our initial top-down hypotheses with actual patterns in the data.
In both of those cases there's considerable (and sometimes adversarial) noise in the data.
Answers to these questions are actually Bayesian statistical models ("what is the probability of Y given a high likelihood of X"), treating these problems as unsupervised classification might work, but that's a very crude way of approaching them.
I wouldn't say it's a crude way of approaching the problem, it's a crude way of solving the problem. Taking the fraud example, taking unsupervised approaches to understanding patterns of the data before you impose assumptions on the data is a very useful process. For example, what might be fraudulent behaviors in the first place, assuming you aren't even sure you know what fraud looks like, or that it's actually all been detected? Your goal there might be to detect latent features period, not look at their predictive power for X.
Having understood that question, and built an understanding of what predicts fraud, you would then graduate to build models to understand the extent to which features predict fraudulence.
My point in context of the conversation is that it's useful in a business context to explore and understand that data.
I'm really not clear on why you're arguing against this. A proper data warehouse tackles the known unknowns, i.e. supervised learning. But you can glean new insights using unsupervised learning, like the textbook example of Target knowing a woman is pregnant based on sales data.
>You already made a faulty assumption — that we're interested in "classifying the data" in the first place.
It's not clear what your point is. If you're not interested in the predictions that tree-based models provide, do not use tree-based models on your tabular data. A predictive model and a SQL query are not the same thing.
Data teams in companies often aim to enable the answering of future questions nobody has asked yet, by creating denormalizations of their data that offer maximum flexibility in what classes of questions they can answer. Maximum "power."
Lately, that means they're often spending a lot of resources (and even novel R&D time!) getting various kinds of ML models trained on the data.
My point is that this is often pointless, because, given the type of data they're working with (tabular, quantitative line-of-business data), they won't actually see "arbitrary questions"; they'll see the strict subset of arbitrary questions that could have been solved just as well — if not much better! — with a SQL query. And for much less capital expenditure — because the LOB data usually already lives in an RDBMS in the first place.
For those curious about what we have been up to on the topic of tabular learning, we have found a setting where deep learning does seem to bring sizable benefits (spoiler alert, it's about being able to pre-train, and transferring to new data works best when there are some strings to be recognized): https://arxiv.org/abs/2402.16785
In the above work, pre-trained tabular models markedly outperform tree-based models (including catboost, which is a very strong baseline).
As someone who has been banging on tabular data for years, I'm really excited about this development.
Paper seems interesting but I don't like the question title. I think the answer to the question would just be that tabular data is not fully in the "big data" regime yet so there is no reason a priori to expect deep NNs to do better. Factor in computational simplicity of tree-based models and I think the deck is stacked against deep learning from the start.
I've worked on models trained on ultra-large tabular data. It still took substantial effort to beat tree models (custom architecture specifically for this particular domain, something I haven't seen elsewhere out in the open).
When tabular data is mentioned, one of the unspoken applications is finance. There, my guess is that one of the issues is that data is not very IID and thus latent "events" are fairly sparse. Combine that with the humongous amount of raw data, and you get models that overfit.
I think there are certain types of tabular data that lend themselves naturally to tree models. But when you're talking about tabular data for finance I guarantee you very few hedge funds are running tree models for trading strategies. When your scale of data is the past X quarters of all stock prices and trade volumes you have enough data that you can fit an NN and there are a number of techniques you can use to reduce overfitting (large amount of data, good regularization, dropout, etc.)
> But when you're talking about tabular data for finance I guarantee you very few hedge funds are running tree models for trading strategies
What do you base this on? Having only neural nets on tabular data is mostly done due to laziness of the creator since neural nets are much easier to use, not because neural nets perform better even with large amounts of data. In general you want both since they are good at finding different kinds of patterns.
Do you know of any (families of) examples of tabular datasets of any size (you can choose what "big" means) where deep learning convincingly outperforms traditional methods? I would love some quality examples of this nature to use in my teaching.
Regression targets where extrapolation may be needed. Decision tree methods cannot extrapolate, the predictions are have to be a mean of a subgroup of the data.
Consider: Predicting how much a customer might pay by end of month, with information we have at the start of the month.
In this example, if a customer had a record $10m of open invoices due by EoM and the largest payment amount received in prior months of $5m, the decision tree cannot possibly predict the payment amount will be ~$10m, even when the best feature indicates the payment will be $10m.
There are some hacks/techniques which can maybe reduce this issue, but they don't always work.
What? Can you explain the mechanism than a NN can “extrapolate” an invoice where a tree model couldn’t? This is all just how the modeler builds the features.
Also all models are a “mean of the subgroup of the data.” The prediction is by definition the conditional mean as a function of the input values.
Recommendation engines: search, feeds (tiktok / youtube shorts / etc), ads, netflix suggestions, doordash suggestions, etc etc. Also happens to be my specialty.
I worked with search and ads model at Google, for most things tree models were better. What evidence do you have that neural nets are better there? I worked with large parts of Google search ranking so I know what I'm talking about, some parts you want a neural net but most of the work is done by tree models and similar, they both perform better and run faster.
I'm not sure that is true. I think inference speed is often the bottleneck for the use cases stated, as is the need for frequent re-training. As a result algorithms like catboost are very popular in those domains. I think catboost was actually invented by Yandex.
PS: Its weird that you are being down-voted. I think your opinion is reasonable.
Inference speed: more sophisticated stacks use multiple stages. Early stage might be a sublinear vector search, and the heavy hitting neural nets only rerank the remainder. Bytedance has a paper on their fairly fancy sublinear approach.
Retraining - online training solves this for the most part.
Frameworks - the only battle-tested batteries-included one I've seen is Vespa. Noone else publishes any of interesting bits. KDD is the most relevant conference if you're interested in the field. IIRC Xiaohongshu has some papers that can only really be done with NNs.
Since this is from 2022, I’m wondering how “tabular foundation models” could change this. The incredible success of DL we see at the moment comes partially from foundation models learning on a lot of “semi-related” data an “understanding” of the behavior. Something similar has been explored in tabular data as well iirc.
So I would be curious to see latest DL results.
On the other hand it is also the case that in most cases where DL based on foundation models is used, specific heavily tuned models outperform the generalistic models. And for tabular data there is a lot of experience how to make it great with tree based models.
What would these tabular foundation models look like? LLMs work as foundation models because the input is fixed in format (a sequence of text). Would the model be for a specific fixed tabular format?
One promising approach is to encode each feature key and feature value as embedding vectors, concatenate them into "feature tokens", then feed them into a Transformer (without positional encodings). This takes advantage of column-order invariance. See:
A calculator outperforms deep learning on basic arithmetic tasks.
Tree-based models are extremely good at finding clustering patterns; they outperform trained humans at that, thus we have commercial applications such as fraud detection.
Deep Learning is most promising way of getting us to general intelligence. So far only known general intelligence, human intelligence, has many quirks at specific tasks and I think Deep Learning won't be any different. However Deep Learning models can recognise their own weakness and call tree-based model if they think that's appropriate.
It's very important to note that this is from 2022. I'm not saying it's not true today but neural models have gotten much better in 2 years.
(I'm personally using NN models for predicting certain values for tabularly structured data and at least for my case, the NN works better than state-of-the art tree models.)
There has been some work on training on lots of different data sets and then specializing on the one you care about. But I think people were trying that approach pre-2022 as well.
Sorry, I don't have references off the top of my head. I just recall coming across it while I was working on something related to timeseries forecasting.
Tooling around embeddings has improved. Creating and fine-tuning custom embeddings for your tabular data should be easier and more powerful these days.
Not the parent, but NNs typically work better when you can't linearize your data. For classification, that means a space in which hyperplanes separate classes, and for regression a space in which a linear approximation is good.
That doesn't look immediately linearly separable, but since it is 2D we have the insight that parameterizing by radius would do the trick. Now try doing that in 1000 dimensions. Sometimes you can, sometimes you can't or don't want to bother.
Note that if linear separability is the only issue you can just use kernel methods. In fact, gaussian processes are equivalent to a single hidden layer neural network with infinite hidden values.
The magic of deep neural networks comes from modeling complicated conditional probability distributions, which lets you do generative magic but isn't going to give you significantly better results than ensemble kNN when you're discriminating and the conditional distribution is low variance. Ensemble methods are like a form of regularization and they also act as a weak bootstrap to better model population variance, so it's no surprise that when they're capable of modeling the domain, they perform better than unregularized, un-bootstrapped neural network model. There are still tons of situations where ensemble methods can't model the domain, and if you incorporated regularization and bootstrapping into a discriminative NN model it would probably perform equivalently to the ensemble model.
That's an advantage over linear models, but GBTs handle non linearly-separated data just fine. Each individual tree can represent an arbitrary piecewise-constant function given enough depth, and then each tree in turn tries to minimize the loss on the residual of the previous trees. As such, they're effectively like a neural network with two hidden layers in terms of expressiveness.
This explanation doesn’t make sense to me. What do you mean by “linearize your data”—tree methods assume no linear form and are not even monotonically constrained. Classification is not done by plane-drawing but by probability estimation + cost function
I assume it's because there are some very complex relationships and patterns that cannot be captured by decision trees. Tree models work better on simpler data at least that is my gut feeling based on previous experiments with similar data.
Interesting. Usually I have better luck with xgboost for tabular data, even when the relationships are complex (which usually means deeper trees). It does fall flat a lot of the time for very high dimensions, though. All data is different, I guess.
There is some work with zero shot (decoder only) time series predictions by google and an open source variant. Curious to see how these approaches stack up as they are explored.
> deep learning architectures have been crafted to create inductive biases
matching invariances and spatial dependencies of the data. Finding corresponding invariances is hard
in tabular data, made of heterogeneous features, small sample sizes, extreme values
Transformers with positional encoding have embeddings are invariant to the input order. CNN's have translation invariance and can have little rotational invariance.
It's harder to find similar invariances to tabular data. Maybe applying methods from GNN's would help?
The team behind Yggdrasil tree library at Google was doing some interesting research into tree differentiability (and thus unlocking SGD & end-to-end learning for hybrid architectures).
This is interesting. Are BART models differentiable? I haven’t looked closely at them but I would have thought for posterior sampling they’d have to be. BART has been around for a while, too
I have a lot of experience working with both families of models. If you use an ensemble of 10 NNs, they outperform well-optimized tree-based models such as XGBoost & RFs.
To both questions above, just simple averaging of the logits (classification) or raw outputs (regressions) usually works well. If I had to guess why people don't use this approach often in kaggle competitions is the relative difficulty of training an ensemble of NNs. Also, NNs are a bit more sensitive to the type of features used and their distribution relative to decision trees (DTs).
Ensemble models work well because they reduce both bias & variance errors. Like DTs, NNs have low bias errors and high variance errors when used individually. The variance error drops as you use more learners (DTs/NNs) in the ensemble. Also, the more diverse the learners, the lower the overall error.
Simple ways to promote the diversity of the NNs in the ensemble is to start their weights from different random seeds and train each one of them on a random sample from the overall training set (say 70-80% w/o replacement).