Hacker News new | past | comments | ask | show | jobs | submit login
Regularization is all you need: simple neural nets can excel on tabular data (arxiv.org)
216 points by tracyhenry 38 days ago | hide | past | favorite | 93 comments

This is interesting, but the paper still notes that in most "real-life" applications people will likely still prefer gradient-boosted trees, just because you need to allocate significant computation to hyperparameter tuning even in the case of the MLP-with-regularization-cocktail. For just getting something off the ground quickly based on tabular data, GBDTs are still unbeatable.

> GBDTs are still unbeatable.

You'd be surprised how many times I've replaced a GBDT with logistic regression and had negligible drop off in model performance with a dramatic improvement in both training time as well as debugging and fixing production models.

I've had plenty of cases where a bit of reasonable feature transformation can get a logistic model to outperform a gbdt. Any non-linearity your picking up with a GBDT can often easily be captured with some very simple feature tweaking.

My experience has been that GBDTs are only particularly useful in Kaggle contests, where minuscule improvements in an arbitrary metric are valuable and training time and model debugging are completely unimportant.

There are absolutely cases where NNs can go places that logistic regression can't touch (CV and NLP), but I have yet to see a real world production pipeline where GBDT provides enough improvement over Logistic Regression, to throw out all of the performance and engineering benefits of linear models.

I strongly agree with this. Not to mention parameter interpretability and, in the case of Bayesian models, uncertainty estimates and convergence diagnostics. Such things are very important when making decision under uncertainty. Kaggle competitions and empirical benchmarks are very biased samples of model performance in real life.

I feel these two things often influence too much the course of Machine Learning research and communities, and this is not good. Most ML researchers and pratictioners are barely aware of the latest advances in parametric modelling, which is a shame. Multilevel models allow you to model response variables with explicit dependent structures. This is done through random (sometimes hierarchical) effects constrained by variance parameters. These parameters regularize the effects themselves and converge really well when fitting factors with high cardinality.

Also, multilevel models are very interesting when it comes to the bias-variance tradeoff. Having more levels in a distribution of random effects actually DECREASES [1] overfitting, which is fascinating.

[1] https://m-clark.github.io/posts/2019-05-14-shrinkage-in-mixe...

While I agree and it is surprising that multi-level/hierarchical modeling is rarely applied in industry (I used them extensively in academia and industry), dealing with hundreds or thousands of random effects in large data sets, especially in non-linear models, is a computational nightmare. And the benefits may not warrant those nightmares.

Finally multi-level/hierarchical modeling is starting to permeate industry thanks to Stan and company.

I use hierarchical modeling regularly to help build Zapier. So do other companies like Generable: https://www.generable.com/

I suspect hierarchical models will become the next “new” hot data structure in software engineering due to their ability to compact logic. https://twitter.com/statwonk/status/1363104221747421184?s=21

I don't know about permeating the industry. I know for example that the model that Airbnb used 3 years ago (things may have changed in the meantime) to forecast occupancy was a random-effects model maintained by a single person in Europe. I don't know about the penetrance of Generable and companies providing similar probabilistic modeling solutions, although I hope they succeed.

When I was working for one of the FAANGs, I was the only one using random effects models (that I know of), in particular non-linear random effects models with ~ hundreds of random effects. I was using a language/tool faster than Stan (fitting the same model with Stan would have taken hours, or more likely days), but making the models converge was always challenging. In addition, since most of my colleagues had a CS background and were in love with the latest not interpretable, brute force algorithm, and were scared of a more statistical approach they made no effort to learn, I faced pushback and skepticism despite the model working very well.

I love random effects model, and I build my technical career on them.

I think one of the main reasons is that there is no good Python library for doing linear mixed effect models. There is no sklearn implementation. There are some libraries that wrap R's lmer (probably using rpy2 or soemthing). The best native Python library I could find is statsmodels, and it has several shortfalls (saving a model to disk consumes hundreds of megabytes, the predict method is useless, it just predicts using the fixed effects, multi-level beyond just 1 group is not even clearly documented, and the syntax sucks if you really do it, nevermind actually implementing a predict method using those random effects). I think once someone does a decent sklearn implementation, it might take off. I've been thinking of doing an implementation for sklearn as a side project, but I'm not an ML researcher, just a practitioner, so it might suck :)

I used statsmodels for a while ... it's definitely possible to predict arbitrary inputs, it just a pain to fiddle in the right inputs ...

>You'd be surprised how many times I've replaced a GBDT with logistic regression and had negligible drop off in model performance with a dramatic improvement in both training time as well as debugging and fixing production models.

Not only reduced training time, but also less data needed for training. Which is particularly important if training on time-series data for something that changes over time, as older data is less useful.

> I have yet to see a real world production pipeline where GBDT provides enough improvement over Logistic Regression

Not my field at all, so "I know nooothing".

Are GBDT's very different from "plain" binary decision trees? I've seen the latter a lot in the context of particle experiments[1][2][3].

[1]: https://arxiv.org/abs/physics/0408124

[2]: http://cds.cern.ch/record/2289251/

[3]: https://arxiv.org/abs/2002.02534

Very simply: plain decision trees usually overfit to training data (and, therefore, perform very badly out of sample). So the important part isn't the tree but the boosting. How you go from an ensemble of weak learners to something that works.

And this boosting generalises to any learner. You can apply it to regression too. Again, the boosting part is really the key. The innovation isn't a new technique either, it is just the aggressive application of computing power to these problems.

They are the same concept under the hood, but a GBDT is an ensemble model using a number of trees in tandem that are grown to improve the performance of the overall model.

Uhm how do you deal with imbalanced data? Like I mean 99:1 or something? I’m always worried about feature engineering - in the right hands it’s great but I’d posit that majority of DSes out there do not have said hands. Much rather take a random forest with no manipulation and shittier (and hopefully less biased) results.

What are the size of the datasets? I have a hard time conceptualizing tabular business data large to be a problem.

consider the problem of "online advertising"

When you have billions of rows the performance savings can be nice.

One of my projects several years back ran both a LR model and a DNN against the same input data (albeit featurized differently). Accuracy, P&R were roughly the same (minor differences depending on the time horizon), but the LR model took maybe a half hour to train and five minutes to run; the DNN took about 24 hours to train and an hour or two to run.

This wasn't even particularly huge data compared to my other projects. But certainly at that scale, there are huge differences between regression & NNs.

An overlooked advantage of the MLP is that it is differentiable. Essentially, one trades the extra CPU for a classifier that one can pipe gradients through. That can be extremely useful in larger NN architectures.

Images -> correlation in space -> Deep convolution networks

Time series -> correlation in time -> Recursive networks

Tabular data without clear correlation structure -> good old ML (ANN, SVM, DT, LR, KNN).

This is obvious when following the field since 2006 or so. Deep Convolutional Networks were considered a special case for data with local correlations at a hierarchy of spatial scales. Same for RNNs in time, although they came much later (when was the LSTM rediscovery again? 2016?)

For most data without clear spatial or temporal structure to exploit, the good old ML techniques will work just fine.

Not only will they work just fine, they'll save you compute costs, training time and lower your need for data.

Not necessarily true that MLPs are very compute expensive. It is maximum a couple of layers and if your input is sparse (categorical) you can gain even more. For some problems it can be the fastest and decent non linear model from my experience.

I don’t think that was the claim… MLP/ANNs are fine except for the difficulties around interpretability. DTs and LR are preferable on that front. Or an SVM if you know a kernel/similarity metric that kills it in your data.

Not to mention the shedloads of “X is all you need” papers which you can ignore, because Bishop’s “Pattern Recognition and ML” book is actually all you need (plus perhaps a good reference on linear algebra).

It is true that MLPs are classic, but the regularizaton methods that apparently make a big empirical difference at this paper are new concepts (data augmentation, skip connections/residual blocks, dropout, batch norm, lookahead, stochastic weight averaging, etc.). They compare againts a good old MLP without the bells and whistles at Table 2 and the classic MLP is quite a poor performer (XGBoost beats a classical MLP very significantly). Which leads to the conclusion that we need all these recent deep learning advances on innovative regularization techniques to make the difference.

From paper: Tabular datasets are the last "unconquered castle" for deep learning, with traditional ML methods like Gradient-Boosted Decision Trees still performing strongly even against recent specialized neural architectures.

My but this statement seems more than a little grandiose.

Never mind that XGBoost still does well on a substantial portion of ML challenges (supposedly). The bigger problematic is that there's a confusion of maps and territories in this way of talking of machine learning. The field of ML has made a certain level of palpable progress by having created a number of challenges and benchmarks and then doing well on them. But success on a benchmark isn't necessarily the same as a success the "task" broadly. An NLP test doesn't imply mastering real language, a driving benchmark doesn't imply master over the road driving. etc. Notably, success on a benchmark also "isn't nothing". In a situation like the game of go, the possibilities can be fully captured "in the lab" and success at tests indeed became success against humans. But with driving or language, things are much more complicated.

What I would say is that benchmark success seems to produce at least a situation where the machine can achieve human-like performance for some neighborhood (or tile or etc) limited in time, space and subject. Of course, driving is the poster-child for the limitations of "works most of the time" but lot of "intelligent" activities require an ongoing thread of "reasonableness" aside from having an immediate logic.

Anyway, it would be nice if our ML folks looked at this stuff more as a beginning than as a situation where they're poised on success.

Paper is missing the control: how good is this 'cocktail of regularization' when applied to traditional methods like XGBoost?

At best you can claim the result here that neural networks with regularization methods can beat traditional methods without it, but to be apples to apples both methods must have access to the same 'cocktail of regularization'.

From the paper:

> This paper is the first to provide compelling evidence that wellregularized neural networks (even simple MLPs!) indeed surpass the current state-of-the-art models in tabular datasets, including recent neural network architectures and GBDT (Section 6).

> Next, we analyze the empirical significance of our well-regularized MLPs against the GBDT implementations in Figure 2b. The results show that our MLPs outperform both GBDT variants (XGBoost and auto-sklearn) with a statistically significant margin.

They test against XGBoost, GBDT Auto-sklearn, and others. Did you read the paper?

> They test against XGBoost, GBDT Auto-sklearn, and others. Did you read the paper?

Yes. Did you read my comment?

They compare NN + Cocktail vs. vanilla XGB. They don't compare NN + Cocktail vs. XGB + Cocktail.

To make it crystal clear, if I wrote a paper "existing medicine A enhanced with novel method B is more effective than existing medicine C" and I did not include the control "C + B" (assuming if relevant, which is the case here), that'd be bad science. It's very much possible that novel method B is doing the heavy lifting and A isn't all that relevant. s/A/NN, s/B/Cocktail, s/C/XGBoost.

How would you even apply layer normalization or SWA to XGB? These methods are neural net specific.

GBDT have their own set of hyperparameters such as learning rate, number of trees, min samples per bin, l0, l1, etc. So you could definitely also create an appropriate cocktail to optimize on, although GBDT are typically more robust wrt huperparameters.

The authors do claim to do a hyperparameter sweep but only for vanilla XGB hyperparams.

The old "my method (with as much optimization as I could get away with) beats the other method (with as little optimization as I could get away with)"

Yup. The Least Publishable Unit strikes again.

Batch normalization is nothing neural network specific to it if you use it on the input layer. I don't think it matters for a tree algorithm like XGBoost either way though.

SWA is pretty NN specific. So leave it out for XGB. There's a bunch that are relevant, and they could be very important.

Batch norm has an advantage for iterative methods on mini-batches, while XGB uses the full training set. Using batch norm on the full training set is equivalent to Z-normalizing the features, which has no effect at all for XGB as the scale of features plays no role at the split decisions of the tree nodes. Apart few non-parametric data augmentation methods (notice adversarial augmentation is also nn specific), I do not think any other regularization used in that paper can be directly/intuitively applied to XGB.

I don't think every single regularization method in the cocktail can be applied to non-neural-network methods, but I'm pretty sure some of them can, like data augmentation. The authors could have figured out which methods can be applied to non-NN models or considered if equivalent/analogous methods exist. I agree it would make a fairer comparison.

I agree with your point, e.g. data augmentation can be added, but thats pretty much it. All the other regularization techniques they use are neural network specific and cannot be applied to gradient-boosted trees. What I find particularly striking at this paper is that their method trains a single neural network which outperforms an ensemble of decision trees (XGBoost). Asking for perfect apple-to-apple comparisons means also comparing an ensemble of the MLPs vs. XGBoost. In this context, at least the message here is that XGBoost and/or other gradient-boosted methods are not anymore a silver bullet for tabular datasets. Boosting for trees was great in reducing both bias and variance, but apparently neural networks can achieve the same effect with a high capacity (low bias) and a mix of modern regularization techniques (low variance).

This paper compares xgboost vs nueural nets and an ensemble [Tabular Data: Deep Learning is Not All You Need] https://arxiv.org/abs/2106.03253

A lot of these papers can be titled "Tuning hyperparameters on the evaluation dataset is all you need." I see a few cases in this paper.

Per the paper, 37.5% of the regularization regimes were Batch Norm + Weight Decay. According to other findings, Batch Norm and Weight decay have an interesting interaction that results in a learning rate adjustment, not regularization:


If you want to use NNs on tabular data look up the work that's being done on point clouds. They both share the same major symmetry over permutations.

They don't? One is a row in table (R^n) (order matters) and the other is a set of points (R^nxd).

If you consider the whole table, any dataSET is like point clouds.

The table is the point cloud the row is the point. The symmetry is permutations of rows or permutations of points.

In which case would you use the whole table?

Well yeah, there's no permutation symmetry between columns. (Unless there is...)

Then as parent said, almost all datasets are like point cloud, except time-series datasets.

Yeah, they are.

Oh that's interesting. Can you say more? Haven't seen much relating the two topics before.

A row in a table of data is an point in R^n. I'm not sure how much there is to write about it other than to say, that's a point cloud.

what's the intuition for why that might be desirable? I can sort of see that you might care to consider the relation between a given row and other rows (not disimilar to something like kernel methods) and then you can use something like Deep Sets[1] to featurize the data?

[1] https://arxiv.org/abs/1703.06114

I think the way it works is, you have one network that produces global permutation-invariant (maintained so by training loss) metrics and another that recognizes based on those metrics. The big prior you're putting in is that the order of the points doesn't matter. Relationships between points do matter but only in a permutation-invariant way. I would recommend reading the literature because of course, it's not my idea. :)

No, that's a point.

The table's the point cloud. I guess if you want to be really pedantic about grammar then I should point out that a set of one item is still a set. :-)

True - I misread your post. Your first post was intriguing but the second was dismissive. Surely there's more to be said than data sets are point clouds? Images are points in R^N too, right?

If your unit of recognition was a set of images, and not an image alone, then you would have permutation symmetry and want to use point cloud techniques to design your first layer. So yes images are points in R^N.

Oh thanks, this was not clear from your other posts: the whole table is used as a data point, not the line. Much clearer now, why you would compare this to point clouds

Which work do you suggest?

Would you please stop it with the "$whatever is all you need"?

Isn’t it a reference to the “Attention is All You Need” machine learning paper?


Yes, and this is the fourth or fifth paper I see copying that format. The attention paper was pretty damn significant. This won’t be, because it just shows that hyperparameters are important. We know that.

"All you need considered harmful"

"The Unreasonable Effectiveness of All You Need Is Considered Harmful"

-- All interpretations worth considering...

"The Unreasonable Effectiveness of All You Need Is Considered Harmful for Fun and Profit"

FANG doesn't want you to know how this one weird trick about the unreasonable effectiveness of all you need considered harmful for fun and profit and you will never guess what happened next.

We live in a society where FANG doesn't want you to know how this one weird trick about the unreasonable effectiveness of all you need considered harmful for fun and profit and you will never guess what happened next in my social experiment gone wrong.

I often interpret that pattern as a reference to the lyric "love is all you need" by the Beatles (e.g., an attempt at being playful and relatable), but that may be my musical bias.

Either way, totally agree. Overused, almost always incorrect, and easily misconstrued especially by people who don't speak English as their first language.

What if the paper which finally gives us The Singularity is titled "All you need is all you need?"

If the paper would not have used this clickbait title, it would probably go unnoticed and noone would pay attention? Sad truth about sensation-oriented modern lives :(

Sometimes it feels like papers authors consider their audience as children.

to be fair many of the tasks of machine learning are things children do like telling if something is a dog, coloring in a a picture, solving a maze, etc.

The field is certainly moving in a direction of infantilization due to the social media attention economy. Your paper competes with cute flashy cat gifs etc. on Twitter. And Twitter is ridiculously important in AI/ML, except perhaps above 50-60 years of age.

A little whimsy doesn't hurt... Nobody said science and research had to be dry and witless.

Communication is based on the lowest common denominator. The best communication can be understood by a wide range of audiences. Have you considered that you weren't the intended audience?

I don't know about that. It's probably driven by the need to for a short attention* gathering title.

* I see what I did there.

It's such a cheap trick too, basically clickbait.

But it works, for example it got to the top of HN.

I strongly agree with this paper. Regularizing a NN is the key for better performance. But first, one needs to know, what exactly to regularize. I don't think trial-and-error hyperparameter bingo should be the way to go. We need better insight and understanding about these networks, analyze their structure and find out, what exactly is wrong with it. Then, a TARGETED regularization (layerwise, or maybe per neuron) has a huge potential to let very simple networks perform extremely well. I even suggest, that adaptive regularizations (on/off/strength) should be researched even more. It is not necessary, that a network is regularized all the time the same way.

Any references/recommendations on the best practices to analyze a network, weights, etc.?

Anyone who finds this paper interesting should read: "self attention between datapoints" Which shows non-parametric transformers often beat GBDTs on tabular data.


Short description: Overfit is All you need

> As a result, we propose regularizing plain Multilayer Perceptron (MLP) networks by searching for the optimal combination/cocktail of 13 regularization techniques for each dataset using a joint optimization over the decision on which regularizers to apply and their subsidiary hyperparameters.

Pardon? https://youtu.be/RXJKdh1KZ0w

Ridiculous clickbait title. How the heck did this neural network "can" Excel wrt. this tabular dataset? What's the underlying objective and the evaluation benchmark? And what feature of Excel was this tested against in the first place?

I suppose the follow up will be titled "One Weird Trick is All You Need To Destroy SOTA On This Dataset!"

Not Excel: the spreadsheet software, Excel: the verb.

They aren't saying that they "canned" Excel (the software), they are saying that neural nets have the potential to perform well on tasks involving tabular data that are traditionally performed by other ML techniques.

I, for one, very much appreciate the comedy of this post.

This makes no sense. Neural nets already tear through tabular data like butter. We're literally dropping out half of the learned parameters. This points to the fundamental inefficiency of the approach. Better regularization is the answer, but it's not about round-robin of existing techniques.

if you spend the same amount of time in xgboost and lightgbm looking for hyperparameters and build trees with a small enough eta. will you achieve the same performance? that's not answered in the paper, and if I am a reviewer I would ask them to come compare time spent.

After reading the paper I believe they do both aspects that you mention: i) they give XGboost and all the baselines up to 4 CPU days of hyperparameter optimization time on 20 CPU core servers per dataset, same compute as the author's proposed method. ii) the search space for the XGBoost hyperparameters includes low eta (starting from 0.001) and large num_round ranges (up to 1000 trees) in Table 5 at the appendix.

Attention, regionalization, someone tell me what I really need.

Massive hyperparameter sweeps without any cross validation are the perfect way to overfit holdout sets on small tabular datasets.

But they evaluate on a test set while sweeping on a dev set, so there is no risk of overfitting to the test set.

To piggy back on the topic of tabular data, has anyone experienced transfer learning on tabular data in their work or research?

I see what you did there.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact