You'd be surprised how many times I've replaced a GBDT with logistic regression and had negligible drop off in model performance with a dramatic improvement in both training time as well as debugging and fixing production models.
I've had plenty of cases where a bit of reasonable feature transformation can get a logistic model to outperform a gbdt. Any non-linearity your picking up with a GBDT can often easily be captured with some very simple feature tweaking.
My experience has been that GBDTs are only particularly useful in Kaggle contests, where minuscule improvements in an arbitrary metric are valuable and training time and model debugging are completely unimportant.
There are absolutely cases where NNs can go places that logistic regression can't touch (CV and NLP), but I have yet to see a real world production pipeline where GBDT provides enough improvement over Logistic Regression, to throw out all of the performance and engineering benefits of linear models.
I feel these two things often influence too much the course of Machine Learning research and communities, and this is not good. Most ML researchers and pratictioners are barely aware of the latest advances in parametric modelling, which is a shame. Multilevel models allow you to model response variables with explicit dependent structures. This is done through random (sometimes hierarchical) effects constrained by variance parameters. These parameters regularize the effects themselves and converge really well when fitting factors with high cardinality.
Also, multilevel models are very interesting when it comes to the bias-variance tradeoff. Having more levels in a distribution of random effects actually DECREASES  overfitting, which is fascinating.
I use hierarchical modeling regularly to help build Zapier. So do other companies like Generable: https://www.generable.com/
I suspect hierarchical models will become the next “new” hot data structure in software engineering due to their ability to compact logic. https://twitter.com/statwonk/status/1363104221747421184?s=21
When I was working for one of the FAANGs, I was the only one using random effects models (that I know of), in particular non-linear random effects models with ~ hundreds of random effects. I was using a language/tool faster than Stan (fitting the same model with Stan would have taken hours, or more likely days), but making the models converge was always challenging. In addition, since most of my colleagues had a CS background and were in love with the latest not interpretable, brute force algorithm, and were scared of a more statistical approach they made no effort to learn, I faced pushback and skepticism despite the model working very well.
I love random effects model, and I build my technical career on them.
Not only reduced training time, but also less data needed for training. Which is particularly important if training on time-series data for something that changes over time, as older data is less useful.
Not my field at all, so "I know nooothing".
Are GBDT's very different from "plain" binary decision trees? I've seen the latter a lot in the context of particle experiments.
And this boosting generalises to any learner. You can apply it to regression too. Again, the boosting part is really the key. The innovation isn't a new technique either, it is just the aggressive application of computing power to these problems.
This wasn't even particularly huge data compared to my other projects. But certainly at that scale, there are huge differences between regression & NNs.
Time series -> correlation in time -> Recursive networks
Tabular data without clear correlation structure -> good old ML (ANN, SVM, DT, LR, KNN).
This is obvious when following the field since 2006 or so. Deep Convolutional Networks were considered a special case for data with local correlations at a hierarchy of spatial scales. Same for RNNs in time, although they came much later (when was the LSTM rediscovery again? 2016?)
For most data without clear spatial or temporal structure to exploit, the good old ML techniques will work just fine.
My but this statement seems more than a little grandiose.
Never mind that XGBoost still does well on a substantial portion of ML challenges (supposedly). The bigger problematic is that there's a confusion of maps and territories in this way of talking of machine learning. The field of ML has made a certain level of palpable progress by having created a number of challenges and benchmarks and then doing well on them. But success on a benchmark isn't necessarily the same as a success the "task" broadly. An NLP test doesn't imply mastering real language, a driving benchmark doesn't imply master over the road driving. etc. Notably, success on a benchmark also "isn't nothing". In a situation like the game of go, the possibilities can be fully captured "in the lab" and success at tests indeed became success against humans. But with driving or language, things are much more complicated.
What I would say is that benchmark success seems to produce at least a situation where the machine can achieve human-like performance for some neighborhood (or tile or etc) limited in time, space and subject. Of course, driving is the poster-child for the limitations of "works most of the time" but lot of "intelligent" activities require an ongoing thread of "reasonableness" aside from having an immediate logic.
Anyway, it would be nice if our ML folks looked at this stuff more as a beginning than as a situation where they're poised on success.
At best you can claim the result here that neural networks with regularization methods can beat traditional methods without it, but to be apples to apples both methods must have access to the same 'cocktail of regularization'.
> This paper is the first to provide compelling evidence that wellregularized neural networks (even simple MLPs!) indeed surpass the current state-of-the-art models
in tabular datasets, including recent neural network architectures and GBDT (Section 6).
> Next, we analyze the empirical significance of
our well-regularized MLPs against the GBDT
implementations in Figure 2b. The results show
that our MLPs outperform both GBDT variants (XGBoost and auto-sklearn) with a statistically significant margin.
They test against XGBoost, GBDT Auto-sklearn, and others. Did you read the paper?
Yes. Did you read my comment?
They compare NN + Cocktail vs. vanilla XGB. They don't compare NN + Cocktail vs. XGB + Cocktail.
To make it crystal clear, if I wrote a paper "existing medicine A enhanced with novel method B is more effective than existing medicine C" and I did not include the control "C + B" (assuming if relevant, which is the case here), that'd be bad science. It's very much possible that novel method B is doing the heavy lifting and A isn't all that relevant. s/A/NN, s/B/Cocktail, s/C/XGBoost.
SWA is pretty NN specific. So leave it out for XGB. There's a bunch that are relevant, and they could be very important.
If you consider the whole table, any dataSET is like point clouds.
-- All interpretations worth considering...
Either way, totally agree. Overused, almost always incorrect, and easily misconstrued especially by people who don't speak English as their first language.
* I see what I did there.
I suppose the follow up will be titled "One Weird Trick is All You Need To Destroy SOTA On This Dataset!"
They aren't saying that they "canned" Excel (the software), they are saying that neural nets have the potential to perform well on tasks involving tabular data that are traditionally performed by other ML techniques.