This is a silly claim, backed only by tiny toy experiments on tiny toy datasets with tiny toy neural networks. See links to the paper in [a] and [b] below.
Using the same logic as the authors of the paper, I could make the claim that neural nets are "essentially the same" as any universal approximator (GP's, SVM's, RF's, GLM's, etc.) and then run a few experiments only on tiny toy problems to back up my claim.
The authors should try to fit a polynomial regression model to cognitive tasks such as language translation, image generation, or modeling the game of Go... they won't get very far without running into the curse of dimensionality.
Please stop posting this on HN.
It's no surprise that in domains where low-parameter models work at all, they often work better than NNs. In econometric forecasting, where the traditional methods encapsulate more prior knowledge than could ever be inferred from the data, neural networks aren't anywhere near competitive. Even in domains where NNs are state-of-the-art, they might not be worth the training costs over GLMs or gradient-boosted forests or whatever.
But there are many other domains, like raw waveform modeling, large natural images (not MNIST), or NLP, where anything that isn't neural fails miserably. These are the domains where people are most excited about neural networks — problems that otherwise have no solution at all. Neural networks won't help you get a better interpretation of your n=25 daily-resolution cohort study, or predict user retention from your 300-column, 20K-row Excel spreadsheet.
In their paper, the authors dismiss most of NN's greatest recent successes as being "specialized" applications that only work because they incorporate tricks like recurrence and convolution, and not because they are neural. This is a much clearer, bolder claim, but they hardly support it as far as I can tell. The things that they call "specialized" (RNNs and CNNs!) are what most would recognize as the bread-and-butter of neural network design.
This might sound like nonsense. On the one hand, most trivial convolutions use trivial operators; "polynomials" might include higher operations anyhow, or approximate some of the more important ones, none of which is appealing if simplicity equals efficiency. On the other hand, I never really understood diff-eqs; ODEs seem like polynomials over self similar polynomials, to me, hence "recurrent"; All the other diff-eqs I can't begin to fathom.
Polynomial regression is nice because it's a little easier to interpret, and it's also a convex problem with a single, global minimum. OTOH you have to design features yourself, otherwise for high-dimensional problems polyreg quickly requires way too much memory and compute to solve in a reasonable amount of time.
This paper might have been more interesting if it had somehow connected neural nets to those facts, or if it had shown how insight on the way in which neural nets work can be used to improve polyreg. But (admittedly, from a brief reading) I don't anything like that here.
There are lots of ways to approximate functions; the property of NNs that make them attractive for ML isn't the universal approximation theorem. It's that there's a fast, robust method of training then that's easy to implement, easy to vectorize, easy to parallelize, and easy to customize for different applications.
I suppose the finding isn't novel. The contribution here might still be a didactic approach, demystifying NNs by simplifying a fundamental notion in known terms. For example exposing implicitly (that is, leaving the insight to be a success for the learning) that "OTOH you have to design features yourself [for the polyreg, because ...]". Although, since "Feature Engineering" is a buzzword for NN development, I still don't understand the difference. Indeed, the paper implies there is none except for the approach and terminology.
Essentially, they uphold that the traditional terminology should not be discarded in favour of the new, but rather understood in context. The part of understanding is left to the reader, of course.
Relatedly, in a linear algevra course, Terry Tao's one pixel camera was attributed with the success of the following ML resurgence, while we were otherwise talking about convolutions, fourier syntheses, wavelets and the like. It's no secret that linear algebra is a corner stone of ... just a corner stone, and that abstract algebra, topology and the like lies very near; At least this lecturer made it a main concern of the course to get that accross, which figured in nicely with a logics course on universal algebra that I took in parallel, while others taught rather monotonously towards taylor series and fourier transformations. At any rate, pretty much all professional researchers in the field stress that the mathematical basis need to be understood.
PS: The paper is written with an undergrad, what ever that means; No offence, I really don't know the slang, much less how much study vs research this implies. The publishing coauthor slash blog host shows some resentment against new fangled fancyfull terminology in the blog's About section, which might explain the scope of the paper, as well as the intended extent as far as I outlined my impressions above.
Your criticism is not quite correct, insofar the blog post notes the conclusions of the paper explicitly, which seems to be explaining common pitfalls, by use of statistical terms.
PPS: Many obervers lament that the results of NNs are intractable, nigh impossible to verify. This is a strong contrast to mathematical rigor. Hoping for the traditional methodology to get up to speed is in principle justified. I'm sure that your remark about design by hand being intractable holds as well, I'm just not sure to which extent. Showing that it can be done reasonably for some is a start, and chronicaling that endeavor is par for the course, however perhaps not yet enough, I guess, as another comment asks for benchmarks.
Those problems are precisely the ones that this paper doesn't evaluate (image recognition, text processing, etc.). In most other domains, tree-based models crush neural nets and statistical regression methods, and so this whole thing just ends up being an academic exercise.
NNAEPR suggests that one may abandon using NNs altogether, and simply use PR instead.
> Our work so far has primarily been on general feedforward NNs. Our investigations have not yet involved much on specialized networks such as convolutional NNs (CNNs) for image classification, recurrent NNs (RNNs) for text processing, and so on. Though we intend to adapt our ideas to these frameworks, we view them as separate, orthogonal issues.
The authors should read about dynamical systems. I don't understand why the physics/controls folks, DSP folks, and ML folks don't read each other's papers. They're using the same math and using the same concepts. The latter two groups seem to be working together more often however.
NNs and the associated toolkit shine with structured high-dimensional data where CNNs, RNNs, or modern shenanigans like Transformer networks excel. I sincerely doubt that these networks turn out to be reducible to polynomial regression in any practically useful sense of the notion. But who knows.
If this seems backwards to you (as it did to me at first) note that unstructured data can be captured raw from instruments like cameras and microphones, while structured data usually involved a programmer coding exactly what ends up in each variable.
As you say, deep neural networks based on CNNs are SOTA on unstructured image data, RNNs are SOTA on unstructured voice and text data, while tree models like random forest and boosted trees usually SOTA on problems involving structured data. The reason seems to be the that the inductive biases inherent to CNNs and RNNs, such as translation invariance, are a good fit for the natural structure of such data, while the the strong ability of trees to find rules is well suited to data where every variable is cleanly and unambiguously coded.
It's widely known to practitioners vanilla Neural Nets don't do very well on small datasets because of overfitting issues -- which can probably be corrected with very careful regularization and parameter optimization, but where generally it's best to just use simpler methods.
Polynomial regression is one of those, but I'm not sure how it compares to other methods such as random forests and KNN variations.
Those methods don't scale very well, in particular PR (if I'm interpreting correctly) doesn't scale, because depth is essential to solving complex problems with computational efficiency.
The researchers conveniently refer to those more complex tasks as a 'work in progress'... when they are what NNs are mostly used for (i.e. it starts getting interesting with CIFAR-10). That said, I would be curious to see performance of deeper polynomial regression networks. They become almost identical to NNs then, but maybe some methods of PR could be extended to give an edge in some cases (e.g. perhaps better regularization techniques than dropout/etc).
This is a interesting insight. Future work is needed though. The vast majority of impressive results using NNs have been achieved by specialized networks, with special mention for resnets in vision [and Go] and transformers for NLP.
The main interesting finding for me was the utility of variance inflating factors in debugging neural networks. But not only the thing lacks a lot of polish (what is their metric for "accuracy"?), but it's still quite "unscholarly", most noticeably lacking:
* serious comparisons with classic literature benchmarks (I'm not even saying state-of-the-art) of NNs rather than whatever they were able to find with R. (Who uses R in academic machine learning?)
* a better developed argument about how Taylor is good enough. I'm not saying rigorous, but an article about reinforced concrete that just said what they said wouldn't be enough. An example of a workable engineering-type Taylor argument would be something that says that all minima are locally like parabolas and h~=0.01 is a very conservative upper bound for the needed approximation because the material is almost 100 times as strong and we just need 10 times for this calculation.
* reproducibility, reproducibility, reproducibility. Don't R people have Jupyter notebooks? Does their Keras package not serialize network structure and weights? Keras for Python does.
Yes, though I prefer Rmarkdown documents.
Very well noted. There's also this, which is usually taught in second year engineer "numerical calculus" and again extensively studied in graduate programs involving [something numbers inside computers something]:
Most neural models don't use spikes or dynamical systems, or some of the most intrincate stuff going on in neurons.
Our brains grow as we mature and along the way reinforce pathways as we internalize information. Our interpretation of the world is internal and unique to our individual consciousness.
The part that is "trained" is how we relate that information externally to our environment and other entities.
A child says "Da da" because it learns that sound elicits a reaction. The internal representation of its father is not trained. It is where the neurons grew and reinforced.
Maybe poorly communicated on my part to people who know much more about it than me. Tl;dr: maybe focus less on perfecting the training of recognition and more on the training of mapping the external to what was naturally recognized where the neurons fell.