
Neural Networks Are Essentially Polynomial Regression - ibobev
https://matloff.wordpress.com/2018/06/20/neural-networks-are-essentially-polynomial-regression/
======
cs702
How and why did this end up on the homepage again?[a][b]

This is a silly claim, backed only by _tiny toy experiments_ on _tiny toy
datasets_ with _tiny toy neural networks_. See links to the paper in [a] and
[b] below.

Using the same logic as the authors of the paper, I could make the claim that
neural nets are "essentially the same" as _any_ universal approximator (GP's,
SVM's, RF's, GLM's, etc.) and then run a few experiments only on tiny toy
problems to back up my claim.

The authors should try to fit a polynomial regression model to cognitive tasks
such as language translation, image generation, or modeling the game of Go...
they won't get very far without running into the curse of dimensionality.

Please stop posting this on HN.

[a]
[https://news.ycombinator.com/item?id=17385383](https://news.ycombinator.com/item?id=17385383)

[b]
[https://news.ycombinator.com/item?id=19145706](https://news.ycombinator.com/item?id=19145706)

~~~
knowuh
I came here to say this ⬆ Glad someone else did it for me, better than I could
have.

------
goodside
This feels overly sensationalist.

It's no surprise that in domains where low-parameter models work at all, they
often work better than NNs. In econometric forecasting, where the traditional
methods encapsulate more prior knowledge than could ever be inferred from the
data, neural networks aren't anywhere near competitive. Even in domains where
NNs are state-of-the-art, they might not be worth the training costs over GLMs
or gradient-boosted forests or whatever.

But there are many other domains, like raw waveform modeling, large natural
images ( _not_ MNIST), or NLP, where anything that isn't neural fails
miserably. These are the domains where people are most excited about neural
networks — problems that otherwise have no solution at all. Neural networks
won't help you get a better interpretation of your n=25 daily-resolution
cohort study, or predict user retention from your 300-column, 20K-row Excel
spreadsheet.

In their paper, the authors dismiss most of NN's greatest recent successes as
being "specialized" applications that only work because they incorporate
tricks like recurrence and convolution, and not because they are neural. This
is a much clearer, bolder claim, but they hardly support it as far as I can
tell. The things that they call "specialized" (RNNs and CNNs!) are what most
would recognize as the bread-and-butter of neural network design.

~~~
ummonk
Right, the whole point is that neural networks let you apply tricks like
convolution or recurrence to capture a lot of the structure in the model
domain and exponentially reduce the search space for your model when dealing
with large inputs like detailed images or waveforms.

~~~
posterboy
Could one say, in a sense, that recurrence is essentially differential
equations and convolution essentially more complicated operations than those
of arithmetic polynomials?

This might sound like nonsense. On the one hand, most trivial convolutions use
trivial operators; "polynomials" might include higher operations anyhow, or
approximate some of the more important ones, none of which is appealing if
simplicity equals efficiency. On the other hand, I never really understood
diff-eqs; ODEs seem like polynomials over self similar polynomials, to me,
hence "recurrent"; All the other diff-eqs I can't begin to fathom.

~~~
apl
There's many perspectives on everything. Deep ConvNets, for instance, can be
expressed as a continuously evolving ODE. Here's a fantastic paper on this
view:

[https://papers.nips.cc/paper/7892-neural-ordinary-
differenti...](https://papers.nips.cc/paper/7892-neural-ordinary-differential-
equations)

------
wish5031
At the beginning the authors claim they're not just making a restatement of
the Stone-Weierstrass theorem (any continuous function on a compact set can be
approximated arbitrarily well by a polynomial), but reading through, I'm not
sure what they're proving besides that. In fact, on pages 6-7 they simply
appeal to that theorem before stating "NNs can loosely be viewed as a form of
polynomial regression". Most of the rest of the paper is comparing polyreg and
NNs on various datasets. Sec. 9 doesn't have anything doesn't have anything
especially novel in it. For instance, in 9.3 they say they'll explore the
causes of overfitting in the context of their "NN <-> PR" principle, but never
actually do so...

Polynomial regression is nice because it's a little easier to interpret, and
it's also a convex problem with a single, global minimum. OTOH you have to
design features yourself, otherwise for high-dimensional problems polyreg
quickly requires way too much memory and compute to solve in a reasonable
amount of time.

This paper might have been more interesting if it had somehow connected neural
nets to those facts, or if it had shown how insight on the way in which neural
nets work can be used to improve polyreg. But (admittedly, from a brief
reading) I don't anything like that here.

~~~
olooney
I agree. All I see here is: "We know any reasonably smooth function can be
approximated by a neural net, likewise it can be approximated by a polynomial.
So they're the same!" Imagine how it's going to blow their minds when they
find out about Fourier transforms or wavelets!

There are lots of ways to approximate functions; the property of NNs that make
them attractive for ML isn't the universal approximation theorem. It's that
there's a fast, robust method of training then that's easy to implement, easy
to vectorize, easy to parallelize, and easy to customize for different
applications.

------
rcar
This is an essentially meaningless finding. Neural networks had the dark days
they did in the late 90's/early 2000's because computational power didn't
allow for the depth they needed to be applicable to most problems. Once GPUs
caught up to the method, deep neural nets have thrived on some problems.

Those problems are precisely the ones that this paper doesn't evaluate (image
recognition, text processing, etc.). In most other domains, tree-based models
crush neural nets and statistical regression methods, and so this whole thing
just ends up being an academic exercise.

~~~
thanatropism
It's not a finding. They never compare their results to literature benchmarks,
just to the networks they've managed to estimate in R. They also say that
charitably they've refrained to reporting the "worst results", but fail to
present a summary of quality of results found over the many networks they ran.

~~~
rcar
I agree, though they do make the pretty strong statement of

    
    
      NNAEPR suggests that one may abandon using NNs altogether, and simply use PR instead.
    

in the blog post, which sounds to me like they're going for a finding.

------
thermonot
Important point:

> _Our work so far has primarily been on general feedforward NNs. Our
> investigations have not yet involved much on specialized networks such as
> convolutional NNs (CNNs) for image classification, recurrent NNs (RNNs) for
> text processing, and so on. Though we intend to adapt our ideas to these
> frameworks, we view them as separate, orthogonal issues._

~~~
ta0xdeadbeef
>Though we intend to adapt our ideas to these frameworks, we view them as
separate, orthogonal issues.

The authors should read about dynamical systems. I don't understand why the
physics/controls folks, DSP folks, and ML folks don't read each other's
papers. They're using the same math and using the same concepts. The latter
two groups seem to be working together more often however.

------
apl
Their applications are tabular data (for which MLPs have never been the method
of choice) and MNIST (which I could classify at 85% with a rusty nail), so
it's not super impressive.

NNs and the associated toolkit shine with structured high-dimensional data
where CNNs, RNNs, or modern shenanigans like Transformer networks excel. I
sincerely doubt that these networks turn out to be reducible to polynomial
regression in any practically useful sense of the notion. But who knows.

~~~
olooney
Terminology note: data like images and voice which have strong spatial or
temporal patterns are actually referred to as "unstructured" data; while data
you get from running "SELECT * FROM some_table" or the carefully designed
variable of a clinical trial are referred to as "structured" data.

If this seems backwards to you (as it did to me at first) note that
unstructured data can be captured raw from instruments like cameras and
microphones, while structured data usually involved a programmer coding
exactly what ends up in each variable.

As you say, deep neural networks based on CNNs are SOTA on unstructured image
data, RNNs are SOTA on unstructured voice and text data, while tree models
like random forest and boosted trees usually SOTA on problems involving
structured data. The reason seems to be the that the inductive biases inherent
to CNNs and RNNs, such as translation invariance, are a good fit for the
natural structure of such data, while the the strong ability of trees to find
rules is well suited to data where every variable is cleanly and unambiguously
coded.

~~~
apl
Yeah, that's right. Doing too little proofreading with HN comments...

------
darkmighty
Is the relevant model here simply a sum of 2nd order terms?

It's widely known to practitioners vanilla Neural Nets don't do very well on
small datasets because of overfitting issues -- which can probably be
corrected with very careful regularization and parameter optimization, but
where generally it's best to just use simpler methods.

Polynomial regression is one of those, but I'm not sure how it compares to
other methods such as random forests and KNN variations.

Those methods don't scale very well, in particular PR (if I'm interpreting
correctly) doesn't scale, because depth is essential to solving complex
problems with computational efficiency.

The researchers conveniently refer to those more complex tasks as a 'work in
progress'... when they are what NNs are mostly used for (i.e. it starts
getting interesting with CIFAR-10). That said, I would be curious to see
performance of deeper polynomial regression networks. They become almost
identical to NNs then, but maybe some methods of PR could be extended to give
an edge in some cases (e.g. perhaps better regularization techniques than
dropout/etc).

------
mxwsn
Previous discussion (June 24, 2018):
[https://news.ycombinator.com/item?id=17385383](https://news.ycombinator.com/item?id=17385383)

------
gbrown
Dismissing more interesting architectures, and ignoring the different
approaches to fitting makes this seem very incomplete. I've long thought that
neural networks are principally useful when the architecture can be chosen to
exploit structural features of a problem - it hardly seems surprising that
feed-forward neural networks with a given architecture have an alternate
representation as polynomial regression. The focus on multicollinearity is
also sort of odd here - I usually take it as a given that neural networks
provide an over-complete basis which only works due to various kinds of
regularization (and the use of optimization techniques like gradient-
descent/back-propagation, which aren't really concerned with globally optimal
parameter estimates, but rather in producing good fitted values/predictions).

------
pacala
> Our work so far has primarily been on general feedforward NNs. Our
> investigations have not yet involved much on specialized networks such as
> convolutionalNNs (CNNs) for image classification, recurrent NNs (RNNs) for
> text processing,and so on. Though we intend to adapt our ideas to these
> frameworks,we viewthem as separate, orthogonal issues.

This is a interesting insight. Future work is needed though. The vast majority
of impressive results using NNs have been achieved by specialized networks,
with special mention for resnets in vision [and Go] and transformers for NLP.

------
thanatropism
This is a... preprint. It's clearly not ready for the prime-time yet.

The main interesting finding for me was the utility of variance inflating
factors in debugging neural networks. But not only the thing lacks a lot of
polish (what is their metric for "accuracy"?), but it's still quite
"unscholarly", most noticeably lacking:

* serious comparisons with classic literature benchmarks (I'm not even saying state-of-the-art) of NNs rather than whatever they were able to find with R. (Who uses R in academic machine learning?)

* a better developed argument about how Taylor is good enough. I'm not saying rigorous, but an article about reinforced concrete that just said what they said wouldn't be enough. An example of a workable engineering-type Taylor argument would be something that says that all minima are locally like parabolas and h~=0.01 is a very conservative upper bound for the needed approximation because the material is almost 100 times as strong and we just need 10 times for this calculation.

* reproducibility, reproducibility, reproducibility. Don't R people have Jupyter notebooks? Does their Keras package not serialize network structure and weights? Keras for Python does.

~~~
gbrown
> Don't R people have Jupyter notebooks?

Yes, though I prefer Rmarkdown documents.

------
carapace
They revised the paper in response to earlier, uh, response.

[https://matloff.wordpress.com/2018/07/01/update-on-
polynomia...](https://matloff.wordpress.com/2018/07/01/update-on-polynomial-
regression-in-lieu-of-neural-nets/)

[https://arxiv.org/abs/1806.06850](https://arxiv.org/abs/1806.06850)

------
excessive
I suppose it depends on the activation function, but many of them are bounded.
Polynomials are unbounded at the extremes, and as such should really only be
trusted (sometimes) for interpolation, not extrapolation. If you get a new
input which is outside the radius of your training set, the answer could be
pretty extreme and not supported by the training data.

~~~
thanatropism
> Polynomials are unbounded at the extremes, and as such should really only be
> trusted for interpolation, not extrapolation.

Very well noted. There's also this, which is usually taught in second year
engineer "numerical calculus" and again extensively studied in graduate
programs involving [something numbers inside computers something]:

[https://en.wikipedia.org/wiki/Runge's_phenomenon](https://en.wikipedia.org/wiki/Runge's_phenomenon)

~~~
excessive
It was exactly that topic which is why I included the "sometimes"
qualification :-)

------
peterwoerner
ANN's aren't that much different than polynomial regression, the issue is that
there are upwards of millions of parameters. Totally intractable for the human
mind.

------
bobinaz
They are also almost exactly functional composition. Were it not for the
activation terms, this composition would reduce to a simple linear equation.

------
rahulmehta95
This isn't a serious ML result; it reads more like course notes/toy
experiments rather than a serious research effort. The funny thing is, this is
still only for tabular data; we don't see any actual classification results
with any proper robust metrics (also the PCA setup seems a bit bizarre in this
setting).

------
kgwgk
The actual blog post: [https://matloff.wordpress.com/2018/06/20/neural-
networks-are...](https://matloff.wordpress.com/2018/06/20/neural-networks-are-
essentially-polynomial-regression/)

~~~
sctb
Thanks, we've updated the link from [https://www.r-bloggers.com/neural-
networks-are-essentially-p...](https://www.r-bloggers.com/neural-networks-are-
essentially-polynomial-regression/).

------
cozzyd
I mean both are ~general classes of functions, so not surprising they can
represent each other?

------
29athrowaway
A biological neural network does much more than neural models. How much more
and why? we don't even know.

Most neural models don't use spikes or dynamical systems, or some of the most
intrincate stuff going on in neurons.

------
cabaalis
I'm not an expert in ML. However, with what little I do know, I wonder how
well "training" a neural network really mimics the brain.

Our brains grow as we mature and along the way reinforce pathways as we
internalize information. Our interpretation of the world is internal and
unique to our individual consciousness.

The part that is "trained" is how we relate that information externally to our
environment and other entities.

A child says "Da da" because it learns that sound elicits a reaction. The
internal representation of its father is not trained. It is where the neurons
grew and reinforced.

Maybe poorly communicated on my part to people who know much more about it
than me. Tl;dr: maybe focus less on perfecting the training of recognition and
more on the training of mapping the external to what was naturally recognized
where the neurons fell.

~~~
otabdeveloper2
The so-called 'neural networks' have nothing to do with the brain. They're
only called that as a branding exercise. Because math is scary and boring,
biology and brains are warm and fuzzy.

