
Neural Networks Are Essentially Polynomial Regression (2018) - ruifigueiredo
https://matloff.wordpress.com/2018/06/20/neural-networks-are-essentially-polynomial-regression/
======
cs702
How the heck did this end up on the homepage again?[a]

This paper makes broad grandiose claims, backed only by tiny toy experiments
on tiny toy datasets with tiny toy neural networks.

By the same logic, I could claim that neural networks are equivalent to
_lookup tables_ , and then run a few experiments showing how look up tables
are a better choice for tiny toy problems in which the answer for every
possible combination of inputs is known in advance. That would be silly,
wouldn't it?

I bet the author's polynomial regression library would do _horrendously bad_
compared to SOTA deep models in cognitive tasks such as language translation,
photo generation, unsupervised language modeling, playing Go, Dota2, or a
similarly difficult game in a reinforcement learning setting, and so on.

[a]
[https://news.ycombinator.com/item?id=17385383](https://news.ycombinator.com/item?id=17385383)

------
gambler
There are dozens of papers "explaining" neural networks, using basic math,
highly abstract math and even theoretical quantum physics[1] models. I don't
see any of them used in constructing, improving or interpreting results of
actual neural networks[0]. Moreover, while most of them have math that would
take me weeks to fully understand, they rarely make concrete, testable
predictions.

Most of the recent advances in ANN learning came from using more data and
using different types of layers. A real theory of neural nets would explain
why architecture A works better than architecture B for a particular problem.
Or why a certain architecture plateaus after working with X samples, while
other can learn from many more.

The way I understand it, the core problem of understanding deep learning is
that it does not scale down. You can make a simple framework of 3-4 neurons
and observe how it works. Then you make it bigger and it works _differently_.
No one _properly_ explained why this happens. People who do applications seem
to just state their intuitions and then throw a couple formulas on top to make
it seem scientific. There is a great essay on that by Piekniewski[2]. No
research I've seen lists all the parameters/architectures/datasets they _tried
that didn 't work_.

[0] - Maybe with the exception of the paper on neural networks as ODEs.

[1] -
[http://neurophysics.huji.ac.il/sites/default/files/PhysRevA....](http://neurophysics.huji.ac.il/sites/default/files/PhysRevA.32.1007.pdf)

[2] - [https://blog.piekniewski.info/2018/10/13/deep-learning-
the-w...](https://blog.piekniewski.info/2018/10/13/deep-learning-the-why-
question/)

~~~
dnautics
Really? Can you honestly say with a straight face that there is no theory
behind why convnets are the right choice? Or no theory behind how an LSTM is
structured?

~~~
zazen
There is plenty of room to be dissatisfied with the current state of deep
learning theory without making absolute claims that there is "no theory"
whatsoever, which GP clearly isn't.

If you wish to challenge GP's comment, you could do so constructively by
discussing any specific theoretical contributions you feel GP may have
overlooked.

~~~
dnautics
> You can make a simple framework of 3-4 neurons and observe how it works.
> Then you make it bigger and it works differently. No one properly explained
> why this happens.

~~~
minkzilla
Can you cite an explanation? I'd love to read one.

~~~
dnautics
Sure. you can _watch_ one that's in very easy to understand plain english.

prereq:
[https://www.youtube.com/watch?v=C_zFhWdM4ic](https://www.youtube.com/watch?v=C_zFhWdM4ic)

substance:
[https://www.youtube.com/watch?v=py5byOOHZM8](https://www.youtube.com/watch?v=py5byOOHZM8)

I mean what is the difference between like using 50 layers versus 75 layers?
Ok, yeah we don't know, but to presume that we don't understand _anything_
beyond 3-4 neurons is just nonsense.

By comparison. Go look at any synthetic chemistry paper. There's going to be
crazy procedures. One time Phil Baran used Xenon Tetrafluoride as an oxidizing
agent. And yeah, for _some reason_ none of the other standard oxidizing agents
worked. We don't know why. Hell it could have been that the grad student had
bad hands with one of the standard ones. We can make a SWAG an to why XeF4
worked. But we'd be fooling ourselves to claim "we know", so we don't say we
do. Does that discredit synthetic chemistry as an enterprise? Does that mean
synthetic chemistry is not backed by theory?

~~~
zazen
That video is an introduction to the concept of CNNs for beginners. It
explains in informal terms the structure and mechanics of a CNN, and probably
gets around to discussing why it might work in very broad terms.

This is extremely far from what academically inclined people are talking about
when they are using the word "theory".

People are hoping to see, ultimately, mathematical proofs that neural networks
can e.g. efficiently learn target functions given a certain amount of training
data and a certain number of trainable parameters.

To address your chemistry analogy: no, it doesn't discredit synthetic
chemistry as an enterprise, it just means there is a gap in the theory.
Similarly, the people bemoaning gaps in the deep learning theory aren't trying
to discredit deep learning as an enterprise: the practical successes are very
real; the theory is just lagging, as it often has in application-led fields.

~~~
dnautics
> It explains in informal terms the structure and mechanics of a CNN

It actually gets quite formal. you could implement a (non-performant) CNN from
everything in that video. The point is, that someone came up with a hypothesis
"hey imposing this structure should yield faster convergence to usable
results" is totally inductive science, and that is able to be explained quite
well and simply from simple primitives that you don't need a math degree to
understand really indicates some level of _maturity_ in the field.

>People are hoping to see, ultimately, mathematical proofs that neural
networks can e.g. efficiently learn target functions given a certain amount of
training data and a certain number of trainable parameters.

yeah I don't dispute that, but to characterize our understanding of machine
learning as not going beyond 3 or 4 neurons is just plain crazy.

Moreover, for some RNNs, e.g. character-based LSTMs trained on language
models, we can extract feature details. I recall one case (sorry not bothered
to find it) where there was a memory line dedicated to detetcting opening and
closing quotation marks. Now this is a feature that one would _expect_ to be
obvious, or at least more so than other language features, but it's still a
rather high level (and successful) attempt to understand how these machines
function.

~~~
gambler
_> but to characterize our understanding of machine learning as not going
beyond 3 or 4 neurons is just plain crazy._

There is a difference between understanding _how_ something works and
understanding _why_ it works. Completely different levels of predictive power.

~~~
dnautics
I'm sorry, lstms and convnets would seem to represent a rather high level of
predictive power, given that someone postulated that they would work, and it
turned out they were correct.

------
myWindoonn
At risk of embarrassing the rest of the thread, please note that the paper was
released with an R package [0] which performs the polynomial regression
algorithm, and additionally an entire chapter of the paper is dedicated to
running this package against two other R packages which implement neural
networks, with competitive results.

[0] [https://github.com/matloff/polyreg](https://github.com/matloff/polyreg)

~~~
LolWolf
I don't know if it's at all embarrassing (though there are some HNers that
definitely did not at least skim the paper, which is a whole other thing).

The examples they gave in the paper were all tiny: MNIST was the biggest with
784 features (Crossfit has <40, Forest has <60, Taxi has ~11), so this is an
absurd comparison. You can get better results than all of the models
presented, in much less time, using random kitchen sinks (also sometimes known
as 'random features') [0].

Plus, it's a little weird to me that their polynomial regression "runs out of
memory"—they're solving a least squares problem for which we have a _ton_ of
good algorithms for. More specifically, there's no need to generate the data
all at once (e.g., why they mention the "polynomial matrix would be huge")
much less run QR factorization on this super dense matrix... rather, you would
do this in a streaming fashion (e.g., generate the points, one by one, and
proceed via SGD, or a nice pre-conditioned version of it since we have a good
approximation of the Hessian via some simple streaming dimensionality
reduction).

At the risk of sounding a bit harsh, this paper reads like someone recently
found out about least squares and tried a few of the bases until they realized
that simple things do pretty well in small cases when compared to fancy models
like NNs (which is something almost every researcher that I work with
knows—though I don't work directly in ML).

\---

[0]
[https://people.eecs.berkeley.edu/~brecht/papers/08.rah.rec.n...](https://people.eecs.berkeley.edu/~brecht/papers/08.rah.rec.nips.pdf)
, perhaps a shorter, much simpler exposition can be found in 11.3 of here:
[https://web.stanford.edu/~boyd/vmls/](https://web.stanford.edu/~boyd/vmls/)
under "Random Features."

~~~
myWindoonn
Indeed, there is the unfortunate fact that directly pointing out that a
particular person has not read the paper is against HN's rules. Let us not
violate the Prime Directive further.

If I understand the origin of this paper correctly, it is mainly the idea and
work of an undergraduate student. From this perspective, several problems
become more understandable. The limited computation done in support of the
paper may have been done by a setup as meager as a lone student's laptop. The
naïve view of machine learning as a field could be explained by undergraduate
eagerness; indeed, perhaps they _did_ "recently [find] out about least
squares".

I agree with your thoughts on SGD. The authors did do PCA for some of their
tests, but this entire setup should be buildable in a more efficient and
incremental way.

I think, however, that the reason that this paper got the other three co-
authors on it, despite its controversial nature, is that the core of the
original idea is interesting and has yielded related fruit like UMAP [0] in
the past. We should be more comfortable exploring it.

I like your book reference. It's informative and direct, without mumbo-jumbo.
To accelerate the lookup for the next follower, start at p279.

[0] [https://github.com/lmcinnes/umap](https://github.com/lmcinnes/umap)

------
lisper
When I was an AI grad student back in the late 80's and early 90's logic was
all the rage. People published papers about things like non-monotonic logics,
intentional vs extensional logics, higher-order logics, the frame problem,
circumscription... One thing that always bothered me about the field was that
the whole premise behind using logic as the basis for AI seemed to me to be
_obviously_ flawed because neither the inputs to nor the outputs from my brain
were logical propositions, they were images and sounds on the input side, and
sounds and gestures on the output side. Even my internal reality didn't seem
to consist of logical propositions. When I went to open the refrigerator to
make a sandwich, I didn't feel like I needed to prove any theorems in order to
do it. And I think subsequent developments have vindicated my skepticism.

Well, I have a similar objection to neural networks. Human learning is
fundamentally qualitatively different from contemporary neural network
learning. It is simply not the case that I learn what a cat looks like by
being shown a bunch of images of cats and having some meta-input that says
"label:cat". Instead, I see a cat -- and a whole bunch of other things,
including the environment that the cat happens to be in -- and I hear some
sounds, and I somehow manage to make sense of the whole melange and sort out
what a cat is and what the sound of my mother's voice is when she says,
"That's a cat!, and the fact that the "cat" part of the audio stream is a
label, and that it stands for this furry purry thing moving around in front of
me, and that by producing some outputs that cause _me_ to produce the same
sounds from my own body I get the reward of hearing mommy say, "That's right!"

What we call neural networks today are interesting and useful, but I don't
think they are even remotely close to anything interesting that the human
brain does, and so quibbling over whether or not they are "essentially
polynomial regression" or not is, IMHO, missing the point.

[EDIT: I had closed with "missing the point rather badly" but subsequent
comments have convinced me that this extra emphasis was unwarranted.]

~~~
breck
> It is simply not the case that I learn what a cat looks like by being shown
> a bunch of images of cats and having some meta-input that says "label:cat".

How do you know this? It seems to me this is likely how it works. If you spend
an hour playing with a cat, and you look at the cat with your eyes at 60hz,
you see about 5 million labeled augmented images of a cat.

~~~
short_sells_poo
Intuitively I feel like there is vastly more context involved in any
biological neural network learning, but particularly in the human brain. You
might be inclined to reduce the learning process to just a sequence of images,
but the internal reference system brings in an enormous amount of other
information that can get attached: sounds, smells, touch, taste, feelings,
desires and so on. These all shape not just the way the new information is
pre-processed but also how it is stored and in turn that newly stored memory
influences future learning. The amount of feedback loops and indirect inputs
is staggering.

Perhaps I'm naive, but the difference in apparent complexity and capability
between artificial neural networks as we use and understand them today, and
their biological counterparts seem vast.

I'm certainly not doing research into new forms of ML and AI algorithms, but I
have used the existing concepts to some degree and just intuitively it feels
like what we have today is still a very primitive approximation of what nature
has achieved over the course of evolution. It feels like it's not merely a
question of more data and more computational power, but we are missing
fundamental building blocks to actually go beyond fancy curve fitting.

Note that I'm not making an assumption that fancy curve fitting cannot
eventually give rise to human level AI, but certainly at the moment I see zero
evidence that it can.

~~~
jayd16
I think you assume too much. Humans can identify pictures of cats and I'd be
hard pressed to assume that sounds, smells, touch, taste and feel matter
during that evaluation.

I'm sure you could identify a large number of animals you've only seen on
film, a snow leopard perhaps.

------
Chabs
If I'm reading this right, the core of the argument is that since any
continuous function is can be approximated via a taylor series expansion, then
activation functions can be seen, in-effect, as polynomial in nature, and
since a neuron layer is a linear transformation followed by an activation
function, the the whole system is polynomial.

That's "technically" correct, but it feels like an academic cop-out.
Interesting/useful transfer functions tend to be functions that take very
large expansions to be approximated with any accuracy.

~~~
leereeves
That is a serious practical problem that we find even in this paper. The
authors were unable to fit a degree 3 polynomial to a subset (just 26
components) of MNIST data (handwritten digits) due to a "memory issue".

But mathematical theory need not be practical. The relation between NNs and
polynomial regression might be a fruitful theoretical observation even if the
equivalent polynomial regression is incalculable.

~~~
lalaithion
They were unable to fit a degree three polynomial, but were still able to
obtain a 97% accuracy with a degree two polynomial.

------
tw1010
I haven't read the paper, but how can that be true when the polynomials
produced by polynomial regression always converge at the "edges" (taking x far
left or far right) to either positive or negative infinity (unless the
polynomial is a constant), while DNNs can be built to have a sigmoid (or
gaussian) function at the output, which doesn't have the same problem (it
constrains the output to a range)?

Polynomials (without "activation functions" at the end) make for poor
classifiers.

(This is a practical problem I've actually had when trying to replace DNNs
with polynomial regression.)

~~~
cochne
It's not true. They make their main claim as follows: "For general activation
functions and implementations, we can at least say that the function is at
close to a polynomial, by appealing to the famous Stone-Weierstrass Theorem
[Katznelson and Rudin(1961)], which states that any continuous function on a
compact set can be approximated uniformly by polynomials.5 In any event, for
practical purposes here, we see that most activation functions can be
approximated by a polynomial. Then apply the same argument as above, which
implies"

Note that it's only true on a compact set, implying bounded set, implying it
will not be valid out to infinity, only on some bounded, closed set.

~~~
tw1010
Ok, but then that makes this a purely theoretical paper about something
already pretty well known. Almost all interesting problems that DNNs are
solving nowdays are some kind of classification problem (or with finite
discrete outputs), which polynomials are really unsuited for (unless there are
an infinite number of terms).

------
est31
The wordpress page doesn't mention it but with "Neural Networks" they appear
to mean "only" traditional deep neural networks, so no CNNs for example, which
they categorize as "preprocessing". Which is a legitimate view, I've seen it
expressed in other places as well, but if you hear "neural networks" then you
usually mean the entire thing and not just the fully connected and softmax
readout layers at the end.

~~~
czr
Conv layers are strictly special cases of FC layers (with respect to
expressive power).

> For any CONV layer there is an FC layer that implements the same forward
> function. The weight matrix would be a large matrix that is mostly zero
> except for at certain blocks (due to local connectivity) where the weights
> in many of the blocks are equal (due to parameter sharing).

(per [http://cs231n.github.io/convolutional-
networks/](http://cs231n.github.io/convolutional-networks/))

In general, if you can make assumptions about your problem and the form the
solution takes, you can find a good solution with fewer parameters and fewer
data, at the cost of suboptimal performance on problems that violate your
assumptions. Conv layers are an example of this.

~~~
est31
> The weight matrix would be a large matrix that is mostly zero except for at
> certain blocks (due to local connectivity) where the weights in many of the
> blocks are equal (due to parameter sharing).

It's correct that the inferrence mode of a convolutional neural network can be
reduced to the inference mode of a FC network. However, during training, you
need to make sure that the weights are the same, and FC networks don't ensure
that. So you need to train CNNs-embedded-in-FC slightly differently from
normal FCs, otherwise what you're getting is not an CNN any more because the
weights are different. What you need to do is to a) initialize all kernel
entries with the same weights and b) instead of applying the weight changes
directly, average the weight changes over all offsets and only then apply them
to the specific kernel positions.

~~~
czr
Absolutely–if you wanted to get an FC layer that satisfies Conv layer
constraints, you would need to perform gradient descent subject to those
constraints; normal gradient descent won't do that unless the actual optimum
is of convolutional form. That's why I said (with respect to expressive power)
:)

------
mettamage
Hmm interesting, in the well-known book Neural Networks and Deep Learning [1]
by Michael Nielsen, it is shown that neural networks can approximate any
function [2].

This leaves the question: are neural networks able to produce more functions
than simple polynomials? Apparently the answer of the summary of the mentioned
paper is no (I didn't read the actual paper, only the summary).

I simply only know the basics of neural networks (I fully understand chapter 1
and 2 of [1]), so I wouldn't know understand why the two are practically
equivalent. I mean, I can imagine why but I'd be making the fallacy of simply
understanding that a neural net is able to approximate any polynomial and then
think "yup, that's it, no clue how to generate counter examples, QED for all I
care O:)"

So with this in mind is anyone able to argue why neural nets are essentially
_only_ polynomial regression?

[1]
[http://neuralnetworksanddeeplearning.com/](http://neuralnetworksanddeeplearning.com/)

[2]
[http://neuralnetworksanddeeplearning.com/chap4.html](http://neuralnetworksanddeeplearning.com/chap4.html)

~~~
jefft255
They are not "equivalent". Polynomials can only produce polynomials, but these
polynomials can be made to be as close as you want to any continuous function
(see Stone-Weierstrass theorem). For example, a NN with a ReLU activation
function produce a function that is not a polynomial.

Now I understand that this not the main point; both polynomials and NN can
approximate any function that is not artificially nasty. The only thing is
that a NN is easier to optimize and generalizes better for high dimensional
problems such as image classification.

The issue is more complicated than simply expressiveness. The author claim
that "one should use PR instead" seems dubious at best, but I'm now going to
read the paper and figure it out.

~~~
cochne
It's not true that they have the same expressiveness on the real line, -inf to
+inf

~~~
jefft255
You're right! When talking about function approximation most theorems work on
compact sets only (for most purposes that's a good enough hypothesis).

------
typicalguy
Very interesting topic. Indeed, one way to interpret the neural networks is to
design the deep architecture interpretable at the first place. Namely, one can
design the structure and activation functions of each layer to mimic one step
of gradient descent. Then, K layer of this tailored DNN resembles K iterations
of an optimization algorithm, e.g., see [1-3]. But still, one need to prove
the convergence of such networks (empirical results show very promising
convergence speed).

[1] [https://arxiv.org/abs/1812.00797](https://arxiv.org/abs/1812.00797)

[2]
[https://openreview.net/pdf?id=B1lnzn0ctQ](https://openreview.net/pdf?id=B1lnzn0ctQ)

[3] [https://papers.nips.cc/paper/8120-theoretical-linear-
converg...](https://papers.nips.cc/paper/8120-theoretical-linear-convergence-
of-unfolded-ista-and-its-practical-weights-and-thresholds.pdf)

------
gubbrora
Back before deep learning was all the rage textbooks used to argue that two
layers were enough because a two layer network can approximate anything
arbitrarily well.

They were right and wrong. Two layer networks CAN indeed approximate any
function arbitrarily well. They just do a piss poor job of it. It can take
exponentially more para meters than a better formulation[0].

The lesson is that representation matters a lot. There are lots of ways to
construct _universal approximators_ but they are not made equal.

[0]
[https://www.researchgate.net/publication/3505534_On_the_powe...](https://www.researchgate.net/publication/3505534_On_the_power_of_small-
depth_threshold_circuits)

------
citilife
I actually always describe neural networks as a robust taylor series. What we
are trying to do by calculating the weights are matching the "plane" of your
data space. In the case of classification, you may have multiple taylor series
generated, where each one represents a class, such that the highest value
output of all the functions concatenated is your classification.

Once you understand that a neural network "learns" the underlying data space,
it is easy to understand why/how you can just change the output layer of a
trained network in many cases to solve a different problem (on the same
dataset).

------
kwhitefoot
Essentially this was pointed out to me by someone who's name escapes me at SAS
in Heidelberg close to twenty years ago. He gave me a paper on it which I seem
to have mislaid.

:-(

------
LaserToy
Saw this recently: Neural Networks seem to follow a puzzlingly simple strategy
to classify images [https://medium.com/bethgelab/neural-networks-seem-to-
follow-...](https://medium.com/bethgelab/neural-networks-seem-to-follow-a-
puzzlingly-simple-strategy-to-classify-images-f4229317261f)

It is no magic

------
k__
So the idea is that NNs are equivalent to PRs and PRs are easier to handle?

~~~
princeofwands
PR is quite similar to a neural network with a single extremely wide layer.
Unless you are hashing the polynomials, you will quickly run into an
intractible problem, while properly setup NNs automatically learn which
polynomials are needed for solving the problem, giving the budget of the
network architecture (number of layers, number of neurons, etc.).

Nobody in industry will abandon NNs over PRs if they are looking to making it
easier to handle. I doubt on most industrial problems, that PR even comes
close to NNs in performance.

~~~
zozbot123
I think what you literally need is a network that is "narrow" but also has
many layers. This is because what PR features over simple linear regression is
interaction terms, and multiple layers are an alternate way to get those. This
also hints at one problem with true PR - in a multi-variate context, the
amount of coefficients you'll need to fit is going to scale very badly given
the number of variables and the degree of your polynomial. NN layers have
natural dimenionality-reduction properties that make them more flexible.

------
tathougies
At first this seems very insightful, but then one recalls that Neural networks
can approximate any computable function and so can polynomials. This is not
particularly insightful in that regard.

------
ykevinator
Neural nets are not better than regression, they're far worse, but the sample
size for statistical significance is the deal breaker.

------
adanhawth
The old seems to be new again. Advancements in hardware are dusting off the
ANN genre, making it shiny and bright, and sparking reinventions an order of
magnitude greater than the previous revolution circa the 80's.

[https://drum.lib.umd.edu/bitstream/handle/1903/3901/umi-
umd-...](https://drum.lib.umd.edu/bitstream/handle/1903/3901/umi-
umd-3747.pdf;sequence=1)

------
js8
This is what I have always been wondering. If according to no free lunch
theorem, there is a bias in every learning method, what kind of bias do the
neural networks have?

