
Polynomial Regression as an Alternative to Neural Nets - Gatsky
https://arxiv.org/abs/1806.06850
======
unixpickle
I don't understand how papers like this end up on the homepage of HN. The
abstract makes very broad claims, but the paper itself only runs toy
experiments on tiny datasets with tiny neural networks.

No good ML practitioner believes that tiny, shallow, fully-connected neural
networks are the best algorithm for every problem (just look at Kaggle results
if you don't believe me). Especially for small, easy datasets like the ones
analyzed in this paper, NNs are often not the best choice. However, for large
scale image classification, density modeling, sequence modeling, etc. (none of
which are tested in the paper), NNs are SOTA.

Hilariously, the paper only compares polynomial regression to tiny neural
networks. I bet if they had thrown in results from XGBoost or other classical
ML techniques, polynomial regression would be blown out of the water.

There's also some heuristics you can use to tell when a paper like this isn't
necessarily representative of the field of ML:

\- Missing citations (e.g. "It is well-known that NNs are prone to overfitting
[Chollet and Allaire(2018)], which has been the subject of much study, e.g.
[?].").

\- Inconsistent formatting (e.g. the tables on page 7)

\- Sentences like "Much more empirical work is needed to explore these
issues." following something that sounds like an easy experiment to try.

~~~
mindcrime
_I don 't understand how papers like this end up on the homepage of HN._

I don't understand what it is you don't understand. Clearly HN is not a site
dedicated strictly to academics and people focused on the ceremony of
academia. So if a paper contains even an _idea_ that might be interesting, and
seems likely to foster some interesting discussion, it's probably going to get
upvotes. It making the front page or not presumably depends on which other
stories are also contending for a front-page spot at the same time. In any
case, getting a lot of upvotes doesn't mean that the paper was "good", that
its conclusions were justified, or anything else. It just means it was
something that people thought was worth talking about.

 _Missing citations (e.g. "It is well-known that NNs are prone to overfitting
[Chollet and Allaire(2018)], which has been the subject of much study, e.g.
[?].")._

 _\- Inconsistent formatting (e.g. the tables on page 7)_

It's a pre-print, and not a published paper, no? My assumption was that the
authors intended to fix those in a subsequent revision.

~~~
sdenton4
At least in my end of the world (math), typically the paper is sent for review
and posted to arxiv at the same time; the paper should already pretty much
meet the quality bar for publication.

The ML community I think suffers from some (shall we say) poor sportsmanship,
with people occasionally posting rather half-baked papers to establish
priority on an idea, before it's even been fully fleshed out or explored. This
paper feels like that, on both the 'science' and 'bothering to use reasonable
LaTeX' fronts.

~~~
mindcrime
_At least in my end of the world (math), typically the paper is sent for
review and posted to arxiv at the same time; the paper should already pretty
much meet the quality bar for publication._

Ah, interesting. At first blush that seems like that kinda defeats the purpose
of releasing a pre-print, but I can kinda see why someone might do it that
way.

------
mxwsn
The neural nets they compare to are remarkably shallow and thin (2 layers of
12 units each, written as "KF/DN, layers 12,12") to be described as "deep
learning".

The surprising resurgence of deep learning since 2012 is not the observation
that neural network models can learn at all -- neural nets with a handful of
hidden layers were known to work fairly well since the 90s -- but that deep
nets with dozens of hidden layers (enabled by GPUs) broke records and achieved
state-of-the-art results on remarkably challenging tasks (famously, ImageNet
2014) in multiple important machine learning subfields (has revolutionized
NLP, also AlphaGo).

This paper fails to recognize why people care about deep learning as a tool
for machine learning in our modern day, and by making meaningless comparisons
they restrict themselves to making meaningless observations.

~~~
anjc
The paper wasn't about deep learning

------
MarkMMullin
I've long wondered about this - I see and understand the comments by people
pointing out this paper may a bit shallow - in the same sort of vein, I have
this on my desk to grind thru - interested if anyone else has looked at it --
[https://arxiv.org/abs/1806.07366](https://arxiv.org/abs/1806.07366)

~~~
lostmsu
Very interesting from mathematical point of view. Too bad no SOTA. When they
compare to ResNet result, that last 0.01% accuracy can cost more, than they
expect, which would render the result useless.

~~~
MarkMMullin
I'm wondering if it might not be a good solution for >some< layers in a deep
network, i.e. it's just another building block - practically I'm wondering
just how far this puppy can distribute over gpus, and if you could pipeline
the whole thing - i dunno tho, as far as I've gotten is that it might be an
interesting experiment - and I don't want to get completely sidetracked by an
ODE solver :-)

------
sdenton4
The core observation here is that NN nonlinearities can be approximated by
polynomials. And sure, that's true, but especially with low degree polynomials
the approximation might suck, or have very different properties from the
actual function in question. In particular, RELU activation doesn't explode
towards infinity the way a high degree polynomial activation would.

OTOH, no one cares what the nonlinearity actually is so long as the net
trains, and there's a lot of effort to keep layer inputs in a neighborhood
near zero (batch normalization), so polynomial explosion may not be such an
issue.

Feel like I would like a more serious comparison of their model's results with
best of class NNs. I'm suspicious that their NN character detector was
basically failing...

~~~
contravariant
Well this is getting in some rather unfamiliar territory to me, but a function
like max(0,x) could be considered a polynomial in the max-plus algebra [1]. In
this algebra the values kind of behave like the base-b log of a real number,
with b taken to infinity [2]. If you replace b by a really big number you're
just dealing with regular polynomials.

In this sense I think you _could_ consider a big neural network with RELU
activation a (really big) polynomial.

This is about the extent of my knowledge on the subject though, for more info
search for tropical geometry [3].

[1]: [https://en.wikipedia.org/wiki/Max-
plus_algebra](https://en.wikipedia.org/wiki/Max-plus_algebra)

[2]: [https://mathoverflow.net/questions/83624/why-tropical-
geomet...](https://mathoverflow.net/questions/83624/why-tropical-geometry)

[3]:
[https://en.wikipedia.org/wiki/Tropical_geometry](https://en.wikipedia.org/wiki/Tropical_geometry)

~~~
rors
I wrote my thesis on tropical geometry for machine translation[1]. You can use
it to train a a linear model for structured prediction, which used to be the
dominant model before RNNs.

[1]:
[http://www.aclweb.org/anthology/N15-1041](http://www.aclweb.org/anthology/N15-1041)

------
cozzyd
I would be less surprised if the paper claimed rational functions instead of
polynomials. NN's are ~general non-linear functions which polynomials are very
poor at representing (e.g. you can't encode asymptotic behavior in
polynomials).

edit: here's a paper about rational functions and NN's:
[https://arxiv.org/abs/1706.03301](https://arxiv.org/abs/1706.03301) . I'm not
really in a position to evaluate the claims, but author just won an NSF CAREER
award so that's something.

~~~
jl2718
Do you know of any open source packages for doing multi-dimensional Chebyshev
regression? I have seen coefficient determination only when the function is
known, and one guide for how to do collocation in one dimension, but not
really for a ‘best-fit’ in an overdetermined system.

~~~
cozzyd
I don't, other than attempting to use a general minimization package, which I
would expect not to work so well without hand tuning of initial guesses.

I think gsl and chebfun (matlab, unfortunately) can do regression in one
dimension but I don't think it can in 2 or 3d (much less general high
dimensions).

------
anjc
I'm surprised that the comments are so critical here. Regarding the dataset
criticisms: these are canonical datasets in ML research, I see no problem with
using them for evaluations here. Using canonical datasets helps with
reproducibility and should be encouraged.

Secondly, there was no claim that polynomial regression is better than NN in
all cases, or is more effective than SOTA boosting techniques. The claim seems
to be that PR is more simple, scrutable, and effective, for certain datasets
when compared to NN. This seems like a non-controversial finding to me, given
that it's common knowledge. I appreciate seeing results like this.

~~~
p1esk
I have a feeling that no one cares about these "canonical ML datasets" in
2018. If this paper was published back in 1998, it might have had generated
more interest. Today we have a whole range of "canonical" DL datasets for
every important application (image classification, object detection, speech
recognition, NLP, games, etc).

I might be wrong. How are their results relevant today?

~~~
anjc
I don't understand your point? Of course there are a whole range of datasets,
this paper chose some of them, and performed evaluations on them. If you want
to reproduce their research, you can use the same algorithms on the same
datasets. A quick search shows that hundreds of published papers have used
these datasets just in 2018.

Did you somehow already know the findings that they've made?

~~~
p1esk
My point is that the datasets they chose to test their ideas on are obscure,
and/or small. From their results, it's not clear, even to someone familiar
with the field, if they achieved anything impressive. For example, I've done a
ton of work on image classification. Pretty much all papers on image
classification published in the last 5 years have used one of the three
"canonical" datasets: MNIST, CIFAR, or ImageNet. In fact, one can argue that
MNIST has been a canonical image classification dataset since LeCun's famous
paper in 1998. Also, anything published on the topic in 2018 better
demonstrate good results on ImageNet to be taken seriously (unless it's from
Hinton).

I see "Letter Recognition Dataset" from UCI ML repository. What's the hell is
that? Can you point to any paper published in a respectable conference
(NIPS/ICML/CVPR) in the last 5 years that used it? They showed results from 9
irrelevant datasets, while a single one on ImageNet would instantly convinced
everyone they are on to something.

~~~
anjc
> I see "Letter Recognition Dataset" from UCI ML repository. What's the hell
> is that? Can you point to any paper published in a respectable conference
> (NIPS/ICML/CVPR) in the last 5 years that used it?

Why are you pro MNIST but anti Letter Recognition Dataset? A quick search of
Scholar shows hundreds of papers that have used it in the last 4 years.

Again, I'm still not sure what you want. They aren't doing a paper on NN vs PR
for image classification. They aren't trying to get a higher score on a Kaggle
classification challenge. Their hypothesis is that PR is a good alternative to
NN, and their findings show that - for certain datasets - they are correct.
Will more research arise from this? Will future results be mindblowing? Maybe,
who knows. It's just research, and somebody has to do it.

~~~
p1esk
_Their hypothesis is that PR is a good alternative to NN_

No, they are not correct, because they have not compared their model to state
of the art NNs. That's exactly why I insist they use standard, canonical
datasets, such as ImageNet: because it's easy to find well-known state of the
art, and compare the models. Google Scholar, last 4 years: "Letter Recognition
Dataset": 98 results, "MNIST": 10,000 results, "Imagenet": 18,000 results.

 _They aren 't doing a paper on NN vs PR for image classification_

If they include image classification results, and claim their model compares
favorably to NNs, then yes, their paper _is_ , at least in part, on NN vs PR
for image classification. I only commented on image classification results
because that's the field I'm familiar with. I suspect other results they
presented would also be unconvincing to people working in those areas.

 _Will more research arise from this?_

If they want more research to arise from that, then they should have made sure
their results are convincing, or at least promising. As presented in the
paper, they are not.

~~~
nightski
Your obsession with state of the art on the specific subset of data that deep
nn excels at blinds you to the entire point of this discussion.

------
claytonjy
You can find the code for the paper here:
[https://github.com/matloff/polyreg](https://github.com/matloff/polyreg)

It's worth noting they do not provide enough information to replicate most, if
any, of their empirical results. They recently (today) removed their
experiments/ directory (can find in commit history), leaving only records of
the CrossFit data analysis still in the repository.

------
beyondCritics
They are arguing, that NNs are _essentially_ polynomials, because they can
approximate each other. With the same right you can argue, that NNs are
_eassentially_ fourier series, since under very mild conditions they
approximate each other as well. Therefore this is not a valid argument. Is
this some kind of joke? I have checked the date, but that gives no clue.

~~~
geezerjay
> With the same right you can argue, that NNs are eassentially fourier series,
> since under very mild conditions they approximate each other as well.

Actually that's very wrong. Some classes of NN (step function, ReLU, etc) do
define piecewise polynomial functions, albeit in a contrived way when compared
with simply generating a spline. So the functional space is exactly the same.

------
graycat
Warning: Polynomial regression via the normal equations tends to generate the
badly conditioned Hilbert matrix and, thus, numerical problems. But can also
do the fitting using some orthogonal polynomials that are much more stable
numerically.

------
yodon
As a non-mathematician (a Physicist) this paper sounds like a pretty big deal,
both in terms of the insights it offers and the experimental results. Is
anyone in the field able to comment more on it?

~~~
p1esk
This paper would be a big deal if they reported good results on ImageNet (e.g.
beat a resnet with fewer params or faster convergence).

~~~
unixpickle
Even _MNIST_ with a decent CNN comparison would be better than what they have
here.

------
banku_brougham
This is an amazing claim in the summary (especially about the tuning
parameters), I can’t wait to read it as well as the commentary here.

------
jl2718
Well shit, there goes the originality of my topic idea (probably already
wasn’t so much).

But anyway, should not be surprising that function approximation can be done
with polynomials. Regression has big problems though, as the problem size
increases, which SGD/backprop seems impervious to. That’s kind of the mystery
behind neural nets.

~~~
dboreham
Not all functions can be approximated by polynomials.

