
One parameter is always enough [pdf] - tmalsburg2
http://colala.bcs.rochester.edu/papers/piantadosi2018one.pdf
======
foob
Their equation is cute, but this really isn't remotely surprising and the
implications aren't as significant as they imply. The result relies heavily on
the fact that their theta parameter has infinite precision. You can encode as
much information as you want in a single real number with infinite precision.
Think of it this way: a single precision float requires 4 bytes to store while
a double requires 8. If all you need is single precision, then you can store
two floats inside of one double variable with each occupying 4 of the 8 bytes.
Now replace the double with an infinite precision number that takes infinite
bytes to represent. Once you have an infinite number of bytes to work with,
you can pack in as many floats of finite precision in there as you want.
That's basically what they're doing here, they just have a simple closed form
expression for decoding it.

The reason that the implications are a bit overblown is that their model is
tremendously and chaotically dependent on the value of theta. The plots that
they include in the paper require _hundreds of thousands of digits of
precision_ in the model parameter. Nobody evaluates a model based simply on
how well it fits the data and its number of parameters; you also look at how
well the model parameters are constrained and what the uncertainty bands on
the fit are. With this model, theta would be completely unconstrained and
their uncertainty bands would cover the entire range of the data that they're
fitting. It simply doesn't matter how many parameters you have when that's the
case, it means that your fit is useless.

~~~
jmalicki
"It simply doesn't matter how many parameters you have when that's the case,
it means that your fit is useless." That is their _entire point_ \- that model
complexity can't be measured in number of coefficients, since this model only
has one coefficient, yet has model complexity as high as is possible.

This is meant as a counterexample to existing practices, not as something you
should be doing.

~~~
ajtulloch
But since everyone (implicitly or explicitly) specifies the precision of the
coefficients (eg fp32, fp64, int8, etc), it’s a complete straw man you’re
arguing against.

------
todd8
Of course one arbitrarily large precision parameter is enough. Over 85 years
ago Gödel demonstrated that every computable function, no matter how complex,
has a representation as a single real number. This idea was used in the
diagonalization step in his proof of Gödel's Theorem.

------
userbinator
Yes, encoding data into numbers and creating an equation that essentially
provides a bitmap renderer. Basically what computers do all the time, but we
just don't realise it because most of the time they don't show the data that
way. Reminds me of [https://en.wikipedia.org/wiki/Tupper%27s_self-
referential_fo...](https://en.wikipedia.org/wiki/Tupper%27s_self-
referential_formula)

Also related to how arithmetic compression works.

------
madez
One can easily embed R into R^N with a dense image, so it's straightforward to
see that one parameter is always enough. In the case of computable reals
(which contain everything we deal with when doing things numerically), the
embedding is even bijective. This completely misses the point, though, because
the "natural" number of parameters is important in understanding reality.

~~~
laretluval
How do you determine the natural number of parameters?

~~~
madez
That is generally an unsolved problem, and highly specific to the situation.
Even the interpretation of the question depends on the context. To read more
about it, I would look under the term "System Identification" and
"Mathematical Modelling", but I might have different contexts in mind that
you.

------
comex
So the equation comes out to

f(x) = sin^2(A * B^x)

for particular choices of A and B. But I don't think you actually need the
squaring - you can do the same thing with just f(x) = sin(A * B^x), and
there's no particular need to talk about iterated logistic maps. For example,
let B = 256, and let A = 2π * A', where A' is a number from 0 to 1. Then for
some positive integer n,

\- f(n) ignores the first 8n binary digits of A', because sin(2π * q) ignores
the integer part of q, and f(n) = sin(2π * A' * 2^(8n)).

\- But f(n) can be determined to a high degree of precision from the
_following_ 8 binary digits. This is just because sin is continuous and
doesn't have a high slope anywhere, so if you split [0, 2π] into 256
intervals, knowing which interval q is in gives you a suitably small range for
sin(q). Therefore, you can put whatever you want in the digits after those 8.

\- Thus, the procedure for encoding a series of y values into A' is just to
set each 8 binary digits in A' to an integer from 0 to 255 proportional to
arcsin(y_value).

Of course, more than 8 bits can be used to achieve higher precision.

Indeed, you can replace sin with _any_ continuous periodic function, though
the required precision depends on how steep the function gets.

Here's a simple demo:
[https://drive.google.com/file/d/1SPdsHCZjH9wY0xjUvrX9ga1TTU2...](https://drive.google.com/file/d/1SPdsHCZjH9wY0xjUvrX9ga1TTU2Zpz7P/view?usp=sharing)

~~~
AstralStorm
What is the for log probability ratio of two such fits? It is required by both
AIC and BIC.

------
tomtimtall
> that “parameter counting” fails as a measure of model complexity when the
> class of models under consideration is only slightly broad.

This seems plainly false. They only show that for this one specific function
with one parameter determined to several million digits. That would not by any
means be “only slightly broad”.

They are essentially just restating that any dataset can be expressed as a
single binary number therefore it can be “fitted” by a function that has a
completely covering map between integers to integers. While I find it an
interesting though not surprising element that they did it with the logistic
map, their claimed purpose and conclusion are really far fetched.

------
rectangletangle
Based off of the title alone, I thought this was going to be about currying an
n-ary function into multiple unary routines.

~~~
mlevental
same

------
shmageggy
Seems related to recent results about compression and generalization in neural
networks
([https://arxiv.org/abs/1804.05862](https://arxiv.org/abs/1804.05862)). Also
to Bayesian Occam's razor arguments about univeral priors etc. All of these
seem to be converging on the same point about how information content of a
representation relates to generalization. Glad this paper points out that
measures like BIC ignore information, instead substituting a weak heuristic
such as number of parameters.

~~~
AstralStorm
No they don't. Only because some people ignore the function which is the
second parameter is where it fails.

Isn't the likelihood function of given fit with any parameter theta with this
silly function almost always 0, making it wrong to use either AIC or BIC?

~~~
shmageggy
Ok, assume Gaussian noise with a fixed variance, hence 0 additional
parameters.

~~~
AstralStorm
AWGN would not help with fit probabilities, you get an additional constant
term in log likelihood L. You still get to at least evaluate the log
likelihood function or show that AWGN dominates the other term.

~~~
shmageggy
With the proposed method, you can fit arbitrarily closely to the data, so you
can get your likelihood as good as you like, still using a single parameter.
So you get good k(=1) and good likelihood, thus good AIC. The likelihood does
not need to dominate the other term, it just has to be as good as the
likelihood of the model you are comparing to.

------
speedplane
Many comments here criticize the paper because it proposes combining many
parameters into a more accurate parameter, if you have something with infinite
precision, you can encode any number of parameters into it.

But these criticisms miss a larger point. Mathematics has long had differing
levels of infinity. The number of integers that exist, despite being infinite,
is fewer than the number of real numbers that exist. Thus, any system, even
with infinite precision, that is based on fixed representation of digits won't
be sufficient to represent all real numbers.

------
AstralStorm
Let's start that AIC is misused here as it is valid only when you have a
likelihood function for the data. The mentioned scatter plots do not have any
(or have something assumed) for prior. Likewise BIC.

The posterior likelihood function has very interesting values for this
logistic map fit.

This is on the level of "gotcha". Usually the AIC and BIC are supposed to be
negative for a correct model and the for is supposed to have a quantifiable
information loss.

~~~
jmalicki
AIC and BIC are only really valid when comparing nested models anyway.

~~~
gammarator
This article at least claims that's not right:
[https://robjhyndman.com/hyndsight/aic/](https://robjhyndman.com/hyndsight/aic/)

(And Wikipedia:
[https://en.m.wikipedia.org/wiki/Akaike_information_criterion...](https://en.m.wikipedia.org/wiki/Akaike_information_criterion#How_to_apply_AIC_in_practice))

Do you have a reference? Or were you thinking of the Likelihood Ratio Test?

~~~
jmalicki
Okay reading, it sounds like before Burnham and Anderson (2002) it was only
known to be valid for nested models, but they were able to derive it from more
fundamental assumptions to remove that restriction.

[https://en.wikipedia.org/wiki/Akaike_information_criterion#H...](https://en.wikipedia.org/wiki/Akaike_information_criterion#History)

I wasn't aware of that update until now (despite being 16 years old), thank
you!!

------
tempodox
Is there a typo in the definition of S(z)? Where does the x come from?

~~~
yorwba
You're right, from the description it is clear that x should be z instead.

------
the_svd_doctor
Waw, pretty cool actually

------
spaceman1331
I don't see how this improves upon the Stone-Weierstrass theorem, aside from
expanding the range of functions that can uniformly approximate a continuous
function or scatter plot. A scatter plot can be turned into a continuous
function by setting the value to be the linear function that connects the two
nearest data points, a piecewise linear function.

[https://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theo...](https://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theorem)

~~~
comex
The improvement is that the function can represent any plot by adjusting a
single coefficient, whereas a polynomial approximation has a number of
coefficients equal to the number of points in the plot.

~~~
mlthoughts2018
This is only an improvement in a superficial, linguistic sense though. If the
single coefficient is just a bit-packing representation of many more degrees
of freedom (because of its huge precision), then from a model information
complexity point of view, the polynomial model could actually have fewer
parameters, in the sense that the overall size of the combined parameter space
is smaller, e.g. it’s a smaller program size.

It reminds me of the Grue vs. Bleen question in philosophy.

------
stevebmark
Papers like this should just be a Github link. Stop letting researchers
communicate with the real world. They write sentences filled with phrases like
"can be shown to be."

