The reason that the implications are a bit overblown is that their model is tremendously and chaotically dependent on the value of theta. The plots that they include in the paper require hundreds of thousands of digits of precision in the model parameter. Nobody evaluates a model based simply on how well it fits the data and its number of parameters; you also look at how well the model parameters are constrained and what the uncertainty bands on the fit are. With this model, theta would be completely unconstrained and their uncertainty bands would cover the entire range of the data that they're fitting. It simply doesn't matter how many parameters you have when that's the case, it means that your fit is useless.
This is incorrect. They said "Both
use r = 8 and require hundreds to thousands of digits of precision in θ." That's hundreds to thousands, not hundreds of thousands.
It's still a ton. For reference, a single precision (32 bit) float has about 6-7 decimal digits of precision, double (your typical float in most things that aren't neural nets) has 15ish, and quad has 35ish decimal digits of precision.
So it's totally impractical, yeah, but I just wanted to point out that it's not nearly as nuts as you emphasized.
This is meant as a counterexample to existing practices, not as something you should be doing.
The existing practice isn't to blindly look at the number of parameters in a model without considering the actual fit. It's a strawman argument to pretend that that's the case. A model that statistically overfits the data is just as suspect as one that underfits it. If you try to publish a paper about a model with a 0.01 chi-squared value being fit to some data, then it's going to be rejected. It doesn't matter if it's one parameter or one hundred parameters, it's clearly encoding information about the actual dataset rather than being a general model. The model that they present in this paper would have a chi-squared value of essentially zero, and someone would be laughed out of the room if they tried to present it at a conference.
So, the paper mentions that their "model" has infinite VC-dimension, so you basically shouldn't expect it to generalize, so existing theory says that it's a model that won't work.
The problem is that VC-dimension (and Rademacher complexity, etc) also claim that modern neural nets are too complicated to generalize with the amount of data we have.
And yet they do. So the deep learning community has fallen back on counting parameters, not as a way to measure generalization, but as a way to compare models, based on the empirical observation that a lot of the "improvements" we see in papers disappear when you compare to equally sized models.
Yes the author was clearly trying to make a point.
If you could take some construct and decompose it further into a bunch of independent components that separately account for the prediction behavior, then the number of parameters was really the number of components (or something close), and not “one parameter” artificially because of the way the separate parameters were packaged and decoded.
So even on this intended point, the paper’s result doesn’t really matter. In a Kolmogorov complexity sense, this “one parameter” model is more complex than many multi-parameter models that represent shorter programs, which is a way of saying this “one parameter” model is not really one parameter.
I'm not a mathematician, but I think the point of the exercise was to show that the number of parameters that a function requires is not a relevant indicator of the complexity of the function itself. It specifically uses a multi-parameter function converted to a very complicated single-parameter function to show this.
In the case you described where the arguments are spread into their components (as is the case in "normal" functions), the arguments can still have different precision, or more generally, different complexities. Look at a function that takes a quad, then look at a function that takes 5 bools. One is obviously more complex than the other (using the meaning of "complex" as discussed in the comments), and it has nothing to do with the number of arguments.
Disclaimer: I haven't read the article, just the comments here so please don't sue me if I'm incorrect.
Think of it this way: suppose someone read this article and as a result they thought, “aha, Occam’s Razor is a sham because this shows that ‘simpler’ models (just “one” parameter!) are possibly even more susceptible to overfitting or generalization error. So I’m free to use whatever complex model I wan’t!”
Then they go and fit a 427-degree polynomial to their data set with 427 observations, and when someone says, “but using that many parameters is overly complex, you should prefer a simpler model” they’ll reply, “No way! Read this paper! It says that even simple models with a small number of parameters can have this problem.”
I know it’s a simplification, but it’s important. The notion of “a parameter” has to meaningfully capture an independent degree of freedom, and be a unit of constraint of the complexity (theoretically, Kolmogorov complexity, minimum description length of a program that models the underlying data generating process), or else it’s not a mathematical parameter, rather just a linguistic construct.
In the paper’s example, saying that the model “has one parameter” is misleading, because you need so much precision specifically to allow the “one parameter” to control many parameters worth of complexity.
So if someone had a takeaway point of “there’s not necessarily a reason to favor simpler models” that would be a big misunderstanding — yet it seems like the paper almost intentionally is set up to mislead in this way.
> The superficial “number” of parameters is not important.
Whether the number of parameters is small or large does not change the complexity of the function, though if you're gonna pass 64 bools might as well call it a long long with bitwise OR to simplify the usability.
Edit: Dammit, userbinator's post below also mentions this an hour ago. :P
I showed it to my teacher and he was immediately annoyed with it. He was like "yeah bro here, when he takes the modulo, he's basically shaving off bits from end of the the binary representation of that number. When he divides, he's shaving off bits from the start."
A humbling experience.
Once, during his class, I created a tiny python code that would let you draw monochromatically in the terminal then it would find the proper Tupper Formula Y offset that represented the image you drew.
It would simply loop over the drawing (I don't remember the directions but I believe it was bottom-to-top then left-to-right) and annotate wether there was a drawn pixel or not. This sequence was the binary representation of the Tupper Formula Y offset. "We should make something out of this", I jokingly said to my teacher. "They already did, it's called binary", he replied.
This is standard practice in deep learning...
So, both BIC and AIC are ill defined for this family of functions... Part of the reason the measure returns worthless garbage. This also happens with fits based on neural nets with tan or clipped activations. (Because sum of activations is nonsmooth as is fit probability.) But not with RBM or GMM or exponential neurons. These produce Gaussian or Pareto fit probabilities. (Polynomial probably also fail because they're not smooth functions but both measures could be corrected for nonsmoothness of likelilihood in this case.)
Sum of sinc should work too as you get Wishart fit probabilities.
This funny chaotic function? I have no idea how distributed the answers are and whether the distribution is continuous.
Also related to how arithmetic compression works.
f(x) = sin^2(A * B^x)
for particular choices of A and B. But I don't think you actually need the squaring - you can do the same thing with just f(x) = sin(A * B^x), and there's no particular need to talk about iterated logistic maps. For example, let B = 256, and let A = 2π * A', where A' is a number from 0 to 1. Then for some positive integer n,
- f(n) ignores the first 8n binary digits of A', because sin(2π * q) ignores the integer part of q, and f(n) = sin(2π * A' * 2^(8n)).
- But f(n) can be determined to a high degree of precision from the following 8 binary digits. This is just because sin is continuous and doesn't have a high slope anywhere, so if you split [0, 2π] into 256 intervals, knowing which interval q is in gives you a suitably small range for sin(q). Therefore, you can put whatever you want in the digits after those 8.
- Thus, the procedure for encoding a series of y values into A' is just to set each 8 binary digits in A' to an integer from 0 to 255 proportional to arcsin(y_value).
Of course, more than 8 bits can be used to achieve higher precision.
Indeed, you can replace sin with any continuous periodic function, though the required precision depends on how steep the function gets.
Here's a simple demo: https://drive.google.com/file/d/1SPdsHCZjH9wY0xjUvrX9ga1TTU2...
This seems plainly false. They only show that for this one specific function with one parameter determined to several million digits. That would not by any means be “only slightly broad”.
They are essentially just restating that any dataset can be expressed as a single binary number therefore it can be “fitted” by a function that has a completely covering map between integers to integers. While I find it an interesting though not surprising element that they did it with the logistic map, their claimed purpose and conclusion are really far fetched.
Isn't the likelihood function of given fit with any parameter theta with this silly function almost always 0, making it wrong to use either AIC or BIC?
But these criticisms miss a larger point. Mathematics has long had differing levels of infinity. The number of integers that exist, despite being infinite, is fewer than the number of real numbers that exist. Thus, any system, even with infinite precision, that is based on fixed representation of digits won't be sufficient to represent all real numbers.
The posterior likelihood function has very interesting values for this logistic map fit.
This is on the level of "gotcha". Usually the AIC and BIC are supposed to be negative for a correct model and the for is supposed to have a quantifiable information loss.
(And Wikipedia: https://en.m.wikipedia.org/wiki/Akaike_information_criterion...)
Do you have a reference? Or were you thinking of the Likelihood Ratio Test?
I wasn't aware of that update until now (despite being 16 years old), thank you!!
It reminds me of the Grue vs. Bleen question in philosophy.