 One parameter is always enough [pdf] 143 points by tmalsburg2 on May 26, 2018 | hide | past | favorite | 46 comments Their equation is cute, but this really isn't remotely surprising and the implications aren't as significant as they imply. The result relies heavily on the fact that their theta parameter has infinite precision. You can encode as much information as you want in a single real number with infinite precision. Think of it this way: a single precision float requires 4 bytes to store while a double requires 8. If all you need is single precision, then you can store two floats inside of one double variable with each occupying 4 of the 8 bytes. Now replace the double with an infinite precision number that takes infinite bytes to represent. Once you have an infinite number of bytes to work with, you can pack in as many floats of finite precision in there as you want. That's basically what they're doing here, they just have a simple closed form expression for decoding it.The reason that the implications are a bit overblown is that their model is tremendously and chaotically dependent on the value of theta. The plots that they include in the paper require hundreds of thousands of digits of precision in the model parameter. Nobody evaluates a model based simply on how well it fits the data and its number of parameters; you also look at how well the model parameters are constrained and what the uncertainty bands on the fit are. With this model, theta would be completely unconstrained and their uncertainty bands would cover the entire range of the data that they're fitting. It simply doesn't matter how many parameters you have when that's the case, it means that your fit is useless. > The plots that they include in the paper require hundreds of thousands of digits of precision in the model parameter.This is incorrect. They said "Both use r = 8 and require hundreds to thousands of digits of precision in θ." That's hundreds to thousands, not hundreds of thousands.It's still a ton. For reference, a single precision (32 bit) float has about 6-7 decimal digits of precision, double (your typical float in most things that aren't neural nets) has 15ish, and quad has 35ish decimal digits of precision.So it's totally impractical, yeah, but I just wanted to point out that it's not nearly as nuts as you emphasized. "It simply doesn't matter how many parameters you have when that's the case, it means that your fit is useless." That is their entire point - that model complexity can't be measured in number of coefficients, since this model only has one coefficient, yet has model complexity as high as is possible.This is meant as a counterexample to existing practices, not as something you should be doing. But since everyone (implicitly or explicitly) specifies the precision of the coefficients (eg fp32, fp64, int8, etc), it’s a complete straw man you’re arguing against. Nobody evaluates a model based simply on how well it fits the data and its number of parameters; you also look at how well the model parameters are constrained and what the uncertainty bands on the fit are.The existing practice isn't to blindly look at the number of parameters in a model without considering the actual fit. It's a strawman argument to pretend that that's the case. A model that statistically overfits the data is just as suspect as one that underfits it. If you try to publish a paper about a model with a 0.01 chi-squared value being fit to some data, then it's going to be rejected. It doesn't matter if it's one parameter or one hundred parameters, it's clearly encoding information about the actual dataset rather than being a general model. The model that they present in this paper would have a chi-squared value of essentially zero, and someone would be laughed out of the room if they tried to present it at a conference. > This is meant as a counterexample to existing practicesSo, the paper mentions that their "model" has infinite VC-dimension, so you basically shouldn't expect it to generalize, so existing theory says that it's a model that won't work.The problem is that VC-dimension (and Rademacher complexity, etc) also claim that modern neural nets are too complicated to generalize with the amount of data we have.And yet they do. So the deep learning community has fallen back on counting parameters, not as a way to measure generalization, but as a way to compare models, based on the empirical observation that a lot of the "improvements" we see in papers disappear when you compare to equally sized models. Spot on analysis. I assume this was intended to be a bit of a joke. Number of parameters is not really a useful metric when talking about the amount of data input to a function. What's important is how much data is encoded within the given parameters. I'm also relatively confident that there are likely a large number of functions with which you could achieve similar results (and probably without going to hundreds of thousands of digits). From the paper: "Thus, the construction shows that even a single parameter can overfit the data, and therefore it is not always preferable to use a model with fewer parameters. "Yes the author was clearly trying to make a point. This is just a trick of language though, because their single, hundreds-of-thousands-of-digits-of-precision “parameter” is really just a way of packing an agglomeration of other parameters into one bit bucket.If you could take some construct and decompose it further into a bunch of independent components that separately account for the prediction behavior, then the number of parameters was really the number of components (or something close), and not “one parameter” artificially because of the way the separate parameters were packaged and decoded.So even on this intended point, the paper’s result doesn’t really matter. In a Kolmogorov complexity sense, this “one parameter” model is more complex than many multi-parameter models that represent shorter programs, which is a way of saying this “one parameter” model is not really one parameter. > "In a Kolmogorov complexity sense, this “one parameter” model is more complex than many multi-parameter models that represent shorter programs"I'm not a mathematician, but I think the point of the exercise was to show that the number of parameters that a function requires is not a relevant indicator of the complexity of the function itself. It specifically uses a multi-parameter function converted to a very complicated single-parameter function to show this.In the case you described where the arguments are spread into their components (as is the case in "normal" functions), the arguments can still have different precision, or more generally, different complexities. Look at a function that takes a quad, then look at a function that takes 5 bools. One is obviously more complex than the other (using the meaning of "complex" as discussed in the comments), and it has nothing to do with the number of arguments.Disclaimer: I haven't read the article, just the comments here so please don't sue me if I'm incorrect. This is exactly what I mean by “a trick of language” though. The superficial “number” of parameters is not important.Think of it this way: suppose someone read this article and as a result they thought, “aha, Occam’s Razor is a sham because this shows that ‘simpler’ models (just “one” parameter!) are possibly even more susceptible to overfitting or generalization error. So I’m free to use whatever complex model I wan’t!”Then they go and fit a 427-degree polynomial to their data set with 427 observations, and when someone says, “but using that many parameters is overly complex, you should prefer a simpler model” they’ll reply, “No way! Read this paper! It says that even simple models with a small number of parameters can have this problem.”I know it’s a simplification, but it’s important. The notion of “a parameter” has to meaningfully capture an independent degree of freedom, and be a unit of constraint of the complexity (theoretically, Kolmogorov complexity, minimum description length of a program that models the underlying data generating process), or else it’s not a mathematical parameter, rather just a linguistic construct.In the paper’s example, saying that the model “has one parameter” is misleading, because you need so much precision specifically to allow the “one parameter” to control many parameters worth of complexity.So if someone had a takeaway point of “there’s not necessarily a reason to favor simpler models” that would be a big misunderstanding — yet it seems like the paper almost intentionally is set up to mislead in this way. I absolutely do not disagree. My entire point is in your first paragraph.> The superficial “number” of parameters is not important.Whether the number of parameters is small or large does not change the complexity of the function, though if you're gonna pass 64 bools might as well call it a long long with bitwise OR to simplify the usability. In a classic stack-based calling convention, every function takes one parameter: SP It kind of seems like cheating, the same way Tupper's formula is cheating: https://en.wikipedia.org/wiki/Tupper%27s_self-referential_fo...Edit: Dammit, userbinator's post below also mentions this an hour ago. :P I love the Tupper Formula.I showed it to my teacher and he was immediately annoyed with it. He was like "yeah bro here, when he takes the modulo, he's basically shaving off bits from end of the the binary representation of that number. When he divides, he's shaving off bits from the start."A humbling experience.Once, during his class, I created a tiny python code that would let you draw monochromatically in the terminal then it would find the proper Tupper Formula Y offset that represented the image you drew.It would simply loop over the drawing (I don't remember the directions but I believe it was bottom-to-top then left-to-right) and annotate wether there was a drawn pixel or not. This sequence was the binary representation of the Tupper Formula Y offset. "We should make something out of this", I jokingly said to my teacher. "They already did, it's called binary", he replied. > Nobody evaluates a model based simply on how well it fits the data and its number of parametersThis is standard practice in deep learning... Not to mention this function is not a regular parametric model - Fischer information matrix is undefined. (Not differentiable in logarithm.) How is it not differentiable in its logarithm? The function is just f(x) = sin^2(A * B^x), for some constants A and B (A encodes the data while B just determines the precision). The function itself is infinitely differentiable, so its logarithm will be infinitely differentiable everywhere except where f(x) = 0. However, it's not zero at the points that matter, and in any case we can avoid the issue by just adding a constant. Not the function itself, the log ratio of fit probabilities of any given pair of these. That is differentiable at most the Lebesgue sense and the likelihood requires something stronger. Specifically it has to be smooth everywhere to have KL divergence well defined. Adding a constant will give you a logarithm that breaks at zero still in log probability ratio.So, both BIC and AIC are ill defined for this family of functions... Part of the reason the measure returns worthless garbage. This also happens with fits based on neural nets with tan or clipped activations. (Because sum of activations is nonsmooth as is fit probability.) But not with RBM or GMM or exponential neurons. These produce Gaussian or Pareto fit probabilities. (Polynomial probably also fail because they're not smooth functions but both measures could be corrected for nonsmoothness of likelilihood in this case.) Sum of sinc should work too as you get Wishart fit probabilities.This funny chaotic function? I have no idea how distributed the answers are and whether the distribution is continuous. Of course one arbitrarily large precision parameter is enough. Over 85 years ago Gödel demonstrated that every computable function, no matter how complex, has a representation as a single real number. This idea was used in the diagonalization step in his proof of Gödel's Theorem. Yes, encoding data into numbers and creating an equation that essentially provides a bitmap renderer. Basically what computers do all the time, but we just don't realise it because most of the time they don't show the data that way. Reminds me of https://en.wikipedia.org/wiki/Tupper%27s_self-referential_fo...Also related to how arithmetic compression works. One can easily embed R into R^N with a dense image, so it's straightforward to see that one parameter is always enough. In the case of computable reals (which contain everything we deal with when doing things numerically), the embedding is even bijective. This completely misses the point, though, because the "natural" number of parameters is important in understanding reality. How do you determine the natural number of parameters? That is generally an unsolved problem, and highly specific to the situation. Even the interpretation of the question depends on the context. To read more about it, I would look under the term "System Identification" and "Mathematical Modelling", but I might have different contexts in mind that you. So the equation comes out tof(x) = sin^2(A * B^x)for particular choices of A and B. But I don't think you actually need the squaring - you can do the same thing with just f(x) = sin(A * B^x), and there's no particular need to talk about iterated logistic maps. For example, let B = 256, and let A = 2π * A', where A' is a number from 0 to 1. Then for some positive integer n,- f(n) ignores the first 8n binary digits of A', because sin(2π * q) ignores the integer part of q, and f(n) = sin(2π * A' * 2^(8n)).- But f(n) can be determined to a high degree of precision from the following 8 binary digits. This is just because sin is continuous and doesn't have a high slope anywhere, so if you split [0, 2π] into 256 intervals, knowing which interval q is in gives you a suitably small range for sin(q). Therefore, you can put whatever you want in the digits after those 8.- Thus, the procedure for encoding a series of y values into A' is just to set each 8 binary digits in A' to an integer from 0 to 255 proportional to arcsin(y_value).Of course, more than 8 bits can be used to achieve higher precision.Indeed, you can replace sin with any continuous periodic function, though the required precision depends on how steep the function gets.Here's a simple demo: https://drive.google.com/file/d/1SPdsHCZjH9wY0xjUvrX9ga1TTU2... What is the for log probability ratio of two such fits? It is required by both AIC and BIC. > that “parameter counting” fails as a measure of model complexity when the class of models under consideration is only slightly broad.This seems plainly false. They only show that for this one specific function with one parameter determined to several million digits. That would not by any means be “only slightly broad”.They are essentially just restating that any dataset can be expressed as a single binary number therefore it can be “fitted” by a function that has a completely covering map between integers to integers. While I find it an interesting though not surprising element that they did it with the logistic map, their claimed purpose and conclusion are really far fetched. Based off of the title alone, I thought this was going to be about currying an n-ary function into multiple unary routines. same Seems related to recent results about compression and generalization in neural networks (https://arxiv.org/abs/1804.05862). Also to Bayesian Occam's razor arguments about univeral priors etc. All of these seem to be converging on the same point about how information content of a representation relates to generalization. Glad this paper points out that measures like BIC ignore information, instead substituting a weak heuristic such as number of parameters. No they don't. Only because some people ignore the function which is the second parameter is where it fails.Isn't the likelihood function of given fit with any parameter theta with this silly function almost always 0, making it wrong to use either AIC or BIC? Ok, assume Gaussian noise with a fixed variance, hence 0 additional parameters. AWGN would not help with fit probabilities, you get an additional constant term in log likelihood L. You still get to at least evaluate the log likelihood function or show that AWGN dominates the other term. With the proposed method, you can fit arbitrarily closely to the data, so you can get your likelihood as good as you like, still using a single parameter. So you get good k(=1) and good likelihood, thus good AIC. The likelihood does not need to dominate the other term, it just has to be as good as the likelihood of the model you are comparing to. Many comments here criticize the paper because it proposes combining many parameters into a more accurate parameter, if you have something with infinite precision, you can encode any number of parameters into it.But these criticisms miss a larger point. Mathematics has long had differing levels of infinity. The number of integers that exist, despite being infinite, is fewer than the number of real numbers that exist. Thus, any system, even with infinite precision, that is based on fixed representation of digits won't be sufficient to represent all real numbers. Let's start that AIC is misused here as it is valid only when you have a likelihood function for the data. The mentioned scatter plots do not have any (or have something assumed) for prior. Likewise BIC.The posterior likelihood function has very interesting values for this logistic map fit.This is on the level of "gotcha". Usually the AIC and BIC are supposed to be negative for a correct model and the for is supposed to have a quantifiable information loss. AIC and BIC are only really valid when comparing nested models anyway. This article at least claims that's not right: https://robjhyndman.com/hyndsight/aic/(And Wikipedia: https://en.m.wikipedia.org/wiki/Akaike_information_criterion...)Do you have a reference? Or were you thinking of the Likelihood Ratio Test? Okay reading, it sounds like before Burnham and Anderson (2002) it was only known to be valid for nested models, but they were able to derive it from more fundamental assumptions to remove that restriction.https://en.wikipedia.org/wiki/Akaike_information_criterion#H...I wasn't aware of that update until now (despite being 16 years old), thank you!! Is there a typo in the definition of S(z)? Where does the x come from? You're right, from the description it is clear that x should be z instead. Waw, pretty cool actually I don't see how this improves upon the Stone-Weierstrass theorem, aside from expanding the range of functions that can uniformly approximate a continuous function or scatter plot. A scatter plot can be turned into a continuous function by setting the value to be the linear function that connects the two nearest data points, a piecewise linear function. The improvement is that the function can represent any plot by adjusting a single coefficient, whereas a polynomial approximation has a number of coefficients equal to the number of points in the plot. This is only an improvement in a superficial, linguistic sense though. If the single coefficient is just a bit-packing representation of many more degrees of freedom (because of its huge precision), then from a model information complexity point of view, the polynomial model could actually have fewer parameters, in the sense that the overall size of the combined parameter space is smaller, e.g. it’s a smaller program size.It reminds me of the Grue vs. Bleen question in philosophy. Papers like this should just be a Github link. Stop letting researchers communicate with the real world. They write sentences filled with phrases like "can be shown to be." Search: