It's an idea about how to judge models. A model's predictive capabilities are modelled (indeed) as compression, like "how many bits do you need to set up your model and correct its output".
It might be nice to compare fairly complete models on a well defined domain, but I can't see it as a general guiding principle. It would get theorizing stuck in a local minimum.
Science isn't interested in this kind of prediction. That's just engineering.
Causal models give counter-factual predictions for existence claims (eg., that a planet exists because the orbit of two other planets doesn't follow the causal model).
Science, in most cases, prefers models with poor "engineering predictions" (ie., point estimates of observables) because they have vastly superior explanatory power.
In most cases it would be a catastrophe for a scientific model to be making good estimates of observables, because we know a priori, that observables aren't fully determined by the model (eg., just consider that F=GMm/r^2 basically didnt apply to most observations of the solar system when it was formulated by newton; nor really does it much today).
Explanatory power is not a property of compression, nor association, nor "prediction" in this engineering sense. Consider here that a lossless model of the solar system would never have yielded newton's law of graviton (since most of the objects in the solar system are unknown).
This entire project is just, "what if science were like ML?" -- an interesting question only because how vast the gap is; and how absurd the suggestion.
How can you tell the difference between a causal model and a predictive model? Isn’t it just the elegance of the model? And isn’t elegance just succinctness?
Because another approach could focus on human comprehensibility. Indeed, a leading theory of scientific progress focuses on the advancement of noetic understanding. But, then we should really be irritated by things like quantum mechanics and love things like nutrition (we might get poor predictions but good understanding).
F = GMm/r^2 does not mean anything about {P(F|M), P(F|m), ..) -- this this the semantics of associative statistical models, as used in ML.
Rather it means a gravitational force is caused by the interaction of two masses over a distance r. Here `F` refers to a force via a scientific model, etc.
The formula is a short-hand conceuqnece of a family of explanatory models about mass, inertia, gravity, forces, etc. and is only valid when used in the context of those models.
Eg., you cannot equate F = GMm/r^2 to F = kQq/r^2. Not least since we have no gravitational model which applies to tiny charged particles.
The formula used in scientific modelling do not have either mathematical or statistical semantics (ie., they dont refer to numbers, nor to associations). They refer to bits of the world via explanations; and are only valid insofar as these explanations apply.
Though science seems obsessed by formula, in terms of the goal of science, it's the least important part of the scientific model. Explanations are the goal, and highly circumstantial consequences of these are given formula partly for illustration, partly for engineering.
There's rarely anything in any dataset whose model would even be useful for building an explanation. Explanations are build via counter-factuals, and these are resolved by experimental data -- they are not made by it.
Indeed, in many cases, the experimental data would be no where nearly modelled lossessly by the candidate hypothesis.
Here’s a toy example based on a project I’d like to do.
I love faraday waves and Chladni plates — roughly speaking, cymatics. Now, if I record high res video of the waves on water in a dish, vibrating at different frequencies, I could likely create a diffusion model that had some latent “understanding” of the relationship between the proportional frequencies of sound and the waves in a particular sized dish. I could test this by holding out certain frequencies and observing whether the diffusion model could recreate them. So, I can ask the question of whether the AI model learned wave physics.
Here, there would be no formula, per se, and no explanation. Merely a computational model that could make predictions about physical phenomena.
Now, what makes this less scientific than a formula and set of explanations for the experiments? Is it because I can relate the the explanations linguistically to everything else I know about science?
So, in that case, If I jointly developed a language model to do the same thing as the diffusion model, ie make predictions based on data—but linguistically capable of connecting the outcomes to scientific concepts, then would then it be scientific?
The relationship between terms in scientific models and reality isn't linguistic.
You cannot create explanatory models from associative models of pixels -- you can create associative models that may have some limited engineering use. (Because, by highly fragile engineered conditions, the pixels track unknown+unstated properties of the physical system).
In the case of waves, the governing dynamical wave eq, ie., f(x, t) = potential(x, t) + kinetic(x, t) needs to have terms related to the physics of the system, eg., the properties of the material, to count as even a basic explanation.
Broadly, you'd need to describe the dynamics of material properties and how they give rise to the dynamics of sound properties, which requires a family of explanatory models.
You would produce those explanatory models by creating novel materials, novel experimental conditions, reasoning counter-factually, etc. over a long period of time. Eventually you may be able to formalise small, circumstantial, parts of those explanations and refute them by using logically entitled experimental data.
The relevant relationships here are: explanation, causation, counter-factual possibility, necessity and logical entialment. No where is "association", nor should it ever be if it's science.
An associative model of data isn't an explanation, it's a sort of pseudo-empiricism also present just in superstition. You can associate personality markers with positions of constellations with an associative model and get arbitrary predictive accuracy (since, eg., there are enough stars in the sky to choose ones which correlate with anything).
This has nothing to do with science, and as even a claim to science, it's outright pseudoscience.
Thank you. But I don’t like the aggression of calling this “pseudoscience.” As though you have clear claim? Especially if I can easily claim that my “pseudoscience” works better than your “science.” (Because you admit as much)
It's pseudoscience to treat it as science, "works" is an engineering condition. "Explains" is all that interests scientists.
I can encode an arbitrary amount of information, losslessly, by recording the position of various stars and listing their positions in some order. This is not an explanation of that information. For some purpose, "it works".
Many things "work"; it is trivial to rig situations so that coincidences can be exploited. This isn't science.
It's hard to over-estimate how profoundly pseudoscientific associative-modelling-as-science is: it's the basis of the history of human superstition, fraud, magical thinking, and so on.
So now it can be automated: how obscene it is that vain engineers go around proclaiming to have automated science. This is ridiculous, and to claim to be able to do physics by correlating pixel patterns is a dangerous religion: no such model is safe, no such model reasons, no such model...
These models are extremely fragile houses-of-cards that must be understood as the magic tricks they are. It is charlatanism to host a stage show and call it science -- there are many gurus in the world on that grift
But if “working” is a necessary condition for an explanation (some explanations work better than others), then won’t scientific explanations eventually become subject to the optimization drive of engineering?
Well, Kepler’s model does a good job predicting? Not perfectly, but astonishingly well.
I’m still not sure that the distinction between predictive model and explanatory model is so clear. Kepler wanted to explain the universe through the harmony of the spheres. Through that objective, he used the data to discover a beautiful and robust predictive model. Was he doing science?
Insofar as modelling is a 'predictive' activity in the engineering sense of useful estimates of observables -- it tends to end in pseudoscience.
Originally the idea of spheres was a good one (and not obtainable via any compression of measurements) -- it was obtained through reasoning by analogy. but when epicycles were added over-and-over, you effectively were using a universal functional approximator to match observable data.
Since the solar system doesn't change much, the epicycle approach works (by coincidence) -- but it's pseudoscience.
A model of gravity which can account for any possible solar system is an explanation, even if it's so hard to use we cannot actually do predictions with it (the status of much science).
Scientific models clearly represent a compression of measurement data?
Scientific models aren't necessarily "causal" to begin with. They are functions that give predictions about measurements. It is these predictions that are tested against data, not the function's confabulated "causal" justifications.
People learn from data not adhering to predictions. This difference from model functions can be compressed, if not random. This compression then might reflect in the form of modularization in justifications, which again is interpreted as causal relationships.
> Scientific models clearly represent a compression of measurement data?
Nope! No theory of heat is a compression of therometer readings; no theory of gravity, of orbits; no theory of atoms of spectra. No theories compress measurments!
Such a thing is pure superstition. Heat is not the motion of thermometer fluid.
> not the function's confabulated "causal" justifications.
Nope!
We construct an experiment by counter-factual analysis of its causal semantics; we do not simply test whether observable quantities match prior data. Arbitary associative models match arbitary amounts of prior data. This is the opposite of science.
We test scientific models by creating new experiments; it isnt "the data" which matters here, but that the experiment is designed to test the causal assumptions of the model.
If the experiment doesn't: control causes, identify novel measures with potential causes, etc. then any data collected is useless.
This is why you need, you know: randomised controlled trials, microscopes, satellites, ... etc.
"Data" in the ML sense does not matter. This is pure superstitious pseudoscience. Science is a process of creating data under experimental conditions designed to be counter-factual tests of theories. Science is about the data generating process (reality), not our measurements of it.
I'm sorry, but I think you misinterpret what compression is all about?
Heat is a random process. That process has a non-random component though. The phenomenon is compressed by describing it as a random distribution of impulses around that mean, given by temperature.
In effect, you construct an algorithm that translates model parameters to predicted measurements. Any algorithm can be described as a function.
Your idea of how models are concluded by testing causal assumptions is cargo-cult science. It is only partially correct and vastly misleading. In particular, such testing isn't always possible even in principle. You have untestable properties and parts of models that are simply chosen in lieu of better alternatives.
By restricting yourself to such simple-minded testing, you blind yourself to large parts of reality even. Many interesting topics have no way of completely controlling conditions for example. They are not arbitrarily repeatable either. Astronomy, economics, psychology, biology...the world is bigger than your approach can account for.
> In particular, such testing isn't always possible even in principle
You're the one saying it's a compression of measurements. How do you think we build models without any then? If you are aware that we do.
> Many interesting topics have no way of completely controlling conditions for example
Yes, and in these cases, much of what's produced is pseudoscience. Much of psychology is just taking surveys and compressing them, that's why it's unreporducible trash.
> The phenomenon is compressed by describing it as a random distribution of impulses around that mean
'The phenomenon' isnt something 'compressible'. Compression in this sense means only a condition on an equation describing an association.
The science of heat includes a model of atoms, molecules, flow, motion, etc. None of which are "Equations" but semantics for interpreting equations. The equations, given a "merely associative" semantics, are useless.
All the things we discovered exist are science; and all the ways they work. The very language you're using here is a product of scientific disocvery.
There was nothing to "compress", and is never anything to "compress". The world behaves differently depending on how you measure it; there are an infinite number of measures for any given system; and no "compression" of them -- not even all of them -- is that system.
A sphere is not it's shadow, nor even all its shadows. It's the cause of its shadow -- and processes properties that no model of its shadows possess.
There is no such thing as "causal data". A causal model is an interpretation of data.
Eg., to say "increasingly energetic motion of molecules leads to increasingly hot water" is an interpretation of a very wide class of equations.
It posits the existence of molecules (a scientific discovery), water, energy, motion, heat, etc. and it provides a means of creating equations&measures tied to each of these terms.
Science is the production of those interpretations. There is no bare "data" which tells you how reality is.
Science isn't "magic trick engineering", it's Explanation. "Compressing tables of data" is something they do in the pseudosciences -- as you've seen, none of it is reproducible: "IQ" is just a compression of survey quizzes. Do you really think it exists?
Do you think you can just compress survey results and claim to have an explanatory model of the most complex system in the entire universe? (a person, society, and their joint interaction) etc.
ML is a temple to pseudoscience, permitted only because the situations it's used in are engineered and low-risk. The whole thing is a dumb trick. You cannot build models of the world from associations in data: that is called superstition.
You flip a coin to randomize choice of a treatment and record the results. The coin-flips+results is a stream of binary data that can then be compressed well or poorly. A compressor which has built a correct causal model of the effects, whatever those are, will compress better than one which is unable to and can only blindly predict pretreatment results (or worse, predict conditional on the correlations which were just broken by the coin-flip, thereby actually wasting bits to fix its especially erroneous predictions). This is in line with the compression paradigm.
Where do you disagree? Do you think that causal models are completely useless for shortening predictions? Or do you think causality just doesn't exist?
> "Compressing tables of data" is something they do in the pseudosciences -- as you've seen, none of it is reproducible: "IQ" is just a compression of survey quizzes. Do you really think it exists?
That's not even close to correct about IQ. You can measure it from lots of things which are not 'survey quizzes'; fMRIs, for example.
fMRIs are also pseudoscience. Again, just associative models of blood flow.
Causal models do not compress measurement data. There are an infinite number of ways of measuring any phenomena (consider, all possible devices which measure temperature) -- in this sense, they non-uniquely compress "all possible data about the phenomena across all possible measurement systems". Ie., even with "all possible data" there are an infinite number of lossess models of it.
(But we would not even want a lossless model, since "all possible data" includes all measurement systems which have their own dynamics).
When we have an explanatory model of heat (as the kinetics of molecules), we have a textbook of explanations which we use (via reasoning, imagination, etc.) to write down whole families of causal models. So when creating a new device we can determine what it's behaviour will be.
This has nothing to do with a compression of measurement data.
We do not, nor ever have, nor even cloud, determine the causal structure of reality using compression of measurements. Measurement devices are physical systems whose properties are causally determined by target devices -- how they are determined is not "in the data". Absent knowing this, ie., science, the data is just a description of the measurement system -- not the target.
Isn't noise in the data going to dominate output size of lossless compression? Wouldn't linguistics and vision be better off with direct measurements of predictive strength?
Noise certainly affects the compression rate. But you are not concerned with the absolute compression rate, you are only concerned with the relative rate achieved by two theories A and B. Both theories will be negatively impacted to the same degree by the noise, so the comparison still works to select which theory is better.
You can easily quantify the variance and do standard model-comparison/hypothesis-testing if you want statistical-significance levels. For many datasets these days, this is hardly even a consideration: even a 1% compression improvement is clear.
Recents events in ML make me feel about 2/3 vindicated of the claims made in the book. Based on the book's ideas, I began training LLMs based on large corpora in the early 2010s, well before it was "cool". I figured out that LLMs could scale to giga-parameter complexity without overfitting, and that the concepts developed under this training would be reusable for other tasks (I called this the Reusability Hypothesis, to emphasize that it was deeply non-obvious; other terms like "self-supervision" are more common in the literature).
I missed on two related points. Technically, I did not think DNNs would scale up forever; I thought that they would hit some barrier, and the engineers would not be able to debug the problem because of the black-box nature of DNNs. Philosophically, I wanted this work to resemble classical empirical science in that the humans involved should achieve a high degree of knowledge relating to the material. In the case of LLMs, I wanted researchers (including myself) to develop understanding of key concepts in linguistics such as syntax, semantics, morphology, etc.
This style of research actually worked! I built a statistical parser without using any labelled training data! And I did learn a ton about syntax by building these models. One nice insight was that the PCFG is a bad formalism for grammar; I wrote about this here:
Obviously, I feel into the "Bitter Lesson" trap described by Rich Sutton. The DNNs can scale up, and can improve up their understanding much faster than a group of human researchers can.
One funny memory is that in 2013 I went to CVPR and told a bunch of CV researchers that they should give up on modeling P(L|I) - label given image - and just model P(I) instead - the probability of an image. They weren't too happy to hear that. I'm not sure that approach has yet taken over the CV world, but based on the overwhelming success of GPT in the NLP world, I'm sure it's just a matter of time.
In hindsight, I regret the emphasis I placed on the keyword "compression". To me, compression is a nice and rigorous way to compare models, with a built-in Occam's principle. But "compression" means many different things to different people. The important idea is that we're modeling very large unlabelled datasets, using the most natural objective metric in this setting.
> I'm not sure that approach has yet taken over the CV world, but based on the overwhelming success of GPT in the NLP world, I'm sure it's just a matter of time.
Yeah, iGPT was the writing on the wall there, but CLIP gave cheap non-generative modeling a new lease on life. Contrastive learning sucks in many ways, but it's substantially cheaper: compare the cost of training a CLIP to the cost of training a DALL-E 1. (CLIP itself was originally generative, doing the obvious generation of caption & image separately, but they found it was like 8x cheaper to go full contrastive.) So, everyone flocked into that to avoid paying the Bitter Lesson. However, people increasingly run into the limits of contrastive learning (eg. about half the examples you'll see of DALL-E 2 or Midjourney or SD failing on a prompt are probably due solely to the use of contrastive embeddings) and compute/resources keep piling up, so we'll get to generative-everything in images eventually.
Hi, Dan! Just want to say thanks for your work on this topic. I really loved your book so I wanted to share it with the HN community. Supervised learning always seemed to rub me the wrong way, both when I learned it in college and when I saw it used in practice in industry.
I was led to your book by recent research in self-supervised learning by LeCun et al [1] [2]. Since reading your book, I have been digging into the work by Rissanen [3], Grunwald [4], and Hinton [5], among many others. I'm trying to build up my knowledge so that I can apply it to TinyML [6] (e.g. running a neural network on a microcontroller with 256kb of RAM). In a TinyML context, power usage must be low and labeled data is non-existent. I have a vague intuition of how MDL can be used to guide the engineering constraints of TinyML, and I'm hoping to formalize this in my research.
Dan, if you know of any papers or research groups that would be related to this area, I'd love to read more about it.
Hi Bob, thanks for the kind words and for sharing with HN. For TinyML, you need to go in almost the opposite direction of what my book suggests, since the model complexity limits are so strict! I think MDL should be very helpful, but make sure you understand the danger of "manual overfitting" that I described in the book. I would also encourage you to read Vapnik's book the Nature of Statistical Learning Theory, which shows the relation b/t MDL and VC theory. Feel free to reach out to me at firstname dot lastname at gmail.com, I'm always happy to chat about these ideas.
I don't really know what more needs to be said here.