So, as a distance measure for classifying MNIST digits, GZIP has lower accuracy, and is much more computationally demanding than other measures.
I'm not that familiar with how the GZIP algorithm works, it's kind of interesting that it's so much lower. I wonder whether image focused compression algorithms might do better?
(Edit: Btw, I enjoyed the post; it's a creative idea, the writing and code was great, and it's sparked some good discussion. But after having a closer look, I think the baselines above provide some context to the gzip scores.)
Best one I found is Normalized Mutual Information at 95%. It's a little more complex, but you could still compute it relatively quickly on binarized images.
Holy cow, I knew that MNIST was simple, but not that simple.
Could you post a snippet of the code that you used to achieve this? It would be really, really nice to have a baseline to work from I'm sure, and I feel like this could be really useful to a few other areas (my personal obsession is speed-training on CIFAR10 :')))) )
I used the notebook linked in the original post [1]. It evaluates using 100 samples from the test set, (I'm guessing because the gzip method is slow - it would take ~7 hours on the full test set, on my machine).
I plugged in the distance measures, for the `compute_ncd` function. (Jaccard/Dice have been negated and the -1 removed.)
Ohhh, you know what? I think this makes sense, since it's basically like making prototypes from the training set, only it's instead comparing against every training example, then averaging. Cool. Did not know that would be remotely close to 90%, but that makes sense to me.
I wonder if something like a weighted max_mean would perform better or something L1. Or maybe L2 is ideal, it is at the center of a lot of things information theory, after all! ;PPPP
Very cool. I tried using PNG compression, it does actually do better, slightly:
PNG : ~15.1 seconds, 83% accuracy
I also tried dropping in zstandard compression:
Zstd (level=3) : ~3.5 seconds, 88% accuracy
Much faster than gzip, at least. And iff I use (x1-x2)*2 instead of x1+x2 to calculate Cx1x2 that pushes zstd up to 93% accuracy.
I'm somewhat interested in the fact that if I stack the two arrays on top of each other instead of adding them together, I get absolutely garbage performance (<20%), however as far as I can tell that actually works well when classifying strings.
Re garbage performance on stacking: I had tried interleaving the values (creating a 56x28 img), it dropped one percentage point of accuracy when using gzip.
Also it seems from reading online articles that people are able to obtain much better results just by using K-NN, so I imagine that the author just made his job harder by using gzip but I could be wrong about this.
Most people don't realize that Logistic regression can get ~90% accuracy on MNIST.
As a big fan of starting with simple models first and adding complexity later, I've frequently been told that "logistic regression won't work!" for problems where it can in fact perform excellent.
When faced with this resistance to logistic regression I'll often ask what they think the baseline performance of it would be on MNIST. The guesses I hear most often are 20-30%.
People, even machine learning people, often don't realize the rapidly diminishing returns you get for adding a lot of complexity in your models. Sometimes it's worth it, but it's always good to start simple first.
It's also been my experience that if you don't get good performance with a simple model, you are very unlikely to get great performance from a more complex one.
Came here to say the same thing, actually NCD can probably do much better than 78%. Li & Vitanyi's book about Kolmogorov complexity has some interesting unsupervised examples.
A simple CNN as implemented in Keras tutorial can easily exceed 98%. 78% is very poor performance for MNIST even if model complexity is penalized.
When would a 90% accuracy on a dataset like mnist ever be useful? And I mean useful as in usable for actual products or software. especially considering mnist is more of a toy dataset.
I think that's why machine learning is the way to go for this type of detection, why go with anything else than a CNN (in this case) when it is now trivial to set up and train? Again, unless it's just to mess around with, 90% mnist accuracy is not useful in the real world
I don't think the point was that you should use logistic regression on MNIST. In lesser-known problems, say a custom in-house model, if you don't try the simpler approach first, you'll never know that your more complex solution is not worth the extra expense, or is actually worse than a simpler, cheaper model. MNIST is well-known to have nearly perfect solutions at this point, but for most novel problems, the data scientist has no idea what is theoretically possible.
Now, you can say that CNNs or other techniques are easily accessible these days, and almost trivial to set up. But they may not be trivial to train and run in terms of compute in the real world.
> People, even machine learning people, often don't realize the rapidly diminishing returns you get for adding a lot of complexity in your models.
Are you me? I was just having this argument at work with someone about using an old (fasttext/word2vec) model vs. the overhead on fine-tuned BERT model for a fairly simple classification problem.
To a large degree this happens because logistic regression is not a sexy approach that one can add to their CV. Everyone wants to solve problems with big, complicated and buzzwordy models, because that sells (and is perhaps more interesting as well).
It's really a tragedy, because so many classic models would work fine for real world applications.
Extending existing features with their non-linear mappings would improve logistic regression too, probably to the svc level (rbf or poly kernel is that but implicit).
Linear models are really well researched, and today with the compute and with proper training and data preparation they can easily get to satisfying levels of performance for a variety of tasks.
this is true unless you can use large pretrained models, which are very simple to use, are very resistant to ask sorts of noise, & would get ultra high accuracy with just logistic regression on the output of the model
The whole point isn't to do better. It's to show that enough information survives compression that you can still get very large signal. Compression is intended to make things harder and still does.
The article OP linked is inspired by Gzip+Knn paper that was released a few months ago, compression is not intended to make things harder but to extract useful features that can be used by simple algorithms like Knn or SVG etc...
For example it would be almost impossible to distinguish cat and dog images by using SVG or Knn directly on the data but it would be much easier if the images were first passed into an image encoder and a small embeddings vector would be used for each one instead. In the article OP linked it doesn't seem that Gzip is very good at extracting useful patterns for the MNIST dataset.
And what it accidentally showed, was that NCD between individual digits in the training set is a really terrible distance metric for classification.
You can do classification with KNN, which is obvious. You can also do classification with compression, which is less obvious, and neat. This approach tries to combine them in a way which doesn't work.
Sure it's not great at differentiating between SotA techniques, but it's very useful for sanity checks like this one.
Even for SotA models, it's still useful to verify that you can get greater than 98% accuracy on MNIST, before exploring larger, more complex bench marks.
It certainly shouldn't be the only benchmark but it's a great place to start iterating on ideas.
It's a bad benchmark because it's artificially clean. It's effectively a 2d dataset with no occlusions. So nearly everything you try on it will work, and many things you try on it won't scale to typical image problems. There are good 3d datasets with more realistic examples that are still fairly simplistic compared to the state of the art large datasets, but at least give you signal that your technique is robust to common problems in vision. MNIST is so simplistic that you encounter none of the typical problems in computer vision settings so it doesn't give you a good prediction of how good your technique is.
Can you imagine a model that performs very poorly on MNIST that would perform well in real-world computer visions problems? If you can't, then MNIST is a nice simple smoke test when assessing models.
What's wrong with "artificially clean"? The goal of benchmarks is to compare and know whether one model is better than the other. There is never a "perfect" or "objective" benchmark. Different benchmarks may highlight advantages in certain models, which is a good thing, but there is absolutely nothing wrong with using MNIST as a dataset to give you a basic idea of how models perform.
Artificially clean gives you too many false positives. Most researchers I know these days start on CIFAR10 which is much more real world and has far fewer false positive signals. A portion of the hype in CV is a paper showing a technique works on MNIST that then fails to go anywhere on any other dataset.
It's still a great data set because it's real-world, useful, high-quality, and an excellent size (~60k examples), and the "correct" classification isn't very subjective at all - humans can easily get 99% accuracy.
MNIST is pretty good as a proof of concept. While I agree it's trivial, I see MNIST as being more like how when you're making a toy programming language, the first thing you do with it is writing a recursive Fibonacci function.
So there is a more optimal compression algorithm for the relation between the MNIST inputs and outputs!
Other models tend to add noise somewhere in there; what about feature engineering before the gzip? Maybe a gaussian blur and a convolution first and then deep learning for feature selection
The article emphasizes the wrong thing, in my view. The interesting part is that compression -- without learning a model -- can be used for classification. This raises the question of what other information-theoretic measures can be used; cheaper, lossy ones.
I remember seeing an example of using zip to classify languages. You take a set of documents of equal size where you know the languages, then individually concatenate and zip them with the unknown text. The smallest compressed output is likely to be the target language.
Ideally, you'd take all the documents in each language, and compress them in turn with the unclassified text, to see which compresses it better. But this won't work very well with gzip, since it compresses based on a 32KB sliding window. You might as well truncate the training data for each class to the last 32KB (more or less). So to get any performance at all out of a gzip-based classifier, you need to combine a ton of individually quite bad predictors with some sort of ensemble method. (The linked code demonstrates a way of aggregating them which does not work at all).
How much better would that get if you append all but one of the equal size documents? (or other combinations like 2 of the top results after using a single one)
Better, if the compressor can use all that extra context. Gzip, and most traditional general purpose compressors, can't.
It's hard to use distant context effectively. Even general purpose compression methods which theoretically can, often deliberately reset part of their context, since assuming a big file follows the same distribution throughout as in its beginning often hurts compression more than just starting over periodically.
An increasingly common refrain in machine learning is “intelligence is compression.” Folks who believe that might bristle at the distinction between learning and compression.
I doubt it because learned features already constitute lossy compression. The question is what kind of compression; lossy vs. lossless, learned vs. unlearned.
The point is not to have "elegant and compact" code, this is meant to be a fun curiosity, and doing it in 10 lines is just an additional layer of challenge for the heck of it.
The interesting thing is not in whether GZip can achieve SOTA, it's that it can do a decent job at all. (The interesting thing is not in whether the bear can recreate Mozart exactly, it's that it can play the piano at all.)
Yeah, it does demonstrate that you can use compression to measure similarity of two images.
But it also demonstrates that it's a pretty poor similarity measure. Something as simple as counting % of matches between the black and white pixels performs much better.
It's not trying to break records, it just shows a neat aspect of compression. It's still 8 times better than baseline, which showcases that compression can learn representation.
My favorite book about the deep connections between information theory, compression, and learning algorithms is MacKay (most probably know about it but I didn’t for a long time so maybe some will benefit from the mention).
I gather this is common knowledge (if not sufficiently emphasized at times) among those with serious educations, but as a self-taught, practical application-type ML person this profound thread running through all these topics (and seemingly into heavy-ass particle physics and cosmology and stuff like that) was a blinding “Aha!” moment that I’ll venture this comment in the hopes that even one other person has that unforgettable moment.
i was quite struck when i learned the original Lempel Ziv compression (which gzip is partly based on) came out of their study of the "complexity of finite sequences" not necessarily trying to shrink things, https://ieeexplore.ieee.org/document/1055501
In fairness you can run MNIST through UMAP and get near perfect seperation. I'm of the belief that you have to try pretty hard not to do well on MNIST these days.
EDIT: I should add, unless it isn't clear, that we really should retire the dataset. Something like the QuickDraw dataset makes a lot more sense to me.
Sorry if my comment came off as dismissive of the post! I was just trying to frame the results for those unfamiliar with the properties of MNIST.
I think it is indeed interesting as it shows that Gzip can capture the same things that e.g. UMAP et co. are capturing when they acheive good scores on MNIST.
Also, I'll add, that even despite some of the suspicions people have cast on the Gzip results in other experiments, I'm bullish on the utility of reducing entropy for classification problems.
> I should add, unless it isn't clear, that we really should retire the dataset.
From a research perspective, MNIST is basically solved. I think current performance on MNIST is better than humans, correct?
But it's still valuable as a teaching tool or a "Hello world" data set, because it's so simple and because most reasonable algorithms will reach 97% accuracy. Even if you build your tools from scratch, it makes a nice homework-sized problem. And it's an obviously useful task that anyone can understand. "Oh, digit recognition for mail!"
I was thinking about this just now... the real utility of MNIST is that, since it's basically impossible to get a bad result, it's a good sanity check.
Yeah, but the thing is, gzip isn't "these days". It was a thing long before UMAP and even MNIST itself. And the approach isn't perticulary novel too, this is a very simple idea, if you do understand compression. It could've been written the first day MNIST was published, and it would be still 78% accuracy. This is kinda amazing to me.
Y'all are making the (rude) person below complaining about acronyms look reasonable. The repository doesn't define UMAP either, but if you believe ChatGPT it is:
> UMAP, which stands for Uniform Manifold Approximation and Projection, is a dimensionality reduction technique and data visualization method commonly used in machine learning and data analysis.
We work in our own ecological niches which sometimes entails having a specialized language that allows us to be precise when referring to things.
I sometimes see HN comments complaining that acronyms and terms are dropped with no explanation. These comments come across as entitled to me — instead of complaining, why not be curious and ask “what does that mean in this context?”
Imagine someone dropping in here and complaining, “the article sucks because it uses the word ‘compiler’ and assumes we all know what a ‘compiler’ is.” This is what it sounds like to those of us to work in specialized fields.
Instead if someone said, “I’m new to this, could someone help me understand what a ‘compiler’ is?”, it demonstrates curiosity and people are more inclined to explain.
I'm merely a hobbyist in this domain but isn't highly compressed data (like encrypted data) also high entropy? If this is finding patterns in the compressed data to figure out which digit the uncompressed data represents, shouldn't those patterns be exploitable for better compression?
The demonstration isn't classifying based on compressed data, rather it's classifying on the compressibility of data. The idea is that "7 7" should be more compressible than "7 3", and similarly raster images of "7 7" should be more compressible than raster images of "7 3".
Edit: One of my favorite concepts in the domain of compression is the pigeonhole principle, that states that for all compression algorithms, some outputs will be larger than the inputs.
Well designed encrypted payloads may be compressed, but the outputs should on average be larger than the inputs, rendering compression useless thus it is said to be "incompressible".
I don't immediately find it, but couple of years back there was a "meta-feature" which was the size of the MNIST image. I think that scored about 90'ish % accurate results on its own - without even looking at the image.
A few years back I worked on a project that involved fingerprinting screenshots of web pages, and compressed image size was pretty much as good as any fingerprinting method we could come up with for comparing the similarity between them.
What do you mean by “size”? Gzipped size? If you simply look at how dark a Mnist image is (count the percentage of dark pixels) you’ll get about 20% accuracy, which is twice better than random guess but a long way from 90’ish %.
What do you mean with accuracy here?
Usually 50% accuracy means cointoss, meaning 20% accuracy is equal to 80% accuracy, which is better than the article's 78% and not that far from 90%.
Only if your model is outputting a yes/no answer right? And that your definition of accuracy is "class with highest probability" (and not "N classes with highest prob")
If your dataset has more than 2 classes like MNIST, a super low accuracy only tells you to ignore the class the model guesses. It doesn't tell you which of the remaining classes is correct
Didn't it turn out that authors of that paper have made mistakes that catapulted their results to the top of the benchmark charts? I thought the theory was inconsistent after that incident. 78% accuracy from just GZIP is impressive.
Leaving aside whether this problem is a good application of this compression trick, I want to say that everyone experimenting with this should stop using `gzip` and start using `zlib`.
If you change the first line from `gzip.compress` to `zlib.compress` you should get the same classifier performance with a 3x speedup.
General purpose compressors and information distance measures have become super interesting to me while I've been investigating alternative language models.
I've been playing around with an attention mechanism that combines the idea of using normalized compression distance (gzip) with discrete convolution between candidate sequences (sliding window of N bytes over each). Another round of normalization over the convolution outputs - accommodating varying lengths - allows for us to compare candidate sequences for relevant information on equal grounds.
So, if I understand your sliding window explanation right, would the distance between two strings X and Y then be a feature vector of all the NCDs of its windows?
Kind of reminds me of auto correlation :)
Also, just a note about your NCD formula, if C(xy) is the compressed size of the concatenation of x and y, then I would recommend also trying (C(xy)+C(yx))/2 for that term, because a lot of compressors don't compress xy the same as yx, and you probably want your distance to be symmetrical.
If you'd like to play around with MNIST yourself, I wrote a PyTorch training implementation that gets ~99.45%+ val accuracy in <13.6 seconds on a V100, est. < 6.5 seconds on an A100. Made to be edited/run in Colab: https://github.com/tysam-code/hlb-CIFAR10
It's originally kitted for CIFAR10, but I've found the parameters to be quite general. The code is very easy to read and well-commented, and is a great starting place for exploration.
Compute for the project funded by Carter Brown and Daniel Gross, my appreciation to them both for helping make this possible. Their support via encouragement has been very helpful as well. <3 :)
Has anyone tried how different compression algorithms compare when doing NCD classification?
gzip first does LZ77 and then Huffman coding. ANS is a more modern alternative to Huffman coding that can achieve higher compression ratios than Huffman coding.
I'm sure there was something like this mentioned in "Managing Gigabytes" or thereabouts. It might have been using bitwise arithmetic compression which makes sense for the problem at hand.
Wouldn't it make more sense to create ten streams of images of the same class and then see which one results in the smallest compressed size for a test image? That is, if I gzip a hundred '6's plus my test image and get a compressed size for my test image of 10 bytes, but doing the same for other digits gives me say 15 bytes then I conclude the test image is a '6'.
I want someone to show me how to use compression to compile exploits on emulated logic gates, like NSOGroup keeps doing
Seems like a sufficiently novel and important advancement that should be taught in universities at this point, since we need to harden software around this kind of possibility.
The internet has turned into a game of “is it him or is it autocorrect?”
I can’t count how many times I’ve looked up at a line of text to spot check it, type in a couple more letters, hit save without looking, and find later that I sound like I’m having a stroke. Never mind the times I’ve simply forgotten to look. Yes that’s a word, stop trying to replace it.
(Five minutes ago I had to type “laissez-faire” three times. Yesterday it turned “understanding” into god knows what. And if iOS doesn’t stop adding apostrophes to “its” I’m gonna fuckin lose my shit)
I ended up turning auto correct off due to iOS being more aggressive with its corrections than my previous android phones (both of which frequently fail to recognize common words, and maybe both but certainly the android was absurdly brand aware).
I got so tired of Mathematica italicizing its own name when I was in college so I figured out how to type it in two phases so it wouldn’t. It was my tiny little petty victory over Stephen Wolfram ruining math for me.
Additionally, try flipping your images and averaging the size before comparing the distance between them. I'd expect a boost of about 78% -> 84% or so, based on how this typically works as TTA.
facepalm I just realized I was talking about this in the context of MNIST -- my apologies.
It would be better in a problem with horizontal symmetry like CIFAR, esp if the zipping is serialized by unwinding the 2d pixel array into 1d.
One trick that _should_ work is by comparing the distances of starting the compression at the 4 different corners of the image, in the 2 separate directions for each corner. That should provide way more than enough information for k-means clustering.
My apologies again for my mistake, thank you for asking, I wouldn't have really seen that otherwise :')))).
What they mean with "“intelligence is compression”" is that there's actually an equivalence between them.
See, an "intelligence" is a predictor. Like a LLM, it predicts the next char/token depending on what it's seen before.
In order to turn this into a compressor, you store the "difference" between the predicted token and the actual token. If the predictor is good, this stream of data will be very low entropy which can be compressed with arithmetic encoding or something similar.
In the case of arithmetic encoding you get lossless compression. (additionally, because arithmetic encoding is based on probabilities (frequencies), if the predictor outputs probabilities instead of a single prediction, this can also be used to crunch the entropy)
Now look at for instance the speech codecs in GSM. They use Linear Prediction Coding, which has the same concept of using a predictor and storing the error stream. Except the latter's coefficients are rounded down, making it a form of lossy compression.
And yes, you can probably make a pretty good (lossless or perhaps even lossy, but I don't think you want that) text compressor by using an LLM to (deterministically) predict (the likelihood of) tokens and storing only the errors/differences. It should be able to outperform zip or gzip, because it can make use of language knowledge to make predictions.
There's a catch, however, which is that in the case of LLM compression you also need to store all those weights somewhere, cause you need the predictor to decode. This is always the catch with compression, there is always some kind of "model" or predictor algorithm that is implicit to the compressor. In the case of gzip it's a model that says "strings tend to repeat in patterns", which is of course "stored" (in some sense) in the gzip executable code. But with gzip we don't count the size of gzip to our compressed files either, because we get to compress a lot of files with one model. Similarly for this hypothetical LLM-text compression scheme, but you just need to compress a whole lot more text before it's worth it.
All that said, however, like many others pointed out, 78% isn't a great score for MNIST.
Then again, I also don't think that gzip compression is the best predictor for gray scale image similarity. For one if you have two pixels of grayscale values 65 and 66, gzip will see them as "different bytes", regardless of them being very similar in grayscale level. You might even be able to increase the score by thresholding the training set to BW 0/255.
MNIST is a classic image classification exercise - a dataset of 60,000 training images and 10,000 testing images where each image is a handwritten numeral as a 28x28 pixel grayscale image.
The challenge is to build a computer vision model that can tell which numeral each handwritten digit represents.
MNIST (and OCR) are acronyms that are so well-known to anyone who has taken any kind of intro to ML class that there's no more need to define them in a short blog post than there would be for us to define HTML. I learned about MNIST in 2008, and I was just taking a class on numerical methods as a physics major in Matlab where MNIST was just one project.
This isn’t a forum about machine learning, though. It’s a general forum of geek news. I could talk to you all day about compression. Gzip plus MNIST rings no bells.
Are you running google searches on https://news.ycombinator.com/news before clicking on stuff? Or expecting an executive summary. I think you can guess which one I’m doing. And which one I believe is normal human behavior.
No, but if the article headline interested me enough to click into the article, I would Google MNIST if I didn't know it already (ML is full of acronyms for datasets, methods, etc. - can't expect blog posts to define all of them. If it were a paper that didn't define the term, I would be annoyed.)
Also, just because an acronym isn't used doesn't mean you'll understand the jargon. For example, also on the front page is a paper titled "Neurons in Large Language Models: Dead, N-Gram, Positional". No acronyms, but I would certainly need to Google (or actually read the paper, not just the abstract) to know what the 3 terms at the end means (well, I do know what a dead neuron is, but not the other two).
Two questions and you answered the more rhetorical one. Is that a no?
Anyway my main beef isn’t with this GitHub repo or the author, it’s that nobody remembers how to write a goddamned overview anymore. The point of tree structured data is making it cheap to backtrack when doing a semi random search. The overview is an important part of making hypertext work, particularly when it feeds through catch-all lists like a news site or a landing page for a wiki.
When a hyperlink is organically embedded into a paragraph, you can usually guess what it’s for, from the sentence and the context. When it’s just a title that is lost.
i think the idea is that compression exploits recognized similarities, so that begs the question of if a compressor is actually recognizing things usefully.
Oh my ex and some acquaintances were bitching about it in 1999.
I recall a guy with good transcription skills starting a long thread on… must have been slashdot? About how it was faster for him to transcribe the numbers again than to verify the OCR was correct, but he couldn’t tell his boss that and what should he do?
There’s also a famous case where the compression algorithm in a copy/fax machine had a bug in a predefined dictionary that resulted in changing digits to other digits when making copies or faxes. Because they weren’t just compressing, they were trying to use a specialized OCR to do aggressive compression. I believe it was spitting out perfectly formed zeroes where another digit was in the original. Yikes.
> This is not an OCR problem (as we switched off OCR on purpose), it is a lot worse – patches of the pixel data are randomly replaced in a very subtle and dangerous way: The scanned images look correct at first glance, even though numbers may actually be incorrect.
It's lossy compression turned to 11 that looks convincingly non-lossy.
I know technically you are right, but I feel like if you sat down and described the high level features and goals of that image compression (to compress text to a ridiculous degree) and OCR , you’d be hard pressed to say which list is which unless they gave it away with jargon words. My brain compressed it to “OCR” and filed it there.
It’s like Jim Gaffigan’s joke about working in a TexMex restaurant in the northern Midwest. It’s a tortilla, cheese, meat and beans… I tell you what I’ll just bring you something.
YES. God what a nightmare. Can you imagine all of the arguments and threats of lawsuits that bug caused for small and medium businesses? You changed the contract after we had a verbal agreement and tricked me into signing!
I wonder if any executive assistants got fired over that. Surely.
I believe you missed the point, though I could be wrong.
Person A complains about MNIST not being defined. Person B was sending them up for complaining about not defining the meaning of acroynms and then using one they don't define.
I'm not that familiar with how the GZIP algorithm works, it's kind of interesting that it's so much lower. I wonder whether image focused compression algorithms might do better?
(Edit: Btw, I enjoyed the post; it's a creative idea, the writing and code was great, and it's sparked some good discussion. But after having a closer look, I think the baselines above provide some context to the gzip scores.)