Hacker News new | past | comments | ask | show | jobs | submit login
78% MNIST accuracy using GZIP in under 10 lines of code (jakobs.dev)
363 points by js98 on Sept 20, 2023 | hide | past | favorite | 137 comments



I tried replacing the distance function in the code with some simpler distance measures:

    Gzip distance: ~3 minutes, 78% accuracy
    Euclidean distance: ~0.5 seconds, 93% accuracy
    Jaccard distance * : ~0.7 seconds, 94% accuracy
    Dice dissimilarity * : ~0.8 seconds, 94% accuracy

    * after binarising the images
So, as a distance measure for classifying MNIST digits, GZIP has lower accuracy, and is much more computationally demanding than other measures.

I'm not that familiar with how the GZIP algorithm works, it's kind of interesting that it's so much lower. I wonder whether image focused compression algorithms might do better?

(Edit: Btw, I enjoyed the post; it's a creative idea, the writing and code was great, and it's sparked some good discussion. But after having a closer look, I think the baselines above provide some context to the gzip scores.)


Best one I found is Normalized Mutual Information at 95%. It's a little more complex, but you could still compute it relatively quickly on binarized images.

    NMI skimage: ~30 seconds, 95% accuracy
    NMI numba *: ~0.6 seconds, 95% accuracy
Here's the code, courtesy of ChatGPT:

    @njit(cache=True)
    def compute_joint(img1, img2):
        stacked = 2 * img1.ravel() + img2.ravel()
        return np.bincount(stacked, minlength=4).reshape(2, 2)

    @njit(cache=True)
    def entropy(counts, total):
        # Convert counts to probabilities
        p = counts / total
        # Only calculate for non-zero entries to avoid log(0)
        return -np.sum(p[p > 0] * np.log2(p[p > 0]))

    @njit(cache=True)
    def normalized_mutual_information(img1, img2):
        joint_counts = compute_joint(img1, img2)
        total_pixels = 28*28
        
        # Marginal counts
        c1 = np.sum(joint_counts, axis=1)
        c2 = np.sum(joint_counts, axis=0)
        
        # Compute entropies directly from counts
        h1 = entropy(c1, total_pixels)
        h2 = entropy(c2, total_pixels)
        joint_entropy = entropy(joint_counts.flatten(), total_pixels)
        
        mutual_info = h1 + h2 - joint_entropy
        return 2 * mutual_info / (h1 + h2)


Holy cow, I knew that MNIST was simple, but not that simple.

Could you post a snippet of the code that you used to achieve this? It would be really, really nice to have a baseline to work from I'm sure, and I feel like this could be really useful to a few other areas (my personal obsession is speed-training on CIFAR10 :')))) )

Holy cow, that's insane. :O


I used the notebook linked in the original post [1]. It evaluates using 100 samples from the test set, (I'm guessing because the gzip method is slow - it would take ~7 hours on the full test set, on my machine).

I plugged in the distance measures, for the `compute_ncd` function. (Jaccard/Dice have been negated and the -1 removed.)

    def euclidean(x1, x2):
        return np.sum(np.square(x1 - x2))

    def jaccard(x1, x2):
        x1_binary = x1 > 0.5
        x2_binary = x2 > 0.5
        return np.logical_or(x1_binary, x2_binary).sum() / np.logical_and(x1_binary, x2_binary).sum()

    def dice(x1, x2):
        x1_binary = x1 > 0.5
        x2_binary = x2 > 0.5
       return 2 * np.logical_or(x1_binary, x2_binary).sum() / (x1_binary.sum() + x2_binary.sum())

[1] https://github.com/Jakob-98/mono/blob/73168bc0ea904e75865815...


Ohhh, you know what? I think this makes sense, since it's basically like making prototypes from the training set, only it's instead comparing against every training example, then averaging. Cool. Did not know that would be remotely close to 90%, but that makes sense to me.

I wonder if something like a weighted max_mean would perform better or something L1. Or maybe L2 is ideal, it is at the center of a lot of things information theory, after all! ;PPPP


MNIST is what the Post Office solved in the late 1990s. It's basically the "hello world" of visual recognition these days.

https://www.govexec.com/federal-news/1999/02/postal-service-...


ben recht's kernel method implementation in 10 lines hits 98%

https://github.com/benjamin-recht/mnist_1_pt_2/tree/main


Great link, thank you so much, lol.

This line freaking killed me:

https://github.com/benjamin-recht/mnist_1_pt_2/blob/c0fe96bc...


Very cool. I tried using PNG compression, it does actually do better, slightly:

    PNG : ~15.1 seconds, 83% accuracy
I also tried dropping in zstandard compression:

    Zstd (level=3) : ~3.5 seconds, 88% accuracy
Much faster than gzip, at least. And iff I use (x1-x2)*2 instead of x1+x2 to calculate Cx1x2 that pushes zstd up to 93% accuracy.

I'm somewhat interested in the fact that if I stack the two arrays on top of each other instead of adding them together, I get absolutely garbage performance (<20%), however as far as I can tell that actually works well when classifying strings.


Re garbage performance on stacking: I had tried interleaving the values (creating a 56x28 img), it dropped one percentage point of accuracy when using gzip.


That's pretty amazing.

So the whole gzip thing - while fancy - really is extra steps for less bang.


For a comparison with others techniques:

Linear SVC (best performance): 92 %

SVC rbf (best performance): 96.4 %

SVC poly (best performance): 94.5 %

Logistic regression (prev assignment): 89 %

Naive Bayes (prev assignment): 81 %

From this blog page: https://dmkothari.github.io/Machine-Learning-Projects/SVM_wi...

Also it seems from reading online articles that people are able to obtain much better results just by using K-NN, so I imagine that the author just made his job harder by using gzip but I could be wrong about this.


Thanks for posting this.

Most people don't realize that Logistic regression can get ~90% accuracy on MNIST.

As a big fan of starting with simple models first and adding complexity later, I've frequently been told that "logistic regression won't work!" for problems where it can in fact perform excellent.

When faced with this resistance to logistic regression I'll often ask what they think the baseline performance of it would be on MNIST. The guesses I hear most often are 20-30%.

People, even machine learning people, often don't realize the rapidly diminishing returns you get for adding a lot of complexity in your models. Sometimes it's worth it, but it's always good to start simple first.

It's also been my experience that if you don't get good performance with a simple model, you are very unlikely to get great performance from a more complex one.


Came here to say the same thing, actually NCD can probably do much better than 78%. Li & Vitanyi's book about Kolmogorov complexity has some interesting unsupervised examples.

A simple CNN as implemented in Keras tutorial can easily exceed 98%. 78% is very poor performance for MNIST even if model complexity is penalized.


When would a 90% accuracy on a dataset like mnist ever be useful? And I mean useful as in usable for actual products or software. especially considering mnist is more of a toy dataset.

I think that's why machine learning is the way to go for this type of detection, why go with anything else than a CNN (in this case) when it is now trivial to set up and train? Again, unless it's just to mess around with, 90% mnist accuracy is not useful in the real world


I don't think the point was that you should use logistic regression on MNIST. In lesser-known problems, say a custom in-house model, if you don't try the simpler approach first, you'll never know that your more complex solution is not worth the extra expense, or is actually worse than a simpler, cheaper model. MNIST is well-known to have nearly perfect solutions at this point, but for most novel problems, the data scientist has no idea what is theoretically possible.

Now, you can say that CNNs or other techniques are easily accessible these days, and almost trivial to set up. But they may not be trivial to train and run in terms of compute in the real world.


If you stack logistic regression in layers, you get a neural network.


> People, even machine learning people, often don't realize the rapidly diminishing returns you get for adding a lot of complexity in your models.

Are you me? I was just having this argument at work with someone about using an old (fasttext/word2vec) model vs. the overhead on fine-tuned BERT model for a fairly simple classification problem.


To a large degree this happens because logistic regression is not a sexy approach that one can add to their CV. Everyone wants to solve problems with big, complicated and buzzwordy models, because that sells (and is perhaps more interesting as well).

It's really a tragedy, because so many classic models would work fine for real world applications.


Extending existing features with their non-linear mappings would improve logistic regression too, probably to the svc level (rbf or poly kernel is that but implicit).

Linear models are really well researched, and today with the compute and with proper training and data preparation they can easily get to satisfying levels of performance for a variety of tasks.


this is true unless you can use large pretrained models, which are very simple to use, are very resistant to ask sorts of noise, & would get ultra high accuracy with just logistic regression on the output of the model


You can get 80ish percent with a linear model (if you one hot the labels).


That blog page is not showing state-of-the-art results, it's just taking relatively naive SVM implementations and comparing them.

The original research paper that introduced the MNIST data set was getting around 98% accuracy, and today neural nets are getting 99.87% accuracy.

https://paperswithcode.com/sota/image-classification-on-mnis...


Yep I was showing others simple techniques, of couse a neural network would be much better at it.


The whole point isn't to do better. It's to show that enough information survives compression that you can still get very large signal. Compression is intended to make things harder and still does.


The article OP linked is inspired by Gzip+Knn paper that was released a few months ago, compression is not intended to make things harder but to extract useful features that can be used by simple algorithms like Knn or SVG etc...

For example it would be almost impossible to distinguish cat and dog images by using SVG or Knn directly on the data but it would be much easier if the images were first passed into an image encoder and a small embeddings vector would be used for each one instead. In the article OP linked it doesn't seem that Gzip is very good at extracting useful patterns for the MNIST dataset.


Why? I mean it's trivial to add a layer of encryption ...

(Also the terminology seems off here, as 100% of information survives gzip.)


No, that wasn’t the point at all. Compressibility was used to determine similarity.


And what it accidentally showed, was that NCD between individual digits in the training set is a really terrible distance metric for classification.

You can do classification with KNN, which is obvious. You can also do classification with compression, which is less obvious, and neat. This approach tries to combine them in a way which doesn't work.


While it's cool that this works at all, I wish we would stop using MNIST as a benchmark given how trivial it is.


It's a good benchmark because it's so trivial.

Sure it's not great at differentiating between SotA techniques, but it's very useful for sanity checks like this one.

Even for SotA models, it's still useful to verify that you can get greater than 98% accuracy on MNIST, before exploring larger, more complex bench marks.

It certainly shouldn't be the only benchmark but it's a great place to start iterating on ideas.


It's a bad benchmark because it's artificially clean. It's effectively a 2d dataset with no occlusions. So nearly everything you try on it will work, and many things you try on it won't scale to typical image problems. There are good 3d datasets with more realistic examples that are still fairly simplistic compared to the state of the art large datasets, but at least give you signal that your technique is robust to common problems in vision. MNIST is so simplistic that you encounter none of the typical problems in computer vision settings so it doesn't give you a good prediction of how good your technique is.


Can you imagine a model that performs very poorly on MNIST that would perform well in real-world computer visions problems? If you can't, then MNIST is a nice simple smoke test when assessing models.


That's what makes it a good benchmark.

It's a benchmark.

Not a real world problem.

That's why the traditional path has been MNIST -> CIFAR10 (optionally -> CIFAR100) -> ImageNet -> ????!?.

Because it gets gradually more complicated.

Your iteration time is the constraint to development progress.

Keep that down, and the bugs from your initial implementation will be significantly less impactful.


What's wrong with "artificially clean"? The goal of benchmarks is to compare and know whether one model is better than the other. There is never a "perfect" or "objective" benchmark. Different benchmarks may highlight advantages in certain models, which is a good thing, but there is absolutely nothing wrong with using MNIST as a dataset to give you a basic idea of how models perform.


Artificially clean gives you too many false positives. Most researchers I know these days start on CIFAR10 which is much more real world and has far fewer false positive signals. A portion of the hype in CV is a paper showing a technique works on MNIST that then fails to go anywhere on any other dataset.


So MNIST for ML models is kind of like FizzBuzz for humans doing software development.


It's still a great data set because it's real-world, useful, high-quality, and an excellent size (~60k examples), and the "correct" classification isn't very subjective at all - humans can easily get 99% accuracy.


MNIST is pretty good as a proof of concept. While I agree it's trivial, I see MNIST as being more like how when you're making a toy programming language, the first thing you do with it is writing a recursive Fibonacci function.


Take a look at the EMNIST - Extended MNIST dataset. It has both digits and letters of the alphabet.


So there is a more optimal compression algorithm for the relation between the MNIST inputs and outputs!

Other models tend to add noise somewhere in there; what about feature engineering before the gzip? Maybe a gaussian blur and a convolution first and then deep learning for feature selection


Obviously, the code may be elegant and compact, 78% accuracy is considered very very bad for MNIST.

A dummy model written with Tensorflow easilly reaches 90% accuracy. The best models ranked at 99,87%, see the benchmark : https://paperswithcode.com/sota/image-classification-on-mnis...


The article emphasizes the wrong thing, in my view. The interesting part is that compression -- without learning a model -- can be used for classification. This raises the question of what other information-theoretic measures can be used; cheaper, lossy ones.

To Compress or Not to Compress- Self-Supervised Learning and Information Theory: A Review https://arxiv.org/abs/2304.09355\*


I remember seeing an example of using zip to classify languages. You take a set of documents of equal size where you know the languages, then individually concatenate and zip them with the unknown text. The smallest compressed output is likely to be the target language.

I can't find the original blog, but there's a note about it here - https://stackoverflow.com/questions/39142778/how-to-determin...


Ideally, you'd take all the documents in each language, and compress them in turn with the unclassified text, to see which compresses it better. But this won't work very well with gzip, since it compresses based on a 32KB sliding window. You might as well truncate the training data for each class to the last 32KB (more or less). So to get any performance at all out of a gzip-based classifier, you need to combine a ton of individually quite bad predictors with some sort of ensemble method. (The linked code demonstrates a way of aggregating them which does not work at all).


How much better would that get if you append all but one of the equal size documents? (or other combinations like 2 of the top results after using a single one)


Better, if the compressor can use all that extra context. Gzip, and most traditional general purpose compressors, can't.

It's hard to use distant context effectively. Even general purpose compression methods which theoretically can, often deliberately reset part of their context, since assuming a big file follows the same distribution throughout as in its beginning often hurts compression more than just starting over periodically.


Now that you mention it, I vaguely recall writing a language classifier based on character histograms as a youth. Good times.


I mean, they’re using knn underneath. You can apparently get 97% accuracy with normal knn, at n=4 if you compare pixel distance.

So another way to frame this might be that gzip costs a lot of accuracy but may lead to better performance.

https://newpblog.netlify.app/2018-01-24-knn-analysis-on-mnis...


An increasingly common refrain in machine learning is “intelligence is compression.” Folks who believe that might bristle at the distinction between learning and compression.


I doubt it because learned features already constitute lossy compression. The question is what kind of compression; lossy vs. lossless, learned vs. unlearned.


The point is not to have "elegant and compact" code, this is meant to be a fun curiosity, and doing it in 10 lines is just an additional layer of challenge for the heck of it.

The interesting thing is not in whether GZip can achieve SOTA, it's that it can do a decent job at all. (The interesting thing is not in whether the bear can recreate Mozart exactly, it's that it can play the piano at all.)


Yeah, it does demonstrate that you can use compression to measure similarity of two images.

But it also demonstrates that it's a pretty poor similarity measure. Something as simple as counting % of matches between the black and white pixels performs much better.


It's not trying to break records, it just shows a neat aspect of compression. It's still 8 times better than baseline, which showcases that compression can learn representation.


Replacing the NCD

  distances = [(compute_ncd(x1, x), label) for x, _, label in compressed_lengths]
with the Euclidean distance

  distances = [(np.sqrt(np.sum(np.square(x1-x))), label) for x, _, label in compressed_lengths]
gives you +15% test accuracy and saves you a lot of compute.


My favorite book about the deep connections between information theory, compression, and learning algorithms is MacKay (most probably know about it but I didn’t for a long time so maybe some will benefit from the mention).

I gather this is common knowledge (if not sufficiently emphasized at times) among those with serious educations, but as a self-taught, practical application-type ML person this profound thread running through all these topics (and seemingly into heavy-ass particle physics and cosmology and stuff like that) was a blinding “Aha!” moment that I’ll venture this comment in the hopes that even one other person has that unforgettable moment.


i have put MacKay on my to do list, thank you

i was quite struck when i learned the original Lempel Ziv compression (which gzip is partly based on) came out of their study of the "complexity of finite sequences" not necessarily trying to shrink things, https://ieeexplore.ieee.org/document/1055501


In fairness you can run MNIST through UMAP and get near perfect seperation. I'm of the belief that you have to try pretty hard not to do well on MNIST these days.

https://github.com/lmcinnes/umap_paper_notebooks/blob/master...

EDIT: I should add, unless it isn't clear, that we really should retire the dataset. Something like the QuickDraw dataset makes a lot more sense to me.


Author here: completely agree. I think it is not much of an achievement by itself, but it is interesting to see that it works!

I will add a comment/edit to the post once I am home to clarify the relative ease of solving MNIST


Sorry if my comment came off as dismissive of the post! I was just trying to frame the results for those unfamiliar with the properties of MNIST.

I think it is indeed interesting as it shows that Gzip can capture the same things that e.g. UMAP et co. are capturing when they acheive good scores on MNIST.

Also, I'll add, that even despite some of the suspicions people have cast on the Gzip results in other experiments, I'm bullish on the utility of reducing entropy for classification problems.


> I should add, unless it isn't clear, that we really should retire the dataset.

From a research perspective, MNIST is basically solved. I think current performance on MNIST is better than humans, correct?

But it's still valuable as a teaching tool or a "Hello world" data set, because it's so simple and because most reasonable algorithms will reach 97% accuracy. Even if you build your tools from scratch, it makes a nice homework-sized problem. And it's an obviously useful task that anyone can understand. "Oh, digit recognition for mail!"


I was thinking about this just now... the real utility of MNIST is that, since it's basically impossible to get a bad result, it's a good sanity check.


Yeah, but the thing is, gzip isn't "these days". It was a thing long before UMAP and even MNIST itself. And the approach isn't perticulary novel too, this is a very simple idea, if you do understand compression. It could've been written the first day MNIST was published, and it would be still 78% accuracy. This is kinda amazing to me.


Y'all are making the (rude) person below complaining about acronyms look reasonable. The repository doesn't define UMAP either, but if you believe ChatGPT it is:

> UMAP, which stands for Uniform Manifold Approximation and Projection, is a dimensionality reduction technique and data visualization method commonly used in machine learning and data analysis.


We work in our own ecological niches which sometimes entails having a specialized language that allows us to be precise when referring to things.

I sometimes see HN comments complaining that acronyms and terms are dropped with no explanation. These comments come across as entitled to me — instead of complaining, why not be curious and ask “what does that mean in this context?”

Imagine someone dropping in here and complaining, “the article sucks because it uses the word ‘compiler’ and assumes we all know what a ‘compiler’ is.” This is what it sounds like to those of us to work in specialized fields.

Instead if someone said, “I’m new to this, could someone help me understand what a ‘compiler’ is?”, it demonstrates curiosity and people are more inclined to explain.


It's[0] a non-linear dimensionality reduction technique in the vein of t-SNE[1] that is very well known in the field.

My two cents is that if your problem can be solved with something like UMAP + kNN[2], then you really shouldn't be using Deep Learning to solve it.

[0] https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduc...

[1] https://en.wikipedia.org/wiki/T-distributed_stochastic_neigh...

[2] https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm


Doesn’t your choice of initialization really affect the results? Lior Pachter really seems to keep beating that drum


There's some work[0] that argues that UMAP's performance relative to t-SNE is not that great when you control for the initialization.

I'm not sure where I sit on that, but regardless, my understanding is that getting UMAP to separate MNIST doesnt really require you to turn knobs.

[0] https://www.biorxiv.org/content/10.1101/2019.12.19.877522v1


UMAP is as ubiquitous as T-SNE these days, though to be fair those on HN that focus on JVM, LLVM, NPM, or TLS may not know about UMAP/T-SNE.


I'm merely a hobbyist in this domain but isn't highly compressed data (like encrypted data) also high entropy? If this is finding patterns in the compressed data to figure out which digit the uncompressed data represents, shouldn't those patterns be exploitable for better compression?


The demonstration isn't classifying based on compressed data, rather it's classifying on the compressibility of data. The idea is that "7 7" should be more compressible than "7 3", and similarly raster images of "7 7" should be more compressible than raster images of "7 3".


Ah ok, that makes sense. Thank you for the explanation.


Encrypted data ideally is incompressible. Incompressibility is a hallmark of efficient cryptographic operations.

See the Wikipedia article on Kolmogorov complexity, which has a short section about compression. https://en.wikipedia.org/wiki/Kolmogorov_complexity#Compress...

Edit: One of my favorite concepts in the domain of compression is the pigeonhole principle, that states that for all compression algorithms, some outputs will be larger than the inputs.

Well designed encrypted payloads may be compressed, but the outputs should on average be larger than the inputs, rendering compression useless thus it is said to be "incompressible".

https://en.wikipedia.org/wiki/Pigeonhole_principle#Uses_and_...


I don't immediately find it, but couple of years back there was a "meta-feature" which was the size of the MNIST image. I think that scored about 90'ish % accurate results on its own - without even looking at the image.


A few years back I worked on a project that involved fingerprinting screenshots of web pages, and compressed image size was pretty much as good as any fingerprinting method we could come up with for comparing the similarity between them.


The off-the-beaten-path nature of this reminds me of banks sending $<arbitrary decimal amount> as a PIN for auth.


Makes sense, considering that's not too terribly far off from what the KL divergence does


What do you mean by “size”? Gzipped size? If you simply look at how dark a Mnist image is (count the percentage of dark pixels) you’ll get about 20% accuracy, which is twice better than random guess but a long way from 90’ish %.


What do you mean with accuracy here? Usually 50% accuracy means cointoss, meaning 20% accuracy is equal to 80% accuracy, which is better than the article's 78% and not that far from 90%.


> meaning 20% accuracy is equal to 80% accuracy,

Only if your model is outputting a yes/no answer right? And that your definition of accuracy is "class with highest probability" (and not "N classes with highest prob")

If your dataset has more than 2 classes like MNIST, a super low accuracy only tells you to ignore the class the model guesses. It doesn't tell you which of the remaining classes is correct


There are ten choices, so getting the answer right 20% of the time is very plausible.


"One simple trick to beat the statistical odds...."


There are 10 classes in Mnist. Random guess would is 10%.


Didn't it turn out that authors of that paper have made mistakes that catapulted their results to the top of the benchmark charts? I thought the theory was inconsistent after that incident. 78% accuracy from just GZIP is impressive.


haven’t looked through this post yet but I think this is what you have in mind: https://kenschutte.com/gzip-knn-paper/


This is 78% accuracy from Gzip-based compression distance + KNN, which seems to be worse than any other distance measure you can think of + KNN.


Leaving aside whether this problem is a good application of this compression trick, I want to say that everyone experimenting with this should stop using `gzip` and start using `zlib`.

If you change the first line from `gzip.compress` to `zlib.compress` you should get the same classifier performance with a 3x speedup.


General purpose compressors and information distance measures have become super interesting to me while I've been investigating alternative language models.

I've been playing around with an attention mechanism that combines the idea of using normalized compression distance (gzip) with discrete convolution between candidate sequences (sliding window of N bytes over each). Another round of normalization over the convolution outputs - accommodating varying lengths - allows for us to compare candidate sequences for relevant information on equal grounds.

The NCD formula I am using right now:

  NCD(x,y) = (C(xy) - MIN(C(x),C(y))) / MAX(C(x),C(y))
No weird parameters or any other things to tune. The only parameters are the source documents and the input context/query.


So, if I understand your sliding window explanation right, would the distance between two strings X and Y then be a feature vector of all the NCDs of its windows?

Kind of reminds me of auto correlation :)

Also, just a note about your NCD formula, if C(xy) is the compressed size of the concatenation of x and y, then I would recommend also trying (C(xy)+C(yx))/2 for that term, because a lot of compressors don't compress xy the same as yx, and you probably want your distance to be symmetrical.


All these ideas date back to at least 2010, probably earlier. It was called "information distance". https://arxiv.org/abs/1006.3520


If you'd like to play around with MNIST yourself, I wrote a PyTorch training implementation that gets ~99.45%+ val accuracy in <13.6 seconds on a V100, est. < 6.5 seconds on an A100. Made to be edited/run in Colab: https://github.com/tysam-code/hlb-CIFAR10

It's originally kitted for CIFAR10, but I've found the parameters to be quite general. The code is very easy to read and well-commented, and is a great starting place for exploration.

Min-cut deltas to run MNIST:

  .datasets.CIFAR10(' -> .datasets.MNIST(' (both occurences)

  'whiten': Conv(3, -> 'whiten': Conv(1,

  crop_size = 28 - > `crop_size = 28
Compute for the project funded by Carter Brown and Daniel Gross, my appreciation to them both for helping make this possible. Their support via encouragement has been very helpful as well. <3 :)


It goes along the same lines as the recent paper from Deepmind stating that DL (language modeling in their case) is compression.

https://arxiv.org/pdf/2309.10668.pdf


Similar result (2019) using ZIP, getting 74%: https://www.blackhc.net/blog/2019/mnist-by-zip/


Has anyone tried how different compression algorithms compare when doing NCD classification?

gzip first does LZ77 and then Huffman coding. ANS is a more modern alternative to Huffman coding that can achieve higher compression ratios than Huffman coding.


I'm sure there was something like this mentioned in "Managing Gigabytes" or thereabouts. It might have been using bitwise arithmetic compression which makes sense for the problem at hand.


What explains the huge difference in perf with the approach from 2019 mentioned at the end?

For image classification, couldn't one use the residual from applying a jpeg codebook learned on a training example as metric?


Wouldn't it make more sense to create ten streams of images of the same class and then see which one results in the smallest compressed size for a test image? That is, if I gzip a hundred '6's plus my test image and get a compressed size for my test image of 10 bytes, but doing the same for other digits gives me say 15 bytes then I conclude the test image is a '6'.


I want someone to show me how to use compression to compile exploits on emulated logic gates, like NSOGroup keeps doing

Seems like a sufficiently novel and important advancement that should be taught in universities at this point, since we need to harden software around this kind of possibility.


I wonder why you started with a code-golfed version? That seems original to the main point of the post


original -> orthogonal?

I like short code! It's like a joke "this is so easy, 10 loc" :)


The internet has turned into a game of “is it him or is it autocorrect?”

I can’t count how many times I’ve looked up at a line of text to spot check it, type in a couple more letters, hit save without looking, and find later that I sound like I’m having a stroke. Never mind the times I’ve simply forgotten to look. Yes that’s a word, stop trying to replace it.

(Five minutes ago I had to type “laissez-faire” three times. Yesterday it turned “understanding” into god knows what. And if iOS doesn’t stop adding apostrophes to “its” I’m gonna fuckin lose my shit)


I ended up turning auto correct off due to iOS being more aggressive with its corrections than my previous android phones (both of which frequently fail to recognize common words, and maybe both but certainly the android was absurdly brand aware).


> brand aware

I got so tired of Mathematica italicizing its own name when I was in college so I figured out how to type it in two phases so it wouldn’t. It was my tiny little petty victory over Stephen Wolfram ruining math for me.

I know that struggle.


Having some artificial intelligence algorithm that is completely understood and tu able like gzip is would be great for many uses.

I think it’s pretty hard to just improve a NN without more data or some non trivial amount of effort.


Additionally, try flipping your images and averaging the size before comparing the distance between them. I'd expect a boost of about 78% -> 84% or so, based on how this typically works as TTA.


Can you elaborate more on that technique?

Why would it improve the result so much? Sounds very interesting so I'm curious.


facepalm I just realized I was talking about this in the context of MNIST -- my apologies.

It would be better in a problem with horizontal symmetry like CIFAR, esp if the zipping is serialized by unwinding the 2d pixel array into 1d.

One trick that _should_ work is by comparing the distances of starting the compression at the 4 different corners of the image, in the 2 separate directions for each corner. That should provide way more than enough information for k-means clustering.

My apologies again for my mistake, thank you for asking, I wouldn't have really seen that otherwise :')))).


What they mean with "“intelligence is compression”" is that there's actually an equivalence between them.

See, an "intelligence" is a predictor. Like a LLM, it predicts the next char/token depending on what it's seen before.

In order to turn this into a compressor, you store the "difference" between the predicted token and the actual token. If the predictor is good, this stream of data will be very low entropy which can be compressed with arithmetic encoding or something similar.

In the case of arithmetic encoding you get lossless compression. (additionally, because arithmetic encoding is based on probabilities (frequencies), if the predictor outputs probabilities instead of a single prediction, this can also be used to crunch the entropy)

Now look at for instance the speech codecs in GSM. They use Linear Prediction Coding, which has the same concept of using a predictor and storing the error stream. Except the latter's coefficients are rounded down, making it a form of lossy compression.

And yes, you can probably make a pretty good (lossless or perhaps even lossy, but I don't think you want that) text compressor by using an LLM to (deterministically) predict (the likelihood of) tokens and storing only the errors/differences. It should be able to outperform zip or gzip, because it can make use of language knowledge to make predictions.

There's a catch, however, which is that in the case of LLM compression you also need to store all those weights somewhere, cause you need the predictor to decode. This is always the catch with compression, there is always some kind of "model" or predictor algorithm that is implicit to the compressor. In the case of gzip it's a model that says "strings tend to repeat in patterns", which is of course "stored" (in some sense) in the gzip executable code. But with gzip we don't count the size of gzip to our compressed files either, because we get to compress a lot of files with one model. Similarly for this hypothetical LLM-text compression scheme, but you just need to compress a whole lot more text before it's worth it.

All that said, however, like many others pointed out, 78% isn't a great score for MNIST.

Then again, I also don't think that gzip compression is the best predictor for gray scale image similarity. For one if you have two pixels of grayscale values 65 and 66, gzip will see them as "different bytes", regardless of them being very similar in grayscale level. You might even be able to increase the score by thresholding the training set to BW 0/255.


"MNIST"?

accuracy - but of what?

what's this about?


MNIST is a classic image classification exercise - a dataset of 60,000 training images and 10,000 testing images where each image is a handwritten numeral as a 28x28 pixel grayscale image.

The challenge is to build a computer vision model that can tell which numeral each handwritten digit represents.

https://en.wikipedia.org/wiki/MNIST_database

78% accuracy on a solution is pretty bad, but achieving it just using GZIP is a very neat hack.


What's the average human performance for this task?


I don't know off hand, but go take a look at the images - I would expect near 100%.


Sadly,there are several errors in the labeled data, so no one should get 100%.

See https://labelerrors.com/


Just looking at a few of those I think I see them mostly as MNIST reports them? But yes, no one could get 100% due to ambiguity.

Very neat site though, I appreciate you showing that to me


thank you


[flagged]


MNIST (and OCR) are acronyms that are so well-known to anyone who has taken any kind of intro to ML class that there's no more need to define them in a short blog post than there would be for us to define HTML. I learned about MNIST in 2008, and I was just taking a class on numerical methods as a physics major in Matlab where MNIST was just one project.


This isn’t a forum about machine learning, though. It’s a general forum of geek news. I could talk to you all day about compression. Gzip plus MNIST rings no bells.


Fortunately MNIST is a pretty distinctive Google search term. At most you would need to search for (MNIST machine learning).


Are you running google searches on https://news.ycombinator.com/news before clicking on stuff? Or expecting an executive summary. I think you can guess which one I’m doing. And which one I believe is normal human behavior.


No, but if the article headline interested me enough to click into the article, I would Google MNIST if I didn't know it already (ML is full of acronyms for datasets, methods, etc. - can't expect blog posts to define all of them. If it were a paper that didn't define the term, I would be annoyed.)

Also, just because an acronym isn't used doesn't mean you'll understand the jargon. For example, also on the front page is a paper titled "Neurons in Large Language Models: Dead, N-Gram, Positional". No acronyms, but I would certainly need to Google (or actually read the paper, not just the abstract) to know what the 3 terms at the end means (well, I do know what a dead neuron is, but not the other two).


No need to guess, you requested a summary:

> you’ve missed the most important hyperlink: what the fuck is a MNIST?


Two questions and you answered the more rhetorical one. Is that a no?

Anyway my main beef isn’t with this GitHub repo or the author, it’s that nobody remembers how to write a goddamned overview anymore. The point of tree structured data is making it cheap to backtrack when doing a semi random search. The overview is an important part of making hypertext work, particularly when it feeds through catch-all lists like a news site or a landing page for a wiki.

When a hyperlink is organically embedded into a paragraph, you can usually guess what it’s for, from the sentence and the context. When it’s just a title that is lost.


i think the idea is that compression exploits recognized similarities, so that begs the question of if a compressor is actually recognizing things usefully.


The fuck is a GZIP? ;)


Goddamnit where is the spray bottle?

Psst psst psst. NO. Bad kitty.


> Such as OCR.

Such as what?


Optical character recognition... Has been pretty widespread tech since ~2016 or so.


Oh my ex and some acquaintances were bitching about it in 1999.

I recall a guy with good transcription skills starting a long thread on… must have been slashdot? About how it was faster for him to transcribe the numbers again than to verify the OCR was correct, but he couldn’t tell his boss that and what should he do?

There’s also a famous case where the compression algorithm in a copy/fax machine had a bug in a predefined dictionary that resulted in changing digits to other digits when making copies or faxes. Because they weren’t just compressing, they were trying to use a specialized OCR to do aggressive compression. I believe it was spitting out perfectly formed zeroes where another digit was in the original. Yikes.


> I believe it was spitting out perfectly formed zeroes where another digit was in the original. Yikes.

JBIG2 lossy compression. Covered in another hn story: https://news.ycombinator.com/item?id=29223815

From the story and the comments:

> This is not an OCR problem (as we switched off OCR on purpose), it is a lot worse – patches of the pixel data are randomly replaced in a very subtle and dangerous way: The scanned images look correct at first glance, even though numbers may actually be incorrect.

It's lossy compression turned to 11 that looks convincingly non-lossy.


I know technically you are right, but I feel like if you sat down and described the high level features and goals of that image compression (to compress text to a ridiculous degree) and OCR , you’d be hard pressed to say which list is which unless they gave it away with jargon words. My brain compressed it to “OCR” and filed it there.

It’s like Jim Gaffigan’s joke about working in a TexMex restaurant in the northern Midwest. It’s a tortilla, cheese, meat and beans… I tell you what I’ll just bring you something.


you're referring to the Xerox Workcenter: http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...


YES. God what a nightmare. Can you imagine all of the arguments and threats of lawsuits that bug caused for small and medium businesses? You changed the contract after we had a verbal agreement and tricked me into signing!

I wonder if any executive assistants got fired over that. Surely.


If you can’t hear my eyes rolling from here I’ll do it again, harder… how bout now?

Optical Character Recognition. Very old, and very infamous attempts at replacing humans with software. Secretaries and executives know what OCR means.


I believe you missed the point, though I could be wrong.

Person A complains about MNIST not being defined. Person B was sending them up for complaining about not defining the meaning of acroynms and then using one they don't define.


No I got it. I just don’t want it.


[flagged]


> I suggest all HN articles must define all abbreviations and acronyms or be removed.

I can't decide if you are joking?

HR = Human Resources

PC = Personal Computer

HN = Hacker News

OP = Original Poster (?)

M.S. = Master of Science

certs = certificates


You solved some problem, including implementing the GZIP algorithm, in less than 10 lines of code?

...

Oh, ok, not that.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: