Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is clever -- and useful in many settings that preclude the use of a deep neural network for classification.

Intuitively, the key idea is that if you have two documents, say, x1 and x2, and a target document y, if x1's statistical regularities are more similar to y's than to x2's, then len(compress(x1+y)) - len(compress(y)) < len(compress(x2+y)) - len(compress(y)), where "+" means concatenation and "compress" is a compression program like gzip.

len(compress(x1+y)) - len(compress(y)) is, quite literally, the number of additional bytes we need to compress the statistical regularities in x1 given the statistical regularities in y. The more similar the statistical regularities between x1 and y, the fewer bytes we need to compress them together.

The authors use kNN using a distance function called normalized compression distance (NCD), based on the above idea. Remarkably, this simple, intuitive method outperforms BERT on a variety of zero-shot classification tasks!




(i did not read the original paper but...) i think you want to subtract len(compress(x1)) and len(compress(x2)) on the lhs/rhs, instead of len(compress(y)) -- we care about the incremental complexity of adding y on top of x1 or x2, ignoring the base complexity of x1/x2.


There's a typo in my comment. Instead of "than to x2's", I actually mean to write "than X2's". Sorry about that!


Related, there's also the Normalised Google Distance (how many results you get searching for X, Y, and X+Y, basically). I don't know if that still works.

These techniques have been around for a long time, in the scale of these things.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: