>gzip approach not better than dnn models but mostly competes and much cheaper t...

eyegor · 2023-07-17T16:55:08

I'm not familiar with most of these models in detail, but training time is generally less interesting than inference time to me. I don't care if it takes a month to train on $10k of gpu rentals if it can be deployed and run on a raspberry pi. I should definitely look into fasttext though.

amluto · 2023-07-17T17:26:50

As described in the paper, it didn't look like the gzip classifier trained at all. Inference involved reading the entire training set.

One could surely speed this up by preprocessing the training set and snapshotting the resulting gzip state, but that wouldn't affect the asymptotic complexity. In effect, the number of parameters is effectively equal to the size of the entire training set. (Of course, lots of fancy models scale roughly like this, too, so this isn't necessarily a loss.)

huac · 2023-07-17T17:58:20

The gzip approach is much slower at inference time because you need to compute the gzip representation of the concatenated strings (query + target). Intuitively, this should be significantly more than a dot product of two embedding vectors.

amluto · 2023-07-17T18:15:59

The latter depends very strongly on how much computation is needed to compute those embedding vectors.

If you run a GPT-3.5-sized mode to compute that embedding (which would be a bit absurd, but if you really want GPT-3.5-quality classification, you may well be doing something like this), you're looking through quite a few tens of billions of parameters and doing a correspondingly large number of FLOPs, which could be just as expensive as running gzip over your whole (small, private) training set.

huac · 2023-07-17T18:21:05

no, because the compute intensity scales with the number of classes which you wish to classify to. if you have n classes, you need to do n gzip compressions at inference time. in the embedding world, you only call the embedding model once on insert, and only need to dot product at inference time.

the same logic extends to using a self-hosted embedding model, which tend to be as good as Ada on most benchmarks, and yes, can be finetuned over your private data.

marcinzm · 2023-07-17T19:02:50

>The latter depends very strongly on how much computation is needed to compute those embedding vectors.

Sure but the gzip metrics are worse than FastText which computes the embeddings in essentially no time. Tokenize, lookup embeddings by token id, and then do some averaging. So compared to that the gzip approach is very slow.

tensor · 2023-07-17T16:34:19

FastText isn't a LLM, it's a token embedding model with a simple classifier on top.

marcinzm · 2023-07-17T17:49:48

Sure but it's existence means the statement is really "gzip approach not better than dnn models, and doesn't compete or be cheaper to run than previous models like FastText." That's not a very meaningful value statement for the approach (although why gzip is even half-decent might be a very interesting research question).