LLMs and vector embeddings are always lossy compression, yes?

eru · 2024-02-29T01:54:04

Almost always. Though you can use them in a lossless compression system, too, with a few tricks.

stolsvik · 2024-03-03T09:35:25

.. but you don't want to tell us?

eru · 2024-03-04T05:43:37

Two possible implementation:

(1) Take your data as a stream. Use your machine learning gadget to give you the (predicted) probability for each of the possible next tokens. Then use those probability in arithmetic coding to specify which token actually came next.

(2) Take your data D. Apply lossy compression to it. Store the result L := lossy(D). Also compute the residue R := D - uncompress(L). If your lossy compression is good, R will be mostly zeroes (and only a few actually differences), so it will compress well with a lossless compression algorithm.

Approach (1) is a more sophisticated version of (2). None of this is anything I came up with, those approaches are well known.

See eg https://arxiv.org/abs/2306.04050 and https://en.wikipedia.org/wiki/Audio_Lossless_Coding or https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049... (Probably not the best links, but something I could find quickly.)