
Lossless compression of English messages using GPT-2 - kleiba
http://textsynth.org/sms.html
======
cs702
...by the one and only Fabrice Bellard: "gpt2tc is a small program using the
GPT-2 language model to complete and compress (English) texts. It has no
external dependency, requires no GPU and is quite fast...The compression
ratios are much higher than conventional compressors at the expense of speed
and of a much larger decompressor. See the documentation to get results on
text files from well known compression data sets."

A natural question I've pondered from time to time is whether Fabrice is
really a time traveler from a more advanced civilization in the future, sent
back in time to show us, mere mortals, what humankind will be capable of in
the future.

If this sounds far-fetched, consider that he has created FFMPEG, QEMU, LibBF,
SoftFP, BPG, TinyEMU, a software implementation of 4G/LTE, a PC emulator in
Javascript, the TCC compiler, TinyGL, LZEXE, and a tiny program for computing
the biggest known prime number.

And that's just a partial list of his successful projects, which now of course
also include software for lossless compression with Transformer neural
networks.

Any of these projects, on its own, would be considered a notable achievement
for an ordinary human being.

Source: [https://bellard.org](https://bellard.org)

\--

Copied and edited some text from my post a year ago:
[https://news.ycombinator.com/item?id=19591308](https://news.ycombinator.com/item?id=19591308)
\-- I never cease to be amazed by the guy.

~~~
londons_explore
This particular project is noteworthy mostly for its completeness and 'it just
works' functionality. Tens of researchers before him have used arithmetic
coding on the outputs of various neural network models to do lossless
compression of text or images.

Bellards contributions are a packaged tool (as opposed to PoC code) and demo
webpage, and the idea of using CJK characters rather than outputting binary
data (in todays world of JSON, binary data has fallen out of fashion).

------
goodside
Not to diminish what a cool idea this is, but isn’t it cheating to not count
the size of the GPT2 parameters as part of the final compression ratio?

Assuming the decompressor already has GPT2 weights is analogous to assuming it
has a massive fixed dictionary of English words and phrases and doing code
substitution — it’s likely the pragmatic answer in some scenario, but it’s not
a fair basis for comparison. Real-world compressors use dictionary coders, but
they build the dictionary specifically for the data when it’s compressed and
then count that dictionary in the compressed size. For competitions like the
Hutter Compression Prize (1GB of English Wikipedia) the reported size includes
the complete binary of the decompressor program too.

GPT2 model weights require over 5GB of storage, so you’d need a corpus orders
of magnitude larger for it to be even close to competitive by that standard.
And it appears it would lose anyway — the OP claims ~15% ratio even with
“cheating”, and the current Hutter Prize winner for 1GB of enwiki is ~11%
without “cheating”.

~~~
Jaxkr
Static dictionaries or models in compression algorithms are not “cheating”.
Brotli, for example, achieves amazing results with its [static
dictionary]([https://gist.github.com/klauspost/2900d5ba6f9b65d69c8e](https://gist.github.com/klauspost/2900d5ba6f9b65d69c8e)).

However, I agree with you on the real-world uselessness of a GPT-based
compression algorithm.

~~~
goodside
That’s why I put “cheating” in quotes — it’s pragmatic, but it complicates the
comparison into something that can’t be measured in a single number. I grant
you that typical bechmarks ignore the static dictionary in comparing Brotli to
other compressors, but they also ignore the size of the binary itself. This is
because both are assumed to be small and highly general, and GPT2 violates
both assumptions. Brotli’s dictionary is 122 KB and covers many natural and
programming languages, whereas GPT2 weights are 5 GB and only cover English.
No real-world static dictionary is even a thousandth of that size.

Large static dictionaries exploit a loophole that would make comparisons
meaningless if carried to the extreme — you could trivially include the entire
benchmark corpus in the decompressor itself and claim your compressed file
size is 0 bytes. That’s why the Hutter Prize rules are what they are.

------
matthewfcarlson
Just for kicks and giggles, I threw in some rather obscure words to see what
would happen. It's been compressing for a few minutes and showing no sign of
progress. Cool project!

~~~
jkhdigital
For anyone who doesn't get why this would happen: GPT-2 basically outputs a
probability distribution for its guess of the next word, and then the encoder
uses these distributions to perform arithmetic coding adaptively. If the next
word in the source text is not actually present anywhere in the output
distribution, it cannot encode it.

~~~
londons_explore
I may be wrong, but I thought GPT2 could also output partial words/syllables
(for unknown words), or individual letters if they don't make a syllable.

The simple way to achieve that is to have an encoding dictionary of words, but
then add to the end of the dictionary "sh", etc., and then add to the end of
that "a", "b", "c", etc. When tokenizing words, prefer to use a whole word,
but if you can't do that, split to syllables, and failing that, individual
letters. That has the benefit that any ascii string can go through the system.

~~~
jkhdigital
Yes, this is why I said "basically". The fact that GPT-2 tokens are not
necessarily prefix-free can be a problem for arithmetic coding, but I've found
that "greedy" parsing almost never fails in practice.

So yes, there are ways to work around this but it seems like the simplest
explanation for why unusual words break the encoder.

------
speedgoose
I don't understand why it shows Chinese characters. Assuming utf-8, English
characters are a lot more compact than Chinese characters. So we can't really
compare.

Otherwise it's a good idea and it works, but it's super slow, only working for
English text, and the system requirements are huge. I like it.

~~~
sp332
It's counting characters, so it is comparable.

This is useful for applications that limit the number of characters, e.g.
Twitter.

~~~
m4rtink
Yep, as far as I can tell, you can cram about twice as much information to the
same number of Japanese as you would cram into Latin characters.

I wonder if Chinese is even more info dense, as it does not have the syllabic
hiragana/katakana characters ?

~~~
dheera
Modern Chinese is typically more dense than modern Japanese (which is
partially phoenetic), and ancient formal Chinese is even more compact than
modern Chinese.

However it's worth noting that Chinese characters are analogous to entire
words in English, and are composed of components much like English characters
are composed of letters.

For example "thanks" is spelled "t h a n k s"

"謝" is made up of "言 身 寸"

(Of course, the components in Chinese have less correlation to their
pronunciation, but the main point I'm making here is that there is a LOT of
overlap in the common components used to assemble the entire Chinese lexicon.)

It is really not a fair comparison to compare languages in terms of their
number of characters needed to represent something.

Better measures would be the fastest time (in seconds) needed to use speech to
convey a concept intelligibly to an average native speaker, or the square
centimeters of paper needed to convey an idea given the same level of
eyesight.

~~~
m4rtink
Indeed, what I meant was basically how much information you could cram into a
message in digital medium that is character limited, but not really limited in
what characters you can use in it. Like SMS messages or Twitter messages when
still limited to 140 characters.

------
minimaxir
A neat trick I found while working with GPT-2 is that byte-pair encoding is,
in itself a compression method. With Huggingface Transformers,
encoding/decoding this way is very fast.

I've implemented this approach in my aitextgen package
([https://github.com/minimaxir/aitextgen/blob/master/aitextgen...](https://github.com/minimaxir/aitextgen/blob/master/aitextgen/TokenDataset.py#L238))
to encode massive input datasets as a uint16 Numpy array; when gzipped on
disk, it's about 1/10th of the original data set size.

However, the technique in this submission gets about compression to 1/10 w/o
the gzipping. Hmm.

~~~
jkhdigital
This is really just a way to show how good GPT-2 is at predicting text. If you
know anything about information theory, you'll know that the entropy of the
information source places a hard limit on how much it can be compressed. If
GPT-2 is really good at predicting English text, then the entropy of its
output should be very very close to the entropy of natural English text. Thus,
using GPT-2 predictions as an adaptive source encoder will achieve compression
ratios that approach the information content (entropy) of English text.

------
starpilot
I compressed "I am going to work outside today," then put the compressed
output in Google Translate. Google translated the Chinese characters back to
English as "raccoon."

~~~
dhosek
I think the Chinese text that comes out confuses Google translate. I took the
whole first sentence of Hamlet's soliloquy which compressed to 䮛趁䌆뺜㞵蹧泔됛姞音逎贊
and plugged that into Google Translate. It came back with "Commendation." The
reverse translation is 表彰

~~~
james412
It's not Chinese text, it's an arithmetic-coded stream of bits mapped so the
bits fall within the range of some codepoints. It's basically a variant of
base64 except for Unicode.

(Side note: aren't these codepoints very expensive to encode in UTF-8? It
seems there must be a lower-valued range more suited to it)

~~~
toast0
The page for base32768 has some efficiency charts for different binary to text
encodings on top of different UTF encodings, as well as how many bytes you can
use them to stuff in a tweet. Depends on where you're going to house the data,
I guess.

[https://github.com/qntm/base32768](https://github.com/qntm/base32768)

~~~
infogulch
In addition to being 94% efficient in UTF-16 (!), this reveals some additional
reasons why one might want to optimize for number of characters: fitting as
many bytes as possible into a _tweet_ which is bounded in the number of
characters not bytes.

------
fla
Try swapping a few characters in the compressed string before decompressing
and get a totally unrelated, but somewhat plausible, sentence.

~~~
VMG

       Try swapping a few characters in the compressed string before decompressing and get a totally unrelated, but somewhat plausible, sentence. -->
    
       䔹䧹焫놉勏㦿顱㦽膑裚躈葊
    

Swapping last two:

    
    
       䔹䧹焫놉勏㦿顱㦽膑裚葊躈 -->
    
       Try swapping a few characters in the compressed string before decompressing and get a totally unrelated, but somewhat applied tlh
    
    

Swapping first two:

    
    
       䧹䔹焫놉勏㦿顱㦽膑裚躈葊 -->
    
       Sexy Shania Twain acting as a sprite for sexy Hogan's Alley demo dude
    
       my site
    
       my favorite animal's name is camelid 2 my favorite artist is david maile my favorite movie's are
    
    

Pretty wild!

~~~
jkhdigital
It's just adaptive arithmetic coding, with the distribution provided by GPT-2
instead of some other statistical analysis of the source. He uses CJK simply
to make the output printable, but it's really just random bits. I mean, it's a
neat idea, but certainly not novel.

------
dmarchand90
I'm really impressed that this seems largely written by scratch in c. "This
demo has no external dependency."

------
vessenes
I am guessing that Fabrice is planning on some sort of commercialization here;
this is a re-issue of something originally on his website.

A fun game to play is to see how many characters a name takes: it’s an
indication of your importance to the Internet.

In answer to the why Chinese, it seems to me to be easier to read and more
compact to display than hexlified bytes.

~~~
dhosek
My last name compressed to 3 characters. I tried my wife's last name and it
was 3 characters, then I decided to add the accent to it that normally gets
dropped in an English-language context and it compressed to 2. Adding first
names, I was 4 characters and she was 5 with and without the accent. William
Shatner went to 6 characters. Barack Obama went to 2. William Shakespeare also
to 2.

~~~
vessenes
Right, I guess your last name is the importance of all Hoseks worldwide,
albeit vis-a-vis some chunking of the word, so it has to compete with the
importance of other Hos es like hospitals and so on.

------
hint23
FYI, the corresponding standalone Linux command line version is available at
[https://bellard.org/nncp/gpt2tc.html](https://bellard.org/nncp/gpt2tc.html) .
It also does text completion and file compression.

------
lxe
> using the probability of the next word computed by the GPT-2 language model

Can the same effect be achieved by looking at actual probability of the next
word from a large corpus of existing text (a-la markov chains)?

~~~
duskwuff
Less effectively. GPT-2 and a Markov chain are both predictive models; GPT-2
just happens to be a much more complex (and, in most cases, more accurate)
model for English text, so fewer bits are required on average to encode the
delta between its predictions and the actual text.

------
jkhdigital
Paste encrypted bits (mapped to the CJK range he uses) in the "decompress" box
and you've got format-transforming encryption.

------
maest
I'm not at all familiar with arithmetic encoding (or adaptive version
tehreof), but, after reading some guides, it seems to me that the novel thing
here is using GPT2 to somehow generate a character probability distribution?

The theory being that GPT2 should have a distribution closely matching
"reality" and thus minimizing the output size?

------
aapeli
So if you end up being famous and talked about a lot on Wikipedia, your name
will compress better?

The impact of bias in training data is interesting in general here. What's the
impact on Wikipedia's article biases? That's probably one of the main corpuses
used.

------
nmca
This guy should enter the Hutter Prize -
[http://prize.hutter1.net/](http://prize.hutter1.net/)

This won't win, but it seems he cares and has some talent :)

------
d_burfoot
A fun game is to compress some text, then look up some random Chinese words,
cut-and-paste them into the compressed output, and then decompress again.

~~~
nmstoker
Yes or even just swap the compressed character order and it still results in
interesting somewhat similar texts.

------
aquajet
Title should be renamed to "Language Models are Lossless Compressors"

~~~
jkhdigital
Exactly, along with a link to some basic information theory Wikipedia
articles.

------
Vvector
Looks like the work is done server-side. And we've hit a bottleneck

------
knolax
You boomers need to know that 먓띑뒢끟 are precomposed hangul characters. It's not
hard to just say CJK.

