...by the one and only Fabrice Bellard: "gpt2tc is a small program using the GPT-2 language model to complete and compress (English) texts. It has no external dependency, requires no GPU and is quite fast...The compression ratios are much higher than conventional compressors at the expense of speed and of a much larger decompressor. See the documentation to get results on text files from well known compression data sets."
A natural question I've pondered from time to time is whether Fabrice is really a time traveler from a more advanced civilization in the future, sent back in time to show us, mere mortals, what humankind will be capable of in the future.
If this sounds far-fetched, consider that he has created FFMPEG, QEMU, LibBF, SoftFP, BPG, TinyEMU, a software implementation of 4G/LTE, a PC emulator in Javascript, the TCC compiler, TinyGL, LZEXE, and a tiny program for computing the biggest known prime number.
And that's just a partial list of his successful projects, which now of course also include software for lossless compression with Transformer neural networks.
Any of these projects, on its own, would be considered a notable achievement for an ordinary human being.
This particular project is noteworthy mostly for its completeness and 'it just works' functionality. Tens of researchers before him have used arithmetic coding on the outputs of various neural network models to do lossless compression of text or images.
Bellards contributions are a packaged tool (as opposed to PoC code) and demo webpage, and the idea of using CJK characters rather than outputting binary data (in todays world of JSON, binary data has fallen out of fashion).
Not to diminish what a cool idea this is, but isn’t it cheating to not count the size of the GPT2 parameters as part of the final compression ratio?
Assuming the decompressor already has GPT2 weights is analogous to assuming it has a massive fixed dictionary of English words and phrases and doing code substitution — it’s likely the pragmatic answer in some scenario, but it’s not a fair basis for comparison. Real-world compressors use dictionary coders, but they build the dictionary specifically for the data when it’s compressed and then count that dictionary in the compressed size. For competitions like the Hutter Compression Prize (1GB of English Wikipedia) the reported size includes the complete binary of the decompressor program too.
GPT2 model weights require over 5GB of storage, so you’d need a corpus orders of magnitude larger for it to be even close to competitive by that standard. And it appears it would lose anyway — the OP claims ~15% ratio even with “cheating”, and the current Hutter Prize winner for 1GB of enwiki is ~11% without “cheating”.
That’s why I put “cheating” in quotes — it’s pragmatic, but it complicates the comparison into something that can’t be measured in a single number. I grant you that typical bechmarks ignore the static dictionary in comparing Brotli to other compressors, but they also ignore the size of the binary itself. This is because both are assumed to be small and highly general, and GPT2 violates both assumptions. Brotli’s dictionary is 122 KB and covers many natural and programming languages, whereas GPT2 weights are 5 GB and only cover English. No real-world static dictionary is even a thousandth of that size.
Large static dictionaries exploit a loophole that would make comparisons meaningless if carried to the extreme — you could trivially include the entire benchmark corpus in the decompressor itself and claim your compressed file size is 0 bytes. That’s why the Hutter Prize rules are what they are.
Yes, but you can compress arbitrarily many texts of arbitrary size with that 5GB fixed cost. As the number of total bytes compressed goes to infinity, that vanishes in the numerator.
In theory, yes, but a constant 5 GB penalty is enormous in practice — orders of magnitude bigger than anything used in the real world. Brotli’s static dictionary is only 122 KB, and covers many natural and programming languages beyond just English.
And this is a tiny personal mailserver. There's loads of applications where a 5GB penalty* is well below the amount of text you're looking at (wikipedia springs to mind since they're in the same kind of size range for text.)
Obviously bodies of text bigger than 5GB exist. I was talking about static compressor dictionaries, which are tiny. Hence mentioning Brotli’s 122KB dictionary. Static dictionaries are an optimization to improve the compression of very small text files — they aren’t useful for compressing large files, because once you have lots of data you can build a more efficient dictionary at compression time and include it in the compressed stream.
This is exactly what I was thinking. It seems like he’s just taking advantage of the fact that CJK characters are a lot more information-dense than English letters, and has a turnkey encoder/decoder in the form of the pre-trained GPT-2 language model.
I am, in fact, currently involved in research that uses GPT-2 for format-transforming encryption and we follow the exact same recipe but in reverse. Encrypt a message, then "decompress" the random bits into English text using GPT-2. Assuming the receiver has the same GPT-2 state (model parameters and prompt) then they will reproduce the same random bits by arithmetic compression, which can then be decrypted with the shared key.
No, having an arbitrarily complex dictionary or compressor is not counted as cheating. The model is basically that you are allowed to grab as many harddrives as you want before going out into the wilderness. From then on all your news arrive over a low-bandwidth carrier pigeon, and you have to decompress the transmissions with what you remembered to bring.
Counted by whom? What benchmark follows the model you’re describing? Does any real-world compressor use dictionaries anywhere near this big?
If you can bring the complete benchmark corpus (or substantial subsets of it) “into the wilderness”, the benchmark isn’t worth running. It’s not a compressor, it’s a database with stable keys. A Library of Congress LCCN code uniquely identifies the complete text of any published book, but it doesn’t contain a compressed copy of that book.
Just for kicks and giggles, I threw in some rather obscure words to see what would happen. It's been compressing for a few minutes and showing no sign of progress. Cool project!
For anyone who doesn't get why this would happen: GPT-2 basically outputs a probability distribution for its guess of the next word, and then the encoder uses these distributions to perform arithmetic coding adaptively. If the next word in the source text is not actually present anywhere in the output distribution, it cannot encode it.
I may be wrong, but I thought GPT2 could also output partial words/syllables (for unknown words), or individual letters if they don't make a syllable.
The simple way to achieve that is to have an encoding dictionary of words, but then add to the end of the dictionary "sh", etc., and then add to the end of that "a", "b", "c", etc. When tokenizing words, prefer to use a whole word, but if you can't do that, split to syllables, and failing that, individual letters. That has the benefit that any ascii string can go through the system.
Yes, this is why I said "basically". The fact that GPT-2 tokens are not necessarily prefix-free can be a problem for arithmetic coding, but I've found that "greedy" parsing almost never fails in practice.
So yes, there are ways to work around this but it seems like the simplest explanation for why unusual words break the encoder.
I don't understand why it shows Chinese characters. Assuming utf-8, English characters are a lot more compact than Chinese characters. So we can't really compare.
Otherwise it's a good idea and it works, but it's super slow, only working for English text, and the system requirements are huge. I like it.
Modern Chinese is typically more dense than modern Japanese (which is partially phoenetic), and ancient formal Chinese is even more compact than modern Chinese.
However it's worth noting that Chinese characters are analogous to entire words in English, and are composed of components much like English characters are composed of letters.
For example "thanks" is spelled "t h a n k s"
"謝" is made up of "言 身 寸"
(Of course, the components in Chinese have less correlation to their pronunciation, but the main point I'm making here is that there is a LOT of overlap in the common components used to assemble the entire Chinese lexicon.)
It is really not a fair comparison to compare languages in terms of their number of characters needed to represent something.
Better measures would be the fastest time (in seconds) needed to use speech to convey a concept intelligibly to an average native speaker, or the square centimeters of paper needed to convey an idea given the same level of eyesight.
Indeed, what I meant was basically how much information you could cram into a message in digital medium that is character limited, but not really limited in what characters you can use in it. Like SMS messages or Twitter messages when still limited to 140 characters.
I agree, this is confusing. It also shows Korean and Hiragana mixed with Chinese. The significance of this is confusing to CJK speakers.
If you're counting by "number of characters" you might as well use the entire Unicode range including all the Emoji if you are going to mix up Chinese+Japanese+Korean, which nobody would already never do.
Also, "number of characters" is a bit meaningless in the sense that human-intelligible Chinese is already far more compact than human-intelligible English in number of characters, and that's only because each character inherently carries more information, and not because the language itself is a compressed representation of ideas. Chinese characters are also made up of a standard set of components that are reused throughout the lexicon and assembled into different ways to make different characters, so it isn't "fair" to count a Chinese character on the same footing as an English character. A Chinese character is loosely more analogous to an entire word in English, and the components inside Chinese characters are kind of analogous to English letters (except they are arranged two-dimensionally and are most of the time not related to a character's pronunciation).
A more interesting study would be compressing English text into shorter strings that use English characters only.
The raw output is actually closer to random bits; I'm certain he just mapped those bits to CJK characters so they would be printable. The output is not intended to be intelligible, as far as I can tell.
A neat trick I found while working with GPT-2 is that byte-pair encoding is, in itself a compression method. With Huggingface Transformers, encoding/decoding this way is very fast.
This is really just a way to show how good GPT-2 is at predicting text. If you know anything about information theory, you'll know that the entropy of the information source places a hard limit on how much it can be compressed. If GPT-2 is really good at predicting English text, then the entropy of its output should be very very close to the entropy of natural English text. Thus, using GPT-2 predictions as an adaptive source encoder will achieve compression ratios that approach the information content (entropy) of English text.
I compressed "I am going to work outside today," then put the compressed output in Google Translate. Google translated the Chinese characters back to English as "raccoon."
I think the Chinese text that comes out confuses Google translate. I took the whole first sentence of Hamlet's soliloquy which compressed to 䮛趁䌆뺜㞵蹧泔됛姞音逎贊 and plugged that into Google Translate. It came back with "Commendation." The reverse translation is 表彰
It's not Chinese text, it's an arithmetic-coded stream of bits mapped so the bits fall within the range of some codepoints. It's basically a variant of base64 except for Unicode.
(Side note: aren't these codepoints very expensive to encode in UTF-8? It seems there must be a lower-valued range more suited to it)
The page for base32768 has some efficiency charts for different binary to text encodings on top of different UTF encodings, as well as how many bytes you can use them to stuff in a tweet. Depends on where you're going to house the data, I guess.
In addition to being 94% efficient in UTF-16 (!), this reveals some additional reasons why one might want to optimize for number of characters: fitting as many bytes as possible into a tweet which is bounded in the number of characters not bytes.
It's so that the output uses printable characters, that's all. The raw output would actually just be random bits, or at least something approximating random bits if GPT-2 is as good as we hope it is.
This is Chinese characters mixed with Korean characters and this is pretty much never done by humans. It is analogous to mixing English and Heiroglyphics and typing out some gibberish with both.
The author might as well have included the rest of the Unicode range including Arabic, Emoji, and math symbols.
I hit the reverse button again, and got "recognition" which translated back to 承认 which finally got into a closed loop to recognition and back to the same Chinese text.
For more fun enter "I am going to work outside today"
compress, delete the second character and decompress, the result is...
"I know what you're thinking –"
Try swapping a few characters in the compressed string before decompressing and get a totally unrelated, but somewhat plausible, sentence. -->
䔹䧹焫놉勏㦿顱㦽膑裚躈葊
Swapping last two:
䔹䧹焫놉勏㦿顱㦽膑裚葊躈 -->
Try swapping a few characters in the compressed string before decompressing and get a totally unrelated, but somewhat applied tlh
Swapping first two:
䧹䔹焫놉勏㦿顱㦽膑裚躈葊 -->
Sexy Shania Twain acting as a sprite for sexy Hogan's Alley demo dude
my site
my favorite animal's name is camelid 2 my favorite artist is david maile my favorite movie's are
It's just adaptive arithmetic coding, with the distribution provided by GPT-2 instead of some other statistical analysis of the source. He uses CJK simply to make the output printable, but it's really just random bits. I mean, it's a neat idea, but certainly not novel.
for i in range(20):
print ''.join(unichr(random.randrange(20000, 25000))
for x in range(4))
to generate some random text; one string like 劓惂儶宓 turns up this bizarre output:
> Honeybees ( Apis mellifera ) are splendidly beautiful little creatures. They have shapely abdomens, amphistales and pedipalps, round chests, and square backs … all of them beautifully highly marketable. Exactly what has caused the popularity of bees I do not quite know; just what they do is a mystery to me. I beg to differ. The
oh this is so much fun! It's like tuning an old radio and suddenly hearing speech amid the static. A tiny nudge on the dial and it's a different accent/topic/language altogether.
My last name compressed to 3 characters. I tried my wife's last name and it was 3 characters, then I decided to add the accent to it that normally gets dropped in an English-language context and it compressed to 2. Adding first names, I was 4 characters and she was 5 with and without the accent. William Shatner went to 6 characters. Barack Obama went to 2. William Shakespeare also to 2.
Right, I guess your last name is the importance of all Hoseks worldwide, albeit vis-a-vis some chunking of the word, so it has to compete with the importance of other Hos es like hospitals and so on.
FYI, the corresponding standalone Linux command line version is available at https://bellard.org/nncp/gpt2tc.html . It also does text completion and file compression.
Less effectively. GPT-2 and a Markov chain are both predictive models; GPT-2 just happens to be a much more complex (and, in most cases, more accurate) model for English text, so fewer bits are required on average to encode the delta between its predictions and the actual text.
I'm not at all familiar with arithmetic encoding (or adaptive version tehreof), but, after reading some guides, it seems to me that the novel thing here is using GPT2 to somehow generate a character probability distribution?
The theory being that GPT2 should have a distribution closely matching "reality" and thus minimizing the output size?
So if you end up being famous and talked about a lot on Wikipedia, your name will compress better?
The impact of bias in training data is interesting in general here. What's the impact on Wikipedia's article biases? That's probably one of the main corpuses used.
A natural question I've pondered from time to time is whether Fabrice is really a time traveler from a more advanced civilization in the future, sent back in time to show us, mere mortals, what humankind will be capable of in the future.
If this sounds far-fetched, consider that he has created FFMPEG, QEMU, LibBF, SoftFP, BPG, TinyEMU, a software implementation of 4G/LTE, a PC emulator in Javascript, the TCC compiler, TinyGL, LZEXE, and a tiny program for computing the biggest known prime number.
And that's just a partial list of his successful projects, which now of course also include software for lossless compression with Transformer neural networks.
Any of these projects, on its own, would be considered a notable achievement for an ordinary human being.
Source: https://bellard.org
--
Copied and edited some text from my post a year ago: https://news.ycombinator.com/item?id=19591308 -- I never cease to be amazed by the guy.