
Transformer-XL: Unleashing the Potential of Attention Models - headalgorithm
https://ai.googleblog.com/2019/01/transformer-xl-unleashing-potential-of.html
======
PaulHoule
I am very impressed with how well this does on the Hutter Prize.

~~~
YjSe2GMQ
How well does it do? Would you have a link?

~~~
PaulHoule
[https://paperswithcode.com/sota/character-level-models-
hutte...](https://paperswithcode.com/sota/character-level-models-hutter-prize)

~~~
terrelln
That site claims 0.99 BPC, which I guess means the compressed size is 10^8 *
0.99 / 8 ~= 12,375,000.

Does that include the size of the model, or is this just the cost to encode
the errors?

~~~
robrenaud
That is just the cost to encode errors.

------
Klasiaster
Relevant discussion at end of this thread:
[https://encode.ru/threads/3059-How-much-further-can-the-
best...](https://encode.ru/threads/3059-How-much-further-can-the-best-
compression-go)

tldr: the measurements here are not meaningful for a comparison, read
[https://encode.ru/threads/3059-How-much-further-can-the-
best...](https://encode.ru/threads/3059-How-much-further-can-the-best-
compression-go?p=59178&viewfull=1#post59178)

~~~
yorwba
That comment complains that the model is not useful for _compression_ ,
because it stores all that linguistic knowledge in the 3GB of model weights.
But for NLP applications, you want the model to store as much information as
possible, and it only matters whether it generalizes to unseen data or not.
The compression measurement is just to show that, not to imply that the model
would be useful as a general-purpose compression algorithm.

