
Building an efficient neural language model over a billion words - jamesgpearce
https://code.facebook.com/posts/1827693967466780/
======
sharemywin
Oh, that was so yesterday I did that last year with my laptop...oh wait I
can't eavesdrop on a billion people's private conversations.

~~~
sp332
The corpus was published in 2013. [https://github.com/ciprian-
chelba/1-billion-word-language-mo...](https://github.com/ciprian-
chelba/1-billion-word-language-modeling-benchmark) (And it seems to only have
0.8 billion words?)

Edit: Here's one mentioned in one of the other papers, it has over 7 billion
"tokens" which I think includes words and punctuation.
[https://ibm.ent.box.com/v/booktest-v1](https://ibm.ent.box.com/v/booktest-v1)

