
Natural Language Corpus Data: Beautiful Data - espeed
http://norvig.com/ngrams/
======
rspeer
This is a great start, but as you get more serious about NLP, keep in mind
that a Web crawl from 2006 is not the same as ground truth about language. The
top 100,000 words in this data include a lot of ephemeral spam. If you're
looking for slang, it'll mostly be outdated. And words like "sitemap" and
"Shockwave" are not as common throughout language as the Web alone would
indicate.

The later Google Books Ngrams [1] data sets are cleaner, but that comes at a
cost. You lose spam, but also a lot of other uses of language, when you only
consider text that has been published in print. It's all in the formal
register. And then you also get correlated OCR errors, as in [2], and you
still get data whose collection ended in 2008.

So what am I suggesting you should use, if data from the Web is bad for some
things and data from books is bad for others? Well, both of them, and a lot of
other things too. An analysis that depends on word frequencies should use the
consensus of many sources.

I've been working on compiling together many of those sources, in many
languages, in the Python package wordfreq [3]. Now I'm working on an update to
include excellent corpus data just released by the OPUS project, including
OpenSubtitles 2016 [4].

[1]
[http://storage.googleapis.com/books/ngrams/books/datasetsv2....](http://storage.googleapis.com/books/ngrams/books/datasetsv2.html)

[2] [http://drhagen.com/blog/the-missing-11th-of-the-
month/](http://drhagen.com/blog/the-missing-11th-of-the-month/)

[3]
[https://github.com/LuminosoInsight/wordfreq](https://github.com/LuminosoInsight/wordfreq)

[4]
[http://opus.lingfil.uu.se/OpenSubtitles2016.php](http://opus.lingfil.uu.se/OpenSubtitles2016.php)

~~~
espeed
Hi Rob - Thanks for posting this. I took a peek at the wordfreq project --
looks interesting. Are you decomposing to a unicode normal form, such as NFKD
[0] or storing words as UTF-32 (that's what Python uses internally [1]),
combining multi-character code points into the 32-bit code unit [2]
represented by UTF-32?

[0]
[https://en.wikipedia.org/wiki/Unicode_equivalence#Normalizat...](https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization)

[1]
[https://en.wikipedia.org/wiki/UTF-32#Use](https://en.wikipedia.org/wiki/UTF-32#Use)

[2]
[https://en.wikipedia.org/wiki/Character_encoding#Terminology](https://en.wikipedia.org/wiki/Character_encoding#Terminology)

~~~
rspeer
I use NFKC form for scripts that seem to require it, such as Arabic, and NFC
for others. If I used NFKC for English, for example, then encountering a brand
name with a trademark sign on it would add the letters "tm" to the end of the
word.

In general I use tokenization rules that follow the Unicode standards in [UAX
29], with language-specific external libraries for Chinese, Japanese, and
Korean, and with some language-specific tweaks to handle cases the Unicode
Consortium didn't go into. [0]

I use Python 3 strings, and it's a peculiar bit of abstraction-busting to
worry about what they look like inside the Python interpreter. It's only
UTF-32 for strings that contain high codepoints. See [PEP 393], "Flexible
String Representation".

I don't think there is such a thing as "multi-character code points". At no
point do I use UTF-16 (which has code points made of multiple surrogate code
points, which are not characters), if that's what you're asking about.

[0]
[https://github.com/LuminosoInsight/wordfreq/blob/master/word...](https://github.com/LuminosoInsight/wordfreq/blob/master/wordfreq/tokens.py)

[UAX 29] [http://unicode.org/reports/tr29/](http://unicode.org/reports/tr29/)
(fixed link)

[PEP 393]
[https://www.python.org/dev/peps/pep-0393/](https://www.python.org/dev/peps/pep-0393/)

~~~
espeed
Thanks for the info. I'm looking at this from the perspective of designing a
backend datastore and query engine for a knowledge system. The idea is to
encode a spatial data structure (similar to Google's S2 Geometry Library [0])
that enables content-based addressing of non-spatial data types for data
fusion.

One idea is to make a lattice of unicode characters that builds up to
combination of words a la Formal Concept Analysis [1] -- on one level, the
characters compose into words that represent properties (key/value pairs), and
then the KV pairs compose into higher-level objects. Each property and higher-
level object is encoded with an integer derived from its constituent
objects/properties, and each object is encoded in such a way that its
constituent objects/properties can be determined algorithmically from the
integer without having to traverse the structure [2] -- ANS encoding [3]
embedded into a space with a VI metric
([https://en.wikipedia.org/wiki/Variation_of_information](https://en.wikipedia.org/wiki/Variation_of_information))
[4] might make this work. Have you played with this type of design?

[0]
[http://blog.christianperone.com/2015/08/googles-s2-geometry-...](http://blog.christianperone.com/2015/08/googles-s2-geometry-
on-the-sphere-cells-and-hilbert-curve/)

[1]
[https://www.youtube.com/watch?v=Xuxm929tIRY](https://www.youtube.com/watch?v=Xuxm929tIRY)

[2]
[http://www09.sigmod.org/sigmod/record/issues/0506/p47-articl...](http://www09.sigmod.org/sigmod/record/issues/0506/p47-article-
tropashko.pdf)

[3]
[https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems](https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems)

[4]
[http://www.sciencedirect.com/science/article/pii/S0047259X06...](http://www.sciencedirect.com/science/article/pii/S0047259X06002016/pdf?md5=71251a6300127946404246c8f8d8f7ea&pid=1-s2.0-S0047259X06002016-main.pdf)

------
otakucode
Aw, I was hoping for an actual corpus. A couple years ago I had wanted to find
a corpus of news stories or similar things in order to analyze relationships
between entities (people/places/events) mentioned. It was then I discovered
that every corpus I could find was specifically only permitted to be used for
natural language analysis, and content analysis was specifically forbidden.
Very disappointing. Still hoping to find a dump of a few decades worth of news
or something some day to play with.

~~~
mark_l_watson
Take a look at the Google book corpus. I used it several years ago by renting
the largest disk/RAM server Hetzner rents and collecting the most common
1gram, 2grams,...5grams in the book corpus. Many classification,
summarization, etc. systems that rely on simple bag of words approaches can be
improved by including ngrams. For a simple example, to label sentiment of
text, considering 2grams lets you correctly handle 'not good', where
considering one word at a time, ignoring word order, etc. is not so good.

------
espeed
What's the best source/method to get the frequency distribution to build an
Arithmetic coding model [0] for English (multilingual ideally) with emoticons?

[0]
[https://en.wikipedia.org/wiki/Arithmetic_coding#Defining_a_m...](https://en.wikipedia.org/wiki/Arithmetic_coding#Defining_a_model)

~~~
web007
Onr of the main points of arithmetic coding is that you don't have to do a
frequency count first like you would for Huffman. It's efficient enough that
you can build your model piecewise as you go, and only sacrifice a few bits of
coding inefficiency.

You're not going to find many public datasets on SMS or other emoji-using
entities regardless of language. Twitter or Instagram are your best bet, but
building your model on those will only work best on those types of input.

~~~
espeed
Just discovered Matthew Rothenberg's Emojitracker project:

* [http://www.emojitracker.com/](http://www.emojitracker.com/)

* [http://github.com/mroth/emojitrack](http://github.com/mroth/emojitrack)

* [https://medium.com/@mroth/how-i-built-emojitracker-179cfd823...](https://medium.com/@mroth/how-i-built-emojitracker-179cfd8238ac#.hhoymvyp1)

* [https://medium.com/@mroth/how-i-kept-building-emojitracker-c...](https://medium.com/@mroth/how-i-kept-building-emojitracker-c31378810136)

------
saycheese
(2009)

Just noting this content is from 2008-2009 and the related book was published
in 2009.

------
sonabinu
Beautiful data is a beautiful book!

