Google Books Ngram datasets

moultano · on Dec 19, 2010

Here's some fun I had with the viewer: http://moultano.blogspot.com/2010/12/history-through-google-...

I've always kind of scoffed when people complain that a history teacher didn't make what they were learning "relevant" or "relatable" but after playing with this for the first time I understand the benefits. At this point in my life, making something "relatable" means expressing it in terms of term frequency statistics and graphs, so this totally blew me away. I was obsessed with it to the point of mania for the first 24 hours. I believe in history now. :)

Smerity · on Dec 19, 2010

This is hugely exciting news. I previously used Google's Web1T corpus in NLP experiments and the restrictive license limited a number of potential uses.

This new corpus has a temporal aspect (as it keeps the track of a word's usage over a given publication year) and is additionally under the Creative Commons license. I'd love to see this become the basis of a large scale database benchmark / competition or open source linguistic application.

kristopher · on Dec 19, 2010

Lots of interesting treasures hidden in this dataset. For example, here is Benford's Law:

http://ngrams.googlelabs.com/graph?content=1,2,3,4,5,6,7,8,9...

kraemate · on Dec 19, 2010

For anyone in NLP (Natural Language Processing), this is a goldmine.