Hacker News new | past | comments | ask | show | jobs | submit login
Google Books Ngram datasets (googlelabs.com)
20 points by abraham on Dec 19, 2010 | hide | past | favorite | 4 comments



Here's some fun I had with the viewer: http://moultano.blogspot.com/2010/12/history-through-google-...

I've always kind of scoffed when people complain that a history teacher didn't make what they were learning "relevant" or "relatable" but after playing with this for the first time I understand the benefits. At this point in my life, making something "relatable" means expressing it in terms of term frequency statistics and graphs, so this totally blew me away. I was obsessed with it to the point of mania for the first 24 hours. I believe in history now. :)


This is hugely exciting news. I previously used Google's Web1T corpus in NLP experiments and the restrictive license limited a number of potential uses.

This new corpus has a temporal aspect (as it keeps the track of a word's usage over a given publication year) and is additionally under the Creative Commons license. I'd love to see this become the basis of a large scale database benchmark / competition or open source linguistic application.


Lots of interesting treasures hidden in this dataset. For example, here is Benford's Law:

http://ngrams.googlelabs.com/graph?content=1,2,3,4,5,6,7,8,9...


For anyone in NLP (Natural Language Processing), this is a goldmine.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: