

Google 5gram corpus has unreasonable 5grams - numeromancer
http://nlpers.blogspot.com/2010/02/google-5gram-corpus-has-unreasonable.html

======
wrs
It's funny that the comments on this post have been spammed with the sort of
duplicative trash he may be seeing in the corpus. I never imagined that spam
comments could actually be relevant.

------
ubernostrum
The trolls/dwarfs one is almost certainly due to the works of Terry Pratchett.

~~~
ZeroGravitas
Most of the weird ones are explained in the comments, you just have to dig
through some spam.

* The trolls/dwarf one was part of the blurb repeated in bookstores and reviews for a Terry Pratchett novel

* the poet wicked the woman was part of a list of plays with "Wicked" (about the witch from Oz) being in the middle

* The prince compiled the Mishna is apparently something from Jewish lore.

I'd guess there'd be similar explanations for the others just like "the matrix
reloaded the matrix" which he calls out, just not as obvious.

One of the commenters points out that any string of 5 words is already an
outlier so you're naturally going to get noise like this mixed in.

If I google the first one it seems to be some standard boilerplate text used
in collecting data about chemical exposure. The second is a list of UK reality
TV series "Popstars: The Rivals, Shattered, The Farm". So they are widely used
on the web, just not in normal speech.

------
Caligula
I am glad to find out there are some other issues with it. My main
disappointment with it however remains its license. It would of been really
cool if Google released it with a CC or MIT license but instead its restricted
for academic usage only. Better to spend time with other corpus'.

~~~
steveitis
What other n-gram corpus' are there? I've been looking, and all I can seem to
find is some web spammer trying to sell me one.

I don't particularly care about the license, but I'd rather not have to build
my own spider to crawl and generate one for me.

~~~
GFischer
Apparently Microsoft has one:

[http://research.microsoft.com/apps/pubs/default.aspx?id=1307...](http://research.microsoft.com/apps/pubs/default.aspx?id=130762)

------
gojomo
Quite possibly the crawler collecting the corpus hit a crawler trap
(intentional or unintentional) -- or perhaps web-based game output (which when
visited by a crawler became a de facto trap) -- which multiplied the
implausible phrases.

