
Tokenising the english text of 30TB common crawl - LiveTheDream
http://matpalm.com/blog/2011/12/10/common_crawl_visible_text/
======
muyuu
Having the full Common Crawl uncompressed in disk with some indexing and
RAID-1 is becoming quite feasible (easily under $10k). The size of the text
part of the web is growing a lot more slowly than the price of storage for
maybe 10 years. At this rate anyone would be able to have a full crawl of the
web in their mid-range desktop computers in just a few years. Here's hoping
for good weather in Thailand. :-)

You need how much right now, 200TB tops? including index files, uncompressed
data and replication. With some clever compression and data structures you can
probably cut that by half or less. Financially speaking this is already on
local club territory.

------
rjurney
Very good to see an example of working with the common crawl. I would
encourage YC applicants to think about what kind of opportunities it presents.

------
monatron
Very interesting. Surprising to see Lithuania as the second most popular
language. Oddly enough, I was having a conversation with someone this weekend
when they mentioned that Lithuania had the highest fiber-to-the-home
penetration in Europe and one of the fastest average internet connections in
the world. Does anyone know why this is?

~~~
malkung
I once tried to identify the language in (supposedly) English-language
Facebook posts. The second-most frequent language was Estonian. Looking at
actual posts classified as Estonian, it turned out they contained text like
"Soooooo nice". Because the language identifier looks at character sequences,
and Estonian tends to have unusually common double vowels, such posts were
classified as Estonian. Dutch and Norwegian were other two languages for which
English was mistaken - apparently some character sequencies there have
frequency distributions that are similar to English.

------
ars
A bit premature to post this - no results yet, he's still working on it.

~~~
mat_kelcey
+1

