

Show HN: Building n-grams from 9m tweets for predictive typing (autocomplete) - chime
http://ktype.net/wiki/research:articles:progress_20110209

======
arctangent
I've been doing something similar for the last month or so but I decided that
Project Gutenberg data was fine for me since what I am producing (haiku
poetry) is slightly more formal than how real people talk.

I also ran up against the problems of storing huge data structures in memory -
my list of n-grams had something like 57 million entries, which I was able to
manipulate into a dictionary with 9 million keys.

Unfortunately this was still too large so I turned to the Redis key-value
store. I'd been somewhat dubious about the benefit of these so-called "NoSQL"
databases but I have to say that Redis was perfect for storing these huge data
structures and the Python libraries made the whole thing really easy to work
with. I recommend you give it a try if memory issues are limiting what you can
do.

~~~
chime
I will certainly hit memory limits if I try to analyze the Google N-grams. Did
you install redis locally?

I would most probably go with <http://aws.amazon.com/elasticmapreduce/> if I
was going to parse through TBs of data.

~~~
arctangent
Yes, I installed Redis locally. I am using only as a key-value store for a
locally running Python script.

