

Ask HN: Best framework for parsing thousands of feeds? - kez

I am currently working on what I hope will be a startup (lean, bootstrapped etc) and I am dealing with thousands of feeds.<p>Presently I am batching 5-10 feeds to download in batches of threads from Ruby using FeedZirra (https://github.com/pauldix/feedzirra) and then parse.<p><i></i>Has anyone been in a similar situation and done something particularly innovative they care to share?<i></i>  I plan on ranking feeds by frequency of updates after some analysis, but in the mean time I am resigned to pulling everything down in as quick a time as possible.<p>I would love to use Superfeedr for this, but cost is prohibitive for me and I do not want to stump up the cash to pay for the credits whilst in development (although I could move to this in the future).<p>Not so bothered about the technology/language - this is a hodgepodge of Ruby, Ramaze, MySQL, Solr and good old file system storage.<p>Advanced thanks and appreciation of any and all comments!
======
swanson
You might want to take a look at Samuel Clay's NewsBlur project:
<https://github.com/samuelclay/NewsBlur> and see how he handles this problem.

~~~
conesus
As others have said, task queues (look into Celery, which is in Python), Mark
Pilgrim's feedparser, and make sure you don't fetch more often than you need
to. A few thousand feeds is fine, but if you grow in the hundreds of thousands
of feeds, if you want to update them more than once a day, you're going to
have to pass in the cache controls (etags and last-modified dates). Even then,
50,000+ feeds is straining the limits of a single DB.

If you were to reach that point, then I recommend moving to a nosql db (like
mongo, which is what I use for NewsBlur), and sharding it so you can
read/write to more feeds without a problem. All of your analysis will have to
be in a MapReduce, so it can be sent to different shards, but that's not too
difficult to learn how to do. NewsBlur has a few examples of how to do this.

I used Python, but Ruby would also be a fine choice. Although I'm not sure
what Ruby libraries you would use.

------
dclaysmith
Check out <http://www.feedparser.org/>. It's for python and pretty robust,
handles Etags and Last-Modified headers. Well documented and loads of unit
tests.

------
bmelton
I've done my fair share of it, and while I don't know that we ultimately
tackled it 100%, there were plenty of gotchas.

1) Respect etags / last_updated tags. This will save you a ton of bandwidth,
for one, and keep you from getting banned by the feeds you're pulling. It's
important. What I ended up doing was a different method for new feeds vs. ones
I already knew about -- on the initial parse (and subsequent ones too), I
would check for an etag or last_modified indicator. If I can't detect
anything, I set a poll frequency to something like a half an hour. This kept
me from slamming servers that didn't properly implement etags, while I could
check headers on the ones that did more frequently.

2) Hang on to your sockets. Opening / closing sockets is expensive for this
particular task. What ended up working for us was queueing entries and using
the same urllib handle for as many as needed polling at a time. Otherwise, we
were flooding the box with open sockets.

3) Use a task queue. My environment was Python, so I had the beautiful Rabbit
and Celery to work with. Never ended up having to scale, but the intention was
that using a distributed task queue, it was built such that we could just add
other nodes to do the fetching tasks if we needed to.

