
Ask YC: How to Build an RSS Aggregator? - ridertech
What do you recommend for reading and parsing 100s of RSS/Atom feeds on an hourly basis?<p>I'm able to write a custom script in PHP or preferably Rails, but wondering if there is a sweet app or tutorial that others have used and liked.
======
billturner
Look at Sam Ruby's Venus (which I've used):
<http://intertwingly.net/code/venus/> (Python)

Or, his Mars version (haven't used): <http://intertwingly.net/code/mars/>
(written in Ruby, and newer)

------
petercooper
Disclaimer: I built, ran (for two years), and sold a Web app that processed
tens of thousands of feeds each hour and distributed summaries based on those
feeds hundreds of millions of times per month.

That out of the way, the difficulty varies with the scale somewhat. With a
large scale, you run into all sorts of issues including arbitrary blocks from
feed providers, dealing with database locking, etc. If you're really just
doing "100s" on an "hourly" basis, hopefully you'll stay well under that
level, but if you think it'll need to scale up quickly, the decisions you make
now will need to be different than if it's going to stay small.

I can't provide any code here, but just some quick pointers.

Our crawler (which is still running under the new owner) was entirely custom
and written in Ruby. It performed very well. Instead of using a specific feed
parsing library, it uses Hpricot (the Ruby library) and a set of custom built
rules for parsing RSS and Atom. The reason for this is that we wanted speed,
reliability (no shifting libraries), and it HAD to work with invalid (and even
extremely broken) feeds - many "proper" RSS and Atom parsers have issues with
busted feeds. Put it this way, though, Ruby is definitely up to the task, as
long as you rely on a parsing library (Hpricot, in this case) and aren't just
using regular expressions or something ;-)

One nasty thing you'll need to deal with is knowing whether items in feeds are
new or not. You _could_ delete all items associated with a feed before
processing that feed each time.. but what if you want to keep an archive of
older items? What if you need to maintain database performance? How are you
going to track what's new, what was deleted, etc?

I used a hash that was _either_ based on each item's GUID and the feed's ID OR
(if no GUID present) the item's link and title. Unfortunately this was not
failsafe. If someone changed the description of an item, the change wouldn't
get picked up! And.. not all feeds use GUIDs - and some feeds have GUIDs that
change when descriptions change.. some don't :)

Feed formats are really, really dirty, despite being specified officially. All
sorts of nasty publishing systems are mangling the formats and you need to be
able to deal with it. These are issues that go far beyond choosing a feed
parsing library - it's about the organization of items. You need to do a lot
of sanitizing to be 100% effective. You'll find feeds that use wholly
inappropriate date / time formats, and the content provider will not care. You
need to be able to deal with that. Oh, and watch out for feeds that have wacky
dates way into the future.. which can then end up "stuck" at the top of your
list of items if you're ordering by date ;-)

This all just scrapes the surface of how tricky it is. I was doing it fulltime
for over two years and even now I feel I've only seen half the picture. You
either strive for 100% effectiveness of processing and parsing feeds and drive
yourself nuts - or settle for 90% and sleep at night ;-)

------
ridertech
Sorry, I'm looking to build an app that aggregates feeds, not a normal
"consumer app"

------
nreece
SimplePie - <http://simplepie.org>

~~~
FiReaNG3L
I use this (as a Drupal module) for <http://esciencenews.com>

Don't reinvent the wheel, you don't want to custom code this yourself.

~~~
ridertech
I've used Drupal's aggregator module in the past, and it often has issue with
the publish dates and/or whether a link already exists.

------
ridertech
Thanks Peter! I was looking into FeedTools...
<http://sporkmonger.com/2008/2/1/feedtools-0-2-27>

But I'll probably just build something custom w/ Ruby. I was hoping someone
else had already done the work and open sourced it ;)

~~~
petercooper
Unfortunately I'm unable to share the code, as I sold the intellectual
property.

However, I can share that it used Hpricot, and had a number of XPath rules for
each discrete element of both the general feed and "items" that needed to be
extracted (title, link, description, time, etc). Each rule was in an array, so
rules for Atom and RSS could be mixed.. the first to match dictated the
format. This is a pretty quick and dirty (but ever so effective) way of doing
it - parsing feeds as XML in the "technically correct" way is an absolute
nightmare given the poor validity of XML out there ;-)

All that said, one thing worth looking at is:
<http://rfeedparser.rubyforge.org/> \- it's based on the Python Universal Feed
Parser which is generally considered to be the most awesome of feed parsers
out there :)

~~~
ridertech
rfeedparser looks perfect, but i'm in dependency hell...

Gem::Exception (can't activate hpricot (= 0.6), already activated
hpricot-0.6.164])

~~~
petercooper
Either force install Hpricot 0.6 (your usual stuff will still use the newer
one, rfeedparser will use 0.6) or go mangle with rfeedparser to remove the
lock to that version (of course, that removes any guarantee of it working
100%).

BTW, today is an awesome day to write a new feed parser. Here's why:

[http://www.rubyinside.com/nokogiri-ruby-html-parser-and-
xml-...](http://www.rubyinside.com/nokogiri-ruby-html-parser-and-xml-
parser-1288.html)

------
eventhough
Zend Framework has an RSS reader.
<http://framework.zend.com/manual/en/zend.feed.html>

------
qhoxie
I use Google Reader with great success, but there are tons of options out
there; desktop and web.

------
lpgauth
Yahoo Pipes!

------
TweedHeads
Magpie

<http://magpierss.sourceforge.net>

~~~
FiReaNG3L
This has been unsupported for years

