
Full Text RSS Feed: Get the whole feed and nothing but the feed - timrosenblatt
http://fulltextrssfeed.com/
======
nicpottier
We built a backend similar to this for our NewsRoom mobile client. (Android
and Pre) Actually used some genetic algorithms to do the training for our
content extraction, one of the more fun projects I've done.

Word of warning, if it takes off, you basically start turning into someone who
is both caching and harvesting the web every 15 minutes. There is an
incredibly long tail on RSS feeds and it starts killing you to keep them all
up to date. Storing and serving it is no big deal, but harvesting actually
turns into real money when you figure out total bandwidth used. (we harvest
about ~30,000 feeds every 15 minutes)

~~~
lurchpop
how do you handle scale like that? Do you have a hadoop cluster or something?
How many concurrent do you download?

~~~
nicpottier
Amazon EC2, MongoDB, S3. The EC2 instances scale with how many stale feeds we
have, but it is usually less than 2.

Just checked and we have ~25k feeds in the system, though not all are deep
harvesting as we call it.

Note we do a few things over just extracting the full content as well, we also
try to grab out images and create a pleasing thumbnail using face detection
etc.. So that probably slows things down a good deal as well.

------
geuis
Could you darken up on the grey a bit? Grey on white on grey isn't exactly
easy to read.

~~~
jonkelly
Could be a way to slow down the feed owners' lawyers a bit? ;-)

~~~
timrosenblatt
lol.

What's the difference between a Cease & Desist letter from a big company, and
free advertising for a startup?

------
grayrest
What's he using to pull out the articles? I had a hacky version set up using
the Readability algorithm but never bothered to make it public.

~~~
yesimahuman
Boilerpipe is by far the best tool for this that I've ever found
(<http://code.google.com/p/boilerpipe/>). I'd be interested to hear if he is
using something better, but I'd be surprised if he is.

I think this is a great idea and very similar to a lot of stuff I have worked
on recently. It's cool to see so much interest in these text-related services.

~~~
benmccann
I don't think it's quite as good as what he's doing though. He has the title
and date specifically pulled out and he doesn't have any extra text included.
I think he manually handles CNN. If I try a HuffPost feed it doesn't work at
all.

~~~
yesimahuman
Yea, I'd be curious to see exactly what he's doing. I can only guess there is
a heuristic which results in a lot of failed feed processing noticed on here
(I know it's just a weekend project :)) that doesn't generalize well.
Boilerpipe, in my experience, works very well on almost all news/blog type
content. Finding the date in the first few sentences and the title are extra
heuristics that can be added later.

EDIT: The date and title are in the RSS feed already! No further analysis
needed.

------
spidaman
Does this work for anybody? I've plugged in 3 feeds, one was "unable to
retrieve full-text content" (an sfgate.com feed) and the other two returned
nothing at all in the preview (one a feed from kqed.org, the other an older
wordpress blog).

~~~
guptaneil
The preview for Lifehacker returned nothing at all, but adding the feed to
Google Reader worked as advertised. I guess, don't rely on the preview box.

~~~
matsur
This may be common knowledge, but all gawker blogs are available in full feed,
ad free form at <gawker entity>.com/vip.xml

i.e. <http://lifehacker.com/vip.xml>

~~~
yahelc
Ah, I see I'm not the first to point this out :)

------
clvv
A similar service: <http://fivefilters.org/content-only/> and it is opensource
too. It uses a PHP version of readability to extract the full content. Also
can the author of fulltextrssfeed.com explain some of the implementation
details? I was planning on a similar project with node.js, jsdom and
readability.

------
hokkos
Is it legal ? Can you legally copy all the content of a site and publish it
while striping the ads ?

I've tough of this idea since 2 years, but I am so ineffective at building my
own ideas that it doesn't surprise me that someone else built it, as the idea
was really floating more and more since instapaper mobilizer.

Considering the legal aspect I had more ideas about that. It is to hide behind
the DMCA takedown, and provide an email address to take-down a feed. But do
not map the www.example.com/feed.xml to
<http://fulltextrssfeed.com/www.example.com/feed.xml> , but use an alias, so
the take-down just remove the alias not the whole * .example.com*.

------
swombat
Immediate swap of current PG essays feed for:

[http://fulltextrssfeed.com/www.aaronsw.com/2002/feeds/pgessa...](http://fulltextrssfeed.com/www.aaronsw.com/2002/feeds/pgessays.rss)

------
yahelc
Considering the impending lawyer-takedown, it would be great if this was made
open source, so people can implement their own local versions on their own
servers.

------
yagibear
Could you also do the opposite: Take bulky feeds (e.g.
<http://feeds.feedburner.com/tedblog>) and truncate them; showing title &
first para & include a link? I use RSS primarily to scan what is available and
mark some for later reading, and bulky feeds interrupt the scanning process.

~~~
baddox
This would probably be trivial with Yahoo! Pipes.

<http://pipes.yahoo.com/pipes/>

~~~
timrosenblatt
there's a commenter further down the page by the handle of "Roll" that says
Pipes didn't work for him.

And yeah, it was built in a weekend :) But now you don't have to.

------
cvandyck76
Is there an argument to be made that the content providers only get 'paid' if
the RSS reader is enticed to click through to the site? I'm all for neat
services, but I think that this is a little bit unfair to the other party.

------
tuhin
Not trying to be the show stopper here, but this is illegal right? I mean
especially news sites like Reuters do create a fuss when this is done. Is that
(legal drama) only in commercial projects or otherwise too?

------
netmau5
Nice, this will come in very useful for an RSS-based project I'm working on
too. Hopefully I won't slam your servers too hard. Are you considering making
the source available?

~~~
aaroneous
I wasn't expecting much interest in it, but I'd be happy to clean it and
package it up if you guys want to play.

~~~
shadowpwner
Yes please.

------
pak
This is nice but what's the difference from ViewText
(<http://www.viewtext.org>)? ViewText has a JSONP API, which made it perfect
for building into a recent little project I did (it was a web app). Plus, it's
been around for a lot longer.

~~~
ericgs
Just tried viewtext on lifehacker's feed and got:

"We understand you'd like to delete your account. If you delete your account
all of your information including your comments, messages, posts, and friends
and followers associations will be removed from our system. Please consider
the following options before clicking delete."

Yikes! =X

~~~
ianvanness
Give the full feeds that Lifehacker (et al) already offer:
<http://lifehacker.com/vip.xml> (this article to it came up in a google search
- <http://lifehacker.com/5489210/>)!

(also, that needs to be fixed asap, lest anyone get the wrong idea)

------
AdamGibbins
Excellent thanks, shall be applying this to all my Gawker feeds.

~~~
yahelc
You know they make a full-feed version available of all their sites, right?
It's just of the form <http://lifehacker.com/vip.xml>

------
gnosis
##sigh##

Yet another service that requires me to hand over information on what I read.

Why couldn't this be made as a privacy-respecting application I can run from
my own machine?

------
roll
interesting project. I was doing a similar thing with yahoo pipes, but it got
blocked because of robots.txt. What do you do about it?

~~~
shadowpwner
You can always disregard robots.txt.. ;)

------
palak55
The same service is offer by www.getrss.in and i am happy client of them for
more then 8 months

~~~
palak55
<http://www.getrss.in>

------
timrosenblatt
Nice. Grabs the whole article text so you don't have to leave the RSS reader.

~~~
dholowiski
The content thieves will love it too. This makes it much easier to
automatically copy content.

~~~
getsat
Not really. I can Right Click -> Copy XPath in Firebug's element inspector
then just Nokogiri::HTML(page_source).xpath('/blah') to get at it. You can do
it with the CSS selector as well. :)

Setting up a quick script to rip all the content from another site is trivial.
There's also wget -m

~~~
geoffw8
I literally cannot get nokogiri set up on my Mac for love nor money, I'm a
noob whose been trying for a week or two. Tried everything. Its preventing me
from running tests. Damn xmllibs2.

~~~
getsat
Install MacPorts: <http://www.macports.org>

Add /opt/local/bin to your PATH (bashrc or zshrc or whatever you use).

 _sudo port install libxml2_ and _sudo port install libxslt_

Then _sudo gem install nokogiri --no-rdoc --no-ri_ should run with no issues.
That's all I had to do for the system ruby (1.8.7 on OSX 10.6) and 1.9.2 via
rvm.

~~~
geoffw8
So it turned out it was webrat that was the problem, but your instructions
actually fixed the problem! I'd researched for hours previously. Genuinely
much appreciated!

~~~
getsat
Glad I could be of assistance!

------
adrianwaj
Is there a time delay between the source feed and the full feed?

------
austintaylor
Works great! Nice to have: concatenation of multi-page articles.

------
sankara
Works with BBC. Thanks.

------
mariuskempe
Thank you so much! :-)

~~~
timrosenblatt
you're welcome!

~~~
andy_mason
Just wanted to second the thanks. I read a lot more of authors content now.

------
irfn
nice! how exactly does this generate the full text version?

------
iphoneedbot
I love it! It works for tumblr rss -- I really wish though that it you can
opensource it. (Well, I would just hate it if you start having hosting
problems or other problems that would cause you the need to shut down)

Im currently using "Readable Feeds" Nirmal J. Patel
(<http://www.nirmalpatel.com/hacks/hnrss.html>) and Andrew Trusty
(<http://andrewtrusty.com/2009/06/29/readable-feeds/>)

I like it -- but its really inconsistent!

Cheers,

