
Ever wanted arc90′s Readability as an API? - Swizec
http://preona.net/2010/11/ever-wanted-arc90s-readability-as-an-api/
======
martinkallstrom
Readability has been ported to Python and Php:
[http://blog.arc90.com/2009/06/20/readability-now-
available-i...](http://blog.arc90.com/2009/06/20/readability-now-available-in-
three-delicious-flavors/) There is also a C# port here:
<http://code.google.com/p/nreadability/>

~~~
Swizec
There is also a ruby port here: <https://github.com/iterationlabs/ruby-
readability>

The beauty of our approach is that we didn't have to port it and we can adapt
much quicker when new versions come out.

~~~
lamnk
Great, i searched for a readability's ruby port on github but the closest i
can find is pismo: <https://github.com/peterc/pismo>

~~~
petercooper
Pismo used to use ruby-readability (linked above) but I ended up writing my
own system. It works similarly to Readability but is better on certain types
of poorly formatted content (but worse on others, so YMMV). Pismo is more a
general purpose content extraction library than Readability and better suited
for machine processing and summarization (which is what I use it for).

Pismo also comes with a command line client built-in, so you can do stuff like
this:

    
    
      $ pismo http://preona.net/2010/11/ever-wanted-arc90s-readability-as-an-api/ title sentences
    
      :title: "Ever wanted arc90's Readability as an API?"
      :sentences: Over at Preona we have been wanting something just like that for a while now. So we built it! Some time ago, while developing LazyReadr, we were faced with the fact that RSS feeds simply aren't all that lovely anymore.
    

Note: "sentences" picks the first few sentences by default, but this is ideal
for a summary by an automated system or for a news page :-)

~~~
Swizec
You should check out topicmarks. It does summaries in a very smart way from
what I've seen. We'll likely start using it to do automated summaries for RSS
content in our iPad app.

~~~
petercooper
Sadly, though, it "takes minutes" (and they even seem to make a big point of
that..) It might be useful for slightly better summaries though I've had great
luck with going with the first paragraph of an article so far (or certain
other metadata if it scores better, like <meta> description).

------
scraplab
I wrote something similar a while ago as a bit of an experiment - again, it
uses Readability and jsdom.

It's slow because there's a lot of unoptimised code in jsdom which makes
certain DOM manipulations pretty slow. It's early days for this stuff.

So in lieu of them open sourcing their code, which is almost certainly more
robust, here it is:

<https://github.com/tomtaylor/thelma>

~~~
Swizec
Having the benefit of hacking around jsdom for a while to fix certain bugs, I
could tweak Readability to work around the slowness.

Also node-htmlparser now contains a patch of mine so it's a bit more robust.

I think you'll find your code might be a bit more robust than you left it, in
part due to my wanting to make our API robust :)

------
waterside81
We, too, have ported Readability to a web based API, however ours is
synchronous. It's lighting fast under the hood (thank you Cython) and you can
hit it as much as you like, as quickly as you like. We currently process about
5 million calls to clean-html.json a month, which is odd, because we added it
to our set of API calls on a whim. It's turned out to be our most popular API
call, by a large margin.

We're also going to be issuing an update soon to include multi-page reading.
So you can grab those long, paginated New Yorker articles all in one API call.

<http://www.repustate.com>

~~~
scraplab
Thanks - that looks quite interesting. Do you offer anything that cleans up
the HTML but preserves formatting and images in the article?

~~~
waterside81
Internally, yes. Give a day or two and I"ll add a flag to let you preserve
some of the formatting.

------
Swizec
UPDATE at midnight: the view-scraped-page part of the architecture was meant
as a helper function. Being frontpaged on HackerNews obviously crashed it. The
two examples have been moved to static files served by nginx. Sorry for any
inconvenience.

If anyone didn't manage to view the two examples, please go check them again
:)

\--> <http://plateboiler.lazyreadr.com/static/example1.html> and
<http://plateboiler.lazyreadr.com/static/example2.html>

------
yesimahuman
Boilerpipe has been amazing for me as a Java library:
<http://code.google.com/p/boilerpipe/>

------
nreece
And there's ViewText as well - <http://viewtext.org/help/api>

~~~
ronnier
Thanks, I created viewtext. It also populates RSS feeds to contain the full
content and extracts text from PDFs.

------
pak
You guys are awesome. I literally just embarked on building an RSS reader that
fits my personal taste and was wondering to myself how to get page content
filtered through Readability. I was envisioning weird hacks, but this is
perfect.

~~~
Swizec
I have two suggestions about making an RSS reader:

1\. Help us instead :)

2\. Use Superfeedr, it will make your life easier.

Also thanks for calling awesome, we're really not that awesome, just have too
much time on our hands :P

------
antimatter15
I had a similar idea two months ago, but instead of being a web api (with all
the scalability and privacy issues that ensue), I made it a browser extension.
where people could add parameters to embedded iframes in order to display them
nicely formatted.

[https://chrome.google.com/extensions/detail/nahmdndkmncjhppb...](https://chrome.google.com/extensions/detail/nahmdndkmncjhppbaomnecihdmijgmne)

------
ludwigvan
Does Safari really use Readability? Any sources on that?

~~~
Swizec
[http://www.downloadsquad.com/2010/06/08/think-safari-
reader-...](http://www.downloadsquad.com/2010/06/08/think-safari-reader-looks-
familiar-thats-because-apple-used-op/)

First hit on Google for "Safari Readability"

