

Instapaper like article extractor demo now online - beagledude
http://jimplush.com/blog/goose

======
petercooper
Nice! I worked on something similar a year ago but for the Ruby world. If
you're on a Unix with Ruby installed (e.g. OS X), you can mostly repeat the
linked demo like so:

    
    
        gem install pismo
        pismo http://techcrunch.com/2011/06/09/twitter-ios/ title lede author datetime body
    

And then enjoy the output. No image pick out, but it's the first IMG tag in
the 'html_body'.. just never got around to implementing it as I didn't need
that feature.

The downside is I haven't worked on it for months and it's in sore need of
improvements. For its current in-production use though, it's proving
sufficient and a reasonable option for Rubyists. More info at
<https://github.com/peterc/pismo>

Not knocking Jim's work on Plush, btw, he's actively working on it so if Java
works out for you, stick to him! :)

------
shazow
One thing I've always wanted is something to extract multi-page forum threads
and render them in a normalized readable way. For example, Reddit comment
threads like IAMAs.

Anyone know of a service or library that does that?

~~~
btucker
I actually started playing with the concept one weekend for IAmAs
specifically. I was trying to do it all client-side and the issue I was
running up against was reddit's jsonp responses get VERY slow on large
threads.

------
metaprinter
it didn't extract images for the sites i tried. all wordpress blogs. EDIT: i
just realized why. those blogs are pulling images from flickr.

------
StavrosK
How is this different from the myriad of other text extraction services, APIs
and libraries out there?

~~~
beagledude
1\. it's open sourced 2\. it's embeddable 3\. it extracts images 4\. it's one
of the most accurate ([http://www.readwriteweb.com/hack/2011/06/head-to-head-
compar...](http://www.readwriteweb.com/hack/2011/06/head-to-head-comparison-
of-tex.php)) 5\. it's named after the best top gun character

~~~
HardyLeung
I agree. It works quite well! If it can be formatted to look like
Readability's default (which is really quite plesant to read) it would be
nice.

------
ltamake
Cool! I really like how accurate it is; works with almost every site I try. :)

------
chopsueyar
Nice work!

------
chrismealy
Everything I try gives a 404.

