
Instapaper like Article Extractor open sourced - beagledude
https://github.com/jiminoc/goose/wiki
======
petercooper
I created something similar in Ruby: <https://github.com/peterc/pismo>

~~~
beagledude
Isn't the image extraction based on the goose algo?

------
yesimahuman
Looks interesting. There is another good project that does this called
Boilerpipe: <http://code.google.com/p/boilerpipe/>

I recently developed a web service that performs these sorts of operations
called Linguini (<http://linguini.me>). The service hasn't been officially
launched but it's in beta and usable. It can extract html/blog text through
JSON web service calls and can tag names/companies/locations in that text.

~~~
bravura
Boilerpipe is great. Also see the diffbot article API which is quite good:
<http://www.diffbot.com/docs/api/article>

There is some deeper discussion of text extraction tools on Tomaz Kovacic's
blog: [http://tomazkovacic.com/blog/56/list-of-resources-article-
te...](http://tomazkovacic.com/blog/56/list-of-resources-article-text-
extraction-from-html-documents/)

One advantage of Goose, (the linked project), is that it tries to also extract
the best _image_ from the webpage. None of the competitors do this. I am not
sure how good Goose's _text_ extraction is, compared to boilerpipe or diffbot.

A new tool just popped on my radar called justext:
<http://code.google.com/p/justext/> It was written by the BiWeC web-as-corpus
people, who crawl the web and expose it through the SketchEngine for NLP
research.

------
dageshi
Stupid question, what's the legality or the "rules" on extracting pic + chunk
of text from random websites?

Are there any rules I presume there must be?

~~~
TillE
Well, we're talking about copyright. And so long as you're not re-publishing
anything, you're definitely fine.

Instapaper arguably is republishing, though the copies are private to each
user. I suspect they could still be in trouble if sued, though.

~~~
vilpponen
This is one of the reasons Readability links you back to the original article
when you share document inside the service. It keeps a small overlay at the
top which you can use to make the page more readable again.

Slightly off topic: I've been playing around with an idea for a service that
would be able to share content this way - would make reading a lot more fun
again.

------
tomjen3
Super nice, I was playing around with the idea of running an article extractor
and running the output through festival (with a little clean up the output is
surprisingly good, and certainly a lot better than it used to be a few years
ago) so that I could hear blog post while I was working.

------
johnwatson11218
Is anyone else interested in seeing something like this used with zap reader?
Zap reader takes articles and renders them a word at a time in the same spot
on the screen. You can adjust the words per minute, I can understand stuff in
the 300 to 500 wpm range. I think their ui is not very good and it seems to
just read through a page, including navigation and header/footer.

------
Immortal
There's also a C# port of Readability algorithm: <https://github.com/marek-
stoj/NReadability> which is being used by InstaFetch - an Instapaper client
for Android (<http://instafetch.immortal.pl/>).

------
beagledude
list of all the unit tests with the extraction results:
[https://github.com/jiminoc/goose/blob/master/src/test/java/c...](https://github.com/jiminoc/goose/blob/master/src/test/java/com/jimplush/goose/GoldSitesTest.java)

~~~
riffraff
accessing the urls in the unit tests seems like a recipe for slow build, why
not put the test inputs in the repo?

~~~
beagledude
it's on the list of things to do, however one of the challenges is that those
sites are always changing or updating their templates. So it's nice to have
the most up to date links to hit against.

Goose also pulls all the images down as well to inspect them to try and find
the most likely main image for the page.

It will probably be split into an online/offline test. If that sounds like
something someone wants to help hack on, the more the merrier!

