

Sunflower: (clojure) extract story text from HTML markup tree - gtani
http://blog.danieljanus.pl/clojure/sunflower.html

======
zacharypinter
Cool idea. The readability bookmarklet
(<http://lab.arc90.com/experiments/readability/>) does a great job at
extracting the main document text without having to compare to other files,
though it's currently all client-side.

~~~
nirmal
I wrote a Python port which I used to create a version of the HN rss feed with
the content of each article embedded into the feed itself. It became too
popular and caused my server to crash.

However, Andrew Trusty put a wrapper around my code so that you could use it
with any website. It's written in Python and runs on Google AppEngine. Check
it out: <http://andrewtrusty.appspot.com/readability/>

If my host was again acting up, I'd link my port.

------
devin
Looks interesting! I was not able to build the jar due to the flyingsaucer
dependency failing. I tried using the maven repo's [groupId/artifactId "8RC1"]
in my project.clj but no dice. I've msg'd the author on GitHub. Likely a quick
fix and nothing to be upset over, but just so you're all aware...

~~~
jefffoster
I found that running

    
    
      lein deps
      lein uberjar
    

Resulted in a clean build for me. I guess you could also stick the JAR in the
local maven repository too - flyingsaucer lives at
<https://xhtmlrenderer.dev.java.net/>

~~~
nathell
I've uploaded the jar to the Downloads section and also added a note about
Flying Saucer to the readme. Thanks for checking it out!

------
evanrmurphy
Of course it wouldn't be able to filter out embedded marketing / product
placement, an increasingly popular paradigm for advertisement. (But who can
filter that out when it's integrated well enough?)

Neat idea.

------
regularfry
Sounds related to ariel: <http://ariel.rubyforge.org/>

