Hacker News new | comments | show | ask | jobs | submit login

The bulk of it was:

News article title extraction. News article relevant thumbnail extraction. News article text body extraction. Generating publicly traded stock symbols from business news articles. Some Techmeme-style document clustering.

I am working on a project (more of a public service than a startup) that needs this. I've looked through all of the resources linked in the articles above and nothing works as well as I need it to. The best performer is readability, so I will probably be going with the python port of that.

If your code works well I also think you should put it up on github. You can see what I intend to use this technology for by reading this text snippet: https://github.com/sbuss/revisionews/blob/develop/web/index....

Currently in the middle of a re-architecture/re-write due to its flaws but something similar to this in Ruby I worked on last year: http://github.com/peterc/pismo

This looks pretty great, I'll definitely keep an eye on it. Thanks :)

Which python port are you using? Last time I looked all the python Readability code I could find was either incomplete, old, or buggy.

I've experimented with https://github.com/gfxmonk/python-readability but it's extremely slow. There's a decent fork called decruft that is a couple orders of magnitude faster http://www.minvolai.com/blog/decruft-arc90s-readability-in-p...

Decruft also has a couple bug fixes to python-readability. They both need a lot of work, though. You'll have to do some spelunking to figure out how to actually call the libraries correctly.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact