
Heuristically finding content in html - hydrogen18
http://www.hydrogen18.com/blog/finding-content-in-html.html
======
jaytaylor
Nice write up.

This reminds me of goose-ng [0].

There are a number of implementations and ports of the heuristic approach out
there, including for Go and Scala.

It works nicely, I'm a fan!

[https://github.com/jaytaylor/goose-ng](https://github.com/jaytaylor/goose-ng)

------
darshandsoni
Nicely written post! I've had this issue when trying to focus on just the
content of the page. It would be interesting to see if you can find any deeper
filtering ideas from other browsers - like Firefox has a "reader view" that
strips away cruft from pages. Or even looking at the source code of some open-
source feed reader apps like Flym - they do a great job of scraping off
content and caching it online in a searchable format.

~~~
hydrogen18
Thank you for the feedback. I intend to go back and try and find a better
basis for the mathematics I used.

