From a quick test, it seems to treat almost every bit of content on a page equally, even elements which are clearly smaller and next to an image.
Might I recommend taking CSS styles into account? Large text is usually headlines, <strong> text is usually important, and darker greys generally suggest a side comment. Would be much easier if everybody used <aside> and <h1> but even in 2013 that's too high an expectation.
You are right, I'm not taking account of HTML tags. It is because I extract the text beforehand using Pythoon Goose. In that sense, only the text will be feed in the algorithm without any HTML tags.
Try https://github.com/visualrevenue/reporter :) I'm looking at your service now and it is really massively awesome. Can I ask, if you are considering monetizing it, or going the venture-path (boo)? I ask this because I'm curious on the viability of using your service/library on a long-term project.
Might I recommend taking CSS styles into account? Large text is usually headlines, <strong> text is usually important, and darker greys generally suggest a side comment. Would be much easier if everybody used <aside> and <h1> but even in 2013 that's too high an expectation.