
Show HN: Automatic web page content extractor for Node.js and the command line - ageitgey
https://github.com/ageitgey/node-unfluff
======
JangoSteve
Reminds me of an article I wrote a while back on the "meat algorithm" I had
developed and used as part of LeadNuke to create more printable versions of
pages. What had surprised me at the time was how simple it was to get just the
main content of a page for like 95% of the pages I tried.

[http://www.alfajango.com/blog/create-a-printable-format-
for-...](http://www.alfajango.com/blog/create-a-printable-format-for-any-
webpage-with-ruby-and-nokogiri/)

~~~
michaelmcmillan
I fear you're exaggerating the success-rate of your algorithm. Have you tried
to test it, if so: How did you measure a successful extract?

The reason for my skepticism is Arc90's readability extension [1]. At the
surface it looks more complex. I could of course be wrong!

[1] [https://github.com/Kerrick/readability-
js/blob/develop/reada...](https://github.com/Kerrick/readability-
js/blob/develop/readability.js)

~~~
JangoSteve
A successful extract was measured as a page for which you could read with the
title and body intact, and by which my application could call `content.text`
on the result and get the plain text of a page without the header, footer,
navigation, etc.

The complexity of the readability plugin seems to be due to the fact that it
actually does a lot more than just making something readable. For example, the
point of my algorithm was to strip all style information from a page and show
only the content, leaving it to be styled according to the global stylesheet.
Notice that my script not only removes linked stylesheets and style tags, but
also style attributes of all elements. The readability plugin actually does
things like counting reference links and styling them a certain way [1].

It has 53 lines dedicated to both getting and normalizing the article title,
when the vast majority of the time, it's just the first h1 or HTML title
attribute (which could be a one-liner and is also outside the scope of the
"meat algorithm" since it's just trying to get the body) [2].

It has 22 lines dedicated to injecting a custom readability footer into the
page [3].

It has 74 lines dedicated to converting all inline links to footnotes [4].

It has 55 lines dedicated to injecting typekit fonts [5]. And on and on.

It also does things like figuring out when to float an image in the article
and when to make it full-width [6] (as opposed to just leaving the image
inline with no styling, as my script does).

And it even dedicates 333 lines of code to finding pagination links to build
content from multi-page articles [7], which my script simply doesn't do, since
it only cares about the content of the current page.

It also does things like computing a content-weight score for parts of the
page, I'm guessing to determine a relative heuristic for which parts are most
likely to be the main content [8]. This is actually the path I had started to
go down, before I realized that my much simpler version solved 95% of the use-
cases I had, and that, for my purposes, I didn't really care if it failed 5%
of the time.

I think the discrepancy in complexity can be explained really easily:

a) The readability plugin does a _lot_ of things not directly related to
simply grabbing the content of the page.

b) There's a lot of complexity involved in trying to get it to work for that
last 5% (those are the really weirdly-structured sites, for which you'd need
to develop some sort of heuristic and/or learning algorithm).

In other words, no, I didn't exaggerate the success-rate. The readability
plugin is just very functionally different. The algorithm in my article is
also not the complete script; the algorithm was meant to be a base starting
point to build from (certainly you would need to if you need greater than 95%
success rate, which most applications would).

 _EDIT: Also should point out, my article was also meant to document my
surprise in how easy it was to get a 95% solution. Most of the complexity I
've seen in other scripts is in trying to figure out all HTML nodes on a page
which could possibly be related to the main content, so that you can scan for
them and reassemble only those nodes. The breakthrough for me was that if you
can find one paragraph tag, then you can just go up a level or two to the
containing node, and blindly grab all nodes, whatever they may be, within that
containing node. The main pages this doesn't work for are pages that don't use
paragraph tags in their article body (e.g. plain text with break tags all over
the place, which were surprisingly few and far between with my sample set)._

[1] [https://github.com/Kerrick/readability-
js/blob/4596857da3cc4...](https://github.com/Kerrick/readability-
js/blob/4596857da3cc45fbbb18bac12d6ddbfd04e83d64/readability.js#L490-L536)

[2] [https://github.com/Kerrick/readability-
js/blob/4596857da3cc4...](https://github.com/Kerrick/readability-
js/blob/4596857da3cc45fbbb18bac12d6ddbfd04e83d64/readability.js#L295-L348)

[3] [https://github.com/Kerrick/readability-
js/blob/4596857da3cc4...](https://github.com/Kerrick/readability-
js/blob/4596857da3cc45fbbb18bac12d6ddbfd04e83d64/readability.js#L350-L372)

[4] [https://github.com/Kerrick/readability-
js/blob/4596857da3cc4...](https://github.com/Kerrick/readability-
js/blob/4596857da3cc45fbbb18bac12d6ddbfd04e83d64/readability.js#L462-L536)

[5] [https://github.com/Kerrick/readability-
js/blob/4596857da3cc4...](https://github.com/Kerrick/readability-
js/blob/4596857da3cc45fbbb18bac12d6ddbfd04e83d64/readability.js#L538-L593js/blob/develop/readability.js#L538-L593)

[6] [https://github.com/Kerrick/readability-
js/blob/4596857da3cc4...](https://github.com/Kerrick/readability-
js/blob/4596857da3cc45fbbb18bac12d6ddbfd04e83d64/readability.js#L230-L249)

[7] [https://github.com/Kerrick/readability-
js/blob/4596857da3cc4...](https://github.com/Kerrick/readability-
js/blob/4596857da3cc45fbbb18bac12d6ddbfd04e83d64/readability.js#L1160-L1493)

[8] [https://github.com/Kerrick/readability-
js/blob/4596857da3cc4...](https://github.com/Kerrick/readability-
js/blob/4596857da3cc45fbbb18bac12d6ddbfd04e83d64/readability.js#L693-L772)

------
zaidf
I wish there was an online demo. I'd like to compare it to diffbot.com(a saas
tool we pay for).

------
JoshTriplett
Nice!

As an aid to extracting the right body content, have you considered comparing
multiple pages from the same site, and giving greater weight to content that
differs (the article) rather than content that stays the same (the
navigation)?

------
rpedela
How does this compare to Apache Tika? There is also a Node wrapper for Tika
but don't remember the name of the module off the top of my head.

~~~
ageitgey
I'm sure Apache Tika is much more capable. For example, it supports html, csv,
ppt, etc instead of just html. But it also requires Java/Maven and the
installation process is far from simple.

Unfluff is a small, simple .js library that can be installed and used in
seconds. It doesn't have any external dependencies on data files or other
language runtimes. So it just depends on which tool is right for your job. If
you are writing a quick script, this might be a lot easier to use.

~~~
frik
> Unfluff is a small, simple .js library

it's written in CoffeeScript

(The transpiled JS files lack comments and meaningful new lines of the
original CoffeeScript source.)

~~~
nateguchi2
You can, of course, configure the coffeescript compiler to output a more
readable compilation output.

------
edwinyzh
Can I use it to monitor source code changes of a Google Code projects, eg
([https://code.google.com/p/dcef3/source/list](https://code.google.com/p/dcef3/source/list)),
or a github project? If not can anybody recommend a good tool for this kind of
task? Thanks.

~~~
MattJ100
If you look at the source of that page, you'll see:

    
    
      <link type="application/atom+xml" rel="alternate" href="/feeds/p/dcef3/gitchanges/basic">
    

which refers to an Atom feed for the project's commits.

Same for Github projects, e.g. at
[https://github.com/petdance/ack2](https://github.com/petdance/ack2) you will
find:

    
    
      <link href="https://github.com/petdance/ack2/commits/dev.atom" rel="alternate" title="Recent Commits to ack2:dev" type="application/atom+xml" />

~~~
edwinyzh
Great! Thanks for the info, and sorry for the dump question without doing any
research myself first - that idea occurred to me but I've never had the time
to investigate into it :P

------
sferoze
Thanks, I am making my own tool to manage my bookmarks and research and this
is gonna be very useful!

------
johnernaut
Pretty cool, I wrote a utility in Go [1] similar to this that also extracts
and compresses related CSS, JS and images for offline use.

[1]
[https://github.com/johnernaut/webhog](https://github.com/johnernaut/webhog)

------
bakareika
Not long ago I started a similar project,
[https://github.com/mvasilkov/readability2](https://github.com/mvasilkov/readability2)

Works as a SAX consumer, pretty fast, not sure how accurate.

------
bndr
I wrote a similar library a while back ([https://github.com/bndr/node-
read](https://github.com/bndr/node-read)), but without the command line tool.
This seems cool.

------
christiangenco
I love it! Very simple and straightforward interface. Makes it very easy to
incorporate into a bigger workflow. You've followed the "do one thing well"
command line tool philosophy to a T.

------
jgmmo
whats the unique value proposition of this compared to the bajillion other web
scrapers out there?

~~~
ageitgey
Most basic scraping libraries require you to input a bunch of regexs or css
selectors to manually specify what you want to extract from a page. They
require custom coding for each page you want to scrape. This library is
totally automatic - you just pass in an html page and it returns the most
'texty' text on the page with no custom coding.

There are of course other libraries like this (boilerpipe, Goose, etc), but
they tend to be written in Java and Python. The very few existing Node
solutions didn't fit my needs so I hacked this together. So for people looking
for a quick and simple Node solution, this might be useful.

~~~
walshemj
there is the boilerpipe library which I used to do a test on New Scientist a
while back - we wanted to use ML to identify usefull clusters of content.

