
Article extraction benchmark: open-source libraries and commercial services - lopuhin
https://github.com/scrapinghub/article-extraction-benchmark/blob/master/README.rst
======
lopuhin
Author here, ready to answer any questions.

This is an evaluation of article body extraction for AutoExtract (ours),
Diffbot, newspaper3k, readability-lxml, dragnet, boilerpipe and html-text.
Since we're evaluating the quality of our system as well, we tried to be extra
careful to be fair and transparent, releasing dataset, evaluation scripts and
all details in the technical report.

~~~
ssivark
Thanks, as an avid reader who saves a lot of articles from the web, this is
quite interesting. (I’m a Pocket subscriber btw)

How do these compare with popular services like Pocket, Instapaper, Firefox &
Safari extractors, etc — or do those services use these libraries/algorithms
in the backend?

~~~
lopuhin
Great question! I'm an active Pocket user myself, would love to know what they
use on the backend. From seeing their failures - when they think something is
not an article or excluding some relevant stuff, I would guess they use
something working on pure html and more similar to current open source
solutions - wheres for example Diffbot failures looked quite similar to ours
as we seem to use a similar approach (and it's quite rare to miss a large
chunk of the article). I imagine Pocket margins must be quite slim so they
can't throw a headless browser + neural network on every page. Maybe they can
use higher quality and more expensive extractors for popular articles.

Browser extensions are in an interesting position here as they can probably
have access to much richer features from the browser context (element size,
position, CSS properties), but still want to be low overhead. I think I saw
such an implementation, maybe even from Mozilla, but can't find it right now.

~~~
ssivark
Hmm, that’s interesting. BTW, are there known/standard ways to “ensemble”
these different algorithms to build more robust solutions (at the price of
some extra computation)? It’s not obvious to me how one would combine
different extraction results, but maybe one could use some more heuristics to
pick the best result for each example.

~~~
lopuhin
Yeah I don't think it's trivial. If an algorithm made predictions on the level
of html elements - whether an element should be a part of article body or not,
it would be possible to combine probabilities or at least vote. But (a) it's
probably a non-trivial modification (b) a lot of methods would also use some
heuristics/postprocessing/have other quirks which make combining results
difficult.

------
freediver
Do you plan to release this? If not, can you discuss your approach?

~~~
lopuhin
The service itself is already released, you can play with the demo without
registration at [https://www.scrapinghub.com/data-api-
news](https://www.scrapinghub.com/data-api-news) (scroll to "Try it out here
for yourself"), and we also have a free API trial. Would be curious to know
how you plan to use it.

In terms of the approach, the whole page is rendered in a headless browser, we
extract the whole page screenshot, text and other features, and feed them into
one neural network where all modalities are joined and which handles the
extraction.

~~~
freediver
Did you base it off Dragnet? Can you comment on the other important parameter,
the extraction speed?

~~~
kmike84
The approach is very different from Dragnet. AutoExtract uses neural networks.
CSS and HTML can only get you so far; we actually process screenshots as
pixels (like humans do), it is not just shallow features like in Dragnet.

Speed of the AutoExtract ML part is not a concern (many pages per second on
GPU) - the bottleneck is in the browser rendering.

