
How can I identify primary article text when scraping news pages - invisiblerobot
I suspect this is a hard problem and that deep learning is the state of the art.  But maybe I&#x27;m missing something?<p>Just to be clear, given the html of a wapo article I want to discard all the affiliate links&#x2F;comments and focus on the article text.  I want a generalized solution for many blogs and news sites.<p>Any tips?
======
tlack
I've had some good luck with "Unfluff"[0], a credible Node.js package that
uses a cascade of logical conditions to figure out what to extract.

It's a very practical start.

I thought the science of it was called "envelope detection" but I'm not
getting any relevant hits on that keyword. Will report back if I recall the
name.

[0]
[https://www.npmjs.com/package/unfluff](https://www.npmjs.com/package/unfluff)

------
nmstoker
You haven't give any details about programming language preferences but if
you're interested in a Python approach then Newspaper3k is worth a look

[https://github.com/codelucas/newspaper](https://github.com/codelucas/newspaper)

