
Wonderful Python3 Library for working with webpages’ semantic content - Siira
https://newspaper.readthedocs.io/en/latest/
======
everling
I’ve been looking for a python library to replace boilerpipe in java and
stumbled across this one the other week. Does anyone have experience in
comparing results?

~~~
shakna
Newspaper for the most part "just works", and when it doesn't it doesn't do
much except provide you with the source so you can munge it and try again. (It
also seems to work well for non-English languages, though I've only tried it
against Spanish.)

Boilerpipe provides a lot more knobs and things to tweak the output. There are
a lot more extractors than the basics that Newspaper provides.

Newspaper will give you the basics. Article content, titles, some metadata.

Boilerpipe let's you make a turn-key solution for particular sets of sites,
which can be more or less helpful depending on where you want to use it.

Lastly... Newspaper isn't the fastest thing in the world. It does tend to have
high accuracy for article extraction, but it tends to be slow.

------
ashtonbaker
Thanks for posting this - I spent most of today kicking around idea of an
instapaper-style webpage parser, and came to the conclusion that I probably
didn't have time to do it well.

