Hacker News new | past | comments | ask | show | jobs | submit login

Depending on what you're trying to do, there's also newspaper3k: https://github.com/codelucas/newspaper

It's quite easy to get "good" extraction for large numbers of outlets/articles without a massive amount of special-casing, as news articles are nearly universally marked up with RDF metadata (partly for Google News's benefit). Article discovery, and perfect parsing, is quite a bit harder. I ended up rolling a new Scrapy project with site-specific parsing code for an academic project as I had quite specific requirements.

Hey, I used this for my newsbetting site - https://www.rashomonnews.com/

I haven't pulled any articles in a while, so it's a little outdated but I love newspaper3k.

Is your code on github? I'm actually working on something very similar, and would love to get some ideas from what you've done.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact