
Show HN: Eat the News – A text extraction playground for global RSS feeds - copypirate
https://extractarticletext.com/online-extractor/
======
copypirate
Extract article text from RSS feeds around the globe en masse.

Mini guide here -
[https://www.reddit.com/r/webscraping/comments/fbknv4/extract...](https://www.reddit.com/r/webscraping/comments/fbknv4/extract_article_text_from_rss_feeds_around_the/?utm_medium=android_app&utm_source=share)

------
bhl
Articles given via RSS feeds usually don't contain ads, right? Might it be
possible to use content with no ads from RSS and the same content from the
home webpage with ads to build a text extraction model that works beyond RSS?

~~~
imduffy15
There are AI solutions that can work on the HTML and manage to avoid the ads.
Checkout [https://blog.scrapinghub.com/extracting-clean-article-
html-w...](https://blog.scrapinghub.com/extracting-clean-article-html-with-
autoextract)

------
foreigner
What is this actually for?

~~~
copypirate
The main API extracts text from news pieces, blogs, PR content, and other
text-heavy pages with a GET request.

The free account gives you access to a GUI where you can paste/upload your
URLs to extract text from. The "playground" is a collection of RSS feeds with
links you can extract text from, to try out the capabilities of the API.

~~~
foreigner
Right I understand what it does, but what is it for? Why would someone want to
do this?

~~~
copypirate
Personal use case was compiling a news dataset to train a topic model.

