
Show HN: Python package to collect normalized news from almost any website - kizy25
https://github.com/kotartemiy/newscatcher/blob/master/README.md
======
kingofpandora
> Programmatically collect normalized news from (almost) any website.

By this you mean ... Fetch RSS feeds from the websites that we've hardcoded
into a Python library?

It seems like the work here was in collecting and cataloging a lot of RSS
feeds. You don't seem to be able to arbitrarily "collect news from any
website".

------
oceanbreeze83
For any news junkies heres, I've built
[https://maagnit.com](https://maagnit.com) which gathers both Left and Right
leaning sources for any story and displays them altogether on one page.

My approach has been, if we can't get neutral/objective coverage, getting
comprehensive, 360 desgree coverage is a good alternative.

These days, news bias is not only in the way a story is covered, but also
_which_ stories are covered. So, maagnit automatically collects stories from
the left and right.

~~~
SoylentOrange
When I was a kid CNN used to do this: they would invite a single climate
scientist and a single climate change skeptic and have them make their points
on equal time. CNN would then say “who’s right? You decide.” And the program
would end.

This was extremely harmful to the overall mission of informing people and
created a false balance between sides which are not equally valid. Today, this
is almost universally seen as a failure of journalism.

Given the lessons learned, I wouldn’t be so quick to replicate that experience
with a website.

~~~
nautilus12
So they present the facts and let them speak for themselves? Sounds like they
had it right. Are you saying if they don't present their opinion its a failure
of journalism? I think this new way of defining journalism is a bad standard.

~~~
aidanlister
People don’t know what is fact and what is fiction. Giving 50/50 air time to
climate scientists and climate change deniers is on one side presenting facts
and on the other total bullshit and saying “hey these are both valid opinions-
you decide”.

~~~
textgel
This isn't climate scientists vs deniers this is major news organisations on
both sides; both of which are guilty of presenting total bullshit as facts.
Allowing them to be viewed side-by-side permits users to to get a better image
of who is willing to miss-lead on which subjects and get facts that have been
omitted by the other side.

~~~
ineedasername
Just because one side might be more willing to mislead doesn't mean that their
stance on a topic is more likely to be the incorrect one.

Showing which news sources employ the most underhanded rhetorical devices may
be a positive goal in itself, but it doesn't, by itself, help the audience
make their own determination on an issue. Even more of an issue is that a
viewer's determination of which source is more willing to mislead or omit
relevant details is much more likely to be influenced by prior opinion than by
the content of either source.

Basically, the problem isn't, in itself, biased news sourced, its that the
format is fundamentally ill-suited towards giving individuals enough
information to come to a reasonably well-supported position on just about any
topic of moderate complexity. Further take any topic that appears to be of
simple complexity and scratch the surface a bit and there's a decent chance it
will turn out to be not so simple.

~~~
textgel
Respectfully; it does. Not 100%, but for the most part it does. The moment you
have to lie to make your point you concede that your argument never had a
grounding in reality.

Even if putting that aside, the utility of bringing to light underhand tactics
isn't meant to be used in and by itself but instead serves as one of many
aspects of debate to help decide what is right/wrong true/false.

Regarding the poor suited nature of news for getting the full story across to
reader I fully agree but again just because a tool isn't perfect, it doesn't
mean it gets cast aside; more perspectives (and these are mainstream
organisations) on a subject doesn't hurt at all.

~~~
ineedasername
I agree that the side that is more deceptive & manipulative in their
persuasive tactics will tend to be the ones with less potential substance, my
point was only that such a scenarios isn't necessarily the case. Even an
"honest" person can find themselves coming to the correct conclusions for the
wrong, faulty reasoning. In such cases They are only accidentally correct. In
the hands of someone that understands that facts don't win arguments, but none
the less believes their "side" is correct, it is all too easy to justify
sensational, emotional arguments, rhetorical flourishes, etc in an "ends
justify the means" sort of way.

I don't have an answer on the issue of news organizations being poorly suited
here. On the one hand, there is an appeal to your the idea you convey that
something is better than nothing. However, that status quo is also what has
lead us to the current situation. There is a correlation with the rise of
24-hour news networks and the internet with the increased vitriolic,
polarizing, and propagandist tone things. The need to fill air time was a big
part of that. I don't wholly think that was the cause. There was some trend in
that direction already:

 _Note to readers:_ This next part is not intended to cast blame only in one
direction. It is simply one concrete example of the type of things that became
commonplace.

Around 1990 Newt Gingrich penned a memo for titled "Language: A Key Mechanism
for Control" It went on to explain how language could be used to manipulate
people, complete with a guide for how to use demonizing dehumanizing language
against political opponents. Over the years it was systematically disseminated
through his party, and when Newt became house Speaker around 1995 he literally
made it required reading. Shortly after is around the time that the term
"liberal" went from being a fairly neutral description like "conservative" to
being a hated moniker for a political opponent. (Though right-wing, alt-right,
etc., fill that purpose. now for the other side)

------
nojito
This is literally just a sqlite db with rss feeds and a python script with a
very misleading description.

The real purpose of this post is to get traffic for their paid news api.

Also this closed issue is hilarious!
[https://github.com/kotartemiy/newscatcher/issues/3](https://github.com/kotartemiy/newscatcher/issues/3)

------
pmyteh
Depending on what you're trying to do, there's also newspaper3k:
[https://github.com/codelucas/newspaper](https://github.com/codelucas/newspaper)

It's quite easy to get "good" extraction for large numbers of outlets/articles
without a massive amount of special-casing, as news articles are nearly
universally marked up with RDF metadata (partly for Google News's benefit).
Article discovery, and perfect parsing, is quite a bit harder. I ended up
rolling a new Scrapy project with site-specific parsing code for an academic
project as I had quite specific requirements.

~~~
jerzyt
Is your code on github? I'm actually working on something very similar, and
would love to get some ideas from what you've done.

~~~
pmyteh
Yes! [https://github.com/pmyteh/RISJbot](https://github.com/pmyteh/RISJbot)

------
raziel2p
This doesn't "collect" anything, it's just a python package wrapped on top of
a sqlite database. If you choose an arbitrary website that doesn't have
information in said database you just get "website not supported". There's not
even any logic to guess a website's RSS feed URL... And in the end it's just
an RSS feed reader?

~~~
justaguyhere
What would the logic to guess the RSS feed URL look like? I suppose it is easy
for wordpress sites, might not be so simple for others?

~~~
slightwinder
Common way is to set a link-tag in html with type rss, so others can discover
the feed to the active url. If this is not set, chances are slow that there is
a proper feed available. Not that you can't still try guessing and googling
for it...

------
fuball63
This looks like a cool library, but I have a question about the newscatcher
API. How does the licensing work for the content? Seems odd (but great) that I
can just read the news in my terminal from NYT but not pay a subscription or
see ads. I read in some of the comments it's an RSS feed, is that freely
available all the time? Surely even the RSS feed is protected with copyright
and has restrictions on republishing?

If that's not the case, the larger implication here that the news is free if
it's in a format that is not as widely used (RSS) compared to what the mass
populous uses (mobile browsers/app).

Cool library, thanks for sharing!

~~~
artembugara
Hi. Co-founder is here. Short answer. We do not know if it is legal!

~~~
hundchenkatze
It seems a little risky to build a paid service that you're not sure is legal.

Also, the Terms of Service, Privacy Policy, and GDPR Policy links in the
footer of your site don't work. They all have empty hrefs.

~~~
artembugara
Will be updated once we start to sell

------
0x006A
Why did you choose to ship the list of RSS feeds as an SQL database? This
makes it hard to keep it up to date and submit pull requests with additional
sources. Would it not be better to keep that info in a json file or a dict /
list in a python module?

~~~
artembugara
Hey. Yes. You are right. We will change that!

~~~
remram
Will you?
[https://github.com/kotartemiy/newscatcher/issues/3](https://github.com/kotartemiy/newscatcher/issues/3)

------
tyingq
The metadata of language, topic, rss url, etc, is nice work.

For use outside of python, here's a gist with a sorted list of sites and a
sqlite dump of the site data:
[https://gist.github.com/tyingq/8e921eed10bf2ecf9c40ebdd70ff1...](https://gist.github.com/tyingq/8e921eed10bf2ecf9c40ebdd70ff1871)

------
permanent
Hi looks interesting and useful! What's the differences between newscatcher
(python) vs. newscatcherapi.com? Is there any limit on using newscatcher
(python) as shown in the pricing pages?

Also, I was looking at [https://newsapi.org](https://newsapi.org). How does
your python API compare? I see that in newsapi, it also get old articles; do
you implement similar features?

~~~
pheme1
There's seems to have a lot of news crawler APIs lately, a quite google and
found these:

[1]
[https://currentsapi.services/en/product/price](https://currentsapi.services/en/product/price)

[2] [https://aylien.com/news-api/](https://aylien.com/news-api/)

[3] [https://www.cityfalcon.com/products/api/financial-
news](https://www.cityfalcon.com/products/api/financial-news)

~~~
bussyscorp
Another guy: [https://newsscraper.com](https://newsscraper.com)

------
nmstoker
How complete are the articles that are returned? Last time I looked into RSS
news feeds a lot of sources would just put abbreviated / teaser content in and
then try to get you to click through to the story on their site. That
obviously didn't make for the best experience with an RSS downloader though.

~~~
k1m
I work on Full-Text RSS which can help convert abbreviated feeds into full-
text versions. The idea is you'll get a new feed URL from Full-Text RSS to use
instead of the original partial feed in your news reader or application.

Free to try here: [http://ftr.fivefilters.org/](http://ftr.fivefilters.org/)
and code for a slightly older version available here:
[https://bitbucket.org/fivefilters/full-text-
rss/src/master/](https://bitbucket.org/fivefilters/full-text-rss/src/master/)

------
0x006A
Your git repo contains the dist folder even though its in .gitignore, might be
good to remove that. No need to checking generated artifacts.

~~~
artembugara
Thx. Will check.

------
charlesdaniels
As others have noted, this doesn't seem to collect the full article text, just
stuff that you would get from an RSS fee. From the title, I expected something
more like newspaper3k[0]. I used that for an NLP class during undergrad to
collect full-text news articles, in conjunction with Selenium (many mainstream
sites don't work with just plain wget or requests).

Lately I've starting using EpubPress[1] to grab full-text articles and
generate an ePub, which happens every night via cron. Then I can get a full
digest on my iPad over sftp at my leisure. Sadly EpubPress is not very
sophisticated, sites like Bloomberg or ArsTechnica return "are you a robot"
challenges which it can't bypass.

I wish there was some kind of community driven library for retrieving full-
text articles from common sites. In my vision of how that would work, users
would contribute hand-crafted Selenium scripts to download and extract the
article text, bypassing the bot-detection for each site. Then something like
EpubPress would work a lot better.

The "modern web" just has too much junk to be interesting any more. Sometimes
news sites publish articles I would like to read, but I'm not interested in
dealing with 1000 different implementations of crappy mobile UIs,
advertisements, animations, etc. I know reader view exists, but you still have
to wait for the page to load, and it doesn't work very well for some sites.
For the sites it doesn't break with, the experience with EpubPress is much
better.

0 -
[https://github.com/codelucas/newspaper](https://github.com/codelucas/newspaper)

1 - [https://epub.press/](https://epub.press/)

------
uptown
Neat! Pair this with some sketchy ad tech, and you can start raking in the
bucks.

[https://www.cnbc.com/2020/05/17/broken-internet-ad-system-
ma...](https://www.cnbc.com/2020/05/17/broken-internet-ad-system-makes-it-
easy-to-earn-money-with-plagiarism.html)

------
slightwinder
I would be more interessted in a ready-to-use server which I can selfhost and
to which I can throw any script for collecting data, filter and process them,
which also handles errors and storage.

At the moment the only real solution for this seems to be using your own RSS-
Reader (Tiny Tiny RSS for me at the moment), which has the disadvantage of
being limited to RSS-sources, as also not allowing much filtering and
processing. But with more and more sources moving avway from RSS, I want
something which can be fetch alternative sources and integrate them into a
unified interface.

In best case it would be even work with any language, be it a shellscript or
python, ruby or even java.

~~~
artembugara
Well, we hat is quite similar to what we are doing at newscatcherapi.com.

We collect tons of news data and let you query it with an API

~~~
slightwinder
Not selfhosted, isn't it? By which I mean it costs money.

~~~
artembugara
Yeah. Though. It will have a free plan. And our hosting cost quite a lot so
(imho) it’s almost always must cheaper to buy a ready solution.

------
kuu
This seems really cool!

How does it work internally? Is it downloading the news from a RSS or is it
crawling the content of the website? Or is the content coming from an external
service?

How are the feeds selected? Can we add more? who is maintaining them (in case
the data is crawled)?

Thanks!

~~~
artembugara
hey, there is a sqlite DB that stores the RSS endpoints. Then we use
feedparser python package to parse it.

~~~
slightwinder
Then why should I use this instead of a real Feedreader? What advantage has
this?

~~~
artembugara
I think you mean feedparser. It knows how to parse the feed. So you have to
give it the feed’s url

~~~
slightwinder
No, I mean feedreader. A Feedreader is a service which collects and parses
rss-files, then makes them accessable in an interface for the user. Similar to
a mailclient, but for rss.

This package seems to do the first part, collecting and parsing, but lacks the
interface. So what is the point of it?

And if this is maintaining it's own config for sources, can I even add my own
sources? Or is this just an elaborated OPML-file with attatched business-
logic?

~~~
artembugara
Hey. Alright. Thanks for letting us know. I will check it tonight.

------
polymorph1sm
Seems like a repost of previous Show HN [1].

[1]
[https://news.ycombinator.com/item?id=22407835](https://news.ycombinator.com/item?id=22407835)

------
pheme1
What does the normalized news means? Able to query from a predefined RSS feed?
Because I was expecting news coverage from different partisan source pointing
to the same news event.

------
forgingahead
Neat project! Definitely useful to have this, folks can build "headline-edit
trackers" or services that collect news using other filters more easily using
this.

------
nreece
This looks great!

I read your blog post about how open-sourcing helped you find testers, which
is a great step for such tools.

* _Shameless plug_ *: Our web service, Feedity - [https://feedity.com](https://feedity.com), helps create a custom feed for any news webpage, via a point-and-select feed builder and REST API.

------
ryanalam
I've been experimenting with a side project: www.glancereport.com, which pulls
the top headline from a variety of news sources. You can also sort these
headlines by neutral, progressive, or conservative sources.

I'm still learning how to play with Python and Redis, and it's not perfect,
but would love feedback.

~~~
modlinska
I see that the Neutral/Progressive/Conservative filter is most applicable to
Politics news, but I'd be interested in seeing if you can filter by topic like
Business, Sports, Lifestyle etc.

------
zzuko
There is a similar library for this called newspaper which I had used in my
undergrad thesis. Not dismissing this work, but I am curious about what it is
offering on top of it? It doesn't offer a comparison to the newspaper, at
least not in the Github page.

------
anonytrary
The demo gif is really long (a few minutes), but it's well worth watching and
summarizes the capabilities of the library well. It'd be cool if you could
register additional sources via a registration API or some passed in
configuration.

~~~
pydry
This would be very cool - a bit like how youtube-dl handles non YouTube
websites.

------
marban
Self-Plug: [https://www.hvper.com](https://www.hvper.com) (Official Successor
of popurls which more or less started the single page aggregator craze.)

~~~
zwaps
That pink bar takes up almost half of my screen on my iphone 11 and makes the
website defacto unusable.

Not to be unkind, but you should fix that. Instead of being incentivized to
pay you, my only instinct was to close that page asap.

~~~
zo1
How about you switch to desktop mode on your browser instead of forcing sites
to be mobile friendly so you can consume it on a tiny screen. Half the web is
broken because of "responsiveness".

~~~
marban
To be fair, it's not broken — but coming up with clever viewport-dependent
layout changes takes up enough of resources to consider it a low-level
priority for some business cases.

------
A4ET8a8uTh0
My initial reaction is very positive. I believe there is a market for a tool
like this. Lets see how it handles marketwatch.

------
jerzyt
First, kudos to the OP. If you don't mind, would you compare and contrast it
to newspaper3k?

------
napolux
Hey, looks cool. Which news sources are available for the italian
market/language?

~~~
artembugara
hey, you can get the news sources by `it` language and check it

~~~
napolux
thx!

------
nickthemagicman
How do subscriptions work with this? Or is it treated like anonymous browsing?

------
amelius
Perhaps nice to show news headlines as an alternative to /etc/motd

------
hitpointdrew
What is "normalized news"? What does that even mean?

------
spiritplumber
This looks amazing.

~~~
artembugara
Thanks!

------
seemslegit
I'm not sure how to parse the following sentence:

 _By newscatcherapi.com (this package is fully self-sufficient, you can just
use it. No dependency on external services /API)_

~~~
stevenjohns
I interpret this as

 _Created by <api provider> but you do not need anything from us or from
anyone else to get the software going, it just works out of the box._

~~~
seemslegit
Well the 'anyone else' part is wrong, someone has to provide the news.

------
unixhero
Will this work for fetching financial data from the public bloomberg.com
website?

~~~
unixhero
I see your downvotes. It means I am on to something. I will try.

------
gitgud
> " _Almost any website_ "

Is there a list of supported websites?

------
carbolymer
So, the centralized, paid version of RSS?

~~~
artembugara
It's a Python package to collect news data. Nothing paid

