
Show HN: News Searching API - caballeto
https://datanews.io
======
caballeto
This is the side-project, that I have been building recently. The API allows
searching over the data from many news sites and news aggregators, it also
allows to easily extract whole news sites. I built it for two reasons. First,
I wanted to have a source of recent news data to play with, essentially to run
data analysis and train ml models on. Secondly, because alternative services
had wildly high pricing, like 400-500$ per month, and I thought I could build
and deploy such a system for a much cheaper price and improve my programming
skills along the way. I would very much appreciate any comments/suggestions
about the serive or how to make it better :)

~~~
bvm
How are you licensing the content?

~~~
jwhiz22
He's not.

~~~
caballeto
Right, I will definitely work on adding data policy to terms of service. I
think it will go along the lines, that you cannot remove the
attributions/trademarks, if you are republishing the data, but doing analytics
on it is ok.

~~~
bvm
It's a pretty fine line you're treading here. A lot of the reason the
competition is so expensive is that they are licensing the data in bulk.
There's lots of litigation in this area, see the Meltwater cases in US and UK.

~~~
caballeto
If you have say 10k news sources from tens of countries around the globe, I
doubt that it would be feasible to contract with every specific site out
there, especially if you get data say from Google News.

------
asdkhadsj
Do you host the content as well? What's the legality of that?

I ask because I want to embed some archiving and "reader mode" logic into an
app of mine that would be FOSS and self hosted. However that means each
individual would be effectively scraping and archiving, and possibly p2p
spreading, news content _(as data sources)_.

So I'm curious if there is some underlying "fair use"-like mechanism that
allows Archive, Outline.com, and you to consume news content without it being
considered piracy.

~~~
caballeto
This is a great question. Thus far I thought, that the content that is
publicly available without any limitation (e.g. membership access), can be
scraped by anyone. You can take a look at hiQ Labs vs. LinkedIn [1].
LinkedIn's public data was scraped by data analytics company, the ruling was
against LinkedIn.

I also think that the use case matters, I don't republish their content on the
site, but merely provide it via API. Technically, it could be argued, that
they could get this data themselves, but it is easier for them to use a
service similar to this one to simplify things.

[1]
[https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn](https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn)

~~~
artembugara
hiQ Labs vs. LinkedIn is a bad example. It is super specific.

LinkedIn did not owe the data (it was users').

In your case, you are reselling copyrighted product.

Of course you can scrape. It does not mean you can distribute this.

~~~
hboon
What about Google Search API?

------
graupel
I used to run a large US publisher with 50+ sites writing 50,000 articles a
month; if we found someone scraping full articles and offering them for use
via an API it would be a problem - even moreso if that person was charging for
it. I'm all for hacking and innovation around this space, just be careful here
and don't break TOS.

~~~
caballeto
Hi, thanks for feedback. I am going to push changes to display only a snippets
from the articles soon.

------
xiekomb
Hi,

Simple and effective API. The documentation lacks details about the query
parameter. I want to search for A AND B but q=?A%20B does not yield the
expected result (seems a OR query) or for exact phrase "A invests in B" do not
seem to work. Can you please post details for advanced syntax like this?

~~~
caballeto
Hi, thanks for the feedback! So far, I haven't implemented this feature yet.
The query parameter gets tokenized as is, and then tokens are used to search
in the index. Could you provide more details about your use case? As I
understand you need to have AND, OR, NOT, grouping () operators, anything
else? I would try to implement this today, and write back to you.

~~~
xiekomb
yes this is exactly what I (most users?) would need. Search for exact phrase
would be a plus also. Eg: "donald trump" instead of Donald AND trump. What
kind of backend are you using, a simple database or search index like
elasticsearch (which would be more appropriate for such project)?

~~~
caballeto
Hi, just finished implementing this stuff. To search a AND b use 'q=A+B'. Here
are a few examples:

> curl -XGET -G 'api.datanews.io/v1/news' -d 'apiKey=API_KEY' \--data-
> urlencode 'q="Europe throws new rule book at Google, tech giants to loosen
> market grip"'

> curl -XGET -G 'api.datanews.io/v1/news' -d 'apiKey=API_KEY' \--data-
> urlencode 'q=google+facebook+amazon'

Supported operators:

\+ AND, | OR, \- NOT, () parentheses for grouping, "" exact match

EDIT: examples

~~~
xiekomb
Kudos!

------
ibdf
Tried it.. pretty straight forward. Free account rate is good for personal
use. I think it comes down to 4 calls per day which is reasonable if you are
using this as maybe your daily news fetcher.

~~~
caballeto
Thanks for the feedback! Concerning rate limiting, I think I may have placed
rules a bit too strict, as the system is hosted on AWS Free Tier. There is
definitely a room for increase in RPS per user, but I wanted to play safe.

~~~
fluffernutter
This is great!

------
todsacerdoti
I added your API to Pipedream. Check out an example --
[https://pipedream.com/@tod/datanews-io-example-
p_ljC13p/edit](https://pipedream.com/@tod/datanews-io-example-p_ljC13p/edit)

If you or your users need to build automated workflows with Datanews, please
let us know any feedback.

------
artembugara
Disclaimer: I am a co-founder of a similar News API service
[https://newscatcherapi.com/](https://newscatcherapi.com/)

No news API solution returns the full body text of the article (including us).
The reason is - copyright infringement in US and EU.

 __You can return only chunk of it __

At least, that 's what all lawyers I spoke to told me

~~~
caballeto
That's interesting. I've browsed a few of alternative apis, and if I get it
correctly then newsapi.org returns full text articles for paid plans. I would
research this question further, and maybe will change the api to return only
part of the content.

------
moneywoes
Can you talk about how you avoid getting blocked by captcha? Do you use RSS
feeds? In that case do you not return images?

~~~
caballeto
The scraper only scraps the recent content, so it avoids scraping same links.
As the daily number of articles published by the website is not very big, I
make requests with a break in between, total number of requests is not that
big, so the ip is not blocked. Also, adding proxy servers is a possible
solution, i.e. having a proxy pool to proxy requests. I guess in this scenario
it is easy to by pass capcha.

~~~
TechBro8615
Nitpick: it’s “scrapes,” not “scraps.” To scrap something is to throw it away.
To scrape something is to ingest it. The two words are, in fact, nearly
opposite.

Sorry — I don’t mean to pick on you — it’s just that I see this mistake with
an inexplicably high frequency, and it’s like nails on a chalkboard for me.

Also, very cool service!

~~~
caballeto
Thanks, will take take a note for future.

------
jslakro
How do you get news from its sources?

~~~
caballeto
First, I extract the links to the news articles by searching them on the home
page or sitemaps. Then, for each article I scrap raw html and extract key
things like title, description, publish date, authors, etc. Usually each news
article has a schema embedded within it, so it greatly simplifies things of
extracting article details.

~~~
erwinkle
Do you scrape using web requests? Or do you use headless browsing to load the
page and extract the contents out of the DOM?

~~~
caballeto
I use standard web requests to get raw html and parse the data from it, no
headless browsing. I guess all the news pages I've met so far were static.

------
sagunsh
What tech stack are you using? And how many news source does it collect data
from?

~~~
caballeto
The tech stack is Python, Java, Spring, Elasticsearch, Redis/MySQL. This is a
beta version with explicit support for 50 sources + Google News aggregator. I
plan to release new version in about a week, which will support another
400-500 sources.

