
AWS Data Exchange - jeffbarr
https://aws.amazon.com/blogs/aws/aws-data-exchange-find-subscribe-to-and-use-data-products/
======
buboard
Wow we were just talking about selling shovels in a (ML) gold rush.

Incidentally, what is the open source alternative of this? Data is so cheap
that it should be actually free, unlike counterfeit nike shoes.

(Does a bittorrent tracker specifically for research data exist? Edit: there's
[http://academictorrents.com/](http://academictorrents.com/))

~~~
meritt
> What is the open source alternative of this?

There are many public-domain datasets:

* [https://aws.amazon.com/opendata/public-datasets/](https://aws.amazon.com/opendata/public-datasets/)

* [https://cloud.google.com/public-datasets/](https://cloud.google.com/public-datasets/)

* [https://github.com/awesomedata/awesome-public-datasets](https://github.com/awesomedata/awesome-public-datasets)

> Data is so cheap that it should be actually free

I feel like you haven't had the opportunity to work with data that has value.

~~~
buboard
if a dataset is really valuable, it wont be available for public
consumption/copying, and, like nike, won't be selling itself on amazon

~~~
meritt
> if a dataset is really valuable, it wont be available for public consumption

Data vendors sell valuable data all the time to essentially anyone who is
willing to pay their asking price. I don't see how this is any different? Just
because aws is trying to run a marketplace doesn't suddenly make the data
"public" in an open/free sense, there's still a price tag attached to it.

> and, like nike, won't be selling itself on amazon

Not every data vendor (or retailer since you keep bringing up Nike) has the
necessary brand recognition, marketing budget, or technical proficiency to
only sell direct to customers: that's why centralized marketplaces and
alternate distribution channels exist.

~~~
buboard
> valuable data all the time to essentially anyone

if it s truly valuabe it isn't sold to 'essentially anyone'. like high value
financial info or security.

------
exhilaration
This is nice, but I don't see the pricing after the free trials. The Pitney
Bowes data [1] they used as an example in linked article only shows $0 for the
free trial, not what it's going to cost you afterwards. It'd be nice to know
the long term cost before tying this data into your business.

[1] [https://aws.amazon.com/marketplace/pp/prodview-
bwf7mapyyjzom...](https://aws.amazon.com/marketplace/pp/prodview-
bwf7mapyyjzom?qid=1573682854054&sr=0-1&ref_=srh_res_product_title)

~~~
skissane
Their screenshot shows an "autorenew" enabled on a $0 free trial with zero
information on how much it will cost after the free trial is up.

You shouldn't allow people to sign up to "autorenew" at the end of their free
trial period without clearly displaying the price that will apply when the
trial ends.

~~~
bowmessage
It looks like the trial dataset in this case is simply a truncated subset of
the full dataset (only San Francisco zip code boundaries). It would likely
remain free forever under this model of "trial".

------
obituary_latte
It looks like this is targeted at ML/AI but I have a tangentially related
question: does anyone know of open source or other publicly available lists of
US businesses? Just business name and address?

I’m building out an app and we receive documents from all kinds of vendors
from all over the country. The app is for our business to manage our client
data. I was hoping to find a list of business I could throw in the db rather
then piecemeal add the addresses in one by one as the documents come in.

I looked at some of the data service providers (infoUsa I think was one and
d&b being another), but one dataset for just business names and addresses they
were asking $50,000 for. I think my use-case is unique in that these companies
typically sell this data as sales lead data which it definitely is not in my
case (we don’t even sell b2b).

Anyone know of anything like this? I suppose I could just scrape phone books
but I think if I can’t find the data we will just resort to one by one entry.

~~~
jedieaston
It’s been a minute since I looked at the docs, but could you use the Google
Maps API for this (or perhaps OpenStreetMap)? You could query all of the POI
categories for a city/state/country and save the addresses and names. Might be
something, unless you need the legal name of the business rather than whatever
they make public.

~~~
obituary_latte
Legal names aren't critical but would be nice. I think that if the business
posts the name that would be close enough.

Very interesting idea - many thanks!

------
slenk
As someone who deals with HIPAA every day, it really bothers me that Change
Healthcare is there even if the data is "anonymized".

~~~
Supermancho
I'm not sure why it would bother you. I can generate random data in the same
format and you can't figure out which is real or who's data it is. Thinking
about data as if it has intrinsic value by being recorded is foolish, as many
find out during acquisitions.

~~~
kupopuffs
With another dataset that is non-anonymous, you can cross-verify with the
anonymous data to increase your confidence of who they are. This happened when
Netflix released anonymized user ratings. Researchers were able to deanonymize
some users by using IMDB ratings.

[https://www.wired.com/2007/12/why-anonymous-data-
sometimes-i...](https://www.wired.com/2007/12/why-anonymous-data-sometimes-
isnt/)

~~~
jcims
Similar thing happened a while back when AOL released 'anonymized' search data
-
[https://en.wikipedia.org/wiki/AOL_search_data_leak](https://en.wikipedia.org/wiki/AOL_search_data_leak)

I dug through it just out of morbid curiosity and unfortunately stumbled upon
a few searches indicating that a friend privately suffered a miscarriage. This
was only possible because I recognized her as the only person that I knew at
the intersection of a few other search terms that were correlated to the same
'anonymized' id.

GP is assuming a naive starting point when that is rarely the case.

------
ekzhu
There are many free public datasets available on the web.

I have an open source project on crawling public datasets and make them
searchable in one place:
[https://github.com/findopendata/findopendata](https://github.com/findopendata/findopendata).

------
say_it_as_it_is
They're crowd-sourcing valuable information services from third parties to
become a market data provider.

What could go wrong for information providers where Amazon controls their
market and infrastructure? They become commoditized "data providers". They are
coerced into profit sharing with Amazon. They are eventually replaced by
Amazon-provided data.

I won't buy from this market because I see where this is heading. I use the
same reason that I apply for not buying many other services and products from
Amazon. It offers no additional value other than minor convenience to a
customer at a much greater cost to the economy and providers.

Buying local isn't just for produce.

~~~
heyflyguy
I've been thinking about your comment for nearly an hour now, wondering about
all of the potential paths this could take.

If your company goal is to sell commodity data and make money in volume rather
than high margin less frequent sales then maybe this is not so bad.

To your point about Amazon eventually taking this over, I do believe that to
be the case when there is near ubiquitous demand for the data type. And just
looking over what is in there now, these are some pretty specific datasets.

I'm not saying you're wrong, I will be considering your comments for quite a
while.

I have said this on here before and I think it still holds true. Amazon is
like Jay Leno.

Conan O'Brien once said:

“Hosting The Tonight Show has been the fulfillment of a lifelong dream to me.
And I want to say to the kids out there watching, you can do anything you want
in life unless Jay Leno wants to do it, too.” Apr 17, 2014

------
djs070
As a consumer of a paid dataset, how would I trust that the vendor is
publishing accurate and complete data?

~~~
rhizome
You can't, and they aren't.

[https://adexchanger.com/the-sell-sider/third-party-data-
is-a...](https://adexchanger.com/the-sell-sider/third-party-data-is-a-bad-
habit-we-need-to-kick/)

------
borramakot
How does a data provider prevent someone from copying the data from their S3
bucket into a new one, then cancelling the subscription and owning the data
forever?

~~~
markolschesky
At least in the case of subscriptions I looked at:

1) It's a continually updated file, so if you didn't subscribe you wouldn't
get new data weekly/monthly. Likely the subscription aligns with the data
refreshes for most products.

2) It's a one-time fee with a hefty cost attached (I saw some healthcare data
sets that were $100K+). You are paying for that in its entirely and just have
data rights to it.

~~~
maximegarcia
On one of the example, the update is « quarterly »...

------
shubb
Probably money to be made grabbing hard to get hold of open datasets and
listing them here until someone complains?

~~~
bootloop
Like stolen or leaked data?

~~~
justaguyhere
Easy to find datasets but hard to use data collected by the government - for
example, tons of gov data is released in PDF format (and other obscure XML
formats). Even when the data is available in machine readable formats, you
still need to read through 50 page read me/data dictionary files to understand
the meaning of the data.

There is tremendous value in cleaning and repackaging this data in easy
formats (offering an API would be awesome too). Even better if the provider
can offer human support.

Obviously it would be ideal if the government releases all this data in easy
formats in the first place, which a lot of governments do, but not all. The
second best thing is for some private companies to help, even if they charge
for it.

Not your parent commenter, but speaking from experience.

Here is a trivial example : I was looking for a list of all government
websites (and their social media accounts), starting with federal all the way
to small towns. There are lists, but none complete - at least I couldn't find,
and definitely not with their social media handles.

------
sansnomme
Will this affect Bloomberg's business?

------
rshnotsecure
Did anyone else find Jeff’s first sentence terribly unoriginal and somewhat
wimpy?

“We live in a data-intensive, data-driven world!”

I know these blog posts are turned out fast, but especially for such a
sensitive issue as a world awash in data that no one understands and no one -
yet - controls...it seemed like it was “whistling past the graveyard”.

~~~
jeffbarr
Sorry to let you down.

------
RocketSyntax
This used to be called AWS Juntos lol

~~~
RocketSyntax
Seriously! That's what it was called last year.

