
Google launches Public Datasets program - vgt
https://cloud.google.com/bigquery/public-data/
======
pbh101
I remember the earlier Google attempt at the same thing. I was in undergrad
and an engineer came to my Uni to give a presentation on the topic. A big area
of concern was "Why won't Google just shut this down when the going gets
rough?" The question was mostly batted away along the lines of "this is lunch
money to Google anyway." But just a couple years later, it did indeed bite the
dust.

Covered in Wired: [http://www.wired.com/2008/01/google-to-
provi/](http://www.wired.com/2008/01/google-to-provi/)

Update: And the shutdown article, 11 months later:
[http://www.wired.com/2008/12/googlescienceda/](http://www.wired.com/2008/12/googlescienceda/)

~~~
fhoffa
2008.

I remember those days. GMail was still in beta and ' _a big area of concern
was "Why won't Google just shut this down when the going gets rough?" '_

8 years later GMail has a billion users, and not going anywhere but up.

I haven't been to the future (yet), but I'll happily take any bet you have
against BigQuery.

Disclaimer: I'm Felipe Hoffa, and I work at Google.
([https://twitter.com/felipehoffa](https://twitter.com/felipehoffa))

~~~
philovivero
Can you give me an update on Wave and Orkut? I never use this Reader thing,
but I hear others do.

~~~
vgt
I'm happy to respond to every single person that makes this argument in
perpetuity :)

There's a difference between experimental free services and Enterprise-grade
SLA'd SLO'd fully-supported paid services with a very clear and recorded
deprecation policy and an army of customers with contracts and full support
from Google CEO and chairman.

~~~
dingaling
Indeed, but _who_ is paying to maintain 546GB of online storage ( plus backups
) for Reddit comments in BigQuery, for example?

If the answer is "Google" then I think people are still right to be cautious.

Or to invert the question; could I put 1TB of my own 'interesting' data into
BigQuery and have Google maintain it in perpetuity for free? If not, then why
are any of these datasets considered safer?

~~~
illumin8
If you think Google even cares about 546GB of online storage you've gotta be
kidding. This is like asking if Google is going to eliminate the free water in
the water cooler because it costs a few cents per employee.

You can put 1TB of your own interesting data in BigQuery and pay
$0.02/GB/month for the data storage fees. I'm sure Google will happily take
your money as a paying customer, and keep your data for as long as you want to
store it and pay for it.

------
vgt
One large difference between this program and alternative programs is that
data already resides in Google BigQuery:

\- You do not need to spin up a database to work with BigQuery

\- You can simply start writing SQL on top of BigQuery

\- You may leverage Dataflow and MapReduce connectors to work with this data
directly in Hadoop, Spark, or Dataflow

\- BigQuery has a free tier - one Terabyte of data processed per month

Finally, for folks who would like to share their datasets, BigQuery offers
free hosting and credits to help get a pipeline going.

(disclosure: work on BigQuery)

~~~
dwmintz
And if you've got public data in BigQuery that you want to make
explorable/shareable/visualizable to anybody (no SQL required), let Looker
know and we'll see what we can do. (disclosure: work at Looker)

~~~
vgt
Looker has been a great launch partner for this program. Really helped us
develop and deliver great interactive dashboards.

------
ortusdux
"HACKER NEWS - A dataset that contains all stories and comments from Hacker
News since its launch in 2006."

I know what I'm doing this weekend.

~~~
panarky
I'm dying to get comment scores, not just submission scores.

There's a lot you could do with that to find the best comments, which is
really why HN is so awesome.

~~~
minimaxir
Blame HN for that. It was removed from public access right after I made a blog
post about it, although it was coincidental I think.

~~~
jrockway
Though the data could be assembled; I can see my comment scores, you can see
your comment scores...

------
polartx
I work and play in data. By far the best resource I've encountered is
[https://app.enigma.io/](https://app.enigma.io/)

Signups are free. The aggregated public data is plentiful and easily
discovered, indexed, filtered, and exported.

Free account have API limitations, but as far as govt data is concerned, I
don't find that its updated often enough to peg my API rate limiter anyway.

~~~
frikk
Awesome. Do you have any other resources to share? I run occasional local
hackathons and am always looking for inspiring data sets.

------
dwmintz
Great to see folks digging into this. Looker (where I work) is what you're
seeing visualize and make the underlying datasets explorable. Our founder did
a great blog post explaining how it works and how quickly these datasets yield
interesting insights with just a few lines of code.

Full blog post explaining and walking through the process (and giving access
to explore the data fully yourself, no SQL required) is here:
[http://looker.com/blog/hacking-hacker-news](http://looker.com/blog/hacking-
hacker-news)

Felipe and all the other folks at Google have done a great job getting this
project off the ground and we're psyched to partner with them. We're working
on some new public datasets now, but if you have particular ones you'd like to
explore, let us know.

What should we look at next? Census? Medicare?

~~~
vgt
Kudos to Looker, great partners here! We're just getting started :)

------
nl
This is a much more useful list:
[https://www.reddit.com/r/bigquery/wiki/datasets](https://www.reddit.com/r/bigquery/wiki/datasets)

I'm not sure why Google hosts it on Reddit. There's some interesting (and more
up-to-date) stuff on there.

~~~
jpatokal
That's a list of random data from random people, curated by Googler
felipehoffa in an unofficial capacity.

The ones in the submitted page are maintained by Google itself in the public-
data project.

~~~
nl
I believe that the GDelt, Freebase and Genomics tables (at least) are
officially supported by Google.

------
flashman
I just ran a Markov text generator over the USA Name Database and created a
few thousand girls' names, including:

Aracella, Ashla, Blakelyne, Carylou, Damariah, Enchantrelle, Francenza,
Iridia, Jalexius, Lilliotte and Scotlanta.

Hey, maybe I can start a bespoke baby name business...

~~~
alexwebb2
I'm building a baby name app and this is a fantastic idea that I am probably
going to steal.

------
talles
Who is the King of Hacker News?

[https://cloud.google.com/bigquery/public-data/hacker-
news#wh...](https://cloud.google.com/bigquery/public-data/hacker-
news#who_is_the_king_of_hacker_news)

~~~
lloydt
Click on the Stories Count and you can see the individual stories :)

------
bakztfuture
Any chance we'll be seeing the Common Crawl data on there anytime soon?

~~~
vgt
Sounds like a good one. We'll be sure to reach out to them. Or, if you know
them, have them shoot us a note at:

bq-public-data@google.com

------
tmannen
I just noticed the Freebase data available as one of the public datasets, so
been wondering what happened to the Freebase team?

I liked what Freebase, DBpedia etc were doing few years ago with "semantic
db's". I remember Freebase Gridworks being extremely useful for data cleanup.
Haven't had a reason to follow the space...is it dead? Haven't seen much
semantic talk off late. Where would one go to catch up on the latest news?

~~~
fhoffa
Freebase data is being moved into Wikidata.

I loaded Wikidata into BigQuery too!

Fun with movies and cats: [https://medium.com/google-cloud/oscars-2016-movies-
that-got-...](https://medium.com/google-cloud/oscars-2016-movies-that-got-the-
most-attention-on-wikipedia-151cd56f4fc0)

Implementation notes:
[https://lists.wikimedia.org/pipermail/wikidata/2016-March/00...](https://lists.wikimedia.org/pipermail/wikidata/2016-March/008414.html)

More notes:
[https://lists.wikimedia.org/pipermail/wikidata/2016-March/00...](https://lists.wikimedia.org/pipermail/wikidata/2016-March/008427.html)

~~~
krishna2
Super interesting. Thanks for all the curation and relevant set of links that
you are promptly posting/replying. StackOverflow data is available for free
too - that would be one awesome dataset to let folks get their hands on. Any
plans for that?

------
danvoell
[https://cloud.google.com/bigquery/public-data/hacker-
news](https://cloud.google.com/bigquery/public-data/hacker-news)

------
marchenko
Looking a the suggested textual analyses of the Reddit and HN datasets, and
thinking about how they can be combined, makes me wonder if anonymity-
through.multiple-avatars will even be remotely possible 5 years from now.

------
th0br0
I get 404s for all Looker iframes...

~~~
apahwa
where are you seeing Looker being used?

~~~
dwmintz
Looker is powering all the iframes (any 404s should be fixed now). Bigquery is
hosting the data, but Looker is generating the queries and visualizations of
the data.

~~~
apahwa
Interesting, I'm surprised Google is using Looker rather than their own tool

edit: ah, I see, it is just one of the partners they have with BigQuery and
they are apparently using it for their website (I'm guessing because Looker
lets you iframe graphs directly into the page for free).

~~~
dwmintz
Yup, Looker is a Premier Partner for the GCP launch
([https://cloud.google.com/partners/?q=Looker#search](https://cloud.google.com/partners/?q=Looker#search)).
And then we (I work at Looker) are specifically partnering with GCP for the
Public Datasets project.

Because LookML (Looker's modeling language) makes it easy to explore and
visualize big datasets, we're building out models and dashboards so visitors
can get a sense of BigQuery/Looker's power and find insights from the datasets
quickly (whether or not they write SQL).

Frankly, there's a ton of public data that's "available" in the sense that you
can technically download and clean a CSV, but isn't actually easy to extract
meaning from. So we figured we'd select some interesting datasets, do the
cleaning, uploading and modeling, and then let folks have at it for free.

If there are specific datasets you'd be interested in seeing, let us know and
we'll see what we can do.

------
voltagex_
Any chance of getting GNAF and some other Australian datasets up there?

[http://data.gov.au/dataset/geocoded-national-address-
file-g-...](http://data.gov.au/dataset/geocoded-national-address-file-g-naf)

A CKAN->BigQuery connector would be interesting (think of an "Open in
BigQuery" button)

~~~
fhoffa
Twitter noticed:

[https://twitter.com/_pwalsh/status/715054406073524224](https://twitter.com/_pwalsh/status/715054406073524224)

------
cosmeen
Seems like it has a dataset of hacker news since it launched in 2006

\--

    
    
      SELECT * FROM [bigquery-public-data:hacker_news.stories] ORDER BY score DESC LIMIT 100
    

\--

made a screenshot with Top 20 stories on HN
[http://i.imgur.com/czXYEyQ.png](http://i.imgur.com/czXYEyQ.png)

------
mark_l_watson
Public data is good! I have used their book text data on two projects to
generate common ngrams. These newly released data sets look useful and using
BigQuery is reasonable if you just need parts of data sets.

Amazon does something similar: keeping useful data in S3 for cheap access when
running on AWS.

------
javiramos
There's a Hacker News dataset. In case anyone wants to take a crack at it:
[https://cloud.google.com/bigquery/public-data/hacker-
news](https://cloud.google.com/bigquery/public-data/hacker-news)

------
temuze
I hope someone submits a voice recognition data!

It's fantastically hard to find audio->phoneme datasets and only slightly less
difficult to find audio->word and word->possible_phonemes...

------
dcdevito
Sorry Google but my trust in cloud solutions reside in AWS and Azure.

Why? Because when Amazon and Microsoft announce something I know there's a
good chance it will still exist 24 months later

~~~
vgt
This narrative gets repeated time and time again, and it really doesn't hold
up to even surface debate:

\- Google TI and Cloud merged. Same teams. Per Urs Holzle in Jan2014

\- Just last week Sundar (CEO), Eric Schmidt (chairman), Jeff Dean, Urs
Holzle, and Diane were on stage talking about how they spent $10 billion last
year investing in infrastructure and how serious Google is about cloud
([https://www.youtube.com/watch?v=HgWHeT_OwHc&list=PLIivdWyY5s...](https://www.youtube.com/watch?v=HgWHeT_OwHc&list=PLIivdWyY5sqIFd0E6JG1hVr8sXQaLmmBP))

\- In many cases, publically available technologies are actually replacing
internal technologies (BigQuery and Dremel, Dataflow and FlumeJava/Millwheel
for example).

\- Disney, Coca Cola, Snapchat, Spotify, Home Depot are just some of the
customers running on Google Cloud. It'd be very foolish to abandon these.

\- If you watch the day two GCP NEXT keynote, you'll see how Google is the
cloud leader in environmental responsibility and open source. We're the good
guys :)
([https://www.youtube.com/watch?v=axhdIa_co2o&list=PLIivdWyY5s...](https://www.youtube.com/watch?v=axhdIa_co2o&list=PLIivdWyY5sqIFd0E6JG1hVr8sXQaLmmBP&index=2))

Happy to debate further!

------
kwoff
How can they copy our comments from here to there and sell queries on them...?

------
kuschku
Can we extract the datasets to work with them on our own (cheaper to query)
servers?

~~~
vgt
Of course! BigQuery export operations are free.

Although compared with other cloud technologies, BigQuery is incredibly cost-
effective :)

~~~
kuschku
Well, I’ve checked the prices, and dedicated servers from old-style hosters
are still far cheaper if you’re going to query data 24/7.

------
benhamner
In a similar vein, Kaggle datasets enables you to run Python, R, Julia, and
SQL on many public datasets
[https://www.kaggle.com/datasets](https://www.kaggle.com/datasets)

------
baabaa
Site doesn't render well on my Moto X.

