Hacker News new | past | comments | ask | show | jobs | submit login
Google launches Public Datasets program (cloud.google.com)
423 points by vgt on Mar 29, 2016 | hide | past | web | favorite | 93 comments

I remember the earlier Google attempt at the same thing. I was in undergrad and an engineer came to my Uni to give a presentation on the topic. A big area of concern was "Why won't Google just shut this down when the going gets rough?" The question was mostly batted away along the lines of "this is lunch money to Google anyway." But just a couple years later, it did indeed bite the dust.

Covered in Wired: http://www.wired.com/2008/01/google-to-provi/

Update: And the shutdown article, 11 months later: http://www.wired.com/2008/12/googlescienceda/

Google BigQuery is a paid service, Dremel under the hood, used internally at Google by ~80% of Googlers every week. It's simply not going away because Google relies on it heavily.

Externally, companies like Spotify and Kabam Games rely heavily on BigQuery as well:



So the story of Google Cloud products, which are paid, supported, and Enterprise-grade, and which are covered by a clear 1 year deprecation policy, are very different from the story of free services.

Public Datasets is just a function of BigQuery's unique ACL functionality.

>> Externally, companies like Spotify

I hate people bringing them up - its just being a sucker for marketing. Spotify and Netflix are clearly brands recruited to promote their respective platforms. The only question is how large a discount they are getting among other special treatment in return for their name. I wouldn't be surprised if its conditional on a gag criticising the platform too. These are the sort of situations that are revealed years later to have required a totally isolated and custom solution that bore no resemblance to the service it was intended to promote.

Special partnerships prove nothing because the economics and service you get won't be remotely similar.

You can listen and read Spotify engineers talking about their experiences with Google, including running 2,000,000 qps load tests without ever telling anyone at Google.

If you read some of their blogs, you can see how brutally honest they are about the pros and cons of our services:




Finally, does this sound like an engineer who got paid to speak? https://twitter.com/mpjme/status/702167231623516160

> It's simply not going away because Google relies on it heavily.

Still doesn't mean Google will maintain the public access to this database. They might just as well shut it down and limit its access to Googlers. Like they already did a few years ago.

When you use these datasets you pay for how much the query has to scan. For example, if you do a select on a single 8 byte column with 1 billion rows in the table you are paying for 8GB.

Every month the first TB is free but if you go over that you have to pay $5 per TB. They are not exactly doing this for free.

Google employees using something internally doesn't mean that it won't go away for everyone else... Also, they could decide to honor existing contracts and stop issuing new ones...

BigQuery is a truly fully-managed service. This means that SREs are on call 24/7. It's not software that one downloads. Hence it really doesn't make a lot of sense to have it available to just a few folks, incremental overhead is low.

I'd love to hear industry standards around deprecation of paid services. This is really a way of doing business in the entire world, and I'm not sure why Google is being held to an unreasonable and impossible standard of "never deprecate anything ever!". For example, here is Microsoft's:


I was not blaming anybody, it's possible that a service would be more valuable to a company if it's kept internally.

Ah yes, point well taken. I think we've seen this historically at Google, but things have changed in the past few years.

Primarily as a reaction to AWS and Azure?

Google announced Colossus in 2011, is that going to make it to GCP?

You can use Colossus today! It's a very low level service, leveraged by other GCP services.

For example, Google Cloud Storage and BigQuery Storage both sit on Colossus. Read more on BQ here:


BigQuery is cover by the Google Cloud deprecation policy: https://cloud.google.com/terms/deprecation

Google is required to give at least a 1 year warning on shutting down released Cloud products.


I remember those days. GMail was still in beta and ' a big area of concern was "Why won't Google just shut this down when the going gets rough?" '

8 years later GMail has a billion users, and not going anywhere but up.

I haven't been to the future (yet), but I'll happily take any bet you have against BigQuery.

Disclaimer: I'm Felipe Hoffa, and I work at Google. (https://twitter.com/felipehoffa)

There are not really comparable.

The value of an email service is how many users it has.

The value of a big database is... who knows? If Google one day decides the content of that database is a strategic asset, they might just as well decide to shut it down to the public and continue using it internally.

As long as IMAP support doesn't go the way of XMPP support for gtalk/hangouts...

gmail is a low-margin business, compared to search advertising. Does it have any non-creepy way to make huge margins? If not, why is getting big mean anything but danger?

Can you give me an update on Wave and Orkut? I never use this Reader thing, but I hear others do.

I'm happy to respond to every single person that makes this argument in perpetuity :)

There's a difference between experimental free services and Enterprise-grade SLA'd SLO'd fully-supported paid services with a very clear and recorded deprecation policy and an army of customers with contracts and full support from Google CEO and chairman.

Indeed, but who is paying to maintain 546GB of online storage ( plus backups ) for Reddit comments in BigQuery, for example?

If the answer is "Google" then I think people are still right to be cautious.

Or to invert the question; could I put 1TB of my own 'interesting' data into BigQuery and have Google maintain it in perpetuity for free? If not, then why are any of these datasets considered safer?

If you think Google even cares about 546GB of online storage you've gotta be kidding. This is like asking if Google is going to eliminate the free water in the water cooler because it costs a few cents per employee.

You can put 1TB of your own interesting data in BigQuery and pay $0.02/GB/month for the data storage fees. I'm sure Google will happily take your money as a paying customer, and keep your data for as long as you want to store it and pay for it.

This is a paid service.

One large difference between this program and alternative programs is that data already resides in Google BigQuery:

- You do not need to spin up a database to work with BigQuery

- You can simply start writing SQL on top of BigQuery

- You may leverage Dataflow and MapReduce connectors to work with this data directly in Hadoop, Spark, or Dataflow

- BigQuery has a free tier - one Terabyte of data processed per month

Finally, for folks who would like to share their datasets, BigQuery offers free hosting and credits to help get a pipeline going.

(disclosure: work on BigQuery)

And if you've got public data in BigQuery that you want to make explorable/shareable/visualizable to anybody (no SQL required), let Looker know and we'll see what we can do. (disclosure: work at Looker)

Looker has been a great launch partner for this program. Really helped us develop and deliver great interactive dashboards.

> Finally, for folks who would like to share their datasets, BigQuery offers free hosting and credits to help get a pipeline going.

Is this available in an automated fashion or is it something one has to contact Google for?

You can always create a dataset and make it public by setting open ACLs to it. The only downside is paying $0.02 per GB per month.

To participate in the Public Datasets program, you must contact us - there's a lot of legal mumbo jumbo associated with us hosting your datasets that I wish weren't the case :)

"HACKER NEWS - A dataset that contains all stories and comments from Hacker News since its launch in 2006."

I know what I'm doing this weekend.

Some resources to get you started:

- How to use the Hacker News dataset https://medium.com/google-cloud/big-data-stories-in-seconds-...

- Discussion of the HN dataset announcement here https://news.ycombinator.com/item?id=10440502

- An iPython notebook: https://github.com/fhoffa/notebooks/blob/master/analyzing%20...

- More: http://debarghyadas.com/writes/looking-back-at-9-years-of-ha...

Disclaimer: I'm Felipe Hoffa, and I work at Google. (https://twitter.com/felipehoffa)

How fresh is the data, and how often is it refreshed? It doesn't seem to be described anywhere...

Looks like the data goes up through 2015-10-13. I created a Look (disclosure: work for Looker) that shows story counts by day for the last 365 days here: https://looker.com/publicdata/looks/169?show=viz

It's mildly amusing:

> SELECT sum(length(text)) FROM [bigquery-public-data:hacker_news.comments] where author="jrockway"


That's almost 5 MB of comments I've written.

Get outside more :)

Being outside does not preclude internet access... :)

I'm dying to get comment scores, not just submission scores.

There's a lot you could do with that to find the best comments, which is really why HN is so awesome.

Blame HN for that. It was removed from public access right after I made a blog post about it, although it was coincidental I think.

Though the data could be assembled; I can see my comment scores, you can see your comment scores...

> There's a lot you could do with that to find the best comments.

In any given situation, how often does best comment you could write coincide with the one that would get you the most upvotes? I'm guessing not frequently.

Your best is different from his/her best.

Would be interesting to find out the value (upvotes) of sponsored content vs. unsponsered content, as well as comments from green users vs normal users.

There isn't sponsored content on HN. (if you are referring to the job ads, those do not receive upvotes and follow a steady rank decay.)

What? The lack of upvotes (or steady decay) does not make it “not content”.

similar: https://hn.algolia.com/ ( Search Hacker News )

I work and play in data. By far the best resource I've encountered is https://app.enigma.io/

Signups are free. The aggregated public data is plentiful and easily discovered, indexed, filtered, and exported.

Free account have API limitations, but as far as govt data is concerned, I don't find that its updated often enough to peg my API rate limiter anyway.

Awesome. Do you have any other resources to share? I run occasional local hackathons and am always looking for inspiring data sets.

can you talk more about your work? what kind of projects do you work on? what tools do you use other than enigma?

Great to see folks digging into this. Looker (where I work) is what you're seeing visualize and make the underlying datasets explorable. Our founder did a great blog post explaining how it works and how quickly these datasets yield interesting insights with just a few lines of code.

Full blog post explaining and walking through the process (and giving access to explore the data fully yourself, no SQL required) is here: http://looker.com/blog/hacking-hacker-news

Felipe and all the other folks at Google have done a great job getting this project off the ground and we're psyched to partner with them. We're working on some new public datasets now, but if you have particular ones you'd like to explore, let us know.

What should we look at next? Census? Medicare?

Kudos to Looker, great partners here! We're just getting started :)

This is a much more useful list: https://www.reddit.com/r/bigquery/wiki/datasets

I'm not sure why Google hosts it on Reddit. There's some interesting (and more up-to-date) stuff on there.

This page may interest you: https://support.google.com/cloud/answer/3466163?hl=en

I think the Google Cloud team watches the community and helps promote the forums that have the most active for a given product.

That's a list of random data from random people, curated by Googler felipehoffa in an unofficial capacity.

The ones in the submitted page are maintained by Google itself in the public-data project.

I believe that the GDelt, Freebase and Genomics tables (at least) are officially supported by Google.

I just ran a Markov text generator over the USA Name Database and created a few thousand girls' names, including:

Aracella, Ashla, Blakelyne, Carylou, Damariah, Enchantrelle, Francenza, Iridia, Jalexius, Lilliotte and Scotlanta.

Hey, maybe I can start a bespoke baby name business...

I'm building a baby name app and this is a fantastic idea that I am probably going to steal.

Click on the Stories Count and you can see the individual stories :)

Any chance we'll be seeing the Common Crawl data on there anytime soon?

Sounds like a good one. We'll be sure to reach out to them. Or, if you know them, have them shoot us a note at:


I just noticed the Freebase data available as one of the public datasets, so been wondering what happened to the Freebase team?

I liked what Freebase, DBpedia etc were doing few years ago with "semantic db's". I remember Freebase Gridworks being extremely useful for data cleanup. Haven't had a reason to follow the space...is it dead? Haven't seen much semantic talk off late. Where would one go to catch up on the latest news?

Freebase data is being moved into Wikidata.

I loaded Wikidata into BigQuery too!

Fun with movies and cats: https://medium.com/google-cloud/oscars-2016-movies-that-got-...

Implementation notes: https://lists.wikimedia.org/pipermail/wikidata/2016-March/00...

More notes: https://lists.wikimedia.org/pipermail/wikidata/2016-March/00...

Super interesting. Thanks for all the curation and relevant set of links that you are promptly posting/replying. StackOverflow data is available for free too - that would be one awesome dataset to let folks get their hands on. Any plans for that?

Looking a the suggested textual analyses of the Reddit and HN datasets, and thinking about how they can be combined, makes me wonder if anonymity-through.multiple-avatars will even be remotely possible 5 years from now.

I get 404s for all Looker iframes...

Sorry about that! This is a known issue we're working with Google to get resolved right now. In the meantime, a refresh to the page usually fixes the issue.

Same here in Safari. Chrome seems to work.

where are you seeing Looker being used?

Looker is powering all the iframes (any 404s should be fixed now). Bigquery is hosting the data, but Looker is generating the queries and visualizations of the data.

Interesting, I'm surprised Google is using Looker rather than their own tool

edit: ah, I see, it is just one of the partners they have with BigQuery and they are apparently using it for their website (I'm guessing because Looker lets you iframe graphs directly into the page for free).

Yup, Looker is a Premier Partner for the GCP launch (https://cloud.google.com/partners/?q=Looker#search). And then we (I work at Looker) are specifically partnering with GCP for the Public Datasets project.

Because LookML (Looker's modeling language) makes it easy to explore and visualize big datasets, we're building out models and dashboards so visitors can get a sense of BigQuery/Looker's power and find insights from the datasets quickly (whether or not they write SQL).

Frankly, there's a ton of public data that's "available" in the sense that you can technically download and clean a CSV, but isn't actually easy to extract meaning from. So we figured we'd select some interesting datasets, do the cleaning, uploading and modeling, and then let folks have at it for free.

If there are specific datasets you'd be interested in seeing, let us know and we'll see what we can do.

Any chance of getting GNAF and some other Australian datasets up there?


A CKAN->BigQuery connector would be interesting (think of an "Open in BigQuery" button)

Seems like it has a dataset of hacker news since it launched in 2006


  SELECT * FROM [bigquery-public-data:hacker_news.stories] ORDER BY score DESC LIMIT 100

made a screenshot with Top 20 stories on HN http://i.imgur.com/czXYEyQ.png

Public data is good! I have used their book text data on two projects to generate common ngrams. These newly released data sets look useful and using BigQuery is reasonable if you just need parts of data sets.

Amazon does something similar: keeping useful data in S3 for cheap access when running on AWS.

There's a Hacker News dataset. In case anyone wants to take a crack at it: https://cloud.google.com/bigquery/public-data/hacker-news

I hope someone submits a voice recognition data!

It's fantastically hard to find audio->phoneme datasets and only slightly less difficult to find audio->word and word->possible_phonemes...

Sorry Google but my trust in cloud solutions reside in AWS and Azure.

Why? Because when Amazon and Microsoft announce something I know there's a good chance it will still exist 24 months later

This narrative gets repeated time and time again, and it really doesn't hold up to even surface debate:

- Google TI and Cloud merged. Same teams. Per Urs Holzle in Jan2014

- Just last week Sundar (CEO), Eric Schmidt (chairman), Jeff Dean, Urs Holzle, and Diane were on stage talking about how they spent $10 billion last year investing in infrastructure and how serious Google is about cloud (https://www.youtube.com/watch?v=HgWHeT_OwHc&list=PLIivdWyY5s...)

- In many cases, publically available technologies are actually replacing internal technologies (BigQuery and Dremel, Dataflow and FlumeJava/Millwheel for example).

- Disney, Coca Cola, Snapchat, Spotify, Home Depot are just some of the customers running on Google Cloud. It'd be very foolish to abandon these.

- If you watch the day two GCP NEXT keynote, you'll see how Google is the cloud leader in environmental responsibility and open source. We're the good guys :) (https://www.youtube.com/watch?v=axhdIa_co2o&list=PLIivdWyY5s...)

Happy to debate further!

How can they copy our comments from here to there and sell queries on them...?

Can we extract the datasets to work with them on our own (cheaper to query) servers?

Of course! BigQuery export operations are free.

Although compared with other cloud technologies, BigQuery is incredibly cost-effective :)

Well, I’ve checked the prices, and dedicated servers from old-style hosters are still far cheaper if you’re going to query data 24/7.

Most of the datasets seem to be available to download here https://www.reddit.com/r/bigquery/wiki/datasets.

After looking at those:

What’s "big" about any of them?

That’s literally a standard database size, and can be easily done with any postgres install.

All of that data combined can be stored and quickly queried on just 10 dedicated servers, for an overall price of 200€/month.

And with "all" I mean "all 20TB".

And with "quickly" I mean "faster than a network request to Google".


Due to rate limiting, I can't answer you at the moment, vgt. So my answer will happen here, inline:

> Well, like this:

> If someone wants to decide if they want to build a business on bigquery, they want to evaluate it's performance on an average data set before for free.

> Assuming the test data is thought for similar cases it has to be of similar size to the average dataset used with BigQuery.

Apologies, but where do you find the implication that these datasets are "big"? There is "big" in BigQuery, so apologize if you misunderstood.

We try to keep datasets reasonable, so that folks can get the most out of the BigQuery free pricing tier :)

If you want bigger, here's an example of a 10TB dataset:


Hi Janne!

I'm sure you'll enjoy a challenge.

Let's talk about Wikidata. Can you download this 8 GB compressed file?


I want to know the id of the JSON row which length is 102 bytes.

It took me 4 seconds with BigQuery - how can we improve this with "any postgres install"?


What you just did isn't how data is usually processed — a single 8GB JSON never occurs in real life cases, except maybe as transitional medium while you migrate from one database to another.

In most cases, you'll have it stored in a relational or graph database, easily accessible.

Especially in the sciences.

But sure, I can go through the data and tell you the amount of cats with specific properties, or similar.

I personally use pgsql currently to analyse data from Android crash reports.

Like, "list me all exceptions which have reports from more than 50 uniqie users which do not all use Samsung phones"

Or, my favourite, as I store for each report the exception, for which I store the stack trace elements, for which I store methods, files, lines of files and classes,

Show me a breakdown of operating systems of users affected by exceptions occuring in this method which have sent more than 40 fatal crashes each in the past month.

It manages that faster than just the latency to bigquery would be — 18ms over a test dataset of several gigabyte by now.


Seriously, @dang, can you maybe consider deactivating downvotes unless people also post a comment explaining why they consider a comment not constructive?

Getting mass downvotes within of seconds of posting is very annoying, and just destroys the discussion culture.

> Especially in the sciences.

Take a look at how Stanford is leveraging BigQuery for their genomics analysis:


> It manages that faster than just the latency to bigquery would be — 18ms over a test dataset of several gigabyte by now.

Yes!!! Exactly! That's the whole point. If something is taking you less than a second, don't bring BigQuery into the mix. But when you start hitting processes that take hours or days to run... try BigQuery. Going from hours to seconds changes your life.

> I personally use pgsql currently to analyse data from Android crash reports.

Cool! Guess what Motorola uses...


> If something is taking you less than a second, don't bring BigQuery into the mix. But when you start hitting processes that take hours or days to run... try BigQuery

But that's my entire point? Why are all the test datasets small enough to be faster in pgsql than the latency to Google would take?

I mean, if I wanted to showcase my software, I'd use a huge dataset with a complex problem — say protein folding, or superconductor molecular analysis — and show it in comparison on bigquery and a standard local database.

Although these two examples are bad, as I know from my university that they can be solved in the same time for less money locally than by using bigquery

Look... if you don't have problems that take hours to solve in your current environment, you haven't found big data problems (yet). And that's OK. Not everyone works with big data.

But if one day you do, please ping me, it will be fun to do a follow up.

Yeah, that's why I had hoped the test datasets would be big data problems — so I could see what that actually looks like ;)

In a similar vein, Kaggle datasets enables you to run Python, R, Julia, and SQL on many public datasets https://www.kaggle.com/datasets

Site doesn't render well on my Moto X.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact