I remember the earlier Google attempt at the same thing. I was in undergrad and an engineer came to my Uni to give a presentation on the topic. A big area of concern was "Why won't Google just shut this down when the going gets rough?" The question was mostly batted away along the lines of "this is lunch money to Google anyway." But just a couple years later, it did indeed bite the dust.
Google BigQuery is a paid service, Dremel under the hood, used internally at Google by ~80% of Googlers every week. It's simply not going away because Google relies on it heavily.
Externally, companies like Spotify and Kabam Games rely heavily on BigQuery as well:
So the story of Google Cloud products, which are paid, supported, and Enterprise-grade, and which are covered by a clear 1 year deprecation policy, are very different from the story of free services.
Public Datasets is just a function of BigQuery's unique ACL functionality.
I hate people bringing them up - its just being a sucker for marketing. Spotify and Netflix are clearly brands recruited to promote their respective platforms. The only question is how large a discount they are getting among other special treatment in return for their name. I wouldn't be surprised if its conditional on a gag criticising the platform too. These are the sort of situations that are revealed years later to have required a totally isolated and custom solution that bore no resemblance to the service it was intended to promote.
Special partnerships prove nothing because the economics and service you get won't be remotely similar.
You can listen and read Spotify engineers talking about their experiences with Google, including running 2,000,000 qps load tests without ever telling anyone at Google.
If you read some of their blogs, you can see how brutally honest they are about the pros and cons of our services:
> It's simply not going away because Google relies on it heavily.
Still doesn't mean Google will maintain the public access to this database. They might just as well shut it down and limit its access to Googlers. Like they already did a few years ago.
When you use these datasets you pay for how much the query has to scan. For example, if you do a select on a single 8 byte column with 1 billion rows in the table you are paying for 8GB.
Every month the first TB is free but if you go over that you have to pay $5 per TB. They are not exactly doing this for free.
Google employees using something internally doesn't mean that it won't go away for everyone else... Also, they could decide to honor existing contracts and stop issuing new ones...
BigQuery is a truly fully-managed service. This means that SREs are on call 24/7. It's not software that one downloads. Hence it really doesn't make a lot of sense to have it available to just a few folks, incremental overhead is low.
I'd love to hear industry standards around deprecation of paid services. This is really a way of doing business in the entire world, and I'm not sure why Google is being held to an unreasonable and impossible standard of "never deprecate anything ever!". For example, here is Microsoft's:
The value of an email service is how many users it has.
The value of a big database is... who knows? If Google one day decides the content of that database is a strategic asset, they might just as well decide to shut it down to the public and continue using it internally.
gmail is a low-margin business, compared to search advertising. Does it have any non-creepy way to make huge margins? If not, why is getting big mean anything but danger?
I'm happy to respond to every single person that makes this argument in perpetuity :)
There's a difference between experimental free services and Enterprise-grade SLA'd SLO'd fully-supported paid services with a very clear and recorded deprecation policy and an army of customers with contracts and full support from Google CEO and chairman.
Indeed, but who is paying to maintain 546GB of online storage ( plus backups ) for Reddit comments in BigQuery, for example?
If the answer is "Google" then I think people are still right to be cautious.
Or to invert the question; could I put 1TB of my own 'interesting' data into BigQuery and have Google maintain it in perpetuity for free? If not, then why are any of these datasets considered safer?
If you think Google even cares about 546GB of online storage you've gotta be kidding. This is like asking if Google is going to eliminate the free water in the water cooler because it costs a few cents per employee.
You can put 1TB of your own interesting data in BigQuery and pay $0.02/GB/month for the data storage fees. I'm sure Google will happily take your money as a paying customer, and keep your data for as long as you want to store it and pay for it.
And if you've got public data in BigQuery that you want to make explorable/shareable/visualizable to anybody (no SQL required), let Looker know and we'll see what we can do. (disclosure: work at Looker)
You can always create a dataset and make it public by setting open ACLs to it. The only downside is paying $0.02 per GB per month.
To participate in the Public Datasets program, you must contact us - there's a lot of legal mumbo jumbo associated with us hosting your datasets that I wish weren't the case :)
Looks like the data goes up through 2015-10-13. I created a Look (disclosure: work for Looker) that shows story counts by day for the last 365 days here: https://looker.com/publicdata/looks/169?show=viz
> There's a lot you could do with that to find the best comments.
In any given situation, how often does best comment you could write coincide with the one that would get you the most upvotes? I'm guessing not frequently.
Would be interesting to find out the value (upvotes) of sponsored content vs. unsponsered content, as well as comments from green users vs normal users.
Great to see folks digging into this. Looker (where I work) is what you're seeing visualize and make the underlying datasets explorable. Our founder did a great blog post explaining how it works and how quickly these datasets yield interesting insights with just a few lines of code.
Full blog post explaining and walking through the process (and giving access to explore the data fully yourself, no SQL required) is here: http://looker.com/blog/hacking-hacker-news
Felipe and all the other folks at Google have done a great job getting this project off the ground and we're psyched to partner with them. We're working on some new public datasets now, but if you have particular ones you'd like to explore, let us know.
I just noticed the Freebase data available as one of the public datasets, so been wondering what happened to the Freebase team?
I liked what Freebase, DBpedia etc were doing few years ago with "semantic db's". I remember Freebase Gridworks being extremely useful for data cleanup. Haven't had a reason to follow the space...is it dead? Haven't seen much semantic talk off late. Where would one go to catch up on the latest news?
Super interesting. Thanks for all the curation and relevant set of links that you are promptly posting/replying. StackOverflow data is available for free too - that would be one awesome dataset to let folks get their hands on. Any plans for that?
Looking a the suggested textual analyses of the Reddit and HN datasets, and thinking about how they can be combined, makes me wonder if anonymity-through.multiple-avatars will even be remotely possible 5 years from now.
Sorry about that! This is a known issue we're working with Google to get resolved right now. In the meantime, a refresh to the page usually fixes the issue.
Looker is powering all the iframes (any 404s should be fixed now). Bigquery is hosting the data, but Looker is generating the queries and visualizations of the data.
Interesting, I'm surprised Google is using Looker rather than their own tool
edit: ah, I see, it is just one of the partners they have with BigQuery and they are apparently using it for their website (I'm guessing because Looker lets you iframe graphs directly into the page for free).
Yup, Looker is a Premier Partner for the GCP launch (https://cloud.google.com/partners/?q=Looker#search). And then we (I work at Looker) are specifically partnering with GCP for the Public Datasets project.
Because LookML (Looker's modeling language) makes it easy to explore and visualize big datasets, we're building out models and dashboards so visitors can get a sense of BigQuery/Looker's power and find insights from the datasets quickly (whether or not they write SQL).
Frankly, there's a ton of public data that's "available" in the sense that you can technically download and clean a CSV, but isn't actually easy to extract meaning from. So we figured we'd select some interesting datasets, do the cleaning, uploading and modeling, and then let folks have at it for free.
If there are specific datasets you'd be interested in seeing, let us know and we'll see what we can do.
Public data is good! I have used their book text data on two projects to generate common ngrams. These newly released data sets look useful and using BigQuery is reasonable if you just need parts of data sets.
Amazon does something similar: keeping useful data in S3 for cheap access when running on AWS.
That’s literally a standard database size, and can be easily done with any postgres install.
All of that data combined can be stored and quickly queried on just 10 dedicated servers, for an overall price of 200€/month.
And with "all" I mean "all 20TB".
And with "quickly" I mean "faster than a network request to Google".
________________
Due to rate limiting, I can't answer you at the moment, vgt. So my answer will happen here, inline:
> Well, like this:
> If someone wants to decide if they want to build a business on bigquery, they want to evaluate it's performance on an average data set before for free.
> Assuming the test data is thought for similar cases it has to be of similar size to the average dataset used with BigQuery.
What you just did isn't how data is usually processed — a single 8GB JSON never occurs in real life cases, except maybe as transitional medium while you migrate from one database to another.
In most cases, you'll have it stored in a relational or graph database, easily accessible.
Especially in the sciences.
But sure, I can go through the data and tell you the amount of cats with specific properties, or similar.
I personally use pgsql currently to analyse data from Android crash reports.
Like, "list me all exceptions which have reports from more than 50 uniqie users which do not all use Samsung phones"
Or, my favourite, as I store for each report the exception, for which I store the stack trace elements, for which I store methods, files, lines of files and classes,
Show me a breakdown of operating systems of users affected by exceptions occuring in this method which have sent more than 40 fatal crashes each in the past month.
It manages that faster than just the latency to bigquery would be — 18ms over a test dataset of several gigabyte by now.
________________
Seriously, @dang, can you maybe consider deactivating downvotes unless people also post a comment explaining why they consider a comment not constructive?
Getting mass downvotes within of seconds of posting is very annoying, and just destroys the discussion culture.
> It manages that faster than just the latency to bigquery would be — 18ms over a test dataset of several gigabyte by now.
Yes!!! Exactly! That's the whole point. If something is taking you less than a second, don't bring BigQuery into the mix. But when you start hitting processes that take hours or days to run... try BigQuery. Going from hours to seconds changes your life.
> I personally use pgsql currently to analyse data from Android crash reports.
> If something is taking you less than a second, don't bring BigQuery into the mix. But when you start hitting processes that take hours or days to run... try BigQuery
But that's my entire point? Why are all the test datasets small enough to be faster in pgsql than the latency to Google would take?
I mean, if I wanted to showcase my software, I'd use a huge dataset with a complex problem — say protein folding, or superconductor molecular analysis — and show it in comparison on bigquery and a standard local database.
Although these two examples are bad, as I know from my university that they can be solved in the same time for less money locally than by using bigquery
Look... if you don't have problems that take hours to solve in your current environment, you haven't found big data problems (yet). And that's OK. Not everyone works with big data.
But if one day you do, please ping me, it will be fun to do a follow up.
This narrative gets repeated time and time again, and it really doesn't hold up to even surface debate:
- Google TI and Cloud merged. Same teams. Per Urs Holzle in Jan2014
- Just last week Sundar (CEO), Eric Schmidt (chairman), Jeff Dean, Urs Holzle, and Diane were on stage talking about how they spent $10 billion last year investing in infrastructure and how serious Google is about cloud (https://www.youtube.com/watch?v=HgWHeT_OwHc&list=PLIivdWyY5s...)
- In many cases, publically available technologies are actually replacing internal technologies (BigQuery and Dremel, Dataflow and FlumeJava/Millwheel for example).
- Disney, Coca Cola, Snapchat, Spotify, Home Depot are just some of the customers running on Google Cloud. It'd be very foolish to abandon these.
Covered in Wired: http://www.wired.com/2008/01/google-to-provi/
Update: And the shutdown article, 11 months later: http://www.wired.com/2008/12/googlescienceda/