Covered in Wired: http://www.wired.com/2008/01/google-to-provi/
Update: And the shutdown article, 11 months later: http://www.wired.com/2008/12/googlescienceda/
Externally, companies like Spotify and Kabam Games rely heavily on BigQuery as well:
So the story of Google Cloud products, which are paid, supported, and Enterprise-grade, and which are covered by a clear 1 year deprecation policy, are very different from the story of free services.
Public Datasets is just a function of BigQuery's unique ACL functionality.
I hate people bringing them up - its just being a sucker for marketing. Spotify and Netflix are clearly brands recruited to promote their respective platforms. The only question is how large a discount they are getting among other special treatment in return for their name. I wouldn't be surprised if its conditional on a gag criticising the platform too. These are the sort of situations that are revealed years later to have required a totally isolated and custom solution that bore no resemblance to the service it was intended to promote.
Special partnerships prove nothing because the economics and service you get won't be remotely similar.
If you read some of their blogs, you can see how brutally honest they are about the pros and cons of our services:
Finally, does this sound like an engineer who got paid to speak? https://twitter.com/mpjme/status/702167231623516160
Still doesn't mean Google will maintain the public access to this database. They might just as well shut it down and limit its access to Googlers. Like they already did a few years ago.
Every month the first TB is free but if you go over that you have to pay $5 per TB. They are not exactly doing this for free.
I'd love to hear industry standards around deprecation of paid services. This is really a way of doing business in the entire world, and I'm not sure why Google is being held to an unreasonable and impossible standard of "never deprecate anything ever!". For example, here is Microsoft's:
Google announced Colossus in 2011, is that going to make it to GCP?
For example, Google Cloud Storage and BigQuery Storage both sit on Colossus. Read more on BQ here:
Google is required to give at least a 1 year warning on shutting down released Cloud products.
I remember those days. GMail was still in beta and ' a big area of concern was "Why won't Google just shut this down when the going gets rough?" '
8 years later GMail has a billion users, and not going anywhere but up.
I haven't been to the future (yet), but I'll happily take any bet you have against BigQuery.
Disclaimer: I'm Felipe Hoffa, and I work at Google. (https://twitter.com/felipehoffa)
The value of an email service is how many users it has.
The value of a big database is... who knows? If Google one day decides the content of that database is a strategic asset, they might just as well decide to shut it down to the public and continue using it internally.
There's a difference between experimental free services and Enterprise-grade SLA'd SLO'd fully-supported paid services with a very clear and recorded deprecation policy and an army of customers with contracts and full support from Google CEO and chairman.
If the answer is "Google" then I think people are still right to be cautious.
Or to invert the question; could I put 1TB of my own 'interesting' data into BigQuery and have Google maintain it in perpetuity for free? If not, then why are any of these datasets considered safer?
You can put 1TB of your own interesting data in BigQuery and pay $0.02/GB/month for the data storage fees. I'm sure Google will happily take your money as a paying customer, and keep your data for as long as you want to store it and pay for it.
- You do not need to spin up a database to work with BigQuery
- You can simply start writing SQL on top of BigQuery
- You may leverage Dataflow and MapReduce connectors to work with this data directly in Hadoop, Spark, or Dataflow
- BigQuery has a free tier - one Terabyte of data processed per month
Finally, for folks who would like to share their datasets, BigQuery offers free hosting and credits to help get a pipeline going.
(disclosure: work on BigQuery)
Is this available in an automated fashion or is it something one has to contact Google for?
To participate in the Public Datasets program, you must contact us - there's a lot of legal mumbo jumbo associated with us hosting your datasets that I wish weren't the case :)
I know what I'm doing this weekend.
- How to use the Hacker News dataset https://medium.com/google-cloud/big-data-stories-in-seconds-...
- Discussion of the HN dataset announcement here https://news.ycombinator.com/item?id=10440502
- An iPython notebook: https://github.com/fhoffa/notebooks/blob/master/analyzing%20...
- More: http://debarghyadas.com/writes/looking-back-at-9-years-of-ha...
Disclaimer: I'm Felipe Hoffa, and I work at Google.
> SELECT sum(length(text)) FROM [bigquery-public-data:hacker_news.comments] where author="jrockway"
That's almost 5 MB of comments I've written.
There's a lot you could do with that to find the best comments, which is really why HN is so awesome.
In any given situation, how often does best comment you could write coincide with the one that would get you the most upvotes? I'm guessing not frequently.
Signups are free. The aggregated public data is plentiful and easily discovered, indexed, filtered, and exported.
Free account have API limitations, but as far as govt data is concerned, I don't find that its updated often enough to peg my API rate limiter anyway.
Full blog post explaining and walking through the process (and giving access to explore the data fully yourself, no SQL required) is here: http://looker.com/blog/hacking-hacker-news
Felipe and all the other folks at Google have done a great job getting this project off the ground and we're psyched to partner with them. We're working on some new public datasets now, but if you have particular ones you'd like to explore, let us know.
What should we look at next? Census? Medicare?
I'm not sure why Google hosts it on Reddit. There's some interesting (and more up-to-date) stuff on there.
I think the Google Cloud team watches the community and helps promote the forums that have the most active for a given product.
The ones in the submitted page are maintained by Google itself in the public-data project.
Aracella, Ashla, Blakelyne, Carylou, Damariah, Enchantrelle, Francenza, Iridia, Jalexius, Lilliotte and Scotlanta.
Hey, maybe I can start a bespoke baby name business...
I liked what Freebase, DBpedia etc were doing few years ago with "semantic db's". I remember Freebase Gridworks being extremely useful for data cleanup. Haven't had a reason to follow the space...is it dead? Haven't seen much semantic talk off late. Where would one go to catch up on the latest news?
I loaded Wikidata into BigQuery too!
Fun with movies and cats: https://medium.com/google-cloud/oscars-2016-movies-that-got-...
Implementation notes: https://lists.wikimedia.org/pipermail/wikidata/2016-March/00...
More notes: https://lists.wikimedia.org/pipermail/wikidata/2016-March/00...
edit: ah, I see, it is just one of the partners they have with BigQuery and they are apparently using it for their website (I'm guessing because Looker lets you iframe graphs directly into the page for free).
Because LookML (Looker's modeling language) makes it easy to explore and visualize big datasets, we're building out models and dashboards so visitors can get a sense of BigQuery/Looker's power and find insights from the datasets quickly (whether or not they write SQL).
Frankly, there's a ton of public data that's "available" in the sense that you can technically download and clean a CSV, but isn't actually easy to extract meaning from. So we figured we'd select some interesting datasets, do the cleaning, uploading and modeling, and then let folks have at it for free.
If there are specific datasets you'd be interested in seeing, let us know and we'll see what we can do.
A CKAN->BigQuery connector would be interesting (think of an "Open in BigQuery" button)
SELECT * FROM [bigquery-public-data:hacker_news.stories] ORDER BY score DESC LIMIT 100
made a screenshot with Top 20 stories on HN
Amazon does something similar: keeping useful data in S3 for cheap access when running on AWS.
It's fantastically hard to find audio->phoneme datasets and only slightly less difficult to find audio->word and word->possible_phonemes...
Why? Because when Amazon and Microsoft announce something I know there's a good chance it will still exist 24 months later
- Google TI and Cloud merged. Same teams. Per Urs Holzle in Jan2014
- Just last week Sundar (CEO), Eric Schmidt (chairman), Jeff Dean, Urs Holzle, and Diane were on stage talking about how they spent $10 billion last year investing in infrastructure and how serious Google is about cloud (https://www.youtube.com/watch?v=HgWHeT_OwHc&list=PLIivdWyY5s...)
- In many cases, publically available technologies are actually replacing internal technologies (BigQuery and Dremel, Dataflow and FlumeJava/Millwheel for example).
- Disney, Coca Cola, Snapchat, Spotify, Home Depot are just some of the customers running on Google Cloud. It'd be very foolish to abandon these.
- If you watch the day two GCP NEXT keynote, you'll see how Google is the cloud leader in environmental responsibility and open source. We're the good guys :) (https://www.youtube.com/watch?v=axhdIa_co2o&list=PLIivdWyY5s...)
Happy to debate further!
Although compared with other cloud technologies, BigQuery is incredibly cost-effective :)
What’s "big" about any of them?
That’s literally a standard database size, and can be easily done with any postgres install.
All of that data combined can be stored and quickly queried on just 10 dedicated servers, for an overall price of 200€/month.
And with "all" I mean "all 20TB".
And with "quickly" I mean "faster than a network request to Google".
Due to rate limiting, I can't answer you at the moment, vgt. So my answer will happen here, inline:
> Well, like this:
> If someone wants to decide if they want to build a business on bigquery, they want to evaluate it's performance on an average data set before for free.
> Assuming the test data is thought for similar cases it has to be of similar size to the average dataset used with BigQuery.
We try to keep datasets reasonable, so that folks can get the most out of the BigQuery free pricing tier :)
If you want bigger, here's an example of a 10TB dataset:
I'm sure you'll enjoy a challenge.
Let's talk about Wikidata. Can you download this 8 GB compressed file?
I want to know the id of the JSON row which length is 102 bytes.
It took me 4 seconds with BigQuery - how can we improve this with "any postgres install"?
In most cases, you'll have it stored in a relational or graph database, easily accessible.
Especially in the sciences.
But sure, I can go through the data and tell you the amount of cats with specific properties, or similar.
I personally use pgsql currently to analyse data from Android crash reports.
Like, "list me all exceptions which have reports from more than 50 uniqie users which do not all use Samsung phones"
Or, my favourite, as I store for each report the exception, for which I store the stack trace elements, for which I store methods, files, lines of files and classes,
Show me a breakdown of operating systems of users affected by exceptions occuring in this method which have sent more than 40 fatal crashes each in the past month.
It manages that faster than just the latency to bigquery would be — 18ms over a test dataset of several gigabyte by now.
Seriously, @dang, can you maybe consider deactivating downvotes unless people also post a comment explaining why they consider a comment not constructive?
Getting mass downvotes within of seconds of posting is very annoying, and just destroys the discussion culture.
Take a look at how Stanford is leveraging BigQuery for their genomics analysis:
> It manages that faster than just the latency to bigquery would be — 18ms over a test dataset of several gigabyte by now.
Yes!!! Exactly! That's the whole point. If something is taking you less than a second, don't bring BigQuery into the mix. But when you start hitting processes that take hours or days to run... try BigQuery. Going from hours to seconds changes your life.
> I personally use pgsql currently to analyse data from Android crash reports.
Cool! Guess what Motorola uses...
But that's my entire point? Why are all the test datasets small enough to be faster in pgsql than the latency to Google would take?
I mean, if I wanted to showcase my software, I'd use a huge dataset with a complex problem — say protein folding, or superconductor molecular analysis — and show it in comparison on bigquery and a standard local database.
Although these two examples are bad, as I know from my university that they can be solved in the same time for less money locally than by using bigquery
But if one day you do, please ping me, it will be fun to do a follow up.