
BigQuery public datasets now include Stack Overflow Q&A - fhoffa
https://cloud.google.com/blog/big-data/2016/12/google-bigquery-public-datasets-now-include-stack-overflow-q-a
======
vgt
Google BigQuery has pure separation of storage and compute, which allows for
Public Datasets [0] to exist in ready-to-query highly optimized format. Run
SQL immediately!

BigQuery has a perpetual free query tier of 1 Terabyte per month ($5). In
addition, you can a get $300 in Google Cloud credits for two months to do more
work [1].

(work on Google Cloud and used to work on BigQuery)

Edit: spaces, how do they work?!?!

[0] [https://cloud.google.com/bigquery/public-
data/](https://cloud.google.com/bigquery/public-data/)

[1] [https://cloud.google.com/free-trial/](https://cloud.google.com/free-
trial/)

~~~
dsl
Are there any plans to bring predictable pricing (i.e. reserved slots) to the
sub $40k/mo user?

I'm trying to bootstrap an idea for a startup dealing with a massive dataset,
but the fear of future costs is a major blocker of building around BQ.

The pricing model for Azure SQL Data Warehouse is amazing. You buy a fixed
number of slots and can find your own sweet spot of response time/cost, but
still has a $1k/mo barrier to entry.

~~~
vgt
feel free to pm me.

~~~
ddorian43
Damn, and the whole dataset for 1 query is in memory. Talk about fast network.
Any other juice for the internals ?

~~~
mikecb
Juice:

[1]
[https://static.googleusercontent.com/media/research.google.c...](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43119.pdf)

[2]
[https://static.googleusercontent.com/media/research.google.c...](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37217.pdf)

[3] [https://cloud.google.com/blog/big-data/2016/01/bigquery-
unde...](https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-
hood)

[4] [https://cloud.google.com/blog/big-data/2016/08/in-memory-
que...](https://cloud.google.com/blog/big-data/2016/08/in-memory-query-
execution-in-google-bigquery)

[5] [https://cloud.google.com/blog/big-data/2016/04/inside-
capaci...](https://cloud.google.com/blog/big-data/2016/04/inside-capacitor-
bigquerys-next-generation-columnar-storage-format)

I think there are more, but these are the ones I have on pinboard.

~~~
vgt
Some more:

[0] [https://cloud.google.com/blog/big-data/2016/01/anatomy-
of-a-...](https://cloud.google.com/blog/big-data/2016/01/anatomy-of-a-
bigquery-query)

[1] [https://cloud.google.com/blog/big-data/2016/08/google-
bigque...](https://cloud.google.com/blog/big-data/2016/08/google-bigquery-
continues-to-define-what-it-means-to-be-fully-managed)

[2] [https://medium.com/@thetinot/paying-it-forward-how-
bigquerys...](https://medium.com/@thetinot/paying-it-forward-how-bigquerys-
data-ingest-breaks-tech-norms-8bfe2341f5eb)

[3] [https://medium.com/google-cloud/15-awesome-things-you-
probab...](https://medium.com/google-cloud/15-awesome-things-you-probably-
didnt-know-about-google-bigquery-6654841fa2dc)

[4] [https://cloud.google.com/blog/big-
data/2016/02/visualizing-t...](https://cloud.google.com/blog/big-
data/2016/02/visualizing-the-mechanics-of-on-demand-pricing-in-big-data-
technologies)

[5] [https://cloud.google.com/blog/big-
data/2016/02/understanding...](https://cloud.google.com/blog/big-
data/2016/02/understanding-bigquerys-rapid-scaling-and-simple-pricing)

------
jsproc
For me, the killer dataset would be the Google Scholar data. That would just
blow the whole scientometric space wide open. It would also be a nice
introduction for researchers into BigQuery (calculate your own H-factor)

~~~
petters
I completely get that it's hard to make money off Google scholar. But it could
do so many more cool things with more effort.

------
wjossey
I'd highly recommend anyone who hasn't played with BigQuery yet to take a few
moments to give it a shot. I was highly skeptical at first when my old company
moved to it, but now I've come to love it as a platform.

What I love most: [0] Exceptionally fast queries against large datasets. [1]
Very Cost Effective (although as others have called out, it can be
accidentally misused resulting in a big bill). [2] Non data-engineers can
setup, use, and manage, with minimal difficulty. [3] Gets you away from high-
priced solutions like Vertica or Teradata. [4] No management headaches like
Redshift.

Downsides: [0] Quotas can get annoying to work with. [1] Not a ton of wrappers
in a diverse set of languages. [2] Not a ton of support with desktop SQL
clients.

~~~
vgt
Great to hear from a happy customer!

Some more color commentary:

\- Quotas can be a little annoying indeed, but they're generally there for a
reason. Over time, we've moved BigQuery in direction of having less and less
quotas in general

\- BigQuery recently released ODBC and JDBC drivers, as well as Standard SQL
support. so you can plug in your favorite desktop client

\- On predictability of pricing, there are several flavors of proactive
controls - cost per query, cost per user, cost per group, etc. There are
reactive billing alerts as well. And as you get to Petabyte scale, there's a
flat rate pricing model.

(work on GCP)

~~~
wjossey
Happy to hear about the JDBC addition. I had missed that. Thank you VGT!

------
minimaxir
The complexity of the Stack Overflow Data Explorer was the only reason I never
played around with SO Data.

I am looking through the tables now and there is certainly a lot of cool stuff
that can be done! :)

Although, there appear to be a few tables with garbage data and only few rows,
like posts_privilage_wiki and posts_wiki_placeholder.

~~~
minimaxir
Here's a quick example from merging the example questions and tweaking a bit:
% SO Questions Answered in 2016, by Type of Technology

[https://docs.google.com/spreadsheets/d/1QlXayFZYGb2U_2OrFDv4...](https://docs.google.com/spreadsheets/d/1QlXayFZYGb2U_2OrFDv416ZaayuysS_fT31IDvt843o/edit?usp=sharing)

------
sambrand
Hey, I figured it out!

[https://bigquery.cloud.google.com/savedquery/809799891616:6b...](https://bigquery.cloud.google.com/savedquery/809799891616:6bda226894244e1f82ed214cfd1c2af3)

There's a learning curve. Getting in without giving up your credit card isn't
exactly intuitive. And Leslie isn't the most popular name in the US -- it's
gender neutral name, so there are lots of rows.

~~~
chaz6
Thanks for sharing but unfortunately when I try to access it I just get
prompted to create a new project. Maybe Google doesn't like people sharing
their own projects.

------
__coaxialcabal
I really hope GCP continues to do this with other types of syndicated public
data feeds like American FactFinder, FRED, etc. They are missing a substantial
audience that would see this as a killer app and compelling reason to move to
GCP for this. The success of Enigma.io suggests there is a legit (although
easily reproduced) business case here with minimal effort. Anyone who has ever
worked to prep and integrate this type of data could now spend that time doing
science.

------
amelius
Since the article is on a "big data and machine-learning blog", I wonder: what
machine-learning applications would this data-set enable?

~~~
garysieling
One option is training a system to identify the topic of text, since
stackoverflow has that. E.g. I'm looking to build something to tag lecture
transcripts with their programming topics for
[https://www.findlectures.com](https://www.findlectures.com).

------
anamoulous
I wish they included the common crawl datasets.

~~~
fhoffa
Can do :)

~~~
orf
Please add this! I was looking for this exact thing recently and would love to
play with it.

------
hawski
I have two Google accounts and BigQuery has a problem if I want to use the not
default one. It inhibited my curiosity a bit.

I had an idea to query the usage of string handling functions in C code bases,
so I could do something like a manual linting around them.

~~~
fhoffa
If multi-login doesn't work, try using different profiles on Chrome. It works
better :)

[https://support.google.com/chrome/answer/2364824?co=GENIE.Pl...](https://support.google.com/chrome/answer/2364824?co=GENIE.Platform%3DDesktop&hl=en)

------
koolba
Very cool. I love the idea of public data sets like these.

I wonder if AWS can swing somethings similar with Athena. They already have
"requester pays" buckets for S3 so should be inline with that to have a
similar offering for Athena connected to S3 resources.

------
WatchDog
It would be nice if they added a public dataset of the certificate
transparency logs.

------
yagga
Did StackOverflow agreed to offer the data or Google just took because they
can? Stackoverflow has their own query tool. Why would they give their data to
Google?

~~~
gbrayut
Blog post: [https://stackoverflow.blog/2016/12/You-Can-Now-Play-With-
Sta...](https://stackoverflow.blog/2016/12/You-Can-Now-Play-With-Stack-
Overflow-Data-on-Googles-BigQuery/)

~~~
hashhar
It's kinda funny how that the blog link was posted to HN even before this
Google one and still the third party Google one made it to the frontpage.

~~~
fhoffa
I guess my queries linking the dataset to HN posts contributed to this post
success :)

