Hacker News new | past | comments | ask | show | jobs | submit login
BigQuery public datasets now include Stack Overflow Q&A (cloud.google.com)
190 points by fhoffa on Dec 15, 2016 | hide | past | favorite | 33 comments



Google BigQuery has pure separation of storage and compute, which allows for Public Datasets [0] to exist in ready-to-query highly optimized format. Run SQL immediately!

BigQuery has a perpetual free query tier of 1 Terabyte per month ($5). In addition, you can a get $300 in Google Cloud credits for two months to do more work [1].

(work on Google Cloud and used to work on BigQuery)

Edit: spaces, how do they work?!?!

[0] https://cloud.google.com/bigquery/public-data/

[1] https://cloud.google.com/free-trial/


Are there any plans to bring predictable pricing (i.e. reserved slots) to the sub $40k/mo user?

I'm trying to bootstrap an idea for a startup dealing with a massive dataset, but the fear of future costs is a major blocker of building around BQ.

The pricing model for Azure SQL Data Warehouse is amazing. You buy a fixed number of slots and can find your own sweet spot of response time/cost, but still has a $1k/mo barrier to entry.


feel free to pm me.


Damn, and the whole dataset for 1 query is in memory. Talk about fast network. Any other juice for the internals ?




For me, the killer dataset would be the Google Scholar data. That would just blow the whole scientometric space wide open. It would also be a nice introduction for researchers into BigQuery (calculate your own H-factor)


I completely get that it's hard to make money off Google scholar. But it could do so many more cool things with more effort.


I'd highly recommend anyone who hasn't played with BigQuery yet to take a few moments to give it a shot. I was highly skeptical at first when my old company moved to it, but now I've come to love it as a platform.

What I love most: [0] Exceptionally fast queries against large datasets. [1] Very Cost Effective (although as others have called out, it can be accidentally misused resulting in a big bill). [2] Non data-engineers can setup, use, and manage, with minimal difficulty. [3] Gets you away from high-priced solutions like Vertica or Teradata. [4] No management headaches like Redshift.

Downsides: [0] Quotas can get annoying to work with. [1] Not a ton of wrappers in a diverse set of languages. [2] Not a ton of support with desktop SQL clients.


Great to hear from a happy customer!

Some more color commentary:

- Quotas can be a little annoying indeed, but they're generally there for a reason. Over time, we've moved BigQuery in direction of having less and less quotas in general

- BigQuery recently released ODBC and JDBC drivers, as well as Standard SQL support. so you can plug in your favorite desktop client

- On predictability of pricing, there are several flavors of proactive controls - cost per query, cost per user, cost per group, etc. There are reactive billing alerts as well. And as you get to Petabyte scale, there's a flat rate pricing model.

(work on GCP)


Happy to hear about the JDBC addition. I had missed that. Thank you VGT!


Would be nice if quotas actually worked on time partitioned tables.


The complexity of the Stack Overflow Data Explorer was the only reason I never played around with SO Data.

I am looking through the tables now and there is certainly a lot of cool stuff that can be done! :)

Although, there appear to be a few tables with garbage data and only few rows, like posts_privilage_wiki and posts_wiki_placeholder.


Here's a quick example from merging the example questions and tweaking a bit: % SO Questions Answered in 2016, by Type of Technology

https://docs.google.com/spreadsheets/d/1QlXayFZYGb2U_2OrFDv4...


Looking forwards to your always inspiring posts! Let me know how I can help.


Hey, I figured it out!

https://bigquery.cloud.google.com/savedquery/809799891616:6b...

There's a learning curve. Getting in without giving up your credit card isn't exactly intuitive. And Leslie isn't the most popular name in the US -- it's gender neutral name, so there are lots of rows.


Thanks for sharing but unfortunately when I try to access it I just get prompted to create a new project. Maybe Google doesn't like people sharing their own projects.


I really hope GCP continues to do this with other types of syndicated public data feeds like American FactFinder, FRED, etc. They are missing a substantial audience that would see this as a killer app and compelling reason to move to GCP for this. The success of Enigma.io suggests there is a legit (although easily reproduced) business case here with minimal effort. Anyone who has ever worked to prep and integrate this type of data could now spend that time doing science.


Since the article is on a "big data and machine-learning blog", I wonder: what machine-learning applications would this data-set enable?


One option is training a system to identify the topic of text, since stackoverflow has that. E.g. I'm looking to build something to tag lecture transcripts with their programming topics for https://www.findlectures.com.


Kaggle has had several competitions using SO Data. One example: https://www.kaggle.com/c/predict-closed-questions-on-stack-o...


I wish they included the common crawl datasets.


Can do :)


Please add this! I was looking for this exact thing recently and would love to play with it.


I have two Google accounts and BigQuery has a problem if I want to use the not default one. It inhibited my curiosity a bit.

I had an idea to query the usage of string handling functions in C code bases, so I could do something like a manual linting around them.


If multi-login doesn't work, try using different profiles on Chrome. It works better :)

https://support.google.com/chrome/answer/2364824?co=GENIE.Pl...


Very cool. I love the idea of public data sets like these.

I wonder if AWS can swing somethings similar with Athena. They already have "requester pays" buckets for S3 so should be inline with that to have a similar offering for Athena connected to S3 resources.


It would be nice if they added a public dataset of the certificate transparency logs.


Did StackOverflow agreed to offer the data or Google just took because they can? Stackoverflow has their own query tool. Why would they give their data to Google?



It's kinda funny how that the blog link was posted to HN even before this Google one and still the third party Google one made it to the frontpage.


I guess my queries linking the dataset to HN posts contributed to this post success :)


StackOverflow (and all the other SEs) is CC-licensed, and you can download the dump yourself: https://stackoverflow.blog/2009/06/stack-overflow-creative-c...




Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: