

Analyze terabytes of data with just a click of a button - wr1472
https://developers.google.com/bigquery/

======
tomnewton
We use this product extensively to analyse users who play our games. We record
billions of rows of data per day and can query it effectively/efficiently with
Big Query. It is a pretty awesome product and getting better all the time.

~~~
zabar
That's interesting, you are the only one in this thread who is using it.

Could you tell a bit more about for what kind of analytics you are using it ?
And why you use BigQuery and not some other analytics solution (Mixpanel,
Kontagent, Flurry etc.) for these scenarios ?

------
alooPotato
At streak.com we use BigQuery to analyze our logs. All of our application logs
are shipped over to BigQuery and we run sql commands over this large data set.
Its been amazingly helpful.

We open sourced two things that helped us do this:

1) A chrome extension to make bigquery's UI better, see here:
[http://blog.streak.com/2012/07/streak-developer-tools-
chrome...](http://blog.streak.com/2012/07/streak-developer-tools-chrome-
extension.html)

2) A library to push logs to bigquery if you're using appengine:
[http://blog.streak.com/2012/07/export-your-google-app-
engine...](http://blog.streak.com/2012/07/export-your-google-app-engine-logs-
to.html)

------
karterk
Tangential to the discussion, but I am currently working on a system that will
handle some 100 million semi-structured rows. Despite all the buzz about "big
data", hadoop and nosql etc. etc. there is literally no system that can give
me a way to reliably store AND _search_ through this data on multiple columns
in near real-time (assuming I can throw a reasonable no. of nodes at it).

I looked at BigQuery, but they don't support updates. It's just dump and
analyze.

~~~
emidln
I needed to do something like this and I ended up settling on a custom
workflow connected through zeromq dumping data into into sharded sqlite
databases. Single process handled writing to each database and lots of reads
were mapped out and then reduced (also via zmq). This handled around 14
billion rows when I left that job. If you want to talk some more about it, my
gmail is the same as my username.

~~~
karterk
Thanks - I have emailed you.

------
tocomment
Dumb question, how to you load the terabytes of data?

~~~
SatvikBeri
One company I know of literally ships hard drives-it's faster and more
reliable than transferring the data over the internet. I'm not sure if that's
what they do with BigQuery.

~~~
alooPotato
To get data into BigQuery you need to get it into Google Cloud Storage first
and then import it from there. To get it into cloud storage you can use rest
api's or command line utils. See <https://developers.google.com/storage/>

------
ajays
The associated Wired article:
[http://www.wired.com/wiredenterprise/2012/08/google-
dremel-v...](http://www.wired.com/wiredenterprise/2012/08/google-dremel-
versus-hadoop/)

and the ensuing HN discussion: <http://news.ycombinator.com/item?id=4395164>

------
d0vs
IIRC it's been around for months now. Why is it resurfacing?

~~~
drucken
Wired.

------
vog
This is a web service, do I have this right?

Wow, so Google is going to collect business and scientific _raw data_ as well.

~~~
sp332
Well it won't be publicly searchable unless you decide to share it. One of the
features is user-based access permissions.

~~~
dbaupp
But Google has access to it, which I think is vog's point.

~~~
tonfa
It's not because you technically have access to something that you should or
will access it, for a myriad of reasons.

~~~
dmishe
> should

------
snorkel
You can do similar feats with Amazon Elastic MapReduce + Karmasphere for
reasonable cost except that Amazon S3 has the annoying 5GB max file size
limit, which is a pain to workaround, where Google Cloud Storage has 5TB max
file size.

~~~
sandfox
Isn't Amazon S3 now running with a 5 terabyte limit now[1]? But still,
shipping drives is far more sane than cross the t'interwebs for things of that
size.

[1] <http://aws.amazon.com/s3/faqs/#How_much_data_can_I_store>

