Hacker News new | comments | show | ask | jobs | submit login
Google launches BigQuery - Analyze big data on the cloud (cloud.google.com)
62 points by neya 1853 days ago | hide | past | web | 19 comments | favorite

I tried running some simple SELECT queries on the sample Wikipedia dataset of a few GBs. Most queries took 5 to 10 seconds, which is not exactly fast.

Moreover, it is super expensive with their 'data processed' based pricing. It actually costs more than $1 to run a single query on a database of 30 GB. So it will cost $300,000 to power my analytics app with 10,000 queries a day. This cost will further skyrocket if your database size is anywhere near a terabyte.

Hi, I work on the BigQuery team.

Keep in mind that you only pay for the columns you query. The Wikipedia table as a whole is 35GB, but if you only query one column, it might only need to scan a couple of GB. If you can limit the columns you are querying, you can save a lot of money.

If you have a high-traffic analytics app, it would probably make sense to cache some of the materialized results, which are usually orders of magnitude smaller than the source data. BigQuery supports writing its output to another table, but it would probably be even faster for you to cache these results client-side.

What about query latency? 5-8 seconds?

5-8 seconds isn't a problem at all.

It's Analytics, not an online database for website authentication or something.

It competes with datawarehouse solutions where the typical reporting models involves submitting a job, waiting an hour and then getting a report.

I'd expect a lot of the most useful applications running on this will use queries that take many minutes (if not hours) to complete.

"...waiting an hour..."

Uhm... Ke?

I work in Business Intelligence and anything over 5 seconds to return a typical (eg: year to date/daily sales) report is unacceptable by my standards.

Could you perhaps go I to more detail as to the definition of 'job' in your post? Are we talking about giant year end actuarial runs here or something like that?

If you know the exact dimensions you're going to query, then you can construct a data cube to fit your exact scenario up front, and indeed, those will return fast.

BigQuery allows you to explore the data without the pain of setting up your star schemas upfront. Think of it this way: you've got XXX Tb of log data, and you have a new question you want to ask it. At this point, you're heading back to Map-Reduce, or Pig/Hive, etc. BigQuery is based on Google's Dremel (check the paper, great read), and has all the operational + performance learnings from wide deployment at Google. Type in your "SQL like" query and BQ takes care of all the rest.. within an order of seconds.

tl;dr: you're comparing apples and oranges.

No, I'm talking about ad-hoc reports over large datasets.

For example, find the original source of all users who bought more than 6 different items over any 6 week period during the last 5 years, then find every web page loaded by IPs form the same subnet as those users in the same time periods.

Got it, makes sense.

OP ran "simple SELECT queries on the sample Wikipedia dataset", not complex queries.


That's like complaining a F-22 take a lot longer to start than a motorbike. It's completely true in every way, and yet not something who uses either of them care about.

Simple selects from Wikipedia are a nice demo for this, nothing more.

Where is this sample Wikipedia dataset described?

The Wikipedia dataset `publicdata:samples.wikipedia` is described here: https://developers.google.com/bigquery/docs/dataset-wikipedi...

Unfortunately it's not that interesting as it holds just the revision history. Earlier this week I was contemplating on writing a script to import the entire Wikipedia dataset into BigQuery. Has anyone else already done this or be interested in such a script?

Queries on hundreds of TB in seconds?

Anyone know what powers this? Is this custom SQL optimization on top of BigTable and/or Map Reduce?

Cool. Columnar, read-only, nested records.

The ability to JOIN (even if it is limited to 8Mb) is pretty useful for a couple of specific use cases we came up against recently.

It can reduce the disk (& therefore more importantly, cache) space requirements of the materialised views you otherwise have to maintain with a product like Cassandra (which is still ACE! IMHO).


Considering they started it, there is still more scope for Google to leverage Map-Reduce paradigm and probably build more products around it like the Hadoop ecosystem has done. This looks quite useful to start with.

This is not MapReduce though, rather an execution engine specialized for data analysis: http://research.google.com/pubs/pub36632.html

How does it compare with http://www.vertica.com ?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact