Moreover, it is super expensive with their 'data processed' based pricing. It actually costs more than $1 to run a single query on a database of 30 GB. So it will cost $300,000 to power my analytics app with 10,000 queries a day. This cost will further skyrocket if your database size is anywhere near a terabyte.
Keep in mind that you only pay for the columns you query. The Wikipedia table as a whole is 35GB, but if you only query one column, it might only need to scan a couple of GB. If you can limit the columns you are querying, you can save a lot of money.
If you have a high-traffic analytics app, it would probably make sense to cache some of the materialized results, which are usually orders of magnitude smaller than the source data. BigQuery supports writing its output to another table, but it would probably be even faster for you to cache these results client-side.
It's Analytics, not an online database for website authentication or something.
It competes with datawarehouse solutions where the typical reporting models involves submitting a job, waiting an hour and then getting a report.
I'd expect a lot of the most useful applications running on this will use queries that take many minutes (if not hours) to complete.
I work in Business Intelligence and anything over 5 seconds to return a typical (eg: year to date/daily sales) report is unacceptable by my standards.
Could you perhaps go I to more detail as to the definition of 'job' in your post? Are we talking about giant year end actuarial runs here or something like that?
BigQuery allows you to explore the data without the pain of setting up your star schemas upfront. Think of it this way: you've got XXX Tb of log data, and you have a new question you want to ask it. At this point, you're heading back to Map-Reduce, or Pig/Hive, etc. BigQuery is based on Google's Dremel (check the paper, great read), and has all the operational + performance learnings from wide deployment at Google. Type in your "SQL like" query and BQ takes care of all the rest.. within an order of seconds.
tl;dr: you're comparing apples and oranges.
For example, find the original source of all users who bought more than 6 different items over any 6 week period during the last 5 years, then find every web page loaded by IPs form the same subnet as those users in the same time periods.
That's like complaining a F-22 take a lot longer to start than a motorbike. It's completely true in every way, and yet not something who uses either of them care about.
Simple selects from Wikipedia are a nice demo for this, nothing more.
Unfortunately it's not that interesting as it holds just the revision history. Earlier this week I was contemplating on writing a script to import the entire Wikipedia dataset into BigQuery. Has anyone else already done this or be interested in such a script?
Anyone know what powers this? Is this custom SQL optimization on top of BigTable and/or Map Reduce?
It can reduce the disk (& therefore more importantly, cache) space requirements of the materialised views you otherwise have to maintain with a product like Cassandra (which is still ACE! IMHO).