
Busting myths about BigQuery - tzury
https://cloud.google.com/blog/big-data/2017/12/busting-12-myths-about-bigquery
======
tzury
We at Reblaze [0] push to BQ nearly 3.5 Billion records every day.

We use it as our infrastructure for big-data analysis and we get the results
at the speed of light.

Every 60 seconds, there are nearly 1K queries that analyze users and IPs
behavior.

Our entire ML is based on it, and there are half a dozen of applications in
our platform that make use of BQ - from detecting scrapers, analyzing user
input, defeating DDoS and more.

The built-in (with legacy sql variant [1]) Math and Window functions are super
handy, the easy DataLab [2] integration, and the last but not least, Google
DataStudio[3] that let us generate interactive reports in literally minutes
are all making our choice (3+ years a great one).

BQ replaced 2,160,000 monthly cores (250 instances * 12 cores each always on)
and petabytes of storage and cost us several grands per month.

This perhaps one of the greatest hidden gems available today in the cloud
sphere and I recommend everyone to give it a try.

A very good friend of mine replaced MixPanel with BigQuery and saved nearly a
quarter of a million dollars a year since [4].

\--

[0] [https://blog.reblaze.com/how-to-stop-a-ddos-attack-in-
under-...](https://blog.reblaze.com/how-to-stop-a-ddos-attack-in-
under-5-minutes-and-block-them-in-the-future-d279fe3c7e15)

[1] [https://cloud.google.com/bigquery/docs/reference/legacy-
sql](https://cloud.google.com/bigquery/docs/reference/legacy-sql)

[2] [https://cloud.google.com/datalab/](https://cloud.google.com/datalab/)

[3] [https://datastudio.google.com](https://datastudio.google.com)

[4] [https://blog.doit-intl.com/replacing-mixpanel-with-
bigquery-...](https://blog.doit-intl.com/replacing-mixpanel-with-bigquery-
dataflow-and-kubernetes-b5f844710674)

------
pentium10
We are using BQ for 4 years now. Since the beginning with dual-writing to
diferent engines. We did an architecture that routes data to Redis, Riak,
Elasticsearch, Mysql, BigQuery. We often have 500+ lines of queries in
BigQuery writen in standard SQL that runs our reporting system. The average
query time we have is 10 seconds usually scanning at least a terabyte. We
truly love the ability to write in Javascript the UDFs, and our marketing team
loves to combine their Google Sheets (mapping data) with a real BQ query. We
heavily use the 'dryrun" feature of BQ to see without running the query how
much resources and costs will generate. We even implemented a quota for users,
based on this feature, wich works great, and only when they do a mistake in a
query targetting alerts them at a certain treshold. Afer a couple of years we
naturally started to consolidate all our data into BigQuery. Since it's
Standard SQL all our team members can start using and reporting + tooling
works quite well. DataStudio is another service which we started to use more
and more as well.

~~~
indogooner
How do you handle updates in BigQuery in case your queries require latest
version of records? Do you use views? If yes does that affect the query
latency too much?

~~~
pentium10
During the 4 years we at REEA.net tried a lot of approaches, in different
projects. 1\. We stream every upate as a new row directly to the table. We
have for every table a view that has a naming convention "table_latest" and it
targets latest version of the record. We didn't see a big latency increase.
Deleted rows have a 'deleted' flag, we use EXCEPT in the query. 2\. Since DML
is available we started on newer projects dual aproach, having a dedicated
streaming table, and rematerializing to a flat table. It adds developer
complexity but it's a win, as the materialized table is easier to be used by
our marketing team. We added to the reports "Based on data 10 minutes ago"
label, and a CTA "re-run using live data", which in background calls the
script to rematerialize the table and rerun the report. Our markting team
loves this, and they don't abuse this functionality. On some projcts we have
remateralization every 1h, on others daily. 3\. regarding partitioned tables
where you can stream only 7 days of window into their respective partitions we
use also a complementary load job. We stream in the 7 day window. And we have
a heavy "future date" use case, where we use "load jobs" to a temp table, then
query and write to their respective partitions.

------
bzillins
As long as you turn a PG table into a BQ dataset with enough partitions (BQ
tables) such that queries do not touch a disproportionate amount of
unnecessary data, you utilize date partitioning, and you can deal with a
minimum 1.5 second query time the service is truly amazing. My only
frustrations are that the majority of streaming service degradations are never
noted on the Status dashboard and that you can get really odd error states
like headers stating that you should be getting json while the body contains a
html error message.

------
scrollaway
Tangential: Anyone here used both BigQuery and Redshift?

Here's the primary reason that is currently keeping me on Redshift:

Redshift is basically Postgres at the query layer which is insanely cool: All
the Postgres tooling works with it. All the Postgres expertise I have carries
over. It feels like a lot less of a "lock in".

Anyone care to chime in?

~~~
nothrabannosir
I have used both. My biggest aha moment with BQ was realising it isn’t “sql to
query a database”, but “map reduce as a sql”. Redshift really feels like a DB:
I’m always worrying about the number of the runtime complexity of my sql, the
number of nodes in my fleet, how my joins work, it just feels like a big big
DB. BQ is less like that. I’m thinking in terms of memory complexity of my
operation : N, log N, etc. Always having as many nodes available as it can be
parallelised to. You’re writing a map reduce job on your data, and it’s... it
feels like the more appropriate boundary for abstraction.

This is a bit hand wavy, but there is really no better way to get a feel for
it than to try it. And think of it less as a db, and more as a map reduce
cluster :) if that helps.

~~~
barrkel
Your comment is a little bit perplexing to me because I don't really recognize
much difference between composing relational operations and map / reduce
operations. SQL `select` is map, `group by` is reduce, `where` is a flatMap to
zero or one items, and `join` is a kind of specialized flatMap to zero or more
items. You have to consider IO quantity and throughput (both disk and memory)
to get decent performance either way.

For example, if you're writing straight-up SQL, the fewer columns you project
out before you start sorting the better off you are, it's better to join stuff
after you've done your sort than the reverse because it means physically less
data shuffling.

~~~
nothrabannosir
In redshift, I control the size of nodes. That means I can brute force my way
through an expensive operation by increasing the memory capacity of the nodes.
In BigQuery, you play by the rules or you’re out. It forces you to think in
map reduce, because there really is nothing else under the hood.

Also, you get immediate parallelisation across O(n) nodes. Again; not in
redshift.

These are crucially different to regular DBs. They’re both semantically sql,
nobody denies that, but they describe different underlying models.

E.g.: in BigQuery, you can’t sort your entire column, even if you do other
stuff afterwards. That makes sense in a map reduce system, but not in a
“normal” DB.

~~~
barrkel
Thanks for the clarification.

------
lima
I recently discovered ClickHouse[1], Yandex's recently open sourced BigQuery
equivalent (someone on HN pointed me to it!).

If you're looking for a OLAP database running on your own infrastructure, make
sure to give it a try.

[1]: [https://clickhouse.yandex/](https://clickhouse.yandex/)

~~~
tzury
Thanks for the link. Seems interesting. BQ is a serverless though.

~~~
lima
Serverless is just another word for "third party service".

~~~
vgt
I respectfully disagree. While what "serverless" truly means varies depending
on who you ask, in this context I think it's:

\- Multi-tenant architecture.

\- Pay-per-job/query.

\- Complete abstraction away of underlying resources.

\- High availability out of the box.

\- Automatic seamless scaling on-demand.

So services like BigQuery qualify, and Redshift don't.

(work at G)

------
hbarka
Inside Google, isn’t F1 and Plx preferred over BQ?

~~~
ddorian43
See "oltp vs olap"

~~~
elvinyung
F1 isn't strictly OLTP though. Even the paper says it:

> F1 also provides a full-fledged SQL interface, which is used for low-latency
> OLTP queries, large OLAP queries, and everything in between.

~~~
ddorian43
ok see "row store vs columnar store"

~~~
elvinyung
F1 does not itself have a storage engine, and can use a variety of them.
Spanner is actually now columnar [1] in a way that's similar to Parquet:

> Ressi stores a database as an LSM tree, whose layers are periodically
> compacted. Within each layer, Ressi organizes data into blocks in row-major
> order, but lays out the data within a block in column-major order
> (essentially, the PAX layout).

[1]
[https://static.googleusercontent.com/media/research.google.c...](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46103.pdf)

------
jnordwick
So Google ads are now HN material?

This is literally just Google putting out a press release touting one of their
products. It isn't unbiased or doesn't attempt to be complete, but people here
seem to think Google's opinion on their own technology is beyond reproach.

Think of something like this being on Microsoft or Oracle website. No way it
would be voted up.

~~~
bloudermilk
It’s a shame you’re getting downvoted, since you’re right about the content.
The list of “truths” isn’t anything more than a feature list. I got to #4
before coming to the comments, where I found the top three comments were all
well-written positive reviews of the product. The whole submission screams of
a well-executed marketing play.

~~~
fhoffa
Oh hi! I work for Google. I browse Hacker News in my free time - like now,
from a plane.

Conspiracy fodder: how smart (or not) would a team need to be to plan a "well-
executed marketing play" for an enterprise database warehouse -- in the
morning of freakin Saturday December 23rd?

[http://images5.fanpop.com/image/photos/31100000/Classic-
Patr...](http://images5.fanpop.com/image/photos/31100000/Classic-Patrick-
spongebob-squarepants-31126065-338-500.jpg)

~~~
bloudermilk
Respectfully, I’m not seeing how your employment at Google or your being on a
plane is relevant to my comment. If you had the experience I do working at
marketing agencies for brands such as Google, I suspect you wouldn’t find my
theory near conspiracy territory. I concede that such a play likely wouldn’t
be considered well-executed given the timing, but it’s entirely within the
realm of possibility.

