
Redshift vs. BigQuery vs. Snowflake benchmark - georgewfraser
https://blog.fivetran.com/warehouse-benchmark-dce9f4c529c1
======
boulos
Disclosure: I work on Google Cloud (but not BigQuery specifically).

Cool! First, thanks for doing this.

You allude to it a little in your text, but I think your cost model:

(cost_per_hour / 3600) * query_in_seconds / expected_utilization

doesn't (often) work out. Your benchmarks only run one query at a time, so
BigQuery will look pretty expensive compared to the smallest
Redshift/Snowflake cluster.

Have you tried simulating say a team of 10 analysts by sending concurrent
queries? (You will see the average runtime go up as you fill the box).
Alternatively, even 18% utilization is probably way too much for a single
person/analyst!

P.S.: You seem to have checked in export PGPASSWORD=...

~~~
georgewfraser
(author here)

Trying to compare costs between BigQuery and Redshift/Snowflake is really,
really hard because of their different pricing models. I tried to be very
explicit about what assumptions I am making so people can make their own
judgements. You are correct that a single-user sending queries to the
warehouse is not realistic, but it should still produce an accurate price
comparison. Here's my logic:

* Suppose I were to send 10 times as many concurrent queries to the same warehouse. That would drop the effective cost-per-query by 1/10, which makes BigQuery look 10 times worse.

* But suppose these queries now take 10 times as long. According to my formula, that makes the effective cost-per-query 10x larger, which makes BigQuery look 10 times better.

As long as the performance of the warehouse is linear-ish with respect to
number-of-users, these two effects should cancel each other out.

RE: PS: Don't worry, I made that one up just for that test cluster, which is
now gone :)

~~~
boulos
I should have been more explicit: it won't be linear-ish. As you fill up any
per-box sort of system, you start contending (pretend that you simulate 100
analysts) while the BigQuery results will be closer to linear. It does suck
that to benchmark that "accurately", you have to make the queries different
enough so that no reuse occurs (unless that's reasonable!).

Again, thanks for doing this!

~~~
ShaunK
When we've looked at BigQuery it seemed that if you prepay you essentially get
a similar effect to what you're describing. You're given a certain number of
"units" of compute, and if you exceeded your concurrent units available you
end up with the same compute resource contention you would with an improperly
scaled Snowflake warehouse or Redshift cluster.

If you're willing to just pay per gigabyte scanned with BigQuery you can scale
near linearly I'm sure (although I haven't actually tried it), but you could
accomplish the same thing using Snowflake's API to add warehouses as
concurrent query load increases. That's what we do (although we just pre-
allocate and suspend the warehouses because you only pay when they're on).

Redshift does suffer from this problem because the compute is tied to the
data, but Redshift Spectrum is attempting to rectify that as well. I don't
know anything about its performance though.

------
mulmen
I have to consider these results to be meaningless based only on the fact that
you do not use the "advanced" features like sort and distribution keys. In
Redshift especially sort and distribution keys are not "advanced" features,
they are fundamental factors in determining performance.

Your justification for this is that you don't know how queries will run but
that's a poor excuse. With your star schema example dataset it should be easy
to pick distribution keys that match the access patterns of commonly joined
tables. There may be tradeoffs here but it's far from impossible and you are
crippling performance by not using these features.

~~~
mulmen
It's too late to edit but this comment feels like it dismisses your hard work
which I do not intend to do. Having said that I am concerned that this
approach leads to the ford-vs-chevy type arguments that don't help anyone make
an informed decision.

What do you think about doing a series on the workloads that each database
engine is best at? Surely each model has benefits and disadvantages. Finding
what each system is best at will help people decide what system they really
need.

You are in a unique position to answer this question because as you say, you
have no interest in any specific solution and your customers use all of these
systems.

~~~
fraserharris
Hi mulmen - I work with George (author). Dist and sort are absolutely an
invaluable tool for optimizing your common BI workloads. What we are seeing
with our customers is extensive use of arbitrary querying via tools like
Looker or Tableau which often do not use the sort or dist key. This is
becoming the most common workload for Fivetran customers b/c of the vast
number of frequently updating dashboards that are powered by those queries.

~~~
mulmen
Thank you for your insight. My team uses both Tableau and Redshift but we
don't do a lot of frequently updating dashboards.

This sounds like a major limitation of these query tools. Are these arbitrary
queries written by users or created by the tools programmatically? Since dist
keys are ideally join keys why do these tools (or users) not understand the
join keys? Can the queries be identified via WLM and triaged for optimization?

I am skeptical of any solution that promises to be magic and do the hard work
for us. I think the "big data" industry does itself a disservice by
perpetuating the myth that hard problems can be solved with mouse clicks.

~~~
georgewfraser
It's not a limitation of the query tools. The problem is that users are
running so many different queries against the same schema that you can't
optimize for them all. Therefore, it's most important how a data warehouse
performs when the optimization "tricks" don't work.

Reasonable people can disagree on this last claim---we'll publish a v2 of this
benchmark in a few months where we do more tuning
[https://github.com/fivetran/benchmark/issues/1](https://github.com/fivetran/benchmark/issues/1)

~~~
mulmen
There will always be trade-offs so disagreements are to be expected. My
interest is largely in what models work best for a given system, specifically
Redshift. What data models have you seen in use? Are the schemas dimensionally
modeled or are they flat tables?

Where have you run into limitations with dimensional modeling? Is it a lack of
engineering resources on the customer side or is the approach limited in some
fundamental way?

Thanks for sharing your invaluable experience. I look forward to part two!

------
robbles
I'm not sure leaving out sort and dist keys is actually being fair to Redshift
here.

I get the argument that you're avoiding any "special tuning", but these are
basic features that the docs and most guides will tell you are not really
optional. In my mind it's more like leaving out the primary key declaration on
a table in Postgres or MySQL.

~~~
georgewfraser
(author here)

It's a tough call. In my experience, sort and dist keys only really work out
in < 50% of queries, and it's the other 50% where you judge the performance of
your warehouse.

If we had chosen to use sort and dist keys in Redshift, then we would have
also used clustering keys in Snowflake, and it probably would have been
somewhat of a wash. BigQuery doesn't really have an equivalent feature yet---
they have date partitioning, but that's a very narrow special case that won't
work for most queries.

~~~
ruw1090
It looks like you had to rewrite the queries pretty significantly to get them
to run on all three. Redshit/Bigquery don't support rollup/grouping
sets/intersect/except etc. which the TPCDS spec queries have. Is there a
reason you don't mention this in the blog post? SQL surface area seems like a
pretty important thing to consider when choosing a DW.

~~~
dimfeld
GROUP BY ROLLUP support actually just arrived in Bigquery a week or two ago.
Still a lot of things missing, but they are steadily improving.

------
manigandham
BigQuery is one of the few products that truly feels magical. It lives up to
the "no-ops" model of just working without thinking about details and handles
any scale of data you have.

Only downsides would be cost-per-run pricing and low-latency queries.

~~~
JPKab
Agreed.

One point that this article doesn't address (not that it necessarily SHOULD)
is the fact that BigQuery achieves these results with zero database
administration or schema design.

Redshift is great, but it's not "set and forget".

Also, there are a huge number of use cases out there which are daily reports
on large datasets. Fixed vs. Variable cost models are obviously going to shift
in favor of variable when this kind of scenario comes into play.

Last, what kind of cost factor do you think comes into account when it comes
to paying someone to administer your Redshift db, vs not paying anyone to
administer the BigQuery setup. I can tell you from experience that this is a
huge, huge factor.

~~~
mattbillenstein
I think you need someone to run and optimize the data warehouse in any case --
figuring out data exports to BQ, table partitioning, etc don't really come for
free. Perhaps BQ is easier in that you don't have to figure out how to run and
scale a Redshift cluster, but as a BQ user, I do a decent amount of work
mucking around with the schemas and figuring out how to do things more
performantly or cheaply.

------
CobrastanJorji
This is super well written! I've seen a lot of benchmark queries show up on
Hacker News, and I think this one is exceptional for calling out and
justifying all of its assumptions, then comparing it to other benchmarks and
explaining how they might be different. I really appreciate all of that
clarity. That academic but easy-to-read style doesn't make for flashy
headlines but it's so much more informative.

------
flgr
I don't know; personally I feel like the scale here is much too low to be
relevant. Also I think not investing the time to tune sort and dist keys makes
the comparison meaningless.

But maybe that just becomes meaningful at larger data sizes and maybe most
people work with less data most of the time.

------
tziki
Do you plan to take into account the developer cost? BigQuery is more managed
than Redshift so that definitely factors into the "real world" cost analysis.

~~~
georgewfraser
Yes! We're going to write a "part 2" where we talk about the qualitative
differences, which are huge.

~~~
milesward
Sweet! Another wrinkle to add in: if you’re comparing a redshift system that’s
>$35k/yr, BQ does have a flat rate pricing option which might fare better in
your TCO analysis. Info here:
[https://cloud.google.com/bigquery/pricing#flat_rate_pricing](https://cloud.google.com/bigquery/pricing#flat_rate_pricing)
Yes, I’m a biased cloud Googler.

------
fibo
I am using Redshift since two years and, as every database has its own SQL
dialect and its own tricks, but it is a really good product. It is a pity that
Amazon forked early PostgreSQL v8 and not PostgreSQL v9 (with array support)
but if I think about databases I used before (I was admin of a Netezza
instance and user of an Oracle instance, they were Data Warehouses too) I
realize it is a really big step forward that now we can create with few clicks
and witha really fast provisioning time, a database that took days, even month
(contracts, waiting they arrive to datacenter, DBAs configuring it, installing
it, etc) to have it up and running, that now thinking about Redshift or Big
Query it seems to live in another era.

------
ozataman
There is a different side to the cost benchmark that's not captured by the
description here. If your use case needs a lot of stored data but not
necessarily a matching degree of peak CPU (even if your query load is
otherwise pretty consistent), Redshift will become really expensive really
fast and it will feel like a waste. BigQuery will meanwhile keep costs linear
(almost) in your actual query usage with very low storage costs.

For example, you may need to provision a 20-node cluster only because you need
the 10+ terabytes in storage across several datasets you need to keep "hot"
for sporadic use throughout the day/week, but don't nearly need all that
computational capacity around the clock. Unlike BigQuery, Redshift doesn't
separate storage from querying. Redshift also doesn't offer a practically
acceptable way to scale up/down; resizes at that scale take up to a day,
deleting/restoring datasets would cause lots of administrative overhead and
even capacity tuning between multiple users is a frequent concern.

Making matters worse, it is common for a small number of tables to be the
large "source of truth" tables that you need to keep around to re-populate
various intermediate tables even if they themselves don't get queries that
often. In Redshift, you will provision a large cluster just to be able to keep
them around even though 99% of your queries will hit one of the smaller
tables.

That said, I haven't tried the relatively new "query data on S3" Redshift
functionality. It doesn't seem quite the equivalent of what BigQuery does, but
may perhaps alleviate this issue.

Sidenote: I have been a huge Redshift fan pretty much since its release under
AWS. I do however think that it is starting to lose its edge and show its age
among the recent advances in the space; I have been increasingly impressed
with the ease of use (including intra team and even inter-team collaboration)
in the BigQuery camp.

~~~
joeharris76
Redshift offers hard disk based nodes with huge amounts of storage at low cost
for precisely the use case you mention. The performance of these is actually
very good, especially with a little effort applied to choosing sort keys and
dist keys.

Spectrum extends that even further, allowing you to have recent and reference
data locally stored and keep archival data in S3 available for query at any
time.

------
afpx
I’ve been using Redshift fairly heavily for the past couple years. I haven’t
had the time to do comparisons between BigTable, unfortunately. I’ve been too
strapped for time. And, we have only just started experimenting with Snowflake
recently.

I would like to try out BigTable, but experimenting with it has required too
much work. I would have to move my entire data set (Several Petabytes) over to
Google.

Redshift has been fine. But, with Reshift, we have two big issues: 1) Loading
data takes too long. Even with efficient copy operations from S3, it takes too
long to import data at scale. 2) Ineffient queries frequently overflow to disk
and consume the entire SSD. I’ve tried to train my data science team on how to
avoid inefficient queries, but it’s been impossible.

So, I’m really looking forward to seeing the results from part 2!

~~~
mattbillenstein
Have you thought about using EMR to generate rollups of the S3 data and then
putting that into Redshift?

~~~
afpx
We do that, for the most part. It definitely helps.

------
marc19
Also, here is a rudimentary, 2-part comparison of Redshift vs Azure SQL DW (in
preview) for good measure:

Part 1 - [http://bicortex.com/microsoft-azure-sql-data-warehouse-
quick...](http://bicortex.com/microsoft-azure-sql-data-warehouse-quick-review-
and-amazon-redshift-comparison-part-1/) Part 2 -
[http://bicortex.com/microsoft-azure-sql-data-warehouse-
quick...](http://bicortex.com/microsoft-azure-sql-data-warehouse-quick-review-
and-amazon-redshift-comparison-part-2/)

------
justinsaccount
How hard would it be to add
[https://clickhouse.yandex/](https://clickhouse.yandex/) to the benchmarks?

~~~
manigandham
On-premise installs tend to be very different from cloud managed warehouses
and not really comparable.

The 1 billion taxi rides benchmark by Mark Litwintschik is pretty
comprehensive though:
[http://tech.marksblogg.com/benchmarks.html](http://tech.marksblogg.com/benchmarks.html)

~~~
justinsaccount
Yes.. that's a good starting point but the hardware used for each platform
varies a lot.

Clickhouse on a single i5 compared to redshift on a 6-node ds2.8xlarge cluster

I've been throwing everything I can at clickhouse on a little VM, and the
performance is amazing.

~~~
IanCal
I've recently moved to clickhouse for some analytical work, and it's been
awesome. It's not perfect for what I want, and I'm sure I'm using it wrong,
but it's torn through what I have to throw at it and I've not spent any time
doing perf tweaks.

There's a connector for redash too, so I can make nice little dashboards from
the results.

Major problem is a lack of updates to data. At least loading is fast enough
for my use.

~~~
justinsaccount
You can't do updates, but you can create tables using the AggregatingMergeTree
table engine.

I've been working out how to use this for tracking count/first seen/last seen
for data sets. With normal sql you need to do upserts, but with clickhouse you
can just create the table using countState, minState, maxState and insert:

    
    
      1, now, now
    

for count, first, last, and when it compacts the table it'll apply the
aggregations.

Basically what you do with with
[https://github.com/facebook/rocksdb/wiki/merge-
operator](https://github.com/facebook/rocksdb/wiki/merge-operator) just with a
few lines of SQL instead of a ton of custom code.

~~~
IanCal
Interesting, I'll have an experiment with this today. I think I can see the
pattern but I'm not entirely sure it'll work quite as I'm hoping.

So assuming I've got a nice primary key (which thankfully I do for the main
data I care about), I could have a table something like (syntax will be wrong
but):

    
    
        PrimaryKey
        Something
        Otherthings
        maxState(LastUpdated)
    

Then when selecting I can run queries efficiently to only get the latest
version of a row?

The bit I'm not sure on is I feel a lot of this is designed to work nicely
with columns, and in this case I have a fairly row-like constraint (latest row
with a key).

Should be an interesting day of investigating though, thanks!

~~~
justinsaccount
Here's a workable example that I came up with:

    
    
      CREATE TABLE mt2
      (
        whatever Date DEFAULT '2000-01-01',
        key String,
        value String,
        first AggregateFunction(min, DateTime),
        last AggregateFunction(max, DateTime),
        total AggregateFunction(count, UInt64)
      ) ENGINE = AggregatingMergeTree(whatever, (key, value), 8192)
    

insert into mt2 (key, value, first,last,total) select 'www.google.com',
'1.2.3.4', minState(toDateTime(1498241729)),maxState(toDateTime(1498241729)),
countState(cast(1 as UInt64));

insert into mt2 (key, value, first,last,total) select 'www.google.com',
'1.2.3.5', minState(toDateTime(1498242729)),maxState(toDateTime(1498242729)),
countState(cast(1 as UInt64));

insert into mt2 (key, value, first,last,total) select 'www.google.com',
'1.2.3.6', minState(toDateTime(1498242729)),maxState(toDateTime(1498242829)),
countState(cast(1 as UInt64));

select key, value, minMerge(first), maxMerge(last), countMerge(total) from mt2
group by key, value;

optimize table mt; -- compact the table and merge multiple rows with the same
(key, value)

You have to add the date column, but you can ignore it.

The last query can be a view or even a materialized view.

~~~
IanCal
Thanks!

I was also playing yesterday with CollapsingMergeTree.

    
    
        CREATE TABLE cmt
        (
          whatever Date DEFAULT '2000-01-01',
          key String,
          value String
          sign Int8
        ) ENGINE = CollapsingMergeTree(whatever, (key, value), 8192, sign)
    
    

Now you can 'delete' a row by sending the same row again but with a sign of
-1:

    
    
        insert into cmt (key, value, sign) values ('k1', 'v1', 1)
        insert into cmt (key, value, sign) values ('k1', 'v1', -1)
        insert into cmt (key, value, sign) values ('k1', 'v1 update', 1)
        insert into cmt (key, value, sign) values ('k2', 'just delete this one', 1)
        insert into cmt (key, value, sign) values ('k2', 'just delete this one', -1)
    

You have to either add FINAL onto the query or optimise the table as far as I
can tell for this to work, or hope it's done it in the background.

I thought you had to use the sign in the query otherwise it wouldn't work, but
creating this example this morning works fine. If you're not getting the
response you expect after optimising, try adding the sign column to the query.

    
    
        :) select key, value, sign from cmt
    
        SELECT
            key,
            value,
            sign
        FROM cmt
        
        ┌─key─┬─value─────┬─sign─┐
        │ k1  │ v1 update │    1 │
        └─────┴───────────┴──────┘
        
        1 rows in set. Elapsed: 0.002 sec.
        
        :)
    
    

[disclaimer - I know very little about how to make high performance things in
clickhouse, tbh the Log format has been easily fast enough for my data so far]

------
chinkatie
Looks like Fortune picked this up. Nice!
[http://fortune.com/2017/10/09/amazon-google-snowflake-
data-w...](http://fortune.com/2017/10/09/amazon-google-snowflake-data-
warehouses/)

------
justinsaccount
[https://github.com/fivetran/benchmark](https://github.com/fivetran/benchmark)
Does not exist?

~~~
georgewfraser
Fixed! Thanks for pointing that out.

------
greggyb
Curious, any reason you've left out MS Azure DW?

I'd be interested in seeing their numbers in there as well.

~~~
marc19
I run some benchmarks, comparing Azure SQL DW (in preview) to Redshift. They
can be found here: [http://bicortex.com/microsoft-azure-sql-data-warehouse-
quick...](http://bicortex.com/microsoft-azure-sql-data-warehouse-quick-review-
and-amazon-redshift-comparison-part-1/) and here:
[http://bicortex.com/microsoft-azure-sql-data-warehouse-
quick...](http://bicortex.com/microsoft-azure-sql-data-warehouse-quick-review-
and-amazon-redshift-comparison-part-2/)

------
goldfishcaura
A BigQuery partner here and a long-time user of Redshift.
([https://cloud.google.com/partners/directory/#?q=caura](https://cloud.google.com/partners/directory/#?q=caura)
[https://discourse.looker.com/t/bigquery-migrating-from-
legac...](https://discourse.looker.com/t/bigquery-migrating-from-legacy-sql-
to-standard-sql/3568) and you can Google my creds for Redshift with
"Troubleshooting Redshift performance")

So having benchmarks tests is great as a general guideline for what works
under different architectures/schema designs. Unfortunately, benchmarking is
highly subjective to the initial choices. I am a big fan of BigQuery (enough
to go through Google's vetting process), but there are plenty of performance
issues that I've run into it that would have been easily resolved with
Redshift. Here are some concrete examples:

1) Running a Query across several very small tables. It turns out that
occasionally querying small tables causes heavy network traffic within
Google's distributed system. The solution on Redshift would be to adjust
distribution. On Google, however, you don't have any control over this. You
just have to hope that Google's algorithms pick up the issue based on usage
(they don't).

2) Joining large tables. Avoid joining large tables in BigQuery. In Redshift
the join would have been done by making sure that the sortkey is set on the
column that is used for a join on the (typically) right table. Then having a
common distkey between the two tables (this way the relevant data on both
tables lives on the same node. BigQuery just throws resources at the problem.
Well, it turns out that throwing resources at the problem is super slow (think
5-15 Redshift seconds vs. 200 BQ seconds).

Re: Snowflake. Can't speak to it as I haven't had personal experience. I have
worked with Data people who had opinions on both favorable and negative sides
of the spectrum. This just suggests to me that just like Redshift and
BigQuery, Snowflake is not a universal solution. You really need to
understand: 1) what your goals are for the usage among varying consumers 2)
what skill set do the various users of the database have

