
Data Warehouse Benchmark: Redshift, Snowflake, Azure, Presto, BigQuery - oconnore
https://fivetran.com/blog/warehouse-benchmark
======
xs83
Speed is only one factor of a data warehouse that should be considered, as
someone who has moved through 3 of these in the past year we have settled on a
mix of Spark, Presto and BigQuery depending on the workload.

\- Presto is not good at longer queries, if a node dies the query fails and it
needs to be restarted. It is however orders of magnitude faster for any of the
other solutions when it comes to Geospatial functions, the team behind it are
simply wizards.

\- BigQuery is also super fast and a fantastic tool for adhoc analysis of huge
amounts of data, they have just started to implement GIS functionality in this
so we are watching it closely. Some of our analyists have been stung on
pricing where partitions weren't possible meaning we were charged for scanning
10TB+ of data for a relatively simple query - it has a learning curve for
sure!

\- Redshift was our original data warehouse, it was great for prescribed data
in an ETL pipeline, however scaling a cluster takes hours and data skew meant
that the entire cluster would fill up during queries if sort keys and
distribution keys weren't precisely calibrated - quite difficult when you have
changing dimensions of data.

\- Spark / EMR / Tez has been our standout workhorse for many things now, it
is much slower than any of the above but there are many tools that work with
Spark and the ecosystem is growing rapidly, we had to perform a cross join of
16B records to 140M ranges and every single one of the above solutions either
crapped out on us or became prohibitively expensive to run this at scale and
get meaningful output. Spark took longer (1h 25m) but the progress was steady
and quantifiable. Presto often died mid query for a number of reasons
(including that we wanted to run this on pre-emptible instances on GCP and it
doesnt support fault tolerance).

File formats are a HUGE differentiator when it comes to these systems as well
- we chose ORC as our file format due to the availability of bloom filters and
predicate pushdown in Presto, this means we can load a 10TB dataset in a
couple of minutes and query the files directly without having to specifically
load them into a store.

Our preference is ORC > Parquet > AVRO > CSV in order.

Basically I will say that these benchmarks are quite good for determining
speed but sometimes there are other factors other than raw speed that will
bite you in the ass unless you are aware of them :)

~~~
vgt
Geospatial for BigQuery just went Beta:

[https://cloud.google.com/bigquery/docs/gis-
intro](https://cloud.google.com/bigquery/docs/gis-intro)

~~~
voycey
I've been in the private beta for a while now, using it for ad-hoc queries,
still missing a few things I require but I think it will be better down the
line!

~~~
havermeyer
Late reply, but which things is it missing for your needs? I work on the
BigQuery team, so I can pass along any feedback that you have.

~~~
voycey
Have given most things in the Beta Group, First off I love the format
conversion capabilities, just missing a few things like Clustering Algorithms.
If parity can be achieved with the PostGIS ST functions then it will be
fantastic!

~~~
havermeyer
Thanks!

------
wgjordan
Full-page, un-closeable 'sign up for our newsletter!' modal shows up ~10
seconds into article, making it unreadable on mobile.

~~~
waterlooalex
I didn't have that problem, maybe they've since fixed it?

------
swimorsinka
Hmm, I use Redshift every day and I've also used BigQuery. You can't have a
valid benchmark without adding sort keys and dist keys to Redshift. It really
isn't meant to work without them.

I understand it's more work that way, which is why BigQuery is so nice. But if
you're trying to do lots of queries, BigQuery is also more expensive.

~~~
georgewfraser
We've gotten a lot of passionate feedback in this category, and we actually
did a comparison with sort and dist keys in the extended version. They don't
make as much difference as people think, these charts would be ~15% shifted
with sort and dist keys.

~~~
vgt
Did you guys implement cluster keys in BigQuery as well?

(BQ PM)

~~~
georgewfraser
No we just did a comparison between Redshift with no tuning, conservative
tuning, aggressive tuning. The point was just to assess "how much does tuning
matter anyway?"

You are correct that if we wanted to do a full-on "tuned comparison", we would
need to implement partitioned tables in BQ, Snowflake and Presto.

~~~
vgt
(and clustering in BQ! [0])

[0] [https://cloud.google.com/bigquery/docs/clustered-
tables](https://cloud.google.com/bigquery/docs/clustered-tables)

------
pcarolan
To say that Redshift doesn't support Arrays vs BigQuery's UDF support is a
little confusing. Redshift has built-in functions for array operations and
Python UDF support.

[https://docs.aws.amazon.com/redshift/latest/dg/JSON_EXTRACT_...](https://docs.aws.amazon.com/redshift/latest/dg/JSON_EXTRACT_ARRAY_ELEMENT_TEXT.html)

~~~
georgewfraser
The fundamental problem is that Redshift doesn't have an UNNEST/FLATTEN
operator. So even if you write a UDF that does something with an array, it's
extremely awkward to work with the results in SQL.

~~~
pcarolan
That is a big downside and I hope Redshift does add this to their product.
Unnesting + flattening is difficult (but not impossible) to do in Redshift.
I'd argue that python + built-in JSON support makes it an orange not a red
though in that category, but thank you for the clarification and thank you for
this awesome analysis!

~~~
cosmie
One downside to using Redshift's Python UDFs is that they don't parallelize;
instead, they process on a single node. So while technically possible, it's
only really feasible to use them for purposes like JSON flattening as part of
your ETL load, rather than on the fly. And since they only allow scalar
outputs, you either have to run a bunch (one per flattened value) or use a
hacky solution where you leverage some nested CTEs to try to output all of the
flattened records as a single value with a custom delimiter, then leverage
built in functions to further split that scalar into multiple rows. I've
heavily used Python UDFs in Redshift, and they're fantastic to fall back on
when you're backed into a corner. But they're such a second-class citizen
within Redshift[1] that I wouldn't consider it a viable alternative to the
lack of unnesting and flattening support.

[1] I've used Python UDFs extensively since they came out, but haven't
evaluated their performance characteristics in about 6 months. Please let me
know if my analysis is out of date.

------
kwillets
Since they forgot Vertica, here are a few cross-comparisons (admittedly funded
by Vertica, but on fairly standard criteria):

Vertical about 10x cheaper than Snowflake:
[https://www.vertica.com/resource/vertica-eon-mode-
snowflake-...](https://www.vertica.com/resource/vertica-eon-mode-snowflake-
benchmark/)

Faster than Redshift on the Amplab benchmark: [https://www.vertica.com/wp-
content/uploads/2017/01/Vertica-R...](https://www.vertica.com/wp-
content/uploads/2017/01/Vertica-Redshift-Benchmark.pdf)

------
bradhe
I appreciate the comparison between the different systems! Especially with the
services, which tend to move quickly in terms of changes to performance.

------
jmpman
Any plans to add Teradata? It’s available in AWS and Azure.

Also, your performance numbers weren’t normalized for system price. Many of
these MPP systems will scale performance linearly as compute nodes (roughly
equivalent to cost) grow.

~~~
georgewfraser
They are normalized, that’s the second chart.

------
jmpman
Can you add arithmetic mean to your analysis? Arithmetic mean would more
closely reflect total completion time of the entire benchmark.

~~~
jmpman
[http://www.tpc.org/tpcd/faq.asp](http://www.tpc.org/tpcd/faq.asp)

By policy of the TPC, published comparisons of any TPC benchmark results must
include all metrics for that particular benchmark. Therefore, the comparison
of TPC-D results for two or more systems must include the power metric, the
throughput metric, and the price-performance metric. While one metric may be
emphasized more than another, the three metrics are considered a unit, and
none may be omitted.

------
dksidana
Any reason why Azure timing has reduced with higher data size ?

~~~
georgewfraser
Gen2 architecture, which is not yet available in a small size appropriate for
a 100GB benchmark.

------
georgewfraser
I’m one of the authors, happy to answer questions!

~~~
vgt
Great post!

A couple of thoughts:

1\. It'd be interesting to see these benchmarks "under real-world adversity".
For example, as data is ingested into the DW, as data shapes change, as
concurrency of workloads varies.

2\. It'd be great to measure "effort" to reach and maintain this performance.

DW vendors are trained on optimizing for static TPC benchmarks with
concurrency of 1, which are useful, but can miss the boat on what users
actually experience.

(BQ PM)

~~~
massaman_yams
Thanks for the insight! Maybe it was just a brain fart, but I had to look up
BQ (BigQuery) just now - the acronym might not be as well-known outside of
Google.

~~~
vgt
noted :)

