
Debunking Misleading Benchmarks Of Redshift vs BigQuery - ranman
https://aws.amazon.com/blogs/big-data/fact-or-fiction-google-big-query-outperforms-amazon-redshift-as-an-enterprise-data-warehouse/
======
Cidan
"...8-node DC1.8XL Amazon Redshift cluster for the tests."

Well, yeah. That's 28,108.80 a month if you're running queries on demand and
don't want a delay/coordination in Amazon instance creation/destruction.

BQ may or may not be as fast, but it's truly a managed service; I give it our
data and it just works. I don't have to worry about instances, boot up time,
maintenance, hourly costs, etc. It's silly to focus just on query speed when
there's a whole layer of management and cost that comes with it.

~~~
beachstartup
> _28,108.80 a month_

maybe i'm completely out of touch, but i'm really wondering who can afford
this kind of stuff without $20M series A money in the bank. and i'm also
wondering what they're going to do when they run out of that money and have
zero database expertise in-house because they outsourced everything to amazon.

~~~
earino
Maybe I'm jaded, but 330k a year or so for the very backbone of your entire
real time query analytics infrastructure just doesn't seem unreasonable? This
is highly specialized software, tuned and architected for the purpose of
running relatively complex analytics queries and aggregates fast enough to
allow interactive real time exploratory analysis.

I dont think it's insane to believe that a profitable mature data informed
business should be expected to spend 1/3 of a million a year for the ability
to have all that data live and available without deep latency.

I can also agree that this is quite a bit for a startup, but if the unit
economics of the startup require that this kind of data be available for
interactive exploring, there's going to be a deep challenge down the road in
scaling it I think?

To your point, if a startup runs through 20 million and their biggest problem
is paying AWS bills, then they are probably a failed startup and folks should
move on towards whatever next exciting thing is on the horizon.

~~~
tw04
With 10TB of data, you could also do it a hell of a lot cheaper and faster on
premise.

~~~
trhway
One engineer/DBA to deal with your cluster is those 330k/year.

~~~
foobarian
Don't those computers cost a lot? Like $10k per blade? Granted it's probably
amortized but still...

On the other hand the Redshift cluster doesn't run itself, despite what Amazon
says. You still need at least a part time DBA guy.

~~~
eitland
Last time I worked with physical servers the rule was to avoid blades and run
rackmount until space or termal was an issue.

And as long as you were not into GPUs you could get some really decent HP
servers at somewhere around $2000 - $7000 depending on your storage needs.

------
ktamura
Before this thread turns into a vim-v-emacs-esque flame war.

I really think it's a good thing that AWS and GCP are punching each other in
the cloud data warehousing market. It means that the market is maturing, and
we are all benefiting from their arms race against each other.

I am of the opinion that Redshift and BigQuery are philosophically different
enough that performance differences, while important, shouldn't be the
deciding factor. I've written about this in a blog post awhile back, and it
might be relevant for folks weighing their options

[https://blog.treasuredata.com/blog/2016/06/09/redshift-
bigqu...](https://blog.treasuredata.com/blog/2016/06/09/redshift-bigquery-
similarities-differences-and-serverless-future/)

Disclosures:

1\. I don't work at either.

2\. My employer, however, partners with both.

~~~
ranman
I agree that they were both created with different problem domains. I mainly
take issue with the fact that they selected a single query in a large
benchmarking suite and used that to claim that BigQuery was categorically
better than redshift. Hell, I actually really like BigQuery! I do find the
pricing for queries a tad opaque and difficult to estimate... and the
variability in performance is fine for my pet projects but I imagine a larger
enterprise would prefer something more predictable. I agree with your
statement that they are two very very different products. I enjoyed reading
your post.

Disclosures: I wrote the parent article.

~~~
thesandlord
> I do find the pricing for queries a tad opaque and difficult to estimate

BigQuery just introduced flat-rate pricing for this exact reason:
[https://cloud.google.com/bigquery/pricing#flat_rate_pricing](https://cloud.google.com/bigquery/pricing#flat_rate_pricing)

Disclosure: I work for Google Cloud

~~~
ranman
You're still dealing with query slots though -- which remains difficult to
estimate. As far as I know the only way to figure that out is to actually run
the query and then check how many slots it used. Am I wrong? I'm certainly not
a BigQuery expert.

If you're downvoting do you mind leaving a comment on why I'm wrong? I'm
interested in knowing more. The documentation for BigQuery doesn't really
explain how slots are allocated and what they're capable of.

~~~
honkhonkpants
Best part is that what you get for a slot can and does change over time.

------
manigandham
We use BigQuery after trying all other options including running our own VMs
with database software, 3rd party data vendors and the big managed cloud
services.

BigQuery's no-ops model makes a huge difference. It saves a lot of effort when
all you have to do is just load up your data and run queries, no other cluster
starting, sizing, tweaking or maintenance necessary.

It has real-time streaming input, nested/repeated records, powerful RegEx
support and UDFs so you can run some really complex queries that you can't do
with traditional SQL. It also has the cheapest storage at $0.01/gb after 90
days. They even have DML statements in beta now to support update/delete
statements.

If you have lots of data to store (TBs to PBs), which can archived by time or
other dimension as it gets older, but also needs to be available at any time,
and have a decent amount of queries or but also need to run massive scans, and
don't need sub-second response times, and want really complex query logic, BQ
wins.

------
mavster
My team uses AWS for all of our servers, CDN, SQL and cache boxes but use BQ
for all of our telemetry data - about 20 billion metrics a month.

Not having to manage a redshift cluster and just let BQ do all the work for us
is worth the query time; even then, most of our complex queries run sub 30
seconds over hundreds of GB. If your data can be queried using a date-range
then use date-partitioned tables. There is a 1000 table query limit so there
are some pit-falls but BQ makes it easy for you to aggregate data into weekly
and monthly tables to query against.

Disclosures:

1\. I don't work at either.

2\. My employer, however, partners with both.

------
georgewfraser
It would be great if AWS would publish the code necessary to reproduce this
benchmark. They mentioned applying sort and dist keys; it matters A LOT how
these are chosen. For example, if you know in advance how you're going to
join, you can distribute both tables on the foreign key column so the join is
a local operation. This isn't always an option in real data warehouses, where
you do lots of different joins on the same tables. So it represents an
opportunity to "cheat" the benchmark. It's also important whether they used
BigQuery partitioned tables, which are the equivalent of SORTKEY in some ways.

Having said all this, my company is a fully-managed data pipeline that
supports both data warehouses, and we find they both work extremely well for
our customers real datasets.

~~~
maslam
George - fan of Fivetran. What, in your opinion, are common cases where
customers choose BQ over Redshift? Or is it really a function of which cloud
you're on?

~~~
georgewfraser
If you are already familiar with one, that is actually a really good reason to
use it. If you store a lot of data but rarely query it, BQ will be WAY cheaper
because you pay for compute and storage separately. On the other hand, you
will never see sub-second queries in BQ.

Another option absolutely worth checking out is Snowflake, which compromises
between the approach of BQ and Redshift in a very interesting way.

------
smegel
> You can’t just cherry-pick the one query where a given product is the
> fastest. The workloads in the database domain have broad requirements

Well you can if that test matches the use-case of the database. Big Data
databases are not like transnational databases, tend to perform column
oriented analytics and tend to not join very well (they may lack indexes
altogether).

I find this post to be more misleading than Google - he seems to be
conceitedly ignoring the fact that many specialised databases exist that do
certain things really well - and selecting tests that match their use-case is
entirely appropriate.

~~~
ranman
I'm __primarily __trying to point out that google 's repeated claims that
their performance on TPC-H #22 means that BQ is categorically better than
Amazon Redshift are misleading.

Also I think you meant transactional not transnational

~~~
timv
I'm not the parent poster, but I agree with his/her sentiment.

I understand what your purpose was, but the blog entry would have felt more
balanced if you'd said something like:

 _It 's true that for a small subset of queries, BigQuery out-performed
RedShift. If your workload is similar to those queries then BigQuery may be a
good match for you, but our testing shows that for general purpose
datawarehousing / analytics workloads, RedShift is the better performing
database and would be more suitable for most customers._

~~~
solipsism
+10! That would have been much classier. Also the language dripped of contempt
for Google. It really comes off as ugly.

~~~
bartl
Google attempted to mislead its audience. The contempt is understandable.

~~~
1024core
... of which we have no proof, just his word. Do you?

------
filereaper
Performance aside, we've found BigQuery to be far cheaper than Redshift.

With BQ you pay for the amount of data that flows through the system, as
opposed to how many instances are running with Redshift.

We also don't really need Redshift running all the time so something like BQ
or Azure SQL Datawarehouse where you can pause and scale up/down fits our
usage patterns and budget much better.

~~~
ranman
Just FYI you can pause and scale with Redshift as well... but if you query
very infrequently (or low volume) cost structure of BQ might be on your side.

------
jack9
Yeah, this didn't debunk much of anything (to me) and the "caveats" at the end
were laughably minor. I'm not sure why they listed them other than to say "but
they aren't ANSI SQL compatible!" as FUD.

~~~
dhd415
+1. They'd have a much stronger case if they didn't drag in silly complaints
such as "[Google BQ] does not support the ANSI SQL syntax of 'substring'. Need
to change to 'substr'".

~~~
santoshalper
It might seem petty, but I can't see how it hurt their case. If you're
migrating hundreds of queries or reports from a local SQL environment to the
cloud, being 100% query compatible is nice.

~~~
twalk
+1

~~~
twalk
Kinda funny that a +1 with no verbiage is still rated.

I will note I am enjoying reading these comments especially the ones that have
real data points, real implementations and considerations of database
structures/architectures on both of the products vs. gripes of the brands.

------
sptmbr
We use reserved Redshift instances for close to 3 years now. There were a few
moments that we were slightly screwed by new features and bugs in Redshift,
but most of the time, we don't need to deal with any maintenance, and because
we are fully on AWS, Redshift works way better with other systems we build or
use (we have a few use cases of bigquery). We do have lots of regular queries
running every hour, and we did lots of optimizations on Redshift. Overall, we
are happy with Redshift.

I think in term of EDW solutions, it is really not a one size fits all
situation. I am happy for the people (especially small companies who can't
afford running Redshift 24/7) who went with BigQuery, but I am also happy that
we chose Redshift and so far, it has been great.

------
ntoshev
No link to the original comparison by Google? It was only presented on a
private event?

Well, it's impossible to know the context in which this comparison was made.
BigQuery is very different from the rest of the data warehouse solutions, I
wouldn't think a standard benchmark suite would be appropriate for it.

------
maslam
We use Amazon Redshift a lot. The biggest pain point is its lack of elasticity
and strong coupling between compute and storage. You can see this pattern
emerging in warehouses from Azure and Snowflake. I fear that Redshift will
never get there because it's based on old Paraccel tech that was never really
designed for the cloud to begin with.

Is Redshift worth it? Yes, it's a powerful EDW system. That said, it's going
to cost you more than BQ because of always-on nodes.

Is it better than BQ? It really depends on your analytics workload.

------
gtrubetskoy
Google is pretty transparent about the underlying technologies such as Dremel
X, ColumnIO, Capacitor, etc. Even if there is no source code, there is quite a
bit in the form of papers, talks, peer OSS projects (e.g. Apache Drill), etc.

But I can't say the same about RedShift, it's a complete black box. (Other
than that it's a mod of an ancient version of Postgres)

~~~
querulous
redshift uses paraccel

------
pkolaczk
While their raw benchmark data may be ok, the summary chart makes no sense.
They calculated the average and median runtime of all the queries, without
first normalizing the results, despite the fact that there is a very huge
runtime variance between these queries. Therefore the final number comes
mostly from a few most costly queries and almost totally ignores the others.

------
alecco
Database researcher here: TPC-H sucks. TPC-DS is the thing to use but sadly
practically nobody does it.

See point 9

[http://www.exasol.com/en/newsroom/blog/10-questions-the-
tpc-...](http://www.exasol.com/en/newsroom/blog/10-questions-the-tpc-h-
benchmark/)

Note: Exasol has been dominating TPC-H for a while but it's kind of werid they
don't publish much about what's under the hood (unlike Actian Vectorwise, for
example, who has a huge library of amazing papers)

~~~
ranman
That's why we ran TPC-DS as well.

~~~
alecco
Yeah, and all queries. Good stuff. It'd be nice to have some kind of
instructions to reproduce, though. And some explanation on the technology
behind it. For example, do you guys scale indexed search or is it just
partitioned without indexing?

~~~
ranman
Do you mind shooting me an email (in profile) with questions you would like to
see answered? I'll try to do that in a followup post.

------
f4703
Would have been much nicer to read if it wasn't written in such tactless way.
Could have stayed classy, but they chose not to, and it came off as petty. For
me at least.

------
dkarapetyan
Present the numbers. Show the trade-offs involved. Any good engineer will then
be able to make up their own mind. The narrative was somewhat defensive for
benchmarking analysis.

I don't have a pony in this race since either one at large enough scale will
cost an arm and a leg and both seem to be within same order of magnitude with
a constant factor here and there.

------
bitmapbrother
Anyone else find it interesting that AWS is so concerned about GCP?

------
AtlasLion
I would recommend checking out SnowFlake who by the way just announced a big
price cut.

[http://www.zdnet.com/article/will-snowflake-spark-a-cloud-
da...](http://www.zdnet.com/article/will-snowflake-spark-a-cloud-data-
warehouse-price-war/)

------
jknoepfler
So, genuinely curious, how much did the author spend to run the benchmarks on
each service?

~~~
querulous
the pricing models for bigquery and redshift are different enough that it is
hard to compare on price. redshift is fixed cost (hourly per node like ec2)
while on bigquery you pay proportional to data scanned during queries

if you do only occaisonal queries on mostly static data bigquery is probably
fine but redshift is a fraction of the cost for more frequent use

------
wnevets
>It’s not surprising to see old guard companies (like Oracle) doing this, but
we were kind of surprised to see Google take this approach, too.

Indeed it is, I wonder if that benchmark claim is being taken out of context
of that private talk.

