
Ask HN: BigQuery vs. Redshift vs. Athena vs. Snowflake - paladin314159
I&#x27;m investigating potential hosted SQL data warehouses for ad-hoc analytical queries. I&#x27;ve read quite a bit about all 4 of these technologies and their high-level tradeoffs, but I&#x27;d love to get people&#x27;s first-person accounts of the pros and cons of each.<p>Dimensions I care about: query performance, load performance, cost, stability, maintainability, features (e.g. views, UDF), limitations (e.g. concurrent queries, schema restrictions), and all of these at scale.<p>If you&#x27;ve used any of these and have thoughts, would appreciate hearing them, thanks!
======
frew
Based on random use over the past few years.

Redshift: Pros: Has the most adoption, so most integrations from SaaS services
etc. are built with Redshift as their sink. Relatively fast and battle-tested.

Cons: In an awkward middle ground where you're responsible for a lot of
operations (e.g. capacity planning, setting up indexes), but don't have a lot
of visibility. Some weirdness as a result of taking PostgreSQL and making it
distributed.

BigQuery: Pros: Rich feature set. Pay-per-TB pricing. Recently released
standard-ish SQL dialect. Very fast.

Cons: JDBC driver is recent and doesn't have support for e.g. CREATE TABLE AS
SELECT (as of a couple of months ago) so harder to integrate with existing
systems. There are ways to run out of resources (e.g. large ORDER BY results)
without a good path to throw more money at the problem.

Athena: Pros: Built off of the open-source Presto database so can use the
documentation there. Pay-per-TB pricing.

Cons: Slower than the other options listed here. Very early product so lacking
some in documentation and some cryptic errors. Not a lot of extensibility, but
you could theoretically move to just using open-source Presto.

Haven't had a chance to evaluate Snowflake.

~~~
antoaravinth
Just a question, which of these does have a good BI tool support?

~~~
itamarwe
All of the above data warehouses have good support. You can use almost any
popular BI tool with all of them. All of the above have ODBC/JDBC drivers that
BI tools can use. And since these are the most popular data warehouses out
there, most BI tools implement their connectors.

------
bsg75
My perspective on a brief trial of BigQuery and RedShift:

1) RedShift is PostgreSQL 8 with some additional features, and while can
significantly improve some query runtimes, comes with usual DBA burdens as an
on-premise database.

2) BiqQuery is a true cloud implementation. Its SQL syntax is evolving and may
not have some features older platforms do, but the concepts of the physical
implementation is completely removed from the user's concern.

Were I going to start a new project without existing dependence on AWS,
BigQuery would have an advantage.

------
user5994461
BigQuery for data storage + Looker for visualization.

RedShift is not a real managed service. You have to handle all the
maintenance/scaling/indexing/provisioning, it's a hassle.

------
DeGi
We have ~80TB of (compressed) data in Snowflake at Celtra and I'm working with
Snowflake on a daily basis. We've been using it for the last ~1 year in
production. Overall maintenance is minimal and the product is very stable.

Pros:

    
    
      - Support for semi-structured nested data (think json, avro, parquet) and querying this in-database with
      custom operators
      - Separation of compute from storage. Since S3 is used for storage, you can just spawn as many compute
      clusters as needed - no congestion for resources.
      - CLONE capability. Basically, Snowflake allows you to do a zero-copy CLONE, which copies just the metadata,
      but not the actual data (you can clone a whole database, a particular schema or a particular table). This is
      particularly useful for QA scenarios, because you don't need to retain/backup/copy over a large table - you
      just CLONE and can run some ALTERs on the clone of the data. Truth be told, there are some privilege bugs
      there, but I've already reported those and Snowflake is working on them.
      - Support for UDFs and Javascript UDFs. We've had to do a full ~80TB table rewrite and being able to do this
      without copying data outside of Snowflake was a massive gain.
      - Pricing model. We did not like query-based model of BigQuery a lot, because it's harder to control the costs.
      Storage on Snowflake costs the same as S3 ($27/TB compressed), BigQuery charges for scans of uncompressed data.
      - Database-level atomicity and transactions (instead of table-level on BigQuery)
      - Seamless S3 integration. With BigQuery, we'd have to copy all data over to GCS first.
      - JDBC/ODBC connectivity. At the time we were evaluating Snowflake vs. BigQuery (1.5 years ago, BigQuery didn't
      support JDBC)
      - You can define separate ACLs for storage and compute
      - Snowflake was faster when the data size scanned was smaller (GBs)
      - Concurrent DML (insert into the same table from multiple processes - locking happens on a partition level)
      - Vendor support
      - ADD COLUMN, DROP COLUMN, RENAME all work as you would expect from a columnar database
      - Some cool in-database analytics functions, like HyperLogLog objects (that are aggregatable)
    

Cons:

    
    
      - Nested data is not first-class. It's supported by semi-structured VARIANT data type, but there is no schema
      if you use this. So you can't have nested data + define a schema both at the same time, you have to pick just
      one.
      - Snowflake uses a proprietary data storage format and you can't access data directly (even though it sits on
      S3). For example when using Snowflake-Spark connector, there is a lot of copying of data going on: S3 ->
      Snowflake -> S3 -> Spark cluster, instead of just S3 -> Spark cluster.
      - BigQuery was faster for full table scans (TBs)
      - Does not release locks if connection drops. It's pain to handle that yourself, especially if you can't
      control the clients which are killed.
      - No indexes. Also no materialized views. Snowflake allows you to define Clustering keys, which will retain
      sort order (not global!), but it has certain bugs and we've not been using it seriously yet. Particularly,
      it doesn't seem to be suited for small tables, or tables with frequent small inserts, as it doesn't do file
      compaction (number of files just grows, which hits performance).
      - Which brings me to the next point. If your use case is more streaming in nature (more frequent inserts, but
      smaller ones), I don't think Snowflake would handle this well. For one use case, we're inserting every minute,
      and we're having problems with number of files. For another use case, we're ingesting once per hour, and this
      works okay.
    

Some (non-obvious) limitations:

    
    
      - 50 concurrent queries/user
      - 150 concurrent queries/account
      - streaming use cases (look above)
      - 1s connection times on ODBC driver (JDBC seems to be better)
    

If you decide for Snowflake or have some more questions, I can help with more
specific questions/use cases.

~~~
jbs40
Here's some useful updates and additional information on some of the items
mentioned above regarding Snowflake:

\- The concurrency limits mentioned above are soft limits that can be raised
on customer request (those defaults are there so that runaway applications can
be detected easily). Snowflake can handle very high concurrency--we have
customers running hundreds of concurrent queries.

\- We’ve recently released a new Spark connector with a bunch of
optimizations, including additional push-down capabilities that speed up
performance significantly.

\- The clustering capability is currently in "preview", we're definitely
taking input and have been working on incorporating feedback we've received so
far into it.

\- One important thing to note when it comes to full table scans is that
Snowflake allows you to choose how much horsepower you apply to the job, so
you can easily adjust the horsepower to get faster scans.

~~~
jbs40
Forgot to state the disclaimer that I work for Snowflake.

------
scapecast
What does the rest of your infrastructure look like? Are you running on AWS
already? or on GCP?

~~~
paladin314159
In AWS, so we understand there will be some extra costs sending data to
BigQuery.

------
vgt
I worked on the BigQuery team and happy to answer any questions you may have.
There are quite a few AWS customers running on BigQuery. (See [0], [1] [7],
[8] )

I'll offer some resources:

\- A session from last Google NEXT, where Yahoo, NY Times, and Blue Apron
detail their experience with migrating to BigQuery [0]

\- A session from previous Google NEXT, where Kabam Games detailed their
experience with migrating to BigQuery (and running the rest of their workload
on AWS) [1]

\- A blog "15 things you probably didn't know about BigQuery" [2]

\- A blog detailing what BigQuery means by "fully managed" [3]

\- A blog offering some details on most recent iteration of Dremel "In-memory
query execution in Google BigQuery" [4]

\- A blog describing Capacitor, the opinionated storage format that replaced
ColumnIO (on which Parquet is based) [5]

\- "BigQuery Under The Hood" [6]

\- Since you specifically asked about Load Performance, [9]

( A wholly redundant disclosure that I work at Google Cloud)

[0]
[https://www.youtube.com/watch?v=TLpfGaYWshw](https://www.youtube.com/watch?v=TLpfGaYWshw)

[1]
[https://www.youtube.com/watch?v=6Nv18xmJirs](https://www.youtube.com/watch?v=6Nv18xmJirs)

[2] [https://medium.com/google-cloud/15-awesome-things-you-
probab...](https://medium.com/google-cloud/15-awesome-things-you-probably-
didnt-know-about-google-bigquery-6654841fa2dc)

[3] [https://cloud.google.com/blog/big-data/2016/08/google-
bigque...](https://cloud.google.com/blog/big-data/2016/08/google-bigquery-
continues-to-define-what-it-means-to-be-fully-managed)

[4] [https://cloud.google.com/blog/big-data/2016/08/in-memory-
que...](https://cloud.google.com/blog/big-data/2016/08/in-memory-query-
execution-in-google-bigquery)

[5] [https://cloud.google.com/blog/big-data/2016/04/inside-
capaci...](https://cloud.google.com/blog/big-data/2016/04/inside-capacitor-
bigquerys-next-generation-columnar-storage-format)

[6] [https://cloud.google.com/blog/big-data/2016/01/bigquery-
unde...](https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-
hood)

[7]
[https://cloud.google.com/customers/sharethis/](https://cloud.google.com/customers/sharethis/)

[8] [https://storify.com/felipehoffa/dow-jones-on-
bigquery](https://storify.com/felipehoffa/dow-jones-on-bigquery)

[9] [https://medium.com/google-cloud/paying-it-forward-how-
bigque...](https://medium.com/google-cloud/paying-it-forward-how-bigquerys-
data-ingest-breaks-tech-norms-8bfe2341f5eb)

~~~
paladin314159
Thanks for the wealth of resources. After reading through, the only big
concern I have with BigQuery is pricing. I like the model of pay-per-query,
but it seems like costs can get out of hand easily.

These two issues particularly worry me:

\- No non-date partitioning, which means you'll often end up (in my particular
use case) scanning 10-100x as much data as you actually want to.

\- Pricing is based on uncompressed data size, which is at least several times
more expensive than the alternative (e.g. Athena).

Any thoughts on these? We'll have several hundreds of TB of data, so cost
matters.

~~~
vgt
Happy to help!

On partitioning - sounds like a great candidate for sharding. BigQuery
standard SQL makes it very easy to run unions on shards (ex. Select * from
tables_*)

On pricing - BigQuery is a very different product functionally, and I
encourage you to compare it via a poc. That said, definitely reach out to me
on price.

------
scapecast
Disclaimer: I'm the co-founder of a start-up that provides performance
analytics for Amazon Redshift. So I'm biased towards using Redshift because
otherwise I can't sell you our service. With that cleared, some
considerations.

One of our customers I thought made a great observation. "I see a lot of
people using Redshift. But I don't see anybody using it happily". Because of
all the issues that people point out in this thread. lots of operations.
little visibility ("black box"). All very true. It's the problem our product
solves, and let me talk about how we see companies successfully using Redshift
at scale. (and that customer is now a happy Redshift user, btw).

Here's the architecture that we're seeing companies on AWS moving to:

\- all your data in S3 (I hate the term, but call it a "data lake")

\- a subset of your data in Redshift, for ongoing analysis

\- daily batch jobs to move data in and out of Redshift from / to S3

\- Athena to query data that's in S3 and not in Redshift

The subset of the data sitting in Redshift is determined by your needs / use
cases. For example, let's say you have 3 years of data, but your users only
query data that's less than 6 months old. Then moving data older than 6 months
to S3 makes a lot of sense. Much cheaper. For the edge cases where a users
does want to query data older than 6 months, you use Athena to query data
sitting in S3.

Why not use Athena for everything? Two reasons. For one, Athena is less
mature. And then you still need a place to run your transformations, and
Redshift is the better choice for that.

Three uses cases for Redshift:

1) classic BI / reporting ("the past")

2) log analysis ("here and now")

3) predictive apps ("the future")

Your use case will determine how you have to think about your data
architecture. For some it's ok if a daily report takes a few minutes to run.
I'll go get a coffee. It also means my batch jobs are running on daily or
hourly cycles. But if I'm running a scoring model, e.g. for fraud prevention,
I want that score to be as fresh / real-time as possible. And make sure that
the transformations leading up to that score have executed in their proper
order.

For Redshift to work at scale, there are really only three key points you need
to check the box on:

\- setting up your WLM queues so that your important workloads are isolated
from each other

\- allocating enough concurrency and memory to each queue to enable high
throughput

\- monitoring your stack for real-time changes that can affect those WLM
settings

That's it. And then Redshift has a rich ecosystem and rich feature set to
enable all uses cases (etl vendors, dashboard tools, etc.). Those users who
have addressed the performance challenges that do come at scale for Redshift -
they're happy users and never looked back.

On the pay-per-TB pricing for BigQuery and Snowflake. I think that's more of a
marketing spin. Most companies we work with are compute-bound. So per-TB
pricing helps them very little. They want to crunch their data as fast as
possible. More CPUs give them more I/O. If you feel you're storage-bound -
take a hard look at the data that really needs to be available in Redshift for
analysis, and move everything else to S3.

For inspiration, watch the AWS Reinvent videos on S3 / Redshift / Athena from
Netflix and NASDAQ. Like this one:

[https://www.youtube.com/watch?v=o52vMQ4Ey9I&t=256s](https://www.youtube.com/watch?v=o52vMQ4Ey9I&t=256s)

If you're entire data is already within AWS, I think Redshift is the way to
go. But then again I'm biased :)

~~~
vgt
> On the pay-per-TB pricing for BigQuery and Snowflake. I think that's more of
> a marketing spin. Most companies we work with are compute-bound. So per-TB
> pricing helps them very little. They want to crunch their data as fast as
> possible. More CPUs give them more I/O. If you feel you're storage-bound -
> take a hard look at the data that really needs to be available in Redshift
> for analysis, and move everything else to S3.

Do you mind elaborating on what you mean by that? I've worked a whole lot on
BigQuery pricing, and am a little confused by your comment. Do you mean that
separation of storage compute is an overrated concept?

(work on Google Cloud)

~~~
marcinzukowski
To clarify, Snowflake doesn't charge for processing per-TB like BigQuery, but
for the (fully elastic) compute time.

(disclaimer: I work at Snowflake)

~~~
vgt
Thanks for the clarification!

BigQuery's on-demand model charges just for the resources consumed during the
job execution (via a per-TB proxy), rather than resources provisioned. This is
the highest order of cloud-native pricing models, and good on Athena for doing
the same!

If one were to equate this to VM pricing, you're getting:

\- The ability to scale from 0 to thousands of cores in roughly 1 second (what
it takes for Dremel to put together the execution plan)

\- Pipelined execution (meaning, each processing stage ONLY uses the resources
required for that stage, rather than net resources. think of this as per-
second autoscaling from query stage to query stage. This is part of the reason
why BigQuery does very well in concurrency tests)

\- Equivalent of per-second billing

This in practice equates to 100% resource efficiency - never pay for resources
you just deploy and not use.

I've written on this topic at [0] and at [1] and happy to chat further!

(And BigQuery did introduce a Flat Rate pricing model last Fall for the more
enterprisey crowd)

[0] [https://cloud.google.com/blog/big-
data/2016/02/understanding...](https://cloud.google.com/blog/big-
data/2016/02/understanding-bigquerys-rapid-scaling-and-simple-pricing)

[1] [https://cloud.google.com/blog/big-
data/2016/02/visualizing-t...](https://cloud.google.com/blog/big-
data/2016/02/visualizing-the-mechanics-of-on-demand-pricing-in-big-data-
technologies)

~~~
scapecast
this is a very elaborate and insightful answer. appreciate the clarification.

------
codeonfire
BigQuery is like a midrange luxury car, easy to do anything, looks great, and
good performance. Redshift is like someone's turbo boosted Honda Civic. Athena
is maybe like a Camaro, You dislike the concept, but they maybe made it
powerful and easy enough that you want it. This is probably no help.

------
marklit
I've used BQ, RS and Athena. Their benchmark results and links to how I
conducted benchmarks on each can be found here:
[http://tech.marksblogg.com/benchmarks.html](http://tech.marksblogg.com/benchmarks.html)

I could probably write a book on my experiences but I'll keep this short.

BigQuery requires the least amount of work to keep performant but this comes
at a cost (this cost matches Athena).

Redshift seems to have support for just about every BI tool out there
including Periscope which I've found to be popular with my customers. The
downside is that you've really got to think about how you layout your data,
how you write your queries and there is something of a weekly task of bringing
up the console and hunting down the worst-offending queries for optimisation.

Keep in mind if you're already on Google Cloud or AWS you could incur
bandwidth charges moving data to the other provider.

