Hacker News new | comments | show | ask | jobs | submit login
Ask HN: BigQuery vs. Redshift vs. Athena vs. Snowflake
26 points by paladin314159 100 days ago | hide | past | web | 21 comments | favorite
I'm investigating potential hosted SQL data warehouses for ad-hoc analytical queries. I've read quite a bit about all 4 of these technologies and their high-level tradeoffs, but I'd love to get people's first-person accounts of the pros and cons of each.

Dimensions I care about: query performance, load performance, cost, stability, maintainability, features (e.g. views, UDF), limitations (e.g. concurrent queries, schema restrictions), and all of these at scale.

If you've used any of these and have thoughts, would appreciate hearing them, thanks!




Based on random use over the past few years.

Redshift: Pros: Has the most adoption, so most integrations from SaaS services etc. are built with Redshift as their sink. Relatively fast and battle-tested.

Cons: In an awkward middle ground where you're responsible for a lot of operations (e.g. capacity planning, setting up indexes), but don't have a lot of visibility. Some weirdness as a result of taking PostgreSQL and making it distributed.

BigQuery: Pros: Rich feature set. Pay-per-TB pricing. Recently released standard-ish SQL dialect. Very fast.

Cons: JDBC driver is recent and doesn't have support for e.g. CREATE TABLE AS SELECT (as of a couple of months ago) so harder to integrate with existing systems. There are ways to run out of resources (e.g. large ORDER BY results) without a good path to throw more money at the problem.

Athena: Pros: Built off of the open-source Presto database so can use the documentation there. Pay-per-TB pricing.

Cons: Slower than the other options listed here. Very early product so lacking some in documentation and some cryptic errors. Not a lot of extensibility, but you could theoretically move to just using open-source Presto.

Haven't had a chance to evaluate Snowflake.


Just a question, which of these does have a good BI tool support?


All of the above data warehouses have good support. You can use almost any popular BI tool with all of them. All of the above have ODBC/JDBC drivers that BI tools can use. And since these are the most popular data warehouses out there, most BI tools implement their connectors.


BigQuery is like a midrange luxury car, easy to do anything, looks great, and good performance. Redshift is like someone's turbo boosted Honda Civic. Athena is maybe like a Camaro, You dislike the concept, but they maybe made it powerful and easy enough that you want it. This is probably no help.


My perspective on a brief trial of BigQuery and RedShift:

1) RedShift is PostgreSQL 8 with some additional features, and while can significantly improve some query runtimes, comes with usual DBA burdens as an on-premise database.

2) BiqQuery is a true cloud implementation. Its SQL syntax is evolving and may not have some features older platforms do, but the concepts of the physical implementation is completely removed from the user's concern.

Were I going to start a new project without existing dependence on AWS, BigQuery would have an advantage.


BigQuery for data storage + Looker for visualization.

RedShift is not a real managed service. You have to handle all the maintenance/scaling/indexing/provisioning, it's a hassle.


I've used BQ, RS and Athena. Their benchmark results and links to how I conducted benchmarks on each can be found here: http://tech.marksblogg.com/benchmarks.html

I could probably write a book on my experiences but I'll keep this short.

BigQuery requires the least amount of work to keep performant but this comes at a cost (this cost matches Athena).

Redshift seems to have support for just about every BI tool out there including Periscope which I've found to be popular with my customers. The downside is that you've really got to think about how you layout your data, how you write your queries and there is something of a weekly task of bringing up the console and hunting down the worst-offending queries for optimisation.

Keep in mind if you're already on Google Cloud or AWS you could incur bandwidth charges moving data to the other provider.


What does the rest of your infrastructure look like? Are you running on AWS already? or on GCP?


In AWS, so we understand there will be some extra costs sending data to BigQuery.


I worked on the BigQuery team and happy to answer any questions you may have. There are quite a few AWS customers running on BigQuery. (See [0], [1] [7], [8] )

I'll offer some resources:

- A session from last Google NEXT, where Yahoo, NY Times, and Blue Apron detail their experience with migrating to BigQuery [0]

- A session from previous Google NEXT, where Kabam Games detailed their experience with migrating to BigQuery (and running the rest of their workload on AWS) [1]

- A blog "15 things you probably didn't know about BigQuery" [2]

- A blog detailing what BigQuery means by "fully managed" [3]

- A blog offering some details on most recent iteration of Dremel "In-memory query execution in Google BigQuery" [4]

- A blog describing Capacitor, the opinionated storage format that replaced ColumnIO (on which Parquet is based) [5]

- "BigQuery Under The Hood" [6]

- Since you specifically asked about Load Performance, [9]

( A wholly redundant disclosure that I work at Google Cloud)

[0] https://www.youtube.com/watch?v=TLpfGaYWshw

[1] https://www.youtube.com/watch?v=6Nv18xmJirs

[2] https://medium.com/google-cloud/15-awesome-things-you-probab...

[3] https://cloud.google.com/blog/big-data/2016/08/google-bigque...

[4] https://cloud.google.com/blog/big-data/2016/08/in-memory-que...

[5] https://cloud.google.com/blog/big-data/2016/04/inside-capaci...

[6] https://cloud.google.com/blog/big-data/2016/01/bigquery-unde...

[7] https://cloud.google.com/customers/sharethis/

[8] https://storify.com/felipehoffa/dow-jones-on-bigquery

[9] https://medium.com/google-cloud/paying-it-forward-how-bigque...


Thanks for the wealth of resources. After reading through, the only big concern I have with BigQuery is pricing. I like the model of pay-per-query, but it seems like costs can get out of hand easily.

These two issues particularly worry me:

- No non-date partitioning, which means you'll often end up (in my particular use case) scanning 10-100x as much data as you actually want to.

- Pricing is based on uncompressed data size, which is at least several times more expensive than the alternative (e.g. Athena).

Any thoughts on these? We'll have several hundreds of TB of data, so cost matters.


Happy to help!

On partitioning - sounds like a great candidate for sharding. BigQuery standard SQL makes it very easy to run unions on shards (ex. Select * from tables_*)

On pricing - BigQuery is a very different product functionally, and I encourage you to compare it via a poc. That said, definitely reach out to me on price.


We have ~80TB of (compressed) data in Snowflake at Celtra and I'm working with Snowflake on a daily basis. We've been using it for the last ~1 year in production. Overall maintenance is minimal and the product is very stable.

Pros:

  - Support for semi-structured nested data (think json, avro, parquet) and querying this in-database with
  custom operators
  - Separation of compute from storage. Since S3 is used for storage, you can just spawn as many compute
  clusters as needed - no congestion for resources.
  - CLONE capability. Basically, Snowflake allows you to do a zero-copy CLONE, which copies just the metadata,
  but not the actual data (you can clone a whole database, a particular schema or a particular table). This is
  particularly useful for QA scenarios, because you don't need to retain/backup/copy over a large table - you
  just CLONE and can run some ALTERs on the clone of the data. Truth be told, there are some privilege bugs
  there, but I've already reported those and Snowflake is working on them.
  - Support for UDFs and Javascript UDFs. We've had to do a full ~80TB table rewrite and being able to do this
  without copying data outside of Snowflake was a massive gain.
  - Pricing model. We did not like query-based model of BigQuery a lot, because it's harder to control the costs.
  Storage on Snowflake costs the same as S3 ($27/TB compressed), BigQuery charges for scans of uncompressed data.
  - Database-level atomicity and transactions (instead of table-level on BigQuery)
  - Seamless S3 integration. With BigQuery, we'd have to copy all data over to GCS first.
  - JDBC/ODBC connectivity. At the time we were evaluating Snowflake vs. BigQuery (1.5 years ago, BigQuery didn't
  support JDBC)
  - You can define separate ACLs for storage and compute
  - Snowflake was faster when the data size scanned was smaller (GBs)
  - Concurrent DML (insert into the same table from multiple processes - locking happens on a partition level)
  - Vendor support
  - ADD COLUMN, DROP COLUMN, RENAME all work as you would expect from a columnar database
  - Some cool in-database analytics functions, like HyperLogLog objects (that are aggregatable)
Cons:

  - Nested data is not first-class. It's supported by semi-structured VARIANT data type, but there is no schema
  if you use this. So you can't have nested data + define a schema both at the same time, you have to pick just
  one.
  - Snowflake uses a proprietary data storage format and you can't access data directly (even though it sits on
  S3). For example when using Snowflake-Spark connector, there is a lot of copying of data going on: S3 ->
  Snowflake -> S3 -> Spark cluster, instead of just S3 -> Spark cluster.
  - BigQuery was faster for full table scans (TBs)
  - Does not release locks if connection drops. It's pain to handle that yourself, especially if you can't
  control the clients which are killed.
  - No indexes. Also no materialized views. Snowflake allows you to define Clustering keys, which will retain
  sort order (not global!), but it has certain bugs and we've not been using it seriously yet. Particularly,
  it doesn't seem to be suited for small tables, or tables with frequent small inserts, as it doesn't do file
  compaction (number of files just grows, which hits performance).
  - Which brings me to the next point. If your use case is more streaming in nature (more frequent inserts, but
  smaller ones), I don't think Snowflake would handle this well. For one use case, we're inserting every minute,
  and we're having problems with number of files. For another use case, we're ingesting once per hour, and this
  works okay.
Some (non-obvious) limitations:

  - 50 concurrent queries/user
  - 150 concurrent queries/account
  - streaming use cases (look above)
  - 1s connection times on ODBC driver (JDBC seems to be better)
If you decide for Snowflake or have some more questions, I can help with more specific questions/use cases.


Here's some useful updates and additional information on some of the items mentioned above regarding Snowflake:

- The concurrency limits mentioned above are soft limits that can be raised on customer request (those defaults are there so that runaway applications can be detected easily). Snowflake can handle very high concurrency--we have customers running hundreds of concurrent queries.

- We’ve recently released a new Spark connector with a bunch of optimizations, including additional push-down capabilities that speed up performance significantly.

- The clustering capability is currently in "preview", we're definitely taking input and have been working on incorporating feedback we've received so far into it.

- One important thing to note when it comes to full table scans is that Snowflake allows you to choose how much horsepower you apply to the job, so you can easily adjust the horsepower to get faster scans.


Forgot to state the disclaimer that I work for Snowflake.


Many thanks! That's by far the most detailed analysis of Snowflake that I've seen.


Disclaimer: I'm the co-founder of a start-up that provides performance analytics for Amazon Redshift. So I'm biased towards using Redshift because otherwise I can't sell you our service. With that cleared, some considerations.

One of our customers I thought made a great observation. "I see a lot of people using Redshift. But I don't see anybody using it happily". Because of all the issues that people point out in this thread. lots of operations. little visibility ("black box"). All very true. It's the problem our product solves, and let me talk about how we see companies successfully using Redshift at scale. (and that customer is now a happy Redshift user, btw).

Here's the architecture that we're seeing companies on AWS moving to:

- all your data in S3 (I hate the term, but call it a "data lake")

- a subset of your data in Redshift, for ongoing analysis

- daily batch jobs to move data in and out of Redshift from / to S3

- Athena to query data that's in S3 and not in Redshift

The subset of the data sitting in Redshift is determined by your needs / use cases. For example, let's say you have 3 years of data, but your users only query data that's less than 6 months old. Then moving data older than 6 months to S3 makes a lot of sense. Much cheaper. For the edge cases where a users does want to query data older than 6 months, you use Athena to query data sitting in S3.

Why not use Athena for everything? Two reasons. For one, Athena is less mature. And then you still need a place to run your transformations, and Redshift is the better choice for that.

Three uses cases for Redshift:

1) classic BI / reporting ("the past")

2) log analysis ("here and now")

3) predictive apps ("the future")

Your use case will determine how you have to think about your data architecture. For some it's ok if a daily report takes a few minutes to run. I'll go get a coffee. It also means my batch jobs are running on daily or hourly cycles. But if I'm running a scoring model, e.g. for fraud prevention, I want that score to be as fresh / real-time as possible. And make sure that the transformations leading up to that score have executed in their proper order.

For Redshift to work at scale, there are really only three key points you need to check the box on:

- setting up your WLM queues so that your important workloads are isolated from each other

- allocating enough concurrency and memory to each queue to enable high throughput

- monitoring your stack for real-time changes that can affect those WLM settings

That's it. And then Redshift has a rich ecosystem and rich feature set to enable all uses cases (etl vendors, dashboard tools, etc.). Those users who have addressed the performance challenges that do come at scale for Redshift - they're happy users and never looked back.

On the pay-per-TB pricing for BigQuery and Snowflake. I think that's more of a marketing spin. Most companies we work with are compute-bound. So per-TB pricing helps them very little. They want to crunch their data as fast as possible. More CPUs give them more I/O. If you feel you're storage-bound - take a hard look at the data that really needs to be available in Redshift for analysis, and move everything else to S3.

For inspiration, watch the AWS Reinvent videos on S3 / Redshift / Athena from Netflix and NASDAQ. Like this one:

https://www.youtube.com/watch?v=o52vMQ4Ey9I&t=256s

If you're entire data is already within AWS, I think Redshift is the way to go. But then again I'm biased :)


> On the pay-per-TB pricing for BigQuery and Snowflake. I think that's more of a marketing spin. Most companies we work with are compute-bound. So per-TB pricing helps them very little. They want to crunch their data as fast as possible. More CPUs give them more I/O. If you feel you're storage-bound - take a hard look at the data that really needs to be available in Redshift for analysis, and move everything else to S3.

Do you mind elaborating on what you mean by that? I've worked a whole lot on BigQuery pricing, and am a little confused by your comment. Do you mean that separation of storage compute is an overrated concept?

(work on Google Cloud)


To clarify, Snowflake doesn't charge for processing per-TB like BigQuery, but for the (fully elastic) compute time.

(disclaimer: I work at Snowflake)


Thanks for the clarification!

BigQuery's on-demand model charges just for the resources consumed during the job execution (via a per-TB proxy), rather than resources provisioned. This is the highest order of cloud-native pricing models, and good on Athena for doing the same!

If one were to equate this to VM pricing, you're getting:

- The ability to scale from 0 to thousands of cores in roughly 1 second (what it takes for Dremel to put together the execution plan)

- Pipelined execution (meaning, each processing stage ONLY uses the resources required for that stage, rather than net resources. think of this as per-second autoscaling from query stage to query stage. This is part of the reason why BigQuery does very well in concurrency tests)

- Equivalent of per-second billing

This in practice equates to 100% resource efficiency - never pay for resources you just deploy and not use.

I've written on this topic at [0] and at [1] and happy to chat further!

(And BigQuery did introduce a Flat Rate pricing model last Fall for the more enterprisey crowd)

[0] https://cloud.google.com/blog/big-data/2016/02/understanding...

[1] https://cloud.google.com/blog/big-data/2016/02/visualizing-t...


this is a very elaborate and insightful answer. appreciate the clarification.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: