It's a bit of a shame they never released a paper on how it works, well, nothing since Dremel anyway. And Collosus isn't public either.
I can already see this in built ML stuff being useful for trialling models, especially as it's built right into the SQL.
Happy to answer any questions!
Has anything changed on BigQuery since then that would warrant rerunning those benchmarks? If Amazon’s largest cluster outperforms BigQuery then is your decision about which service to use just a cost calculation of dedicated per month for Redshift vs. projected units scanned per month on BigQuery?
In the article you linked, they gloss over the part that takes by far the longest when working with redshift: setting up "compression, distribution keys on large tables, and sort keys on commonly filtered columns". With BigQuery, you don't really worry about that crap, or about reloading your terabytes of data if you made a mistake somewhere in the schema. You also don't worry about vacuuming, or running out of space, or taking down the database from excessive CPU usage. Do you want to have a team of people at your company whose job is just to keep Redshift running smoothly? Or do you want another team of analysts or engineers?
I could see some companies being forced to use Redshift, especially if they're using S3 a lot, and I could see people saving money in some specialized use cases, but for most people BigQuery is almost certainly going to be cheaper and faster in almost every way.
That’s exactly what the TPC benchmarks are designed to show: how different appliances perform under the same diverse set of generic workloads. As of 2 years ago, they show that Redshift is faster.
> With BigQuery, you don't really worry about that crap
The benefit this depends on your organization’s level of expertise. If you grok sorting and distribution then you can leverage those to increase performance—but it’s not a prerequisite.
For example, in Redshift, when you bulk load data into a table, if the user didn’t specify a compression scheme in the table definition, Redshift will analyze the data and find a scheme that works best, and automatically apply it to the table for you. BigQuery almost certainly does something similar. The difference is, with BigQuery, you’re not invited to take part in that discussion. And you’re charged as if the data is uncompressed. Psychologically, this is a huge relief if you don’t (want to) know how compression works but rest assured you’re paying for it somehow.
To draw a tired analogy, vehicles with automatic transmission still have to shift gears. If you’re driving to the grocery store, not having to worry about that is a win. But if you’re racing stock cars, you’re definitely going to want a stick.
- We finance compute resources required for you to ingest data into BigQuery. With Redshift, you pay for ingest directly via compute cluster consumption (again, more expensive than storage). This also increases your complexity due to on-cluster contention of resources between ingest and query.
- Separation of storage and compute gives you lots of options. With BigQuery ,you don't need to attach relatively expensive compute just for the luxury of getting more storage. Spectrum helps somewhat, but ultimately with Redshift you don't even get to pay for storage - you pay for compute/storage combos.
- BigQuery's Long-Term Storage is not an archival storage tier - it's a discount on storage, with identical performance and durability characteristics. At only $0.01 per GB per month.
This is likely a result of origins of the two technologies. BigQuery is Dremel, written and operated by Google since 2006. Redshift purchased source code to an on-premise fork of Postgres.
a) Do you look at the benchmarks that each company produces, and choose the one that publishes the numbers that make them look the best?
b) Do you ask "I have n data analysts with m different questions and I want to give them the most productive platform I can".
Well, this is what Twitter chose:
- How Twitter Migrated its On-Prem Analytics to Google Cloud
Or listen to Nick Caldwell, Reddit VP Engineering, moving away from AWS to BigQuery:
- "2017, which effectively brought us to the present system, we began forking all of our event data into BigQuery, after considering a lot of different alternatives" https://youtu.be/tKISLQ87GO8?t=426
I prefer method b) :)
Disclosure: I'm https://twitter.com/felipehoffa and I work for GCP
> BigQuery has achieved approximately 5x improvement in TPC-DS 10T power run geomean over the past 18 months.
Outside of that, BigQuery now provides better partitioned table functionality (https://cloud.google.com/bigquery/docs/partitioned-tables) as well as clustering (https://cloud.google.com/bigquery/docs/clustered-tables), both of which can help to reduce costs and improve performance. In terms of pricing, many large customers have reserved slots (https://cloud.google.com/bigquery/docs/slots), where you pay a fixed monthly cost in exchange for dedicated resources, similar to running a large Redshift instance.
As a disclosure, I work on the BigQuery team.
Here's some food for thought.
- We recently announced that we've improved our TPC-DS 10T geomean by 5x over past 18 months . The nice thing is that users never had to set maintenance windows or "upgrade their clusters" to get these performance improvements.
- Here's a session by our eng lead and CTO of Looker discussing various performance improvements, including querying a PB of data in ~5 seconds and paying a fraction of a penny for it.
- Benchmarks are important, but they miss three key scenarios.
------ Effort - how much complexity is there to achieve maximum performance. As someone else stated in this thread, we feel that we differentiate here.
------ Maintenance - when data changes, new data comes in, you run DML, what happens to this pristine benchmark of yours? What is required to maintain this performance? Again, we feel that our offering is compelling (we don't ever have messy storage state that requires a vacuum, our compute is entirely stateless so re-sorting/re-distributing data is not needed, and BigQuery ingest never affects query capacity, even at PB/day ingest scale).
------ Real-world usage - a serially-executed set of queries poorly represents what happens in reality - high volatility, high concurrency workloads. Again, BigQuery has stateless compute, separation of compute and storage, and separation of compute and state.
- customers - in addition to what's been stated already, PTAL at Kabam  & NYT/Yahoo/BlueApron  for their motivations for migrating from Redshift to BigQuery
That said, lots of folks really like running on Redshift, and they've extended their platform with Spectrum and other bits. I'd invite you to find out what works best for you. Competition is good for the end user!
I suspect you might be hearing some feedback from us soon w.r.t this trial, assuming the google people we are working with relay that back to you?
Anything you don't like is an opportunity to make our product better
- multidimensional / hierarchy modeling for analytical purposes
- permissioning model (roles with element-level granularity, ie. User A - allow Country=USA, User B - allow Product=Bike)
- historical modeling (built-in slowly-changing dimensions support)
What is the underlying storage model? Is it a column store ? Is it closer to traditional row-based stores?
Are you taking advantage of GPUs or other dedicated hardware to accelerate BQ?
tl;dr for underlying storage model: distributed column store which pushes computation down a tree to leaf nodes to parallelize disk I/O. Parent nodes aggregate computations before returning to the client.
We at BuzzFeed have seen similar results for migrating to BigQuery (and we did a talk about it at Next too, although it’s not online yet)
BuzzFeed talk: https://youtu.be/74PKZmL-Qjs
I quoted the reddit talk on the thread above/below :).
There was also a public talk on D, the GFS chunkserver replacement, but there don't seem to be recordings or slides from that.
It seems like for many use cases, you'd have to pay quite a bit of attention to partitioning the database in a way that minimizes query costs.
As BQ only supports DAY partitioning, I had to create multiple tables that represent the other dimensions, but it seems to work OK, if a little clunky when you have to use wildcards/_TABLE_SUFFIX
Disclosure: I'm https://twitter.com/felipehoffa and I work for Google Cloud. And I'm really excited to reprocess all public tables into clustered ones.
Compared to many players in different verticals, our data is small. But in our vertical of asthma care, we probably have one of the largest (possibly the largest) asthma data.
We've been looking at different way of plumbing the data to automate and run some rudimentary analysis on it, since we found BigQuery a bit limiting. Now seeing this announcement, it could be a great start for us.
I hope this is just a start and people like us can send Google Cloud team a wish list as we come across various needs. Good job and thanks to everyone behind this release.
I wonder how they're doing logistic regression. It doesn't have a closed form and normally solved with an iterative, multi-pass optimization. edit: Read more closely. They're using (non-stochastic) batch gradient descent.
Pandas is now the standard for data analysis (as long as things fit into memory). It's much much easier to debug than a SQL command. You can write operations as a succession of small logical steps (instead of one huge query that is hard to debug).
It's raw Python, so you can do something like:
df.groupby('movie_id').agg(dict(ratings='median', price=lambda p : np.percentile(p, .95))).plot.bar(bins=50)
I just use SQL to grab and if necessary aggregate the data and then do everything else in Pandas - using Python custom reducers to deployed trained models although we are migrating to GCP now so soon that won't be necessary.