Hacker News new | past | comments | ask | show | jobs | submit login
Machine Learning in Google Bigquery (googleblog.com)
199 points by zitterbewegung on July 25, 2018 | hide | past | favorite | 49 comments

We're trialling BigQuery at the moment, pushed about 130TB into it last week (about ~300 billion rows) and have been blown away by the performance of it.

It's a bit of a shame they never released a paper on how it works, well, nothing since Dremel anyway. And Collosus isn't public either.

I can already see this in built ML stuff being useful for trialling models, especially as it's built right into the SQL.

Bq pm here. No papers, yes, but lots of talks and blog posts. Here's a list:


Happy to answer any questions!

Did you all work together with the Firebase team on the recent schema change? We're scrambling to mitigate the ~2x cost increase that caused for us, since it blew our analytics budget out of the water.

Can you post on https://stackoverflow.com/questions/tagged/google-bigquery? I haven't heard about this, but if you give the community good specifics, we'll work on it.

I second Felipe's comment. Shoot me a note.

It was a change that Firebase implemented that went into effect about a month ago. Check this out and the look at the "Old Export Schema" at the bottom. The export schema went from multiple events per record to one event per record, with the User Parameters repeated on each record. That bloated the exports from Firebase and increased costs. https://support.google.com/firebase/answer/7029846?hl=en

This blog post from 2 years ago from AWS has Redshift beating BigQuery pretty handily on TPC-H and TPS-DS: https://aws.amazon.com/blogs/big-data/fact-or-fiction-google...

Has anything changed on BigQuery since then that would warrant rerunning those benchmarks? If Amazon’s largest cluster outperforms BigQuery then is your decision about which service to use just a cost calculation of dedicated per month for Redshift vs. projected units scanned per month on BigQuery?

I've worked with both, and BigQuery is so much easier to work with. Redshift does give you more control (to a limited extent) but it takes a lot more work and specialized knowledge to perform simple tasks well.

In the article you linked, they gloss over the part that takes by far the longest when working with redshift: setting up "compression, distribution keys on large tables, and sort keys on commonly filtered columns". With BigQuery, you don't really worry about that crap, or about reloading your terabytes of data if you made a mistake somewhere in the schema. You also don't worry about vacuuming, or running out of space, or taking down the database from excessive CPU usage. Do you want to have a team of people at your company whose job is just to keep Redshift running smoothly? Or do you want another team of analysts or engineers?

I could see some companies being forced to use Redshift, especially if they're using S3 a lot, and I could see people saving money in some specialized use cases, but for most people BigQuery is almost certainly going to be cheaper and faster in almost every way.

> I could see people saving money in some specialized use cases, but for most people BigQuery is almost certainly going to be cheaper and faster in almost every way

That’s exactly what the TPC benchmarks are designed to show: how different appliances perform under the same diverse set of generic workloads. As of 2 years ago, they show that Redshift is faster.

> With BigQuery, you don't really worry about that crap

The benefit this depends on your organization’s level of expertise. If you grok sorting and distribution then you can leverage those to increase performance—but it’s not a prerequisite.

For example, in Redshift, when you bulk load data into a table, if the user didn’t specify a compression scheme in the table definition, Redshift will analyze the data and find a scheme that works best, and automatically apply it to the table for you. BigQuery almost certainly does something similar. The difference is, with BigQuery, you’re not invited to take part in that discussion. And you’re charged as if the data is uncompressed. Psychologically, this is a huge relief if you don’t (want to) know how compression works but rest assured you’re paying for it somehow.

To draw a tired analogy, vehicles with automatic transmission still have to shift gears. If you’re driving to the grocery store, not having to worry about that is a win. But if you’re racing stock cars, you’re definitely going to want a stick.

Compute resources tend to be significantly more expensive than storage. Our approach is two-fold:

- We finance compute resources required for you to ingest data into BigQuery. With Redshift, you pay for ingest directly via compute cluster consumption (again, more expensive than storage). This also increases your complexity due to on-cluster contention of resources between ingest and query.

- Separation of storage and compute gives you lots of options. With BigQuery ,you don't need to attach relatively expensive compute just for the luxury of getting more storage. Spectrum helps somewhat, but ultimately with Redshift you don't even get to pay for storage - you pay for compute/storage combos.

- BigQuery's Long-Term Storage is not an archival storage tier - it's a discount on storage, with identical performance and durability characteristics. At only $0.01 per GB per month.

This is likely a result of origins of the two technologies. BigQuery is Dremel, written and operated by Google since 2006. Redshift purchased source code to an on-premise fork of Postgres.

Let's say you are Twitter, Reddit, or the CTO of your current company. How do you choose your next analytical platform:

a) Do you look at the benchmarks that each company produces, and choose the one that publishes the numbers that make them look the best?

b) Do you ask "I have n data analysts with m different questions and I want to give them the most productive platform I can".

Well, this is what Twitter chose:

- How Twitter Migrated its On-Prem Analytics to Google Cloud https://www.youtube.com/watch?v=N3JAwCYGHU8

Or listen to Nick Caldwell, Reddit VP Engineering, moving away from AWS to BigQuery:

- "2017, which effectively brought us to the present system, we began forking all of our event data into BigQuery, after considering a lot of different alternatives" https://youtu.be/tKISLQ87GO8?t=426

I prefer method b) :)

Disclosure: I'm https://twitter.com/felipehoffa and I work for GCP

From today's release notes (https://cloud.google.com/bigquery/docs/release-notes#july_25...):

> BigQuery has achieved approximately 5x improvement in TPC-DS 10T power run geomean over the past 18 months.

Outside of that, BigQuery now provides better partitioned table functionality (https://cloud.google.com/bigquery/docs/partitioned-tables) as well as clustering (https://cloud.google.com/bigquery/docs/clustered-tables), both of which can help to reduce costs and improve performance. In terms of pricing, many large customers have reserved slots (https://cloud.google.com/bigquery/docs/slots), where you pay a fixed monthly cost in exchange for dedicated resources, similar to running a large Redshift instance.

As a disclosure, I work on the BigQuery team.

Great question. At the time we had a hard time replicating results of that blog post.

Here's some food for thought.

- We recently announced that we've improved our TPC-DS 10T geomean by 5x over past 18 months [0]. The nice thing is that users never had to set maintenance windows or "upgrade their clusters" to get these performance improvements.

- Here's a session by our eng lead and CTO of Looker discussing various performance improvements, including querying a PB of data in ~5 seconds and paying a fraction of a penny for it.[1]

- Benchmarks are important, but they miss three key scenarios.

------ Effort - how much complexity is there to achieve maximum performance. As someone else stated in this thread, we feel that we differentiate here.

------ Maintenance - when data changes, new data comes in, you run DML, what happens to this pristine benchmark of yours? What is required to maintain this performance? Again, we feel that our offering is compelling (we don't ever have messy storage state that requires a vacuum, our compute is entirely stateless so re-sorting/re-distributing data is not needed, and BigQuery ingest never affects query capacity, even at PB/day ingest scale).

------ Real-world usage - a serially-executed set of queries poorly represents what happens in reality - high volatility, high concurrency workloads. Again, BigQuery has stateless compute, separation of compute and storage, and separation of compute and state.

- customers - in addition to what's been stated already, PTAL at Kabam [2] & NYT/Yahoo/BlueApron [3] for their motivations for migrating from Redshift to BigQuery

That said, lots of folks really like running on Redshift, and they've extended their platform with Spectrum and other bits. I'd invite you to find out what works best for you. Competition is good for the end user!

[0] https://cloud.google.com/bigquery/docs/release-notes#july_25...

[1] https://www.youtube.com/watch?v=6Nv18xmJirs

[2] https://www.youtube.com/watch?v=6Nv18xmJirs

[3] https://www.youtube.com/watch?v=TLpfGaYWshw

Plenty to chew on here. Thank you

Please don't hesitate to reach out for follow ups

Thanks for the links, will definitely take a look.

I suspect you might be hearing some feedback from us soon w.r.t this trial, assuming the google people we are working with relay that back to you?

I'd love to hear your feedback, so yes please :)

Anything you don't like is an opportunity to make our product better

How would you compare it to a traditional OLAP solutions (MS SQL Analysis Services in OLAP model (not tabular), Essbase etc.) in terms of :

- multidimensional / hierarchy modeling for analytical purposes

- permissioning model (roles with element-level granularity, ie. User A - allow Country=USA, User B - allow Product=Bike)

- historical modeling (built-in slowly-changing dimensions support)

What is the underlying storage model? Is it a column store ? Is it closer to traditional row-based stores?

Are you taking advantage of GPUs or other dedicated hardware to accelerate BQ?

If you haven't read it already, I strongly suggest reading the original Dremel paper [0]. It's no doubt somewhat out of date, but I believe BQ is based in Dremel.

tl;dr for underlying storage model: distributed column store which pushes computation down a tree to leaf nodes to parallelize disk I/O. Parent nodes aggregate computations before returning to the client.

[0]: https://ai.google/research/pubs/pub36632

This video gives a walkthrough of the new version of Dremel that's been running in production for past 3 years (no paper sorry!):


At the Reddit talk at Google Next, the VP of Engineering noted that they moved to BigQuery and saved in both performance and cost.

We at BuzzFeed have seen similar results for migrating to BigQuery (and we did a talk about it at Next too, although it’s not online yet)

I really enjoyed this slide from the BuzzFeed talk "Hundreds of terabytes across 300+ tables, 100K+ queries a day" on BigQuery.

I quoted the reddit talk on the thread above/below :).

Having recently moved to data science at another big gcp early adopter, BQ was a complete game changer having come from redshift and hive

We (Sojern) have been using BigQuery since late 2015, we moved from Hive. It has been an amazing experience in terms of cost savings (managed infra, operational cost of a hadoop cluster), and performance. We use it across the org from analytics, ingestion, reporting, ML, and many more uses.

There is information about Colossus: the original GFS paper, an interview with Sean Quinlan with McKusick on Colossus improvements, as well as a talk by Denis Serenyi on more of the gory details, e.g. rebalancing.

There was also a public talk on D, the GFS chunkserver replacement, but there don't seem to be recordings or slides from that.

How do you find the querying costs?

It seems like for many use cases, you'd have to pay quite a bit of attention to partitioning the database in a way that minimizes query costs.

For this trial we are transferring some data from AWS, which is already partitioned in S3 by quite a few dimensions, including DAY.

As BQ only supports DAY partitioning, I had to create multiple tables that represent the other dimensions, but it seems to work OK, if a little clunky when you have to use wildcards/_TABLE_SUFFIX

The relatively new date-partitioning feature for tables has cut down query costs/time significantly.

And with the just announced clustering now you get even better costs and performance.

Disclosure: I'm https://twitter.com/felipehoffa and I work for Google Cloud. And I'm really excited to reprocess all public tables into clustered ones.

Cost (or rather bytes scanned) is sent back from the API on any queries or dry run/validation calls. There's a common chrome extension to put it in the legacy ui, and I believe the beta ui has it baked in (as part of validation)

This is very exciting for us, even at its nascent limited ability point.

Compared to many players in different verticals, our data is small. But in our vertical of asthma care, we probably have one of the largest (possibly the largest) asthma data.

We've been looking at different way of plumbing the data to automate and run some rudimentary analysis on it, since we found BigQuery a bit limiting. Now seeing this announcement, it could be a great start for us.

I hope this is just a start and people like us can send Google Cloud team a wish list as we come across various needs. Good job and thanks to everyone behind this release.

Can I ask more about the structure of the data and what you'd like to do with it? My email is in my profile.

Only logistic and linear regression for now. For use cases it covers, this will save a lot of plumbing.

Independent of DL, I'm curious why these two regression cases haven't been made available in a SQL-like interface until now. Kudos to Google for putting this in the hand of folks who otherwise just use SQL.

Oracle has had linear regression for well over a decade now.

Linear regression makes a lot of sense since it can be done streaming in a single pass. We have it as an aggregator in Hail: https://hail.is/docs/devel/aggregators.html#hail.expr.aggreg...

I wonder how they're doing logistic regression. It doesn't have a closed form and normally solved with an iterative, multi-pass optimization. edit: Read more closely. They're using (non-stochastic) batch gradient descent.

Good point! I wouldn’t be surprised to see a Postgres plugin for this. Wouldn’t get the vast scalability but it would offer convenience.

Good, I hope this trend of shoving ML into SQL (instead of the other way around) continues. I always thought it was silly that every "data wrangling" system like Pandas and R needed to (poorly) re-invent SQL.

Unless you present clear arguments, I'd refrain from saying that Pandas is "poorly re-inventing SQL".

Pandas is now the standard for data analysis (as long as things fit into memory). It's much much easier to debug than a SQL command. You can write operations as a succession of small logical steps (instead of one huge query that is hard to debug).

It's raw Python, so you can do something like:

df.groupby('movie_id').agg(dict(ratings='median', price=lambda p : np.percentile(p, .95))).plot.bar(bins=50)

Yeah, also in Pandas you can do stuff that otherwise requires writing a custom reducer or UDAF in which case you aren't using SQL anyway.

I just use SQL to grab and if necessary aggregate the data and then do everything else in Pandas - using Python custom reducers to deployed trained models although we are migrating to GCP now so soon that won't be necessary.

I appreciate how convenient it is to have statistical analysis tools available directly in BQ SQL, but are linear and logistic regression really considered "machine learning"?

Actually most of online ads ctr prediction machine learning systems are still logistic regression based until very recently. It's simple but very effective.

Just because it's not a neural net doesn't make it not ML. Also, can echo the other folks here: linear models and basic logistic regression are still competitive.

I wholeheartedly agree with you. I'm suggesting that it might be time to have the counter conversation that just because it has numerical statistical analysis does not make it ML. I've been around the field for a couple of decades and understand linear regression is covered in the first pages of chapter one. If that is the standard, then Excel can claim to have had machine learning capabilities for years.

Yes, it definitely is.

Does anyone here tried Druid? I hear at Druid performance is much better in terms of response time, especially for arbitratry queries over large set of dimensions. Did anyone do an in-depth comparison between Bq vs Druid?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact