
Machine Learning in Google Bigquery - zitterbewegung
http://ai.googleblog.com/2018/07/machine-learning-in-google-bigquery.html
======
djhworld
We're trialling BigQuery at the moment, pushed about 130TB into it last week
(about ~300 billion rows) and have been blown away by the performance of it.

It's a bit of a shame they never released a paper on how it works, well,
nothing since Dremel anyway. And Collosus isn't public either.

I can already see this in built ML stuff being useful for trialling models,
especially as it's built right into the SQL.

~~~
vgt
Bq pm here. No papers, yes, but lots of talks and blog posts. Here's a list:

[https://medium.com/@thetinot/bigquery-required-reading-
list-...](https://medium.com/@thetinot/bigquery-required-reading-
list-71945444477b)

Happy to answer any questions!

~~~
jimmytucson
This blog post from 2 years ago from AWS has Redshift beating BigQuery pretty
handily on TPC-H and TPS-DS: [https://aws.amazon.com/blogs/big-data/fact-or-
fiction-google...](https://aws.amazon.com/blogs/big-data/fact-or-fiction-
google-big-query-outperforms-amazon-redshift-as-an-enterprise-data-warehouse/)

Has anything changed on BigQuery since then that would warrant rerunning those
benchmarks? If Amazon’s largest cluster outperforms BigQuery then is your
decision about which service to use just a cost calculation of dedicated per
month for Redshift vs. projected units scanned per month on BigQuery?

~~~
blaisio
I've worked with both, and BigQuery is so much easier to work with. Redshift
does give you more control (to a limited extent) but it takes a lot more work
and specialized knowledge to perform simple tasks well.

In the article you linked, they gloss over the part that takes by far the
longest when working with redshift: setting up "compression, distribution keys
on large tables, and sort keys on commonly filtered columns". With BigQuery,
you don't really worry about that crap, or about reloading your terabytes of
data if you made a mistake somewhere in the schema. You also don't worry about
vacuuming, or running out of space, or taking down the database from excessive
CPU usage. Do you want to have a team of people at your company whose job is
just to keep Redshift running smoothly? Or do you want another team of
analysts or engineers?

I could see some companies being forced to use Redshift, especially if they're
using S3 a lot, and I could see people saving money in some specialized use
cases, but for most people BigQuery is almost certainly going to be cheaper
and faster in almost every way.

~~~
jimmytucson
> I could see people saving money in some specialized use cases, but for most
> people BigQuery is almost certainly going to be cheaper and faster in almost
> every way

That’s exactly what the TPC benchmarks are designed to show: how different
appliances perform under the same diverse set of generic workloads. As of 2
years ago, they show that Redshift is faster.

> With BigQuery, you don't really worry about that crap

The benefit this depends on your organization’s level of expertise. If you
grok sorting and distribution then you can leverage those to increase
performance—but it’s not a prerequisite.

For example, in Redshift, when you bulk load data into a table, if the user
didn’t specify a compression scheme in the table definition, Redshift will
analyze the data and find a scheme that works best, and automatically apply it
to the table for you. BigQuery almost certainly does something similar. The
difference is, with BigQuery, you’re not invited to take part in that
discussion. And you’re charged as if the data is uncompressed.
Psychologically, this is a huge relief if you don’t (want to) know how
compression works but rest assured you’re paying for it somehow.

To draw a tired analogy, vehicles with automatic transmission still have to
shift gears. If you’re driving to the grocery store, not having to worry about
that is a win. But if you’re racing stock cars, you’re definitely going to
want a stick.

~~~
vgt
Compute resources tend to be significantly more expensive than storage. Our
approach is two-fold:

\- We finance compute resources required for you to ingest data into BigQuery.
With Redshift, you pay for ingest directly via compute cluster consumption
(again, more expensive than storage). This also increases your complexity due
to on-cluster contention of resources between ingest and query.

\- Separation of storage and compute gives you lots of options. With BigQuery
,you don't need to attach relatively expensive compute just for the luxury of
getting more storage. Spectrum helps somewhat, but ultimately with Redshift
you don't even get to pay for storage - you pay for compute/storage combos.

\- BigQuery's Long-Term Storage is not an archival storage tier - it's a
discount on storage, with identical performance and durability
characteristics. At only $0.01 per GB per month.

This is likely a result of origins of the two technologies. BigQuery is
Dremel, written and operated by Google since 2006. Redshift purchased source
code to an on-premise fork of Postgres.

------
salimmadjd
This is very exciting for us, even at its nascent limited ability point.

Compared to many players in different verticals, our data is small. But in our
vertical of asthma care, we probably have one of the largest (possibly the
largest) asthma data.

We've been looking at different way of plumbing the data to automate and run
some rudimentary analysis on it, since we found BigQuery a bit limiting. Now
seeing this announcement, it could be a great start for us.

I hope this is just a start and people like us can send Google Cloud team a
wish list as we come across various needs. Good job and thanks to everyone
behind this release.

~~~
cottonseed
Can I ask more about the structure of the data and what you'd like to do with
it? My email is in my profile.

------
mooneater
Only logistic and linear regression for now. For use cases it covers, this
will save a lot of plumbing.

~~~
Karrot_Kream
Independent of DL, I'm curious why these two regression cases haven't been
made available in a SQL-like interface until now. Kudos to Google for putting
this in the hand of folks who otherwise just use SQL.

~~~
mark_l_watson
Good point! I wouldn’t be surprised to see a Postgres plugin for this.
Wouldn’t get the vast scalability but it would offer convenience.

~~~
ximeng
[https://www.postgresql.org/docs/current/static/functions-
agg...](https://www.postgresql.org/docs/current/static/functions-
aggregate.html#FUNCTIONS-AGGREGATE-STATISTICS-TABLE)

------
hotpotjunkie
Good, I hope this trend of shoving ML into SQL (instead of the other way
around) continues. I always thought it was silly that every "data wrangling"
system like Pandas and R needed to (poorly) re-invent SQL.

~~~
halflings
Unless you present clear arguments, I'd refrain from saying that Pandas is
"poorly re-inventing SQL".

Pandas is now the standard for data analysis (as long as things fit into
memory). It's much much easier to debug than a SQL command. You can write
operations as a succession of small logical steps (instead of one huge query
that is hard to debug).

It's raw Python, so you can do something like:

df.groupby('movie_id').agg(dict(ratings='median', price=lambda p :
np.percentile(p, .95))).plot.bar(bins=50)

~~~
alexgmcm
Yeah, also in Pandas you can do stuff that otherwise requires writing a custom
reducer or UDAF in which case you aren't using SQL anyway.

I just use SQL to grab and if necessary aggregate the data and then do
everything else in Pandas - using Python custom reducers to deployed trained
models although we are migrating to GCP now so soon that won't be necessary.

------
akadien
I appreciate how convenient it is to have statistical analysis tools available
directly in BQ SQL, but are linear and logistic regression really considered
"machine learning"?

~~~
Xorlev
Just because it's not a neural net doesn't make it not ML. Also, can echo the
other folks here: linear models and basic logistic regression are still
competitive.

~~~
akadien
I wholeheartedly agree with you. I'm suggesting that it might be time to have
the counter conversation that just because it has numerical statistical
analysis does not make it ML. I've been around the field for a couple of
decades and understand linear regression is covered in the first pages of
chapter one. If that is the standard, then Excel can claim to have had machine
learning capabilities for years.

------
spocklivelong
Does anyone here tried Druid? I hear at Druid performance is much better in
terms of response time, especially for arbitratry queries over large set of
dimensions. Did anyone do an in-depth comparison between Bq vs Druid?

