In-Database Machine Learning [pdf]

scottlocklin · on Dec 3, 2020

"This means we need a way to represent loss functions in SQL, so we first introduce some necessary mathematical background and define the terms used in our implementation. "

Jesus flippin Chrikey kill it with fire. Academics; where do they come from?

Having scoped out a design for this, seen a design for it in a large corporate research lab, and noting that several large scale machine learning packages .... almost do it .... themselves: you just need to lay out the data in a memory friendly way, and you're freaking done. It's a completely straightforward task which could be done with any number of existing columnar data stores if it hasn't been done already. The only reason it hasn't been outside of (large research lab) is the type of skill needed to lay out the data in the DB properly and the type of skill needed to write efficient machine learning algorithms almost never exist in the same person. Then you get academics .... who think you need to write automatic differentiation in SQL to integrate ML with databases, having never heard of, I dunno, online BFGS or Pegasos, and, like, C.

Irishsteve · on Dec 3, 2020

They are getting paid to research different ideas so thats why they might be proposing ideas you find to be misguided.

There are instances where researchers spend their time where the outcome was mostly thought experiment / not directly applicable, and there are instances where the work was useful from an application perspective. Such is life

scottlocklin · on Dec 3, 2020

You left out "there are instances where the researchers actually have no idea what they're talking about, but hope to get a lucrative job in the subject to escape teaching C++ to 19 year olds for the rest of their lives." This may not be such an instance, but it sure looks like one!

tchalla · on Dec 3, 2020

> The only reason it hasn't been outside of (large research lab) is the type of skill needed to lay out the data in the DB properly and the type of skill needed to write efficient machine learning algorithms almost never exist in the same person. Then you get academics .... who think you need to write automatic differentiation in SQL to integrate ML with databases, having never heard of, I dunno, online BFGS or Pegasos, and, like, C.

I'm curious. In your opinion, is there a scenario where there is a possibility and need to do what the authors have done?

scottlocklin · on Dec 3, 2020

I can't imagine it, just like I can't imagine why you'd want to do it in a DSL oriented around set theory instead of, say, the DB implementation language. I assume they're thinking of deep learning? Why you'd want to do it "in database" is beyond me.

The only reason I thought of doing it was skipping the "marshall the data" step for online algorithms; performance basically. If you look at something like Vowpal Wabbit, it stores/consumes data in this ridiculous text format dating from when people invented the Support Vector Machine in the 80s. This annoys the shit out of me, as it involves creating the ridiculous 80s text format. What if it consumed database columns instead of this nonsense, and did the hashing trick and everything you needed to solve the problem? That would be cool. That, in fact, would be something like how the universe is supposed to function instead of the stunted grotesqueries we have today. You could do crap like exploratory analysis right on your database, as a query. You could even get fancy and do wackadoo online matrix decompositions while you're writing the data out in the first place (or at least when nobody's looking), and store it as metadata, meaning you know all kinds of good shit about your data even as you're writing it down. Anyway, because marketing departments keep bellowing about "deep learning" instead of the actual breakthroughs in machine learning and linear algebra of the last 20 years, nobody gave a shit about it. Even (large research group in gigantor corp) couldn't figure out a way of selling the idea. I went on to a productive career in something entirely different, and all I got out of it was the ability to make snarky comments about seemingly clueless academics.

kohlerm · on Dec 3, 2020

Why would you want to do the Machine Learning within the DB anyway? To me this looks like rather a corner case. E.g. I usually want to scale my MLE infrastructure independent from the DB. To it seems much to pipe the Data into log like Kafka to compute the result, just because that is much easier to scale. Similarly for training it seem to be that using some files on an Objectstore as the input is usually much easier to scale.

scottlocklin · on Dec 3, 2020

I guess if you don't care about data exploration, efficiency or doing work on data much larger than memory, there's no reason to do it. As you note, most people seem to get along fine without this idea.

newdude116 · on Dec 3, 2020

Hm. Maybe because you could do it in the RAM directly in the future?

Maybe you want to use an encrypted DB without decrypting it?

This is research, not a vanilla solution.

aspaceman · on Dec 3, 2020

You’re an engineer. You care about engineering. This is fine.

They care about mathematical background and laying out definitions. Academics care about those pointless things. This is also fine. If you don’t care about these things, don’t read it.

You don’t write a system like this to say: “Here’s a system I wrote, go and use it in practice” as much as “What have we learned from trying this?” It’s the job of an engineer to synthesize useful things from the abstract knowledge.

scottlocklin · on Dec 4, 2020

>You’re an engineer.

You failed, first sentence.

You're also the second person asserting some secret knowledge of academic ding dongs who published a paper (as if this is some kind of achievement) without, you know, actually even attempting to make the case.

aspaceman · on Dec 4, 2020

It was just an assumption. At the very least you think like one.

And I’m not following. I’m not asserting some secret knowledge. This work is knowledge. It’s not practically useful for you. But it doesn’t have to be. And you’re the “ding dong” if you think it has to be valuable to you for someone to get something from it.

_moll · on Dec 3, 2020

Stumbled upon this paper. '...automatic symbolic differentiation framework and the gradient descent operator together with the proposed tensor algebra to be integrated into database systems'. This seems quite a bit more advanced than say BigQuery ML in its current form, which as I understand supports batch gradient descent for some pre-specified models. Any thoughts? Should we expect automatic symbolic differentiation support in future database offerings?

visarga · on Dec 3, 2020

Not if there aren't many other papers and projects on the same topic. Lots of papers look enticing on their own, but when judged in context they disappoint.

wenc · on Dec 3, 2020

Microsoft SQL Server 2017 (and later) with Machine Learning Services already do in-database ML [1]. Models are trained, stored and invoked via stored procedures which call R or Python code (SQL is not the best language to do ML in).

The advantage of this approach is that data is never moved outside SQL Server or over the network. The downside I guess is that you need a pretty beefy machine to run the database server.

For uncomplicated ML applications, say a logistic regression over a few columns, this is a relatively easy approach to get results quickly. To me, the actual use cases of in-db ML are limited, but the one case in which I can imagine it being useful is performing live ML on a SQL view that has constantly evolving data -- you save ETL roundtrips to an external ML algorithm.

[1] https://docs.microsoft.com/en-us/sql/machine-learning/sql-se...

scoot_718 · on Dec 3, 2020

I've had in-database machine learning, and I'll always pick a stupid datastore over a "smart one" that comes with a fixed set of tools.

Because eventually the smart datastore is just an unperformant dumb one.

newdude116 · on Dec 3, 2020

This sounds like an interesting idea. And databases can be held in the RAM (e.g. HANA)and things move into the direction of having calculations done directly in the RAM (without a processor).

nkozyra · on Dec 3, 2020

What mechanism enables RAM to do calculations/processing? Or are you just talking in-memory processing?

newdude116 · on Dec 3, 2020

Special RAM :-)

VC buddy of mine invested in such a company. So basically the RAM can do small things by itself. Should be very interesting for Databases and HANA. When he and a major Chip producer invested, I even considered asking SAP if they want to be on board.

visarga · on Dec 3, 2020

> Our approach enables common machine learning tasks to be performed faster than by extended disk-based database systems or as well as dedicated tools by eliminating the time needed for data extraction

Data extraction time is insignificant compared to actual training except for trivial models. Databases can read hundreds of thousands of items per second while a model can only process dozens to hundreds.

jononor · on Dec 3, 2020

Trivial models can be rather common, in some application. For example Anomaly Detection per user or per sensor. The simplest models might basically be computing the number of standard deviations* away the new datapoint is from a typical distribution, and then having a well-chosen threshold on this. Since these models are per entity, if one has 100'000 sensors deployed, then one gets 100'000 such trivial models. Inference might run every 1 hour or so, and consume last 48 hours of data for example. Training might run once per week on the last 4 week.

This is to me a good candidate for an ML workload to run directly in a database:

- Low compute vs storage ratio for the models.

- High number of models.

- Often only want a small subset of data as input to model (a few numbers typically).

- Relational data highly relevant, for contextual data around the entity.

- Simple models with few parameters to store.

- Frequent updates to models.

- Historical models interesting. To implement checking new models against old, running in parallel

Other examples with similar characteristics would be Timeseries Forecasting on many different time-series. Could be sensor data, stock tickers or whatever.

* Better to use a robust analog such as Median Absolute Distance.

jononor · on Dec 3, 2020

Basic Anomaly Detection can be done with no support from the database (example: https://towardsdatascience.com/anomaly-detection-with-sql-77...).

But more effective (but still rather simple) models, like using Malahobis Distance, or kNearestNeighbours distance is harder to do efficiently.

One rather mature project for doing ML integrated in SQL databases is Apache Madlib. It has been around since 2015. https://madlib.apache.org/

TuringNYC · on Dec 3, 2020

On more thing: easy ingress-egreess for DB <---> Model w/o resorting to networks, encodings, decodings, etc.

marco_craveiro · on Dec 3, 2020

I'm sure someone will create a PostgreSQL extension for this... :-)

nickmancol · on Dec 3, 2020

Check http://madlib.apache.org/

tasubotadas · on Dec 3, 2020

When you think that people couldn't come up with a worse abomination than PL/SQL, here they are.

karsinkk · on Dec 3, 2020

I don't want to start a flame war, but what's wrong with PL/SQL?, I'm genuinely curious.

tasubotadas · on Dec 5, 2020

It's a procedural programming language (in a lot of ways similar to Pascal) that try to pretend to be SQL.

There is no advequate tooling and all you get is a horrible SQL Developer.

It means that you won't have a debugger, you won't have unit tests, you won't have profilers.

Debugging it is a pain in the ass. Even setting up the environment for work is pain in the ass because setting up Oracle server is pain in the ass.

Ecosystem is non-existant.

If somebody wants to write business logic using it, why not use a normal language like Java or C#?

In general, having business logic inside the database is almost always universally a bad idea.