
Could predictive database queries replace machine learning models? - tlarkworthy
https://aito.ai/blog/could-predictive-database-queries-replace-machine-learning-models
======
llarsson
I don't see it as a problem that this will likely just work for, and thus
correctly target, "easy" problems. Nobody looking at this blog post should
think "hmm, sober of this lets me implement an autonomous vehicle in a single
query", but rather "this looks like a great way to explore data without
requiring any data science expertise". Many businesses have data and could
gain insight and improve their business if they just ran some rudimentary
queries to understand customer patterns etc, without needing a data science
PhD to crunch numbers for them.

This seems to provide that type of value.

If I were developing this, I would make sure to make this able to read from
Excel sheets and integrate with popular invoicing services etc, because that
is where I believe the primary customers are in technological maturity.

~~~
sails
Agree entirely.I see this being implemented as a column type on analytics DBs
such as Snowflake or Bigquery (feature of a bigger product) rather than a
specific DB designed for ML.

The reason being that as you pointed out, this would be most useful for "easy"
problems. These problems live with analysts who are using analytics oriented
DBs as part of their Business Intelligence workflow.

~~~
arauhala
I can see, why having predictive queries in solutions like Snowflake or
Postgres would be extremely tempting.

Still, the problem of integrating the Aito like functionality in existing
databases is that it requires lot of specialized data structures to work fast
enough. While getting it to work in existing DBs is plausible, at minimum it
would require completely new storage engine, or at least a wide refactoring to
the old ones.

Regarding the analyst workflows: could you tell more about the main use cases?
We haven't had Analyst / BI customers (yet), while they seem like a plausible
audience for an Aito-like solution.

~~~
willvarfar
You work on this?

Writing storage engines for MySQL or even Postgres isn’t that hard.

~~~
arauhala
Would you be our first customer, if we do it? ;-)

The fact of the matter is that, while it is a tempting idea it's far from
easy. The interfaces may not be that hard, but the storage itself will have
its challenges and building a fast ad-hoc inference & representation learning
layers on top of it is a huge project.

After working on Aito's DB and ML parts for several years, I can promise: it's
more work & harder than it looks :-)

~~~
willvarfar
Yeah I trust you, I use and contribute to rdbms and am a bit familiar with
their innards.

The key thing you want to enable is eg Tableau. Building classifications and
predictions into something business people rather than devs use would be a
promising strategy.

Recently I’ve been using presto to make various things appear to be
conventional db tables, and getting computed data into tableau that way.

------
crazygringo
The idea of doing predictive ML inside of, say, MySQL, sounds tremendously
appealing. E.g. if you could add something like a "predictive index" that
would model one column based on certain other columns, and with a variety of
"prediction types" (just like choosing index types of b-tree, hash, etc.).

Although instead of a SELECT statement on existent rows, you'd need to use
something like a PREDICT statement on non-existent rows:

    
    
      PREDICT product_category
      FROM invoice_data
      WHERE item_description = "Packaging design"
      AND vendor_code = "VENDOR-1676"
    

The big question for me, however, would be: when would the predictive model be
updated? It seems too computationally expensive to update it on every UPDATE,
INSERT and DELETE. I mean, if it's of any decent complexity, it would require
a full table read every time it was updated, no? Would you have to manually
issue a command to update it? How long would that take?

But it absolutely seems like a _very_ natural place to predictive capabilities
-- directly into the database.

~~~
zitterbewegung
The power of SQL is that you don't even have to make the decision "when would
the predictive model be updated? " . This could be configurable and be
dependent on the data that you can find in the tables.

Also, another solution that is sort of the opposite is Datasette.
[https://datasette.readthedocs.io/en/stable/](https://datasette.readthedocs.io/en/stable/)

~~~
jsemrau
That sounds more like fuzzy inference rather than prediction.

------
simonhughes22
Short answer - no. Where is the explanation for how the predictive queries
work. Is it some sort of bayesian model? It's not too hard to quickly fit some
NB or regression model on some dataset on the fly given the simplicity of
those models. However just throwing random features at it without
consideration of bias vs variance, i.e. whether the model is either over-
fitting or is not powerful enough to answer the question can easily result in
a useless model. To make this useful you would need to build in all of the
functionality regular data scientists use to build regular models. In doing so
you would lose all of the speed and flexibility of the tool you are pushing.
Also given the prevalence of deep learning models for unstructured data, and
also search and recommendations, such an approach would not work given it
relies on structured data. A lot of modern data science work focuses on those
kind of problems as learning from structured data is mostly quick and easy
with today's ML tools. I don't see how this framework would solve for these
more complex and more typical business problems.

~~~
arauhala
I would say, that the answer is partially yes and partially no.

We have done several projects with simple machine learning problems, where
e.g. semi-technical RPA developers have been able to implement ML based
automation just fine. We have gotten compliments that Aito is easy to use, and
one intelligent automation demo was implemented in 5 hours with some
integrations by 2 RPA developers. It's worth noting, that there is an absolute
abundance of ML problems (especially in domains like automation or UIs) that
are simple to understand and easy as ML problems.

At the same time, we have run into many ML problems, which require data
scientist to even formulate the problem and to think about it. There are also
problems, where Aito's Bayesian approach is inadequate and you need a data
scientist to do good amount engineering to make it possible to model the
patterns and then find the right model.

So TBH: I don't think the predictive queries can fully replace the traditional
models or data science work, but there are large application domains, that can
be handled just fine with predictive queries and even by normal developers.

Regarding text: Aito can already handle simple texts just fine, and with
representation learning based 'world modeling' approaches: I believe that we
can do also more complex analysis on text.

Overall, Aito does not seek to provide the best models or solve the hardest
problems, but it's value proposition is on speed and easiness. We focus on
investment instead of return in the return-on-investment equation. It gives an
advantage on the lower-value 'tail' of the ML market, where the importance of
costs is higher, and where the traditional data science approach is
economically not that attractive.

------
plain4
I didn't finish ready the article because it didn't give a succinct summary of
what predictive databases are. But at a glance this seems to be a SQL
interface to an AUTOML system. Is that a correct summary? I don't get the
distinction between ML and predictive databases. It seems predictive databases
use ML.

~~~
fouc
Seems like "predictive database queries" is more about the queries and less
about the database, and there's nothing relational (or RDBMS/SQL) about it.

~~~
plain4
The title implies that it's going to replace ML models. But it seems that it
still uses ML models, but provides a different interface. It also seems that
it's using some AutoML training system, so that in theory little ML expertise
is required to use the system.

~~~
FridgeSeal
> in theory little ML expertise is required to use the system

Maybe I'm just a cynical data scientist, but this is how we get people using
and interpreting models that they don't necessarily understand the
complexities of. If some data violates an underlying assumption or has some
complexities around representation and meaning then there's nothing really
stopping someone getting a model that appears to fit correctly but gives
answers that are meaningless or just wrong.

------
chroem-
Glad to see my technical debt has gone full circle, and become the bleeding
edge (joking-not-joking). Hierarchical linear models implemented as
aggregation queries are surprisingly powerful and easy to scale. We use them
in production to do time series forecasting, among other things.

~~~
samplatt
>Glad to see my technical debt has gone full circle, and become the bleeding
edge

This exact cycle is why I'm starting to feel serious burnout in this industry.
Even when I look to other related branches of data analysis, everything just
feels to be a gigantic cesspit of corporate ignorance feeding on itself,
fueled by new buzzwords.

I wonder if anyone's done studies on the cycle time of technology adoption in
corporate life; I swear it's getting faster.

------
arauhala
The author here. If you have any questions about the article, I'm happy to
help. :-)

I do believe, that the query based ML will replace the trained model based ML
in the long run. I believe this not because the results would be better, but
because it offers higher productivity and greater simplicity.

What are your thoughts? Does the query based ML makes sense?

~~~
vaidhy
I think there is a huge underlying assumption that given some data, building a
model for it is trivial and can be done on the fly. I have seen the same kind
of approach from people who have built toy models in 10 lines using pytorch
and seem to equate fizzbuzz code for production code.

If you can clearly articulate how you do feature engineering, model debugging,
meeting latency requirements, handling constant updates, dealing with non-
numerical data and all the other issues that real world ML faces inside a
query engine automatically, we can sit together and have a meaningful chat.

~~~
arauhala
I do understand your point. There are definately tons of hard data science
problems, which are simply not suitable for the predictive query kind of
approach.

At the same time there are tons of ML problems e.g. in process automation or
user interaction, which have extremely strong patterns and very easy to treat
with sophisticated enough ML model.

Regarding your list of items. Feature engineering is greatly managed by the
user selecting relevant facts in the query, by analyzers, by MDL based feature
learning and by information theory based feature selection. I feel this
approach is pretty robust for many problems, all thought not complete. There
are special queries like $on for making conditional variables of form A|B, and
$numeric to deal numeric data, that can be used manually.

Model debugging can be partly done with $why explanations, that are easy to
create with the Bayesian approach. I feel that model debugging has been good
enough.

Latency requirements and constant updates are more about software/database
engineering and they are solvable, but right now we do recommend batch updates
and applications, that can deal with sparsely occurring multi second latency.
And OFC if you have limited data sets (less than 100k), there shouldn't be
such problems.

I feel that all the problems you listed are solvable, but they are of course
hard problems and we are still on the roadmap on fully solving those issues
for larger set of applications. For many applications (like RPA, internal
tools, analytics) these are not real issues, while the benefits (easiness,
speed) are extremely concrete and relevant.

------
jamez1
I'd like to learn a bit more about your architecture/process and why it
creates value over the standard ML toolkit. It makes sense philosophically to
increase capability in the database to handle uncertainty and so forth.
Databases were built for transactions not analytics and a rethink would likely
be fruitful.

Also, do you have any funding?

~~~
arauhala
The reason, why the predictive queries create value relates to the simplified
workflow and the simplified architecture.

Instead of defining model, training model and using model, you merely ask for
an arbitrary unknown variable, based on any arbitrary facts. This provides
much easier interface, much faster iteration cycle and other technical
benefits like the ability to create generic query templates. These benefits
stand even when compared to the AutoML platforms (which also do lot of heavy
lifting to simplify the workflow).

Regarding the architecture and process: the system has a lot of resemblance to
normal databases (and especially to the Lucene like search engines), but in
order to serve arbitrary predictive queries, the entire database is
specialized in-and-out for counting statistics and doing millisecond time-
frame ML modeling. The things are somewhat described in the article, but I'm
also happy to answer to additional questions about the system.

As interesting details the underlying database is very functional programming
oriented and build on a Git-like system. We'd like to expose the database's
snapshot and branching abilities in the future.

~~~
jamez1
So effectively, you've added a set of ML/statistics scripts to the query
engine? But the query engine is otherwise still relational based?

~~~
arauhala
No scripts. The change is much deeper, because Aito uses ad hoc / lazy models
to provide the predictive query capabilities. If you would thinly integrate
some 3rd party ML library, you would end up with separate 1) model definition
and 2) training steps as an addition to 3) the prediction. Aito's database is
specialized for counting statistics, so that it can create ML models in
millisecond scale to answer pretty arbitrary prediction queries instantly.

There is quite normal query engine working inside Aito, but the basic database
query capabilities haven't been our focus right now. We have an SQL API on our
roadmap, but it will likely take time, before we can even start working on it.

------
softwaredoug
It sounds a bit like search and search relevance. Which is a bit like trying
to guess what’s wanted in a ranked or probabilistic ordering.

Often not just used for finding blog articles or products, but also many other
ranking situations we don’t think about. One is “fuzzy joins” across
databases. Like match this person in this database, with another person in
another database.

~~~
arauhala
This is a good point, and Aito's inference engine has lot of similarities with
search engines. As an interesting, we can provide TF-IDF scored full text
search functionality from the same indexes we are also using for inferences.

Still, while there are tons of similarities, I feel that the inferences
engines are fundamentally different from search engines. The data structures
are different, and I can see them diverging even more in the future. The
algorithms and modes of operations are very different, even if there is some
overlap.

From the user point of view, there is still a striking similarity between Aito
and ElasticSearch. Both act now as auxiliary databases (all thought we would
like to make Aito fully ACID with an SQL interface in future) and provide more
search engine / inference engine-like functionality than full database
functionality.

------
aSplash0fDerp
The DB implementation of PA is the ultimate turn-key avenue if its
indistinguishable in the markets from ML.

The could just call it artificial light, prepivot the marketing and open up a
new field in the mind of the customers if its just a friendly game of spin the
data among nerds.

------
tabtab
Kind of reminds me of Factor Tables:
[https://github.com/RowColz/AI](https://github.com/RowColz/AI)

~~~
jamez1
This looks like a fairly low power monte-carlo system. You just store samples,
and the inference is sampling that sample set? That's just bootstrapping and
has been far more explored in a much more extensive way by random forests and
so forth.

You've shoe-horned this logic that usually belongs in linear algebra world
into a database table form, there's some initiative there but this has also
been heavily explored in the academic field under the topic of Probabilistic
Databases. BayesDB is a full implementation of what you've just described,
with a much deeper inference engine utilizing joint distributions rather than
just distributions that exactly match the sample.

~~~
tabtab
Re: "You just store samples"

Not necessarily. One can "summarize" the samples, as shown, to get
approximations using much less data. And various sub-sets can be switched on
and off as needed (or weights turned down).

Re: "BayesDB is a full implementation of what you've just described"

Perhaps, but using tools similar to what office workers currently use, staff
without PhD's can study and adjust results based on direct observation and
specialty sub-division. It's more about an approachable tool-set and division
of labor than technical accuracy. It's about "de-esoteric-izing" AI so that
more can assist in its tuning.

~~~
jamez1
BayesDB provides a toolset, what you've proposed is where you need knowledge
of the underlying process. You've shown some initiative here, but I'd really
recommend studying what's out there and doing a true contrast of your solution
to see the shortfalls.

