Hacker News new | past | comments | ask | show | jobs | submit login
Stochastic gradient descent written in SQL (maxhalford.github.io)
295 points by Lemaxoxo on March 7, 2023 | hide | past | favorite | 182 comments



>A machine learning algorithm which can be trained using SQL opens a world of possibilities. The model and the data live in the same space. This is as simple as it gets in terms of architecture. Basically, you only need a database which runs SQL.

First paragraph of the conclusion, and this very much fits with the mindset that's been growing in me in the data world over the past few years. Databases are much more powerful than we think, they're not going to go away, only get better, and having the data and the logic in the same space really removes tons of headaches. ML models, transformation of data, generating json for an API can all be done within the database rather than outside scripting language.

Are others seeing this? Are the current tradeoffs just that more people know python vs sql or database specific languages to where moving logic to postgres or snowflake is looked down on?


The challenge I, with limited knowledge, see with developing detailed algorithms in SQL is a lack of good testing, abstraction, and review tooling. Similarly for a lot of the user-defined-functions for the larger data warehouses (redshift, bigQuery, etc.)

dbt solves a lot of it, but I'd love to learn more about good resources for building reliable and readily-readable SQL algorithms for complex logic.


I've supported 3 different models over the years with inference implemented in SQL. First one I inherited, loved it so much that I implemented it twice again. Amazingly fast for TBs of data and no waiting on other tech teams.

That tooling you're describing is definitely not there. Bigquery has BQML but it's very much in its infancy. I tend to do all the modeling in Python/R on sampled data and then deploy to SQL.


BQML should become standard.


I mean this is an opportunity right? Build those things


I agree. Databases are going to be here for a long time, and we're barely scratching the surface of making people productive with them. dbt is just the beginning.


The problem with throwing everything in a database is you end up with brittle stored procedures all over the place, which are painful to debug. There is no good support for version control or testing, which means you end up creating a dozen copies of each function named (sp_v1, sp_v2,.., etc.). It much more harder to practice iterative development which the rest of software development seems implements effectively.

Also traditional relational databases have a way to go before they can support parallelized machine learning workloads. You do not have control or the ability to spin up threads or processes to boost your performance. You rely on the query processor to make those decisions, and depending on your platform the results will be mixed.


No comment re rDBs supporting parallelized ML but re stored procedures - if your workflow evolves to treat them as ‘1st class’ code assets they’ll be just the same as the rest of your code.

We always had them in version control, unit tested etc. The tools are there if you want to use them.


I ll take this as a learning opportunity. I have looked around to find a reliable framework to implement within our team and failed to find anything usable.

How do you guys manage to implement versioning and testing? If you had a new stored procedure to deploy, where do you deploy? How to you integrate with existing applications which rely on it?


Depends on what you're using. The Java world has a bunch of JDBC wrappers for testing SPs. SQLServer GUI has unit testing tools that can also be run from command line. @dventimi has good postgres recommendation.

Version control - each SP is in a file sp_descriptive_name.sql that's on the source tree under /db.

Deployment - compiled from the command line. Deployment target managed by env variables. Systems can require odd workarounds for ensuring no downtime - this is where systems end up with sp_descriptive_name_vN with N for each version (still stored in sp_descriptive_name.sql tho) so the new SP can be loaded alongside the old version, and then application code can be updated (or config updated) to use the new SP name. Good practice to remove the old version on next release.

I think it's true that SPs introduce a compromise, but having logic operate on data without having to send data back and forth over DB connection is a pretty big win.

It can also be a good abstraction because it avoids application code needing to know (for example) which field in which table indicates a customer is active, and code can execute something like sp_deactivate_customer(cust_id) and then let the stored procedure take care of updating flags, dates etc that describe a customer's active status. ymmv depending on whether your team owns the data and database, or if that's managed separately.


For testing if I'm in PostgreSQL I use pgTAP.

For version control I use git just as I do for other program units.


Updating stored procedures was a pain in the ass last time I checked. Checking changes on them in deployed servers was too very painful


> The problem with throwing everything in a database is you end up with brittle stored procedures all over the place, which are painful to debug. There is no good support for version control or testing...

Shell scripts with the SQL as Here-Docs, in stead of stored procedures. Ordinary text files, eminently gittable. Also lets you insert varying table/column names as environment variables, in stead of the regex rigmarole in TFA. (Or was that in another post on his blog? Just returned from reading several of them.) Best of both worlds.


> Databases are much more powerful than we think

The older I get the more I agree with this.

There is nothing you cannot build by combining SQL primitives. Side effects can even be introduced - on purpose - by way of UDFs that talk to the outside world.

I've seen more than one system where the database itself was directly responsible for things like rendering final HTML for use by the end clients. You might think this is horrible and indeed many manifestations are. But, there lurk an opportunity for incredible elegance down this path, assuming you have the patience to engage in it.


> I've seen more than one system where the database itself was directly responsible for things like rendering final HTML for use by the end clients.

I did this for a side project a few months ago and even used postgrest to serve the page with correct headers for html. It felt simultaneously really cursed and obvious. Shit you could even use plv8 to run mustache or whatever in the db if you really wanted to piss people off.


I'm doing this right now with a DO droplet built with PostgreSQL, postgrest, nginx, and not much else. Do you have any tips, tricks, or blog posts you can share based on your experience? You should post it to HN. Strike while the iron's hot. With a little luck you'll hit the front page.


Nah nothing that refined. Pretty predictably supabase is doing some weird stuff along these lines and I found some abandoned and semi-active repos associated with them and people working for them that were useful examples of some things.

As for posting to HN absolutely no thanks. These people are so fucking hostile there is no accomplishment too small to tear apart for entertainment here. I have no interest in it.


disclosure: supabase ceo

plv8 doesn't have network access, but this can be done with PostgREST.

That said, we actually restricted the HTML content-type on our platform so that users cannot do this. This is because we don't want to get too deep into the "frontend" world, with ever-changing frameworks.

We've found that if we offer a small feature, our users demand feature-completeness. In this case we're better-off waiting until we have solve a few of the major database tasks (branching + migrations, scale-to-zero, edge caching, etc)

FWIW, I think it's very cool serving HTML through PostgREST. I would be pretty happy to do this personally


disclosure: Supabase user

Long time user, first time caller. Besides the DO droplet, separately I also use Supabase and I love it.

I understand completely about restricting the content type, maybe just for now. Fun fact: I work around it by serving up text/xml plus XSLT. I'm sure you don't mind.


> I work around it by serving up text/xml plus XSLT. I'm sure you don't mind.

Not at all, I hear "application/xhtml+xml" is a nice hack from some of the team


> As for posting to HN absolutely no thanks. These people are so fucking hostile there is no accomplishment too small to tear apart for entertainment here. I have no interest in it.

You make a really good point.


> Side effects can even be introduced - on purpose - by way of UDFs that talk to the outside world.

> I've seen more than one system where the database itself was directly responsible for things like rendering final HTML for use by the end clients.

I built BI Web pages entirely in PL/SQL (using some Oracle modules I can't quite recall the nams of) over twenty years ago. Got a bit hairy to keep the meta-recursive stuff in your head when hard-coding JavaScript into it, calling the next such page for each value on the page, for drill-down to the next level... Not fun, but quite doable.

And with a bit more modern tooling -- as I said, this was turn-of-the-century tech -- I have no doubt it could be done much more effectively and efficiently nowadays.


This PL/SQL Overlord is running banks https://www.avaloq.com/solutions/products/avaloq-core


+1 sql is extremely elegant composable and is under rated

Postgres is very powerful. While I sought a short detour in nosql Mongodb land now back to Mysql Postgresql sql territory and glad for it

Being able to generate views is and stored procedures is useful as well.having sql Take over more like ml, gradient descent does open up good possibility.

Also since sql is declarative it Makes it so it's rather easier than imperative scripting languages


SQL has some positives but it is not composable. At all. This is because relations are not first-class values in SQL.


CTEs go a long way to making SQL more composable.


Is a query not a relation?


Basically, but queries are not first class in SQL. You can't assign a query to a variable, or pass it as a parameter to a stored procedure, for example. This would make SQL composable:

    declare @x = (select * from Person)
    select Name from @x where Birthdate < '2000-01-01'


You can do composition through CTE or Table-Valued Function.


CTE and TVF still treat tables as second class citizens, so while they enable some forms of composition, they're still very restricted. This has been the consistent story with SQL, 15 ways to use queries and tables, all to work around the limitation that they are not first class values.


  with persons as (select * from Person)  
  select Name from persons  
  where Birthdate < '2000-01-01


Where is the assignment to a variable? Where can you construct a query using a variable in table/query position? That's the whole point of being first class and composable, a query becomes like any other value so you should be able to parameterize any query by another query assigned to a variable that may have been set inside an if-statement, or accepted as a parameter to a stored procedure. You know, the same kinds of composition we see in ordinary programming languages.


  create table x as (select * from person);
  select name from x where ...;
there you go, just configure your editor to display "create table x" as "declare x = " ;)

or even a version with lazy evaluation:

  create view x as (select * from person);
  select name from x where ...;


You're still not getting it. First-class status means that anywhere a value or variable can be used, a query or table should also be able to appear, and vice versa. This means a table or query can appear as a return type, a parameter to a stored procedure or query, a variable, and so on.

SQL just does not have this, it instead has 15 different second class ways to handle tables and queries that try to make up for the fact that they are not first-class values. These include CTEs, table valued functions, views, etc.


Usually I balk at the idea of of someone describing a language feature as “first class” because it seems to wishy washy. But in this thread you’ve shown me that maybe the best definition is through “proof by negation,” by patiently responding to arguments and demonstrating why a certain usage and the ensuing restriction around it means it is not first class. Bravo!


I agree the term is often abused, but I think the wikipedia page actually does a decent job of making the notion somewhat precise, along the lines I've been arguing here:

https://en.wikipedia.org/wiki/First-class_citizen

If you want to see what queries as first-class values looks like, LINQ in .NET is pretty close. I can actually write a series of queries that build on and compose with each other, like this:

    IQueryable<Person> RunQuery(int userSelection)
    {
        var first = from x in People
                    select x;
        var second = userSelection == 1
            ? from x in first where x.Birthday > '2000-01-01' select x
            : from x in first where x.Name.Contains("Jane") select x;
        return DumbJoin(first, second);
    }

    IQueryable<Person> DumbJoin(IQueryable<Person> first, IQueryable<second>)
    {
        return from x in second
               join y in first on y.Role equals x.Role into g
               select g;
    }
This query is nonsense, but it just shows you what composition really looks like when queries are first-class values. I wish raw SQL were like this!


> You're still not getting it. First-class status means that anywhere a value or variable can be used, a query or table should also be able to appear, and vice versa. This means a table or query can appear as a return type, a parameter to a stored procedure or query, a variable, and so on.

I doubt you could implement a query planner that would cope with that degree of flexibility. Which means you’d be forced to deal with the mechanics of the query, pushing you away from declarative SQL and into procedural and functional programming. At which point you might as well ditch SQL anyway.


Without these features, people have to resort to dynamically generated SQL queries in procedural or functional languages, which is much worse! SQL has also become significantly more complicated by adding all sorts of second-class features to get around this composability limitation (CTEs, table valued functions, views, etc.).

Besides, I don't think it would be as bad as you say. You can approach it as a simple template expansion into flat SQL queries except where a data dependency occurs, at which point template expansion proceeds in stages, one for each dependency.

LINQ on .NET provides most of the composability I'm talking about, although it has a few limitations as well. Still worlds better than raw SQL.


In PostgreSQL at least, a table can appear as a return type of a function and as a parameter to a function. That's not nothing.


What if I wrote a very long, complicated query that I'd like to test against different tables (like test tables), and let's say I can't use stored functions or procedures. How could I pass different tables to my query?


  CREATE TABLE data_a AS (SELECT 'a' AS test_case, 1 AS value);
  CREATE TABLE data_b AS (SELECT 'b' AS test_case, 2 AS value);
  CREATE VIEW data AS (SELECT * FROM data_a UNION ALL SELECT * FROM data_b);

  CREATE VIEW complicated_query AS (SELECT test_case, value+1 FROM data);

  SELECT * FROM complicated_query WHERE test_case = 'a';
  SELECT * FROM complicated_query WHERE test_case = 'b';


Nice, that is what I was looking for. Of course, it'd need to point to production data as well, so maybe test_case is null, in that case:

  CREATE TABLE data_a AS (SELECT 'a' AS test_case, 1 AS value);
  CREATE TABLE data_b AS (SELECT 'b' AS test_case, 2 AS value);
  CREATE TABLE data_prod AS (SELECT NULL AS test_case, prod_table.value FROM prod_table);

  CREATE VIEW data AS (SELECT * FROM data_a UNION ALL SELECT * FROM data_b UNION ALL SELECT * FROM data_prod);

  CREATE VIEW complicated_query AS (SELECT test_case, value+1 FROM data);

  -- when testing
  SELECT * FROM complicated_query WHERE test_case = 'a';
  SELECT * FROM complicated_query WHERE test_case = 'b';

  -- when in 'production'
  SELECT * FROM complicated_query WHERE test_case IS NULL;


You just reinvented defunctionalization, which is a transformation from a domain that has first-class values to a domain where support is only second-class. Defunctionalization is typically used in programming languages to simulate first-class functions in languages where functions are only second-class citizens, like C and Pascal.

This perfectly illustrates my point. You had to manually defunctionalize your data model and queries to support what I'm saying should be inherently part of SQL.


> languages where functions are only second-class citizens, like C and Pascal.

1) Only if you define Pascal as only Wirth's very first version. That changed almost immediately.

2) Only if you refuse to equate “pointer to function” with “function”. Which in C, where “everything is a pointer” (a bit like in Unix Linux “everything is a file”), seems rather silly.


Check out dbt - it's a great tool for organizing queries and solving such patterns


If you can't use stored procedures which are good for this very case, many databases offer dynamic SQL. That might work in some cases.


That's a good point. This would rule out SQLite for me.


Dynamic SQL isn’t SQL, and it’s not relational. It’s no different from using a language like Python to generate SQL queries.


It's a little different. Anyway, this is under the constraint "no stored procedures."


TVF ( Table-Valued Function) with Cross apply.


Exactly since it declarative the style Lends itself using stored procedure calls to become composable


You cannot abstract over stored procedures either, so that's still not composable.


No idea what this means


Think of first-class functions. Can't do that with stored procedures, just like you can't do that with queries or tables in SQL.


Furthermore, stored procedures/functions are not queries.


Stored procedures can be relations. Queries are relations. Ergo, stored procedures can be queries.


Isn't that the point of common table expressions (CTEs)?



Postgres has read write parallism that can scale across millions of read writes ; if ml model is inherent in the Postgres db it is indeed very elegant reminds me of the glory days of tsearch2 to do text search in Postgres for searching our internal document repository using nothing but Postgres and boy was it faster than Java based search systems


I think general programming languages are better for general programs than SQL.

Specifically they have: Type systems, compilers, debuggers, text editors, package managers, C FFI etc.

But I agree that having the data and the program in the same process has benefits.

Writing programs in SQL is one way.

Another way is to move your data to your general program with SQLite.

I like using SQL for ACID, and queries as a first filter to get the data into my program where I may do further processing with the general language.


Another is MS SQL Server, which lets you run .NET on the database server :D "you can author stored procedures, triggers, user-defined functions, user-defined types, and user-defined aggregates in managed code"


I have had nothing but bad experiences trying to run .NET in SSIS packages -- is there another way?


I've never had the pleasure(ha) of using SSIS, but this is the stuff that I was talking about: https://learn.microsoft.com/en-us/dotnet/framework/data/adon...


> Type systems

SQL has types

> compilers

For what specifically do you need a compiler?

> debuggers

Some tasks - like the concurrency SQL enables - are just very difficult to debug with debuggers. It would be the same with any other language. What SQL does here though is to allow you to focus on the logic, not the actual concurrency,

> text editors, package managers

I feel like these two are just for filling up the space.

> C FFI

Many SQL engines have UDFs


> Type systems

Sure SQL has types, but they are checked at runtime, not compile time. Also you cannot define function input and return arguments with types that are checked before you run the program.

> compilers

If you want efficient and/or portable code. They will check your code for type errors before you run them. They give you coding assistance in your editor.

> debuggers

Being able to break a program and see its state and function stack is useful. The quality of the tools for real languages are much better than SQL.

I agree that databases do concurrency better than most languages with their transactions (I mentioned I would use the db for ACID).

> text editors, package managers.

Editor support of real languages is much better than SQL.

Package managers enable code re-use.

> C FFI

Take for example Python. A huge amount of the value comes from re-using C libraries that are wrapped with a nice Python API.

You might be able to do this in SQL, but you'll have to wrap the C library yourself as there is no package manager, and no one else is doing the same thing.


> > text editors, package managers.

> Editor support of real languages is much better than SQL.

So text editors suck at supporting SQL... How is that SQL’s fault?!? Go complain to the text editor authors.

> Package managers enable code re-use.

Yup. Build one for SQL, then you can re-use SQL code. Just like someone had to build one for every other language. What does this prove about SQL being inferior to other languages? A: Nothing at all.


> What does this prove about SQL being inferior to other languages

My point is that I think it is inferior for general application/business logic programs. For queries that filter large datasets to small result sets, SQL is probably better as it has the built in indexes and query planner (plus ACID).

I am pointing out that the current environment (better text editors and package managers) favours general languages, so they are a better current choice (to use in combination with SQL) over just writing everything, including application logic, in SQL.


> I think general programming languages are better for general programs than SQL. Specifically they have: Type systems, compilers, debuggers, text editors, package managers, C FFI etc.

Non sequitur. SQL is typed; SQL can be edited in any text editor; there are lots of SQL IDEs and, arguably, debuggers and package managers. Sure, the package managers are specific to each RDBMS, but so what? Npm is no use in COBOL either. And sure, the “debuggers”, to the extent they can be said to exist, are radically different from those of “conventional” – of other – programming languages. But again, so what? A Smalltalk debugger is no use to fix the output from Intel’s latest C compiler either, or vice versa.

IOW: There is no such “SQL vs programming languages” dichotomy. SQL is just another programming language, with its own strengths and weaknesses, just like all the rest of them. “The rest” are not distinguished from SQL by somehow magically all having the attributes you claim for them: Some have them, some don't; some have this but not that, some others, the other way around. Someone built all those IDEs and debuggers and package managers for (some / many / most of) those other languages; you can build them for SQL too.


Another option is to use an approach like Prevayler: https://prevayler.org/

The basic notion is you keep your data hot in RAM and manage it directly. You make every change an object (or a command), and write that out serially to a log before you execute it. That gets you the ACID guarantees but with no I/O but linear writes, so it can be extremely fast.

It only makes sense when your data fits conveniently in RAM, but that's a lot of things.


I don't consider querying a relational database, transforming its data, or validating its state transitions to be general programs. I consider those to be special purpose programs, which benefit from special purpose tools tailored for writing them. SQL is one such tool.


I spend most of my time in the parallel universe that is scientific computing/HPC. In this alternate reality SQL (not to mention databases) never really took off. Instead of scalable, performant databases, we have only the parallel filesystem. I'm convinced the reason contemporary scientific computing don't involve much SQL is sociological/path-dependency, but there are also very good technical reasons. Optimizing software in scientific computing involves two steps: 1) Think hard about your problem until you can frame it as one or more matrix multiplications 2) Plug that into a numerical linear algebra library The SQL abstraction (in my experience) takes you very much in the opposite direction.


For sure. Anything done in SQL is running on top of a million lines or more of extremely complicated non-SQL code. If that works for a given use case, great, but if not, optimizing can get very challenging. I'd much rather deal with something closer to the metal.


Self-proclaimed database expert here. What a database is good at depends on what you're trying to get that database to do, at least in part.

Take it into piecess, elegance and efficiency. These will correspond to a logical statement of what you're trying to do, and how quickly the database will actually do it in practice.

SQL can do some nice things in areas, making it elegant in those areas. Elsewhere it can be pretty wordy and ugly.

In efficiency, it comes down largely to how the database is implemented and that also includes the capability of the optimiser. Both of these are out of your control. In my experience trying to turn a database into a number cruncher is just not going to work.

I guess that's long way round of me saying that I don't think I agree with you!


In the data warehouse / OLAP space, I think we are heading towards a world where the underlying data storage consists of partitioned Arrow arrays, and you can run arbitrary UDFs in arbitrary languages on the (read-only) raw data with zero copy and essentially no overhead other than the language runtime itself and marshalling of the data that is emitted from the UDF.

Something like the Snowflake data storage model + DuckDB as an execution engine + a Pandas/Polars-like API.

There is no reason why we have to be stuck with "database == SQL" all the time. SQL is extremely powerful, but sometimes you need a bit more, and in that case we shouldn't be so constrained.

But in general yes, the world is gradually waking up to the idea that performance matters, and that data locality can be extremely important for performance, and that doing your computations and data processing on your warehouse data in-place is going to be a huge time and money saver in the longer term.


> Databases are much more powerful than we think

And a function of what people think is attitudes towards working at the DB level. I see this often with ORM's in the web dev sphere (rather than Dat Science). Yes, ORM's are great but many people rely on them to completely abstract away the database and are terrified by raw sql or even query building. You also see it with services that abstract away the backend like PocketBase, Fireship, etc. Writing a window function or even a sub select looks like black magic to many.

I say this after several experiences with codebases where joins and filtering were often done at the application layer and raw sql was seen as the devil.


Opposite here - dont like ORMs. Too much overhead - though i get their value.


Database first designs make a lot of sense in a lot of ways. I've worked for a company with an Oracle database that has SQL scripts that do all the validation and create text files for downstream usage. I think it makes more sense than a ton of Java, but there are pros and cons. One is that SQL is relational and the advanced stuff can be extra hard to troubleshoot if you don't have enough experience. Even those that can't code can usually understand a for loop and can think imperatively.

Unfortunately it's an expensive commercial product or I'd recommend you look at kdb+ if you work with time series data. The big banks use it and essentially put all thier latest RT data into kdb+ and then can write extremely succinct queries with a SQL-like syntax, but the ability to approach it far more programmatically than what is typically doable with something like PL-SQL. You can even write your neural network or whatever code in less than a page of code as the language of kdb+ is extremely powerful, although also basically incomprehensible until someone puts some time into learning it. It's extremely lightweight though, so very easy to deal with in an interactive fashion.

All that to say I agree with you that it's nice to just have everything you want all in one spot rather than to deal with 4 different tools and pipelines and shared drives and so on.


Absolutely.

Consider too that PostgreSQL databases support different languages, like Python.

Loads of for-profit companies have tried to cash in on this. SAP HANA is one of the ones I've had recent experience with. It is unfortunately a poor implementation. The right architecture tends to be: put your model behind an API interface, not internal on the system. Train your model separately from production systems, and so on.

You might also be interested in checking out MLOps platforms like Kubeflow, Flyte, and others.


There is a recent trend in database research, ML in databases. Not sure how much an impact it can make though, the sweet spot is doing relatively linear stuff, arguably just a step up from analytical functions in queries, while cutting edge ML needs dedicated GPUs for compute load and often uses unstructured and even binary data.


> Are the current tradeoffs just that more people know python vs sql or database specific languages to where moving logic to postgres or snowflake is looked down on?

Yes, mostly development, deployment, etc.. concerns. I haven't ever seen an org that versions their SQL queries, unless they are in a codebase. The environment is just unfriendly towards that type of management. Nevermind testing! Things that have solutions but we haven't matured enough here, because that type of development has been happening in application code.

Also, SQL is generally more complex than application logic, because they are designed to do different things. What is a simple exercise in iteration or recursion can more easily become something a little more of a headache.

Problems that can be resolved, but they are problems.


I would like for this to be the case. I was DBA of a Postgres database for LIMS almost 2 decades ago. Back then you could code functions for the database to execute on data and it was very powerful, but also very clunky. The tools to do software development in the database was not mature at all. A lot has changed in the past 20 years and SQL has evolved. Do you think SQL will expand that much or there will be APIs built into the database? Near-data functions are powerful and useful, but I would want my development environment to be more like version controlled code than "built-in".

I wonder if near-data functions on small databases is the solution to the limit of statelessness that you have with functions as a service.


With “traditional” RDBMS systems, putting a lot of code in the DB lead to a lot of scaling issues where you’d need gigantic machines with a lot of RAM and CPU: because the DB was doing so much work. It was expensive and clunky to get HA right.

In more modern DBs being distributed horizontally, this approach may see a rebound. The big “but” is still costs, in my experience in AWS as an example, managed Postgres Aurora was surprisingly expensive in terms of monthly cost.


Why would running code in a database process be intrinsically more expensive than running it in some other process?


The issue is that many relational databases are not horizontally scalable, so you want to be frugal with their resources.


The resources I'm familiar with are I/O, memory, and CPU. The only one I believe can be spared in the database by using that resource outside of the database, is CPU. When the database is far from saturated on CPU and latency and throughput are determined by I/O and memory, using CPU on some other machine that isn't the database can't possibly have any impact on latency and throughput.


> When the database is far from saturated on CPU

The issue here is if you scale enough saturate the database, you'll have to rewrite essentially all your code if you're a typical CRUD webapp. Basically all of your business logic is about data retrieval. There's probably some companies that can get away with this, but it would be way too expensive for most.


If I scale up to saturate the database CPU...by doing data retrieval? Setting aside my skepticism about saturating the CPU with mere data retrieval, how is that solved by moving the data to another host's CPU, when moving the data involves the very data retrieval that's saturating the database's CPU?


If you mix in compute-heavy calculations (like stochastic gradient descent) in with your pure data-retrieval, yes you'll saturate the database's CPU, and you won't have a reasonable way to scale it.

If you do it on a host that's not a database, then you can horizontally scale it. There's a reason stateless apps are the default.


Ok so if I'm not mixing in compute heavy workloads like the stochastic gradient descent described in the article, then it's less likely I'll saturate the CPU. Perhaps that won't happen at all and then I won't have to scale horizontally.

On the other hand if I'm doing stochastic gradient descent that's saturating the CPU then there's a good chance I'm doing offline training of an ML model. In that case my latency tolerances are probably much much higher. In other words, I can also avoid scaling horizontally provided I can live with longer training times. That might be a worthwhile trade-off to me given the added complexity of horizontal scaling.

Good to know!


you can easily scale linearly your app layer on multiple machines, while it is harder with previous gen databases.


What good is that going to do if the data always resides in the database?


I agree. I took a course in databases and SQL and was blown away by its power. With CTE’s and PLSQL you can do a lot of stuff inside the database.

I played with SQLite and it’s json columns. Once you get the hang of the syntax for walking a json structure you can do all sorts of neat things. Doing the same thing in Python would be tedious.

And I also believe it ended up being way faster than what I did in python.


had a very good chat with https://postgresml.org/ last week which is focusing on bringing ML to postgres: https://youtu.be/j8hE8-jZJGU


I'm watching it, it's really good. Montana makes a great point: you can move data to the models, or move the models to the data. Data is typically larger than models, so it makes sense to go with the latter.


thanks for watching! i should really up the production quality haha but also this is what i can kinda manage with my existing workload. idk how the pro youtubers make these calls interesting


Junior developers like me were uncomfortable with SQL twenty years ago. Java ORM frameworks became popular because of the Object-Relational impedance. I kind of see the same kind of sentiment nowadays among newer generations but in Python&Co.

The success of the Apache Spark engine can at least partially be attributed to

* being able to have the same expressive power as SQL but with a real Scala API (including having reusable libraries based on it)

* being able to embed it into unit tests at a low price of additional ~20 seconds latency to spin up a local Spark master


A dbms is really it's own operating system, usually this is hosted on another operating system, one that understands the hardware.

I remember one place I worked where we had several old graybeard programmers who considered the dbms[1] the operating system, as a unix sysadmin we had some interesting discussions as I was always confused and confined by the dbms and they felt the same about unix.

1. unidata if curious, a weird multi value(not relational) database, very vertically integrated compared to most databases today.


I've created a big system with only SQL in the past like you just said. I wouldn't do it again because of these two pain points that I had to deal with:

1. it's really hard to debug SQL queries and stored procedures (at least it was in Postgres 11)

2. when you hit a performance bottleneck, you don't have much control over it - parallelizing is hard and you have to trick the query planner to do what you want (and it doesn't work sometimes)


Not challenging your experience but genuinely trying to learn from it, in broad strokes what kinds of things were you doing in these stored procedures?


The problem with this approach is that:

1) No static typing 2) Updating the logic requires migrations 3) You put the logic into the place which is the hardest to scale and many times a single point of failure 4) Cannot compose the code effectively and general verbosity

It's one of those ideas that sounds great on paper and maybe works in some smaller problems but as you go up in complexity things get worse and worse


It is tempting to combine web server, database and some imperative language with built in data oriented / SQL features in a single executable and call it an application server that would communicate with the outside world using for example JSON based RPC. I think there were / are some products in the area even with the built in IDE (like Wakanda).


Fully agree.

And using PostgREST [0] you can serve your postgreSQL database as REST-API. And if you throw foreign data wrappers / multicorn in the mix, you can map any other datasource into your postgreSQL-db as table.

[0] https://postgrest.org/en/stable/


> Databases are much more powerful than we think

and data has mass. One example of bringing the work to the data is https://madlib.apache.org/ (works on Postgres and Greenplum)

[Disclaimer - former employee of Pivotal]


I have yet to see a decent IDE or system which allows great version control, unit testing and collaboration with SQL source code.

So I think a lot of the reluctance is from practical concerns.


We store individual sql files in github and keep them in separate folders

This is very simple and scales well for our purposes


Title here is wrong. Title in article and headings in article are right: ONLINE gradient descent

It's specifically not stochastic. From the article:

Online gradient descent

Finally, we have enough experience to implement online gradient descent. To keep things simple, we will use a very vanilla version:

- Constant learning rate, as opposed to a schedule.

- Single epoch, we only do one pass on the data.

- Not stochastic: the rows are not shuffled. ⇠ ⇠ ⇠ ⇠

- Squared loss, which is the standard loss for regression.

- No gradient clipping.

- No weight regularisation.

- No intercept term.


Hehe I was wondering if someone would catch that. Rest assured, I know the difference between online and stochastic gradient descent. I admit I used stochastic on Hacker News because I thought it would generate more engagement.


Then just call it Non-stochastic Gradient Descent? You can't editorialize titles per HN guidelines

https://news.ycombinator.com/newsguidelines.html


Thanks, I wasn't aware.


My pleasure. You can still edit the title, by the way ;-)


What are some adversarial cases for gradient descent, and/or what sort of e.g. DVC.org or W3C PROV provenance information should be tracked for a production ML workflow?

Gradient descent: https://en.wikipedia.org/wiki/Gradient_descent

Stochastic gradient descent: https://en.wikipedia.org/wiki/Stochastic_gradient_descent

Online machine learning: https://en.wikipedia.org/wiki/Online_machine_learning

adversarial gradient descent site:github.com inurl:awesome : https://www.google.com/search?q=awesome+adversarial+gradient...

https://github.com/EthicalML/awesome-production-machine-lear...

Robust machine learning: https://en.wikipedia.org/wiki/Robustness_(computer_science)#...

Robust gradient descent


We built model & data provenance into our open source ML library, though it's admittedly not the W3C PROV standard. There were a few gaps in it until we built an automated reproducibility system on top of it, but now it's pretty solid for all the algorithms we implement. Unfortunately some of the things we wrap (notably TensorFlow) aren't reproducible enough due to some unfixed bugs. There's an overview of the provenance system in this reprise of the JavaOne talk I gave here https://www.youtube.com/watch?v=GXOMjq2OS_c. The library is on GitHub - https://github.com/oracle/tribuo.


Just don't.

SQL:

- does not allow for easy and clean importing of modules/libraries

- is not easily to write tests for

- has limited support for a debugger

- lacks a consistent style for such large queries (plus most textbook cover fairly simple stuff) which means it's hard for a developer to start reading someone else's code (more than in other languages)

- clearly indicates in its name that it is a Query language.

Save yourself the trouble and all your collaborators the pain of working with this code in the future, of trying to add new features, of trying to reuse it in another project.

If you want to operate near the data, use PL/Python for PostgreSQL.

EDIT: Fixed formatting.


-PostgreSQL extensions are easy to include and use.

-pgTAP exists for testing.

-A large query in SQL is not made smaller but translating it into an ORM DSL.

-If "Query" in "SQL" means it's for querying data, then evidently "Query" not being in say Java or Python means those languages are NOT meant for querying data. If that's true, then why would you use them for querying data?


> If "Query" in "SQL" means it's for querying data, then evidently "Query" not being in say Java or Python means those languages are NOT meant for querying data

If X then Y does not imply if not X then not Y. Java and Python do not indicate a purpose in their name because they are general-purpose.


Are they meant for querying data?


Re modules/libraries: I meant it is not easy to write a piece of SQL code, and then import it into several queries to reuse it, or lend it to someone else for use on their on schema. It is possible, yes, but seldom done, because it is hell. PostgreSQL extensions could be used for this purpose, but developing an extension requires a different set of SQL statements (or luckily, python or c) than those used by the user of the extension, which makes compounding them a bit hard. Not impossible, just hard to maintain,

About your last point, I don't think that was my line of reasoning, but, yes, for the love of what is precious, don't open SQL files as python/java file objects and then parse and rummage through them to find the data you are looking for. Not impossible, just hard to maintain.

Thanks for pointing out pgTAP, didn't know about this.

For some reason, data-science folks haven't yet caught up with ORMs.. I don't know if this is good or bad, but (as the OP shows) they are more used to rows and columns (or graphs) than objects. Maybe that will change one day.


> maybe that will change one day

I pray that it never does.

https://blog.codinghorror.com/object-relational-mapping-is-t...


As for sharing SQL, that's easy to do within a database using views. Across databases with possibly different data models, that's not something I personally ever want to do.


Also, there is MindsDB: https://mindsdb.com/


This is great! Moving away from the proprietary nature of GPUs and complex math gatekeeping should help democratize AI.

Has anyone converted stuff like gradient descent to set theory?

https://support.unicomsi.com/manuals/soliddb/7/SQL_Guide/2_G...

https://www.sqlshack.com/mathematics-sql-server-fast-introdu...

https://www.sqlshack.com/learn-sql-set-theory/

Right now AI algorithms kind of look imperative and stateful to me, like state machines. But there should be a functional or relational representation, where pattern matching would be used to derive the current state instead.

It's trivial to go from functional to imperative representation, but often nearly impossible to go the other way. That's why monadic (sorry if I'm mincing terms) logic is so troublesome. Mutable variables, futures/promises, async, etc can't be statically analyzed, which is why most code today is difficult to parallelize and stuff like C++ optimizers don't even increase speed to the next order of magnitude. But functional languages have nearly infinite optimization potential through divide and conquer approaches like sharding and scatter-gather arrays that can run internally without side effects. In other words, we can throw hardware at SQL for linear speedup since it's embarrassingly parallel, but might have limited success optimizing something like Python.


> This is great! Moving away from the proprietary nature of GPUs and complex math gatekeeping should help democratize AI.

I understand the sentiment but you can run ML on plain CPUs, in fact it is easier to do so, you just don't get the massive speed benefits of GPUs. Hugging face's public models are all run on CPUs to save costs.

> Right now AI algorithms kind of look imperative and stateful to me, like state machines.

The maths is not stateful or imperative, only our implementation. Much like how functional languages are executed on a stateful CPU.

Research in parallelization is ongoing.


Your question "Has anyone converted stuff like gradient descent to set theory?" doesn't make sense from the perspective that gradient descent uses differential calculus to find min/max points of an objective function and differential calculus requires a lot of additional assumptions on top of set theory.

That being said, current deep learning libraries such as JAX and Pytorch use automatic differentiation to efficiently compute partial derivatives used for optimization algorithms such as gradient descent and it's not clear to me what the level of effort would be to convert that to something that could run efficiently in SQL?


Thank you, I knew that derivatives were used in gradient descent, but automatic differentiation is new to me. The Wikipedia article is fairly opaque compared to what I learned in school, but this tidbit stood out:

https://en.wikipedia.org/wiki/Automatic_differentiation#Impl...

Source code transformation (SCT): the compiler processes source code so that the derivatives are calculated alongside each instruction.

Operator overloading (OO): operators are overridden so that derivatives are calculated for numbers and vectors.

Based on the state of software these days, I'm guessing that OO (the "bare hands" method) is what's mainstream. It would be far better IMHO to use SCT, since it's a universal solution that doesn't require manually refactoring programs.

But stuff like SCT might be considered metaprogramming, which seems to have fallen out of fashion. I grew up with C++ macros and templates, so I feel that this is somewhat tragic, although readability and collaboration are much better today. A modern example might be something like aspect-oriented programming (AOP):

https://en.wikipedia.org/wiki/Aspect-oriented_programming

I once used the AOP library AspectJ to trace a Java app's execution, since Java made the (unfortunate) choice to focus on objects rather than functions, which makes it generally a poor fit for data processing, due to its high use of mutable state within objects (mutable state is what limits most object-oriented projects to around 1 million lines). Meaning that I couldn't remember the program's context as I was stepping through it, and had to analyze traces instead. AspectJ allows one to hook into the code without modifying it, sort of like a debugger, so that stuff like function calls and variable mutations can be watched:

https://en.wikipedia.org/wiki/AspectJ

https://www.eclipse.org/aspectj/doc/released/progguide/index...

Looks like this might still be an open problem in Python:

https://stackoverflow.com/questions/12356713/aspect-oriented...

https://docs.spring.io/spring-python/1.2.x/sphinx/html/aop.h...

But it seems like SQL would be a good candidate for AOP:

https://stackoverflow.com/questions/12271588/aspect-oriented...

https://technology.amis.nl/it/aspect-oriented-programming-ao...

If we had that, maybe we could automatically generate derivatives for the set operations. Then either access them as variables in stored procedures, or possibly as something like views or via metadata stored somewhere like MySQL's INFORMATION_SCHEMA.

I don't really know, but maybe these breadcrumbs could be helpful.


Most ML algorithms are numerical approximations, so yes, they look procedural. You can try and find analytical solutions, but except for the simplest models they’re intractable.


Replying to this to remind myself to research later, as this is a very important question


This is really interesting, but a basic part I don't understand: What would it actually look like to run this on a live dataset?

If I understand correctly: you'd run the recursive query, it produces results for every step, effectively showing you the progression of output over time, and then once it hits "present day", it completes and stops?

How would you generate results going forward? I.E. A minute elapses after the results return, do you have to re-run the whole query for all time?


Good question. I touched upon this in the conclusion. Basically, if you run this in a streaming SQL database, such as Materialize, then you would get a true online system which doesn't restart from scratch.


In a non-streaming db What would prevent you from storing the result set and just using the last iteration to calculate the next?


This is definitely a possibility. What I meant to say is that the implementation in the blog post doesn't support that.


Makes sense, just an insert instead of a select, I used a similar approach on newton-raphson to implement XIRR in SQL and it worked well.


i’m trying to learn here so please pardon my ignorance. wouldn’t a pre-aggregated set affect the new aggregate result? i suppose you could store avg, sum, count and then add the new value(s) to sum, new count to count, and recalculate average from that. or even just avg and count and then re-aggregate as ((avg*count)+new values)/(count + new values count) but i didn’t know if there’s a better way to process new values into a set of data that’s already been aggregated


Yep that would be what I would do - effectively a functional approach where no memory is required besides the current set and iteration.

A big part of materializing datasets for performance is finding the right grain that is both easy enough to calculate and also can do nice things like be resumeable for snapshots.


Did you try running it using DuckDB?


DuckDB is what I used in the blog post. Re-running this query simply recomputes everything from the start. I didn't store intermediary that would allow starting off from where the query stopped. But it's possible!


Does this work with Postgres?


Postgres has excellent support for WITH RECURSIVE, so I see no reason why it wouldn't. However, as I answered elsewhere, you would need to set some stateful stuff up if you don't want the query to start from scratch when you re-run it.


I for one would welcome a second blogpost with the stateful example. I’ve been doing running totals (basically balances based on adding financial tx per account) but had to do it in some client code because I couldn’t figure out how to do a stateful resume in Sql


From the start I assumed this is a nice playful "Hexing the technical interview" kinda joke. But given the tone of the article, and some of the comments here… Uh, this cannot be serious, right?


Why couldn't it be serious? Your database already has your data so you don't have to stream it to another machine for training. If you have a developed model that needs retraining on new data, save yourself the network bandwidth (and time!) and just make the DB retrain. This doesn't replace the exploration phase of model-building but if you already know gradient descent works for you then what's wrong with this approach?


I'm very confused as well. Are we facing a wall of GPT-generated comments?


Incredible post. I laughed when I saw the title, snickered at the first paragraph, and then proceeded to be blown away by the rest of it. Thought I was in for a joke and instead I'm thinking about the nature of ML Ops and what it's become.


My sentiment exactly. The premise of the article comes across a little naive because there are so many fine-tuned libraries for specialized hardware architectures that already do this computation very efficiently.

However, it did make me wonder what this might look like on a gpu-accelerated database engine that is designed to leverage the SIMD parallelism of GPGPU architectures.

Beyond using SQL/NoSQL databases for CRUD apps I am not a "database guy", so I'm not sure about the feasibility, but it would be interesting to see it implemented.


In the comments here so far, we see a pattern we've seen before. When someone suggests doing something in SQL, there's a lot of concern about SQL being a very limited programming language where it's hard to do proper engineering.

Here's I would really love to know: why is it that SQL is, to first order, the only language used to interact with databases, and SQL has about the same features as it did in the 70s? It seems analogous to if the general-purpose programming world stopped with C.


I believe the PC revolution gave a few generations of programmers the mistaken belief that imperative languages with familiar features (Basic, Pascal, C, Perl, Python, Java, etc.) are the only or the best way to interact with computers. Forms of interaction developed earlier (Forth, Prolog, SQL, etc.) were learned later, grudgingly, if at all. SQL was viewed with disdain, a necessary evil to be held at arm's length. That stalled evolution of SQL, the success of other query languages, or really anything that didn't fit the Algol-style "general purpose programming language" blinders.

Mercifully, finally, the cold war against SQL in particular and against relational databases in general seems to be thawing. Articles like this one exemplify that trend


Plenty of SQL features have been added since the 70s, notably window functions (which TFA relies on heavily). Most of the major databases are from the 80s, and even new kid on the block MySQL has been around since the mid 90s.

SQL draws a very hard line between the expression of the query and the AST that is used in the actual implementation. Database vendors like this aspect because they are free to implement whichever optimizations they want, but application developers want to build queries programmatically and optimize them themselves before passing them to the db engine, hence the tension in threads like these.


> why is it that SQL is, to first order, the only language used to interact with databases, and SQL has about the same features as it did in the 70s?

Because SQL is effectively a domain-specific language. If you added 100+ additional keywords/functions/etc., do you think it would be easier or more difficult for the average developer to build something clean with it?

I look at SQL like a pile of bricks. You don't want complicated, unicorn bricks or you won't be able to fit anything meaningful together. Experienced masons almost certainly prefer their materials to be as dumb and consistent as possible.


> You don't want complicated, unicorn bricks or you won't be able to fit anything meaningful together. Experienced masons almost certainly prefer their materials to be as dumb and consistent as possible.

And now we know what's gone wrong with Lego for the last two-three (close to four?) decades.

Funny – and sad! – how popular this shit has been on sites like this, though. Dare one hope that that popularity might begin to wane? Idunno


This is genius I love it.

On stream inference with something like "continuous" data in the same structure I think is the final material form for "AI" so this is a great step towards that

Thanks for the writeup


You might want to consider checking out ClickHouse which supports many ML functions natively:

- stochasticLinearRegression (https://clickhouse.com/docs/en/sql-reference/aggregate-funct...)

- stochasticLogisticRegression (https://clickhouse.com/docs/en/sql-reference/aggregate-funct...)


Thanks for the links, I wasn't aware of them. The Russians often seem to have a step ahead in the ML world.


That is an outdated view, it is a global and culturally diverse company.


I always love DuckDB articles because I can learn what is new in ClickHouse.


At least one F500 company had iteratively reweighted least squares implemented in MS SQL as of ~20y ago. When all you have is a hammer...


This is great. The only thing I dislike from this is using these variables to try to predict Adj Close when they are not at all correlated.

There are countless meaningful correlations in financial data that would have been just as easy to play around with. One truly valuable example would be to look at trading multiples of comparable companies. Sticking to P/E would be easier as P is easily observable and forward-looking EPS estimates are generally always available. This would limit the exercise to more mature companies than the ones commonly discussed on HN but would make it actually meaningful


Do you have some data/resources on this? I'm a total snowflake at this, but I'm willing to learn.


In hindsight this is harder than it seems if you don't already have access to data, so I'll try to be as informative as possible in my response

It's hard to find this information out there, so here's ~all you need to know.

Data is usually behind paywalls, unfortunately. Industry standards are Bloomberg terminal (ridiculously expensive, 5 digits $), FactSet (very expensive, 4 digits), Capital IQ (expensive, not sure)... but there are a number of up-and-coming startups trying to disrupt the space so you may be able to grab data from them. I think https://atom.finance has a 7-day free trial you could use to play around with.

P/E simply means the company's _P_rice per share divided by _E_arnings per share. Cancel out the "per share" terms and you get total market capitalization (which is the value of the total equity) divided by net income (since "earnings per share" really means "net income per share")

So the "P" is easy to get. It's your Adj Close.

The "E" is trickier as it can mean a lot of things. Diluted EPS from financial statements? Last year's EPS? Management's guidance for EPS? None of those are actually correct even if they are all "EPS"

Importantly--and contrary to 99% of the info you will find online--the most relevant EPS number are forward estimates of EPS, usually for the next twelve months ("NTM"). That is based on an average or median of analyst estimates which is called "consensus". These are analysts from financial institutions who build their own little models based on their own views of where the business is going to go, informed by recent earnings, management's color in earnings calls and filings, etc.

Believe it or not, as hairy as that sounds, EPS is fairly easy to get as it's a metric that has less room for interpretation than, say, EBITDA.

So you're not going to go out there, read all these (paid) analyst reports, find their EPS, calculate the median, etc. Bloomberg, Capital IQ, FactSet do this for you and it's easily observable for the end user (that's their business).

The thing is, as you may have guessed, "next twelve months" are a moving target across time. Analysts usually provide estimates for the current fiscal year (i.e. FY 2023, ending 12/31/2023 for most companies) and the following year, ending 12/31/2024. Let's call these FY0_EPS and FY1_EPS, for simplicity

You might be tempted to just take a moving average of these two estimates, so that on 1/1/2023 it is 100% of FY0_EPS + 0% of FY1_EPS, on 1/2/2023 it is 99.9% + 0.1% and gradually "move forward in time" as the days pass. That sort of works (and definitely checks the box for a proof-of-concept like in your post) but for the sake of completeness, I'll just say that the right-er approach is to only "move forward in time" when new earnings are released. So it doesn't matter if we're in 1/1/2023 or 2/1/2023--what matter is what is the latest reported quarter. Take Coca-Cola for instance (https://www.bamsec.com/companies/21344/coca-cola-co). Let's roll the tape backward one year. They reported FY 2021 earnings on 2/22/2022, at which point analysts published new estimates in revised models, so on from that day forward until the next quarterly earnings we take 100% FY0_EPS + 0% FY1_EPS, in which these correspond to estimates for FY 2022 and FY 2023, respectively.

On 4/1/2022, Coca-Cola reported Q1 2022 results, analysts published new estimates, and we now take 75% FY0_EPS + 25% FY1_EPS. On 7/1/2022, we move forward another quarter so 50% + 50%, then 25% + 75% starting on 10/26 and then back to square one with 100% + 0% except FY0_EPS now means FY 2023 vs FY 2022 previously, and FY1_EPS means FY 2024

So your table is something like (I'm making up numbers)

    +------------+--------+-----------+---------+---------+-------------+-------------+-------------+
    | Date       | Ticker | Adj_Close | FY0_EPS | FY1_EPS | Period      | NTM0_Weight | NTM1_Weight |
    | 01/01/2022 |     KO |     90.10 |   20.00 |   24.00 | Q1 2022     |        1.00 |        0.00 |  
    | 01/02/2022 |     KO |     91.14 |   20.00 |   24.00 | Q1 2022     |        1.00 |        0.00 |  
    | 01/03/2022 |     KO |     89.30 |   20.00 |   24.00 | Q1 2022     |        1.00 |        0.00 |  
    | 01/04/2022 |     KO |     91.09 |   20.00 |   24.00 | Q1 2022     |        1.00 |        0.00 |  
    | 01/05/2022 |     KO |     92.01 |   20.00 |   24.00 | Q1 2022     |        1.00 |        0.00 |  
    | 01/06/2022 |     KO |     89.05 |   20.00 |   24.00 | Q1 2022     |        1.00 |        0.00 |  
    ...
    | 07/02/2022 |     KO |     89.05 |   19.50 |   23.20 | Q2 2022     |        0.75 |        0.25 |
With that you can take NTM0_Weight and NTM1_Weight to calculate NTM_EPS by multiplying those weights by FY0_EPS and FY1_EPS. And then can take AdjClose / NTM_EPS to calculate P/E

Why is this useful? Because in theory you can take the average P/E of companies X, Y and Z in one industry and compare it to a fourth company W. Is W's P/E multiple above or below the industry average? You now know if they are over or undervalued, respectively, which means you know if you should buy or sell that stock (if you believe you picked the right "comparable" companies in that industry)

This is just one example... there are all sorts of similar analyses done daily in the financial services industry. I'm not saying it's easy to extract alpha from trading on these, but that's the framework


Thank you so much for this! It's very generous of you to have taken the time.


My pleasure! I've spent the better part of the last decade doing this stuff and I appreciate how hard it is to find resources on it, so thought I'd share since you mentioned you were interested in learning

At the risk of exhausting said interest, here's one real life example of a similar analysis: https://wsp-blog-images.s3.amazonaws.com/uploads/2011/09/171...

See PDF page 14. Note the lines called "Composite P / NTM EPS" which they built as a blend of American Eagle's, Tilly's and Zumiez's P/E multiple, which are companies X, Y and Z in my comment above (for some reason they gave AE double the weight which is unusual) and compared it to Heat's P/E multiple (Heat was the codename for retailer Rue21, or hypothetical company W in my example above)

I got this deck from https://www.wallstreetprep.com/knowledge/investment-banking-... which has a few other examples and a variety of other resources on the topic FYI


Just to disambiguate, here machine learning = linear regression. So no auto-grad needed as gradients are explicit and simple. Still interesting though.


From a comment by the author:

> Also, I can't justify why, but my gut feeling is that the database should be an order of magnitude faster than Python, provided the implementation were to be improved.

Would be curious how that could end up being the case. Perhaps if NumPy wasn't used at all? That would mean no vectorization or optimized storage.

Would be interesting to see how it scaled with length and dimensionality


We've been doing this type of gradient descent in production for a couple of years now, and we've been very happy with it. A few insights from my experience:

One of the main benefits of doing differentiable programming over a relational database is that you can use joins to traverse the normalized data schema, and this acts as a very effective prior for your model. For example, you can learn parameters associated to shirt colors by having a parameter in the color table, and joining it with the shirt sales table (through the shirts table). And knowledge of the table structure also lets you improve convergence, for example by adding a factor to compensate for the fact that not all colors have the same number of sales (and therefore have been updated fewer times during each epoch). Here's a paper from last year: https://arxiv.org/abs/2209.03771v1

Automatic differentiation is an absolute must-have. Once you start having a few complex joins and aggregations, it's too hard to derive the gradient by hand. Doing automatic differentiation on relational queries requires some adjustments (in particular, because you cannot use a tape), so you end up having to define a subset of relational algebra that is closed by automatic differentiation. The general ideas were presented in https://ceur-ws.org/Vol-2971/paper07.pdf

On the other hand, while it's possible to do SQL-to-SQL automatic differentiation, the resulting queries have poor performance, so it's better to go one step lower (at the level of the executor for an already-planned query) and perform the automatic differentiation there. And we had an excellent intern work with us on dedicated parallelization for gradient descent queries: https://blog.lokad.com/pdf/reproducible-parallel-sgd-ziyad-b...

Finally, in practice, it usually doesn't matter if the gradient descent is stochastic, so long as you run several passes over the data.

I'm very interested in differentiable programming on relational languages, as the mainstream research concentrates on unstructured, low-information-density data (images/sound/text) when there are many domains where the data is more dense and structured, and (because of that structure) is stored in a database.


Is there a risk that postgres becomes sentient in our lifetime?


I thought I was decent at SQL until I saw this. Nice article!


This is interesting, but is not using the GPU.


Or use MindsDB: https://mindsdb.com/


There was also this neural network module in redis if you want to do training and inference in a redid DB. (quite old btw) https://github.com/antirez/neural-redis


Please don't do this in prod


Many ML models experience offline training, and some online training is fine-tuning a model that was pretrained offline. Hence, I see no particular reason to worry about doing this "in prod."


R has been natively supported in sql server for a long while (At least since 2017), same for Java or python, e.g. Writing SP in these languages.


I tried to replicate this in SQLite. The first few steps worked OK, e.g.

https://lite.datasette.io/?json=https://gist.github.com/simo...

(I replaced "figures" with "raw" due to the way Datasette Lite assigns a default table name to the imported JSON)

But the more complex recursive queries gave me this error and I'm not sure how to work around it:

    recursive reference in a subquery: state
E.g. https://lite.datasette.io/?json=https://gist.github.com/simo...


That's too bad, I would have expected it to work out of the box. Other than rewriting the query in a different way, I'm not sure I see an easy workaround. Are you still working on this?


No I managed to drag myself away from it and get back to what I'd been intending to work on!


Super intresting! Love the scrappy mindset ;)


Cheers, I learnt with the best :)


Reminds me of the time I coded up an optimized nearest neighbor search in SQL.


Next step would be to implement autodiff. If only PostgreSQL had CUDA support.


Uhhhh, ya know that Python has a library for that :-)


god tier


An even MORE useful idea is to do this in CSS.


I challenge you to add more emojis to your bio ;)


Yes I know some people look down on that. I hope it doesn't take away the merits of the article for you hehe.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: