>A machine learning algorithm which can be trained using SQL opens a world of possibilities. The model and the data live in the same space. This is as simple as it gets in terms of architecture. Basically, you only need a database which runs SQL.
First paragraph of the conclusion, and this very much fits with the mindset that's been growing in me in the data world over the past few years. Databases are much more powerful than we think, they're not going to go away, only get better, and having the data and the logic in the same space really removes tons of headaches. ML models, transformation of data, generating json for an API can all be done within the database rather than outside scripting language.
Are others seeing this? Are the current tradeoffs just that more people know python vs sql or database specific languages to where moving logic to postgres or snowflake is looked down on?
The challenge I, with limited knowledge, see with developing detailed algorithms in SQL is a lack of good testing, abstraction, and review tooling. Similarly for a lot of the user-defined-functions for the larger data warehouses (redshift, bigQuery, etc.)
dbt solves a lot of it, but I'd love to learn more about good resources for building reliable and readily-readable SQL algorithms for complex logic.
I've supported 3 different models over the years with inference implemented in SQL. First one I inherited, loved it so much that I implemented it twice again. Amazingly fast for TBs of data and no waiting on other tech teams.
That tooling you're describing is definitely not there. Bigquery has BQML but it's very much in its infancy. I tend to do all the modeling in Python/R on sampled data and then deploy to SQL.
I agree. Databases are going to be here for a long time, and we're barely scratching the surface of making people productive with them. dbt is just the beginning.
The problem with throwing everything in a database is you end up with brittle stored procedures all over the place, which are painful to debug. There is no good support for version control or testing, which means you end up creating a dozen copies of each function named (sp_v1, sp_v2,.., etc.). It much more harder to practice iterative development which the rest of software development seems implements effectively.
Also traditional relational databases have a way to go before they can support parallelized machine learning workloads. You do not have control or the ability to spin up threads or processes to boost your performance. You rely on the query processor to make those decisions, and depending on your platform the results will be mixed.
No comment re rDBs supporting parallelized ML but re stored procedures - if your workflow evolves to treat them as ‘1st class’ code assets they’ll be just the same as the rest of your code.
We always had them in version control, unit tested etc. The tools are there if you want to use them.
I ll take this as a learning opportunity. I have looked around to find a reliable framework to implement within our team and failed to find anything usable.
How do you guys manage to implement versioning and testing?
If you had a new stored procedure to deploy, where do you deploy? How to you integrate with existing applications which rely on it?
Depends on what you're using. The Java world has a bunch of JDBC wrappers for testing SPs. SQLServer GUI has unit testing tools that can also be run from command line. @dventimi has good postgres recommendation.
Version control - each SP is in a file sp_descriptive_name.sql that's on the source tree under /db.
Deployment - compiled from the command line. Deployment target managed by env variables. Systems can require odd workarounds for ensuring no downtime - this is where systems end up with sp_descriptive_name_vN with N for each version (still stored in sp_descriptive_name.sql tho) so the new SP can be loaded alongside the old version, and then application code can be updated (or config updated) to use the new SP name. Good practice to remove the old version on next release.
I think it's true that SPs introduce a compromise, but having logic operate on data without having to send data back and forth over DB connection is a pretty big win.
It can also be a good abstraction because it avoids application code needing to know (for example) which field in which table indicates a customer is active, and code can execute something like sp_deactivate_customer(cust_id) and then let the stored procedure take care of updating flags, dates etc that describe a customer's active status. ymmv depending on whether your team owns the data and database, or if that's managed separately.
> The problem with throwing everything in a database is you end up with brittle stored procedures all over the place, which are painful to debug. There is no good support for version control or testing...
Shell scripts with the SQL as Here-Docs, in stead of stored procedures. Ordinary text files, eminently gittable. Also lets you insert varying table/column names as environment variables, in stead of the regex rigmarole in TFA. (Or was that in another post on his blog? Just returned from reading several of them.) Best of both worlds.
There is nothing you cannot build by combining SQL primitives. Side effects can even be introduced - on purpose - by way of UDFs that talk to the outside world.
I've seen more than one system where the database itself was directly responsible for things like rendering final HTML for use by the end clients. You might think this is horrible and indeed many manifestations are. But, there lurk an opportunity for incredible elegance down this path, assuming you have the patience to engage in it.
> I've seen more than one system where the database itself was directly responsible for things like rendering final HTML for use by the end clients.
I did this for a side project a few months ago and even used postgrest to serve the page with correct headers for html. It felt simultaneously really cursed and obvious. Shit you could even use plv8 to run mustache or whatever in the db if you really wanted to piss people off.
I'm doing this right now with a DO droplet built with PostgreSQL, postgrest, nginx, and not much else. Do you have any tips, tricks, or blog posts you can share based on your experience? You should post it to HN. Strike while the iron's hot. With a little luck you'll hit the front page.
Nah nothing that refined. Pretty predictably supabase is doing some weird stuff along these lines and I found some abandoned and semi-active repos associated with them and people working for them that were useful examples of some things.
As for posting to HN absolutely no thanks. These people are so fucking hostile there is no accomplishment too small to tear apart for entertainment here. I have no interest in it.
plv8 doesn't have network access, but this can be done with PostgREST.
That said, we actually restricted the HTML content-type on our platform so that users cannot do this. This is because we don't want to get too deep into the "frontend" world, with ever-changing frameworks.
We've found that if we offer a small feature, our users demand feature-completeness. In this case we're better-off waiting until we have solve a few of the major database tasks (branching + migrations, scale-to-zero, edge caching, etc)
FWIW, I think it's very cool serving HTML through PostgREST. I would be pretty happy to do this personally
Long time user, first time caller. Besides the DO droplet, separately I also use Supabase and I love it.
I understand completely about restricting the content type, maybe just for now. Fun fact: I work around it by serving up text/xml plus XSLT. I'm sure you don't mind.
> As for posting to HN absolutely no thanks. These people are so fucking hostile there is no accomplishment too small to tear apart for entertainment here. I have no interest in it.
> Side effects can even be introduced - on purpose - by way of UDFs that talk to the outside world.
> I've seen more than one system where the database itself was directly responsible for things like rendering final HTML for use by the end clients.
I built BI Web pages entirely in PL/SQL (using some Oracle modules I can't quite recall the nams of) over twenty years ago. Got a bit hairy to keep the meta-recursive stuff in your head when hard-coding JavaScript into it, calling the next such page for each value on the page, for drill-down to the next level... Not fun, but quite doable.
And with a bit more modern tooling -- as I said, this was turn-of-the-century tech -- I have no doubt it could be done much more effectively and efficiently nowadays.
+1 sql is extremely elegant composable and is under rated
Postgres is very powerful. While I sought a short detour in nosql Mongodb land now back to Mysql Postgresql sql territory and glad for it
Being able to generate views is and stored procedures is useful as well.having sql Take over more like ml, gradient descent does open up good possibility.
Also since sql is declarative it Makes it so it's rather easier than imperative scripting languages
Basically, but queries are not first class in SQL. You can't assign a query to a variable, or pass it as a parameter to a stored procedure, for example. This would make SQL composable:
declare @x = (select * from Person)
select Name from @x where Birthdate < '2000-01-01'
CTE and TVF still treat tables as second class citizens, so while they enable some forms of composition, they're still very restricted. This has been the consistent story with SQL, 15 ways to use queries and tables, all to work around the limitation that they are not first class values.
Where is the assignment to a variable? Where can you construct a query using a variable in table/query position? That's the whole point of being first class and composable, a query becomes like any other value so you should be able to parameterize any query by another query assigned to a variable that may have been set inside an if-statement, or accepted as a parameter to a stored procedure. You know, the same kinds of composition we see in ordinary programming languages.
You're still not getting it. First-class status means that anywhere a value or variable can be used, a query or table should also be able to appear, and vice versa. This means a table or query can appear as a return type, a parameter to a stored procedure or query, a variable, and so on.
SQL just does not have this, it instead has 15 different second class ways to handle tables and queries that try to make up for the fact that they are not first-class values. These include CTEs, table valued functions, views, etc.
Usually I balk at the idea of of someone describing a language feature as “first class” because it seems to wishy washy. But in this thread you’ve shown me that maybe the best definition is through “proof by negation,” by patiently responding to arguments and demonstrating why a certain usage and the ensuing restriction around it means it is not first class. Bravo!
I agree the term is often abused, but I think the wikipedia page actually does a decent job of making the notion somewhat precise, along the lines I've been arguing here:
If you want to see what queries as first-class values looks like, LINQ in .NET is pretty close. I can actually write a series of queries that build on and compose with each other, like this:
IQueryable<Person> RunQuery(int userSelection)
{
var first = from x in People
select x;
var second = userSelection == 1
? from x in first where x.Birthday > '2000-01-01' select x
: from x in first where x.Name.Contains("Jane") select x;
return DumbJoin(first, second);
}
IQueryable<Person> DumbJoin(IQueryable<Person> first, IQueryable<second>)
{
return from x in second
join y in first on y.Role equals x.Role into g
select g;
}
This query is nonsense, but it just shows you what composition really looks like when queries are first-class values. I wish raw SQL were like this!
> You're still not getting it. First-class status means that anywhere a value or variable can be used, a query or table should also be able to appear, and vice versa. This means a table or query can appear as a return type, a parameter to a stored procedure or query, a variable, and so on.
I doubt you could implement a query planner that would cope with that degree of flexibility. Which means you’d be forced to deal with the mechanics of the query, pushing you away from declarative SQL and into procedural and functional programming. At which point you might as well ditch SQL anyway.
Without these features, people have to resort to dynamically generated SQL queries in procedural or functional languages, which is much worse! SQL has also become significantly more complicated by adding all sorts of second-class features to get around this composability limitation (CTEs, table valued functions, views, etc.).
Besides, I don't think it would be as bad as you say. You can approach it as a simple template expansion into flat SQL queries except where a data dependency occurs, at which point template expansion proceeds in stages, one for each dependency.
LINQ on .NET provides most of the composability I'm talking about, although it has a few limitations as well. Still worlds better than raw SQL.
What if I wrote a very long, complicated query that I'd like to test against different tables (like test tables), and let's say I can't use stored functions or procedures. How could I pass different tables to my query?
CREATE TABLE data_a AS (SELECT 'a' AS test_case, 1 AS value);
CREATE TABLE data_b AS (SELECT 'b' AS test_case, 2 AS value);
CREATE VIEW data AS (SELECT * FROM data_a UNION ALL SELECT * FROM data_b);
CREATE VIEW complicated_query AS (SELECT test_case, value+1 FROM data);
SELECT * FROM complicated_query WHERE test_case = 'a';
SELECT * FROM complicated_query WHERE test_case = 'b';
Nice, that is what I was looking for. Of course, it'd need to point to production data as well, so maybe test_case is null, in that case:
CREATE TABLE data_a AS (SELECT 'a' AS test_case, 1 AS value);
CREATE TABLE data_b AS (SELECT 'b' AS test_case, 2 AS value);
CREATE TABLE data_prod AS (SELECT NULL AS test_case, prod_table.value FROM prod_table);
CREATE VIEW data AS (SELECT * FROM data_a UNION ALL SELECT * FROM data_b UNION ALL SELECT * FROM data_prod);
CREATE VIEW complicated_query AS (SELECT test_case, value+1 FROM data);
-- when testing
SELECT * FROM complicated_query WHERE test_case = 'a';
SELECT * FROM complicated_query WHERE test_case = 'b';
-- when in 'production'
SELECT * FROM complicated_query WHERE test_case IS NULL;
You just reinvented defunctionalization, which is a transformation from a domain that has first-class values to a domain where support is only second-class. Defunctionalization is typically used in programming languages to simulate first-class functions in languages where functions are only second-class citizens, like C and Pascal.
This perfectly illustrates my point. You had to manually defunctionalize your data model and queries to support what I'm saying should be inherently part of SQL.
> languages where functions are only second-class citizens, like C and Pascal.
1) Only if you define Pascal as only Wirth's very first version. That changed almost immediately.
2) Only if you refuse to equate “pointer to function” with “function”. Which in C, where “everything is a pointer” (a bit like in Unix Linux “everything is a file”), seems rather silly.
Postgres has read write parallism that can scale across millions of read writes ; if ml model is inherent in the Postgres db it is indeed very elegant reminds me of the glory days of tsearch2 to do text search in Postgres for searching our internal document repository using nothing but Postgres and boy was it faster than Java based search systems
Another is MS SQL Server, which lets you run .NET on the database server :D "you can author stored procedures, triggers, user-defined functions, user-defined types, and user-defined aggregates in managed code"
Some tasks - like the concurrency SQL enables - are just very difficult to debug with debuggers. It would be the same with any other language. What SQL does here though is to allow you to focus on the logic, not the actual concurrency,
> text editors, package managers
I feel like these two are just for filling up the space.
Sure SQL has types, but they are checked at runtime, not compile time. Also you cannot define function input and return arguments with types that are checked before you run the program.
> compilers
If you want efficient and/or portable code. They will check your code for type errors before you run them. They give you coding assistance in your editor.
> debuggers
Being able to break a program and see its state and function stack is useful. The quality of the tools for real languages are much better than SQL.
I agree that databases do concurrency better than most languages with their transactions (I mentioned I would use the db for ACID).
> text editors, package managers.
Editor support of real languages is much better than SQL.
Package managers enable code re-use.
> C FFI
Take for example Python. A huge amount of the value comes from re-using C libraries that are wrapped with a nice Python API.
You might be able to do this in SQL, but you'll have to wrap the C library yourself as there is no package manager, and no one else is doing the same thing.
> Editor support of real languages is much better than SQL.
So text editors suck at supporting SQL... How is that SQL’s fault?!? Go complain to the text editor authors.
> Package managers enable code re-use.
Yup. Build one for SQL, then you can re-use SQL code. Just like someone had to build one for every other language. What does this prove about SQL being inferior to other languages? A: Nothing at all.
> What does this prove about SQL being inferior to other languages
My point is that I think it is inferior for general application/business logic programs. For queries that filter large datasets to small result sets, SQL is probably better as it has the built in indexes and query planner (plus ACID).
I am pointing out that the current environment (better text editors and package managers) favours general languages, so they are a better current choice (to use in combination with SQL) over just writing everything, including application logic, in SQL.
> I think general programming languages are better for general programs than SQL.
Specifically they have: Type systems, compilers, debuggers, text editors, package managers, C FFI etc.
Non sequitur. SQL is typed; SQL can be edited in any text editor; there are lots of SQL IDEs and, arguably, debuggers and package managers. Sure, the package managers are specific to each RDBMS, but so what? Npm is no use in COBOL either. And sure, the “debuggers”, to the extent they can be said to exist, are radically different from those of “conventional” – of other – programming languages. But again, so what? A Smalltalk debugger is no use to fix the output from Intel’s latest C compiler either, or vice versa.
IOW: There is no such “SQL vs programming languages” dichotomy. SQL is just another programming language, with its own strengths and weaknesses, just like all the rest of them. “The rest” are not distinguished from SQL by somehow magically all having the attributes you claim for them: Some have them, some don't; some have this but not that, some others, the other way around. Someone built all those IDEs and debuggers and package managers for (some / many / most of) those other languages; you can build them for SQL too.
The basic notion is you keep your data hot in RAM and manage it directly. You make every change an object (or a command), and write that out serially to a log before you execute it. That gets you the ACID guarantees but with no I/O but linear writes, so it can be extremely fast.
It only makes sense when your data fits conveniently in RAM, but that's a lot of things.
I don't consider querying a relational database, transforming its data, or validating its state transitions to be general programs. I consider those to be special purpose programs, which benefit from special purpose tools tailored for writing them. SQL is one such tool.
I spend most of my time in the parallel universe that is scientific computing/HPC. In this alternate reality SQL (not to mention databases) never really took off. Instead of scalable, performant databases, we have only the parallel filesystem. I'm convinced the reason contemporary scientific computing don't involve much SQL is sociological/path-dependency, but there are also very good technical reasons. Optimizing software in scientific computing involves two steps:
1) Think hard about your problem until you can frame it as one or more matrix multiplications
2) Plug that into a numerical linear algebra library
The SQL abstraction (in my experience) takes you very much in the opposite direction.
For sure. Anything done in SQL is running on top of a million lines or more of extremely complicated non-SQL code. If that works for a given use case, great, but if not, optimizing can get very challenging. I'd much rather deal with something closer to the metal.
Self-proclaimed database expert here. What a database is good at depends on what you're trying to get that database to do, at least in part.
Take it into piecess, elegance and efficiency. These will correspond to a logical statement of what you're trying to do, and how quickly the database will actually do it in practice.
SQL can do some nice things in areas, making it elegant in those areas. Elsewhere it can be pretty wordy and ugly.
In efficiency, it comes down largely to how the database is implemented and that also includes the capability of the optimiser. Both of these are out of your control. In my experience trying to turn a database into a number cruncher is just not going to work.
I guess that's long way round of me saying that I don't think I agree with you!
In the data warehouse / OLAP space, I think we are heading towards a world where the underlying data storage consists of partitioned Arrow arrays, and you can run arbitrary UDFs in arbitrary languages on the (read-only) raw data with zero copy and essentially no overhead other than the language runtime itself and marshalling of the data that is emitted from the UDF.
Something like the Snowflake data storage model + DuckDB as an execution engine + a Pandas/Polars-like API.
There is no reason why we have to be stuck with "database == SQL" all the time. SQL is extremely powerful, but sometimes you need a bit more, and in that case we shouldn't be so constrained.
But in general yes, the world is gradually waking up to the idea that performance matters, and that data locality can be extremely important for performance, and that doing your computations and data processing on your warehouse data in-place is going to be a huge time and money saver in the longer term.
And a function of what people think is attitudes towards working at the DB level. I see this often with ORM's in the web dev sphere (rather than Dat Science). Yes, ORM's are great but many people rely on them to completely abstract away the database and are terrified by raw sql or even query building. You also see it with services that abstract away the backend like PocketBase, Fireship, etc. Writing a window function or even a sub select looks like black magic to many.
I say this after several experiences with codebases where joins and filtering were often done at the application layer and raw sql was seen as the devil.
Database first designs make a lot of sense in a lot of ways. I've worked for a company with an Oracle database that has SQL scripts that do all the validation and create text files for downstream usage. I think it makes more sense than a ton of Java, but there are pros and cons. One is that SQL is relational and the advanced stuff can be extra hard to troubleshoot if you don't have enough experience. Even those that can't code can usually understand a for loop and can think imperatively.
Unfortunately it's an expensive commercial product or I'd recommend you look at kdb+ if you work with time series data. The big banks use it and essentially put all thier latest RT data into kdb+ and then can write extremely succinct queries with a SQL-like syntax, but the ability to approach it far more programmatically than what is typically doable with something like PL-SQL. You can even write your neural network or whatever code in less than a page of code as the language of kdb+ is extremely powerful, although also basically incomprehensible until someone puts some time into learning it. It's extremely lightweight though, so very easy to deal with in an interactive fashion.
All that to say I agree with you that it's nice to just have everything you want all in one spot rather than to deal with 4 different tools and pipelines and shared drives and so on.
Consider too that PostgreSQL databases support different languages, like Python.
Loads of for-profit companies have tried to cash in on this. SAP HANA is one of the ones I've had recent experience with. It is unfortunately a poor implementation. The right architecture tends to be: put your model behind an API interface, not internal on the system. Train your model separately from production systems, and so on.
You might also be interested in checking out MLOps platforms like Kubeflow, Flyte, and others.
There is a recent trend in database research, ML in databases. Not sure how much an impact it can make though, the sweet spot is doing relatively linear stuff, arguably just a step up from analytical functions in queries, while cutting edge ML needs dedicated GPUs for compute load and often uses unstructured and even binary data.
> Are the current tradeoffs just that more people know python vs sql or database specific languages to where moving logic to postgres or snowflake is looked down on?
Yes, mostly development, deployment, etc.. concerns. I haven't ever seen an org that versions their SQL queries, unless they are in a codebase. The environment is just unfriendly towards that type of management. Nevermind testing! Things that have solutions but we haven't matured enough here, because that type of development has been happening in application code.
Also, SQL is generally more complex than application logic, because they are designed to do different things. What is a simple exercise in iteration or recursion can more easily become something a little more of a headache.
Problems that can be resolved, but they are problems.
I would like for this to be the case. I was DBA of a Postgres database for LIMS almost 2 decades ago. Back then you could code functions for the database to execute on data and it was very powerful, but also very clunky. The tools to do software development in the database was not mature at all. A lot has changed in the past 20 years and SQL has evolved. Do you think SQL will expand that much or there will be APIs built into the database? Near-data functions are powerful and useful, but I would want my development environment to be more like version controlled code than "built-in".
I wonder if near-data functions on small databases is the solution to the limit of statelessness that you have with functions as a service.
With “traditional” RDBMS systems, putting a lot of code in the DB lead to a lot of scaling issues where you’d need gigantic machines with a lot of RAM and CPU: because the DB was doing so much work. It was expensive and clunky to get HA right.
In more modern DBs being distributed horizontally, this approach may see a rebound. The big “but” is still costs, in my experience in AWS as an example, managed Postgres Aurora was surprisingly expensive in terms of monthly cost.
The resources I'm familiar with are I/O, memory, and CPU. The only one I believe can be spared in the database by using that resource outside of the database, is CPU. When the database is far from saturated on CPU and latency and throughput are determined by I/O and memory, using CPU on some other machine that isn't the database can't possibly have any impact on latency and throughput.
The issue here is if you scale enough saturate the database, you'll have to rewrite essentially all your code if you're a typical CRUD webapp. Basically all of your business logic is about data retrieval. There's probably some companies that can get away with this, but it would be way too expensive for most.
If I scale up to saturate the database CPU...by doing data retrieval? Setting aside my skepticism about saturating the CPU with mere data retrieval, how is that solved by moving the data to another host's CPU, when moving the data involves the very data retrieval that's saturating the database's CPU?
If you mix in compute-heavy calculations (like stochastic gradient descent) in with your pure data-retrieval, yes you'll saturate the database's CPU, and you won't have a reasonable way to scale it.
If you do it on a host that's not a database, then you can horizontally scale it. There's a reason stateless apps are the default.
Ok so if I'm not mixing in compute heavy workloads like the stochastic gradient descent described in the article, then it's less likely I'll saturate the CPU. Perhaps that won't happen at all and then I won't have to scale horizontally.
On the other hand if I'm doing stochastic gradient descent that's saturating the CPU then there's a good chance I'm doing offline training of an ML model. In that case my latency tolerances are probably much much higher. In other words, I can also avoid scaling horizontally provided I can live with longer training times. That might be a worthwhile trade-off to me given the added complexity of horizontal scaling.
I agree. I took a course in databases and SQL and was blown away by its power. With CTE’s and PLSQL you can do a lot of stuff inside the database.
I played with SQLite and it’s json columns. Once you get the hang of the syntax for walking a json structure you can do all sorts of neat things. Doing the same thing in Python would be tedious.
And I also believe it ended up being way faster than what I did in python.
I'm watching it, it's really good. Montana makes a great point: you can move data to the models, or move the models to the data. Data is typically larger than models, so it makes sense to go with the latter.
thanks for watching! i should really up the production quality haha but also this is what i can kinda manage with my existing workload. idk how the pro youtubers make these calls interesting
Junior developers like me were uncomfortable with SQL twenty years ago. Java ORM frameworks became popular because of the Object-Relational impedance. I kind of see the same kind of sentiment nowadays among newer generations but in Python&Co.
The success of the Apache Spark engine can at least partially be attributed to
* being able to have the same expressive power as SQL but with a real Scala API (including having reusable libraries based on it)
* being able to embed it into unit tests at a low price of additional ~20 seconds latency to spin up a local Spark master
A dbms is really it's own operating system, usually this is hosted on another operating system, one that understands the hardware.
I remember one place I worked where we had several old graybeard programmers who considered the dbms[1] the operating system, as a unix sysadmin we had some interesting discussions as I was always confused and confined by the dbms and they felt the same about unix.
1. unidata if curious, a weird multi value(not relational) database, very vertically integrated compared to most databases today.
I've created a big system with only SQL in the past like you just said. I wouldn't do it again because of these two pain points that I had to deal with:
1. it's really hard to debug SQL queries and stored procedures (at least it was in Postgres 11)
2. when you hit a performance bottleneck, you don't have much control over it - parallelizing is hard and you have to trick the query planner to do what you want (and it doesn't work sometimes)
Not challenging your experience but genuinely trying to learn from it, in broad strokes what kinds of things were you doing in these stored procedures?
1) No static typing
2) Updating the logic requires migrations
3) You put the logic into the place which is the hardest to scale and many times a single point of failure
4) Cannot compose the code effectively and general verbosity
It's one of those ideas that sounds great on paper and maybe works in some smaller problems but as you go up in complexity things get worse and worse
It is tempting to combine web server, database and some imperative language with built in data oriented / SQL features in a single executable and call it an application server that would communicate with the outside world using for example JSON based RPC. I think there were / are some products in the area even with the built in IDE (like Wakanda).
And using PostgREST [0] you can serve your postgreSQL database as REST-API.
And if you throw foreign data wrappers / multicorn in the mix, you can map any other datasource into your postgreSQL-db as table.
Hehe I was wondering if someone would catch that. Rest assured, I know the difference between online and stochastic gradient descent. I admit I used stochastic on Hacker News because I thought it would generate more engagement.
What are some adversarial cases for gradient descent, and/or what sort of e.g. DVC.org or W3C PROV provenance information should be tracked for a production ML workflow?
We built model & data provenance into our open source ML library, though it's admittedly not the W3C PROV standard. There were a few gaps in it until we built an automated reproducibility system on top of it, but now it's pretty solid for all the algorithms we implement. Unfortunately some of the things we wrap (notably TensorFlow) aren't reproducible enough due to some unfixed bugs. There's an overview of the provenance system in this reprise of the JavaOne talk I gave here https://www.youtube.com/watch?v=GXOMjq2OS_c. The library is on GitHub - https://github.com/oracle/tribuo.
- does not allow for easy and clean importing of modules/libraries
- is not easily to write tests for
- has limited support for a debugger
- lacks a consistent style for such large queries (plus most textbook cover fairly simple stuff) which means it's hard for a developer to start reading someone else's code (more than in other languages)
- clearly indicates in its name that it is a Query language.
Save yourself the trouble and all your collaborators the pain of working with this code in the future, of trying to add new features, of trying to reuse it in another project.
If you want to operate near the data, use PL/Python for PostgreSQL.
-PostgreSQL extensions are easy to include and use.
-pgTAP exists for testing.
-A large query in SQL is not made smaller but translating it into an ORM DSL.
-If "Query" in "SQL" means it's for querying data, then evidently "Query" not being in say Java or Python means those languages are NOT meant for querying data. If that's true, then why would you use them for querying data?
> If "Query" in "SQL" means it's for querying data, then evidently "Query" not being in say Java or Python means those languages are NOT meant for querying data
If X then Y does not imply if not X then not Y. Java and Python do not indicate a purpose in their name because they are general-purpose.
Re modules/libraries: I meant it is not easy to write a piece of SQL code, and then import it into several queries to reuse it, or lend it to someone else for use on their on schema. It is possible, yes, but seldom done, because it is hell. PostgreSQL extensions could be used for this purpose, but developing an extension requires a different set of SQL statements (or luckily, python or c) than those used by the user of the extension, which makes compounding them a bit hard. Not impossible, just hard to maintain,
About your last point, I don't think that was my line of reasoning, but, yes, for the love of what is precious, don't open SQL files as python/java file objects and then parse and rummage through them to find the data you are looking for. Not impossible, just hard to maintain.
Thanks for pointing out pgTAP, didn't know about this.
For some reason, data-science folks haven't yet caught up with ORMs.. I don't know if this is good or bad, but (as the OP shows) they are more used to rows and columns (or graphs) than objects. Maybe that will change one day.
As for sharing SQL, that's easy to do within a database using views. Across databases with possibly different data models, that's not something I personally ever want to do.
Right now AI algorithms kind of look imperative and stateful to me, like state machines. But there should be a functional or relational representation, where pattern matching would be used to derive the current state instead.
It's trivial to go from functional to imperative representation, but often nearly impossible to go the other way. That's why monadic (sorry if I'm mincing terms) logic is so troublesome. Mutable variables, futures/promises, async, etc can't be statically analyzed, which is why most code today is difficult to parallelize and stuff like C++ optimizers don't even increase speed to the next order of magnitude. But functional languages have nearly infinite optimization potential through divide and conquer approaches like sharding and scatter-gather arrays that can run internally without side effects. In other words, we can throw hardware at SQL for linear speedup since it's embarrassingly parallel, but might have limited success optimizing something like Python.
> This is great! Moving away from the proprietary nature of GPUs and complex math gatekeeping should help democratize AI.
I understand the sentiment but you can run ML on plain CPUs, in fact it is easier to do so, you just don't get the massive speed benefits of GPUs. Hugging face's public models are all run on CPUs to save costs.
> Right now AI algorithms kind of look imperative and stateful to me, like state machines.
The maths is not stateful or imperative, only our implementation. Much like how functional languages are executed on a stateful CPU.
Your question "Has anyone converted stuff like gradient descent to set theory?" doesn't make sense from the perspective that gradient descent uses differential calculus to find min/max points of an objective function and differential calculus requires a lot of additional assumptions on top of set theory.
That being said, current deep learning libraries such as JAX and Pytorch use automatic differentiation to efficiently compute partial derivatives used for optimization algorithms such as gradient descent and it's not clear to me what the level of effort would be to convert that to something that could run efficiently in SQL?
Thank you, I knew that derivatives were used in gradient descent, but automatic differentiation is new to me. The Wikipedia article is fairly opaque compared to what I learned in school, but this tidbit stood out:
Source code transformation (SCT): the compiler processes source code so that the derivatives are calculated alongside each instruction.
Operator overloading (OO): operators are overridden so that derivatives are calculated for numbers and vectors.
Based on the state of software these days, I'm guessing that OO (the "bare hands" method) is what's mainstream. It would be far better IMHO to use SCT, since it's a universal solution that doesn't require manually refactoring programs.
But stuff like SCT might be considered metaprogramming, which seems to have fallen out of fashion. I grew up with C++ macros and templates, so I feel that this is somewhat tragic, although readability and collaboration are much better today. A modern example might be something like aspect-oriented programming (AOP):
I once used the AOP library AspectJ to trace a Java app's execution, since Java made the (unfortunate) choice to focus on objects rather than functions, which makes it generally a poor fit for data processing, due to its high use of mutable state within objects (mutable state is what limits most object-oriented projects to around 1 million lines). Meaning that I couldn't remember the program's context as I was stepping through it, and had to analyze traces instead. AspectJ allows one to hook into the code without modifying it, sort of like a debugger, so that stuff like function calls and variable mutations can be watched:
If we had that, maybe we could automatically generate derivatives for the set operations. Then either access them as variables in stored procedures, or possibly as something like views or via metadata stored somewhere like MySQL's INFORMATION_SCHEMA.
I don't really know, but maybe these breadcrumbs could be helpful.
Most ML algorithms are numerical approximations, so yes, they look procedural. You can try and find analytical solutions, but except for the simplest models they’re intractable.
This is really interesting, but a basic part I don't understand: What would it actually look like to run this on a live dataset?
If I understand correctly: you'd run the recursive query, it produces results for every step, effectively showing you the progression of output over time, and then once it hits "present day", it completes and stops?
How would you generate results going forward? I.E. A minute elapses after the results return, do you have to re-run the whole query for all time?
Good question. I touched upon this in the conclusion. Basically, if you run this in a streaming SQL database, such as Materialize, then you would get a true online system which doesn't restart from scratch.
i’m trying to learn here so please pardon my ignorance. wouldn’t a pre-aggregated set affect the new aggregate result? i suppose you could store avg, sum, count and then add the new value(s) to sum, new count to count, and recalculate average from that. or even just avg and count and then re-aggregate as ((avg*count)+new values)/(count + new values count) but i didn’t know if there’s a better way to process new values into a set of data that’s already been aggregated
Yep that would be what I would do - effectively a functional approach where no memory is required besides the current set and iteration.
A big part of materializing datasets for performance is finding the right grain that is both easy enough to calculate and also can do nice things like be resumeable for snapshots.
DuckDB is what I used in the blog post. Re-running this query simply recomputes everything from the start. I didn't store intermediary that would allow starting off from where the query stopped. But it's possible!
Postgres has excellent support for WITH RECURSIVE, so I see no reason why it wouldn't. However, as I answered elsewhere, you would need to set some stateful stuff up if you don't want the query to start from scratch when you re-run it.
I for one would welcome a second blogpost with the stateful example. I’ve been doing running totals (basically balances based on adding financial tx per account) but had to do it in some client code because I couldn’t figure out how to do a stateful resume in Sql
From the start I assumed this is a nice playful "Hexing the technical interview" kinda joke. But given the tone of the article, and some of the comments here… Uh, this cannot be serious, right?
Why couldn't it be serious? Your database already has your data so you don't have to stream it to another machine for training. If you have a developed model that needs retraining on new data, save yourself the network bandwidth (and time!) and just make the DB retrain. This doesn't replace the exploration phase of model-building but if you already know gradient descent works for you then what's wrong with this approach?
Incredible post. I laughed when I saw the title, snickered at the first paragraph, and then proceeded to be blown away by the rest of it. Thought I was in for a joke and instead I'm thinking about the nature of ML Ops and what it's become.
My sentiment exactly. The premise of the article comes across a little naive because there are so many fine-tuned libraries for specialized hardware architectures that already do this computation very efficiently.
However, it did make me wonder what this might look like on a gpu-accelerated database engine that is designed to leverage the SIMD parallelism of GPGPU architectures.
Beyond using SQL/NoSQL databases for CRUD apps I am not a "database guy", so I'm not sure about the feasibility, but it would be interesting to see it implemented.
In the comments here so far, we see a pattern we've seen before. When someone suggests doing something in SQL, there's a lot of concern about SQL being a very limited programming language where it's hard to do proper engineering.
Here's I would really love to know: why is it that SQL is, to first order, the only language used to interact with databases, and SQL has about the same features as it did in the 70s? It seems analogous to if the general-purpose programming world stopped with C.
I believe the PC revolution gave a few generations of programmers the mistaken belief that imperative languages with familiar features (Basic, Pascal, C, Perl, Python, Java, etc.) are the only or the best way to interact with computers. Forms of interaction developed earlier (Forth, Prolog, SQL, etc.) were learned later, grudgingly, if at all. SQL was viewed with disdain, a necessary evil to be held at arm's length. That stalled evolution of SQL, the success of other query languages, or really anything that didn't fit the Algol-style "general purpose programming language" blinders.
Mercifully, finally, the cold war against SQL in particular and against relational databases in general seems to be thawing. Articles like this one exemplify that trend
Plenty of SQL features have been added since the 70s, notably window functions (which TFA relies on heavily). Most of the major databases are from the 80s, and even new kid on the block MySQL has been around since the mid 90s.
SQL draws a very hard line between the expression of the query and the AST that is used in the actual implementation. Database vendors like this aspect because they are free to implement whichever optimizations they want, but application developers want to build queries programmatically and optimize them themselves before passing them to the db engine, hence the tension in threads like these.
> why is it that SQL is, to first order, the only language used to interact with databases, and SQL has about the same features as it did in the 70s?
Because SQL is effectively a domain-specific language. If you added 100+ additional keywords/functions/etc., do you think it would be easier or more difficult for the average developer to build something clean with it?
I look at SQL like a pile of bricks. You don't want complicated, unicorn bricks or you won't be able to fit anything meaningful together. Experienced masons almost certainly prefer their materials to be as dumb and consistent as possible.
> You don't want complicated, unicorn bricks or you won't be able to fit anything meaningful together. Experienced masons almost certainly prefer their materials to be as dumb and consistent as possible.
And now we know what's gone wrong with Lego for the last two-three (close to four?) decades.
Funny – and sad! – how popular this shit has been on sites like this, though. Dare one hope that that popularity might begin to wane? Idunno
On stream inference with something like "continuous" data in the same structure I think is the final material form for "AI" so this is a great step towards that
This is great. The only thing I dislike from this is using these variables to try to predict Adj Close when they are not at all correlated.
There are countless meaningful correlations in financial data that would have been just as easy to play around with. One truly valuable example would be to look at trading multiples of comparable companies. Sticking to P/E would be easier as P is easily observable and forward-looking EPS estimates are generally always available. This would limit the exercise to more mature companies than the ones commonly discussed on HN but would make it actually meaningful
In hindsight this is harder than it seems if you don't already have access to data, so I'll try to be as informative as possible in my response
It's hard to find this information out there, so here's ~all you need to know.
Data is usually behind paywalls, unfortunately. Industry standards are Bloomberg terminal (ridiculously expensive, 5 digits $), FactSet (very expensive, 4 digits), Capital IQ (expensive, not sure)... but there are a number of up-and-coming startups trying to disrupt the space so you may be able to grab data from them. I think https://atom.finance has a 7-day free trial you could use to play around with.
P/E simply means the company's _P_rice per share divided by _E_arnings per share. Cancel out the "per share" terms and you get total market capitalization (which is the value of the total equity) divided by net income (since "earnings per share" really means "net income per share")
So the "P" is easy to get. It's your Adj Close.
The "E" is trickier as it can mean a lot of things. Diluted EPS from financial statements? Last year's EPS? Management's guidance for EPS? None of those are actually correct even if they are all "EPS"
Importantly--and contrary to 99% of the info you will find online--the most relevant EPS number are forward estimates of EPS, usually for the next twelve months ("NTM"). That is based on an average or median of analyst estimates which is called "consensus". These are analysts from financial institutions who build their own little models based on their own views of where the business is going to go, informed by recent earnings, management's color in earnings calls and filings, etc.
Believe it or not, as hairy as that sounds, EPS is fairly easy to get as it's a metric that has less room for interpretation than, say, EBITDA.
So you're not going to go out there, read all these (paid) analyst reports, find their EPS, calculate the median, etc. Bloomberg, Capital IQ, FactSet do this for you and it's easily observable for the end user (that's their business).
The thing is, as you may have guessed, "next twelve months" are a moving target across time. Analysts usually provide estimates for the current fiscal year (i.e. FY 2023, ending 12/31/2023 for most companies) and the following year, ending 12/31/2024. Let's call these FY0_EPS and FY1_EPS, for simplicity
You might be tempted to just take a moving average of these two estimates, so that on 1/1/2023 it is 100% of FY0_EPS + 0% of FY1_EPS, on 1/2/2023 it is 99.9% + 0.1% and gradually "move forward in time" as the days pass. That sort of works (and definitely checks the box for a proof-of-concept like in your post) but for the sake of completeness, I'll just say that the right-er approach is to only "move forward in time" when new earnings are released. So it doesn't matter if we're in 1/1/2023 or 2/1/2023--what matter is what is the latest reported quarter. Take Coca-Cola for instance (https://www.bamsec.com/companies/21344/coca-cola-co). Let's roll the tape backward one year. They reported FY 2021 earnings on 2/22/2022, at which point analysts published new estimates in revised models, so on from that day forward until the next quarterly earnings we take 100% FY0_EPS + 0% FY1_EPS, in which these correspond to estimates for FY 2022 and FY 2023, respectively.
On 4/1/2022, Coca-Cola reported Q1 2022 results, analysts published new estimates, and we now take 75% FY0_EPS + 25% FY1_EPS. On 7/1/2022, we move forward another quarter so 50% + 50%, then 25% + 75% starting on 10/26 and then back to square one with 100% + 0% except FY0_EPS now means FY 2023 vs FY 2022 previously, and FY1_EPS means FY 2024
So your table is something like (I'm making up numbers)
With that you can take NTM0_Weight and NTM1_Weight to calculate NTM_EPS by multiplying those weights by FY0_EPS and FY1_EPS. And then can take AdjClose / NTM_EPS to calculate P/E
Why is this useful? Because in theory you can take the average P/E of companies X, Y and Z in one industry and compare it to a fourth company W. Is W's P/E multiple above or below the industry average? You now know if they are over or undervalued, respectively, which means you know if you should buy or sell that stock (if you believe you picked the right "comparable" companies in that industry)
This is just one example... there are all sorts of similar analyses done daily in the financial services industry. I'm not saying it's easy to extract alpha from trading on these, but that's the framework
My pleasure! I've spent the better part of the last decade doing this stuff and I appreciate how hard it is to find resources on it, so thought I'd share since you mentioned you were interested in learning
See PDF page 14. Note the lines called "Composite P / NTM EPS" which they built as a blend of American Eagle's, Tilly's and Zumiez's P/E multiple, which are companies X, Y and Z in my comment above (for some reason they gave AE double the weight which is unusual) and compared it to Heat's P/E multiple (Heat was the codename for retailer Rue21, or hypothetical company W in my example above)
Just to disambiguate, here machine learning = linear regression. So no auto-grad needed as gradients are explicit and simple.
Still interesting though.
> Also, I can't justify why, but my gut feeling is that the database should be an order of magnitude faster than Python, provided the implementation were to be improved.
Would be curious how that could end up being the case. Perhaps if NumPy wasn't used at all? That would mean no vectorization or optimized storage.
Would be interesting to see how it scaled with length and dimensionality
We've been doing this type of gradient descent in production for a couple of years now, and we've been very happy with it. A few insights from my experience:
One of the main benefits of doing differentiable programming over a relational database is that you can use joins to traverse the normalized data schema, and this acts as a very effective prior for your model. For example, you can learn parameters associated to shirt colors by having a parameter in the color table, and joining it with the shirt sales table (through the shirts table). And knowledge of the table structure also lets you improve convergence, for example by adding a factor to compensate for the fact that not all colors have the same number of sales (and therefore have been updated fewer times during each epoch). Here's a paper from last year: https://arxiv.org/abs/2209.03771v1
Automatic differentiation is an absolute must-have. Once you start having a few complex joins and aggregations, it's too hard to derive the gradient by hand. Doing automatic differentiation on relational queries requires some adjustments (in particular, because you cannot use a tape), so you end up having to define a subset of relational algebra that is closed by automatic differentiation. The general ideas were presented in https://ceur-ws.org/Vol-2971/paper07.pdf
On the other hand, while it's possible to do SQL-to-SQL automatic differentiation, the resulting queries have poor performance, so it's better to go one step lower (at the level of the executor for an already-planned query) and perform the automatic differentiation there. And we had an excellent intern work with us on dedicated parallelization for gradient descent queries: https://blog.lokad.com/pdf/reproducible-parallel-sgd-ziyad-b...
Finally, in practice, it usually doesn't matter if the gradient descent is stochastic, so long as you run several passes over the data.
I'm very interested in differentiable programming on relational languages, as the mainstream research concentrates on unstructured, low-information-density data (images/sound/text) when there are many domains where the data is more dense and structured, and (because of that structure) is stored in a database.
There was also this neural network module in redis if you want to do training and inference in a redid DB. (quite old btw)
https://github.com/antirez/neural-redis
Many ML models experience offline training, and some online training is fine-tuning a model that was pretrained offline. Hence, I see no particular reason to worry about doing this "in prod."
That's too bad, I would have expected it to work out of the box. Other than rewriting the query in a different way, I'm not sure I see an easy workaround. Are you still working on this?
First paragraph of the conclusion, and this very much fits with the mindset that's been growing in me in the data world over the past few years. Databases are much more powerful than we think, they're not going to go away, only get better, and having the data and the logic in the same space really removes tons of headaches. ML models, transformation of data, generating json for an API can all be done within the database rather than outside scripting language.
Are others seeing this? Are the current tradeoffs just that more people know python vs sql or database specific languages to where moving logic to postgres or snowflake is looked down on?