Hacker News new | past | comments | ask | show | jobs | submit login
Composable SQL (borretti.me)
84 points by earnestinger 7 hours ago | hide | past | favorite | 31 comments





The article talks about how CTEs are efficiency minefields and how Postgres doesn't do a good job of pushing down predicates, for example.

TBH, the solution is to improve the Postgres engine in this regard?

For example, in Bigquery I often use CTEs and see predictable speedups that are explained by predicate pushdown when slapping a WHERE clause on the user of a CTE etc. Bigquery has plenty of performance gotchas but I don't think CTEs are one of them; they seem to be transparent to the optimiser.

Something else that Bigquery supports is user-defined 'table' functions. These are like the 'functors' the article creates.

However, in practice they aren't very widely used, and I think this hints that they aren't as useful as the article is hoping for.

The thing I _really_ want is scoping. If only functions could be described like CTE WITH blocks, e.g.

   WITH visits AS (
      SELECT ...
      FROM web_logs
      WHERE ...
   ),
   foo(bar INT, baz STRING) AS (
      ... function body here
   )
   SELECT ..., foo(a, b), ...
   FROM visits
   WHERE ...
Apart from the tidiness of this, this allows inlining of functions into the SQL statement body rather than as separate statements. Often various tooling e.g. DBT makes it really hard to with separate function definitions etc, and being able to inline logic would make things possible.

Bigquery also supports user-defined aggregate functions. Although there is rarely the need to stray from the standard builtin aggregate functions, when you need them they are brilliant.

(Anyone working _on_ Bigquery reading? If so, please ping me, I have loads of feedback on what works well and where the user pain and toil is ;) )


As of Postgres 12 (Oct. 2019) CTEs are no longer materialized by default and are eligible for predicate push down.

The planner may still struggle but it’s not true that a CTE is an optimization fence.

https://www.depesz.com/2019/02/19/waiting-for-postgresql-12-...


SQL cannot be "fixed". It's too broken, too ubiquitous, too customized, too optimized. Will these functors integrate with the thousands of optimizations modern database systems rely on?

This post has the format: "SQL is extremely terribly awful at writing business logic. But if people implemented this one half-baked, unrealistic idea, then it would suck a little bit less!" The first part is spot-on: SQL is a horrible language for writing business logic. So: stop. Stop doing that.

Stop writing business logic in SQL!

Imagine you joined a new team, and they said: our program uses a bunch of XML files as its data format, and that's why most of our business logic is implemented in XPath. Just sit there and ponder that situation for a little bit. This is what you all sound like to me. Stop it!

> Testing the simplest query requires building up a massive object graph, just to test an infinitesimal slice of it.

Yes, correct, because SQL is awful at representing business logic, and foreign key constraints are business logic! The fact that an employee has one supervisor who is also an employee (FK Employee.SupervisorID -> Employee.ID) is a business constraint. (And one that can easily change, by the way.) All these database engineers who can't pull themselves away from their stored procedures will argue with me to their graves about foreign key constraints, but they're generally terrible. They're a business constraint that just happens to be supported by the database engine, but they're still a business constraint. Stop putting your business logic in your persistence layer.

We've been getting better at building programming languages that actually support software engineering every year. Use one of those. SQL is an output of one of your program layers. Just like XPath would be an output of your program. Compose SQL in an actual real programming language that isn't horrible instead of doing any of this.


100% completely agree.

You just can’t scale SQL code. If you have a small app, then sure. But then I don’t know why you’d invest so much in testing. But if you need to scale your code to more than 3 people, SQL should only be your persistent state store. You probably already have a BL layer in another, more manageable programming language, so use that.

In my team I set a rule to remove all SQL functions and views, and move all the logic to our typescript backend. That resulted in code which is much easier to test, and is version controlled.


And what are the results of your decision in terms of performance?

These days? It doesn't really matter. For most purposes, you can just hold the entire important dataset in RAM.

For example, a humongous company with 500000 employees (Amazon) can trivially hold on a fairly low-end server the database of all the personnel, and all their salary payments, and all their employment history, badge photos, and office access logs.


What is business logic? What specifically is SQL bad at?

I’m not sold by your foreign key example. Constraints are great. I guess you can do that in python? Good like getting ACID right

As soon as your data stops resembling a tree and starts looking like a graph than all your “business logic” looks a lot like operations on relations and you are back to where you started.


> Compose SQL in an actual real programming language that isn't horrible instead of doing any of this.

Got any recommendations? That's a serious question, I've tried many (and built my own) and some are okay but none feel remotely good enough when you consider how much progress we've made basically everywhere else in the wider world of software engineering. Relational databases are so good; they deserve better than SQL!


I don't know why you are getting downvoted, and I generally don't like a lot of business logic in the persistence layer, but representing the things you absolutely dont want another part of your app doing via FKs is...pretty normal.

If you are arguing it spreads out your business logic into multiple places, that's fair, but so does... a lot of things. Databases with decent fks and constraints have a tendency to outlive the apps that build atop them, and communicating these constraints in long lived fashions can be ... good actually.


Similar functionality already exists in some SQL implementations, notably DuckDB:

  create table macro author_books() as   
  (
    select
        a.name as author,
        b.title,
    from
        a
    inner join
        b on b.author_id = a.author_id
  );
is the same as the example in the article and it is called like:

  with a as (...),
       b as (...)
       from author_books();

https://duckdb.org/docs/sql/statements/create_macro.html#tab...

I really dislike leading off with "create table". My brain is unable to read it except as DDL.

I understand the primary premise about the difficulty with testing SQL and fully agree with it.

I do have a question though - while I understand how functors can help make the problem easier to tackle, I am not sure I fully understand how functors are different from a similar existing tool - stored procedures.

Some DB flavors:

- can take tables as arguments to stored procedures - can return tables - also offer the additional benefit of being able to run almost all flavors of SQL commands ( DDL, DML, DQL, DCL, TCL) in those stored procedures

Netezza stored procedures, for example, can do what you describe here:

https://www.ibm.com/docs/en/netezza?topic=nsg-return-result-...

As can SQL Server & Oracle (which both return cursors, which are just ordered tables):

https://learn.microsoft.com/en-us/sql/relational-databases/s...

https://docs.oracle.com/cd/B28359_01/appdev.111/b28843/tdddg...

Am I missing something fundamental here? How are functors different from stored procedures? To me, they seem to be just a sub-class / sub-type of stored procedures.


The goal is that the composable parts get woven into an efficient, planner-friendly query. Stored procedures completely undermine that unless something very exciting has happened since last I checked (SQL Server, but probably applies to all of them). You will likely end up in a "row by agonizing row" situation.

(Okay, maybe you can find a few cases where stored procs work decently, but in general both composability and performance will be much worse than the proposed functors.)


OK, this I understand, that is a good insight - cursors are row-processing based so its gonna be slow

I think Netezza, SQL Server and Oracle are all cursor-based processing "by default" so this makes a lot of sense. I suspect that they all have bulk operation capability but can't immediately think of how I would have worked bulk processing in a way that maps to this article - maybe something like analytic functions like windowing, partitioning etc. that is definitely not row by row.

Having said that, the examples I see for actual testing in the article are DQL / DML so would be multiple row processing by default .. yes, the functor definition / creation is a DDL process but it is a "do once and reuse the definition" thing (like, the author correctly observes, a view, which is the point of functors) and the functor in use would just be DML. In which case, functors go back to looking like stored procedures...

I also understood composability as being built in for SQL - for example, in Oracle, packages allow composability of stored procedures, triggers, sequences etc allow composability of DML and views allow composability of queries and tables - which the author points out in the article.

With functors, DDL, DML, DQL, DCL, TCL would still be the only command tools available unless a new command language was invented for SQL for testing - let call that something like DTL (Data Test Language), with a whole new bunch of associated new SQL keywords, capability and functionality that are built right into the core of the DB engine that are optimized for what functors are trying to achieve.


They seem somewhat like stored procedures, but not stored? As in a query can contain a functor in it and then use it immediately. I didn't see those `create functor` statements as anything other than ephemeral - or am I wrong?

EDIT: also stored procs that use imperative logic and cursors can be quite a bit slower than queries that achieve the same logic - the capability here is purposefully a subset of that and is just to help build standard SQL queries that can go through the standard query planner.


I think they have to be long lived else they cannot make sense for performant testing. ie they are created as DB objects, using DDL, in the same way tables, views, functions etc are made.

They can certainly be created at test run time but that would slow things down a lot - you would essentially be creating a ton of objects every time you run the test which means having a setup to test if they exist or not, take them down if they do or fix them if they don't match spec ( e.g. column and data type changes etc etc )

The more I think about this, the more complicated I realize it would be to manage this dynamically:

You essentially have to build a test harness enviroment that figures out your testing elements dynamically from your data environment (with some kind of parameterization engine and data set to tell it what to look for so as to "make functors and run them" (e.g. all PKs of FKs or all columns starting with a certain prefix or all columns of a certain data type etc etc), gets the most up to date definitions of those elements from system tables and uses that data to create or update or drop functor objects ... wow, ok, this is getting complicated, I am going to stop now before I see the void.


Malloy is a newish data language with semantic models... Perhaps what semantic SQL would look like: https://www.malloydata.dev/

This what Kusto Query Language looks like:

  StormEvents
  | summarize 
      TotalStormsInState = count(),
      StormsWithCropDamage = countif(DamageCrops > 0)
      by State
  | extend PercentWithCropDamage = 
      round((todouble(StormsWithCropDamage) / 
  TotalStormsInState \* 100), 2)
  | sort by StormsWithCropDamage

I find it to be much more readable. With SQL CTEs is as close as you can get.

Nodding along furiously. One area where this comes up is when you want to export some piece of logic that effectively “returns a table”.

For example, I work with a source of data that requires a particularly hairy JOIN strategy. The logic is the same for any table with the same “shape”, but no one can reuse it without copy / paste.


On the testing aspect - The TSQLT assembly (https://tsqlt.org/) is an attempt at unit testing within MSSQL. It works okayish for what it is. I've at least found it to be nice when testing inputs for SQL functions and procedures, as well as some business rules within tables and views.

A few IDEs provide integrations for TSQLT: Management Studio with the redgate extension, Datagrip, maybe others.


I prefer to keep the SQL itself as simple as possible. I've gotten by OK by composing SQL in my outer code as needed. Despite relying crazy heavily on SQL, that's rarely needed. And the outer code has tests.

It'd just be nice if Postgres could loosen the syntax in a few annoying places, like allowing a trailing comma in "WITH q1 AS (...), q2 AS (...)" Or if something could be done to avoid needing the "WHERE true AND" trick.


SQL select-from-where queries can indeed be formalized as "pro-functors": https://arxiv.org/pdf/1602.03501

Ah but “functor,” is a name from category theory and algebraic topology. I had read along thinking that this would come up in the conclusion! Instead it seems it’s not even mentioned in the sources for the name.

There’s potentially something like this in the “esqueleto” package for Haskell. But it has some limitations. It composes on syntactic fragments.

Although speaking of functors, the category theory ones, maybe categorical databases would fit the bill: https://categoricaldata.net/


Very nice. Adding this one to my list of "things other people wrote that nicely elucidate a shortcoming of SQL"

> Why would this be useful? Because SQL tables are global variables. By vanishing global variables, we automatically make every query fully testable.

And even if you don't care about testability, certainly you can appreciate that global variables are bad for composability.


I already have functors, effectively, by writing sql queries in clojure with honeysql. The clojure data structures make behavior composition from reusable sql snippets easy, and the testing of sql snippets is just based on literals as the article describes. Works really well.

I’ve done the same with SQLAlchemy in Python and SQLKata in C#.

Sadly the whole idea of composable query builders seems to have fallen out of fashion.


MS-SQL would seem to already support everything you’ve suggested.

Isn't it table-valued function? IIRC, the SQL standard still doesn't have it but it's almost universally supported extension across vendors.

At least in Postgres, table-valued functions can't take tables as arguments, only scalars. That's the main difference: functors can not just return tables, but take tables satisfying some interface as arguments.

https://www.postgresql.org/docs/7.3/xfunc-tablefunctions.htm...

I thought I had written a footnote or appendix about this but I guess I forgot.



Attempt at sql composability.



Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: