Hacker News new | past | comments | ask | show | jobs | submit login
SQL + M4 = Composable SQL (emiruz.com)
119 points by usgroup on Dec 28, 2022 | hide | past | favorite | 81 comments



POSIX M4 is one of those things I really want to be good for production, but whenever I've tried using it in production it's been insufficient for anything even moderately complicated. (It's not that things are impossible, only that it's more maintainable long term to replace the m4 invocation with a Perl script or a more potent templater like Jinja2 in the cases where the macro expansion needs are large.)

The only place where I tried using M4 in production and I know it's still used there, the final system ended up generating the M4 templates and the script to invoke it from another script, and then yet another script fixes up some loose ends left by M4. Effectively the m4 step is entirely redundant and was only left in there because the people picking up the project after me didn't want to remove Chesterton's fence. (I learned later.)

Maybe there's a more useful non-POSIX M4, but I really like the idea of a POSIX standard templating/macro-expansion language. Only M4 isn't it, in my experience.


Autotools have a library of m4 macros called m4sugar that is quite powerful. I can't quite recall whether it's limited to GNU M4 and as to the readability, well, it's still m4...

https://www.gnu.org/software/autoconf/manual/autoconf-2.67/h...


I think that M4 can be fine in production. Many years ago, I had used it to re-write a massive Squid Proxy configuration file which was mostly a collection of repeating blocks. The file was so big that no-one really understood how it worked, and it evolved by constant monkey patching. Post M4 and distilled to just its non-repeating elements, it was an order of magnitude smaller but also clearer because the macros designated the patterns and what wasn't in a macro was actually unique.


m4 is fine in production if what you're doing really is just string manipulation. It just tends to be used in cases like the one in this article, where you are manipulating structured text. In that context, you run the risk of errors/complexity/fragility because you are manipulating strings rather the structures.


"rather than structures"


I chose M4 for $WORK to use in a context where we had config files with templated environment values and we didn't want to depend on a programming language for those templates. Since the configs were created and placed on container start, M4 came with the base image and was easy to use just to set a few values in the template that depended on the container environment.

The whole thing works with having a "template file" which looks mostly like the config with a couple M4 macros to source envars. Envar sourcing logic came from a base M4 file we included in each template at the top. There were a couple sharp edges with escaping but the obviousness of the whole setup made M4 fine. If M4 was a bit easier to work with I'd love to use it for something more complicated, but for our purposes it was fine.


"Note that M4 doesn’t support non-ASCII characters in definitions so my ideal macros had to lose their $ prefixes."

$ is ASCII character 36. I believe the intended restriction is that M4 restricts macro names to be made up of letters, digits, and the '_' character (and cannot start with a digit), which means you can't have a $ in the macro name.


I've been using dbt for achieving the same goals and it's honestly been a bliss. Macros, packages, cross referencing views, unit tests, documentation generation, snapshotting.

And I don't even swear as much at jinja as I thought I would.

dbt allows macros to resolve arguments at "evaluation" time by nesting queries (for example, get a list of DISTINCT values so that they can be then used as column names), which is really useful. I have a particularly nasty database schema (thanks, wordpress) to deal with, and I now have plenty of macros that do the dirty stuff for me.


Intriguing! I would like to learn more about how well dbt can "take over"/reverse engineer an existing database.

> What is dbt?

https://docs.getdbt.com/docs/introduction

Collaborate on data models, version them, and test and document your queries before safely deploying them to production, with monitoring and visibility.

https://docs.getdbt.com/docs/supported-data-platforms

--

core: https://github.com/dbt-labs/dbt-core - Python, Apache 2 license

cloud: https://www.getdbt.com/pricing/


I've been using a bit of dbt to smooth some of the edges of an existing SQL Server analytics solution.

dbt's biggest strength is that you can incrementally expand it to fill the gaps in your existing solution: some source data quality tests here, a few automatic history-of-changes tables powered by dbt snapshot there, and with enough context you can then create push-button documentation and data-lineage graphs, which tends to be a lot more documentation than most companies' BI teams maintain natively.

The downsides that I have found so far (subjectively) have been basically that it's wholely geared for batch operations, it seems to expect to export to only one data warehouse and thus lacks a "many-to-many" data-mesh-like mode, and it wasn't easy to hack the documentation page to describe things like "the ETL process that populates this table". Also, it does not seem to have a way to define or create source tables if they do not exist (so it will not black-start a project for you, it's really only for existing data), and if your code is already checked in to SSDT you might have to move some code around.

...But I have complex problems due to $CURRENT_JOB's existing code base, organizational structure and business needs. (And yes, I am really liking my job.)

My needs aside, dbt is refreshingly easy to get started with, it provides value very fast relative to time invested, and you could do far worse than to spend a day or a week seeing what small annoyances dbt could help you with.


I especially enjoy being able to write "unit tests". Since I tend to mess up my SQL, or some code change might taint the data in a way I can potentially expect (null values, for example), I can add some pretty high level checks like: "the sum of all order line items should match the sum of the prices of all line item products" which are built out of two separate data sources, and make sure my analytics stay in range. All checked into git.

I do agree that the iterative workflow could be a bit smoother, and the doc side of things is a bit rough. I expected the project to gain a higher velocity than it seems to have right now.


I just started incrementally throwing views at our existing schema, mostly lifting SQL out of the code base and committing it to the dbt repository. From there, I started heavily annotating the columns, and starting to write unit tests for the views.

This uncovered a ton of hidden assumptions and dirty data which I either cleaned up, or manually flagged in the views (as in, skip the orders listed in this CSV file because they stem from the period of time where shipstation blablabla).

Since then I'm continuously adding to it, refactoring the macros as if they were code, and it really feels in line with the normal workflow of a developer.


The big problem with macro approaches is that when you get to nontrivial sizes, they get impossible to troubleshoot. You get an error from the SQL engine but that's an on expanded macro. If you've availed yourself of "composability", it's a deep tower of macros, and you typically can't easily tell from the error location and message on the expanded SQL which macro is the cause of the problem. It can become really difficult to maintain.

Good approaches here have some form of higher level validation and error messages, such that by the time the SQL engine gets your query, it runs well almost all the time.


You get this problem with any sufficiently tall sandwich of abstractions. SQLAlchemy has a query builder, the DSL is easily directly translatable to actual SQL statements in your head. But since you have the full power of Python at your disposal, it gets quite convenient to abstract away some of the boilerplate. And then some of the common parameters. And then some more stuff, just for good measure.

Then you try to bite the sandwich, and half the ingredients go sideways: change one parameter, and the generated SQL can look wildly different, etc.

I still think DSLs are superior to macros though; a smart DSL transformer will catch a whole class of errors, while a macro engine like M4 could happily spit out garbage with syntax errors.


I have also seen this before. The knee-jerk reaction of programmers who are subjected to the imperfection of SQL is to introduce more abstractions.

But it is perfectly possible to write correct SQL code. Just like it is possible to solve mathematical equations with pen and paper.

I try to accept SQL as it is and avoid any abstractions because they make the query impossible to read.

I write my queries top down using CTEs and test manually along the way.

And it turns out that 90% of the problems are not in the query logic but with the input data and the two villains of tabular data.

Duplicate rows and NULL values.


Abstract INSERT, UPDATE, DELETE and select by id, as these are common idiomatic tasks.

And then write sql statements for everything else.


One could use post 1970's macro technology, allowing far better error tracking

https://docs.racket-lang.org/syntax/index.html


Having personally encountered a similar need to find ways to manage duplication and complexity in database code and previously considered use of templating as described in the article, I have some thoughts on the subject.

In the example given in the article - composability mainly centers around extracting and reusing the contents of the CTE. Rather than resorting to an external template preprocessing tool and introducing “compile” stage that brings its own set of challenges (e.g you can no longer directly review or run queries, template syntax is not supported by IDEs) - one could consider extracting the body of the CTE into a sql view or sql function, which could then be reused as needed. You can even nest such reusable primitives. It does require some discipline, but it is much simpler to keep straight than relying on text templates.

Depending the database engine and subject to some constraints - many (postgres, sql server, for example) can already inline and expand such sql constructs during planning and execution time, for instance pushing down predicates as an optimization technique.


The article also abstracts the joins, a field list and an aggregate function too.

It also notes the potentially to do so with where clauses, window function and so on.


Poor choice of words on my part - I should’ve said that the main value of the refactoring as presented in the article, as I see it, is extracting an “entity” as often captured in the CTE, or in other times - in a subquery.

Extracting and templating other structural components of the query in the list you provided - I personally interpret as needless complexity that is likely to make things harder rather that easier, especially considering that there’s no database-native alternative mechanism that I can point to, and I already explained that I consider text templating an anti-pattern (though perhaps it does have value in those instances where the database engine is not as full featured as postgres - sqlite as an example).


I've found text manipulation of SQL can quickly degenerate into a nightmare of concatenation, nested quoting & escaping.

I wrote up Using plain Jinja[1] or SQLAlchemy+Jinja[2] as complementary techniques. The post is a bit dated but the principles should still work.

- [1](https://github.com/gregn610/plpgsql/wiki/meta-sql-with-pytho...)

- [2](https://github.com/gregn610/plpgsql/wiki/meta-sql-with-pytho...)


I'm challenged to see the virtue of M4 for string mangling over a scripting tool (e.g. python).

Squeezing the repetition out of strings make sense, but Yet Another Mouth To Feed (YAMTF) has to bring something substantial to the project to justify inclusion.


Isn't that the very point of picking m4, as it's always there on a POSIX system, whereas a proper scripting language might not be? It's been a while since I saw someone pick m4 when alternatives were present.

These days it might be a bit different, as you might either be operating on a very stripped OS (container), and thus might need to provide all your tools anyways, or you might have a standard image with a more wider range of scripting opportunities (even Perl might be a good candidate here, as it's available on any system where you've got a non-restricted git installed).


OK, historical reasons.


Other than availability on POSIX systems, also that the macros are just text without fanfare. I.e. no double curlys etc, so its easy to accept as just an extension to SQL as you might do with a C pre-processor.


Fewer deprecation warnings?


The M4 Macro Processor by Brian W. Kernighan & Dennis M. Ritchie, 1977-07-01 https://wolfram.schneider.org/bsd/7thEdManVol2/m4/m4.pdf


If the author was willing to learn Haskell there’s a library, esqueleto, which does exactly this. It’s better than the m4 solution in that the type system will ensure the query fragments are composed correctly with respect to the typing rules of Haskell.

The drawback is that query fragments written in esqueleto’s experimental DSL are a bit funny looking compared to the plain SQL you’re trying to write. And you’d have to learn a non-trivial amount of Haskell which is a tall order.

Using m4 to this purpose is an interesting, out of the box idea. I’m curious what challenges it has in practice.


There's also jOOQ for Java. https://www.jooq.org/


jOOQ is amazing, on of the best ORM/Sql libraries I have ever worked with


I used XSLT for somewhat similar purposes twice upon a time. It hurts to realize that that was 20+ years ago, but I was producing many dozens of megabytes of complex SQL and PL/SQL. There was no way to do that "on time" while writing everything by hand, and "refactoring" would have been laughable.


I did something similar when I figured out XSLT was for more than just front-end stuff. And I was working with Oracle as the database too; these enterprise code bases acquire huge amounts of often poorly written SQL and it feels like making at least some of the repeated queries DRY would help - but later you realize the people who need to maintain this system when you're gone are the same people who came up with the spaghetti queries in the first place.


Could you say a bit about how you'd manage a SQL repo using XSLT? It isn't obvious to me how that works.


Not OP but I reckon OP would have XML documents from which to generate SQL schema and queries, with one XSL for generating schema, and one or more XSLs for generating queries.

You could do the same with jq.


M4 is a nightmare in many respects, in my experience at least. I worked with it on a project and it was the biggest source of issues. It is hard to debug, doesn't handle I8N, convoluted rules, everything is a macro.


Have a look at PRQL (www.prql-lang.org)!


Nice reminds me of Linq, which looks like they acknowledge as one of their inspirations. SQL relational servers should really move more toward this kind of language.

I always thought just building a pure .Net DB server using its types and language model would be very productive rather than translating to SQL, like PRQL or Entity Framework.


Have you seen RavenDB: https://ravendb.net/

It's a nosql/multimodal document store written in .NET and supports LINQ-like syntax.


Yes not my cup of tea. Had some poor experience with it in the past and don't really like its design philosophy unless somethings changed. Stores data as BSON I believe, which is better than text JSON I guess, but pretty ineffecient compared to something schema based that doesn't have to store keys with values for every every column/property. Didn't like the async indexing with stale results, nice to have as an option but the default should be sync/consistent.

Written in .Net but not really using the .Net data type/model, more of a json db, used to use Windows only ESENT data engine but I think they got around to building their own K/V in .Net at some point.

I am thinking something more like https://velocitydb.com


RavenDB has changed massively and your assumptions are about 6 years and 2 major versions out of date. I suggest trying it again as none of those are issues anymore.

> ".Net but not really using the .Net data type/model"

What's that mean? The .NET driver has seamless object persistence.


That means at the storage level it's storing a JSON data model not a .Net one. That means the length of a property name matters because its stored with every document rather than having the type info stored in a schema and the individual object instances only storing the values, similar to how a relational db stores columns names in a schema and rows only have data. This can save significant space and overhead.

If I am designing a .Net database from scratch with Linq as its target query language then I would want the storage engine to understand and take advantage of types to optimize storage with minimal translation rather than having to serialize/deserialize from .Net to BSON/JSON with key overhead included.


When you say object "instances" of "types" then what you're referring to is a strong schema. That works in relational databases but RavenDB is a document/schemaless database so there will always be some overhead.

RavenDB has collection-level compression across multiple documents which minimizes this: https://ravendb.net/articles/ravendb-5-0-features-smart-docu...


>When you say object "instances" of "types" then what you're referring to is a strong schema.

Yes that's exactly what I said.

.Net is a strongly typed language with a strongly typed data model, hence my point of RavenDB not being a .Net data model even though it's written in .Net. I want a .Net DB that takes advantage of the type system rather than a JSON DB I can access with a .Net client, there are plenty of those.

Document compression with dictionary training in a schemaless databases is a bandaid over their fact there is no schema. A Relational DB saves a ton of space with increased performance and lower CPU (rather than decreased performance and higher CPU with compression) because it has a strongly typed schema and therefore does not need to read/write the column names and types for every row and it can serialize column data to their most efficient form based on the type. This sort of efficiency flows through the entire system including better cache utilization, smaller indexes, and less data sent over the wire to the client.

You can still do compression on top of a schema for even more savings like most DB's but you can do even more interesting storage techniques like column storage vs row storage if you have a schema that can really save space and increase performance depending on usage. A .Net DB could similar things again because if a strong type system.

RavenDB looks like it has come a long way and Ayende definitely knows databases and .Net, however I am not a big fan of the JSON/document database philosophy, just my option based on experience. I prefer a strongly typed system that has an untyped escape hatch when needed, like PG with JSONB columns. That way you can stick to strong types and loose documents sparingly when needed. In a .Net DB that would be dynamic types being transparently stored with a loss of efficiency.


I'm not really sure what you're talking about then. What exactly is a .NET database? Every language has its own implementation of in-memory classes and instances, and these are translated over some protocol to a database that then translates to a on-disk format for persistence.

That's how they all work, with various trade-offs. If you want a ".NET database", the only example would be to simply dump the in-memory bytes to disk, which you can through the various serializers in the framework.


Thanks for the reminder. I knew this was reminding me of something!

Here's a previous discussion on PRQL:

https://news.ycombinator.com/item?id=31897430


Full disclosure: I'm a PRQL contributor.

There will be a big PRQL release in January which makes that previous discussion look quite dated. Best source for current state is the prql-lang.org website and the github repo (github.com/PRQL/prql).

Also try the playground - there you can live queries on real data in your browser.


Hey, maybe you can help me with something about PRQL that I've never bothered to investigate (and I'll confess this is laziness on my part). If newlines just extend a pipeline, what separates pipelines? Do semicolons separate pipelines just like they separate queries in SQL?


I don't recall us discussing that before.

You can have named pipelines which basically become CTEs, like

    table employees_usa = (from employees | filter country=='USA')
but ultimately there is usually just one "main" pipeline/query per file/input.

Because PRQL only targets queries / DQL at the moment, I'm not sure what the sense multiple pipelines per file would be at this point.

You're welcome to open an issue or start a discussion on the Github repo about your use case though, we'd love to hear about it!


Does PRQL allow for a shared library? Ie can fragments be saved and reused?


That's definitely on the roadmap but we're not quite there yet.

With the next release you will be able to create functions that are reusable pipeline segments (e.g. something like "top 5 items per group"). This works at the moment but is currently undocumented. Do expect there to be breaking changes on this still though as there are some issues around resolving names and namespaces etc... that we still need to work out.

There is a (currently rather small) stdlib of functions but beyond that we still need to think about a good way to share code and create reusable libraries.


I think a composable SQL alternative would be great but if doesn't compose I personally don't see too much value in just an alternative syntax.


Thank you for giving so much credit to dplyr in your FAQ. The influence is obvious and I think dplyr deserves more credit for its revolutionary design, which has influenced a whole new generation of data manipulation tools.

(I am not affiliated with dplyr/R/RStudio in any way, just a huge fan.)


Thinking about why the PRQL syntax is as it is, does it have something to do with chaining CTEs together? I roughly see how given a pipeline with discrete components you could translate it into a collection of CTEs which lead into each other.


This sounds very similar to what dbt does.


Do you have a link to an example of DBT doing something similar?


DBT has two mechanisms to implement the blog post. The first is macros which closely resembles m4. The second is "ephemeral" models which are separate SQL files which get compiled into a CTE when referenced in the main model.





Wow who else has PTSD from M4 use for sendmail configuration way back when?


I find it surprising that people still use M4 for anything. I encountered M4 in my first job out of college. The company was using it for email templating. It still leaves a bad taste in my mouth.

Velocity, Jinja, Mustache... all are a breath of fresh air compared to M4.


Or autoconf. Sendmail is a thing of the past, but autoconf will still be around for quite some time.


I’ve used M4 when I’ve needed to generate a large number of configuration files for different systems from a single source.

Query languages are a great target for metaprogramming but something like JooQ that works at the AST level is going to be a lot better than M4. (Recent versions of JooQ allow you to write transformations that work on that tree.)


@work we have a language, called TQL, Transform Query Language, that allow to transform TSQL, which allow us to make reusable stored proc.

I made an online demo here:

https://kuinox.github.io/TQLBlazorDemo/

Sadly we don't have any language docs...


Reminds me of JupySQL, which allows to compose large SQL queries: https://jupysql.readthedocs.io/en/latest/compose.html

Disclaimer: My company develops JupySQL


This appears to be the domain of Dagster + dbt, with which you get data lineage, not just composition.


Gnarly queries aside what this highlights for me is the verbosity and readability of SQL code. Making me have to lookup queries being used as modules or as libraries makes things more complicated added to that all the idiosyncrasies of the templating language as well


Things like this is why I've always been a fan of LINQ. Not only do you get excellent first class language support for writing and composing queries, you also get a very good typechecker to ensure that stuff are the right shape.


At my previous job, the founder did everything in Linq. He created a reporting engine in EF6 with Linq. Reports were run against the OLTP db. All queries were in LINQ.

LINQ is a nightmare in these scenerios. Imagine 10s to 100s to 1000s of lines of indecipherable LINQ.

We spent so much time debugging these things and trying to make them performant.

After he sold the company I began replacing the LINQ garbage with inline queries with dapper when appropriate. So satisfying to see queries go from minutes to seconds to unmeasurable in SSMS.

The problem with LINQ is it's upside down and backwards to people who know SQL. You can easily create horrible queries and unless you're an expert in both LINQ and SQL you'll have no idea how to fix it. There problem was it was difficult in many cases to get LINQ to create performant SQL, when writing a simple(r) query was both shorter and faster.

We regularly hit the recursion limit in SQL Server and other errors I had honestly never seen in all my time using SQL Server.


The idea is good, the implementation isn't quite ideal. Its possible to fully model SQL as it is with a query builder in a 1:1 unambigous manner, in a way where absolutely nothing about "plain SQL" is faster.

I know SQL. I don't think LINQ is upside-down and backwards.


Whilst I hate it - Jinja is so much more accessible for SQL coders.


While I love SQL, there's something deeply twisted in the mind that combines SQL with M4.


Macros only looks good at first look, but quickly fall flat when need to debug query or any of the underlying components change.

The entire approach of CTEs is flawed, because you are declaring your schema in query - by saying CTE1 is my $orders_query, instead of referring to a view formalized in schema: select * from views_tenders join views_sales join views_items - instead of a sandwich of CTEs

using macros with SQL is like using python script to autogenerate classes and interfaces in Java for $my_class and $my_interface, instead of having a formally defined hierarchy of classes and interfaces with methods and signatures


You can always macroexpand and debug that.

CTEs exist for very good reasons. Chief among those is that they make recursive queries easy to express and understand. But also because it lets you have one command that implements what might look like temp table/view creation, population of those tables, run various queries, and delete said temp schema, but the optimizer gets to see what all you are doing and optimize accordingly (there might be no temp tables in the optimized query). As well CTEs help with readability by letting you abstract sub-queries and outdent them.

CTEs do have some downsides, mainly that if you want to reuse them across statements, well, you can't.

> using macros with SQL is like using python script to autogenerate classes and interfaces in Java for $my_class and $my_interface, instead of having a formally defined hierarchy of classes and interfaces with methods and signatures

We live in a world full of generated code. The alternative is much worse, so we have to be able to debug through generated code.


I agree that CTEs can help compose complex query from several ingredient queries.

But once several different queries start sharing common CTE ingredients - this is an indicator that you are relying on implicit(informal) schema, which should be formalized and formalized.

Just take your CTE ingredient queries and declare them as views, isolated to your namespace. And let everyone reuse your views, instead of copy-pasting CTE ingredients, or using code-generation to achieve the same.

The benefit is single source of truth - there will be only one definition of CTE subquery, and it can evolve/extend independently while letting everyone reuse your parts.

I assure you - if you take your CTE with 4 subqueries, and instead create 4 views, the query plan will be the same regardless of using CTEs with code-generation or using views.

The benefit is you dont have to use codegeneration, each view will be testable/verifiable


Has anyone attempted to do anything like this in Racket?


Ecto in Elixir/Phoenix world provides much of this.


Hmm I wonder what's the synergy between this and something like PostgREST or Directus.


CTEs considered harmful. This query will run nornally as long as those CTEs stay unmaterialized. But all you need is a single intermediate CTE and due to the “two+ references cause CTE materializations” rule in Postgres, the query will start to load an entire table into an unindexed set and use that for the query. Basically, disables all indexes for that table.

Idk why Postgres doesn’t disable materialization by default. IME it’s a niche use case compared to using CTEs as intermediate views (where you want to expand all joins and use indexes). It’d be better to have an explicit opt in “Yes, store this result on disk, disable all indexes, and process it O(N) fashion.”


> You can override that decision by specifying MATERIALIZED to force separate calculation of the WITH query, or by specifying NOT MATERIALIZED to force it to be merged into the parent query. The latter choice risks duplicate computation of the WITH query, but it can still give a net savings if each usage of the WITH query needs only a small part of the WITH query's full output.

https://www.postgresql.org/docs/current/queries-with.html#id...


Thanks, it is important for people to know this. In this article 5 tables are pulled into CTEs which aren’t marked as not materialized.

It wasn’t the point of the example, but the default is scary, leading to disabling all indexes for the query once it gets more layers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: