
Modularizing SQL? (2007) - Kinrany
http://lambda-the-ultimate.org/node/2440
======
MarkusWinand
The most important code management tool SQL has are views and the WITH clause.

[https://modern-sql.com/use-case/literate-sql](https://modern-sql.com/use-
case/literate-sql)

Something like "polymorpihc" views would be extremley useful, but is not there
yet (in standard SQL).

In other areas there are sublte features like the explicit WINDOW clause to
avoid repeating similar OVER clauses. Breifly mentioned at the end of this
doc:

[https://www.postgresql.org/docs/current/tutorial-
window.html](https://www.postgresql.org/docs/current/tutorial-window.html)

This was adopted in recent years by many FOSS-DBs, but not yet by the big
commercial ones:

[https://modern-sql.com/static/sm-
blog-2019-postgesql-11.over...](https://modern-sql.com/static/sm-
blog-2019-postgesql-11.over.en/T612.window-clause.zrNWMjbF.svg)

Similarily the chaining of grouping sets can be helpful in those very rare
cases you need them.

The Oracle Database intrdouced "SQL Macros" recently. For my taste, they are
too 'rude' to make me happy on first sight (much like C #define):

[https://blog.dbi-services.com/oracle-20c-sql-macros-a-
scalar...](https://blog.dbi-services.com/oracle-20c-sql-macros-a-scalar-
example-to-join-agility-and-performance/)

~~~
yen223
Be careful with WITH clauses in Postgres though. Prior to Postgres 12, the
engine always materialized queries inside the WITH clauses, even when its
unnecessary. This can lead to certain queries being orders of magnitude slower
than normal if written with a WITH clause.

~~~
eatonphil
You don't need to follow this blindly though. Benchmark with EXPLAIN to see
the loss/gain on performance vs. readability for your particular query.

------
default-kramer
Darn, I was going to do a show HN in a few days, but my side project too
relevant here not to link to: [https://docs.racket-
lang.org/plisqin/index.html](https://docs.racket-lang.org/plisqin/index.html)

As the warning says, it is NOT READY for any kind of use. But the SQL it
generates is pretty close to what a human would write, so maybe it's fine to
use.

There are two main things that Plisqin offers that I haven't seen elsewhere:

1\. Joins are values. This is wonderful for composability. When you say (from
x SomeTable ....), x is bound as an "instance of SomeTable". Joins are also
instances in the same way.

2\. Traditional SQL-style aggregates are supported, but "grouped join
aggregation" is a new concept in Plisqin. Described here:
[https://docs.racket-lang.org/plisqin/Aggregates.html](https://docs.racket-
lang.org/plisqin/Aggregates.html)

Besides those two, there is also a different approach to comparisons which
eliminates 3VL: [https://docs.racket-
lang.org/plisqin/Nullability.html](https://docs.racket-
lang.org/plisqin/Nullability.html). This is only in the "strict" variant of
Plisqin; the "unsafe" variant gives you 3VL back if you like it.

I plan to finish up the documentation, and enhance the "Plisqin as a Research
Language" section with more analysis of what works well and what gaps in the
language I'd still like to patch.

------
bootywizard
Whether you're trying to do this for a data warehousing use case, or in an
environment where you are creating production database queries will change the
approach you need to take here.

We have been working on SQLX for the data warehousing use case, allowing you
to embed JavaScript into your SQL queries and making things like code re-use
much easier while fitting in with your existing SQL dialect, might be of
interest -
[https://docs.dataform.co/guides/sqlx](https://docs.dataform.co/guides/sqlx).
Some of the concepts in there are specific to our framework, some are not.

For data warehousing, create more views! They are IMO the right level of
abstraction for encapsulating common business logic, data definitions, joins
etc, making composing downstream queries much easier. Managing lots of views
like this requires some investment in a data modelling tooling tool however
(Dataform, DBT etc).

For generating queries used against production databases for building user
applications, none of this applies.

~~~
chrisjc
> For data warehousing, create more views!

Agreed, but... the problem with views are they aren't parameterizable. In
effect they are static templates. Fortunately, modern data warehouses often
provide user defined table functions that accomplish pretty much the same
thing, but allow you to "create input parameters to your view".

Dataform looks interesting (how have I not heard of this before) but I wonder
if they support UDTFs?

~~~
nightski
Sure they are, you just slap a where statement on your query when you query a
view. Or am I missing something?

~~~
chrisjc
Sure, that's sufficient most of the time. However, there times when your view
might span multiple tables and you want to ensure that full table scans aren't
triggered before your view predicate is acted upon.

This may also come down to how sophisticated your DBs sql optimizer is.

------
branko_d
SQL Server's inline table-valued functions are a good start. They are
essentially parametrized views, which are inlined and "disappear" at the point
of use, so all the usual transformations are available to the query optimizer.

They are also quite limited in a number of ways:

\- Can't parametrize on the underlying table or view, even when it is
"compatible" with the function's body. Some sort of OOP-style polymorphism or
even JavaScript-style "duck-typing" would be nice.

\- Can't parametrize the SELECT list or ORDER BY clause.

\- No type generics (i.e. ability to "vary" the parameter's type when that
doesn't require changing the function's body).

\- Can't pass another function as an argument (what would be called delegate
or function pointer in other languages).

\- Only one SQL statement in the function body (SQL Server 2019 has lifted
this limitation somewhat, but we are not quite there yet).

\- Can't (explicitly) throw exceptions.

\- And probably other things that escape me at the moment...

Some of these limitations apply to the stand-alone SQL as well, of course.

~~~
throwaway_pdp09
Reason for most of these is the optimiser. What if you have eg. 2 tables t1
and t2 which are structurally identical and have

    
    
      select * 
      from parameterised_t
      where parameterised_t.x = 99
    

where parameterised_t is either t1 or t2 passed in as a param, the tables can
still have very different statistics. The optimiser needs to know statically
which it's getting.

Of course, if t1 and t2 are structurally identical, you should not usually be
splitting them into 2 tables, they should be one table partitioned logically
with some predicate (an extra column) rather than physically (having 2
tables).

\- Can't parametrize the SELECT list or ORDER BY clause.

ditto but worse.

> \- Can't (explicitly) throw exceptions.

and that one's a bitch, and quite unnecessary.

~~~
branko_d
> Reason for most of these is the optimiser.

Well, optimiser as currently implemented, tying only one plan to each uniqe
SQL text.

There is nothing, in principle, preventing an implementation that produces
multiple plans based on the parameter values. This would be useful not just
for "advanced" parametrization I mentioned above, but also for simpler cases
such as when a parameter being NULL could be handled via a different access
path compared to non-NULL.

> if t1 and t2 are structurally identical, you should not usually be splitting
> them into 2 tables

Depends on constraints. For example, you might have two structurally same
tables that reference (or are referenced by) different tables.

~~~
throwaway_pdp09
> ... nothing, ... preventing an implementation that produces multiple plans
> based on the parameter values

I'm afraid there is - time and space. Optimising is a _very_ expensive process
that can take well over a second to optimise a complex query on MSSQL (and
that's not because the best plan is always found; it has a timout after which
the best plan _so far found_ is returned. Complex queries _will_ hit the
timeout and a suboptimal plan will most likely be returned), and the compiled
query plan can take megabytes.

Suppose you want to parameterise by fields selected. That's a combination,
which involves factorial. tl;dr that grows fast.

> you might have two structurally same tables that reference (or are
> referenced by) different tables

Umm. I'd say if their semantics were different enough that they be 2 distinct
tables then their identicallity is coincidental. I'd have to see some actual
cases before arguing further (edit: though I have come across this, so maybe,
but it's not IME common).

~~~
branko_d
> I'm afraid there is - time and space.

Sure, I'm well aware of that. I did say "in principle", after all.

However, we are generating multiple plans already by changing the SQL text (by
concatenating-in the table names, SELECT list, ORDER BY and so on).

There is currently no good solution in SQL even for parametrizing different
"shapes" of WHERE, so we use dynamic SQL for that, generating even more plans.

There is some low-hanging fruit here.

> Suppose you want to parameterise by fields selected. That's a combination,
> which involves factorial. tl;dr that grows fast.

You don't have to do it in advance, though.

Today, when you change the text of the SELECT list, you generate a new plan
anyway. Parametrized SELECT list would be no different.

> I'd have to see some actual cases before arguing further (edit: though I
> have come across this, so maybe, but it's not IME common).

Most junction tables have essentially identical structure, but different
foreign keys. This is also pretty common when implementing things like "tags"
or "attributes" that apply to more than one kind of object...

But it doesn't really matter whether all fields are identical, only those that
are actually used by the given SQL statement. In a hypothetical future SQL,
you could implement your algorithm generically and then apply it on any table
or view that is "compatible enough", even if it has some extra fields that
your SQL statement doesn't use.

In today's SQL, unfortunately, you are condemned to repeating the
(essentially) same code.

~~~
throwaway_pdp09
Your point about generating query plans anyway if using dynamic sql is well
taken. I can't argue - but see below.

> There is some low-hanging fruit here.

syntactically perhaps. But it is bloody expensive to generate new plans. It's
quite reasonable for a query to take one or two or even three orders of
magnitude more CPU for plan generation than for actually running it. Neither
dynamic sql nor parameterisation will deal with that (that I can see). I think
we're stuck with expensive repeated query planning.

> Most junction tables have essentially identical structure

Yes but they are used only to link 2 tables together (I call these join tables
because they're just used for joining 2 tables). They aren't used in any other
context

> This is also pretty common when implementing things like "tags" or
> "attributes" that apply to more than one kind of object...

Maybe. I'd need to sit down with you and talk it over (not happening sadly!)

I think parameterised SQL is necessary but also a wide open door to clueless
programmers massively increasing the CPU load and I've seen what cluelessness
can do to perforemance in an exists-right-now nonparameterised RDBMS. OTOH
I've never liked the idea that a tool's power should be reduced to avoid
numpties doing damage. You modify the numpty, not the tool.

~~~
branko_d
> I think we're stuck with expensive repeated query planning.

I agree with that (sadly). I'd like to see more expressive SQL, but I'm not
saying it will make the situation with planning any better. I'd argue it
should not make it much worse either, if done right.

------
chishaku
I _highly_ recommend dbt (data build tool) for a modular approach to sql;
postgres, snowflake, bigquery or redshift.

[https://github.com/fishtown-analytics/dbt](https://github.com/fishtown-
analytics/dbt)

~~~
throwaway_pdp09
I'm afraid I don't understand it. I assume a data warehouse is in a relational
DB anyway, so from
[https://docs.getdbt.com/docs/introduction/](https://docs.getdbt.com/docs/introduction/)

"dbt (data build tool) enables analytics engineers to transform data in their
warehouses by simply writing select statements. dbt handles turning these
select statements into tables and views"

So you are taking some runnable SQL and... running it? Why does SQL need to be
turned into tables an views anyway? What problem is this solving?

~~~
chrisjc
> Why does SQL need to be turned into tables an views anyway? What problem is
> this solving?

So that data is aggregated, summarized, cleansed, transformed, join,
normalized, whatever before you query it. It's much cheaper, faster, etc that
doing all of that ahead of time instead of query time.

> I assume a data warehouse is in a relational DB anyway

Not sure what you mean that a data warehouse is in a relations DB? Data in
data warehouses can be set up in a relational way, but it certainly doesn't
have to be. Perhaps you need to get a better understanding of olap vs oltp?

[https://stackoverflow.com/questions/21900185/what-are-
oltp-a...](https://stackoverflow.com/questions/21900185/what-are-oltp-and-
olap-what-is-the-difference-between-them)

~~~
throwaway_pdp09
> Not sure what you mean that a data warehouse is in a relations DB?

That you're using an RDBMS of some sort. It understands SQL and the data is in
the conventional tabular form. I understand olap vs oltp. My question was
whether eg. snowflake was an RDBMS (I did check wikipedia and the snowflake
website but neither said). If it wasn't then I could see the benefit of this
tool).

> So that data is aggregated, summarized...

That's what SQL is good for, why (if we're already in an RDBMS) do I need
another tool on top to do this?

> It's much aggregated, summarized, cleansed, transformed, join, normalized
> ... that doing all of that ahead of time instead of query time.

Of course, that's what a DW does as it pulls in new data. But that is what SQL
also does rather well, so what does this tool bring to the party?

~~~
chrisjc
DWs like Snowflake certainly share some features you'll find in traditional
RDBMSs, but I'm not sure I would classify it as one.
[https://www.quora.com/Is-Snowflake-a-RDBMS](https://www.quora.com/Is-
Snowflake-a-RDBMS)

> That's what SQL is good for, why (if we're already in an RDBMS) do I need
> another tool on top to do this?

Writing, versioning, testing, deploying, monitoring all that complicated SQL
is very challenging. This is what DBT is good for. DBT doesn't replace SQL, it
simply augments it by allowing you templatize, modularize, figure out all the
inter-dependencies, etc...

> Of course, that's what a DW does as it pulls in new data. But that is what
> SQL also does rather well, so what does this tool bring to the party?

Sounds like your talking more about ETL, but from what I understand DBT is
really useful for what happens after ingestion. Transforming, refining,
aggregating, etc from one internal DW source to another. This is where you
would want to use SQL, but again, DBT doesn't replace SQL, it augments it.

\----

BTW, I've never used DBT and have only really looked into it in the last 24hrs
since we're really struggling with the pain of managing all of our ETL
(table->table) SQL. I really see where it could make our lives a lot simpler,
but it doesn't seem to provide the functionality that we need for our
workloads.

We're using Snowflake change streams to build refined, summary and aggregation
tables off an ingestion table (Kafka Connect to SF). We use change streams on
that ingestion table to track new data that we can refined, summarized and
aggregated into final tables in a near real-time fashion that's used for
analytics and reporting. Change streams (and tasks and procedures) is not
something that DBT seems to be ready for, but it's probably understandable
since they're trying to provide functionality to more DWs than just SF.

~~~
throwaway_pdp09
Snowflake is per your link an RDBMS. Where it lives and on what storage
subtrate it operates from is irrelevant.

On your recommendation I'll have a look at DBT, thanks.

Whatever you're doing at work sounds really complex, possibly worsened by
sheer quantity of data. I wish you luck with it!

------
ramraj07
I solved this problem (to my personal satisfaction) by adapting sqlalchemy for
this. Sqlalchemy was not designed for complex data warehouse ELT for sure, but
all it took was abstract out a few "quirks" of sqlalchemy and I was able to
get a framework that's as expressive as spark, and hence extremely
modularizable (not to mention testable), and compiles to regular SQL very
easily.

~~~
chrisjc
Any chance you could get into a little more detail about this framework?
Sounds very interesting. Any reason you developed your framework instead of
using a tool like dbt?

~~~
ramraj07
I will go into brief details, but will clear with my org and post more on HN
later for sure. But the gist is I try to extract any sqlalchemy representation
as an 'alias' object like:

ORMModel.__table__.alias() select([query.c.id]).alias()

And wrap it with a class that's called a DataFrame. This class can then have
methods implemented such as 'with_column_renamed' and 'join' which can just
take simple strings and do all the sqlalchemy heavy boilerplate inside. The
goal is to make it almost similar to spark code, where each operation returns
the DataFrame with the logic applied (continuously updating the internal alias
object).

With this DataFrame, we always know the schema of the query, so it makes it
easy to write generic transformer functions and pass them to an 'apply' method
just like in spark. You can always reach out to the power of sqlalchemy in any
step to customize the logic, and write custom compilers for temp tables, etc.

This also means you can write unit tests in a very generic fashion if you have
standard transformations that process one DataFrame.

One reason I didn't go with dbt is that we already have our codebase in
python, and going with dbt would have fully siloed the pipelines to Jinja SQL.
Also not a huge fan of a pipeline mandatorily split into hundreds of files
often having just a few lines each!

------
Tarean
You need an SQL optimizer to _efficiently_ abstract over sql. There are a
couple implementations, empirically producing good sql at runtime is kinda
slow.

The pathfinder compiler is meant for C# linq to mix sql and XML queries. It
has backends in a couple languages but is kinda slow, and iirc has no support
for aggregates.

Database Supported Haskell builds on this but has support for basically all
sql features and some no-sql backends. Needs some slight work to fix bit rot
by now, and the compiler is kinda slow.
[https://hackage.haskell.org/package/DSH](https://hackage.haskell.org/package/DSH)

The Ur language has great support for SQL, including updateable views.
[https://en.wikipedia.org/wiki/Ur_(programming_language)](https://en.wikipedia.org/wiki/Ur_\(programming_language\))

TL;DR: If you allow general nested results in your abstraction you need an
optimizer. Optimizing aggregates in general is REALLY hard and slow.

~~~
throwaway_pdp09
If you do mere joins, you need an optimiser.

------
naranha
As said in the comments, some abstractions are possible with stored functions
and views, in PostgreSQL you also have generated columns now. I'm sure there
are more.

Abstractions built on ORMs or query builders have the disadvantage that they
are tied to one language and framework and application, making them less
reusable.

------
kpmah
I had a go at writing a type-inferred composable language for database queries
a few months ago:
[https://github.com/KMahoney/squee](https://github.com/KMahoney/squee)

------
ksri
I have been working JinjaSQL[1] for a while to simplify complex SQL Queries.

JinjaSQL is a templating language, so you can use interpolation, conditionals,
macros, includes and so on. It automatically identifies bind parameters, so
you don't have you to keep track of them.

1\. [https://github.com](https://github.com) /hashedin/jinjasql

------
aformella
My dream - enabling/disabling columns and tables:

    
    
      𝐂𝐑𝐄𝐀𝐓𝐄 𝐓𝐄𝐌𝐏𝐋𝐀𝐓𝐄 𝐕𝐈𝐄𝐖 𝐭𝐩𝐥(𝐭1, 𝐧𝐞𝐞𝐝𝐞𝐝) 𝐀𝐒 𝐒𝐄𝐋𝐄𝐂𝐓 
      𝐭1.*,
      𝐭2.* 𝐈𝐅 𝐧𝐞𝐞𝐝𝐞𝐝
      𝐅𝐑𝐎𝐌 𝐭1
      𝐉𝐎𝐈𝐍 𝐭2 𝐎𝐍 (𝐭𝐫𝐮𝐞) 𝐈𝐅 𝐧𝐞𝐞𝐝𝐞𝐝

~~~
throwaway_pdp09
Don't get it - why do you think you need this?

~~~
aformella
All object-relational mappings would be in database. Why not function
returning a query? Because you can't EXPLAIN ANALYZE function body.

------
eeereerews
Not really on topic, but what happened to LtU anyway?

~~~
throwaway_pdp09
It's evidently still there. It's a slow-post high-info site so not much
happens for long periods. I'm not sure what you're asking.

~~~
eeereerews
It's slowed down a lot. There's been a total of 8 posts and 50 comments since
2019. There were 32 posts just in 2017.

~~~
throwaway_pdp09
It's a user driven site. If you have something relevant to it, post it! It's
what will keep it alive.

If you're like me who lurk because they've nothing to post bt are concerned
about it dying, have a word with the guy running it, Ehud I think.

------
cryptonector
SCHEMAs, TYPEs, FUNCTIONs, VIEWs with and without INSTEAD OF TRIGGERs,
TRIGGERs -- these things let you build abstractions in SQL.

------
Kinrany
A tangent: I wonder if a ReactJS-like rerendering could be used as a
declarative alternative to schema migrations.

~~~
Kinrany
Hmm, I guess that's "State-based database delivery/migrations". I couldn't
find much about them though.

~~~
andreypopp
Checkout [https://github.com/prometheusresearch/baseline-
codebase/tree...](https://github.com/prometheusresearch/baseline-
codebase/tree/baseline/master/src/rex.deploy) which does that. Very happy
about how it works.

------
javisantana
We found a way to organize large SQL decomposing them in smaller chunks and
organize those chunks like in a "jupiter notebook" so you also see results for
intermediate results, an example:

[https://ui.tinybird.co/snapshot/7437e5efa1f944a0b7cc01775169...](https://ui.tinybird.co/snapshot/7437e5efa1f944a0b7cc017751696ea3)

------
tlarkworthy
Terraform templating + SQL plays together pretty well. Certainly helps a fair
bit with pattern reuse

------
bokwoon
Don't typical query builders allow for composing SQL from smaller building
blocks?

~~~
kolme
Yes, this is the main reason why they exist in the first place.

~~~
throwaway_pdp09
From my perhaps minimal experience with query builders they're there to make
sql queries graphical and therefore presumably easier. IME they don't do a
good job of that. If you an handle the semantics of SQL you can put a bit more
work in and learn the syntax. It'll pay for itself rapidly.

