Hacker News new | past | comments | ask | show | jobs | submit login
Against SQL (scattered-thoughts.net)
499 points by deafcalculus 11 months ago | hide | past | favorite | 339 comments

I think the problem of this essay is that it's overly technical: only those versed well enough in SQL will really care to read the whole thing, and if they are already at that level, either they accepted that "SQL will get the job done in the end", or they learned to live along it and now even kinda embrace it, and are happy to write about how the examples are very poor and dismiss the critique based on that, when the essay kinda explains it main point pretty well:

>> The core message [...] is that there is potentially a huge amount of value to be unlocked by replacing SQL

To me, a lot of people defends SQL saying that "perfect is the enemy of good" and that SQL simply works. Not the favourite of anyone, but everyone kinda accepts it.

And yeah, it's true. People use SQL because it's good enough, and trying to reinvent the wheel would take more work (individually speaking) than just dealing with SQL as it is right now. For large organizations where the effort could be justified, all your engineers already know SQL anyway, so it's not so great either.

But for something so relevant as relational databases, perfect is not the enemy of good. We do deserve better. We generally agree that SQL has many pitfalls, it's not great for any kind of user (for non-technical users, a visual programming language would work well here, more like what Airtable does, closing the bridge between spreadsheet and hardcore database, and for technical users, it does feel unwieldy and quirky). We should be more open to at least consider critiques and proposals for better. We might find out that people, from time to time, are making some good points.

I think I'm experienced enough to understand the article, and I agree. I've written multiple optimizing SQL generators (altering generated SQL to access better plans), and rewritten hundreds of queries for better performance, which involves trying many semantically identical rewrites of the same query.

I agree with Jamie. I think SQL is irritatingly non-composable, many operations require gymnastics to express, and I'd like a more expressive and regular language to write queries in.

I also suspect that such a language would be harder to optimize, and might be more practical to implement if it was less declarative and closer to the plan.

I also think that as you scale up, offloading computation to the database is a false economy. It's closer to the data, but the database itself is a bottleneck. As you scale, you want to limit the complexity of your queries. But this is actually an argument for things like Materialize, i.e. making copies of the data elsewhere that bake in some of the computation ahead of time.

> I think SQL is irritatingly non-composable, many operations require gymnastics to express

One approach to radically simplify operations with data is to use mathematical functions (in addition to mathematical sets) which is implemented in Prosto data processing toolkit [0] and (new) Column-SQL [1].

[0] https://github.com/asavinov/prosto Prosto is a data processing toolkit - an alternative to map-reduce and join-groupby

[1] https://prosto.readthedocs.io/en/latest/text/column-sql.html Column-SQL (work in progress)

What is your opinion on abstractions on top of SQL queries?

On paper, a more expressive language that spits out SQL queries sounds great, but I've never seen a single one not become a pain in the ass to use.

> On paper, a more expressive language that spits out SQL queries sounds great, but I've never seen a single one not become a pain in the ass to use.

That's precisely because of some the flaws of SQL outlined in the article.

Generating SQL is complex. Generating portable SQL is impossible.

Take a look at the queries sent by something like Power BI to MS SQL Server when running in direct-query mode. It's just obscene how complex the queries can get!

I've tried to write some SQL generators before, of various types. I always got bogged down in the ludicrous complexity.

The author of the article makes so many good points that it's easy to gloss over entire categories of mistakes in the SQL spec. For example, he briefly mentions that getting a column schema back in query is weird and non-standard. The real problem here is that SQL is not homoiconic: tables, columns, and their types could have been represented as tables, including when defining them. E.g.: instead of "CREATE TABLE" the syntax should have been more like "INSERT INTO sys.tables ... ". (Similarly for columns, constraints, foreign keys, etc...)

Instead, we have a language designed for data that until recently had implementation-defined read-only system views for the schema data! Writing into that data uses an entirely different set of key words and syntax, which is "not really SQL" in the sense that it isn't relational and cannot use tables as input unless coded in an external language like Java, or built up using string manipulation an called via some sort of non-standard "exec string". Of course, then you have to worry about things like not having a built-in standard escape/unescape function! (QUOTENAME is the MS SQL version of this, surely non-standard).

There's a lot more wrong with DDL than the inability of expressing it as DML. Some SQL doesn't scale well, but DDL fundamentally scales really badly. It needs either an async or resumable execution model, for one thing, which doesn't map well to connection-oriented transactions. Big migrations take days to complete. That's too long to be very reliable in a distributed system when operating on a synchronous basis.

Big migrations take days to complete because -- like the author alluded to -- SQL merges very low level data layout concerns with very high level abstract data representations. So to make a change to one you have to make a change to the other.

Decoupling the materialisation of a schema from its theoretical representation would make all migrations "instant". Many column-oriented databases can do this, or nearly so, but most row-oriented databases can't or won't.

I.e.: ideally database engines should be able to keep track of multiple "versioned" physical schemas and transparently map in-flight queries to them. Big data migrations often implement this kind of thing manually with "writeable views" or similar techniques. This shouldn't be manual, the database engine should be able to do it behind the scenes with minimal human involvement.

I'd agree with this. Despite hating databases whilst studying for my undergrad, I've come to realise after almost 30 years in the industry that expertise in SQL and database optimizations are key, and all the various frameworks I've tried to use to "hide" the complexity of SQL for my team, end up causing more headaches.

e.g. when I use sqlalchemy, I end up writing queries using core than ORM, because it becomes simple for me to optimize versus having to keep printing str(query_statement) to analyze sqlalchemy thinks it should do versus what it really should do.

Do remember to find a good DB optimization expert (hint: not the type that answers "add an index" when asked how they would speed up a query) and you'd be surprised how well SQL continues to scale today.

And for those of you about to say "but KV/NoSQL", by all means don't shoehorn SQL into every single thing; use the right tool for the right job; SQL will handle that 80+% of what you need, then implement the "other" stuff in memcached/redis/mongo/kafka/whatever you find is "easiest" to implement, deploy, manage and maintain.

> versus having to keep printing str(query_statement) to analyze sqlalchemy thinks it should do versus what it really should do.

ORM generally aren’t meant to totally replace sql, it’s an anti pattern. There was a quote by Gavin king explicitly calling this out; ORM for crud operations on object hierarchies, sql for everything else (reporting, adhoc queries etc). I try to educate this whenever I can but so many teams resist it.

ORMs assume that they are the only access point to the database and cache result sets based on that. Disable the cache or keep in mind that bugs can happen due to that behaviour.

No it's not, and that link isn't relevant. Some ORMs do have caching layers and when you use those ORMs you have to know how to invalidate the appropriate caches when you update without going through them.

.Net's LINQ comes close, at least for querying. Update statements using LINQ usually translate to a select query followed by an in-language loop, but that's probably more an Entity Framework limitation than of LINQ itself.

The problem with most abstractions on top of SQL is that the first thing they abstract away is the relational model; you end up with an opaque result set, a mapped object, or an untyped collection. In my opinion, in order for a higher-level SQL language to be useful it must retain all the low-level data type information. The abstraction should be on the operation level (DML), not the data level.

SQL abstractions are stuck with a market problem, for reasons in this thread. Everyone who wants better-than-SQL is generally already adept at SQL relational gymnastics. And everyone who isn't adept at SQL wants to be able to ignore relational thinking.

End result? Abstractions target the easier of the two markets: people who don't want to learn relational modeling.

I think there's some low hanging fruit out there. Simple improvements to make queries easier to write and maintain.

As an example, ActiveRecord allows you to define scopes. If you have an account things like:

platinum: -> value >100000

churn_risk: -> churn>.95

You can then chain these together i.e Account.platinum.churn_risk to get all the platinum accounts at risk of churn.

Afaik there's nothing similar in SQL. If your definition of a platinum customer changes you have to change a bunch of different queries instead of one definition.

While I agree it’s a PITA, most SQL-interfaced RDBMS support views, which is a way to abstract the Platinum customer issue you pose into something composable, albeit in SQL terms, not in the dotted property chaining style you propose. In the past I’ve created view generators to automatically populate a sweet syntactic palette for application semantics. Modern query planners are hardly bothered by the implied excess of joins any more, and will often use the embedded literals to optimize the query plan before execution…

I’ve plucked that low hanging fruit before, and it can taste great even in SQL sauce.

> .Net's LINQ comes close, at least for querying

It breaks down in really stupid and unintuitive ways. Now you're not just learning LINQ, but also the ideosyncracies of how LINQ translates queries to SQL and why sometimes your performance plummets even though you expressed yourself in the same way, but if you put some query in a variable first and then continue grouping it does a query that takes milliseconds. It's the best example of a leaky abstraction I've ever come across.

Something as fully expressive as SQL is hard to design well. I think jOOQ is one of the best, but it relies on code generation to get good typing.

The context for my generators was querying over data whose schema was customer defined. I wrote two generators in Ruby and one in Java. They were designed to take advantage of quirks and performance characteristics of MySQL when doing arbitrary ad-hoc sorts and filters, and the optimization was mostly about delaying joins if they weren't necessary for the predicates, and creating an inside out nest of subtable queries to control the execution plan.

Doing this without a generator of some kind is impossible - the columns are dynamic - and then you leave a lot of performance on the table if you don't tune.

I never understood the point of this. SQL is easy to learn and usually easy to write, and usually easy to performance tune.

I find that if my query is getting overly complex, that some of the logic is often better suited in the app layer.

At best, you are spending time learning something niche with little upside. At worst, you are introducing buggy bad performing complexity.

Hense, so far they all seem to suck.

While I agree with you, it's too bad you can't create macros over common patterns. Not sure about you but I copy and paste a ton for this.

Really, I'd just want a coffee script, or moon script (https://moonscript.org/) where you have the idea for a query in your head, and know what the constructs will ultimately generate.

EnntityFramework in .NET core is actually very good for an ORM that allows you to get right into the weeds of your queries if you need to. LINQ was ok, but entityframework is a lot better, no rubbish gui layout and the relationships are explicit and can be specified through object annotations right down to the level of controlling the table column collation.

If you have bounced disappointingly off ORMs like I have since the 1990s, EntityFramework is a great step forward.

Agree, probably because it's not what people are looking for. Most devs complain they don't have enough control over how a query is executed, they want to be closer to the actual storage, as they think they can do a better job deciding how data should be retrieved than the database engine.

Adding another abstraction layer on top would only increase this frustration, as they'd have even less control.

Elixir's Ecto.

Have you tried Logica?


I always thought that was what hadoop had in mind with HDFS, i.e. move the storage and query engines closer in terms of spatial and temporal locality, albeit only if you partitioned/clustered properly.

Speaking of which, what has happened to hadopp and HDFS? Used to be flavor of the month about 10 years ago, but now I hardly ever hear people talk about it.

SQL is exactly like the QWERTY layout: A first quickshot with little design thoughts and unfixable architectural issues that‘s so widespread that everyone is used to it by now.

Trying to change to the Dvorak layout taught me a lot about enacting change on such a grand scale.

After a lot of hassle switching machines and OSes, typing on other user‘s computers, them typing on mine and general headaches among internationalization and inaccessible key combinations, I switched back to QWERTY in the end.

As said in other threads, there is no shortage of attempts to replace SQL. A lot of them are pretty good. But having learned SQL the hard way, I feel zero urge to learn another language from scratch right now.

It‘s why nearly all big databases eventually switched to it despite the FORTRAN vibe and its general ugliness.

Anyway, probably time to SET TRANSACTION LEVEL READ UNCOMMITTED and call it a day ^^

While I agree with your point, I wonder why that does not seem to be the case with programming languages.

For example, in iOS development (and, more in general, on Apple platforms), there has been a huge shift from Objective-C to Swift.

The same arguments should apply there. Swift is much better, but Objective-C got the work done, and many codebases were written in it, especially at Apple. And yet, the whole community switched pretty quickly.

One could argue that Swift was easier to pick up for newcomers. While that's true, I would then expect the argument to apply also to SQL alternatives.

So, what is the difference here?

I don't think this is a great example. The Objective-C to Swift transition going smoothly is only because of Apple's almost total control of the ecosystem (I mean this positively).

Apple is automatically the loudest voice in the room for iOS development. If they embrace Swift, the writing is on the wall for Objective-C. It's not just sticks, I'm sure they also put a lot of effort into making the transition as easy as possible.

In SQL - there is no equivalent. I think a better example is x86.

It took a generational new form of computing (mobile) to give ARM the momentum to seriously challenge x86. It's been almost 15 years since the original iPhone and we're only now seeing ARM based processors in computers.

SQL IMO is EVEN harder to displace than x86. x86 has a massive ecosystem but only only two serious manufacturers. SQL has a similarly massive ecosystem AND is the de facto language of the vast majority of major databases. Going to be very hard to unseat.

> It's been almost 15 years since the original iPhone and we're only now seeing ARM based processors in computers.

Are we just ignoring the Acorn Archimedes series of computers which gave rise to the ARM processors in the first place, fully 20 years before the first iPhone was launched?

No, we’re factoring in the market impact of the Acorn Archimedes on the wider desktop computing ecosystem compared to the impacts we’re seeing now around ARM on the server and the iMac of M1 based desktop and laptop machines.

Dating from the iPhone release makes sense in the "It took a generational new form of computing (mobile)" context.

The iPhone is the relevant reference because it's the reason why ARM is relevant in general-purpose computing today, and e.g. MIPS or POWER aren't

There was a strong push for "NoSQL" about a decade ago, but it got marred by document databases trying to usurp relational databases around the same time. When people realized they chose the wrong tool for the job (that is, the document database), they were happy to return to their relational databases using SQL. That completely killed any momentum that had been built to replace SQL with different languages.

I don't think that NoSQL ever was about better query languages for relational databases. It was about making databases faster and easier to use by using simpler data models with fewer guarantees.

And now that we can query JSON fields in PG we get the best of both world.

I like this (and use it) but it has to be said, the syntax is grim.

Still, I'd rather query JSON in postgres than do joins in mongo.

For now. Next version (currently in beta) will support the usual [] syntax in queries: https://blog.crunchydata.com/blog/better-json-in-postgres-wi...

Same for MSSQL, all XML or JSON Syntax is very… unfun.

The author seems to think that putting structured data in columns is a good idea. That is pretty clearly contrary to the basics of the relational model itself, never mind SQL. In fact it's quite close to how document databases work, so a very NoSQLish proposal overall.

>That is pretty clearly contrary to the basics of the relational model itself

The consensus among relational theorists appears to be that data types can be arbitrarily complex. For instance, C.J Date and Hugh Darwen write the following ([1] page 56):

Third, we remind you that types are not limited to simple things like integers. Indeed, we saw in Chapter 1 that values and variables can be arbitrarily complex—and that is so precisely because the types of those values and variables can be arbitrarily complex. Thus, to paraphrase a remark from that chapter, a type might consist of geometric points, or polygons, or X rays, or XML documents, or fingerprints, or arrays, or stacks, or lists, or relations (and on and on).

[1] https://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf

Alright, JSON in RDBMS is now canon!

Ironically, at least in the case of ObjectiveC->Swift, the answer is almost certainly "because those languages compose with each other reasonably well". I say ironically because I believe most of this article boils down to how uncomposable SQL is.

I think the same can be said of Java -> Kotlin, C -> Python (I know, I know), and lots of other medium-to-large scale language migrations over the past several decades. When people move to a new language, it's because there's strong interoperability with what came before that everyone would like to quit using but can't because they have too much invested in it.

This suggests to me that anything that wants to beat SQL will in fact have to compose with it - probably partly by generating it, but also by having a fairly solid "drop down to SQL" story. In other words, a language that, at least on the read-side, can somehow take two separate SQL queries and automatically rewrite them as subparts of a different SQL query. It might not be fast, but it needs to work, because you're going to want to reuse that work on that one gnarly query you did that gets all the important business metrics, and you also are going to want your results to be free of consistency issues.

> ”And yet, the whole community switched pretty quickly.”

This is only true for hobbyists and indie/startup developers. The largest and most popular iOS apps remain Obj-C (with a lot of C++ and in-house frameworks in the mix). There’s no incentive to rewrite something like the Facebook app in Swift.

Maybe the objc / swift difference is great enough to justify such change.. meanwhile the query languages have not evolved enough to do so .. just wondering. Maybe ask the datomic/datalog folks.

I don’t know if that’s really true these days. Each SQL server has its own dialect with nuances you have to pick up sooner or later, and if you’re a backend engineer for more than a couple years you’ve likely dealt with multiple types of data stores.

> So, what is the difference here?

If one company had all of SQL market share and they switched to a new lang, everyone else would have too.

But that's not the case here.

SQL is not controlled by single entity? Arguably there is foundationdb (sort of) from Apple.

You're right that it got established early and network effects, but Codd had two goes at a relational query language before Chamberlin and Boyce had a go. So not literally "a first quickshot".

I think the fact that many people were willing to try switching to mongodb despite the fact that it doesn’t use sql and had quite poor semantics should suggest that there would be some way for a better relational database query language (and engine) to catch on.

On the other hand, the fact that many new time series databases (and other engines, including Materialize which the author of the article worked on) wanted sql (or fake sql-like languages when they couldn’t manage sql) suggests that not being sql can be a big hindrance.

> typing on other user‘s computers

Off-topic, but changing layouts is easy. Also, using other people's keyboards is not the best for hygiene even before covid.

Dvorak-intl-code user here, SQL has so much more engineering than the QWERTY layout it's hardly comparable at all.

I don’t know. If someone gives me an interface like Prisma, and the backing database does not use SQL, but is otherwise the same, that just removes one needless translation step, and I wouldn’t have to do anything.

I find the fact that others cannot borrow my computer to be a major plus of using non-QWERTY layouts. Your point about keyboard shortcuts is spot-on: forget hjkl navigation in vim if you’re not QWERTY.

The point about hjkl is the biggest deal here. Everything else has been non-issues me.

Interesting to note that Emacs standard key-bindings don't have the issue (from what I heard). However, I use evil in Emacs and Colemak-dhm layout and I decided to rebind all evil / vim keys. This fixes the position issues, mnemonic issues, and gave me the opportunity to improve on a bunch things over the defaults in QWERTY.

I very rarely type on anyone else's computer. Usually when I do it's my wife's computer (maybe 5 time's a year). I have a small 40% keyboard with QMK firmware so it's easy to transport and everything I need is flashed into the firmware. No need for configuration of the keyboard layout on their OS.

There's a lot of talk about what is wrong with SQL, but i haven't seen something yet that is actually better than SQL for most use cases.

Stop fighting SQL so much, and just focus on bringing a better solution. If potential users see it has significant benefits they'll start using it.

I agree. I’ve seen people bashing SQL ever since I started using it 25 years ago, but I’ve yet to see an attempt to replace or supersede it succeed.

Many of these blog articles seem to be written by devs that have only really experienced the user-facing application side of things, and really don’t realise the sheer number of financial, analytic & business systems, that keep the world spinning round, that are all happily buzzing away using SQL.

Moaning about JSON manipulation in SQL is madness. A client should use appropriate joins to get the data from the database & then do the transformation. Storing JSON when you actually need to perform operations on said data makes 0 sense. It’s a huge overhead, an utter nightmare to index etc.

> A client should use appropriate joins to get the data from the database & then do the transformation

Please don't set up "object joins" against an object/key/value table. It's better to flatten the highly used bits of the json out into the table that hosts the "source json" as a blob in one column. There is a place in our codebase where someone did these "object joins", and it's a nightmare to work with. After about two months, there is a comment in one of the tests I found, "I really regret this pattern now". Now we have enough customer data in that object "kv table" that we can't migrate back to json + select flattened fields without extreme pain.

In theory datalog is better than SQL. However the reality is that we have millions of programmer-hours invested in making SQL perform well where datalog has no where near this effort.

I think that's why NoSQL is where the gains are being made. You can't magically improve things with a query language without putting the effort into performance, but you can get huge gains by changing your assumptions about how data is structured.

Can you? For every article hating SQL there is one for hating NoSQL: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never...

NoSQL is much bigger than Mongo, which deservers books written on hating it. The relational algebra is great, but its also overkill for things like event streams, message brokers, caches, does hit scalability issues for certain legitimately large workloads, and in comparison to Mongo and other document-based databases, fails when your data simply is not relational.

Then there's the whole object/relational impedance missmatch, which if you insist on using OOP, there's a variety of object databases that make a small persistance layer much easier.

> fails when your data simply is not relational

What's an example of data that's not relational?

One case is where you have fairly ragged entities that may have several of many attributes but result in very sparse rows.

Another case is when users are effectively designing their own schemas but you want to keep them, for example I have a database for a CMS where users can define nested hierarchies of keys with translations that would be impossible to fit into a schema. You could just toss this data into a JSON column, but its an example of nonrelational data.

Inventing a new language and a new query engine and a new storage engine at the same time, competitive with the state of the art, is maybe just too much to be feasible.

There were multiple optimizing compilers which can do a lot of these asks, allowing composable queries that return nested types, but produce SQL queries. I think the pathfinder compiler had the most real-world use, it was meant to efficiently mix SQL and xquery to query SQL+XML documents stored in postgres. It had a c# linq frontend, but also fairly hefty compile times.

I think most of the criticism is about SQL not being the best programming language - a language in which pros build application, especially web applications. And SQL is truly not best at that. It is not the best API.

But in my world, SQL is much more of a Human Data Interface than Application Programming Interface. SQL is used for both ad-hoc and automated data querying and transformation. It is something manufacturing intelligence and supply chain experts and researchers and others learn to empower themselves to get to the data - and yes, you won't see any CEO running reports in SQL themselves, but it is not for programmers only.

Those people would not benefit from syntax with more brackets instead of words, easier way to define and call functions, or the ability to install libraries - in fact I think it would make it harder and less accessible for them.

The OP is right that for machine-to-machine interface, all the syntax baroqueness is not worth it. And of course, having more portability would be great.

But while machine-to-machine interfaces written by skilled developers who knows exactly what are they doing might actually be the most common use of SQL (since half the internet runs on Wordpress and MySQL and most smartphone apps might have SQLite in them), it is not where the majority of complexity or time is spent - that one is with the people working with data.

Spot on.

Replacing SQL is a tough sell because at its core, it really does work pretty well for what it was designed to do. (i.e. not most of the things they've bolted onto it and the corresponding databases this century) But it's not a silver bullet and people should stop trying to jam things into relational databases that aren't a good fit. There are already a number of potentially useful alternatives out there depending on your particular problem.

In the event that nothing exists that is right for the job: don't worry about replacing SQL, worry about creating/using a matched DSL/API/(whatever is appropriate) and data store for the particular problem you're trying to solve that SQL isn't a good fit for. If you create something that solves its particular problem even half as well as SQL did for relational databases you'll probably have a winner in that space that people with the same problems would be happier using.

To me, SQL looks like something I should be using 79-char punchcards for.

Scalable databases are just so difficult that we’re still driving a ‘64 IMPALA

Most of this opinion comes from “SQL” being vendor-specific. Is JSON vendor-specific? Is anything else, that we actually use by choice?

Mad at you too, Graph DBs, for sending us on another snipe hunt by adding vendor-imposed innovations, because it makes the enterprise marginally profitable. It’s how the world works, I’m still not happy about it.

(Disclaimer: I possibly just traveled back to 2002 and said this on slashdot)

> Most of this opinion comes from “SQL” being vendor-specific. Is JSON vendor-specific? Is anything else, that we actually use by choice?

Two things that come to mind are Markdown with all its flavors and Regex with multiple engines.

edit: and to a lesser extent maybe C/C++ compilers and JS engines.

edit2: also JVM, Python and Ruby runtimes

But both edits describe technologies with an official spec and slightly different implementations. Markdown/Regex are more comparable to SQL because they have vendor-specific syntax.

CommonMark and ECMAScript+bundlers exist, so it's very easy to write Markdown and JS against a standard and be interpreter-agnostic.

SQL standard only covers the basics and is only useful to humans, not machines.

You're mixing up different things that don't nescessarily need to be addressed at the same time. SQL is the declarative programming language used to express what data you need.

What type of database engine is used to execute that query is independent of the language. It could be highly scalable or it could be focussed on single user, single process.

What is the problem that needs solving?

> We do deserve better.

Totally! SQL is like shell, c and js that somehow generate this "never ever try to improve them for real, lets stay suffering for all eternity!"

And the AMOUNT of energy in mask them (transpilers, linters, "good practiques", tricks, etc) is a monumental waste of effort that cost so much than actually go and fix the core issues.

And in this case, SQL is the most trivial to fix. It could get simplified and streamlined far easier that is to replace C, for example.

And a lot of interactions to the DBs are using bridges already. Create a "good sql" is similar to create WASM and then for legacy reasons support in EOL fashion all the old sql and all the new go against the new way.

But this mean at least 2 or 3 major database vendors go for it (I dream: sqlite & postgresql).

P.D: I'm exploring building a relational language, so already have a simplified dialect for it: https://tablam.org/tutorial

Well there were a lot of atempts to relace SQL. And there will be more. But untill then SQL gets job done®. By the way, why "replace" let people choose. If you have better tool for the job, pick that over SQL.

There are probably a few of us who simply disagree. The fact that there aren’t libraries for SQL is a huge advantage in my book, and I say that as someone who’s overseen his developers move more and more from C# to Python because it has libraries that are simply easier to work with.

In this you might be able to guess my objection to many of OPs points. Because the key thing we look for when we decide to implement tech isn’t how good it is, but how easy it is to work with and more importantly, how easy it is to find other people who can pick up when someone’s career take them forward in life.

SQL is the one technology that has been stable enough over the years that no matter who we hire, we can always count on them being capable of working with our Databases. Yes, some of the more expansive functions, especially some of the recursive ones, take a little time to get accustomed to and there will always be some back-and-forth preferences in regards to stores procedures, but on a whole, it’s the one technology that never causes us any pain and never requires retraining.

I’m sure it’s not efficient, and OP is completely correct about the parts about the untapped potential, and I fully support people developing rational alternatives to SQL, I’m simply pointing out why SQL has been the power horse it has since, well basically, since we moved beyond punch cards and why it’ll likely still be so in 50 years.

Because at the end of the day, technology doesn’t really matter to the business. Cost vs benefit does, and as long as it’s easier and therefor cheaper to have people work with SQL (in the big picture), it’ll be what businesses work with.

The only reason you don’t see something similar in declarative programming (or even functional) is really beyond me but it probably has to do with how things evolved and how much easier it is to change that part of the tech stack compared to your databases. If we get a new hire who can do magic with JavaScript/Python/whatever we may consider allowing him/her to do that. We don’t want to, again because the big picture is easier to manage with fewer technologies, but it’s much easier to justify the cost-benefit of that compared to someone wanting to use a different database than the one in our managed cluster. Already you need more staff than that person to maintain it, as developers don’t run database server operations, aren’t part of the contingency staff and so on.

Like I said, I fully encourage the direction OP wants to go. Enterprise rarely invent things, we’d be happy with ball frames if that’s what made us the most money, so we need people like OP to invent things for us to adopt.

I completely agree with you. Is SQL perfect? No. Have I accepted and embraced it? Yes, because it'll get the job done.

I also happen to really dislike how the author hasn't capitalized the syntax like SELECT FROM WHERE or CREATE TABLE which, to me, poorly affects the legibility and therefore makes me less interested in reading the argument overall.

Among the various languages I use, why is SQL the only one that favors ALL CAPS EVERYTHING? When writing ad hoc queries I ignore that convention just to be ornery.

My honest take is that it's rare to name your tables and columns with all caps, whereas there are various cases where people do so with camelCase and snake_case.

In the end this makes spotting SQL-specific functions much clearer. I find it much easier to read SQL with this convention since my brain immediately knows when it's looking at context/project-specific data or a FUNCTION.

HTML used to be the same, it fell out of practice at some point. The same may happen to SQL under the right circumstances.

You're not alone. Myself and many coworkers don't feel the need to capitalize everything, preferring `select * from Table` to the shouting version `SELECT * FROM TABLE`. It ain't the 80s anymore, we can use lowercase.

This reminds me, case sensitivity of identifiers (table names, column names etc) in SQL is a whole mess that can fill another blog post.

It's not 1960 any more. We have colour video. We are allowed to highlight syntax with colours instead of the less legible all caps.

We certainly do deserve better, and the author made some good suggestions.

Unfortunately, building a new database is a huge project and there appears to be no party currently willing to sponsor it.

I wonder how much of the limitation are necessary in order for the query optimizer to have any chance at finding a good execution plan. As you add more and more abstractions and more and more general computations in the middle of your queries, it will probably become harder and harder for the query optimizer to understand what you are actually trying to do and figure out how to do it efficiently. Are you not running the risk that the database will have to blindly scan through entire tables calling your user defined functions as black boxes on each row?

I would also guess that we could have a better SQL but I do not think it could and should look anything like a general purpose programming language because otherwise you might get in the way of efficient query execution. Maybe some kind of declarative data flow description with sufficient facilities to name and reuse bits and pieces.

And maybe you actually want two languages which SQL with its procedural parts already kind of has. One limited but well optimizable language to actually access the data, one more general language to perform additional processing without having to leave the server. Maybe the real problem of SQL lies mostly in its procedural part and how it interfaces and interacts with the query part.

> I wonder how much of the limitation are necessary in order for the query optimizer to have any chance at finding a good execution plan.

Probably in exactly the opposite way: the limitations of SQL put a lot of work on the back of the query optimiser without allowing for said optimiser to easily reason about the queries, or for the writer to easily feed the optimiser (short of dedicated extensions e.g. Oracle's optimizer hints).

Or so I would think, I've never heard of SQL being praised for its optimisability.

I think it's important to distinguish between what's highly optimizable in theory and what's easily optimizable in practise. The latter working here and now, and the latter being a (possibly perpetual) decade of compiler development away.

An example here is how, sure, in theory, JITs can outpace AOT compilation because they have all the information the AOT compiler has plus runtime insights. But the ability to truly do that always seems to be a decade of compiler development away, with many giving up on the idea entirely.

It's also important to consider what we're comparing SQL's optimizability against. If it's against typical NoSQL databases, most of which seem to favour a lower-level query specification, I can defend SQL's optimizability to the end - with SQL databases having the freedom to dynamically switch algorithms to adapt to the actual contents of the database at the time of the query. Something which, ironically, a stale optimizer hint (i.e. written with the size & shape of the database as it was last year in mind) can actually get in the way of. Not that I'm saying that SQL planners never produce stupid query plans of course.

> An example here is how, sure, in theory, JITs can outpace AOT compilation because they have all the information the AOT compiler has plus runtime insights. But the ability to truly do that always seems to be a decade of compiler development away, with many giving up on the idea entirely.

JIT has been used very successfully for a couple of languages, e.g. Java. But adding it requires a lot of effort. The same thing is unfortunately true for new query languages, building a new database is really expensive, this is probably the main reason that there have been no serious alternatives to SQL.

> JIT has been used very successfully for a couple of languages, e.g. Java

I'm not at all disputing that - but its promises of "beating AOT" that we used to hear have not come to fruition. Where it really matters, e.g. HPC, you don't really see any JIT. C, C++ and Fortran still rule. And even for Java, you tend to be able to achieve equal or better performance with AOT compilers.

With an AOT compiler a lot of work is done at compile time, but with JIT it is done at runtime, which makes these programs slower to start. But for programs that do run for a longer time, like servers or IDE's the performance is quite decent. But Java is a memory safe language which means that it has to do a lot of checks that unsafe languages do not have to do, which means that it will always be slower that unsafe languages. In some cases the JIT can eliminate those safety checks, but that is not always possible.

Java always had (and still has) an AOT that compiled to byte code. Adding JIT was a major improvement.

Optimizability is one of the core ideas of the relational model. The others being data independence and a declarative query language. SQL is based on relational calculus and relational algebra, which in turn are based on first-order logic. The reason that every attribute in the relational model must be atomic is that first-order logic cannot be applied to nested sets. However, first-order logic is what makes query optimization possible in the first place. Thus, if one were to develop a query language that did not have these constraints, then one would lose optimizability.

As has been mentioned in this article and many previous takedowns of SQL the language, it's a pretty poor approximation of both relational calculus and relational algebra. So in that sense, SQL barely has a leg to stand on. If anything, SQL not adhering more closely to those techniques is why query planning is so fraught in modern relational DBs. A language like the one described in TFA would be both closer to the relational model as theorized and probably easier/more consistent to optimize.

> otherwise you might get in the way of efficient query execution

Maybe? But I kind of feel that the query optimizer often gets in the way of efficient query execution.

There’s so much stuff that’s semantically the same, but if you shuffle some strings around, the query is suddenly 10 times faster.

I think the criticisms of the article are basically right. SQL sucks in a lot of ways, and the base SQL standard really sucks such that virtually everyone has extended it at least somewhat, but they've all done it in a completely nonstandard way so nothing is portable, and the standard is never officially updated anymore. Oh also it's a massively leaky abstraction and you may have to tune your query to the database anyway, like even just normally, not even something ported, hence hints/etc.

There are quite a lot of pretty basic things many programmers would want to do that just are way more stupid than they should be. One that irks me is things like "return me the top 100 items", "return me/join against the most event for item X in a history log", etc) that end up requiring way more shit than they should just because there's no standards-compliant way to say "select first X rows from ... where ... order by ..." or "join first row where .. order by ...". In the case of top 100 you just wrap it in another query, but for more analytical stuff like "join the most recent X for this item" you have to use window functions and the syntax fucking sucks.

Since you mention optimization, perhaps it would help to allow the abstraction to be peeled away and write clauses that express imperative commands. Like being able to write that what you want is an "index join table X on index Y", etc. That's sorta what hints do, but roll them into the language in a portable way. It could also allow the query planner to look at it and tell you "that's just not possible and you need to check your keys/indexes/queries/etc", rather than having to run it and guess what it's trying to do.

Because I kinda feel that's a lot of the tension for SQL queries (beyond the QOL stuff that everyone knows is stupid). It's a guessing game. You're trying to write queries that will be planned well, but the query planner is not a simple thing that you can easily predict. The closest analogy is probably something like OpenGL/DirectX where you're just guessing at what behaviors might trigger some given heuristic in the driver and there are a lot of unseen weights and flags. There are "equivalent" SQL expressions that are much heavier than another seemingly similar one (like "select from ... join ... where exists (select ...)" vs "select where x in (select ...)". There are operations that are mysteriously expensive because you're missing an index or the query isn't getting planned the way you expect.

The suggestion of a "procedural" expression is, I think, also probably correct for some situations. PL/SQL functions are extremely useful, just obnoxious and in some cases arcane to write (as someone who never had formal education in DBA). It would be even nicer if you could have an embedded Python-style thing that let you have "python" syntax with iteration and list comprehensions and shit, that represent DB queries using cursors and shit, and perhaps defer execution until some final "execute" command while transforming it all into a SQL query. Like C#'s LINQ but for Python, but instead of buffering streams of objects transform it into a database query. Transform operations/etc on fields into SQL statements that work on columns, etc.

Or java if you will. Imagine a Java Streams API that compiles down to planner ops. I know Hibernate Criteria and JPA exists, but skip that API and express it as streams: iterations and transforms and so on, and map that onto the DB. Being able to build subqueries, then attach them to larger queries, etc. That way they execute in bytecode rather than pl/java.

Postgres allows you to run python in the database, Oracle has support for java, Postgres can do limit 100, you can build subqueries using views, and reuse these in larger queries...

Problem with query optimization is that it needs to be done at runtime, you can't optimize it in some procedure language easily. The optimal way to retrieve data depends on the number of records in your tables, the where clauses you use, the actual values you are filtering by, the records already in cache, etc.

99% of all programmers would not be able to program better performing queries when doing this with a procedural language or streams expressions or it would take them way too long.

> Postgres allows you to run python in the database, Oracle has support for java, Postgres can do limit 100, you can build subqueries using views, and reuse these in larger queries...

Compiling down to fast bytecode is important and what I am thinking is more of a "pythonesque"/"javaesque" than an actual thing on a python VM / java VM. Limit the allowable syntax, it doesn't need to be full-spectrum python/java where you can recursively declare objects/etc, just what can be mapped into SQL.

Give me Javacard+LINQ that compiles to bytecode with cursors. https://youtu.be/31D94QOo2gY?t=607

(OK not real javacard but... a minimal java equivalent that compiles to C bytecode, with LINQ, and with stack allocated variables/etc.)

Views as subqueries are a good idea though, we don't hardly ever use them so I mostly don't think about it. It takes more authorization to modify a DB object (view) than program code, so we have a political bias against it...

> Problem with query optimization is that it needs to be done at runtime, you can't optimize it in some procedure language easily. The optimal way to retrieve data depends on the number of records in your tables, the where clauses you use, the actual values you are filtering by, the records already in cache, etc.

> 99% of all programmers would not be able to program better performing queries when doing this with a procedural language or streams expressions or it would take them way too long.

That's probably fair but also leaves you at the mercy of the stats/query planner when it doesn't work the way you want.

Maybe a middle ground would be to make the query command an "expectation" and if the plan doesn't match the expectation then a connection flag is raised and you can check that at the end of your session, so if it's set you know to look for that session's query planner data.

lol I know there's no way that wouldn't get muted and I'm sure programmers would end up tilting at the windmill anyway but

(SQL Server) Query hints(OPTIMIZE FOR), fixed plans, compiled stored procedures - they all do things like this but they all have tradeoffs that you generally would rather just let the engine take care of and write more clearly expressive queries than take on, mostly "hey wait why did my performance suddenly get worse?" type of reasons.

https://github.com/adsharma/fquery/ https://adsharma.github.io/fquery/

Uses python expressions and generates SQL.

Also does static typing, so you can run a type checker on the code

I feel like most frustrations with SQL boil down to fighting against a shitty schema.

When you are sitting in a properly normalized database, it is a lot easier to write joins and views such that you can compose higher order queries on top.

If you are doing any sort of self-joins or other recursive/case madness, the SQL itself is typically not the problem. Whoever sat down with the business experts on day 1 in that conference room probably got the relational model wrong and they are ultimately to blame for your suffering.

If you have an opportunity to start over on a schema, don't try to do it in the database the first few times. Build it in excel and kick it around with the stakeholders for a few weeks. Once 100% of the participants are comfortable and understand why things are structured (related) the way they are, you can then proceed with the prototype implementation.

Achieving 3NF or better is usually a fundamental requirement for ensuring any meaningfully-complex schema doesn't go off the rails over time.

Only after you get it correct (facts/types/relations) should you even think about what performance issues might arise from what you just modeled. Premature optimization is how you end up screwing yourself really badly 99% of the time. Model it correctly, then consider an optimization pass if performance cannot be reconciled with basic indexing or application-level batching/caching.

The categories were made for man, not man for the categories.

A not-shitty schema is always to be preferred over a shitty one, to be sure. But for any data schema, there will be queries which cut against unexpected joints (joins?).

And SQL is bad for this. The entire Fine Article is a detailed exploration of how it's bad for this. A decent query language would keep the easy things easy (SQL is ok at this) and wouldn't make the hard things pointlessly difficult or impossible-except-at-the-application-layer.

Sometimes it's just a weird or one-off query, and sometimes there are so many of them that migrating to a better schema is indicated. Since it's difficult to do the dozen or so things the article sketches out, it's also difficult to iterate your way to a better schema arrangement.

The sort of waterfall database design you advocate (which, yes, think long and hard about your data!) always breaks down, because building business logic is a process of discovery. There are always missing pieces.

> I feel like most frustrations with SQL boil down to fighting against a shitty schema.

Which one of the frustrations from the article boils down to fighting against shitty schema?

None of them in particular. The author of this article seems mostly concerned with constructing pedantic strawmen and handily defeating them with clever demonstrations. Claiming SQL is not expressive is a fault on the part of the developer of the schema. The notion of a broader ER model and why a good one matters seems to have completely eluded the author, aside from a brief mention of 6NF (which is definitely not the right place to start the conversation on normalization).

It is really easy to take a negative stance on a technology and poke holes in it without having some grounding in reality. If you are dealing with a problem domain that has hundreds of types/facts/relations, and the need to author logic spanning all of these things all at once, you are going to have a hell of a time managing this with any other technology. Looking at SQL through the lens of toy problems is a massive mistake. You have to look at it through the lens of the worst problems imaginable to begin seeing the light.

I used to think SQL was some ancient trash until I saw the glory of a 40 table join in a factory automation setting. If you need dynamic, hard answers right now, SQL is the only option that makes sense at this kind of scale.

I felt it would contribute more meaningfully to the overall discussion on HN if I were to sidestep a direct critique and present a view of the problem from a different perspective.

The vast majority of recursion in SQL is probably the result of a bad schema.

I agree with this. If you are doing recursion in SQL, its probably because you have a bad schema. If the schema is clean, then you should probably be doing the recursion in your application logic. There are still cases where it makes sense but its super rare in my experience.

Storing trees for instance, but you can always rewrite your recursive CTE to use a while loop and a stack.

Yeah, this has been my experience, too. Sql and RDBs are a lot less frustrating when someone takes the time to actually do some design and planning.

Personal experience incoming: At startups, it's usually a mess because hiring someone who knows databases seems to always come so late in the game. At bigger corporations, well, hopefully the developers and database people get along and talk - otherwise, one of those teams is going to be a bottleneck.

> Model it correctly, then consider an optimization pass if performance cannot be reconciled with basic indexing or application-level batching/caching

So true. This also extends into general purpose languages, everything is so much easier when you take the time to model things correctly.

Schemas and requirements evolve over time. There is no perfect schema.

Schema design always results in endless "are we gonna need this?" questions, and usually everything is postponed down the road because it adds needless upfront complexity.

All data are naturally triplets, and with the relational model we are forced to group the triplet attributes into tables. The difficulty arises because these groupings are often incorrect, and then difficult to change. There is always premature optimization, because the SQL query statements become insane with too much normalization.

A bigger problem is how sticky schemas are due to how difficult they are to change, and how difficult it is to update your code to support a schema modification.

I think these two problems need more attention.

We've gone too far down the code-first / code-generation ORM approach which makes tinkering difficult.

I think all database design should happen visually. MS Access style. You should design your schema alongside your data (as you suggested by using spreadsheets). But instead of then transferring that to code, it should be a living schema that connects to your IDE code refactoring. The more declarative queries are and the closer they are to what needs to be finally rendered the more chance of reliable refactoring. The more imperative code modifying the queried data, the more complex and difficult to refactor things become. A lot of the cruft in a codebase is when schema changes are additive because people are too lazy or too risk averse to refactor.

E.g. Think about a User having one address, and then needing to support multiple addresses. There's quick and dirty way by adding more columns or adding to a JSON column, or we add a new Address entity and keep the existing User address columns, or we delete those columns and migrate data to the new Address table - which is the cleanest way to represent this schema change. I think there are few who would do the last option though, and this is the problem. Hiding this behind an API also causes more problems down the line because our frontend data model starts to separate from our actual DB data model.

We need better tools, and less code.

Without any stakes in the game, Intellij IDEA does a pretty good job at integrating the application being written with the db schemas it uses. It can autocomplete string SQL queries, and I think it can do database refactors - and though it doesn’t automatically does it at code site, it does add warnings when a string query becomes incorrect.

>The usual response to complaints about the lack of union types in sql is that you should use an id column that joins against multiple tables, one for each possible type.

>create table json_value(id integer);

>create table json_bool(id integer, value bool)

>create table json_number(id integer, value double);

No, the usual response is "Don't do that!"

99% of the time you either know the data types (so each JSON object becomes a row in a table where the column names are keys) or you don't know the data types and store the whole object as a BLOB

I'd be on board with adding unions to SQL, but I doubt I'd use that feature very often.

But the world just works like that. There are unions everywhere. No matter how it's implemented under the hood, it should really be a first class concept in any language, including SQL.

SQL already have first class support for union types in the form of relations. Adding union-type columns would actually be second class compare to this. You would have to add special-case operators and it wouldn't give you more power compared to just using relations.

That said, I'm not averse to the idea if someone can provide a realistic use case. The JSON example in the article is misguided though - you should not save the structure of the syntax of a serialization format, you should save the data model which is represented by the syntax.

Relations don’t provide sum types in a first-class way.

If you want a list of employees each of which have different roles, and depending on the role you have different fields guaranteed to be available (not null), this is not possible to express directly in SQL.

You can express this with sum types in ML, Haskell or Rust.

In SQL you would have to split it out into separate relations, like:

create table employee(id it)

And then employee_boss, employee_dev, employee_sales which each have different sets of non-null fields.

You can express a foreign key constraint that these tables must link to an employee, but not the other way around. You can’t guarantee in the limited type system of SQL that every employee row has a corresponding employee_X table, and exactly one, not more.

This is trivial in languages with sum types but not in SQL. In this sense they are strictly more powerful.

Various triggers can be added to try to enforce this constraint, but that’s a runtime check: better languages tell you when you type check the code.

Yes! This! Exactly! There’s so much that _can be knowable_ with SQL based on the constraints of the _query_ (not just the relational model) that it should be possible to know the characteristics of the data (and in some cases even the performance characteristics) just from type characteristics.

> You can express a foreign key constraint that these tables must link to an employee, but not the other way around. You can’t guarantee in the limited type system of SQL that every employee row has a corresponding employee_X table, and exactly one, not more.

True. (You can in some databases using check constraints, but it is not necessarily simple) The solution is to improve the constraint model to make it much simpler to define such constraints. For example it is trivial (in most databases) to add primary key and foreign key constraints, but more complex constraints are difficult or impossible.

I think the desire for tagged unions are really an XY problem, just like the desire for array types and other composite types. The relation type itself is powerful enough. But the desire for such features show that certain patterns are too complex and un-intuitive to apply in SQL.

>You can express a foreign key constraint that these tables must link to an employee, but not the other way around. You can’t guarantee in the limited type system of SQL that every employee row has a corresponding employee_X table, and exactly one, not more.

in this case instead of an employee table you would have an employee view that’s the union of the separate tables - engines vary dramatically in what sort of constraints they’ll enforce on the view now but the relational model is fine

In most languages where unions are a first class concept, are they not generally warned against?

The author uses the term "union", but unfortunately they picked the one wrong term against a long list of correct alternatives: sum type, tagged union, discriminated union, coproduct, disjoint union, variant, algebraic data type, ...

"Union" does not imply disjointness (which is the source of many problems), whereas all those other terms do.

Correct, I just continued to use the term of OP. Sum types r coproducts are what I mean.

By the "author", I meant the author of the article. I figured you were just being consistent with them.

You’re probably thinking of C/C++ unions, and they do indeed have quite tricky semantics and are difficult to use correctly. What the article and the other commentators are talking about is more properly called a “sum type”, like Rust’s enums, for instance. Those are different things from C unions. In languages with first class support for sum types, they are used everywhere, it’s an incredibly useful concept.

Gotcha, ta. That becomes a tricky problem when we're talking about something that's primarily a storage engine though, right?

Well, sum types are not magic - they must be stored in the memory somehow. Rust's enums tend to use a simple approach: there's a large enough integer tag (usually an 8-bit integer), followed by enough space to store any of the variants. The same exact encoding could be used with a database storage engine too.

Not sure - because in the end, a programming language also has to store this concept in memory. I don't think there is a fundamental difference.

Not at all. Untagged unions, sure, but no decent modern language has first-class support for untagged unions. First-class tagged unions are a real treat to use.

I was talking specifically about the JSON example in the article. Needing to store objects in that manner is just a silly problem to have. Any solution will be slow or ugly or both.

You're right, unions are everywhere. Right now a human has to think about each union and how to represent it in a database. It would be really cool if I could store capnp objects like the one below and still get optimal query performance and aesthetics without thinking about it:

struct Shape {

  area @0 :Float64;   
  union {
    circle :group {
      radius @1 :Float64;
    rectangle :group {
      width @2 :Float64;
      height @3 :Float64;

Maybe I didn't get your point correctly but this seem actually doable in PostgreSQL.

You could create a custom type based on jsonb type with a trigger to enforce the types inside the jsonb. You could as well write some functions to ease manipulation of such custom type.

There is no out of the box light syntax to do that right now but this could be implemented in a future update/extension. Just look a the complex and successfully implemented geodata types brought in by the PostGis extension.

You can already do that by having circle and rectangle relations. Union types dos not give you any additional power compared to relations.

Your proposal might be more convenient though than creating multiple relations, so we should look into making it just as convenient to create the necessary relations to express this. So lets say your syntax proposal creates multiple relations under the hood - then I'm all aboard!

The "everything is a relation" is a really powerful concept in the relational model. Adding composite column types like arrays, unions etc. are still less powerful than just using relations. But people like it because the syntax is more convenient to use.

Agreed, that would solve my aesthetic concerns and kinda solve OP's XY problem with JSON.

But the performance of auto-created tables for every union member would kinda suck if you never take advantage of the relation and still pay the cost. You'd also have to auto-create indexes so that queries that read full objects is as fast (IOPS) as storing BLOBs.

I was imagining a future database that takes this a step further and lets you define named parameterized queries next to your tables and chooses the best implementation of unions based on your queries. That's what I meant by "without thinking about it". We already do this for algorithms (query optimizer), and I'd love to see it work with data structures

The basic premise of the relational model is that the physical storage structure and optimizations (including indexes) are independent of the logical model. A table could be stored row-by-row or column-by-column or any other clever way without it affecting the schema and queries - although it will affect the performance of the queries.

But if I have a circle relation and a triangle relation, how do I create a foreign key for the favourite_shape column in my user relation?

Thinking a bit more about it I think this would be a very elegant solution for Tagged-Unions in SQL:

- simply allow tagged unions for foreign key constraints: CONSTRAINT my_fk FOREIGN KEY (own_column) REFERENCES EITHER table_a (a_id) OR table_b (b_id) OR table_c (c_id)

- add join syntax to join via defined foreign key: FROM own_table JOIN VIA my_fk

The union FK would store an additional flag determining the target table and the JOIN VIA would switch on that flag and it would be allowed to use any columns from all target tables in the query but only the matching ones would be not null.

If you put the foreign keys on the table_a/b/c tables, then you can express the same without having to introduce tagged unions.

As I said in response to a sibling: But can you can not (in a simple way) prevent that both a circle and a triangle point to the same user and you can not enforce a NOT NULL for a user to must have a favourite shape.

You would have the foreign key in the circle/triangle relations pointing to favourite_shape rather than the other way around.

But can you can not (in a simple way) prevent that both a circle and a triangle point to the same user and you can not enforce a NOT NULL for a user to must have a favourite shape.

You can do that with a check constraint, although I don't know if you consider it simple. It would be cool with better support for common constraint patterns.

Could you give a concrete example for such a check constraint? I imagine each shape-table would need to have it's own check to test the existence of the same id in all other shape tables, which scales qudratically.

Or if you inverse the FK to make the user-table reference the shape tables you would need multiple potential nullable columns and a check constrain that exactly one is non-null. And it would still not solve the query side of things.

Check constraints are just SQL expressions, so in pesudocode it would be something like:

    (SELECT COUNT(*) shape JOIN square JOIN circle) = 1
How conveniently this can be expressed depend on the database system in question.

But if we are talking about adding features to SQL, it could have built-in support for such a constraint, just as there is built-in support for say foreign key constraints.

Are you by any chance generating GPT-3 comments as part of an experiment?

If not please take time to actually dive into modern SQL concepts there is likely a way to achieve your goal but I personally consider that it's pretty rude to ask random strangers a proof for a complex and loosely defined problem in the comments section. Stack Exchange should be a better place for instance.

I do love SQL and at least where I live (MS SQL Server) it can be made to run amazingly fast if you take some care with your queries and indexes. It's not portable though: as far as I know not a single one of the big sql vendors follows the standards 100% and more importantly, spending some time with one vendor will give you some habits that are sure to not work as well with another (cursor constructs are generally a death blow to performance in tsql but they are the way to do it on oracle for example). So I kind of agree with the author here.

But I also feel that maybe they are asking a bit much from SQL. The complaint that complex subqueries are complex... Well then don't use them? I would use WITH constructs in that situation because I find them easier to read but that's beside the point. I think its perfectly fine to pull out multiple result sets from simple queries and then do the complex stuff in your host language.

> maybe they are asking a bit much from SQL

But this article is thought provoking to say the least. It follows the courtroom logic of holding the defendant SQL on trial for as much as possible. And SQL is guilty of a lot of crimes.

I do hope GraphQL and similar query languages become more prevalent and standardized, as it seems SQL could really use some stiffer competition.

SQL is often conflated with the relational model (hence the term NoSQL for non-relational databases), but the article is careful to explain that the relational model is great, but SQL is a clunky syntax/standard.

Going to graph databases is certainly throwing the baby out with the bathwater. The relational model was invented to address shortcomings in the hierarchical and graph database models.

> SQL is often conflated with the relational model (hence the term NoSQL for non-relational databases)

Still not confused enough? Check this out: http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/NoSQL/

Do graph databases predate relational databases? Do you have a source for that? I thought graph databases were a fairly new thing.

I only recently started working with a graph database (a bit over two years now), and it struck me just how terrible relational databases are at relationships, compared to graph databases. I know nothing about the history of databases, but my impression is that relational DBs are basically an expansion of simpler table-based databases that can also handle relations between different entries, if awkwardly. Graph databases are a much better fit and smoother to work with.

Graph databases have been around nearly as long as databases generally, at least since the 1970s. The sole feature that makes a database a "graph database" is support for a minimal amount of recursion in queries, which was a feature before SQL even existed. There are good technical reasons graph databases have never been commercially successful.

Most databases support some form of recursion and have for decades. The reason they don't market themselves as a "graph database" is that the performance and scalability of graph data models in conventional database architectures is typically very poor, so representing your entire data model that way is not encouraged. The way indexing is implemented in many traditional database engine designs, e.g. B-trees, is pathological for some common graph-like traversals. Even databases that market themselves as "graph databases" have conventional internals and perform only marginally better for graph data models than databases that are positioned differently.

The idea of graph databases are great. Unfortunately, the existing implementations all suffer from very poor scalability due to fundamental limitations of the data structures and algorithms used internally. The core theoretical challenges are well-understood but few people are credibly working on addressing those.

Interesting. But are those old graph databases similar to modern ones like Neo4j? Neo4j has pretty amazing performance compared to doing the same thing in SQL. Although it's certainly possible that SQL database was poorly designed. Even so, a handful of developers new to graph DBs managed to easily beat its performance.

I've also heard that some modern graph databases are not true native graph databases below the surface, and therefore perform worse at graph-specific queries.

Edit: is it possible you're talking about Network Model DBs[0]? Wikipedia mentions them being around since the late 1960s, and that they could model graphs, but suggests it's more of a predecessor to graph DBs, which saw a lot of improvements until the arrival of modern commercial graph DBs in the 2000s[1].

Again, I'm not an expert on this at all, but it sounds like there have been significant improvements in graph databases since the 1970s. Much more so than in relational databases.

[0] https://en.wikipedia.org/wiki/Network_model

[1] https://en.wikipedia.org/wiki/Graph_database

I wouldn't say relational databases are poorly designed, any more than saying a hammer is poorly designed because it makes for a bad screwdriver. A hammer is still excellent at working with nails, it's hard to find a better tool to work with that. This is just about using the right tool for the job.

Back when data was simpler and not as big, relational databases were perfect, and there have been years of engineering and bug fixes that have gone into them. They are excellent at what they do, and they continue to improve.

But as technology has improved, as our disks and memory have gotten bigger, as the data we collect and want to query over has gotten bigger, and as our queries have gotten more complex, we've been running against the limitations of log(n) joins and relational database technology for some use cases. Now, not every problem is a nail. Some are screws. Some are more exotic.

That's been the reason for nosql databases in the first place, to try to address the shortcomings that arise as data gets bigger, more complex, and as queries and operations become more complex over large data.

log(n) joins are fine...until data explodes, and you're no longer doing just a handful of joins per query, but a very large number of them, maybe even unbounded, and maybe the rules for what data to traverse has soft or even no restrictions. When your data is graphy, when the questions you want to answer require traversals of this scale and nature, and when you want to make sure your traversal costs are proportional only to the graph you want to traverse (and not proportional to the total data in the database), then graph databases provide a very good tool for modeling and efficiently querying over that data.

Graph databases are relatively young, compared to relational databases. Yet their usage has been proven, especially as more graphy problems and data have grown more common.

Relational databases are still useful, and still improving, and graph databases will also continue to grow and improve side by side with them.

We even have a GQL initiative, on the language side, aimed at becoming an ISO standard that will hold an equivalent position as SQL, but for graph querying. That should speak to the value and importance of the paradigm.

The fundamental premise of the relational model is the physical/logical distinction. The relational model deliberately does not make any requirements or assumptions about how data is physically stored or structured.

The difference between relational and graph (and other NoSQL database systems) is not about particular sizes and shapes of data, it is about level of abstraction. For example assuming joins are "log(n)" makes certain assumptions about how relations and indexes are implemented which is only true for some naive implementation (like Access or MySQL).

Just as an example, materialized views is a physical-level optimization where an arbitrary complex query result is stored and kept updated, which means data can be retrieved as fast as physically possible. Of course this has a cost at insert-time, since materialized views also have to be updated - but this is a performance trade-off just like the structure of a graph database is a performance trade-off.

NoSQL databases has a tight coupling between the physical and logical structure, which makes them easier to optimize towards particular usage patterns but harder to adapt to changing requirements over time. The relational model was specifically designed for large databases used by multiple applications and changing over time.

Performance was not the selling point of relational databases in the first place. It is hard to beat the performance of a perfectly tuned hierarchical or graph database on the workloads they are designed for. At best a relational database can be equally fast. For this reason hierarchical/graph databases never went away and are still appropriate for specialized tasks.

The selling point of relational database is they can accommodate long-lived multi-purpose databases used by multiple different users and applications over time and where the data model would change and evolve over time as the world changes. The is achieved by "data independence" - the logical model is decoupled from physical storage structure and is not hard-coded to accommodate particular query patterns. So you can have ad-hoc queries - cross-cutting queries which you didn't know you would need when the schema was designed. The key insight is that relationships between entities are also just data.

You can build almost any kind of database on top of any decent database kernel -- they are similarly expressive. The "kind" of database is just a skin over the kernel. However, different kernel designs make different performance tradeoffs in several dimensions. A graph database is much more join-intensive than a relational database, for example, so it may make sense to reduce select performance to increase join performance.

Graph database technology was essentially stagnant for decades, and the design of Neo4j is from that era. The basic technology is the same across all graph databases, just incrementally modernized and micro-optimized, hence why they all have the same scalability and performance problems. The market gap is that most interesting graph data models have at least 10^12 edges and graph databases stop working well long before you get to that scale.

There was a major breakthrough on this topic around 2008 at a corporate research lab, a new way of representing graph data models that was highly scalable. By coincidence, I was working in graph database R&D at the same time elsewhere and had some exposure to the research, made several material improvements to the new approach, and even built working prototype database based on it that addressed a lot of classic scalability and performance issues. I stopped working in graph databases shortly thereafter but to this day have not seen a single implementation in the wild based on that research. It isn't the kind of thing you can build on top of an existing database kernel, the internals were necessarily pretty exotic, which is part of the problem.

tl;dr: Graph databases have been stuck in the same local architectural minima for half a century, receiving a fresh coat of paint every decade or so but not addressing any of the reasons almost no one can use them.

Sure, in the 1970 paper "A Relational Model of Data for Large Shared Data Banks" where E.F.Codd introduced the relational model, he specifically mention how it is superior to hierarchical and graph databases, which were the dominant models at the time.

The hierarchical database evolved from the flat file database by allowing nested records (i.e a record could have sets of child records, arbitrary deep). The graph model further added support for navigation links or pointers between records across the structure, hence supporting a graph model (also commonly called network model at the time).

> Do graph databases predate relational databases? Do you have a source for that? I thought graph databases were a fairly new thing.

I believe they just called them "network databases" back then.

Aren’t you just doing the same thing in reverse here? You can use GraphQL with a relational database.

Fair enough! But GraphQL returns data in a hierarchical format, so the property of relational closure is lost.

I do not understand the comparison between SQL and GraphQL. As far as I can see they’re not close to the same. And GraphQL is also missing a bunch of really necessary things.

GraphQL is very limited in expression power compared to SQL. I doubt you could do 20% with GraphQL compared to what you can do with SQL.

Are you also talking about switching from a relational model to something graph-based, or is there some way one could use GraphQL with a relational database?

GraphQL isn't a graphdb querying language. The "graph" in its name is highly misleading, and should basically be ignored.

It's an API querying language.

SQL is also an API querying language. The primary difference is that SQL is designed to query relational data (rows and columns), while GraphQL graph data (tree-like structures).

The parent is questioning how useful GraphQL is as a relational query language or if the suggestion centres around moving to the database speaking in graphs. Conceivably, the data storage could still follow what suits the relational model internally with some kind of ORM-style engine that sits in the middle to translate relations to graphs for the GraphQL query processor, but at that point you could argue that the database is a graph database.

GraphQL is stiff competition, in the sense of "stiff as a corpse". Yes, it has the words "query language" in the name, but that's where the similarities end.

In GraphQL, you can select fields of JSON objects, optionally a field may have parameters which affect the values it returns in unspecified ways. That's it. Because of this design, unlike in SQL where you are concerned with modeling the structure of your data, GraphQL also requires you to think about the API all users will use to access the data. In SQL, I can write a table defining the structure of my dataset, and then users of that table can perform arbitrary queries when they know what data they need (aggregate with a month average grouped by account ID, filter to only rows where x = 7, JOIN this to grab some data from some other table etc.).

GraphQL has no aggregates (sum, average...), no grouping, no joins, no sorting, no filtering, other than what you manually design each using parameters at schema design time. Good luck anticipating the use cases of every future consumer of your data. Miss one? Better dig back into your implementation code & implement that use case each & every time a new one comes up.

The only part of GraphQL that is standardized is the query syntax. In SQL, the actual underlying relational data model exists and the syntax of queries exists within that context, not so in GraphQL land. In SQL, I define my data structures, and users can write queries and access the data. But GraphQL throws up it's hands and says "not my problem, try one of these libraries that ask you to implement your own custom data access functionality for all your data types".

OK, so it's a rubbish query language, but even the graph part of the name is misleading. Assuming that you even have a domain that it makes sense to model with a graph of types, GraphQL provides you no tools for dealing with the backend complexity of such a design. Because the syntax is so simplified, there is no mechanism within the syntax to define rules about relationships between types. For example, imagine a simple parent/child relationship. There is no mechanism within the syntax to tell GraphQL that for parent X, parent.child = parent.child.parent . So you can't even think about writing a GraphQL query optimizer, because there isn't enough information about the structure of the data encoded into the schema or query to do so.

So in practice no GraphQL implementations that I know of have anything resembling a query optimizer - someone asks for cart.item, and then item.cart for a cart with 1000 items? Have fun re-requesting the cart from your DB 1000 times (yes you can cache the item by ID to save a DB lookup, but we shouldn't even need to hit cache here! Every programmer involved knows the data is the same, it's just dumb GraphQL has no clue about the details of the relationship).

You are focusing on the wrong aspects of the comparison here. The point is not that GraphQL is dreamy from a developer perspective (as opposed to the api end-user perspective). The popularity just demonstrates that there is a real end-user need for the query to be correctly structured out of the gate instead of being re-combined/nested/etc after it's left the database.

> The original idea of relational databases was that they would be queried directly from the client. With the rise of the web this idea died - SQL is too complex to be easily secured against adversarial input, cache invalidation for SQL queries is too hard, and there is no way to easily spawn background tasks (eg resizing images) or to communicate with the rest of the world (eg sending email). And the SQL language itself was not an appealing target for adding these capabilities.

> So instead we added the 'application layer' - a process written in a reasonable programming language that would live between the database and the client and manage their communication. And we invented ORM to patch over the weaknesses of SQL, especially the lack of compressibility.

> This move was necessary, but costly.

Your complaints are about GraphQL at the application layer, ie, the layer of concern that exists because SQL is inadequate for the burden it's been made to bear.

Imagine if GraphQL added aggregates and support for types besides what JSON limits you to, and it then just sat atop an RDMS like Postgres so that you no longer have to fuck around with it at the application layer to get it going. That would clearly be an improvement, right? Or really, just the ability to implement a language like GraphQL atop the RDMS so you can design your own QL that you don't hate.

My bet is that somepoint we are going to get very good NLP model where you feed tabular data, and then you are going query througt normal language. That would like be pretty big change and would eat pretty big share of sql market share.

>spending some time with one vendor will give you some habits that are sure to not work as well with another

It seems like having multiple vendors is only valuable if their products are to some degree differentiated, no?

For me the opposite is true. The value of multiple vendors with a similar product is that it drives prices down and liberates me from a lock in. I genrally avoid managed services that are not available from all three major cloud providers, and put abstractions between my suff and their so I can move my workloads around.

As to SQL. Its a weird feeling to read all that; I've spent 15 years working with very large relational databases -- and about 5 years ago I ditched it all in favor of using key-value object stores, wasting some storage but saving a metric ton of manhours of development work. Not looking back.

What do you use for your KV storage?

I agree with your last point to an extent. There's something about "inexpressiveness" that can actually be good, in that it requires you to simplify your data model. However, I imagine as it gets more complex, SQL becomes completely unwieldable. You basically have to use a NoSQL db as the author points out.

Ideally a RDBMS can also be a NoSQL DB. There is no reason it can't/shouldn't, if you have JSON-formatted columns.

Mostly the noSQL pattern is a mistake because you almost certainly have columns that recur reasonably frequently, but if you do have high-dimensional data or want to store processed documents/etc, you can represent them as a JSON/JSONB in postgres/etc, and even run indexes or queries on them.

Use a tool to generate the SQL schema. Then you can generate it to different DB types. Stored procedures is what breaks things in my experience.

Yeah, however it is kind of strange to single out SQL, when we have examples like a very famous kernel that can only make use of specific compiler, or Web APIs that are only implemented by a specific browser.

I don't see it as strange. They probably spend lots of time mucking around in SQL trying to force it to things it's not good at, so now they have strong opinions on SQL. They specifically state that it's too late to turn back the clock on SQL, but it's not too late to expose RDMS access with a better language, or with a sufficiently flexible API to allow arbitrary QL implementations. They aren't shitting on anyone here either. They say that the assumptions made when SQL were designed ultimately didn't pan out:

> The original idea of relational databases was that they would be queried directly from the client. With the rise of the web this idea died - SQL is too complex to be easily secured against adversarial input, cache invalidation for SQL queries is too hard, and there is no way to easily spawn background tasks (eg resizing images) or to communicate with the rest of the world (eg sending email). And the SQL language itself was not an appealing target for adding these capabilities.

> So instead we added the 'application layer' - a process written in a reasonable programming language that would live between the database and the client and manage their communication. And we invented ORM to patch over the weaknesses of SQL, especially the lack of compressibility.

<message has been deleted>

Any more info? ;)

No I just hate mssql

SQL was not made for programmers alone. It has been invented also for not so technical people so that verboseness and overhead is part of the deal.

>When Ray and I were designing Sequel in 1974, we thought that the predominant use of the language would be for ad-hoc queries by planners and other professionals whose domain of expertise was not primarily data- base management. We wanted the language to be simple enough that ordinary people could ‘‘walk up and use it’’ with a minimum of training.


> We wanted the language to be simple enough that ordinary people could ‘‘walk up and use it’’ with a minimum of training.

In that case it has been an abject failure. I have been using SQL since the mid 1980s (so pretty much since the start of its widespread adoption) and I have never met "ordinary people" (by which I assume intelligent business-oriented professionals) who could (or wanted to) cope with it.

I like it, but the idea of sets does not come naturally to most people (me included). But I once worked with a programmer who had never been exposed to SQL - I leant him an introductory book on it and he came in the next day and said "Oh, of course, it's all sets, isn't it?" and then went on to write some of the most fiendish queries I have ever seen.

I don't think this is disputed that the original goal of SQL was a flop. The designers grossly underestimated the technical chops of a layman. However, I would argue that us tech people did benefit from that original goal of simplicity. I mean, SELECT first, last FROM employees WHERE id = 10 is not too bad at all. Kind of elegant, no?

If SQL was designed "by engineers for engineers", you would be using esoteric Git commands just to blow off steam.

> Kind of elegant, no?

Oh, I agree, which I said "I like it", and compared to things like CODASYL it was and is shining sanity.

This is very important. I personally like old technologies and languages where the designers considered users who had limited technical skills, and most importantly, assumed that those users had no interest or need to improve their technical skills. Removing the assumption that users are willing to increase their technical sophistication forces a designer to think more about what they're designing. Looking at older languages is interesting - for all their warts, they do feel more intentional in their design than modern things that have a clear developer-centric mindset baked in.

It's similar to the discussions around COBOL back in the day. What's interesting is that the primary "end user" language is Excel formulas. Who would have thought a declaratory relational language with matrices would "win" that battle.

Any arguments that "users will write their own" languages are basically flawed. Users want results, if there's no alternative, they'll do it themselves, in the simplest, but probably most inefficient way possible.

This is a recurring theme. I was once amazed to learn that ClearCase was made for lawyers to use. The name makes suddenly a lot of sense!

I share the author's point of view, which led me to start a new relational programming language that compiles to SQL. It's a way to build on existing databases, like postgres or mysql, with all of their advantages, but improve on many of SQL's limitations.

If that sounds interesting, you can find it here: https://github.com/erezsh/Preql

I was sketching a syntax for python to work with pandas, and avoid extra annoying code, and surprisingly some of the elements and approaches I see in your code are similar. My syntax for subsetting was

   df{\*, -column1}
   df{column1: new_name1, column2, column3: new_name3}
I noticed in working with pandas you often need to do lookups into other dataframes. It's partially solved by assignment operator if left field equals right index, or .map method, the same way.

But often you need a lookup with merging by an arbitrary column, then grouping and aggregation by the left table items. This is partially doable without special functions. But this can be a killer feature if one makes this for spatial joins.

Very often you need to do the following:

    gdf1 = geodataframe of points
    gdf2 = geodataframe of points
need to make gdf1.geometry.buffer(500 m) and sjoin it with gdf2.geometry, then lookup gdf2 fields and bring them to gdf1, and keep original gdf1.geometry (points). This operation takes a dozen of lines and leaves lots of variable garbage if not put into a separate function.

But IMO it could be condensed to something more natively supported.

What I really wanted was something compatible with SQL, but that augmented it (and compiled to plain sql). Like typescript is for javascript, but instead of adding a type system, adding constructs for better composability.

Something like.. BQL http://intelligiblebabble.com/a-better-query-language-bql-la... which went nowhere https://github.com/lelandrichardson/BQL

Looks interesting. I've been thinking about trying this myself and one of my goals has been to create a language that's easily introspectible. I think it's much more important for a query language as opposed to an application language, since you'll want to see what code in the former does without running it for integration into application code.

My approach has been to design a very simple (in the lisp sense) syntax, kind of the opposite to SQL where everything is hard-coded into the parser. I've adopted a "pipeline programming"-like approach, where the operations are just (special) functions, which also helps with extensibility. Have you thought about this? From a cursory look, it seems Preql does rely on keywords. Admittedly fewer than SQL, but it also doesn't cover all of its features.

I don't have a lot of experience writing Lisp-y code, so perhaps I'm speaking from ignorance, but I think there is a reason that syntax never gained huge traction. Imho a syntax that's concise and expressive is important for effective coding. Having operators for the most common operations is just a small complication that yields a big reward.

Having said that, the amount of keywords and operators that you see in Preql right now, isn't likely to grow by much. I have the basic and common operators covered, and for the most part, the rest can be done with regular functions.

I agree about introspection, which is why in Preql you can ask for the type of any value, list the columns of a table, get the resulting SQL of an expression as a string, and so on. And certainly more can and should be done towards that.

Thanks for the reply. I'm happy that others are also tackling the problem of a better query language. My approach isn't actually a full-on Lisp with parentheses and all, rather it's based on a single compound data structure (like Lisp's cons cell, but more like Lua tables). I call it an "arg-tuple", and it's basically a function's arguments in Python, but as a data type (which allows nesting). Add in a simple function call syntax, infix operartors and a "pipeline" operator (like F#'s |>) and you get something like this:

    from stops
    | let region :be case(zoneid,
        :when ("P", "0", "B") :then "Prague",
        :when ("1", "2") :then "Almost-Prague",
        :else "Regional")
    | where lat != 0 && lon != 0 && region != "Prague"
    | select name, region
"from", "let", "where" and "select" are filters, functions that take a query description object and return a new one. "name, region" is just an arg-tuple with two unnamed elements and "case" is a function taking one polymorphic value and a variable number of nested arg-tuples with ":when: and ":then" named elements.

Filters can be arranged in any order, unlike SQL, so this would work:

    from stops
    | group_by lat, lon
    | group_by lat
    | select min(group.lon)

I tried developing a syntax like this: a table is set as context, then you do some transformations in pipeline fashion. That seems promising, but trying to rewrite some of my reallife examples, I found out that pipelines are much shorter than I expected. The reason was that you take a couple of tables, then make 3-4 of them (by filtering and grouping), then mix them together. And this does not fit well into pipeline structure, even with heavy nesting. In the end, the most gain on real examples (Python & Pandas) was where I made a function-like operation, which can be as well written in Python.

I can share some sketches if you wish (maybe it can give some ideas).

But generally, I think the code could be rather structured around calculation graph, than a linear sequence of operations.

I've also tried to rewrite some of my challenging queries [0] in my hypothetical syntax and while I think your observation about pipeline length is correct, the result still came out much better than SQL. Frankly, even in F#, most of my pipelines are around 5 functions too. In my view, pipelines are just a convenient mental model. I'd love to see your sketches, here are my (very WIP) concepts: [1]

I don't quite see how a computation graph would work as the core of a textual language and I'm sceptical about using it with existing DBMSs (I'd like to actually use my creation one day :) ), but I'm open to ideas.

[0]: for example this monstrosity: https://gitlab.com/dvdkon/jrutil/-/blob/dbb971c18526e68dcc97...

[1]: https://gitlab.com/-/snippets/2147895

Python has a data structure like this called NamedTuple.

I also noticed it's popping up everywhere, in this case sql rows (kind of). Conceptually I like the idea of using it as the mechanism for function arguments, but I didn't want the syntax to be too confusing for beginners.

Although yours seems a little bit different, because different fields can have the same name.

Just for the heck of it, here's how your second example would look in preql:

    { lat, lon => }
    { lat => lon: min(lon) } 
    { lon }    // not necessary. just drops the lat column

It's not quite that arg-tuples can contain multiple values under the same name, but rather ":a 1, :a 2" is actually an arg-tuple containing two nested arg-tuples as unnamed values. I agree that basing all of syntax on a single element (especially one that's so "light" on syntax), it's hard making the resulting syntax intuitive. I think it's worth it, but I can see other approaches' benefits.

It's definitely an interesting approach. I imagine it makes the parsing much easier.

This is a great article and you can tell the author has deep experience with SQL from the way they speak and the other projects they're involved in.

I think many of the comments here are missing the point by saying "Oh you can get around that issue in that example snippet by doing X Y Z". Sure there are workaround for everything if you know the One Weird Trick with these 10 gotchas that I won't tell you about... but that just makes the authors point.

We can do better. We deserve better.

What could things look like if you could radically alter the SQL language, replace it altogether, or even move layers from databases or applications into each other?

Who knows if it will be better or worse, but I'd like to find out.

SQL and the relational model mostly works well and it's probably not practical to redo the enormous amount of work that was invested in SQL and its implementations and extensions.

As someone who frequently used SQL for analytics and less frequently for app development, I would gladly use a language that would transparently translate to SQL while adding some syntactic niceties, like Coffeescript did to JS:

- Join / subquery / CTE shortcuts for common use cases (e. g. for the FK lookups that are mentioned in the article)

- More flexible treatment of whitespace (e. g. allow trailing commas, allow reordering of clauses etc.)

And for the language to be usable, it would probably need: - First class support for some extended SQL syntax commonly used in practice (e.g. Postgres's additions)

- integration with console tools (e.g. psql), common libraries (e.g. pandas, psycopg2) and schema introspection tools

- editor support / syntax highlighting.

It would probably be good to model the syntax of that language on some DSL-friendly general purpose language (like Scala, Kotlin or Ruby).

Isn't it what query builders, such as Knex.js, are for?

Basically it would be ideal to have a query builder that is easily integrated with shells and notebooks (so that it can be used outside the context of writing programs in a specific languages) and that is accepted across the community.

After a little bit more than 2 decades of coding, SQL is nearly the only thing that was constant in my career.

It's a skill I used every working day, I'm pretty sure I will still use it in 20 years.

On the other side, tt's very unlikely that the ORM 'du jour' will exist in 3 years from now.

I think you're right. I hope your name isn't though!

Haha, in fact I believe js (vanilla, not framework of the week) is probably as durable as SQL

> First, while SQL allows user-defined types, it doesn't have any concept of a union type.

Isn't a union type essentially a de-normalized field?

This seems like attacking arithmetic operators for their lousy character string support.

Weren't XML databases (briefly) a (marketing) thing some decades back?

One idea might be to have everyone integrate jq[1] into their database engines. My understanding is that one can make the JSON do back flips with jq. Then we can move to complaining about queries that appear to have been written in Klingon instead of boring ol' SQueaL.

[1] https://stedolan.github.io/jq/manual/

> Isn't a union type essentially a de-normalized field?

No? You have to denormalize to emulate unions when they're missing. Sum types are a fundamental category of types, that SQL only supports product types is a problem you have to work around.

See mannykannot's reply.

The argument for union types seems to get weak when one asks: how do we index their components?

Because there seems little middle ground between needing discrete fields and safely just parking the data as a memo field and deferring the management to the application.

Unix win by letting the system utilities specialize.

SQL need not be "one language to rule them all".

> See mannykannot's reply.

Their answer clearly, explicitly, assumes a C-style `union`. Despite the essay literally using a proper sum type as example.

> The argument for union types seems to get weak when one asks: how do we index their components?

With the system creating partial indices under the cover? I fail to see what's complicated about it.

> SQL need not be "one language to rule them all".

SQL has "won" the relational battle and is literally the only way to query relational databases. Any time SQL is unable to do the job and you have to move that job to application code, you're making the schema less reliable and less of a source of truth, because parts of the schema's information have to be embedded in each application instead.

That doesn't seem desirable to me, unless you assert SQL should just be a trivial data storage and retrieval interface, which it has not been… possibly ever, but at the very least since the introduction of window functions.

I would contend that there is a minimum set of tools across which an system will span.

However, attempts to crunch the system into One True Tool will hit the diminishing returns curve early and hard.

Quite to the contrary, many users tend toward Object-Relational Mappers, just to ease the transition between data and logic portions of the application.

It is more fundamental than that: a union violates the fundamental tenets of the relational model. In that model, each element of a relation must be a simple value of some type.

If you are working with unions and, for whatever reason, want to put unions into a data store, this may seem to be a capricious limitation, but this rule, together with the other principles Codd stipulated, are the basis of the semantic transparency of the relational model, in which each relation expresses an atomic fact about the world of discourse. This, in turn, is the basis of its desirable features, such as its openness to ad-hoc querying, and the applicability of referential integrity constraints.

To look at it in more concrete terms, suppose you had an attribute with a union type: the meaning of any particular bit-pattern would be ambiguous - it might depend on the value of a different datum, or, worse, be context-dependent in a more complex way. This is going to make querying more complicated, whether you are using SQL or some replacement for it, and while one or two cases may seem expedient and harmless, these are the sort of accommodations that, as they accumulate, lead to programs becoming hard to understand and brittle.

At this point, I am unsure whether it would be acceptable, within the relational model, to have types that are, structurally, a union together with a flag disambiguating it. On the one hand, this would avoid the problem of disambiguation I mentioned above, but if it were implemented in such a way that the flag value is independently queryable and/or settable, that would seem to open a back door to let in all the seems-expedient-but-ends-badly design choices that raw union types would facilitate.

SQL is a pretty warty implementation of relational databases (with non-composability being its primary sin IMO), but we're stuck with it at this point. A new querying DSL that fixes all of SQL's flaws is only half the story, getting enough programmers on the planet to buy into it is another half. To do that you'd need this new piece of software to at least be as fast and as battle tested as existing SQL databases. Even the new generation of massively-scalable relational databases stick with some form of SQL instead of inventing a new DSL because of the sheer momentum behind this sorry syntax.

Anybody can criticise SQL, programming languages, etc. It isn't hard and it doesn't make you better than the people that wrote them. When someone says "this thing that has been working fine for decades needs to be completely replaced" and barely mentions any alternative, I don't think they understand the process involved in replacing things or the terrible (non) proposition they are offering.

Increment on SQL, write a translation layer, and see if people adopt it. Maybe 10 years from now your idea will be more popular than standard SQL. Most likely your idea sucks though and you will stay in the easy land of criticising things.

The front-end is infinitely more complex than SQL on the backend. I write fairly common web applications and the SQL part is maybe 10% of my time, and very easy. React is where I spend most of my time. I don't have any problem that really needs to be solved. SQL works for me even though it isn't perfect. Any imperfections can most likely be incrementally fixed. I use tagged templates in JavaScript to deal with parameters, composability, and reusability.

The fact the the author highlights GraphQL as supposedly the great alternative shows how ridiculous the proposition is. GraphQL does basically nothing. It is 10% of the functionally of SQL.

When I compare the codebase of a tool like gqlgen (a GraphQL library in Go) with the code base of PostgreSQL or SQLite, I'd say even 10% is generous.

It would be really cool if databases had an Option<T> type. Then you could remove all the NULLs. Although you can mark a column as NOT NULL, that restriction doesn't "travel": it isn't present for function inputs/outputs, subquery results, etc. Adding it to the type system gives you a lot more mileage. And then joins could be option-aware: an inner join would have outputs matching the input types, but an outer join would have Option outputs (for at least one side).

I'm curious how much work has been done on optimizers for Tutorial D or other D variants. It looks way nicer to use, but I wonder if it is easier to stumble into pathological cases.

This is how it is done in ClickHouse. It has Nullable(T) type. The functions of non-Nullable types will return non-Nullable types (except some specific functions).


you can use COALESCE to remove nulls

Explanation for your downvotes: you misunderstood GP's post, he refers to removing the concept of null as it is currently specified from the language, not removing null values from data.

> It would be really cool if databases had an Option<T> type. Then you could remove all the NULLs.

A nullable T column _is_ an Option<T> column, with NULL representing "Empty" or "No Value".

OP explains why this is not the same in all the other sentences of their comment

The only way out that I can see is to design embedded domain specific languages (EDSLs) that inherit the expressiveness, composability and type safety from the host language. That's what Opaleye and Rel8 (Postgres EDSLs for Haskell do. Haskell is particularly good for this. The query language can be just a monad and therefore users can carry all of their knowledge of monadic programming to writing database queries.

This approach doesn't resolve all of the author's complaints but it does solve many.

Disclaimer: I'm the author of Opaleye. Rel8 is built on Opaleye. Other relational query EDSLs are available.

[1] https://github.com/tomjaguarpaw/haskell-opaleye/ [2] https://github.com/circuithub/rel8/

As someone from the analytics side who's been working with SQL for 30 years (First Choice, remember that?) (but who also wrote a fair share of ORM boilerplate), I find these debates fascinating .. but also kind of trivial, in the sense that SQL has a lot of other pros and cons that app devs rarely consider.

Truly it is blind men evaluating an elephant.

Given SQL's roots as a human-friendly declarative interface, the only thing I see completely replacing it in the near future is a Copilot-style neural implant where you just think of the results you want.

A more pragmatic view in that article: https://blog.nelhage.com/post/some-opinionated-sql-takes/

Thanks, that's a great article. It has just the right balance of some interesting things I didn't know with enough things I agree with that I believe it!

I agree with the desire for a data-based language, rather than text-based one as SQL is. A classic example of this is MongoDB: you can add a new filter by just adding a new entry to a dict in Python or object in JS etc. I think 99% of the reason MongoDB was successful, at least in the early days, was because of its data based API. (Polite request to all: please don't reply to this comment with pros/cons of MongoDB except it's query language.)

I especially agree with the point about having to trick query planners into using the indices you wrote. I get that sometimes it's nice to let the database engine cleverly choose the best strategy (dynamically building queries with a data-based API would be a case in point). But in other situations you'll have carefully designed the tables and indices around one or more specific queries, and then it's frustrating not being able to directly express that design in the code.

I don't have any experience with live migration of production databases (thankfully!) so that was interesting, especially the conclusion that MySQL is best for this, which I didn't expect. The idea of separating out the type system into lower-level "storage types" and higher-level "semantic types" was also food for thought.

I agree that a programmatic API semantically equivalent to SQL – for example like the JSON-based MongoDB API you mentioned – would go a long way making things easier for application developers and removing the need for ORMs (which I consistently avoid).

F1, Google's SQL database, uses Protocol Buffer in an interesting way: https://storage.googleapis.com/pub-tools-public-publication-...

SQL is not a great API but its a great human interface. JSON based query language is much better API but terrible human interface. So depending on your objectives, it might make things better or worse.

MySQL is strange choice but think I understand why the author picked it - from his other critique he seems to look at databases as a building block of hyper-scaleable applications, not as a tool for humans to do often-ad hoc things with data.

I would never recommend MySQL for "business data" - it had and possibly still has way too many footguns with regards to number behavior, character encodings and Unicode, date and timestamp handling, and so on - hell, it doesn't have a proper MERGE and it only got CTEs in the latest version.

But if you're using it as persistence store that barely more than key-value store, why not? I have no problem believing the author that that kind of use is more common.

I've been thinking about this problem a lot, CRUDs, GraphQL,ORMs, Models etc. Mostly in the "CRUD-Like" environment. I have been thinking about a "client side SQL impl".

In most CRUD's we currently have on the backend layers and layers of software with ORMS, frameworks etc, and it all boils down to "Writing/Generating the correct(good-enough) SQL"

We now have added stuff like GraphQL, which if you squint hard enough (ok very hard) can be seen as being a SQL alternative(Language to get the actual data).

Maybe SQL + "GraphQL-Like" Layers should "evolve" into ONE common "data scripting language" ?

Maybe we have something like "ClientSide-SQL" - which can be a subset of ServerSide-SQL ?

We need the "TypeScript" of "data-querying" which can be run on the server,client, moon and my device, where one can also only define any "Types" ONCE.

Anywhoo - I think there is still a lot to be done, researched and discovered in this section of CS :)

Yeh, this is lacking right now. GQL makes joins much easier to write than SQL, which is what you want to be using in your components. But GQL is not great for offline-support and caching. You want your frontend to know about how your data relates to each other. Your frontend GQL should query a local SQL db.

I'm not sure you need a full SQL implementation on the frontend though, as the data is not going to get all that large to need the optimizations it affords, but it would be nice to be able to use the same queries on your browser DB as your backend DB.

We have a general rule on our team that complex SQL is a code smell. In our project complex queries are usually an indication of a poor design.

Anything SQL that can be made simpler via dynamic generation (which is safe as long as you use proper parameters for user inputs) is favored over creating logical branches in queries. Anything that can be processed further quickly in memory in the app (mapping operations, string ops, ordering/filtering predictably small data sets, etc.) we tend to offload from SQL into something more suitable.

And we tend to solve a class of problem in our data layer and reuse those generalized patterns heavily. This makes our codebase predictable even when dealing with unfamiliar subject matter.

Of course there are always places where some complex query is necessary (especially when building reports), but if it’s status quo then you’re doing something wrong—-it’s only a matter of time until you end up with a performance nightmare on your hands.

I think the author makes a valid point about this though - that it is precisely because of the limitations of SQL that we think this way. If SQL was a much more expressive data processing language we may not even require an app layer.

I do agree with this - after all SQL is supposed to be the data layer - why should we think that data processing shouldn't happen there?

> complex queries are usually an indication of a poor design.

Can you give an illustrative example of one. I suspect that framing it this way biases designs away from 'poor ones that use complex queries' into one that foregoes other good aspects such as normalization. Sometimes the best design uses a complex query for something other than a report.

Design is not something that should be done by application of dogma and avoiding smells.

No, but by treating a SQL server as a way of storing and retrieving data efficiently FIRST then the times when a complex query is actually necessary tend to stand out better.

In reality most tables and queries start out simple enough, and poor schema choices are usually accompanied by poor architectural choices. It can be painful to come up with a decent migration scheme when the business needs change, especially if there’s fear/pressure involved, but often that’s going to be better than trying to keep the data layer the same/similar and shoehorning in data to represent new scenarios. This is what leads to a fragmented design IMO and allows the schema to diverge from the actual goal of efficient data storage/retrieval.

Though verbose and somewhat strange at times, one thing I love about SQL is that the query statements read like a set definition from set theory. That declarative nature is pretty powerful IMO, sure there are hiccups but it is a different way of thinking.

The (mostly) declarative nature of SQL is not something the post is criticizing. You could have a good declarative language for relational database that doesn't suffer from the things the post criticizes.

I agree. I also agree with the post. I’m a big fan of sql and write a lot of it by hand. TFA makes a lot of excellent points to which I could add quite a few more.

If we ever do get a replacement, I hope it retains the declarative set theory approach of SQL while addressing the warts.

> By far the most common case for joins is following foreign keys. SQL has no special syntax for this

You can use NATURAL JOIN

select * from foo natural join bar

Works as long as the keys are named the same. However, a lot of people have a habit of naming keys differently in the two tables.

And what if there is more than one way of joining two tables, or a table to itself?

A foreign key is effectively a reference to another column, but to de-reference it you have to tell the database which table and column it's a reference to. Every time. Even when this information is already specified in a foreign key constraint.

The author is talking about (not) being able to specify the "from" table and column without having specify the "to" table and column (i.e. tell the database how to de-reference it) on each query. A natural join removes the need to specify the columns, but still requires specifying both tables. So besides requiring a de facto single namespace for columns across tables and generally seeming like a footgun, it doesn't achieve the same thing.

And it also breaks if two non-key columns are named the same.

This makes naming key columns differently a defence technique, so you stop people from using natural joins.

I generally use a prefix for columns based on the relation name, but preserving the name of keys (so a foreign key to user_id is user_id, not order_user_id etc). You obviously can't use natural joins if you end up with _multiple_ foreign keys with different roles, but generally I find this a better way to live all round. Never having to rename five 'name' columns in some output to make it clear which is which etc.

or ON:

select * from foo join bar on (foo.x = bar.y) if the columns have a different name.

I tend to write my joins first, then use where clauses as filters. A select * from foo left inner join on (foo.x = bar.y) is semantically equivalent to foo, bar where foo.x = bar.y, but keeping the joins separate from the filters makes the query more clear.

>what if we want to return the salary too?

>the only solution is to change half of the lines in the query

How about adding a second subquery for the salary.

It is a toy example. Perhaps imagine a more realistic subquery that is much longer. Are you going to duplicate a 50 line subquery to get the salary when all you want is one more value? No, you'd probably want to restructure the query with a join or CTE instead. To the author's point, a large change relative to the gain.

The author explicitly said "the only solution" and that's plain wrong.

No, I will write a SQL function for the subquery and call it from the main query.

This example seemed wrong to me as well. You can have a subquery, or CTE that returns as many fields as you want and can join on the manager key

An additional subquery and a CTE are both restructuring the query significantly, which is the author's point.

Well if he is insisting in doing the SQL version of writing everything inside a lambda without modularizing the code, then yeah that is bound to happen.

There's already a subquery in there. I'm saying this can still be done with one subquery, just in a different place and it returns the desired results.

The GROUP BY section is odd:

> You can use as to name scalar values anywhere they appear. Except in a group by.

  -- can't name this value
  > select x2 from foo group by x+1 as x2;
  ERROR:  syntax error at or near "as"
  LINE 1: select x2 from foo group by x+1 as x2;
  -- sprinkle some more select on it
  > select x2 from (select x+1 as x2 from foo) group by x2;
  (0 rows)
Looking at that first one I'm just kinda like "well duh, there's nothing special there" - it doesn't work with ORDER BY either, you use that to rename columns (on SELECT) or tables (on FROM and JOIN).

And then it goes on to show ways to work around that:

> Rather than fix this bizaare oversight, the SQL spec allows a novel form of variable naming - you can refer to a column by using an expression which produces the same parse tree as the one that produced the column.

Instead of just... using the renamed column?

  select x+1 as x2 from foo group by x2;

Sadly using the renamed column name in a group by or order by doesn't work on all database engines.

This is also fixed in ClickHouse - you can set and reuse aliases in any expressions in the query.

I agree with the Author. SQL is not a great query language. Almost every decently sized app I have written I have needed some sort of a query compiler so I don’t have to deal with nuances.

Also agree that GraphQL is a pretty fantastic language for working with graphs. And that relational databases are essentially graphs. Hasura is neat.

Admit have not read the article but has of my personal experience I think the hostility of developers vs SQL came from lack of fundamental formation and experience in declarative programming and full constant every day immersion in imperative programming.

The article isn't against declarative programming. The author alternative declarative solutions throughout the article.

I was in this state for years, thinking in an imperative way. The change for me was realising that your starting point is every possible combination of the rows in the tables. Just start with that (regardless of how massive) and then filter it down.

The problem with SQL is the language, not the paradigm.

This article promotes other relational alternatives and query languages.

> So instead the best we can do is add json to the SQL spec and hope that all the databases implement it in a compatible way (they don't).

Of course they are incompatible. That's just par for the course when it comes to SQL.

This isn't just a matter of some constant programmer overhead, like SQL queries taking 20% longer to write.

20% longer to write than what alternative? And how is this being measured?

And.. am I missing something?

By far the most common case for joins is following foreign keys. SQL has no special syntax for this:

  select foo.id, quux.value 
  from foo, bar, quux 
  where foo.bar_id = bar.id and bar.quux_id = quux.id
Why can't this be expressed as an INNER JOIN?

And can't some of these subqueries be written using a WHERE EXISTS or a windowing function?

For SQLite I use 'natural join':

    select foo_id, quux.value
      from foo natural join bar natural join quux
Unfortunately this doesn't actually use the foreign key relation; it matches on the same element name. So you have to have `foo.bar_id` and `bar.bar_id`, as well as `bar.quux_id` and `quux.quux_id`. But I find that actually makes queries more readable.

I find that the use of this technique, plus banning NULLs (handle them in application code), promotes user sanity. My 0,02€.

While some of their complaints are legit I think most of the SQL in this article are straw men. Most of it can be made more readable and more performant.

Care to back up those claim with evidence? The SQL queries are toy examples. Experienced SQL users don't write SQL like in the article because they know SQL's pitfalls and avoid them. It doesn't mean SQL isn't full of pitfalls and bad decisions (like many very old programming languages still in use).

Evidence? This is just, like, my opinion, man. But nested subqueries and old style joins... ripe for rewrite if I had say.

> Why can't this be expressed as an INNER JOIN?

`from foo, bar, quux` is an inner join, it's a shorthand syntax. He's lamenting that he has to keep specifying and matching ids, when the database can figure it out on its own from the foreign keys.

Depending on your database, it may be worse than just shorthand syntax. I encountered once a DB (that shall remain nameless) where replacing this syntax with the actual JOIN keyword resulted in dramatically better query plans everywhere it was used.

They should use NATURAL JOIN if auto-selection of the join keys is that important. I wouldn't recommend relying on that type of automagic behavior because it is very brittle; adding a column to a table might accidentally break existing queries.

Yes, but that's only automagic and brittle because it's not an actual join on the foreign key.

It's a join on whatever happens to have the same name! Which sometimes happens to be the foreign key.

The point here is that SQL doesn't have a way of specifying "join on whatever the foreign key is, and nothing else".

> adding a column to a table might accidentally break existing queries

Just a column, no. Adding a new foreign key might, but that's something that a type-checker/compiler can let you know about.

> They should Use NATURAL JOIN

NATURAL JOINs are better than nothing, but they're flawed too. Column names are supposed to convey meaning or intention, not just the type of data that they contain.

A natural join selects all matching names which is not the same as what the article is saying. The database already knows about foreign keys. Why do I have to say

  select * from A a join B b on b.Id = A.OtherId
SQL IDEs will auto-suggest that "on b.Id = A.OtherId" because it's the foreign key and could be inferred. That's what you need 99% of the time.

> A natural join selects all matching names

Yes, which is why I called it "brittle". It can easily break if you aren't careful with column names. However, if someone really wanted (unwisely, in my opinion) to have the DB automatically handle their inner joins, they could use NATURAL JOIN. This would require being very careful with you column names, which is why I (strongly) recommend doing it the usual way with INNER JOIN.

> the foreign key and could be inferred

Only in simple cases. If there are multiple foreign keys referencing the same table, automagically inferring joins will probably do the wrong thing, just like NATURAL JOIN.

> IDEs will auto-suggest

The editor/IDE is where this type of automagic inference should be done! You have the opportunity to inspect the suggestion and fix it if necessary, instead of hoping the DB does the right thing at runtime.

Partly for persistence, I would imagine. The query you write today should function the same tomorrow and a year from now. If you want this implicit behavior, you may use natural joins

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact