Hacker News new | past | comments | ask | show | jobs | submit login
Oops, You Wrote a Database (dx.tips)
105 points by swyx 23 days ago | hide | past | favorite | 91 comments

The same can be said about ORMs...

- You start of thinking, my problem is simple/unique so I don't need an ORM

- You design your database schema, start building your application, inserting, updating, querying it...

- First job is to write a few abstractions that represent a row in each the tables in your db. It handles deserialising/serialising from the database types to you languages types, plus a few of your custom types.

- You find yourself duplicating code for simple select/update/insert/delete quires so you build another layer of abstraction so you have only one generic implementation of these that your table row abstractions use.

- You start writing more complex queries based on user input, it starts to get a little complex with all the string concatenation. You write a query builder for one area of you application, but soon find yourself longing to use it elsewhere. So you generalise it to be usable on other tables.

- Hand writing database migrations is simple to start with, but as your team grows, and you have more deployment environments, and more branches, it becomes a massive headache. Then one day someone runs the wrong migration and drops the main db... you write a database migration tool, reuse a bunch of your application table abstractions as the definitive source of truth for what your tables look like. Job done!

Oops, you wrote an ORM...

Obviously the world isn't that simple, and there are legitimate reasons not to use one. But the majority of the time it's better to start with an off the shelf ORM than not.

The issue with ORMs really come down to restrictions on escape hatches, easy dropping down to SQL but with mapping back to your higher level abstractions.

I have quite the opposite experience: you start with an ORM because your app is a simple CRUD, later when the app becomes more complex and the team/userbase grow, you start having problems:

- N+1 query problem leading to subpar performance (down to unusable)

- escape hatches in the form of raw SQL increase cognitive load (2 ways to do the same thing, having to always remember how ORM matches to SQL and back when writing business logic) and may introduce inconsistency/bugs when the ORM caches object state in memory while raw SQL doesn't and you accidentally mix the two in one request (we've had a few such bugs)

- ORMs often encourage coupling business logic with DB access logic so it becomes harder to write tests and refactor code

- with a large amount of data (millions/billions of rows), built-in ORM stuff stops working when you care about zero-downtime migrations (i.e. no table locking, support for custom sharding etc.) so you end up writing cutom tools tailored for your use cases anyway

- it's harder to debug because there's a lot of magic

Our product was initially written using ORM and we ended up rewriting it using layered architecture, DDD and CQRS, it's much easier to reason about now (because all DB access is encapsulated in one place and doesn't spill over the rest of the application), and we're in the process of moving away from ORM in our CRM as well. I don't find it that hard to write raw SQL when implementing repositories and query services in the infrastructure layer, it's mostly write-and-forget for me.

>- N+1 query problem leading to subpar performance (down to unusable)

There are two problems in hibernate. The join fetch syntax is pointlessly verbose. This means doing the right thing can turn a beautiful 7 line query into a 21 line monster as you constantly have to thread the same IDs back into the query.

I mean it doesn't take many brain cells to do this:

select member from Member member left join fetch member.groups groups where member in (:members) order by member.fullname;

except you have to repeat this for every single relation but hardly anyone is aware of it and it is highly inconvenient so it isn't done by default. Other ORMs usually provide some sort of .includes() or .with() call which makes it less painful but even those could probably do with implementing a batching select fetch mode. The 1+N problem is mostly the result of bad default behavior because it "always works". Honestly I would prefer it if lazy loading was disabled by default and unloaded relations just give you an uninitialized error the moment you access them so you know you should fix your query instead of blaming the tool.

The other problem is absurdly janky handling of inheritance where it just goes straight back to 1+N. I would recommend to simply not use inheritance with discriminators. The feature basically doesn't work if you don't have a 1:1 mapping of discriminators to subclasses. You can now say ORMs are like the Vietnam war or you accept that the way you are using the tool is extremely counterproductive. I mean I have seen code bases where the discriminator is converted via a case statement and hibernate simply doesn't know what the heck to do with that.

Another problem with ORMs is their absurdly bad support for arbitrarily complex computed properties. CTE/WITH are not some arcane construct but the @Formula support in hibernate doesn't use them which means computed properties have unusably bad performance.

From my perspective 90% of the ORM performance problems could be solved and in theory the ORM could even provide enough batch operation features to make your code faster than not using an ORM.

This is a debate that has been going on for decades, but I rather like the point made here: http://www.revision-zero.org/orm-haters-do-get-it

"You simply can't use an ORM the right way

You can use them in the intended way but unfortunately there is no right way. The reason is that the ORM approach is intrinsically flawed. This is the claim I'd like to discuss today."

It's a short and interesting read so would like to see others take on his fundamental argument. It kind of reminds of me of kyren's Rust keynote and the difficulty of shoehorning OOP into places it doesn't really fit.


It's the reverse for me. When I join a team, I get hit in the face with an existing ORM. It makes debugging performance issues completely opaque.

Two issues in my current code base are on my mind:

Firstly, our slowest query involved the ORM generating a "CROSS JOIN" clause in a query which returned 0 or 1 results. Of course there was no way of tying it back to the code which was emitting it without some investigation. Real SQL would have shown up in a text search.

While investigating the above problem, I came up with a replacement query which would take 5 ms rather than the ~400ms it currently took. Unfortunately I couldn't use the query because my change (ordering by B.time instead of A.id) didn't return strictly the same results. But they should have matched. It turned out there was a slight race in writing out our rows via the ORM. We're currently not touching it because we're worried about breaking it. If it were SQL strings, I'd just wrap them in a transaction.

SQL is already the do-everything high-level abstraction. Trying to put Java code on top of SQL is like trying to put a for-loop on top of a map-reduce cluster.

> - First job is to write a few abstractions that represent a row in each the tables in your db. It handles deserialising/serialising from the database types to you languages types, plus a few of your custom types.

> - You find yourself duplicating code for simple select/update/insert/delete quires so you build another layer of abstraction so you have only one generic implementation of these that your table row abstractions use.

> - You start writing more complex queries based on user input, it starts to get a little complex with all the string concatenation. You write a query builder for one area of you application, but soon find yourself longing to use it elsewhere. So you generalise it to be usable on other tables.

I think most of your problems seem like they would be solvable within SQL. SQL goes far deeper than INSERT/UPDATE/SELECT/DELETE.

Most DBMSes have functions and views that can provide shared functionality that removes the need for abstraction. They were designed to be able to deal with business logic, not just be dumb a model-store. Now obviously that means you have to write non-trivial amounts of SQL code, which can be a bit of a pain, but it will in most cases perform a lot better and end up a lot cleaner than wrangling the data as though it was a bunch of objects.

I don't really share your opinion here (and, likely, the supporting ones about how to use ORMs). But simply disagreeing here would waste both of our time, so I thought I'd ask you about specifics instead.

Can you give me some short links to examples demonstrating the upsides you see there? Which off-the-shelve ORMs are good ones in your opinion?

Since my take is likely based on my biases: I really haven't made good experiences with a couple of off-the-shelf ORMs, especially Hibernate and Entity Framework, both in green-field and "entrenched before I joined the project" scenarios. Both models I've seen (generate a DB schema from code annotations and generating code from some magic tool you run against your DB schema) seem really ill-advised to me (from an architectural standpoint) if the respective code is supposed to form the core of an architectural model. And if it doesn't, I can't really see the benefit in avoiding code duplication unless supporting a wide range of database vendors is a requirement.

That said, I also never worked on CRUD style applications and all projects I worked on didn't need to work with multiple database vendors. And I've certainly seen the "oops, you wrote an ORM!" effect, but the thing that irks me there was that it wasn't necessary (at least in my opinion).

I'd also readily agree on schema migrations, but there are good non-ORM solutions for that problem, too. So, what I've seen there, as problems go, is mostly a lack of care, structure and proper process akin to every occurrence of "bad code". I'm not sure that this problem is easily solved by any solely technical solution.

Do you think any of that is what made the difference in our experiences?

Try using GORM for Grails. You still have to specify your fetching strategy but it is so much more sane than using raw Hibernate.

I don't know what people mean by code duplication.

When using GORM I just use liquibase to create and update tables. The default data type mapping rules are easy enough to memorize.

Does GORM allow the use of window functions, pivot tables, recursive queries, filtered aggregates, unnesting arrays, splitting a string into rows, or lateral joins?

I mean without an explicit escape valve to write raw SQL queries, which means you're bypassing the ORM anyway.

It seems your requirements mostly center around OLAP use cases? In that case, an ORM isn't the right tool to begin with, i.e. lacking support for these things doesn't diminish the value proposition of ORMs, which lies elsewhere (mapping results to managed object graphs in OLTP applications, tracking changes to the same, and synching them efficiently back to the DB). For OLAP, object mapping typically doesn't provide much value.

Nope, I'm mostly OLTP. But even mostly OLTP systems need dashboards, even if they're read from periodically refreshed materialized views. I also use things like lateral unnest of arrays for ordering dependent rows. Actually LATERAL shows up a lot to compute on the fly. And while everything I listed can and should be used in OLAP, OLTP has uses for these items too.

It’s funny how people’s experiences can differ. I find ORMs have enough downsides that I wouldn’t assume it was best to use one for a new project.

ORMs encourage pulling knowledge and behaviour that belongs at the database level up into application code.

ORMs by their nature end up duplicating the database schema in application code, resulting in multiple sources of truth that need to be synchronised.

ORMs all have their own ideas about how to build queries. None is as powerful or explicit as just writing SQL in the first place, none is standardised and portable to other ORMs, and the queries they produce might or might not be efficient.

SQL isn’t perfect, but it’s stable and universal. If you learn SQL to an intermediate level, you can probably already solve 95% of real world relational database problems, and you can do it using any database and any programming language with minimal adaptation. To be a good choice for any given project and team, I think an ORM needs to offer something more than what the database and SQL already provide. Quite often, I suspect that isn’t really the case, and the ORM is just being used for dogmatic reasons.

> ORMs by their nature end up duplicating the database schema in application code, resulting in multiple sources of truth that need to be synchronised.

There is always a duplication of "schema", both your database and application language need a definition of this. It then makes sense to have a tool to, as you say, synchronise this one way or the other. I prefer the ORM approach of the application code being the source of truth, but for others it's completely fair and aloud to go the other way.

It comes down the type of work you do, the size and number of teams and who has ownership of the database and data.

>who has ownership of the database and data

I absolutely loathe projects that communicate via shared database. It completely ties the hands on everything that touches it. Data should be shared via an API, so that the underlying representation can be migrated, updated and changed and the API can munge up the new reality of the data to still look like the old. You should be able to switch between backends with a environment variable update, allowing your project to run in-memory, against sqlite, or against whatever actual relational or document store you plan to use in production.

ORMs are "easier", in that they can get code running faster, but I've seen a good number of teams write the most atrocious code because the abstraction allows them to not think about things. Code that manages to fire tens of thousands of queries in loops instead of doing a single large query or walking the data in a conscious and efficient manner.

It is my opinion that a single object inside of a project should wrap the database, each query thereto being represented by a function that, once again, can allow the underlying representation to change without the rest of the code being aware. ORMs are far too tightly bound, and teams using them likely to spatter ORM code all over the codebase instead of neatly arranging it behind a properly cut abstraction.

There is always a duplication of "schema", both your database and application language need a definition of this.

Why can’t an application just run SQL queries directly against the database? I agree that there are many reasonable ways to manage a database, but not all of them involve reproducing the entire schema directly in client code the way an ORM does.

Certainly it depends on the use-case. ORMs are wonderful for CRUD style applications, but tend to be (in my experience) awful for reports. They also tend to discourage developers to think about query performance, because the ORMs query DSLs don't usually have first-class profiling.

You missed the last sentence of the comment.

Even with custom query, using escape hatch bring a lot to the table because you benefit from the ORM core handling system for configuration, serialization, pooling, etc.

Not to mention some core (like sqla core) in fact include a functional query language that is capable of representing low level sql in any form. In which case you benefit even more, and you can do your report perfectly.

> You missed the last sentence of the comment

I didn't, and I agree with the comment. I'm elaborating upon it.

So now you need to know two languages to access the database. :-/

Unless you access the database only and don't have the reports accessed through an end-user application, you usually have some language involved to call your SQL queries, normalize input, abstract them away, etc.

You do have to learn an additional API, but as a programmer, that's something I have to do every day anyway.

At the very least you should know SQL. The purpose of having a query dsl in some other language isn't to avoid understanding sql, it's because sql isn't easily composable.

CTEs. Views. Set-returning functions.

"SQL isn't easily composable" is a pernicious lie. Does it compose like your favorite functional language? No.

SQL is not nor has ever been a functional or object-oriented or generic or procedural language. It's a fourth-generation declarative set-oriented language, so the composability is going to be based on sets. Period.

Until you think of SQL as a set-oriented declarative language, you will be perpetually trying to push water uphill, ORM or no.

It's not a pernicious lie at all. You can't just say, it's composable so long as you only compose with the special mentality that you consider to be blessed. Also the fact that SQL acts on sets is completely irrelevant to the syntax being composable or not. JSON is programmatically composable. XML is programmatically composable. S-expressions are programmatically composable. Most programming language syntax's are not.

If you build a UI or reporting mechanism that wraps around an SQL engine and has multiple configuration options that combine to determine the SQL expression that should be used, the fact of the matter is that you will either abstract the combination of the configuration into a DSL that then gets converted to SQL, or you will have a mess of string concatenation that isn't type-safe and isn't verifiably correct until runtime.

Show me the code you'd write to generate an SQL expression with ten optional filters, where each filter is comparing a different column to a set, and if the set is empty the filter should be ignored. I'll show you a DSL that does the same thing, is readable and statically verifiable, and ensures that if anybody else ever writes a similar query guarantee's that parameters will be the same so that the execution plan can be cached.

S-Expressions are not a set-based construct. Neither is JSON nor XML. Those are all document-oriented and/or hierarchical tree structures.

I did not "bless" SQL as a set-based language. It's literally a 4th generation, declarative DSL for set theory. That was its design goal. That's the lens through which every change to the language and the engines that implement it are discussed and adopted.

If I said LISP sucked at object-oriented programming and OO design patterns, how would you react? How about if they said LISP shouldn't be solely through the lens of functional programming just because that's your "blessed mentality" for it.

I don't even know how to respond to that assertion. "It's not even wrong."

I would probably say that you are miss-understanding the difference between syntax and the evaluator that evaluates that syntax.

All three of these expressions contain the same data:

(select (col1 col2) (table1) (= col1 "foo"))

"select col1 col2 from table1 where col1 = 'foo'"


All three of these are "4th generation, declarative DSLs for set theory" as you like to brandish around.

> If I said LISP sucked at object-oriented programming and OO design patterns, how would you react? How about if they said LISP shouldn't be solely through the lens of functional programming just because that's your "blessed mentality" for it.

S-Expressions are just a syntax. It's quite normal to see Lisps that are functional, declarative, object-oriented or procedural. The only thing that differentiates Lisps from other programming language is the simplicity of the syntax enabling meta-programming, which is specifically the problem with SQL syntax and it's composability.

JSON, and XML are data structures, not programming languages. Can you loop or iterate in JSON or does some other language do the iterating? Can XML execute instructions (without a script tag or processing instruction, which is just punting to an external 3rd-gen language)? S-Expressions blur the line a bit, but clearly not in the direction of 4GL.

A 4th-gen language is specifically one that describes what you want, not how to get it. S-Expressions fail this metric in that they can describe the logic for retrieval in addition to the data (where the data portion is not conducive to it being a programming language).

    SELECT a.id
         , a.name
         , p.name
      FROM alien a
      JOIN planet p ON a.planet = p.id
     WHERE a.name = 'Kaminoan'
Now here I'd completely agree that things might have been better with some non-implicit syntax bounds, eg.:

    SELECT [ a.id
           , a.name
           , p.name
      FROM [ alien a ]
      JOIN [ planet p, (a.planet = p.id) ]
     WHERE [ a.name = 'Kaminoan' ]
As it stands today, SQL has some pretty gnarly EBNR to describe it. But that is still missing the point. Where are the datasets for "alien" and "planet"? Are they on the same disk? The same server? The same cluster? The same kind of database?

Are there a lot of aliens? Would a btree index lookup be more appropriate, a sequential search, a bloom index, or some other algorithm? Is alien name enforced to be unique or not so that we can be sure to return either optional one, exactly one, zero or more, or one or more?

Is "planet" a single table, a view into one or more tables, a materialized view that caches its results, or a foreign table accessing the data from another source that isn't even relational?

A 3rd-gen language would need to know these things. It would require telling the system HOW to get the data rather than just describing WHAT data subsets are desired based on other known data sets. It doesn't just transform one or more sets into a new set without any consideration as to what serialization format those data sets are stored in—on disk or in memory.

Here we have the need for separation between DDL and DML. Yet DDL still only describes the structure and makes access patterns available. Creating an index does not mandate use of that index. It merely puts that data access pattern on the menu. DML (as seen above) makes the query result set definition but has nothing to do with telling the engine how it should go about doing it.

Ergo: 4th-gen declarative language

This doesn't mean SQL is the best language or even better than all 3rd-gen languages simply because it's a 4th-gen language. But if you're going to engage in this topic, I strongly encourage you to learn what a 4th-gen language actually is and how laughable calling JSON, XML, or S-Expressions one is.

I could give you HTML as a 4th-gen language in that it describes layout and general behavior without reference to how the layout engine renders it or implements its linking, form-handling, and embedding behavior. HTML+CSS make a compelling case for a composite 4th-gen language, describing what you want, not how to render it. Whether it be links, Firefox, or Mobile Safari, the description is the same and the desired behavior is well-defined despite wildly different implementations from the engines.

I keep replying to you by the way because I'm really hoping you are going to have the aha moment that is holding you back here. Again, you are misunderstanding the difference between syntax and how it is used. Json and xml aren't data structures. They are syntaxes, which are commonly serialised into a data structure. (Usally a tree). S-expressions are also syntaxes that can be parsed into a data structure.

Did you know that C# is also a syntax that can be parsed into a data structure? The Roslyn compiler does this, which allows for symbolic code manipulation and generation.

Did you know that's how almost every programming language parser works? They will tokenise a syntax into a symbolic data structure which will then either be interpreted and executed or output a binary. You can quite easily write a parser for xml that uses it as a programming language. Two well known examples of this include xslt and xhtml. Xhtml is even one of your lauded 4th gen declarative languages.

Whether or not the language is popular enough to have multiple parsers isn't relevant to the class of language that it is. In the example of a c++ program that reads xml into a tree, and then converts that xml into sql to be executed on a database, that xml syntax has become a 4th gen domain specific language.

The important distinction between languages and data structures is the the implementation of the parser/evaluator and what it does with the text or symbols that it reads. A language implies execution and a datastruture implies serialisation.

Oh come on. You are basically arguing that Python is just syntax, not a programming language; it's just the interpreter that parses the text and makes it go. No duh! You've made a distinction without a difference.

If an XML dialect were to be parsed by a processor to declaratively perform a sequence of tasks, the AST is not relevant to this discussion, only the tasks being performed. The XML dialect—the semantic structure behind it—is what I'm discussing.

JSON and XML as they are used today are "just syntax", but they are examples of syntax that almost universally are parsed directly into data structures. The Liquibase XML dialect on the other hand moves beyond a simple data structure into a declarative language for migrating database structures from one form to another by virtue of the Liquibase engine.

SQL is not "just syntax" any more than your speech is "just phonemes". Yes, English, Xhosa, and Mandarin are all just collections of phonemes. All written languages are "just syntax" waiting to be processed by smart meat.

It is needlessly reductive.

SQL absolutely implies execution in a way that JSON/XML do not. However, the style of execution is clearly different between SQL and C#. That style can accurately be described as set-oriented execution, at a higher level than 3rd-gen languages such as Python or C#.

Writing a parser for SQL (or XHTML) is a very different exercise from processing the AST from that parsed SQL/XHTML if that processing happens immediately after the parsing step and is largely driven by that AST.

Relational database engines are interpreters for SQL.

But these are actually for different purposes and requirements. KISS and DRY are respected.

For me, it never gets more complex beyond the first 3 points you listed. Write a few wrapper functions sql() rows() row() map() value() insert() update() delete() and that's it.

Use hashmap to send parameters and receive the result set. Use sql() to write direct sql statements for complex tasks.

This is what happens the first time around. Maybe few more, if you don't learn the first time. It doesn't really lead to creation of ORM though. This is more similar to how a lot of people w/o appropriate tools would "naturally" think that Earth is flat.

Another (better) way to deal with this stuff is to evolve the database instead of growing a defective translation layer and to keep the interface to the minimum.

You want simple CRUD and highest performance (cause 2ms vs 4ms is 2 times faster!), you go with a micro-ORM. You start adding associations two days later, 3 weeks later you finished coding an abstract layer to handle simple associations, you start wanting to have an aggregate root, 2 months later, you get a simple and basic ones... Now you want to handle cascading features because they are also nice to have for your use case. Congratulation, you have implemented something 100 worse and slower than a full-scale ORM without any benefits.

I have a huge amount of sympathy for people who accidentally attempt to implement a database. I believe they are usually motivated by their desire to avoid SQL - an extremely worthy goal. The industry, by and large, puts a crazy amount of effort into avoiding SQL by writing ORMs, translation layers and similar tech because it really is that bad.

But unfortunately databases have solved common-as-mud problems to do with data, disks, multiple users (which, if your data is worth storing, you will eventually have) and the interactions thereof. It is worth putting up with SQL to get all the lessons of the 70s, 80s, 90s, 00s, 10s and now 20s in how to store data without creating buggy software. People probably have these problems and it is usually faster to solve them the usual way that works. Write some SQL and maul a stress ball, it'll be easier long term.

SQL is pretty good though. It could compose better, sure, but it’s well suited to data queries. I think people dislike learning a new language, even to the point of writing bad software. I wonder why?

I'd say that pretty much everything about SQL is bad. Apart from composability, the syntax is super annoying (the ordering of keywords), many databases have pretty much no error reporting if you get it wrong and the linters suck. You can't unit test it. It's not very well standardized and it's actually more than one language (DML vs DDL). Plus, it's not very readable. And the declarative nature of the language is eventually going to back-stab you when you discover yet another edge case of your query planner that flips a reasonable plan to one that's literally going to take ages.

Few things: Regarding unit testing, it's now very possible to productively unit test a SQL database - I'm familiar with a Java package that boots up an embedded PostgreSQL, obviously Docker is now in vogue, etc, etc. It's not quite 'just run a function', but you write your code in the same style as a unit test and you're running hundreds of tests in seconds. You can use transactions to automagically clear out your state, which means performance is also excellent. Regarding the standardization of the language - no, it's not totally standard (although arguably that's a good thing, we don't want 10 implementations of exactly the same thing, they'd never evolve), but because they all work sort-of-the-same, a lot of language support ends up transferrable. You can use ActiveRecord on many backing databases, for example. Query planner problem is definitely true, know folks who've spent a lot of time hinting Oracle and scaring Postgres into the same, though personally only took a few days.

However, SQL is IMO the best of a bad bunch. Most non-SQL databases I've used have been extremely sharp cornered, whether this be via: unpredictable/unbounded worst case performance in pessimal storage cases (RocksDB, Cassandra), various parts of the code where unbounded memory/storage usage is required (Cassandra, Spark, Elasticsearch), extremely low level APIs making it easy to make unobservably (as in, you won't notice it and will eat the pain every read) slow APIs (RocksDB). I'm not saying there's nothing better, but it's probably not an all-rounder.

There are certainly things I don't like about most SQL databases. Few SQL databases are columnar, which is _the_ key to good performance in the 2010s, and native LZ4-style data compression is also not widely used. But for most engineers they solve a lot of problems for you without having to think about them.

> I'm familiar with a Java package that boots up an embedded PostgreSQL, obviously Docker is now in vogue, etc, etc.

I think you are talking about Testcontainers, although it only matches the second part of the comment (can manage different external moving parts in containers for integration tests): https://www.testcontainers.org/

I'd be interested in seeing an embedded Postgres library; I only see H2 and maybe Derby being used for tests and "getting to know the ropes" use cases, eg. a default install.

Pretty sure they are talking about https://github.com/zonkyio/embedded-postgres-binaries. It's not exactly embedded postgresql since it still runs the full postgresql in a separate process but it is close enough for most purposes.

Cool, thank you!

Let’s pretend at least half of this is true. With that, ORM pretty much adds a layer of complexity on top and hides it, opening a door for a lot more well complex behaviour. That’s leaky abstraction. As often is the case, adding on top of that person using an ORM without actually knowing _any_ SQL and/or what’s behind ORM just rolls the ball more into the mud.

So on one hand, most of the times people are doomed to re-invent an ORM.. poorly. On another, if you don’t have people knowing behind the scenes using “a proper ORM”(tm) they will roll mud around, and if you do have people knowing, most will object ORM in the beginning and start, over time, to roll their own. Overall, there must be a thin line where one should walk on, but it’s unclear exactly where since it’s highly project-specific.

You can unit test SQL by spinning up Docker containers.

You should just write SQL for the database you are using (e.g. PostgreSQL). Switching databases without changing queries is unrealistic anyways.

I agree with your other points, but I am yet to see a layer above SQL that improves on them enough to justify its additional complexity.

I know a team who used JooQ (code-generated query builder for Java) to generate SQL which worked on both Postgres and Oracle and worked well enough that they could support both in production (Oracle for top end perf, Postgres for everything else). There were certain places where they had to switch on SQL format, but over years they felt comfortable enough only unit testing on one kind of database.

Here we go…

SQL is composable through views, CTEs, temp tables, and set-returning functions. It's a set-oriented language, so its composability is going to be based on sets, not the constructs in your favorite general purpose programming language using a completely different paradigm. Just because that's unfamiliar to you doesn't deny its composability. Impedance mismatch.

Regarding syntax, the best "solution" I've seen are query builders that put the FROM first but fail to fully implement the full range of functionality available to an engine, mostly due to catering to the lowest common denominator of functionality. (Looking at you, MySQL!)

This "I hate the syntax" is generally a petty argument. No popular multiuser database allows direct access to its internals, but I don't see Cypher or Mongo's QL getting anywhere near the same level of critical scrutiny SQL gets, and it's not due to their clarity, elegance, or comprehensiveness. You simply want a functional (or OO or whatever) syntax model, and it's not gonna happen because the problem SQL is solving doesn't match up. Impedance mismatch.

Unit testing: https://pgtap.org/ https://sqitch.org/docs/manual/sqitch-verify/ https://learn.microsoft.com/en-us/sql/ssdt/creating-and-defi...

SQL linters suck, I'll grant you that one. At least the syntax and error highlighting is there for good editors like DataGrip, but folks have a hard time accepting a universal style guide for their SQL though it's sorely needed along. Most folks just cobble together stuff with haphazard newlines and spacing then complain about how no one can read each others' SQL.

Wait… you're pissed there's a different set of keywords for generation and modification of structure vs generation and modification of data? You think… they should be the same… because… you think they're related?

> Plus, it's not very readable.

Ever notice how someone in every thread on every language says, "I hate the syntax. I think it should look like <insert personal favorite language>." Just accept that your difficulty with the language is a personal preference or reflection of your familiarity with it, not some kind of objective fact. It's like claiming Chinese is less understandable than English when you were born and raised in London.

> …when you discover yet another edge case of your query planner…

That's not SQL's fault. Any query language would do this. If you had to write accessors by hand such as for DynamoDB, you'd find that the original algorithm you wrote is no longer effective because the dataset you're querying got much larger. Fix your indexes, vacuum, and reset statistics. Stop blaming your tools.

You don't like SQL? That's fine. But that hatred is also not universal. I and many other folks LOVE SQL. That isn't to say it's flawless or cannot be improved. But most folks always seem to focus on those aspects that probably don't matter as much, missing the relational forest for syntactical trees.

"Beware the old man in a young man's game. He is always much more capable than he seems."

> It's like claiming Chinese is less understandable than English when you were born and raised in London.

There is a proverbial consensus that written Chinese is very difficult to understand [0, 1]. It is a supremely challenging script. A well educated Chinese person can conceivably encounter words in a written text that they know the meaning of but they cannot understand or verbalise without consulting a reference book.

[0] https://en.wikipedia.org/wiki/Greek_to_me

[1] https://flowingdata.com/wp-content/uploads/2015/04/Greek-to-...

And folks speaking any of the variations of Chinese speech, eg. Cantonese, can still read the ideographs. That's the tradeoff.

Whereas in English, a well-educated American can (and does) conceivably encounter words in written English that they neither know the meaning of, understand, or pronounce correctly without consulting a dictionary.

Hell, English pronounces the exact same written words in wildly different ways as well as multiple words spelled differently but spoken identically to the point that a non-trivial part of the written language is merely memorized rather than learned.

But we learned English when we were young, so I guess it seems "normal" that "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo" is a grammatically correct sentence.


Or how animal poops can have so many semantic meanings.


I wish SQL operated on sets but it really operates on tables that have an order and may contain duplicates.

LATERAL, VALUES, et al would object to that narrow designation.

Tables are just sets of records. Nothing more. Nothing less. SQL references tables as sets of records, but JOINs are referential subsets of tables, not complete tables themselves. And LATERALs may not even have a table underneath at all! Same with set-returning functions. generate_series(…) doesn't have any table under the covers; it's purely a dynamic set of values generated on the fly.

It's turtles all the way down.

As for duplicates, that's up to your defined constraints to sort out. You're right that SQL DML knows nothing about whether the underlying set of records contains duplicates. The DDL however can trivially define a set of records to have no duplicates based upon varying conditions/attributes.

Whether a set is ordered or not or has duplicates are not has no bearing whatsoever on whether the data is set-oriented or not.

How is SQL bad? If anything my experience with KV db's is that most of the time they'll eventually end up implementing some form of structured data storage and retrieval that in some way resembles SQL. I love KV's only when I used them as embedded db's like pebble and badger db but you can never beat a good ol' SQL server that just does what you need from it nothing more nothing less. ORM's most of the time aren't that great either you always have to keep up with their new fancy API changes and instead of just doing your work you find yourself relearning the same ORM every few years, not only that but some ORM's are extremely sensitive in how you order your operations that the order in which you write you code can potentially kill your performance or worse accidentally deleting data.

Why is SQL bad?

I'll take SQL over almost any language used to write the "application layer" any day...

Also, why do you think that databases solved problems with disks? Do you mean disks as in block devices? If so, none of the popular databases ever touches that level, most popular SQL databases work with filesystem.

It doesn’t compose properly, lacks type safety, you have to codegen your statements in the application anyways and is generally an utter bitch to do so because it’s horrifically inconsistent (and doesn’t compose properly). You can dump everything into stored procs and functions, but now you get worthless compiler errors, indirect version control at best, and really your whole tooling support is reverted to the 80’s.

It’s a pretty shit language — its just the only one we get to have, and the value of what it interfaces with (the RDBMS itself, ACID and the relational algebra) is… invaluable. You can’t get away with not knowing it, and it’s generally dumb to avoid it (because you’re then avoiding the glory behind it) but as a language it’s basically COBOL-tier.

SQL, as a set-oriented language, composes just fine.

• views • CTEs • temporary tables • set-returning functions

It will not compose like an OO language or like a functional language because it isn't one of those. It's set-oriented. This is the biggest hangup I've seen from folk who hate SQL: mistaking it for something it isn't, never was, and was never intended to be given its problem domain.

> It doesn’t compose properly

Stored procedures?

> lacks type safety

How so? Can you give some examples? I have no idea what this is even about.

> you have to codegen your statements

Never happened to me. Not in any capacity that I would feel like this was a burden. The only cases where I needed to generate SQL code were some fuzzing tests intended to figure out problems with database performance, s.a. feeding a bunch of generated SQL to pgbench to see what may cause the database to slow down / use too many resources. This seems very strange in an application that is simply using SQL database rather than trying to do something weird with it (eg. testing performance). SQL is a very high-level language, which means you don't need to do as much code generation compared to, say, Assembly or even Java.

I do see that people do this when they use ORM, but this is a self-inflicted pain. They shouldn't have been doing ORM to begin with.

> get worthless compiler errors

From what? Are you talking about writing SQL functions (eg. in Java) and then using them from SQL (eg. Oracle)? This doesn't sound like a SQL problem, more like Java compiler problem... But, yeah, I can see how this is a complication and an inconvenient thing to do, so, if that's what you mean, I'm generally with you on this one. Well, I think that the benefits are still worth doing this.

> whole tooling support is reverted to the 80’s.

Again, I don't see the connection. But, even if it was so... what's the problem? You get to work with an old and reliable technology, with decades of knowledge and experience to support it. And it's not like it died in the 80's. Tons of effort is being invested into this technology, it's amongst the first to get updates and integrations with new hardware and system capabilities (eg. vectorized ISAs support, even GPU access, bet you that when "smart SSD" become common enough, SQL databases will be the first to support them). What's not to like? This is like complaining about C compilers for being "old tech"... for all that's wrong with C (compilers), being old is not one of those things.

> It’s a pretty shit language

If you were honest, you'd write it like this: "I don't like it". And, well, you are allowed not to like whatever you want. It's subjective, and completely up to you. It just has less of a convincing power. But, really, this is your entire argument...

Yeah, the "lacks type safety" comment threw me too. SQL has some of the most rigid type safety in my daily interactions in computing.

    • "Exactly one"
    • "Optional"
    • "Text but matching this pattern"
    • "A 16-bit integer from 1 to 1000"
    • "Many to many without any duplicates"
    • "A unique combination of id and a time range"
    • "This field can be NULL but only if this other field is NOT NULL"

Now store a 64-bit unsigned integer in Postgres.

CREATE DOMAIN unsigned_64bit_int AS numeric CONSTRAINT range CHECK (VALUE >= 0 AND VALUE < 2^64 AND round(VALUE) = VALUE);

Use bigint with check-between constraint?

Maybe inconvenient, but not impossible... but if this is the kind of problem you have, then, try in your favorite language to create integers that are 63 bit long or 65, and see how it goes? Looks like the comparison won't be in that language's favor.

Postgres bigint is a signed 64-bit integer, not an unsigned one. https://www.postgresql.org/docs/current/datatype-numeric.htm...

Trying to use it to store the full range of unsigned values would mean awkward math everywhere, to convert between the two ranges.

Lots of software uses uint64s, for e.g. as identifiers handed out by some API, and not being able to store them conveniently definitely hurts.

Now store a 64-bit unsigned integer in Java. Or TypeScript. Or Lua. Or Python.

Now explain why you need an unsigned bigint in this set-oriented language. What's your use case?

That said, not a limitation of SQL, just Postgres, who avoided it for similar maintenance and interoperability issues that Java avoided them. MySQL supports unsigned bigint just fine. There is nothing in the SQL spec that forbids unsigned integer values.

Apparently there's an extension for that! https://github.com/petere/pguint

> "The industry, by and large, puts a crazy amount of effort into avoiding SQL by writing ORMs, translation layers and similar tech "

In my experience (I am retired from a telecom operator) using a database made applications quite inflexible yet real life implies constant changes in their data model and as a result when there was a "database migration" there was always some information lost.

On contrary Pick databases (MultiValue databases) were very useful.


I have done, quite literally, thousands of database migrations in a telco context and we never - not one time - lost information.

I don’t think there is any particular problem, or solution. Applications get old, assumptions change, the world moves in.

It’s true that our database design became inflexible over time and needed to be hoisted. But so too did our authentication layer (hand built - JaaS - JWT), the UI (AWT - Swing - Web 1.0 - Web 2.0), the deployment model (on prem - cloud - docker), even the API scheme (RMI - soap - rest - graphql).

Any application that lives long enough is going to go through this sort of change. The only constant in all of the above was SQL.

Beware the old man in a young man's game. You will be tempted to underestimate him. As many unremembered before you have done.

I'm of the opposite opinion. Database migrations force you to think through the changes in your data more explicitly. You catch logic oversights, boundary conditions and implicit transformations much more thoroughly and with fewer errors later on.

I've seen a lot of different enterprise systems. In my experience it's not the db that's causes this inflexibility issue, but designing the architecture of systems data-first (using Data Dictionaries / ER diagrams etc.) rather than use-case-first that leads to this sort of problem.

When a data schema is the foundation of a system, it's often the most difficult thing to change.

I don’t know how that works. Your data schema is the foundation of almost any system, regardless of how you generate it.

I think this is a common mistake. People think their code is what’s important. But it’s the data and how it gets transformed that is the core of most applications.

Data-based architecture means that you have all data about an entity in a single place. Even though if you analyze the domain and all use cases you find that there are slices of that data that never interact together.

Example is storing customers address and purchase history in same place, only because they "belong to customer" in a data sense. This is very bad because it leads to entangled architecture where everything depends on everything and you cannot extract truly independent modules. The project implemented like this simply cannot scale beyond a certain size.

So no, data is not important but how it is used.

Ok. Perhaps I’m not familiar with this concept of “data-based architecture”. The GP talked about designing around features, which I think is also bad. But maybe they meant more like a DDD approach?

So - I certainly agree 100% with what you say about keeping thing independent. Maybe I’m just unfamiliar with the terms.

A more specific example is the idea of sitting down and listing all the attributes that define a customer profile (name, addresses, date of birth, etc) and then designing the CRUD around those things. Contrast this to starting with what are the interactions that a customer actually needs to worry about (send me this product to the address in my PayPal* account)

*or whatever external system.

If you had to make a guess at how many different pieces of information you have about you in databases for services worldwide (and how many of those are inaccurate, or unnecessary to provide the service), that number would be fairly ridiculous for most people (OOM thousands or tens of thousands).

Taking the address idea - no e-commerce site really needs to store this as a profile item. It needs to be in my browser so I can autocomplete for any shipping form.

The second point I’m making is the same as your last. “How it gets transformed” is often more important to start with than how it is at rest.

(I acknowledge in advance this is a weak argument btw).

I actually agree with all of this.

Ultimately where I’m coming from is that the flow of data through a system and how it is transformed is actually what’s important. For me that’s taking a “data first” approach, to use your term, but I took the term to mean something different to what you meant.

You are totally right about things like customer info. If you have an email component then that’s where email addresses live. If you have a fulfilment component then that’s where physical addresses live. 100% agree that they don’t belong together, at least not by default.

In fact, I tend to think of entities like customers as just a uuid. The bits and pieces we attach to that uuid depend on the system. But many domain objects should be treated in a very abstract way.

I’m not sure this is widely accepted, but it’s where I got to.

For this to be true (which it is in some contexts), the important question is what the boundary of the "system" is.

I agree.

On reflection, I'd also add that I do design systems "data first" however to me this doesn't mean "database first". It's important to think about how data flows through the system, how it gets transformed, and what parts of the system "own" which data.

All of this requires understanding of the boundaries of the various systems that make up the whole, which I think is exactly your point.

I don't see anything particularly bad about SQL. If your problem is a recipe manager CRUD it's trivial. If it's more complex then than, any DIY thing is gonna be hard too. If you need some insane performance, SQL might not be right but any DIY thing will still be hard.

This is not a bad take, but then every so often I have to insert 10 million rows into a database and for some reason that takes way longer than it should, and now I'm stuck figuring out the particular knobs and buttons needed to make it consume a reasonable amount of time and space.

Most databases do suffer from "common denominator" syndrome, and you need to know a LOT, or be comfortable with hybrids, to get the right DB for the job if you want to avoid writing your own.

What you're describing is the common symptom among many programmers which is their unfamiliarity with disk limitations. It's not the database's fault for not being able to write 10M rows to the disk quickly. Disks, especially the cloud SSDs are quite slow defaulting very often to just 3000 write IOPS.

On the other hand, programmers are very familiar with CPU and memory limitations. Very rarely will a programmer try to hold 10M or 1B rows in memory because they intuitively understand memory limitations which is usually around 4GB per app/process.

> It's not the database's fault for not being able to write 10M rows to the disk quickly.

I am primarily saying this in regards to reads, but sometimes the database management system is indeed at fault, by picking sub-optimal ways to do things.

For example, I had an Oracle query that took 45 minutes to execute and I was called in to help out with it. This drastic slowdown hadn't been alleviated by reviewing what indices are in use, nor partitioning the table, however I did a bit of research and was able to optimize the execution to 3 seconds, with the NO_UNNEST optimizer hint: https://docs.oracle.com/cd/B12037_01/server.101/b10752/hints... for an EXISTS check.

Thankfully cases like that are somewhat rare, but I'd still generally pick boring and battle tested DBMSes for most of my use cases (maybe along the lines of PostgreSQL and MariaDB for a variety of reasons, or even SQLite for simpler cases), as opposed to newer ones. But when dealing with performance considerations, then it indeed is a good idea to look at whatever metrics are available, to see whether it's the hardware or the software that's responsible for certain characteristics.

Of course, for some workloads Redis or RabbitMQ might be better suited, if available (though I've seen key-value stores and message queues implemented in SQL that worked fine, at least up to a point).

If you have queries with joins or subqueries, and you have filters on multiple tables in the query, the DBMS has to make a best guess at order in which to execute the query.

If you are joining, for example, two very large tables and there is a filter on each, if the first table has 10 matching rows and the second table has 1 billion, you would hope the execution plan will filter by the first table first, but the DBMS may not know how many rows match each filter unless there are appropriate column statistics, so it may execute the query in the other order.

3 seconds is often horrible performance for an sql query (although it may be "good enough" for your use-case), and implies hundreds of thousands or millions of comparisons during execution, which implies that the query is not covered well by indexes. In short, if you understand the shape of the data in the DBMS better than the database does, then you ought to provide index hints and order hints so that it can do as well as it can. (And if this particular multi-table filter is something that is run by users multiple times a minute, you should be investigating multi-table indexes, denormalising tables, or some third party external index like cdc into lucene.

Writing millions of rows per second through a database on somewhat arbitrary cloud VMs is pretty normal. This isn’t an intrinsic disk limitation. IOPS doesn’t map that closely to write performance in many modern databases; an IOPS can carry a lot of logical writes. Storage bandwidth is the main limitation, which isn’t great with ubiquitous disaggregated storage in particular, but still sufficient for writing millions of rows per second without doing anything too special.

Obviously this is somewhat dependent on the specific data model, but most rows are pretty small.

Another thing I’ve seen (perhaps even been guilty of :) is assuming we don’t have to worry about our table or index structure.

Inserting 10 million narrow rows of integers and enums into a database with just a Pkey index will be super fast. Inserting 10mm wide rows containing json blobs and long text strings will be bad. Inserting 10mm json blobs with a bunch of indexes will be horrible.

I’ve been as guilty as anyone for thinking the database can work miracles and that I don’t have to worry about how the underlying hardware works.

But it’s that lack of care that is at fault, not the database engine. It’s just doing what you told it to do.

Actually what I'm describing is exactly the opposite: because I know the limitations of the disk, if I would have to write custom code to solve my problem, I would easily be able to outperform the database. Naturally that has many other issues that still makes me choose a database, like basically all of ACID, but still it is disheartening to know that fundamentally the operation can be done an order of magnitude faster than what you're seeing, and improving the situation means understanding how the DB maps its operations to the disk, and how that produces limitations.

So in fact it is the lack of knowledge of the DB, not disk, that causes the problem. Apparently at this scope, the DB is a leaky abstraction.

This. I moved multiple web applications from bare-metal to the cloud, and many times had to refactor code to do hit the disk less. What worked before: Dozens of SQL queries, or loading session data from a file - for each request - it all was too slow in the cloud.

Also: I find that many engineers wouldn't be able to tell you if the storage attached to their VMs was networked, or actual hardware on the hypervisor. A distinction that has big implications for throughput and latency.

> It's not the database's fault for not being able to write 10M rows to the disk quickly.

It usually is though. Being able to write 10m rows quickly is entirely a matter of how the engine is designed to work and the required durability semantics.

Some designs can write 10m rows to magnetic storage in a few seconds. Others might take minutes.

The most ideal designs (from a raw throughput perspective) leverage batching and can write multiple transactions per disk IO operation.

> Disks, especially the cloud SSDs are quite slow defaulting very often to just 3000 write IOPS.

And then there's cloud solutions like AWS Aurora that blow those disk IO issues out of the water.

Where I see this happen a lot: testing.

Initially test automation people think about tests as stateless, and so don't design the framework to have any persistence at all. Eventually, life proves that to be wrong, and test start needing some persistence between sessions... well, it usually starts between different cases, and that's where AQA start building ad hoc "databases".

Unfortunately, it almost never downs on them that what they needed all along was a database, so the tests grow a weird slow and buggy persistence layer.

this is a brilliant point, i wish i had thought of it and figured out a way to work it into the setup of the story. done this one myself enough times.

newer databases like planetscale now offer “branching” which in theory lets you use a copy of production data for tests. not really sure when this does/doesnt make sense but it seems interesting

While we're on the subject of ORM's I really like the https://github.com/cornucopia-rs/cornucopia way of doing things.

Basically write SQL in a file and code generate a function that runs the SQL for you and puts it into a struct (this one is for rust)

I think there's a library to do the same thing with typescript.

For me, the best way to talk to the database is with SQL and I don't have to learn an ORMs way of doing it.

I used sqlc[1] for golang, while it’s very cool and keeps things simple, at the end I switched to gorm as I needed hooks and dynamic queries where annoying to write. I could have build these things myself on top, but that would have been a homegrown orm in the end. But as every orm, gorm also has its problems, pick you poison.

[1] https://sqlc.dev/

I used to write small 2D engines when I was a teenager. I had no clue about databases at the time, so I ended up writing a NoSQL-ish database to save games. I soon learned that rolling your own db is looking for trouble.

I believe the article has a wrong take on this, and it seems more like a rant to generate views than an actual analysis of why people write things that already exist in the wild

It's not about spending time but considering all the actual real learning of making a complex piece of software (maybe if you focus too many dev cycles into it might become harmful). Still, at the end of the day, you will have a team of almost experts in X topic who can go back and even improve on what they are doing (using any other common, well-known framework)

And yes, I agreed with the author that it doesn't make sense to write one just for the sake of it, but as Feymann said: "What I cannot create, I don't understand."

I think there’s a big difference between building something to learn about it, vs accidentally building something complex in order to support the thing you’re actually trying to build.

The point is, just use a database. We’re talking 15 minutes of effort.

Or to paraphrase someone wiser than me, 4 weeks of programming can save 15 minutes of installing.

Welcome to the world of today's tech gurus. These guys are busy sharing what's wrong and what's right on Twitter all day. The number of people who talk about real world applications is almost non-existent.

for what its worth i was just writing a database complement to https://rachitnigam.com/post/you-have-built-a-compiler/ because i thought the format was really helpful in understanding why we independently evolve the same things from first principles. i’ve done some of these mistakes in real life, but adding anecdotes would have spoiled this particular story format.

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact