Hacker News new | past | comments | ask | show | jobs | submit login
Upsert in SQL (antonz.org)
140 points by ingve on Sept 25, 2023 | hide | past | favorite | 55 comments



> insert...on conflict is a fine way to do upserts on older PostgreSQL versions. For modern ones (15+) there is a better alternative [SQL Standard: MERGE]

This is incorrect. To quote the Postgres MERGE docs:

"When MERGE is run concurrently with other commands that modify the target table, the usual transaction isolation rules apply; see Section 13.2 for an explanation on the behavior at each isolation level. You may also wish to consider using INSERT ... ON CONFLICT as an alternative statement which offers the ability to run an UPDATE if a concurrent INSERT occurs. There are a variety of differences and restrictions between the two statement types and they are not interchangeable."

MERGE can give duplicate violation errors when concurrent inserts take place (at the default READ COMMITTED isolation level). MERGE is no more capable of avoiding these sorts of race conditions than totally naively application code. In short, upserting is precisely what you're not supposed to use MERGE for -- ON CONFLICT is expressly designed for that use case.


I agree. In SQL Server, merge comes with a number of quirks that cause me more pain when I try to use it. The ONLY reason I use MERGE today, isn't to merge, but to include non-inserted columns in the OUTPUT clause; only the merge statement can do that. This can be useful when you want to output a lookup from a source to a newly inserted identity (auto-increment) column.

Generally, stay away from MERGE. At least in SQL Server.


I've done a lot of work on MERGE queries recently. For postgres I suspect that self updating CTEs that report back what actually happened are maybe a better idea.


MERGE is standardized in ANSI SQL 2003, and updated in 2008:

https://en.wikipedia.org/wiki/Merge_(SQL)

No question, everyone should use standardized, cross-platform syntax if at all possible.

Maybe someday ANSI will standardize some kind of upsert syntax; until that time use upsert only if you are explicitly tying your SQL to your particular database.


> No question, everyone should use standardized, cross-platform syntax if at all possible.

There are big questionmarks around this statement. SQL isn't an interchange format. Standard table design goes a long way because data has lots of different consumers and needs frequently to be moved around between systems, standard SQL is nothing useful to aim for. Everyone already uses DB-specific drivers and syntax because it is better than the alternative.

- The standard is not generally available. Most of us will never learn what is in it.

- SQL standardises terrible syntax. The SQL standards committee has a unique view on programming where "function_name(a, b, c);" is some sort of weird alternative syntax. They haven't gotten as far as working out variables or function composition either [0].

- Tying SQL to your specific database is the best option for performance. Writing database-independent SQL is somethign of a fools errand, it is either trivial performance-insensitive enough that you should be using a code generator or complex enough to deserve DB-specific turning.

[0] I assume. Not going to pay for a copy of the standard to check.


> The standard is not generally available. Most of us will never learn what is in it.

It also not aimed at users, but at implementors. Funny enough, they don't read the standard either ;) But more seriously: Some implementations are old and generally vendors prefer not changing the behavior of their product. When the standard comes later, the train has already departed. The most critical incompatibilities are in the elder features. The newer ones have a tendency the be more aligned with standard behavior (e.g. window functions are typically implemented just fine).

> They haven't gotten as far as working out variables or function composition either [0].

Part 2 SQL[0] is declarative and intentionally doesn't have variables. Part 4 SQL (aka "pl" SQL) does have variables. I personally consider Part 4 obsolete on systems that have sufficient modern Part 2 support.

> Tying SQL to your specific database is the best option for performance. Writing database-independent SQL is somethign of a fools errand, it is either trivial performance-insensitive enough that you should be using a code generator or complex enough to deserve DB-specific turning.

While this is certainly true for some cases there are also plenty of examples where the standard SQL is more concise and less error prone than product-specific alternatives. E.g. there COALESCE is the way to go rather than isnull, ifnull, nvl, or the like (typically limited to two arguments, sometimes strange treatment of different types).

There is a lot of *unnecessary* vendor-lock in in the field.

[0] https://modern-sql.com/standard/parts


> Part 2 SQL[0] is declarative and intentionally doesn't have variables...

The issue I have with that is that making a bad decision for a reason doesn't change the fundamental correctness of the decision.

SQL semantics clearly support variables, because they are available as WITH ... clauses and subqueries. So either these elements are a mistake and should be discouraged, or the language has variables and they should be made to read like variables in almost all other languages for consistency and readability.

This is literally a cosmetic issue but a lot of thought on syntax has happened since the 1970s and there seems to be an overwhelming on-the-ground consensus that SQL's syntax is not the way to lay code out. The semantics? Great. But there is nearly nothing from SQL's syntax choices that is present in any surviving programming language since. Many of the more popular languages explicitly reject SQL syntax decisions like Python (significant whitespace), "function(arg,arg,arg)" instead of constructs like "VERB arg,arg WITH TEXT,arg TEXT arg" (everything I've met bar lisp) for functions and not replicating broken English (everything).

Variables are a different thing because the declarative nature of SQL is a useful property. But standard SQL overall does such an appalling job of handling syntax I feel confident that they have gotten this part of the syntax wrong too.

Gotten a bit off topic, just happens to be a thing I feel strongly about.


Lisp is in no way closer to SQL syntax than languages like python. On the contrary, lisp syntax is much more regular.


Sigh, I never wanted to read the SQL standard before, but now that I know it’s behind a pay wall I want to read it.


I don’t believe there is a SQL standard.


What most users want is a statement that allows them to UPSERT, with reasonable guarantees around that never throwing a duplicate violation error, and never deadlocking (at least not in any novel way caused by implementation details). Something practical.

That's not the kind of thing that the standard takes a position on, for a variety of reasons. In general, it says very little about concurrency control/isolation. And what little is does say is all very abstract.

For whatever reason, a lot of people believe that MERGE promises something over and above what using multiple INSERT, UPDATE, and DELETE statements will promise. But that simply isn't true. The standard doesn't say anything that even vaguely suggests otherwise. Same is true of both the SQL Server MERGE docs, and the Oracle MERGE docs.


Before you recommend anything, read this about MSSQL's dangerously borked MERGE https://www.mssqltips.com/sqlservertip/3074/use-caution-with...

Anyone using MSSQL should read it.


I recently covered MERGE on modern-sql.com: https://modern-sql.com/caniuse/merge

There I also look at limitations of some implementations and problems such as not reporting ambiguous column names — just guessing what you mean ;)


Great work as always! I still remember looking at your site when I was working on an ETL system for sensor data and realising in frustration how MariaDB was missing all the nice modern SQL functions like merge.


I've noticed that MERGE in MSSQL can have a performance penalty as opposed to have separate insert and update statements.


mssql merge has a lot of problems. it's good for the most basic and straightforward of use cases, but it struggles with a variety of features:

https://michaeljswart.com/2021/08/what-to-avoid-if-you-want-...


We stripped MERGE out of all of our production code. Not worth the many corner cases and footguns that remain even after over 15 years of being in the product.


In ms sql you can do an update query, check the rowcount and if it's 0, then do an insert.

That's how I got an improved query performance for syncing data.


I think you need a unique index if you're doing something like this in MS-SQL, Postgres, or MySQL. Otherwise, the default isolation level won't prevent this data race:

  writer A: begin xact, update, see 0 rows updated
  writer B: begin xact, insert same thing, end xact
  writer A: insert, end xact
One reason I use ON CONFLICT in Postgres is for explicitness. You're specifically telling the DBMS to upsert, and it'll throw an error if you don't have a unique index to support this.


Also mentioned above, ON CONFLICT DO UPDATE would be able to update the concurrent writer's row instead of throwing a uniqueness violation.


The problem with plain/primitive upset/merge operations is they don’t always support handling of what should happen when a row is deleted from the source. People new to data engineering don’t always recognize this.

Im those cases, one needs “delete-then-upsert” or “soft-delete-then-upsert” in your upset/merge implementation. I was a tad disappointed to see this only peripherally alluded to.

With some flavor of merge implementations you can do this with a pair of merge calls but in my opinion, ideally it should be one merge operation and db vendors should wise up to this.

Per https://modern-sql.com/caniuse/merge only SQL Server and BigQuery currently support “when not matched by source”. But Snowflake lets you have multiple “when not matched” clauses with “and” expressions so perhaps you can do it on that. Redshift docs show examples of using two merge statements.


> The problem with plain/primitive upset/merge operations is they don’t always support handling of what should happen when a row is deleted from the source.

I didn't understand what you meant by this - you seem to be referencing a larger workflow (e.g. keeping 2 tables in sync), and I'm not understanding how is relevant to the "upsert" primitive. Can you give an example?


You are bringing data from a transactional (OLTP) system into a reporting/analytics (OLAP, lakehouse, data warehouse, data hub) system.

The analytic database is expected to maintain a list of all customers and their transactions going back years. The transactional system stores a more limited history for operational performance reasons.

You daily pull the list of customers from source transactional system and upsert/merge it into your analytic database. Over time the source transaction team removes old unused or mistakenly created customers.

If you use a pure upsert without awareness/handling of source-side deletes, your analytic list of customers grows without any way to indicate to reporting users which customers are still in your transaction system (or when they got deleted from the transactional system.) But various analytics and/or APIs from your analytic system need to exclude such no-longer-present customers…

“Why am I seeing these customers in analytics that aren’t in the source system?” “Can I just get the current customer list from the analytic system?” “Can your analytics component X quit transmitting to me these old/bogus customers I don’t see in the transactional system?” people will ask you…


UPSERT support is really for OLTP databases with frequent transactions that may conflict. If you're doing periodic batch jobs then you can just use plain insert/update/delete (and I believe a common pattern for syncing to an analytics database would be to first dump all the data from the transactional system into a staging table, from which you can run easily comparative queries (e.g. to check for records no longer present in the transactional database) before updating the main tables.


Re ‘UPSERT is really for OLTP’, I find that claim surprising. If you mean the oracle UPSERT command, you may be right and should ignore my post, but if you mean upsert concept/terminology/MERGE-usage, I’m not sure I agree.

I admit I have no visibility into the true history of Oracle UPSERT vs ANSI merge (2003) and their initial target audience or the audience for early implementations. A web search on my part turned up no clarity. But I have three reasons I don’t perceive upsert as as OLTP-not-OLAP feature:

1) I vaguely recall seeing documentation from Microsoft around 2008 when they first implemented merge (I was excited at the time, forgive me) that it was for analytic use cases.

2) Anecdotally I did telecom+web “OLTP” programming for many years without hearing about the upset concept, but heard about upserts in my first few weeks working in the analytic (ETL) space and it was clearly a fundamental operation there.

I suppose the upsert concept was also fundamental for web crud operations and ensuring idempotency but nobody called it upsert back when I was doing that — we just wrapped an insert and update in a transaction. I acknowledge this is an anecdotal claim but it is my Bayesian prior.

3) Merge is a bit notorious for wonky transactional behavior in corner cases… so I’m not confident OLTP was/is the target audience.

Additionally I am not sure why you suggest batch usage implies you wouldn’t upsert. That’s a bit bizarre to me. The operation for moving data from staging tables (for dimensional/attribute data) to main target tables often uses upsert logic of some form.


I guess I would consider the defining feature of true UPSERT support as being an atomic operation that does not exhibit "wonky transactional behaviour". Which means Postgres's ON CONFLICT...DO UPDATE, and not MERGE.

Given that MERGE has this "wonky transactional behaviour", I don't really see the point of using it at all. I guess it gives a more convenient syntax in some cases, but it seems like it will always be possible to replicate the functionality by something like separate INSERT/UPDATE operatiosn (optionally wrapped in a transaction). And OLAP use cases are unlikely to have lots of concurrent transactions that are liable to issues with this approach.

So basically, while you might want to UPSERT in an OLAP use case, you probably don't need watertight atomic UPSERTs, which are the feature that requires special database support.


Something I find somewhat annoying about MySQL's INSERT...ON DUPLICATE KEY UPDATE is that it increments auto-increment columns even if no new row is inserted. Does anyone know if Postgres' ON CONFLICT and MERGE do the same?


I expect this behavior to be pretty common. In PostgreSQL, auto incrementing keys like this are backed by sequences.

Sequences, at least in the database products I'm familiar with, do not exhibit transactional behavior; I'm not sure if this is actually part of the standard, but if not, I at least expect this to be pretty common. Part of the reason for this is to avoid introducing a point of serialization in concurrent transaction processing. If we were to do what you're asking for, you can get quite a performance hit. Consider a long running transaction involving an insert into a table using an auto-incrementing key. Any other transaction that wanted to get an ID would have to wait for that long running transaction to finish to get a number from the sequence... the waiting transaction wouldn't know if the long running transaction was going to commit or rollback and so the sequence number would be undefined until the waiting transaction could get its own lock.

If we set forth the expectation that a sequence will only guarantee you that you'll get a unique number from within the range, but that any other desire like gapless numbering is out of scope for the sequence you can avoid that whole waiting game. Sure, you might have gaps as a result; if the long running transaction finally rolls back, you'd get a gap for the number it consumed.... but no one waited to find out if that was the case and all the records got unique IDs.

If you need to ascribe meaning beyond merely "unique identity" to such a record ID, it might be that an auto-sequencing is simply the wrong tool. I think the compromise and expectation that sequences, and extended functionality depending on sequence mechanics, are not transactional is a good one given the performance costs of ensuring gapless numbering whether you really need it or not.


worth re-upping -

postgres requests an id from the sequence for each row of the incoming data ahead of time since it doesn't know which rows are updates and which are inserts (although presumably this could be changed?). the sequence doesn't reset down for the unused so this can eat through it unexpectedly quickly if you have a table with a relatively large volume of updates.

also as a tip if you hit the max integer for the sequence and need space to implement a fundamental fix you can quickly change the sequence to start at -1 and go down. there's no issue with negative ids since they're also integers.


> postgres requests an id from the sequence for each row of the incoming data ahead of time since it doesn't know which rows are updates and which are inserts (although presumably this could be changed?).

Not in any sort of general way - the sequence can be part of the unique key that you're conflicting on.


PG upsert also increments the counter.

I somehow feel it's wasteful and feel a bit conflicted. I also realize that while it feels natural to somehow get a feeling for the row numbers and it can help with debugging because you somewhat know some IDs / ranges after a while it's probably more something I just got used to and maybe shouldn't rely on.


Yes, unfortunately it does and I don't think you can work around this...


Good article, but I wouldn't characterize MERGE as strictly better than the older "ON CONFLICT ..." for Postgres, because MERGE doesn't support any sort of RETURNING clause.


INSERT ... ON CONFLICT has a problem with CHECK constraints. The constraint must be satisfied in the INSERT tuple even if the row already exists and UPDATE will be executed. I recently dealt with this at work and the resulting CTE query was very similar to MERGE, except MERGE has better syntax.


I found a post on the mailing list about this: https://www.postgresql.org/message-id/CANbhV-EXmw9i4YhJTgsrL...

I'm not sure how to go about making an official feature/change request though.


Yeah, we were happy to see MERGE in latest postgres but abandoned it after we found there is no RETURNING...


The challenge with UPSERTs is ambiguity, and the variety of syntax options you see on the page indicates different attempts to try to address that ambiguity while still keeping a simple syntax.

Personally, I work with MSSQL which has skipped adoption of UPSERT syntax and supported MERGE statements for probably 15 years now. While there's a certain degree of complexity in a MERGE statement, there's also a strong level of explicitness that is helpful in making sure certain conditions don't fall through the cracks or don't perform an operation that isn't expected.

The ability to handle different conditions in match statements is incredibly powerful and allows potentially complex ETL statements to be handled within a single merge statement. I'm not saying that's always the best practice, but if things are structured in the right way, it may be.

I believe the transaction isolation of a MERGE statement (in MSSQL at least) means that the lock on the rows is not released between the read and the write, which means you'll have fewer issues with reading data, attempting an update, then having it fail bc another process made a change in between the two calls.

I regularly use output tables to debug the results of a MERGE statement, and in this case the $action keyword is incredibly helpful for knowing what was done.


> allows potentially complex ETL statements to be handled within a single merge statement. I'm not saying that's always the best practice, but if things are structured in the right way, it may be.

I wonder how often one has the ideal schemas for such. The advantage of an ETL is that it's probably easier to debug, log, and trouble-shoot problems. One-Giant-Do-All-Command is often a recipe for headaches. Being able to "X-Ray" the intermediate pipes really helps troubleshooting.

The X-ray issue is one reason I have trouble debugging functional programming. Maybe there's a way to X-ray better, but I haven't found it yet.


It's been many years since I worked at a MSSQL shop but at least at the time (2008R2) there were issues with MERGE upsert statements deadlocking with themselves when run concurrently.

Based on https://dba.stackexchange.com/questions/187776/merge-deadloc... it seems that's still an issue as of 5 years ago, and not the recommended way of doing an upsert on MSSQL.


Is anyone aware of performance benchmarks comparing Postgres's new MERGE to the old INSERT ... ON CONFLICT?

The "WHEN NOT MATCHED THEN DO NOTHING" is very appealing, and we currently implement this with a transaction and a DELETE after the upsert because we want to do updates only, not unintentionally insert new rows.


> we currently implement this with a transaction and a DELETE after the upsert because we want to do updates only, not unintentionally insert new rows.

Huh??? Why not just do an UPDATE then? After all, you seem to be saying "we want to do an upsert but without the insert part".

Edit: I'm also trying to understand the value of "WHEN NOT MATCHED THEN DO NOTHING". How is that any different that just leaving off the "WHEN NOT MATCHED" clause altogether?


Batch update many many rows at once with entirely different values.

(Agree that the “WHEN NOT MATCHED DO NOTHING” is effectively the default behavior.)

Use case is the Outbound Prober processes for https://heiioncall.com/ which can do thousands of HTTP requests per second, then aggregate all those row updates into a single query (or single transaction, which in practice is a single query).

We can do these batched updates very, very fast using INSERT … ON CONFLICT. No need to re-parse a differently shaped query, only a single index rebuild. Smooth :)


Still not really understanding - you can do all of that with plain UPDATEs (using join syntax if you want to "batch update many many rows at once with entirely different values.") I would be hella confused as a developer if I saw an INSERT ... ON CONFLICT statement where you specifically wanted to prevent any inserts.


Ah, by “join syntax” do you mean like this answer: https://stackoverflow.com/a/70803859 ?


Unnesting is a bit orthogonal, I'm basically saying any UPDATE statement that has a FROM clause in Postgres.


FYI, article doesn’t say it but Redshift nowdays supports MERGE: https://docs.aws.amazon.com/redshift/latest/dg/r_MERGE.html


Great explanation,

I highly recommend the author’s book

https://antonz.org/sql-window-functions-book/


Excellent article.

I did not realize that PostgreSQL now supported MERGE


That will save me a few headaches--I wasn't aware of the MYSQL ability to reference the values in the update part.


An old Python package with the implementation of MERGE in BigQuery - https://github.com/ZiggerZZ/pygbq


Its annoying that a standard upset cannot be part of SQL.


But MERGE is now the upsert that is part of the SQL standard.


Unfortunately MERGE doesn't seem to have the transactional semantics that people want from an UPSERT feature (I believe in any implementation, but perhaps Oracle supports properly transactional MERGE?)


Thanks for the article.

You have ton of interesting posts in your personal website!


I don't know how recently the "UPSERT" nomenclature came into use but I laughed the first time I heard the term used several years ago, thinking it was a verbal contraction for a concept that, while logical, didn't actually exist in database engines (my ignorance was quickly remedied after my expression of mirth). It certainly wasn't part of the SQL courses back in 1998.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: