
Upsert anti-pattern in SQL Server - hobs
https://sqlperformance.com/2020/09/locking/upsert-anti-pattern
======
mumblemumble
This is specific to Microsoft SQL Server, which has a rather fraught history
in this department.

I believe the best option in PostgreSQL is still INSERT ... ON CONFLICT ...

In general, this is one of those spots where every single RDBMS behaves
differently, so knowledge from one platform doesn't necessarily transfer well.

(Which, in turn, is one of the reasons why it's not necessarily a good idea to
try to be database-agnostic.)

~~~
liability
Sqlite has the same as Postgres; INSERT ... ON CONFLICT ... DO ...

It's very ergonomic and intuitive. I understand why SQL Server needs to retain
older forms, but why can't they also adopt these new forms that are more
pleasant for the end user? I suppose the people buying licenses to SQL Server
usually aren't the ones actually using it.

~~~
mumblemumble
Officially, MERGE is the method defined in the SQL standard, and is by far the
most widely supported syntax.

Everyone else uses different, DBMS-specific, non-portable syntaxes. A lot of
them (e.g., PostgreSQL and Sqlite) have some similarities, but are still
mutually incompatible.

I think that Microsoft arguably made the right decision by choosing to conform
to the standard. It's just kind of unfortunate that the ISO standard syntax is
(in my personal opinion) not very good.

~~~
petergeoghegan
I am the primary author of INSERT ... ON CONFLICT in Postgres. I extensively
studied similar features from other database systems as part of that project.

MERGE simply doesn't provide the guarantees that most users want -- not in
theory and not in practice. For example, the Oracle docs introduce MERGE by
saying: "This statement is a convenient way to combine multiple operations. It
lets you avoid multiple INSERT, UPDATE, and DELETE DML statements."

It's likely impossible to make the concurrency guarantees work sensibly in
READ COMMITTED mode while supporting MERGE's very general semantics. There are
subtle but very real problems with making it behave like ON CONFLICT in one
context but not in others. In my view MERGE is a perfectly reasonable feature,
but it just isn't what people want when they say they want to upsert.

The SQL standard deliberately underspecifies anything related to transaction
isolation or concurrency. I cannot see the standard providing any guidance on
upsert for this reason.

~~~
andorov
Thank you for your work! Postgres is great to use and the regular release of
useful features has been a very nice benefit.

One unobvious (to me) but possibly necessary side effect of the INSERT...ON
CONFLICT pattern is that it reserves a primary id for every row ahead of time,
presumably because it does not know what will be updated and what will be
inserted. We have a heavily updated table that started failing inserts one day
because the primary id sequence hit the max int (2.1b), despite the table only
reaching 100mm rows. The immediate fix was to reset the sequence to -1 and
have it start going down...

~~~
petergeoghegan
I hear that one from time to time. It's a consequence of the fact that
sequences are generally not transactional, and the fact that the underlying
"speculative insertion" infrastructure really does have to do most of the work
of the insertion before it can decide whether or not the insertion really
should go ahead.

I'm sympathetic, but unfortunately I cannot think of any tractable way of
avoiding the problem at the implementation level. Maybe it would be possible
to do something with an identity column (as opposed to a SERIAL column or a
raw sequence) -- we have access to sufficient context there. Maybe we could
skip consuming a sequence in the common case where there is a clear conflict
at the start, and the sequence isn't accessed through the EXCLUDED pseudo
table. We wouldn't actually promise anything more, but in practice we wouldn't
burn through sequences at the same rate in cases where updates are common.

This is quite a significant project, but maybe it'll happen some day.

------
bosswipe
It's annoying that databases don't implement UPSERT natively. As far as I can
tell the solution provided by every db requires the repetition of all the damn
fields. To pick on the supposedly more ergonomic postgres method, here's a
typical example,

    
    
      INSERT INTO students(id, firstname, lastname, gender, d_o_b, email)
      VALUES (1516, 'Gerard', 'Woodka', 'M', 'January 27 1995', 'gerardo_woodka@gmail.com')
      ON CONFLICT (id) DO UPDATE SET
        firstname = EXCLUDED.firstname, 
        lastname = EXCLUDED.lastname, 
        gender = EXCLUDED.gender, 
        d_o_b = EXCLUDED.d_o_b, 
        email = EXCLUDED.email;
    

At least it doesn't force you to repeat the values like many other servers.
But still the repetition is bug prone and hard to maintain as the table
changes. And it's just annoying to type.

~~~
nkozyra
I think you really want to repeat the fields because inserting versus updating
could require dramatically different behavior.

~~~
masklinn
There are many cases where you only want to update a subset of the fields, or
even not update anything, but it's also true that a shortcut to update
everything but the conflicting fields would be nice.

~~~
nkozyra
Oh I get the convenience, it just seems like an easy way to shoot yourself in
the foot, like an update without a limit clause.

------
gfody
> I think everyone already knows my opinions about MERGE

not everyone is stuck on a 15-year-old version of SQL Server. odds are you can
go ahead and use modern language features. I can understand why a DBA
consultant that frequently encounters production issues with old bugs would
feel this way though.

~~~
masklinn
It's certainly not jus sql server. The implementor of INSERT…ON CONFLICT
UPDATE in Postgres considered and rejected MERGE[0] on grounds that it's
simply not specified to do what people generally _want_

> MERGE, as implemented within other major systems and as described by the SQL
> standard does not require an atomic insert-or-update for even the simplest
> possible case of OLTP style insert-or-update.

> SQL MERGE as implemented within other systems doesn't require that a unique
> index be defined on the column or columns of the MERGE statement's outer
> join. I am not enthusiastic about the idea of falling back to table-level
> locking when an appropriate index isn't available. That could be a foot-gun
> for the large majority of users that just want to solve their basic insert-
> or-update problem.

A MERGE patch was submitted for pg11, it got bumped multiple time and returned
with comments early 2019 with no visible activity since.

[0] [https://www.postgresql.org/message-
id/CAM3SWZRP0c3g6+aJ=YYDG...](https://www.postgresql.org/message-
id/CAM3SWZRP0c3g6+aJ=YYDGYAcTZg0xA8-1_FCVo5Xm7hrEL34kw@mail.gmail.com)

~~~
gfody
this has nothing to do with SQL Server. on SQL Server MERGE is atomic, if the
source has no indexes it's probably a temp table, of course evaluating "not
matched" requires joining the source to the target, on SQL Server you should
be thinking and working in sets and avoiding single row (RBAR) whenever
possible. if you're doing highly concurrent single-row upserts you're already
SQL Servering wrong and I'll concede that's a case where you probably need a
bunch of lock hints (regardless if it's a merge statement or an update
followed by an insert)

------
TekMol
It is such a common operation that I think MariaDB is doing the right thing by
having a dedicated command for it, "INSERT ... ON DUPLICATE KEY UPDATE":

[https://mariadb.com/kb/en/insert-on-duplicate-key-
update/](https://mariadb.com/kb/en/insert-on-duplicate-key-update/)

~~~
jakobdabo
Also, REPLACE INTO...

~~~
gweinberg
I've been burned with replace. The problem is that if you have fields with
default values they will be set/reset to the default even if nothing changed.
What I'd like to happen is that for any values that I don't supply, the table
values remain unchanged.

------
adontz
Actually for many rows I delete and insert.

Much easier to write.

DELETE FROM t WHERE id IN (SELECT id FROM new_data); INSERT INTO t SELECT id,
other_field FROM new_data;

It seems so natural to me that I do not have to repeat field list.

Queries were mostly in context of ETL, so never considered performance issues.
Never wrote such code for OLTP.

~~~
jimktrains2
That'll update all ties though, including fields like created_at that you may
not want updated.

------
grugagag
I agree with this but this is a small nitpick compared to what kind of SQL
queries Ive seen in production in different places I worked at. Not to mention
the amount of redundant queries from ORMS performance hit which a lot of
developers aren't even aware of.

------
mulmen
Shouldn't the IF EXISTS be done after starting the transaction? What happens
if the key is deleted before the transaction can start?

------
mjevans
Title possibly: "Stop writing upserts the wrong way (MS-SQL Server)" since
it's less clear which flavor of SQL this is relevant?

~~~
hobs
SQL Server is the name of the product, hence the inclusion in the title.

~~~
bragh
In a common forum that is not purely dedicated to Microsoft technology, MS
product names really ought to be prefixed — context is important. For example,
it applies to Entity Framework also: it has happened a few times to me that a
non-MS developer assumed that ORMs in general were discussed, while it was
related only to the MS implementation.

------
Semaphor
when I need to do some kind of upsert on SQL server, I tend to refer to this
post: [https://michaeljswart.com/2017/07/sql-server-upsert-
patterns...](https://michaeljswart.com/2017/07/sql-server-upsert-patterns-and-
antipatterns/)

------
betimsl
You also gain by executing one less transaction at the times of successful
update.

Quite a good pattern actually.

------
kardianos
True. Don't use MERGE in MS SQL Server. Use two separate statements.

~~~
teh_klev
It's better to back up assertions, even if you're correct:

By the same author (who I highly respect):

[https://www.mssqltips.com/sqlservertip/3074/use-caution-
with...](https://www.mssqltips.com/sqlservertip/3074/use-caution-with-sql-
servers-merge-statement/)

------
dathinab
And that is (part of) why RDB "failed" to keep people even for use-cases where
they should be better then NoSQL.

\- To much hidden complexity which can leaf to bugs if you don't know about
the _hidden_ complexity.

\- To messy and/or misunderstood concurrency.

\- Many People having fundamental misunderstandings about how to do things and
how far transactions are useful for given (often implicit default) options.

One thing I had to learn the hard way is that many programmers which
supposedly "know" about RDMS/SQL don't even understand isolation levels
properly.

Also is it just me or is this way easier in postgres (with on conflict ...).

~~~
teh_klev
It makes me weep when interviewing and I ask developers about fundamental
stuff such as "what is ACID?"[0]

None of this stuff is hard to learn, and _I 'm_ not exactly Mr Ace Developer.
But time and again I've been asked to debug/fix data inconsistencies that were
caused simply because the developer didn't know what a transaction was, whilst
running multiple inserts and updates, and one of them failed along the way.

[0]: [https://en.wikipedia.org/wiki/ACID](https://en.wikipedia.org/wiki/ACID)

~~~
ragnese
I don't know if it's worth a damn, but I have no idea what the acronym stands
for. I'm sure I knew it at some point and I know it's a database compliance
standard kinda thing.

But I use the shit out of transactions. I rather have contention issues than
non-atomic operations done to corrupt our data. When I get to Facebook scale
I'll let you know what I think about eventually-consistent and NoSQL, etc. ;)

~~~
dathinab
The problem is people not even knowing that if you use the default isolation
level (on pg, read-commited) that you can't do something like "read cell X,
modify X in code (e.g. X+=1), write X back" without running into race
conditions which you can either avoid by using a lock (e.g. a row lock) or a
different isolation level.

No need to know what ACID is an acronym for (I forget it all the time) or how
the isolation levels are called, but you need to know what to look out for
(mainly data dependencies between read an writes in the same transaction
having race conditions you need to "team" and multiple read inconsistencies)

~~~
teh_klev
> No need to know what ACID is an acronym for

Sure, but if you're mentoring a junior developer then I'd like to think you'd
pass this knowledge on and make them understand its importance in a formal
way, rather than some partial hand-wavey "this kinda means this". Get them to
go and learn these fundamentals and test them on it, which is good for you
because it keeps it fresh in your mind to just roll off the tongue at parties
:)

------
tuatoru
I don't understand why INSERT ... ON CONFLICT ... DO is even a thing.

Yesterday there was a item on a piece by rachelbythebay about checking for and
dealing with unexpected conditions in your code as a practice that separated
professionals from ... er, idiots.

This would appear to be a prime example of that antipattern -- in general, not
just in SQL Server. If a tuple snuck into your relation while you weren't
looking, don't you need to rollback and find out the current state of things?

edit: Rachel's piece:
[https://rachelbythebay.com/w/2020/09/02/check/](https://rachelbythebay.com/w/2020/09/02/check/)

~~~
manigandham
What do you mean by "you weren't looking"? The SQL statement which is doing
what you tell it to do, _when there 's a conflict_.

~~~
tuatoru
I'm asking, why is there a conflict? Why don't you know whether or not that
tuple is in the relation before you start?

In OLTP, this practice (UPSERT) would cause all sorts of mayhem.

~~~
icedchai
In practice, these conflicts can happen quite often during data
synchronization. Say someone uploads a list of inventory, keyed by ITEM_ID.
You want all your items to be either added (new) or modified (previously
existing.) You _could_ loop through them one-by-one, or you can just let the
database handle it with a bulk upsert.

