Skip down to the bottom for a simple approach that just works, in sufficiently recent PG releases (MERGE). Back up just a bit for something that's nearly as good, and works with everything since Postgresql 10 (INSERT ... ON CONFLICT). Most of the post describes more complicated approaches that have subtle race conditions, and are worth looking at mainly to understand what the problems are.
Small nitpick - the author mentions upsert in the beginning of the article only to "forget" about it and use it in the end (insert .. on conflict ignore) and that's the obvious solution there.
The final query is very neat though! And special thanks for mentioning "MERGE ... RETURNING" in PG17, that's really cool
I wonder why this really common use case requires so much esoteric SQL/Postgres knowledge. While everything is well explained, the solution is not simple, by all means. I don’t like complex solutions to common problems.
It's not that difficult to do in the general case. Usually, you're just doing it through the program.
- Open connection
- Begin transaction
- Do INSERT
- If you get a PK violation, do an UPDATE instead
- Commit
The thing is, while it's a common pattern, it also has many equally common similar patterns. Sometimes you want to try the UPDATE first. Sometimes you want the program to throw an exception if the INSERT fails. Sometimes you're doing something with row versioning or pessimistic locking.
It's same reason why your file IO library might have a ReadAllText() method, but only after it has Open() and Read() and ReadLine().
If that's not safe, then Postgres has not implemented transactions correctly. That seems unlikely, so my guess is that what you think I'm suggesting is not what I'm actually suggesting.
there is nothing wrong postgres transactions, they work exactly as intended, but they aren't magic. this is pretty easy to test:
Session 1:
psql (16.3 (Debian 16.3-1.pgdg120+1))
Type "help" for help.
postgres=# CREATE TABLE tags (
id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
name VARCHAR(50) NOT NULL
);
CREATE TABLE
postgres=# ALTER TABLE tags ADD CONSTRAINT tags_name_unique UNIQUE(name);
ALTER TABLE
postgres=# INSERT INTO tags (name) VALUES ('A'), ('B') RETURNING *;
id | name
----+------
1 | A
2 | B
(2 rows)
INSERT 0 2
postgres=# BEGIN;
BEGIN
postgres=*# INSERT INTO tags (name) VALUES ('B') ON CONFLICT DO NOTHING RETURNING *;
id | name
----+------
(0 rows)
INSERT 0 0
Session 2:
psql (16.3 (Debian 16.3-1.pgdg120+1))
Type "help" for help.
postgres=# DELETE FROM tags WHERE name = 'B';
DELETE 1
Session 1 continues:
postgres=*# SELECT * FROM tags WHERE name = 'B';
id | name
----+------
(0 rows)
postgres=*# END;
COMMIT
(yes, I added `ON CONFLICT DO NOTHING` to avoid aborting the transaction prematurely)
I did some more research. Postgres doesn't handle transactions correctly. Once an error on a constraint occurs, subsequent commands are ignored. You can't handle errors in a transaction. Most RDBMSs don't require a rollback in that situation. It's weird.
Yeah the error handling is a side-note here. The main thing to understand is that postgresql transactions have "read committed" isolation level by default, which means that reads will see the writes from any committed other transactions, which means that there can be arbitrary changes to data between operations, e.g. between insert and select, even if no errors or conflicts are involved. This is pretty much what the sample I posted in parent comment demonstrates.
I believe PG will abort the transaction on an exception (PK violation) and the subsequent update will not run in the same context that had original violation. So it could result in a data race. I don't know what isolation level would fix that, if any.
My understanding is that in general, if you hit an exception in postgres at all then you can't trust the isolation of the current transaction anymore.
That's what MERGE and speculative insertion (on conflict do update) addresses.
well, yes, you are right that exception does abort the transaction. but even if you use something like 'on conflict do nothing' to avoid the exception, you still can get problems with concurrent writes (see my sibling comment https://news.ycombinator.com/item?id=41169638)
REPEATABLE READ or SERIALIZABLE isolation levels would help with that, as the name suggest repeatable read ensures that the read made by inserts constraint check is repeatable in successive select (or update) statements.
Most RDBMSs don't behave this way. They allow you to correct exceptions that occur for anything less than a deadlock without a total rollback, but not Postgres.
Oh I've been doing the `on conflict do update set id = EXCLUDED.id` thing for ages now. Thought the update would be no-op because nothing changed. Interesting.
> The suppress_redundant_updates_trigger function, when applied as a row-level BEFORE UPDATE trigger, will prevent any update that does not actually change the data in the row from taking place. This overrides the normal behavior which always performs a physical row update regardless of whether or not the data has changed.
This is super informative, thank you. I did a small OSS project around postgresql a while ago to implement a simple document store with Kotlin and a low level co-routine friendly postgresql driver called jasync-postgresql that doesn't do blocking IO. I ended up using a few of the other solutions mentinoed in the article. But based on this, I might want to upgrade it to use MERGE.
My intention with this project is to use it for storage and transactions. Most of our query heavy lifting is done via external search engines like Elasticsearch. And my attitude with databases is that if you are not querying on it, it doesn't need separate tables, columns, and indices. The point with a document store is that you only use limited querying strategies. I've used documents stores for various projects for over ten years and that style of persistence just fits a lot of things I do.
I'm currently planning to roll this out for one of our products. So, this is not just a hobby project. If somebody has time; I wouldn't mind some eyeballs on this project: https://github.com/formation-res/pg-docstore
I do wonder why in the first example of inserting multiple tags with get_or_create_tag the id of 'C' suddenly changes to 4. Were some queries perhaps not run in the order they appear?
I got the same impresssion, that the demo wasn't implemented as a single "clean" run. Another indication of this is that the next passage, about eliminating "bloat" from failed inserts, shows the table containing two live tuples again, not three or four as it should after the preceding example.
Since at least two people noticed, I would just say that the insert-select-insert approach was added later, after I took some inspiration from Django's implementation of get_or_create.