On a related note, a difference in Postgres is that both Repeatable Read and the higher Serializable Isolation levels are both using Snapshot Isolation under the hood.
I've been trying to grok this stuff for a few years and it's good to keep in mind that the implementations of various isolation levels between vendors can vary significantly.
I've found the resources on Jepsen very helpful as well if you want to go down a rabbit hole :)
The problem with all of these examples is that it’s using a model of Snapshot Isolation which guarantees writes don’t conflict, but doesn’t provide a similar guarantee for reads. What you really want is the pair of constraints:
- Anything I write hasn’t been concurrently written to by another transaction
- (added): Anything I read also hasn’t been concurrently written to by another transaction
Adding “conflicting reads” fixes every single example in this post. The linked list example can be fixed in any number of ways. The most efficient might be to “spread out” the set of conflicting keys to also include the next / prev pointers of the deleted items b and c. This makes the transactions conflict. Or maybe better - include all the next pointers of all items before the modified item. This adds an implicit guarantee that the visited / modified item must be in the list for the transaction to be successfully committed.
Now, adding reads to the conflict set might change SI into a different concurrency model. The blog post does use the same definition as Wikipedia. And that definition only mentions writes. But I think any SI system can be made to work this way by setting X = X for any value read during each transaction.
The “two threads” example becomes this:
beginTxn();
if (x == 0 && y == 0) x = 1;
x = x; y = y; // implicit
endTxn();
beginTxn();
if (x == 0 && y == 0) y = 1;
x = x; y = y;
endTxn();
And the two transactions will function correctly.
Having the database handle this for you is best because:
- Humans will forget
- If two transactions read some value but neither transaction writes to it, the two transactions don’t actually need to conflict. This is too strict.
Foundationdb in my opinion has the best API for this I’ve seen. Reads and writes within a transaction are tracked like this automatically, and you can manually add, remove or change the set of conflicting keys using an explicit api if you want to. But I’m sure it’s not alone.
There's a typo, clearly the second transaction is removing C and not D as stated.
> The first transaction modifies fields in records A and C, while the second transaction modifies fields in records B and D.
> Under Snapshot Isolation both transactions will commit, which means that the doubly linked list will be left in an inconsistent state
That's not how the DB we use works[1]. Both transactions will run, but as soon as either touches a row the other has already modified it will raise an "Update conflict on snapshot transaction" error. If the first transaction races the second, the second will be blocked on the row's write lock until the first is committed, and then raise the same error.
And yes, I verified this just now, and I verified the connections are running snapshot isolation.
So does SQLAnywhere not implement the standard way?
> Both transactions will run, but as soon as either touches a row the other has already modified […]
The whole point of this scenario is that this never happens. The two writes touch unrelated sets of rows. There’s no point where they both try to modify the same row and a conflict is observed.
I was thinking so DB that when I read "remove", I assumed that. Of course you're right, if you're not actually removing B and C, there's no conflict.
If you actually remove B and C in the transaction, ie "deallocating" them either by deleting rows or updating prev/next/data to null, you'll get the conflict I mentioned.
I really like to work with snapshot isolation and rarely have any problems with it, but you do have to be aware what guarantees it gives and does not give.
In the example given in the article, there would not be a problem if the links on the removed node would be reset (which is cleaner imho) as then an update conflict would be triggered.
When using snapshot isolation, I sometimes implement ‘dummy’ updates (using a version field for example) on ‘root’ objects to trigger update conflicts when concurrent updates must be prevented to data linked to that same ‘root’ object. (This is similar to implementing optimistic locking using a version field on a root object.)
This is a lot like how couchdb handles MVCC. There is a `_rev` field that represents some specific mutation of a document, old revisions stay available until compaction, and you will receive an error if you attempt to write to a document with a different revision than you read.
Snapshot isolation, after all, is basically a method for implementing MVCC. I guess it's no big surprise that this isolation level is problematic for people that don't implement the other half.
It’s the ABA problem that also comes up with normal multithreading when trying to go lock free. Lock free data structures are hard. SI is giving you the same guarantees (your update is atomic, but things might have changed between when you read and when you write). If you can handle this extra complexity, all good, but it’s generally something that needs to be carefully abstracted. Nobody wants to be thinking this hard every time they have to interact with a DB.
Mssql on azure also has READ_COMMITTED_SNAPSHOT (server level setting, default ON on Azure SQL Database, default OFF on SQL Server <<wtf?>>). When ON, it transparently changes your explicilty specified isolation level in query/transaction of "READ COMMITTED" to "READ COMMITTED SNAPSHIT" (genuine typo made here, decided to not correct) instead. Good luck fixing your prod incident when locally you're using mssql docker image (default OFF).
If you want to say "just use their edge mssql container image", yeah man – mssql/azure sql databse/mssql docker all support JSON_PATH_EXISTS, edge sql docker doesn't so that's it for better compatiblity.
It’s enough, just make sure you’re not trying to fly an airplane or prevent the detonation of nuclear weapons using a database, it’s the wrong tool for the job.
If you’re storing doubly linked lists in a DB you’re doing it wrong.
Updating doubly linked lists can be done at about 200 million ops/sec, single threaded, not sure why you need multiple threads updating the list at the same time, exactly what are you doing that can't be solved by putting a ring buffer in front of a single thread that updates the values, doesn't need locks and is cache coherent.
It’s enough, just make sure you’re not trying to fly an airplane or prevent the detonation of nuclear weapons using a database, it’s the wrong tool for the job.
Your intuition should be the opposite. You should always reach for correctness first and only sacrifice it after careful consideration when performance demands require it.
>It’s enough, just make sure you’re not trying to fly an airplane or prevent the detonation of nuclear weapons
My gripe with this kind of argument is that today, you aren't. You can write application-side duct tape to deal with any kind of wonky database situation, but the whole point of having a database with strong and easy-to-reason guarantees is that it makes future development easier.
OP provides even more simple examples than double linked lists. It made me realize how pg default isolation level was actually quite a nice footgun and reread the doc about the various isolation levels much more carefully.
> If you’re storing doubly linked lists in a DB you’re doing it wrong.
This was my reaction on finding that TFA's key example is a doubly-linked list. I've never implemented any kind of linked list in a database; nor have I ever come across someone else's schema that involved linked lists. The kinds of operation you do on linked lists (traverse, insert, append, delete) all involve sequences of row accesses that can't be (conveniently?) described in SQL, so they have to be expressed as multiple distinct accesses.
More generally, I'm suspicious of a schema in which a table contains a "foreign" key to itself. Foreign keys are keys to other tables.
I haven't given it any thought; but could there have been a better example? Or is updating a doubly-linked list the best illustration of why SI is dangerous?
Yes, this. Because of this, I saw substantial data corruption that had built up over years in live medical databases: The on-disk B+trees had corrupted internal links due to incorrect handling of concurrent updates and rollbacks. This made some queries return incorrect results.
The developers had been writing messy heuristic workarounds for database low-level corruption in application code for over a decade, instead of figuring out the cause and fixing that.
For example, application code had special routines to attempt to detect and ignore duplicate records in some queries (but not others), in response to customer bug reports, but of course that didn't fix missing records or make the data correct. It just patched over an observed symptom. It also didn't fix all the places it could happen, so much of the application still had occasional flaky behaviour, but not consistently enough to be reported in detail and get the (bad) workarounds to hide it.
Correctly implemented transactions are very helpful for database internals and on-disk structures too, not just for the API level presented to applications.
So don't use it for writes. Snapshot is good for analysis. Let's say you want to show the user a report of their sales by category, with some additional detail. You want to load those things as of the same cut off time because one of the detail queries could be different from the summary query if you didn't. This is what people expect in their results.
If you really need a query for modifying data that depends on itself, you should read it exclusively, i.e. serializable.
If you don't need your snapshot read-only queries to be linearizable, then you don't need MVCC for this: just take a consistent checkpoint every second or minute or hour. A simple way to implement low-frequency snapshots is what Redis made famous: fork the server process so all future updates in the parent are to CoW pages and read the snapshot from the child. (HyPer did this originally but I'm not sure they still do.)
I think this approach makes more sense than MVCC when linearizable abort-free read-only transactions aren't really necessary: you can avoid a lot of complexity and a lot of overhead by not retaining old versions.
I was referring to using recent database snapshots for read-only queries. HyPer used the same snapshotting technique as Redis backups for this purpose:
Ok, thanks. I was wondering because last year I spent a few hours poring through Redis documentation, trying to figure out how to get access to read-only snapshots...
Good luck troubleshooting why your data is subtly (or not-so-subtly) corrupt when the database transactions that started the problem occurred months previously, because the guarantees were too loose/not well-understood.
https://techcommunity.microsoft.com/t5/sql-server-blog/seria...