
How Postgres Makes Transactions Atomic - craigkerstiens
https://brandur.org/postgres-atomicity
======
brandur
Author here with just a quick note that writing this was a pretty great
learning experience. Postgres has always been the ultimate black box — it does
some amazing things, but I generally have no idea how.

I read a lot of Postgres code to get this finished, and I'm happy to say that
for a codebase with so many authors, the quality of Postgres' is very high.
Like any complex program it can be a little hard to trace through, but most of
the naming is pretty self-explanatory, and it comes with some really amazing
comments that walk you through the high-level design. I'd definitely recommend
taking a look if anyone is curious.

~~~
ralusek
Thanks for the writeup, it's at the necessary level of depth to convey the
important concepts. I didn't have time to read through all of it yet, so maybe
this is addressed, but there has always been something about transactions that
I don't understand. How do they accomplish queries that rely on indices within
a transaction? If they're not keeping a copy of the index for that
transaction, but you wanted to do a sorted query that may or may not include
some of your newly inserted rows prior to committing, do they simply do the
query without the use of indexes? Do they do the query with indexes to create
a temporary table, then run it again non-indexed on the temporary table
including your new records? It would seem like that would have many edge cases
to address.

~~~
felixge
Not OP, but I think the answer is that PostgreSQL updates the the index right
away, which points to the heap tuple (and it's visibility information). This
index value can already be accessed by concurrent transactions, but they
decide if its visible or not by looking at the xmin/xmax of the heap tuple it
belongs to. If the transaction is rolled back, it's up to VACUUM to remove
both the dead heap tuple, as well as its index tuples.

AFAIK there are also optimizations that allow queries to avoid having to look
up the individual heap tuples for visibility information by marking the entire
page on the index as visible (useful for index-only scans). Yet other
optimizations exist to not require updating indexes when updating existing
rows where none of the indexed columns change (HOT).

Maybe somebody with a more in-depth understanding can comment, but hopefully
the above is somewhat correct :).

~~~
petergeoghegan
That is correct.

There is one small inaccuracy in your summary, which is that a structure
called the visibility map summarizes if a heap page is all visible; that's how
index only scans work. At no point do we mark an index page as all-visible,
because a HOT update would invalidate that.

~~~
felixge
Thanks for the clarification about the visibility map.

That being said, I don't understand why HOT would interfere with a
hypothetical mechanism for marking index pages as all-visible. If a tuple gets
a HOT update, I think the index values should remaing unchanged? And if they
do, shouldn't their visibility on an all-visible index page remain unchanged
as well?

Unrelated: Thank you so much for your work on "UPSERT" :)

~~~
petergeoghegan
I guess that I said HOT UPDATE because you talked about HOT. What you describe
goes against how visibility information is maintained. There is an interlock
between indexes and tables for VACUUM that would have to be considered. It
would be hard to set an all-visible bit on a leaf page in a race-free manner.

A much simpler way of putting it would be that deletes would have to touch
indexes if summarized visibility info was stored in indexes, even in cases
where overall there is hardly any benefit from putting per-page visibility
information in index pages. It's not impossible to make something like this
work, but it's not obviously compatible with the current design.

I'm glad that you found UPSERT useful.

~~~
felixge
I wasn't really proposing to change how visibility information is maintained,
my grasp on the internals is far to weak for that :). I just tried to make
sure I understood your comment, which I think I do now. So thanks again for
another insightful reply :)

------
munro
> [https://brandur.org/assets/postgres-atomicity/heap-tuple-
> vis...](https://brandur.org/assets/postgres-atomicity/heap-tuple-
> visibility.svg)

These diagrams are really beautiful, I love the subtle separation in the lines
showing that they're made of hypens. Is there some software that makes these
easy to generate?

~~~
jdelaney
I don't know if this is what the author used but I use Monodraw[0] to make
diagrams like this.

[0] [https://monodraw.helftone.com/](https://monodraw.helftone.com/)

~~~
willglynn
I had the same question and dug around in the repo to find out. It is indeed
Monodraw:

[https://github.com/brandur/sorg/tree/master/content/raws/pos...](https://github.com/brandur/sorg/tree/master/content/raws/postgres-
atomicity)

I'm excited to use this tool.

~~~
pritambaral
I wish I could use it on Linux.

------
pselbert
This is precisely the kind of article I hope for when I browse the front page.

Thank you for all the effort you put into writing this. It is clearly the
product of a lot of effort and craft.

------
aneutron
As a beginner in the software engineering realm, I cannot thank people who
give these kind of write-ups enough. It helps me explore patterns and
beautiful pieces of engineering. On a side note I wonder if the functions are
really named like that or did the author rename some for clarity?

~~~
aidos
Postgres has a super readable codebase and the functions in the article seem
to come directly from the code. When you understand the general model of how
postgres works it's fairly easy to browse the code. Having said that I haven't
look at the deeper structures at all before and have mostly just perused the
query planner / executor code.

We were looking through it earlier to see why our nested loop nodes where
performing badly — which ultimately made us realise we were just misreading
the EXPLAIN output :-)

------
combatentropy
Thank you for the article, but I do have one quibble:

    
    
      > We could ensure that any process takes out an exclusive lock
      > on a file before reading or writing it, or we could push all
      > operations through a single flow control point so that they
      > only run one at a time. Not only are these workarounds slow,
      > but they won’t scale up to allow us to make our database fully
      > ACID-compliant
    

As much as I like Postgres, I also like SQLite, which uses a simple file lock
and yet is utterly acidic
([https://www.sqlite.org/transactional.html](https://www.sqlite.org/transactional.html)).
From SQLite's FAQ:

    
    
      > SQLite allows multiple processes to have the database file
      > open at once, and for multiple processes to read the database
      > at once. When any process wants to write, it must lock the
      > entire database file for the duration of its update. But that
      > normally only takes a few milliseconds. . . .
      >
      > If your application has a need for a lot of concurrency, then
      > you should consider using a client/server database. But
      > experience suggests that most applications need much less
      > concurrency than their designers imagine.
    

\--- [https://www.sqlite.org/faq.html#q5](https://www.sqlite.org/faq.html#q5)

~~~
felixge
AFAIK SQLite doesn't achieve atomicity by virtue of its global write lock. It
does so by using a rollback journal (default) or WAL.

So it's not really all that different from PostgreSQL in that regard and the
authors point remains valid :).

You might also enjoy this talk by the main author of SQLite:
[https://www.youtube.com/watch?v=ZvmMzI0X7fE](https://www.youtube.com/watch?v=ZvmMzI0X7fE)
(SQLite: Protégé of PostgreSQL)

~~~
combatentropy
The paragraph I quoted is talking about two things, one right after the other:
concurrency and acidity.

First, it talks about concurrency. It says that a naive solution to
concurrency is file locking. It says this is slow. So I first trot out SQLite
as an example that is not slow.

It says, "Not only are these workarounds slow, but they won't scale up to
allow us to make our database fully ACID-compliant." To me it sounded like he
said choosing file-locking for concurrency will mop you into a corner, that
you won't ever become ACID, regardless of what you tack on, like a journal.

One other clarification about concurrency: SQLite says it's fast enough for
most websites. Then it turns around and says that if you need a lot of
concurrency, you should go with another database. The contradiction evaporated
when it dawned on me that when database makers say "concurrency," they're
talking about many users at a time connecting to a database directly: picture
a bunch of SQL gurus connecting directly to an SQLite file and issuing raw SQL
statements. Or replace the gurus with several "application" servers that all
write to a database on yet another server. For that, SQLite would be bad, we
all agree, including SQLite.

But that's not what most of us, the 99% of app developers, are thinking of
when we hear "concurrency." We're thinking of several people hitting our
website. In that case, SQLite would still hold up
([https://sqlite.org/whentouse.html](https://sqlite.org/whentouse.html), see
_Server-side database_ ).

Anyway, again, I'm not against Postgres. It's what I use. I'm not talking
about SQLite because I'm invested in it but because I admire it. It's just an
example of a simple solution that meets many people's needs, without resorting
to more complexity.

~~~
kuschku
> But that's not what most of us, the 99% of app developers, are thinking of
> when we hear "concurrency." We're thinking of several people hitting our
> website. In that case, SQLite would still hold up
> ([https://sqlite.org/whentouse.html](https://sqlite.org/whentouse.html), see
> Server-side database).

It doesn’t. At all.

The quassel project supports PostgreSQL and SQLite as backends.

And, as expected when you use a database as log storage for an IRC bouncer,
you end up with many concurrent threads writing and reading to the same table.

The result is that if a user is reading a table right now, no one can write to
it - and that, in turn, means that the thread receiving messages ( _and
responding to PINGs, which is required to keep the connection to the IRC
server alive!_ ) gets blocked, and has to wait on the threads currently
requesting messages.

So if you connect and set your client to, on connection to the bouncer,
request the last 500 messages for each channel, you might actually time out.
Yes. When you have enough data, then connecting to the bouncer can kill the
bouncers other connections.

If you have more than one thread doing writes or reads – in a usual web
application you’ll have dozens or hundreds of worker threads reading and
writing concurrently – then SQLite just isn’t useful.

Disclaimer: Not speaking for the quassel project here. This comment is just my
opinion.

~~~
combatentropy

      > if a user is reading a table right now,
      > no one can write to it
    

In 2010 SQLite came out with a mode that keeps the writer from blocking
readers: [https://www.sqlite.org/wal.html](https://www.sqlite.org/wal.html)

    
    
      > > 99% [...] SQLite would still hold up
      > an IRC bouncer
    

I would consider a chat client one of the 1%. Although there are many chat
clients, and other examples that we could think of, there are thousands of
websites and apps of the "small to medium" size that SQLite might be a good
fit for.

[https://www.sqlite.org/features.html](https://www.sqlite.org/features.html)

~~~
kuschku
> I would consider a chat client one of the 1%. Although there are many chat
> clients, and other examples that we could think of, there are thousands of
> websites and apps of the "small to medium" size that SQLite might be a good
> fit for.

The issue with those is still that if you have a too high writing load, SQLite
will become a bad fit. It’s perfect for any low writing, high read situations,
as many websites are, but for more dynamic applications it seems to be less
usable.

Also, the FAQ of SQLite still lists

> SQLite supports an unlimited number of simultaneous readers, but it will
> only allow one writer at any instant in time. For many situations, this is
> not a problem. Writer queue up. Each application does its database work
> quickly and moves on, and no lock lasts for more than a few dozen
> milliseconds. But there are some applications that require more concurrency,
> and those applications may need to seek a different solution.

------
sgt
_Postgres is entirely designed around the idea of durability, which dictates
that even in extreme events like a crash or power loss, committed transactions
should stay committed._

Although if I had a $100 for every time I've had DB corruption in Postgres
over the years...

That being said, since 9.4 (or maybe 9.5) these incidents have mostly stopped
happening and it's been remarkably stable.

~~~
orthecreedence
Hi, very curious under what situations you had data corruption. I use postgres
a lot and haven't experienced (or noticed) this. I'm wondering if I'm using it
differently than you or if I've been lucky =]

~~~
sgt
Most of our cases have been standby servers having corrupted WAL files, to be
discovered only when the master has had a physical disk crash or similar.

After a few tough lessons like that we've put monitoring in place for ensuring
that the standby and/or PITR servers have testing in place e.g. PITR servers
must have uninterrupted WAL sequences.

Most Postgres corruption (on a master) actually happens due to dodgy hardware,
but we don't really use bad hardware. In cases where we used VM's we've
suspected something fishy in VMWare itself. In one case we suspected the SAN
that the VMWare host was using.

So mostly not Postgres' fault. There were however some known replication bugs
in the 9.x versions that were quite nasty. I think those bugs have now all or
mostly been fixed, so 9.6 and 10 (beta) onwards are extremely stable.

But the risk of HW issues still remain as the biggest reason for postgres
corruption.

I cannot emphasize enough how important backups are (both standby servers and
PITR) and also to verify that these are indeed valid for that day when you're
going to need to use them.

~~~
orthecreedence
Great response, thank you. Seems mostly replication-related, which admittedly
I haven't delved into much. I've used postgres to hold a lot of data, but
haven't gotten to the point where it needed to scale out yet. This is great to
know for when that starts to be a concern. I'll definitely heed your advice on
backups (already do them, but could put more automation into verifying they
work).

------
christophilus
"Every transaction in the database is applied in serial order, with a global
lock ensuring that only one is being confirmed committed or aborted at a
time." Is this true? I assumed you could transact in parallel, if there was no
conflict, e.g. different tables were involved or maybe even different rows.

~~~
dullgiulio
The confirmation for non-conflicting transaction is a very fast operation,
it's not a bottleneck to protect it with a global lock. The whole process of
writing the transaction is in parallel, you just have written data that might
be discarded if the final commit fails. I don't think you can tell that two
transactions do not conflict without holding a lock on the global state.

------
andrewshatnyy
Those diagrams are badass! Anyone knows what was used to make em?

~~~
arnarbi
Not OP, but I like
[https://monodraw.helftone.com/](https://monodraw.helftone.com/)

There are a few online variants, e.g.
[http://asciiflow.com/](http://asciiflow.com/) and
[https://textik.com/](https://textik.com/)

For extra credit use these fonts: [https://int10h.org/oldschool-pc-
fonts/fontlist/](https://int10h.org/oldschool-pc-fonts/fontlist/)

------
bullen
This is a non problem IMO. Why would you ever need to write the same data from
two different sources simultaneously? What is the user case?

------
exabrial
Wait, they're -not- atomic? This surprises me because of the number of
comparisons between MySql and Postgres... Comparing apples to oranges

~~~
lightbyte
You misread the title and didn't see the "How" (I did as well).

~~~
exabrial
Yep! I did, ooops

~~~
exabrial
Ok Everyone, I realized my mistake, no need to keep downvoting, sheesh

~~~
kuschku
This is why sometimes the ability to edit a comment later on (and if it would
only be additive editing) would be useful, to correct mistakes made.

