Hacker News new | past | comments | ask | show | jobs | submit login
How long will a 64 bit Transaction-ID last in PostgreSQL? (scherbaum.la)
101 points by vinayan3 46 days ago | hide | past | web | favorite | 32 comments

There's no chance we go for 64bit transaction ids on the tuples themselves - the space increase would be far too big. The overhead of tuple headers is already a problem, and xmin/xmax are a significant portion of that.

There were patches however that kept an 'epoch' (the upper 32bit of a 64bit transaction id) on a page level. Plus some rewrite logic when transactions that are too far away from each other to be represented as an index from a base epoch are about to be present on one page. That'd allow to effectively have 64bit xids.

The in-development zheap storage engine basically does something roughly akin to that, removing the need to perform freezing when a table becomes older than ~2^31 - safety-window transactions.

The transaction id that the system internally has effectively already keeps track of of xids in a 64bit manner, albeit in a somewhat over-complicated manner by keeping track of an epoch separately (there's a patch likely to land in the next version to just go for 64bit there). That's why you can see e.g. txid_current() return 64bit transaction ids.

Why write the transaction ids in the tuples at all?

In most production cases, transactions in flight are going to affect a small amount of rows overall, so you can just keep the data in memory, and store it to disk in a separate table if it gets large.

You need to access that data from different connections, so it needs to be correctly locked etc. Looking purely at the tuple you need to know where to look for the tuple visibility information. Accessing data stored in some datastructure off to the side will also have drastically worse cache locality then just storing it alongside with the data. E.g. for a sequential scan these checks need to be done for every tuple, so they really need to be cheap.

Except in most usecases, the tuple is old enough that it was committed long ago and visible to everyone.

It's only a tiny fraction of tuples which are recently committed and visibility rules come into play. That can be a special-cased slow-path

In a system like postgres' current heap you cannot know whether it was committed long ago, without actually looking in that side table (or modifying the page the one time you do, to set a hint bit). You pretty fundamentally need something like the transactionid to do so.

Also, in OLTP workload you often have a set of pretty hotly modified data, where you then actually very commonly access recently modified tuples and thus need to do visibility checks.

There's obviously systems with different visibility architectures (either by reducing the types of concurrency allowed, using page level information about recency of modification + something undo based), but given this post is about postgres, I fail to see what you're arguing about here.

The transaction ID per tuple is a core piece of data used in MVCC, the transaction management protocol underlying PostgreSQL in its current form.

Isn’t 64 bit, even duplicated, completely irrelevant to how much data is generally stored in a row?

Depends on your data. If you have a link table that just contains two 4 byte integers, then yes, that's pretty significant overhead. Even if you have a few hundred bytes per row, it's still not entirely negligible.

Any overhead is generally considered irrelevant when you’re on a small database. For example, going from 50GB in a table to 75GB is 50% overhead, but easily handled in a time where you just pay for more GB on RDS.

That doesn’t hold true when working on multi-terabyte tables or databases when physical disk space in a chassis is actually a boundary you have to consider.

> Any overhead is generally considered irrelevant when you’re on a small database. For example, going from 50GB in a table to 75GB is 50% overhead, but easily handled in a time where you just pay for more GB on RDS.

Often disk-space is less the issue, and it's more about whether things can fit in memory or not...

This seems to be (with coincidental timing) the cause of Mandrill’s current outage [1]:

> Mandrill uses a sharded Postgres setup as one of our main datastores. On Sunday, February 3, at 10:30pm EST, 1 of our 5 physical Postgres instances saw a significant spike in writes. The spike in writes triggered a Transaction ID Wraparound issue. When this occurs, database activity is completely halted. The database sets itself in read-only mode until offline maintenance (known as vacuuming) can occur.

> The database is large—running the vacuum process takes a significant amount of time and resources, and there’s no clear way to track progress.

[1] https://news.ycombinator.com/item?id=19084525

It is probably too late, for later reference, when this talk is up, check it out, it has some answers.


The author talks a bit about the architecture of PostgreSQL transactions, touching on lazy transaction ID consumption and vacuuming. Notably, writes require IDs but reads do not. So this is focused on write-optimized workloads.

If you want to get the basic tl;dr which answers the headline: these IDs will last so long it’s almost not worth quantifying. This is an obvious calculation even if you assume ostentatatious performance requirements three orders of magnitude greater than the author’s:

    2^64 / (86,000 * 1,000,000,000) = 213,503.9
The author uses 1,000,000 writes/second; I prefer 1,000,000,000 since it’s more ridiculous. There are 86,000 seconds in a day. It will take you the better part of a millenium to exhaust those IDs, assuming you consume an average of one billion every single second.

The author didn’t talk about collisions, but those are worth mentioning because you could even confidently assign these randomly instead of incrementally. Since a collision will occur (in expectation) after 2^63 transactions, you shouldn’t even have to worry about a single one occuring (on average) for almost 300 years.

Of course, using 64-bit IDs comes with nontrivial space increase - every single tuple will increase by a factor of 2.

EDIT: Original collision estimate is wrong, see corrections. I took (2^n)/2 = 2^(n-1) as the birthday bound instead of 2^(n/2).

Actually you'd expect a collision with 50% probability after only a much smaller fraction of the 2^64 space. This would be the birthday paradox, and unfortunately I can't find a calculator or software at the moment that can handle 2^64 power factorial to calculate it properly.

Square rooting will get you in the proper ballpark. I imagine that's why UUIDs are 128-bit values.

Correct on both counts.

Back of the envelope math for birthday paradox is root n. So in this case with ~50% probability you'll get a collision after 2^32 IDs (or after 4 seconds of 1 billion writes a second)

Ah, good point. I'm aware of the birthday paradox but I took (2^n)/2 = 2^(n-1) instead of 2^(n/2) as the birthday bound. Nice correction :)

Another way of looking at it is: CPUs run at about 3 GHz; assume you can increment your 64 bit variable once per cycle (extremely optimistic); it will still take a ludicrous amount of time to overflow the 64-bit variable. I find this is a good rough intuition for upper bound on writes/sec. (Your 1 billion writes/sec figure works out to more or less the same assumption, at 1 Ghz.) The conclusion is the same, of course.

My admittedly naive understanding of transaction IDs and MVCC is that they can’t be random because transactions are ordered and (depending on your isolation level) that ordering controls what’s visible inside a given transaction. Generally speaking a transaction can’t see any rows with a transaction ID greater than its own transaction ID.

The transaction ordering you mentioned is sufficient if transactions are always retired in the order they begin, but it's not a necessary condition for even the highest transaction isolation levels. However, I don't think anyone has discovered a trick to getting high concurrent performance out of a database that strictly retires transactions in chronological order.

At the highest transaction isolation levels, you need to be able to perform a topological sort of the transaction dependency graph. That does require that the graph is acyclic, but doesn't preclude the DB engine from pretending a transaction that started later actually started earlier (or even had its first write chronologically earlier). For full transaction isolation, the DB engine just needs to be able to pretend transactions happened in a linear order, but that order isn't dictated by the chronological order of the first operation of each transaction. (It's not even constrained by the chronological order of the first write operation of each transaction in DBs that only track write-write conflicts.)

The easiest way to keep track of this consistency is some kind of monotonically increasing counter "timestamp" (or something like a vector of them in an asynchronous / distributed system ... Lamport vector clocks or similar), but this timestamp doesn't need to be identical to a transaction ID, and doesn't have to be unique. It's possible that most databases make unique transaction IDs moonlight as timestamps, but it's not a fundamental constraint.

We can't use transaction ids for that however, because they're necessarily assigned by the time a transaction starts to write, whereas visibility is determined by the time transactions commit. Snapshots, the datastructure that determines which transaction's writes ought to be visible and which not, use transaction ids as cutoff values however, to make visibility determinations cheaper.

A second large reasons why we'd not want to go for randomness is that we need to store data for each transaction id, namely whether it committed or not. If we'd assign them randomly we'd need to keep around a lot more of that data, and accesses would be a lot more expensive because there'd basically not be any locality.

Moving to 64-bit Transaction IDs has been discussed on HN before. https://news.ycombinator.com/item?id=9936711 The discussion to move within Postgres was early as 2005.

Moving to 48-bit seems possible but like as the other HN discussions says supposedly other real world systems have wrapped around with that too.

Just imagine running `VACUUM` on that table that wrote 1M rows/seconds for 300 years and now you need to vaccuum quick because the transaction ids will wrap around next year...

Unless I’m missing something, you’d still have some 299700 years for your VACUUM to complete.

I highly recommend using flexible-freeze if you run Postgres in production - does not take very much effort to set up and almost certainly will help you avoid issues with txnid wraparound:


It just runs `VACUUM FREEZE` when you schedule it (usually daily), starting with the tables closest to hitting a wraparound.

What if we interpret Moore's law to say that transaction speed will double every 2 years?

15 years for 1B/tx/s, 35 years for 1M/tx/s.

Or more generally:

    t_years = doubling_time*(txid_bits - log( initial_tx_rate_per_year * doubling_time / log(2)) / log(2))

Let me guess - for this insight, you're asking one upvote for the first square of the chessboard, twice as many upvotes for the next square, and so on..

If transaction speed doubles every 2 years, then just imagine for a sec, if you could also increase the word length by 1 bit every 2 years. So in 2 years you increase the transaction ID from 64 bits to 65 bits -- you've now just DOUBLED the number of transaction IDs available. In 2 years times 64 additional bits, or in 128 years, and I would say long before 128 years, it will begin to seem like a no-brainer to just move to 128 bit transaction IDs, since by that time a simple 'int' will be 256 bits and a short will be 128 bits. Machine instruction sizes will be long enough to encode machine instructions in ASCII making assemblers / disassemblers unnecessary.. The number of bits per pixel will be a steganogropher's delight. We could increase the bit rate of mp3s from 64 kbps to 96 kbps which should please all audiophiles everywhere. Dial up internet speeds should reach 128 kbps by then. Colonies on Mars, but still no male contraceptive.

What I don't understand is how this only affects postgress. How do db2/mssql/oracle handle mvcc? Is it superior or is it a case of trade offs? Supposing the answer is publicly available.

I guess people had similar arguments when designing IPv4.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact