Hacker News new | past | comments | ask | show | jobs | submit login
50 years later, is two-phase locking the best we can do? (concurrencyfreaks.blogspot.com)
329 points by ingve 6 months ago | hide | past | favorite | 81 comments

I'm a beginner in this topic and I find this topic interesting. I really want there to be an easy-to-deploy consistency solution.

If I have a distributed microservice architecture and I want to keep multiple datastores in synchronization or "consistent" what's the industry best practice?

A few days ago I was trying to solve the inconsistency problem with "settled timestamps" which is a kind of multiversioning idea except that timestamps elapsed with the absence of reported error represent a valid save/commit. Kind of like two phase commit with the second phase being time. The idea is that we watch the clocks of other servers and if they don't update then we know we cannot trust their settled timestamps. (My intent was to allow scaling consistency across many servers, because we don't need to wait for response for every update, we only need to wait for the next timestamp interval)

Here's my Multithreaded multiprocessing Python code to test indeterminancy. 10 threads all send eachother random updates. They also broadcast their own timestamp and the timestamps of their own perspective of the timestamps every other server.

https://replit.com/@Chronological/InconsistencySimulation#ma... (click Run and watch the output, you'll have to wait 10 seconds)

A read in this simulation is the MIN of all timestamps of all servers reported timestamps.

10 seconds into the simulation, we ask every thread for its own perspective of what the counter value is. Sometimes they will all report the same value, a lot of the time they shall be split brained.

I am aware that wall clock timestamps are not suitable for ordering in a distributed system and that logical or vector clocks should be used for ordering.

If you can get the simulation to all report the same number at any point in time, then that would be great :-)

Ordering in distributed systems is significant, as the eventual consistency of the simulation means that some values can arrive late but affect the value, meaning it is not linearizable. Bloomlang tries to solve this.

I'm specifically interested in scaling WITH consistency but I think this is quite difficult.

> If I have a distributed microservice architecture and I want to keep multiple datastores in synchronization or "consistent" what's the industry best practice?

Not to use a distributed microservice architecture.

This is the answer. But if they are going to, I would recommend a book called "Designing Data-Intensive Applications" by Martin Kleppmann. It is a book with the most clear communication on this topic and others that I have ever read. Even if you have a degree in CS, I think it is eye-opening to see how very complex ideas can be communicated vs the same ideas in textbooks.

So much this.

I was part of an engineering team that built a payments switch from scratch in the early oughts.

We built it in Java, on commodity hardware and OS, on top of our own replicated, stateful, distributed computing platform.

This was a bonkers thing to do then, pre-cloud. It’s probably still a bonkers thing to do _today_.

Anyway, we did a lot of work with 2PC and other consensus mechanisms and came to the conclusion they 2PC wasn’t up to scratch for what we needed (it’s actually provably less than ideal).

We ended up building (again, from scratch) one of the earliest (that I know of) implementations of the Virtual Synchrony protocol. VS has some robust maths behind it that you can use to make some stronger consistency claims than you can with 2PC. These are important when you’re dealing with interbank settlements for payments switching.

If we started again today I’d say that we might use something like Raft, but I’ve been away from the space for ~20 years now so I’m not entirely sure.

However, one thing I do know is this: distributed consensus is Very Hard™ to get right.

If the answer to your question involves any kind of multiple master implementation using distributed consensus I can unequivocally guarantee that (other than in a few very specific circumstances that you almost certainly do not have) you’re asking the wrong question.

For distributed systems, the main idea is a centralized write ordering journal that is replayed by individual nodes.

Multiple systems write sequentially to the central journal. The journal is simply taking requests like a key value store. The journal is replicated to all the nodes. The nodes read from the journal and performs the complex logic requested.

If business software practices is not enough of a proof, look at any massive online games. They all use one central server as a source of truth about the game world, and broadcast that state to the clients. Anything a client reports that diverges from the central server view is either corrected, rejected, or becomes a reason to disconnect the client for cheating attempts.

If you need strict order, that order should happen in strictly one place. (The universe itself does not support strict order at a distance, as Special Relativity shows.)

Bitcoin: Am I joke to you?

Everyone: yes.

There's no performance gain from bitcoin's approach though. The distributed consensus is there for trustless operation, not performance. It's many orders of magnitude slower (and a few more orders of magnitude more power inefficient!) than just having one server handle it.

Isn't bitcoin actually eventual consensus determined by increasing unlikelihood of finding a conflicting sequence of hashes?

Yes, although if the consistency is good enough for _payments infrastructure_, it's probably good enough for just about everything.

If a particualr mechanic part is good enough for a _space ship_, it's probably good enough for just about everything.

But its price may be prohibitive.

Bitcoin is an interesting exception to the above. Issue is latency.

Latency might be OK. The true price of trustless consensus is proof of work, that is, energy.

> Latency might be okay.

Nope. Not for real-time gaming which is the topic at hand. Certainly not at scale. Certainly not at the kind of volume of events that gaming produces.

that is server/many clients, not distributed

I’d suggest to re-evaluate if you really, really need (a) distributed data stores and (b) synchronous consistency. Things become much simpler if you can forego one of them.

I suspect TigerBeetle DB will be the industry benchmark for consistent, high throughput, fault tolerant, distributed databases in 5 years.

Look at the Raft protocol. In general you’d rather not integrate at the protocol layer; it’s standard to use a consistent store like etcd that implements Raft for the coordination/data that needs to be serializable.

(Kubernetes uses etcd so it scales pretty well for a strongly-consistent k/v store.)

You said “multiple datastores” so I’m assuming you have heterogenous data and something like CockroachDB isn’t an option.

> I'm a beginner in this topic and I find this topic interesting.

Not trying to gatekeepe but rolling your own is dangerous. See https://aphyr.com/ for the gold standard in testing (great educational material). You can use Jepsen to test your distributed systems. But better to just use datastores that Kyle has shown are solid.

Thanks for your reply.

I've experimented with a toy Raft implementation but I haven't Jepsen tested that and it's incomplete

I did write a Jepsen test for a different eventually consistent protocol which understandably fails the linearizability test because I'm still learning - eventually consistent is not linearizable.


I want to have my cake and eat it too. Scalability and consistency.

Is your inter-machine messaging asynchronous and is it possible for one of your machines to crash?

I wonder how this new approach compares to serializable snapshot isolation (SSI) https://wiki.postgresql.org/wiki/SSI

I'm not familiar with these technologies, but when I was studying databases, SSI was touted as the "better" 2PL on the horizon. I wonder how SSI compares to 2PLSF, and why it wasn't mentioned here?

I’m working on a memory model for a new platform which is almost entirely based around copy on write, snapshots and SSI. But in distributed effects you still need locks and two phase transactions and so on. These are largely complementary features imo.

Locking algorithms are great, but before applying them to your problem, I think it is important to take a step back and ponder whether you really need to have lots of threads fighting for the same resources. If you're working with in-memory data structures that is expected, but if you're dealing with an external database or another shared external resource, you may be able to do better.

Often, you can batch the requests and hit the external resource with less concurrency, but larger payloads. If the resource handles well batches, you'll need far less concurrency and locking. For example, if you're working with Postgres, you'll need fewer connections, and you may be able to forego adding PgBouncer, which makes things more complicated.

Granted, batching requests is not something most programming languages are well suited to do. Those that are optimized for high concurrency like Go (channels) and Elixir (processes) can do it well, but for languages that do everything with threads it can be painful.

2PLSF paper the author claims is what 2PL should have been. https://zenodo.org/record/7886718

We seriously need to have a new HN policy that requires every link posted to be HTTPS link

Why? There are plenty of older useful sites which work just fine over HTTP. If you mean for cases where https is supported, but link is http - I agree.

There are plenty of good reasons to use HTTPS. [0]

It doesn't make sense to link to HTTP when the site works fine over HTTPS, which is the case here. I'm not sure I'd want to completely ban all HTTP though.

[0] https://news.ycombinator.com/item?id=27507886

Honest question: what is the consequence of visiting an HTTP link rather than HTTPS for a site where my interaction is read only? Is there some security issue? Or is it privacy concerns.

There is a security issue and a privacy issue.

The privacy issue is that your local WiFi provider, direct isp, and all the intermediate isps can see not only which site you visit, but all your activity within that site (like which pages you visit or things you download).

The security part is that any of those who can view can also do a “man in the middle” attack. Comcast could decide to send you a different version of the website that was more favorable to their company, or inject ads (ISPs have been known to inject ads on sites they don’t own before https was big).

A hacker could send you a version that gets you to download malware by replacing content or links. They can see and effect everything you do and see in such a site if they can intercept your request.

Just run Firefox in HTTPS only mode. [0]

You'll get a warning for any site that CAN'T be upgraded to HTTPS, but any site that supports both you'll just go straight to the HTTPS version.

[0] https://support.mozilla.org/en-US/kb/https-only-prefs

After reading something on concurrent algorithms you come up with this irrelevant observation?

Also, no if an http link is about a good concurrent algorithm, I will read it anyways.

There still are a few HTTP-only websites.

For Wait-Or-Die, do you really need to fetch_and_add to get a transaction ID? Do you need a transaction ID at all? It sounds like the goal is just to have an arbitrary-but-consistent ordering of active transactions so that different transactions can agree on who should wait and who should die in the event of a conflict. So why not just use the thread ID? Or even a random number might work (if ties are treated as “die”, so the worst that happens is that both transactions unnecessarily abort, and then retry with new random numbers).

While it’s not mentioned, I suppose you want to prioritize older transactions in order to prevent long-running transactions from being starved by shorter-running transactions. (If one long transaction conflicts with an average of, say, three short transactions, and on each conflict it’s effectively random who wins, then each long transaction has only a 1/8 chance of winning all three conflicts and being able to commit.)

But preventing starvation only requires older transactions to be prioritized most of the time, not every single time, especially not if the transaction is only slightly older. So some kind of timestamp / cycle counter should work fine, even if there’s skew between threads or other sources of inaccuracy. Ties could be broken by thread ID, or again by having both sides abort.

I would use a ULID rather than thread ID.


They (and others) are great for this kind of case - and many others.

ulid is great, I use it a lot.

There is also a draft to make a new uuid variant – uuid v7 – that will be very similar to how ulid works.


I use ULID a lot too and It’s frustrating how close the new spec UUIDs are to it without actually being the same… so I’ve got a bunch of code to modify once Postgres supports generation of the new UUIDs server side without extensions or stored procedures. Relatively painless work, but frustrating since it could have been avoided.

Yes. V7 looks good but ULID already does it. Rather than change code, I'm just staying with ULID. All give us 128 bits. That should be enough for anyone ;/

> frustrating since it could have been avoided.

How? UUID is a structured format so the only options I can see is ulid creating their own unregistered uuid variant (probably a terrible idea) or adding ulid support to postgres (nothing to do with uuid).

I'd love for Postgres to adopt ULID as a first class variant of the same basic 128bit wide binary optimized column type they use for UUIDs, but I don't expect they will, while its "popular" its not likely popular enough to have support for them to maintain it in the long run... Also the smart money ahead of time would have been for the ULID spec to sacrifice a few data bits to leave the version specifying sections of the bit field layout unused in the ULID binary spec (https://github.com/ulid/spec#binary-layout-and-byte-order) for the sake of future compatibility with "proper" UUIDs... Performing one quick bitfield modification on every row in PostgreSQL would have been less painful as in set bit, vs load parse, repack in order to re-computing the appropriate UUIDv7s (or UUIDv8s for some reason) since the primary key update transaction should be roughly the same speed either way.

ULIDs technically fall under UUID v8 since they map onto the 128bit space. So it's possible that you can just keep using them and PostgreSQL might have something for v8 specifically.

I haven’t had my coffee yet but I’m pretty sure v8 still needs to set the version bits to specify its a v8 UUID, which means there’s 6 bits of data that need to be thrown away to migrate from ULID to a fully conforming UUID.

this is a great paper that provides a framework for comparing 2 phase and paxos.


Two-phase locking is different from two-phase commit, in spite of an overlap in their naming. Two-phase commit is relevant to be compared against Paxos - both of which fall under the category of consensus protocols.

Two-phase locking is a concurrency control mechanism.

A little summary:

2PC: An algorithm used in the context of distributed transactions where each machine handles a different part of the transaction. This means that nothing is redundant - the success of each and every participant is required for the transaction to be committed.

Paxos/Raft/consensus: An algorithm usually used in the context of distributed replication. Since every participant is doing the same thing, it's tolerable if a few fail or give outputs that diverge from the majority.

2PL: A method of acquiring multiple locks such that first you acquire all the required locks (first phase), then you do what you need to do, and then you release all the locks (second phase). This is in contrast to a locking scheme where lock acquisitions and releases are interspersed. This isn't strictly limited to distributed systems, although it's common to see 2PC with 2PL.

If this piques your interest, read the Spanner paper! Spanner uses all three - 2PC with 2PL for distributed read-write transactions, and Paxos for replication.

PS: "Distributed" just means there's more than one machine involved, any of which may fail independently, and communication among these machines happens over unreliable wire.

I find it interesting that “distributed” and “concurrent” end up falling under the mathematical concept of nondeterminism with respect to correctness. Of course a practically efficient implementation has additional concerns.

In a multi-replica system, where, say, we cannot tolerate any failures or lags, is 2PC used in practice to achieve consensus? Or are there other methods for achieving such strict consensus?

There are other means, the most common being Calvin like protocols where every replica deterministically processes some log, and so only one round of distributed calls is necessary.

Which systems use Calvin-like protocols?

FaunaDB is the most popular, I believe.

2PC (or atomic commitment more generally) is needed for sharded/partitioned systems with different data on each node. In these systems, each node gets a vote on whether a transaction should be allowed to commit. Replication, making multiple copies of the same data, doesn't need 2PC. Instead, algorithms like Paxos, Raft, or chain replication are used.

Concurrency control > Concurrency control in databases > Why is concurrency control needed?: https://en.m.wikipedia.org/wiki/Concurrency_control#Why_is_c...?

Locks (computer science) > Disadvantages: https://en.wikipedia.org/wiki/Lock_(computer_science)#Disadv...

Two-phase locking (2PL) https://en.wikipedia.org/wiki/Two-phase_locking

Two-phase commit protocol (2PC) https://en.wikipedia.org/wiki/Two-phase_commit_protocol

Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science)

Raft: https://en.wikipedia.org/wiki/Raft_(algorithm)

Consensus (computer science) https://en.wikipedia.org/wiki/Consensus_(computer_science)

Spanner: https://en.wikipedia.org/wiki/Spanner_(database)

Non-blocking algorithm; "lock-free concurrency", "wait-free" https://en.wikipedia.org/wiki/Non-blocking_algorithm

"Ask HN: Why don't PCs have better entropy sources?" [for generating txids/uuids] https://news.ycombinator.com/item?id=30877296

"100-Gbit/s Integrated Quantum Random Number Generator Based on Vacuum Fluctuations" https://link.aps.org/doi/10.1103/PRXQuantum.4.010330

Re: tests of randomness: https://mail.python.org/archives/list/python-ideas@python.or...

TIL there's a regular heartbeat in the quantum foam; there's a regular monotonic heartbeat in the quantum Rydberg wave packet interference; and that should be useful for distributed applications with and without vector clocks and an initial time synchronization service (WhiteRabbit > PTP > NTP Network Time Protocol) https://journals.aps.org/prresearch/abstract/10.1103/PhysRev... :

> The [quantum time-keeping application of this research] relies on the unique fingerprint that is created by the time-dependent photoionization of these complex wave packets. These fingerprints determine how much time has passed since the wave packet was formed and provide an assurance that the measured time is correct. Unlike any other clock, this quantum watch does not utilize a counter and is fully quantum mechanical in its nature. The quantum watch has the potential to become an invaluable tool in pump-probe spectroscopy due to its simplicity, assurance of accuracy, and ability to provide an absolute timestamp, i.e., there is no need to find time zero.

IIUC a Rydberg antenna can read and/or write such noise?

"Patterns of Distributed Systems (2022)" https://news.ycombinator.com/item?id=36504073

Forever it is. And never perfect it is.

The issue is if your first locking message gone how do you know it is not the reply message that is lost. For simple one just go ahead and deal with conflict later like GitHub and Dropbox etc. for database good luck. For bank.

I do not quite understand the last figure for the relaxed avl tree. For the 100 % lookup (rightmost) the TL2 algo should scale linearly with the number of threads. For read-only transactions, TL2 needs to sample the global version, then for all reads make sure the local version is less than or equal to the sampled version. given this, it is difficult to understand why the graph is sub linear and that TL2 is not as fast as the other STM implementations.

I don't see such a graph for TL2? I do see one for TLRW though which does use reader locks, hence the scalability cap.

the chart doesn't seem visible on ios safari but i can see it on firefox desktop however, the figure seems to be the same as from the linked paper: https://zenodo.org/record/7886718

Ah ok I found it there, I see what you mean now. My only guess would be cache effects? With that large AVL tree (1 M entries, so likely dozens of MiB), you are escaping L2 cache and hitting shared L3 or main memory for a large portion of lookups, and are bandwidth-constrained at the die level, thus adding that knee (which I think is visible with some of the other algorithms as well).

but TL2 shouldn't be using more memory than say TinySTM, i don't think. If the implementations are object based and nodes are cache aligned, adding an extra 8 bytes for the versioned-lock shouldn't bump the node size to greater than a cache line.

however, i do think the tl2 implementation as described in the paper is memory based, as is TinySTM so every read needs to do a hash to locate the corresponding lock / meta data. the read-only transactions for tl2 and TinySTM seem identical to me which is why i am so confused.

looking at other figures from the 2PLSF paper, the TL2 for 100% lookup on hash set and skip list it looks like such a dog compared to the other algos.

You could simply add randomized queues.

Example, typically have 1000 tasks and 10..100 hardware threads.

So, make one ordered list of 1000 tasks, and than make copy of it for each thread, each time randomize order of copy.

Than, each thread will just read it's list and execute task and subscribe to one list implemented as non-blocking multi-thread queue.

In worst case, few threads will do some task repeatedly.

And atomic operations this way will scale up to 1000 times.

Isn’t 2PL a limited applicability construct?

All my incoming and outgoing “transactions” are over loosely-coupled API calls.

Is there a point I’m missing?

IMO it’s about what’s happening underneath those API calls. You might be doing lots of POST requests that touch the same underlying sql table.

One thing I didn’t quite catch was: I thought most write transactions needs consistent reads on many (contended) “objects”, but the actual writes are often just one or two objects. Is 2PL addressing this or does a write transaction take write locks on all objects?

You can take locks on all objects before you write. Most systems don’t do that.

Usually there are multiple writes to complete a business process. In each step stale data is read to create a write request similar to how you place your order with the wait staff at the restaurant. The wait staff will orchestrate your request. First the chef prepares your meal according to the order ticket. Second the wait staff delivers your meal based on the chef’s internal work order. Lastly they take your payment based on the original order ticket. Notice there are at least three microtransactions based on stale data. No one is holding locks until the diner finishes eating.

IIUC, That’s a micro service way, missing cancellations and rollback and timeouts

What happens if the chef can’t fulfill the meal, the customer leaves cause the order is taking long … etc Most real life will not charge the customer is the chef can’t make the meal and cancel the cooking if the customer leaves

The analogy maybe going far ;)

True. If the customer leaves in mid process, we would have to issue a reversal transaction. Eat the financial loss, dispose of the food, and void the order the ticket.

Or do it like a fast food restaurant: take the order, take the money, finalise the transaction but not commit (irreversible from that point on) then cook the food then commit: if the customer leaves before taking the meal, serve it to the next customer ordering the same meal.

You take either read OR write locks for whatever you need, and keep all locks until you commit the transaction.

That (possibly brief) moment where you hold all the necessary locks is your point of linearizability: it's as if everything happened at that exact moment.

Interesting re the transaction ID generation. Sled generates ~70-120 million transaction IDs per second. The author points out that it’s still a scaling bottleneck but I wonder if Sled’s approach might raise the TPS for the contended case.

huh? no mention of serializable snapshot isolation (SSI), used by most commercial dbs for years. what gives?

MVCC is what most databases do.

MVCC is optimistic concurrency control – you may end up having to retry your write transaction multiple times, which might suck if the transaction is expensive. Pessimistic concurrency control like locking may be cheaper in those cases, since the cost of locking may be drastically lower. Having one doesn't preclude the other!

has its own tradeoffs tho

We are so arrogant sometimes. The passage of time has nothing to do with reality. People seem to think that given enough time and energy and ingenuity we can do anything we want. This is not how the real world works.

Sometimes, we find a good solution, and it's the best possible one and we found it early on. We think we can do better, but we can't. A classic example of this is Euclid's fifth axiom. It wasn't proven until the 19th century that this axiom was necessary but everyone from Euclid to Gauss tried to get rid of it. Foolish.

It seems to me that while it's foolish to assume we can always improve on something, it's not at all foolish to try to do better (even if the attempts don't succeed). Asking "can we do better than this?" and trying to find a better way is a prerequisite for progress, after all.

And trying to improve something will often result in improving ourselves (by learning).

Yeah, that's very true! Honestly, a lot of programming projects I've done are never going to be of use to anyone (even myself). But I learned from all of them, even the ones which went nowhere. That is valuable in itself.

An excellent illustration of how a lack of diversity in thinking can hinder forward progress. Thank God everyone doesn't think the same as you.

I think it's a testament to the human condition there are those of us willing to entertain the folly of pushing the boundaries, or of being foolish as you put it.

2500 years later and the best hypotenuse algorithm is still Pythagoras’.

You jest but this is a computationally slow algorithm that only works for right angle triangles in a Cartesian space. There’s very reasonable and fast trigonometric approximations. The general form would be the cosine law which applies to any triangle. We can derive this from more abstract metrics over an inner product space.. a sort of vector space of any dimension.

Funny you should say that. Here[1] is an interesting improvement. More interestingly he shows his work.

[1] https://www.cs.utexas.edu/~EWD/transcriptions/EWD09xx/EWD975...

Thank you for this gem.

It is a delight to read.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact