Hacker News new | past | comments | ask | show | jobs | submit login
How to do distributed locking (kleppmann.com)
277 points by martinkl on Feb 8, 2016 | hide | past | favorite | 71 comments

Thanks to Martin for analyzing Redlock, I looked forward to an analysis. I don't agree with the two main arguments of the analysis. TLDR: I think the unique random token of Redlock is enough for Check & Set when the lock holder work materializes into a database write (also note that it requires a linearizable database, to check token_A_ID > token_B_ID in the Martin proposed solution), and I think the system model is very real world (AFAIK the algorithm is not dependent on bounded network delays, I think there is an issue in the analysis, details in my blog post so other people can evaluate), and that the fact different processes can count relative time, without any absolute clock, with a bound error (like, 10%) is absolutely credible. However, I'm analyzing the analysis in depth right now and writing a blog post with the details to be posted tomorrow. Also note that usually when you need a distributed lock, it's because you have no sane way to handle race conditions, otherwise you can do with some weaker and/or optimistic locking form to start with.

The quality of your C code running on a single node is by all accounts really good. But you sometimes speak about distributed systems in a way that is not correct to people who have really spent a lot of time making these things work at scale.

I've done distributed systems, and in fact I usually don't implement systems that are 100% correct under maximally-adverse conditions, for performance and convenience reasons. But it's important to know and acknowledge where and how the system is less than 100% bullet-proof, so that you can understand and recover from failures. When you have 1000 servers, some virtual in the cloud, some in a colocation facility, and maybe 10 or 20 subtle variants in hardware, you will get failures that should not happen but they do. For example, I've had an AWS host where the system time jumped backwards and forwards by 5 minutes every 30 seconds. And I've seen a tcp load-balancer appliance managed by a colo facility do amazingly wrong things.

The point is, you need to understand where the consistency scheme in a distributed system is "fudged", "mostly reliable", "as long as conditions aren't insane", and where they are really guaranteed (which is almost nowhere). Then you can appropriately arrange for "virtually impossible" errors/failures to be detected and contained. This ad-hoc logic you bring to the problem of distributed systems doesn't really help.

Not to "dogpile", but elucidate for others who may not know some of the history here: https://aphyr.com/posts/287-asynchronous-replication-with-fa... and the also interesting reply: http://antirez.com/news/56

Hello ploxiln, to see systems with clocks jumping is indeed possible, I'll make it more clear in my blog post, but basically it is possible with some effort to prevent this in general, but more specifically the Martin suggestion of using the monotonic time API is a good one and was already planned. It's very hard to argue that the monotonic time jumps back and forth AFAIK, so I think it's a credible system model. The monotonic API surely avoids a lot of pitfalls that you can have with gettimeofday() and the sysadmin poking the computer time in one way or the other (ntpd or manually).

About the ad-hoc logic, in this case I think the problem is actually that you want to accept only what already exists. For example there are important distributed systems papers where the assumption of absolute time bound error among processes is made, using GPS units. This is a much stronger system model (much more synchronous) compared to what I assume, yet this is regarded as acceptable. So you have to ask yourself if, regardless of the fact this idea is from me, if you think that a computer, using the monotonic clock API, can count relative time with a max percentage of error. Yes? Then it's a viable system model.

About you linking to Aphyr, Sentinel was never designed in order to make Redis strongly consistent nor to to make it able to retain writes, I always analyzed the system for what it is: a failover solution that, mixed with the asynchronous replication semantics of Redis, has certain failure modes, that are considered to be fine for the intended use case of Redis. That is, it has just a few best-effort checks in order to try to minimize the data loss, and it has ways to ensure that only one configuration wins, eventually (so that there are no permanent split brain conditions after partitions heal) and nothing more.

To link at this I'm not sure what sense it makes. If you believe Redlock is broken, just state it in a rigorous form, and I'll try to reply.

If you ask me to be rigorous while I'm trying to make it clear with facts what is wrong about Martin analysis, you should do the same and just use arguments, not hand waving about me being lame at DS. Btw in my blog post I'll detail everything and show why I think Redlock is not affected by network unbound delays.

It's fair to criticize my somewhat harsh and yet hard-analysis-free critique. But let me explain why it matters to me, and what the relation to Redis Sentinel is:

"Redlock" is a fancy name, involves multiple nodes and some significant logic, less experienced developers will assume it's bulletproof.

"Redis Sentinel" is a fancy name, involves multiple nodes and some significant logic, less experienced developers will assume it's bulletproof.

I guess I'd prefer that people shoot themselves in the foot with their own automatic failover schemes, instead of using a publicized ready-to-use system which has marketing materials (name, website, spec, multiple implementations) which lead them to believe it is bulletproof, and these systems end up being used in for a bunch of general-purpose cases without specific consideration for their failure modes. In this case, the "marketing" is just the fact that you're behind it, and redis really is solid for what it is.

(Usually I avoid automatic promotion/fail-over altogether. It's often not a good idea. Instead I find some application-specific way of operating in a degraded state until a human operator can confirm a configuration change. So service is degraded for some minutes, but potential calamity of automatic systems doing wrong things with lots of customer data is averted.)

It's really not your responsibility to make sure the open source software you offer to the world for free is used appropriately. But people like me will be annoyed when it contributes to the over-complicated-and-not-working messes (not directly your fault) we have to clean up :)

One specific comment: http://redis.io/topics/distlock notes the random value used to lock/unlock to prevent unlocking a lock from another process. But at that point, the action just before the unlock could very well also have been done after the lock was acquired by another process. So while this is a good way to keep the lock state from being corrupted, it is really a side-note rather than a prominent feature, since if this happens, the data the lock was protecting could well have been corrupted.

I have to disagree here.

"fancy names == people think bulletproof" is not a credible criticism.

Now "Oracle Enterprise Manager Fusion Middleware Control" is a fancy name!!! Pretty sure no-one with a clue thinks a product named like that is bulletproof, and its probably got significant logic/nodes etc

I apparently have the opposite experience to you. I'll often find complex distributed systems that are painful to troubleshoot when they misbehave, and find it refreshing on the other hand when someone is just using Redis, because typically THAT system is working a lot more predictable.

If people are too inexperienced to realize that ALL systems have tradeoffs, and not read up on what those are (because apparenly the name is fancy) then they'll get burned. Antirez does a pretty good job of explaining and documenting where he sees Redis limits to be.

Arguing that everyone should be home-baking their own failovers until of learning the limits of well-known ones our there doesn't seem like responsible advice.

using the monotonic clock api may not actually be an improvement. that gets you deterministic ordering on a single node but increases possible clock skew (you will still see clock drift but it will now be clamped limiting the ability to bring it back in sync)

also note that you need deterministic ordering over the cluster and deterministic ordering on individual nodes does not actually grant you any greater certainty

i believe you're referring to google's chubby lock system with your reference to gps but the clocks were only used as an optimization in chubby, it still had to implement a consensus protocol

> I think the unique random token of Redlock is enough for Check & Set when the lock holder work materializes into a database write

But that's optimistic concurrency control! In that case the locking service is a pure optimization and doesn't affect correctness. So why use this complex multi-node locking thing then, instead of a single redis node?

It's a long read, but the gist of this article is "use the right tool for the job". Redis is really good for some things, but in its current implementation, distributed locking is not one of them.

Along with the author of this blog post, I would recommend the usage of Zookeeper if you have a need for obtaining a lock in a distributed environment. You can read an analysis of how Zookeeper performs under a (intentionally) partitioned network here:


The author dedicates a good chunk of code to describing a potential problem that might happen in long GC pauses or similar situations. That problem is not unique to redis. Is there any way to avoid it in ZK?

Yep, and the author even (briefly) describes it. But you can't do it without moving some of the logic into the storage service, to check fencing tokens. Zookeeper exposes various kinds of IDs that can be used to do fencing; apparently (I haven't read the details) Redlock doesn't.

Interestingly, even though HBase uses Zookeeper for coordination, it does not use it for fencing. Instead, fencing is handled by atomic "log-rolling" operations on the HDFS namenode, which can be configured to use its own quorum-based journal. (The equivalent of a "token" is the monotonically-increasing namenode operation sequence number.) So the principle is the same: the system responsible for doing mutual exclusion, and the system that actually stores the data, must be coupled.

The solution of fencing tokens on the storage layer that's proposed there, just pushes the problem one layer down - you'll still need the storage layer to be consistent for this to work. Also it means it should be aware of those semantics, which complicates thigns. Redlock has non incremental tokens, so I guess it can be used as well. Read @antirez's reply above.

If you want to have a truly parallel system, you cannot share resources between processes. If you need a lock, then that implies sharing of resources... Which implies limited scalability according to Amdahl's law.

To achieve unlimited scalability, you need to make sure that no single data store is shared between all processes and that the amount of interprocess communication (on a per-process basis) doesn't increase as you add more processes.

The more sharing you have, the less you can scale your system.

This. Sometimes it is better to duplicate computation than to wait for shared state (this is in the context of analytics and not a single-source of truth).

Although no one starts out wanting to have a truly parallel system. They start out wanting to solve a problem and when that requires scalability or redundancy a parallel system is a means to that end. Not every problem can or should be solved with no shared state, and when that's the case, you should know how to do it right.

I find the code of highly parallel systems much more elegant and easy to read.

Personally, I find it easier to write highly parallel code than parallel-up-to-a-certain-point code that involves locks likes mutexes and/or semaphores. Highly parallel code just takes a little bit of extra planning and thinking up-front but it saves lots of time and stress down the line.

What if you have a shared resource? Is there any way around distributed locking?

"If you still don’t believe me about process pauses, then consider instead that the file-writing request may get delayed in the network before reaching the storage service. Packet networks such as Ethernet and IP may delay packets arbitrarily, and they do [7]: in a famous incident at GitHub, packets were delayed in the network for approximately 90 seconds [8]. This means that an application process may send a write request, and it may reach the storage server a minute later when the lease has already expired."

This statement confused me. It seems to say that the packets were delayed in the network for 90 seconds before being delivered. From reading the original sources it actually sounds like packets were discarded by the switches, so the original requests discarded, and the nodes were partitioned for 90 seconds. When the partition was removed both nodes thought they were the leader and simultaneously requested the other to shutdown. Can anyone confirm? Keeping packets delayed in a network for 90 seconds would seem quite difficult (though not impossible assuming certain bugs).

Edit: On re-reading I think this is just talking about the network stack in general - not the network. A temporary partition may delay delivery of your request until max TCP retries is exceeded on your host, if its recovered before then your request may arrive later than you intended.

I only have the blog post to go by, and don't have first-hand information. It seems possible to me that a few packets could indeed be delayed by 90 seconds, perhaps stuck in a switch buffer somewhere, although this would be a small number of packets since those buffers are not very big.

However, yes, I was thinking about the network stack as a whole. Any kind of retries, e.g. TCP retransmission on timeout, effectively turns packet loss into packet delay (within certain bounds). Thus, even if you have a network interruption ("partition") and all packets are dropped, it could happen that after the interruption is fixed, a node receives packets that were sent before or during the interruption. For this reason, I find it helpful to think of it as delay, not just packet loss.

Why not just use the database to handle the "locking" for you? For example, to ensure that an email with ID=123 gets sent only once, just check if "email 123 sent" is in the database, otherwise commit it to the database, wait for the transaction to be committed, and send the email.

Edit: Why is this downvoted? It is a serious question.

Perhaps because your example fails to describe what happens if there is a crash, after updating the database to say "email sent", but before the email has actually been sent.

Well, I didn't want to go into that much detail here, because that also presents a problem in the case of locks (if the locking process crashes, the lock is never released).

Good observation! The article also covers that exact scenario - plus a solution called fencing which a proper locking system can help facilitate.

I like the fencing solution. Couldn't such a scheme be implemented over a database instead?

Why I'm thinking more in terms of a database-oriented solution is because there is still the problem of crashing. What happens if the system crashes just before the email is sent? The system somehow needs to remember to restart that task (sending the email) when it comes back up. And this is probably best done through a database anyway.

You need locking around the entire transaction, which includes sending the email.

You shouldn't be downvoted; people should help you get up to speed.

In the scenario you describe, the database commit works, you send the email but... you get an error back from the email send function. Now what?

This is a basic distributed computing problem. Choosing how to solve it can have a drastic effect on your code and your infrastructure. If the secondary function is "charge a credit card" obviously you can't do that twice. But if it's merely "send an email" then maybe it's okay if people get a duplicate.

Google lease/lock/deadlock/race condition and read up on why databases tend to make for bad implementations.

"When used as a failure detector, timeouts are just a guess that something is wrong. (If they could, distributed algorithms would do without clocks entirely, but then consensus becomes impossible [10]."

Having just re-read Lynch's paper, can you explain what you mean here? I didn't see anything explicitly relying on time. It could be there is some implicit usage I didnt see. Additionally, the paper's impossibility result is about "perfectly correct consensus" which applies with and without clocks and then has a positive result for "partially correct consensus" (i.e. not deciding a value is a correct result). Im not sure which you mean when you say "consensus becomes impossible" as it is either already impossible (the perfectly correct protocol) with one faulty process or (to my understanding) not dependent on time (the partially correct protocol).

p.s. great article!

The FLP result (Fischer, Lynch, Paterson) shows that consensus cannot reliably be solved in an asynchronous system (if you do not make any timing assumptions) if one or more processes can fail. The impossibility result shows that any consensus algorithm in such a system will have executions in which it waits forever and never makes a decision, which breaks the liveness property of consensus.

However, if you do allow some timing assumptions — just enough to measure a timeout, nothing more — then consensus becomes possible. In this case, the timeout is known as an "unreliable failure detector". See Chandra and Toueg's paper for details.

In a good algorithm, the safety properties do not depend on timing, only the liveness properties do. Rather than "consensus being impossible" perhaps it would be clearer to say "consensus may never terminate" in an asynchronous system. But "consensus impossible" is the standard way of phrasing this issue in the distributed systems literature.

I'm having trouble relating this comment to the referenced paper [1], which seems to be explicit that its impossiblity result does not apply when synchronised clocks are available.

For example, on page 375: “Crucial to our proof is that processing is completely asynchronous; that is, we make no assumptions about the relative speeds of processes or about the delay time in delivering a message. We also assume that processes do not have access to synchronized clocks, so algorithms based on time-outs, for example, cannot be used.”

1. http://www.cs.princeton.edu/courses/archive/fall07/cos518/pa...

You mean the OP's comment or mine? I agree with you if you mean the OP's comment. Though your quote refers to synchronized clocks. I dont think OP is referring to synchronized clocks, though your point that Lynch et al does not rely on timeouts is what Im getting at.

"When used as a failure detector, timeouts are just a guess that something is wrong."

I send a packet over the network. I configured the timeout to be 10 seconds. 10 seconds passes. Is the packet truly lost? Or, maybe the packet was received, processed, and and a response is on the way back, but it takes 20 seconds instead of 10! Or maybe it takes an hour instead of 20 seconds. Or a decade.

We don't know if the packet is truly lost or is just taking >$timeout_timer and making assumptions is dangerous.

My question was about his claim in regards to the result in the paper. In practice, you are right but a consensus algorithm is still "partially correct" if it never decides on a value. It is only incorrect if different nodes decide on different values. For example, paxos does is not guaranteed to decide on a value. So timeouts are useful guessing tools, but I dont see how there is no consensus possible without them.

For those interested in distributed locking with Redis and C#, check out this blog post we did which also links to a github project. We use this very heavily. Hope it helps someone.


> In a reasonably well-behaved datacenter environment, the timing assumptions will be satisfied most of the time – this is known as a partially synchronous system [12]. But is that good enough?

Please consider this answer [0] (which I personally understood as - YES, real systems are always partial sync): "Asynchronous systems have no fixed upper bounds. In practice, systems tend to exhibit partial synchrony, which is described as one of two models by Dwork and Lynch in Consensus in the Presence of Partial Synchrony. [1]"

[0] http://bravenewgeek.com/category/distributed-systems-2/

[1] http://groups.csail.mit.edu/tds/papers/Lynch/jacm88.pdf

Interesting document a colleague of mine wrote: https://specs.openstack.org/openstack/openstack-specs/specs/...

Off-topic but that URL is remarkably repetitive. Is that some kind of SEO play?

No, its an effect of autobuilding from a git repo, I think.

The specs are submitted for review at https://github.com/openstack/openstack-specs

Repo called openstack-specs sitting in the openstack namespace. Inside that repo is a folder called specs.


Yo, Dawg, I heard you like some openstack in your openstack...

Who's the author? I can't find it in the document.

Yup that's me.

https://lwn.net/Articles/662140/ is where this also was written up.

Not directly related, but in a sublink within that piece I found an interesting read about 2 phase commits in Postgres.


Interesting stuff. In all my distributed systems work so far, I've assumed that a distributed lock is a thing to avoid. I really should take another look at them, just as a tool to have at my disposal.

If you assume that your distributed lock gives you transactional guarantees that you are the only lock holder then you are making a mistake. If, however, you can tolerate small overlaps in lock holders you are fine and this helps with numerous distributed algorithms. Further, using other facilities such as fences can make it even more secure. Another feature of ZooKeeper is write-with-version. You could obtain a lock (using Apache Curator - note I wrote this), then do a write-with-version to achieve 100% certainty that you are the only writer.

BTW - I wrote a Curator Tech Note about this a while ago: https://cwiki.apache.org/confluence/display/CURATOR/TN10

Like anything, it depends on what you're using it for. I wouldn't put a distributed lock into some massively high volume request path, or where absolute availability is required, but it's perfectly fine for some scenarios.

As Martin points out though, many (most?) distributed lock implementations are or can be broken in various ways. He hints at one of the fundamental problems - consensus. Many distributed lock implementations fail simply because they cannot achieve reliable consensus, or don't even try to.

That said, distributed locks can be safe and handle reasonably high throughput. The Atomix distributed lock is one example:



Since consensus requires quorum and quorum requires availability, there is a risk that your ability to obtain a lock or learn about lock related events could be effected by availability, but at least in the case of Atomix, the system is fairly resilient with auto-failovers to passive or inactive nodes as needed (as compared to, say, a ZooKeeper based lock).

Thanks for posting about Atomix; I had not heard about it before. This seems to be based on Copycat, discussed previously on HN here: https://news.ycombinator.com/item?id=8180360

It is indeed built on Copycat. We're updating the docs right now ahead of an RC, this week or next.

Oracle has a distributed transaction system that'll work cross-country, but I can't find the details. It's basically a complete rip-off of the OpenVMS implementation from the late 70s (or whenever all those VAX/VMSstations came out - post PDP), no surprises there, and it costs an insane amount just for that functionality (separate from RAC), but it works really well. I'm trying to find the docs since it's been more than half a decade since I used it but [1] is the closest I can get (and that's certainly not it, though RAC does it's job quite well also and that page is worth a read if only for exposing yourself to time-tested transactional models.) It might be HA-NFS over ACFS but that doesn't seem quite right to me. Anyways, read section 5 of this[2] Oracle doc, to have a master-class on maintaining database integrity across the datacenter or across the country. I've used a litany of DBs and I think I wouldn't pick anything made in the last 10 years for my primary store other than Datomic (which I've beaten into the ground as hard as I could and it took my abuse) and or maybeeee VoltDB (if only because Stonebraker is DJB-level smart).

[1] http://docs.oracle.com/cd/B28359_01/server.111/b28318/consis...

I mean, it's definitely a thing you want to avoid whenever possible, because it's a strict point of slowdown in a distributed application.

Perf for sure, but I was typically more concerned about availability.

Why would you ever want to do distributed locking when you can do MVCC via eg vector clocks, possibly with cryptographic signatures?

If you want to acquire some shared resource, then elect an owner for that resource.

If you want to do mutual exclusion using distributed locking then you end up in a painful place (as the article points out). In general you can't distinguish between a process that is running really slowly and a process that is dead. So when confronted with a non-responsive lock owner you end up with two unpleasant options: 1) Assume the lock owner is alive and don't do anything. This approach guarantees mutual exclusion but isn't especially useful because it gets stuck when things die. 2) Assume the lock owner is dead and take the lock. This doesn't guarantee mutual exclusion but "probably" works if you make the timeout "long enough." This can work correctly if you have a realtime system which will monitor the lock owner and kill the owner if the lease on the shared resource expires (this would probably require a special hardware/OS/programming language stack).

Instead of the fencing token, could the scenario in the blog post be prevented using a good hashing function?

When you perform a write, instead of the token, send a hash of the object previously read. The storage can then compare this against a hash of the resource's current state. If it doesn't match the lock expired and the write is not accepted.

This would reduce the state to keep track off to the resources themselves.

Has anyone used both Zookeeper and etcd in production for management of distributed state?

Generally when I think of this problem I reach for etcd, not Zookeeper first, in the hopes of it being lighter (with a relatively vague definition of "light"), and easier to use.

Not etcd, but zookeeper yes. It's mostly been set up and forget kind of infrastructure for us, except for some snafus created by our own config errors.

etcd would be a fine choice, but Consul provides a lock feature[1] out of the box.

1. https://www.consul.io/docs/guides/leader-election.html

As the article explains at length, using Zookeeper or etcd as a "locking service" in this way is not safe, even if the service is perfect and failure-free. There is a fundamental race condition between finding out that you've obtained a lock, and doing some other operation that assumes mutual exclusion.

It's not really "fundamental". It's simply that the process that acquires the lock can fail (or be paused, or partitioned away from everything else, etc), and if it does, but then comes back later with a valid lock, bad things may happen.

The author's solution is to push serialization logic into the resource/storage layer (by checking fencing tokens). But what if the resource is itself distributed? Then it needs it's own synchronization mechanism? It's locks all the way down.

Thinking more about it, this is a fundamental weakness of having self-policing processes, which I suppose is the OP's main point. It can be mitigated by having infinite lock TTLs, at the cost of risking system deadlock on process failure. Thank you to GPP for spurring me to think more deeply about this.

As I stated, though, if the resource being protected is either a distributed system itself, or a system that cannot support fencing logic, this failure mode is difficult or impossible to prevent. The frequency of failure should be kept in mind here: most services can probably guarantee 99.99% uptime against the likelihood of 5 minute GC pauses.

Doesn't it have the same issue that redis does?

Nope, zookeeper nodes have ids that always increment and can be used for fencing, compare-and-set and more...

I was talking about consul and etcd.

In the linked consul documentation, we see a `LockIndex` field in the lock state; https://www.consul.io/docs/agent/http/kv.html confirms that "LockIndex is the number of times this key has successfully been acquired in a lock. If the lock is held, the Session key provides the session that owns the lock."

So you can do fencing with consul.

For etcd: https://coreos.com/etcd/docs/0.4.7/etcd-api/

node.modifiedIndex appears sensible for fencing.

> If they could, distributed algorithms would do without clocks entirely, but then consensus becomes impossible

Consensus can be achieved without clocks using blockchains: Instead of timelocking the resource, client can broadcast his altered version after making changes to the network. The next client than then start working on top of this changed resource, but since multiple versions might be floating around (due to timing issues in the first place) the longest tree wins. So if a client submits work done on a shorter tree then it is rejected by the network.

This has other issues like it takes longer time & reorganization risks but it does away with clocks altogether by providing a different method of timestamping.

This is similar to proof-of-work timestamp server that bitcoin uses but we can do away with proof-of-work because the resource and membership in the network is centralised.

a blockchain does not make consensus without clocks possible it is just a grossly inefficient vector clock implementation (that incidentally does not provide consensus in the distributed systems sense)

I just described how you can do it without clocks. Is there a mistake somewhere, sorry?

a clock in the context of a distributed system is a method of providing a partial ordering (a happened before b). most blockchain implementations use a merkle tree that uses predecessor transactions as inputs to the output as the clock

regardless, blockchains as implemented in certificate transparency and bitcoin aren't actually consensus systems, they are just (highly accurate) approximations

Both of the points you make are true but how are they relevant here?

We are talking about clocks that tell time in the quoted quote (time is ordering of events by definiton) & consensus in bitcoin is final(accurate) enough an practical approximation as much as a centralised system as described in OP is. Either I am misunderstanding or this is your dislike for bitcoin.

The blockchain is a consensus protocol.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact