
How to do distributed locking - martinkl
http://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html
======
antirez
Thanks to Martin for analyzing Redlock, I looked forward to an analysis. I
don't agree with the two main arguments of the analysis. TLDR: I think the
unique random token of Redlock is enough for Check & Set when the lock holder
work materializes into a database write (also note that it requires a
linearizable database, to check token_A_ID > token_B_ID in the Martin proposed
solution), and I think the system model is very real world (AFAIK the
algorithm is not dependent on bounded network delays, I think there is an
issue in the analysis, details in my blog post so other people can evaluate),
and that the fact different processes can count relative time, without any
absolute clock, with a bound error (like, 10%) is absolutely credible.
However, I'm analyzing the analysis in depth right now and writing a blog post
with the details to be posted tomorrow. Also note that usually when you need a
distributed lock, it's because you have no sane way to handle race conditions,
otherwise you can do with some weaker and/or optimistic locking form to start
with.

~~~
ploxiln
The quality of your C code running on a single node is by all accounts really
good. But you sometimes speak about distributed systems in a way that is not
correct to people who have really spent a lot of time making these things work
at scale.

I've done distributed systems, and in fact I usually don't implement systems
that are 100% correct under maximally-adverse conditions, for performance and
convenience reasons. But it's important to know and acknowledge where and how
the system is less than 100% bullet-proof, so that you can understand and
recover from failures. When you have 1000 servers, some virtual in the cloud,
some in a colocation facility, and maybe 10 or 20 subtle variants in hardware,
you will get failures that _should not happen_ but they do. For example, I've
had an AWS host where the system time jumped backwards and forwards by 5
minutes every 30 seconds. And I've seen a tcp load-balancer appliance managed
by a colo facility do amazingly wrong things.

The point is, you need to understand where the consistency scheme in a
distributed system is "fudged", "mostly reliable", "as long as conditions
aren't insane", and where they are really guaranteed (which is almost
nowhere). Then you can appropriately arrange for "virtually impossible"
errors/failures to be detected and contained. This ad-hoc logic you bring to
the problem of distributed systems doesn't really help.

Not to "dogpile", but elucidate for others who may not know some of the
history here: [https://aphyr.com/posts/287-asynchronous-replication-with-
fa...](https://aphyr.com/posts/287-asynchronous-replication-with-failover) and
the also interesting reply:
[http://antirez.com/news/56](http://antirez.com/news/56)

~~~
antirez
Hello ploxiln, to see systems with clocks jumping is indeed possible, I'll
make it more clear in my blog post, but basically it is possible with some
effort to prevent this in general, but more specifically the Martin suggestion
of using the monotonic time API _is a good one_ and was already planned. It's
very hard to argue that the monotonic time jumps back and forth AFAIK, so I
think it's a credible system model. The monotonic API surely avoids a lot of
pitfalls that you can have with gettimeofday() and the sysadmin poking the
computer time in one way or the other (ntpd or manually).

About the ad-hoc logic, in this case I think the problem is actually that you
want to accept only what already exists. For example there are important
distributed systems papers where the assumption of _absolute time bound error_
among processes is made, using GPS units. This is a much stronger system model
(much more synchronous) compared to what I assume, yet this is regarded as
acceptable. So you have to ask yourself if, regardless of the fact this idea
is from me, if you think that a computer, using the monotonic clock API, can
count relative time with a max percentage of error. Yes? Then it's a viable
system model.

About you linking to Aphyr, Sentinel was never designed in order to make Redis
strongly consistent nor to to make it able to retain writes, I always analyzed
the system for what it is: a failover solution that, mixed with the
asynchronous replication semantics of Redis, has certain failure modes, that
are considered to be fine for the intended use case of Redis. That is, it has
just a few best-effort checks in order to try to minimize the data loss, and
it has ways to ensure that only one configuration wins, eventually (so that
there are no permanent split brain conditions after partitions heal) and
nothing more.

To link at this I'm not sure what sense it makes. If you believe Redlock is
broken, just state it in a rigorous form, and I'll try to reply.

If you ask me to be rigorous while I'm trying to make it clear with facts what
is wrong about Martin analysis, you should do the same and just use arguments,
not hand waving about me being lame at DS. Btw in my blog post I'll detail
everything and show why I think Redlock is not affected by network unbound
delays.

~~~
ploxiln
It's fair to criticize my somewhat harsh and yet hard-analysis-free critique.
But let me explain why it matters to me, and what the relation to Redis
Sentinel is:

"Redlock" is a fancy name, involves multiple nodes and some significant logic,
less experienced developers will assume it's bulletproof.

"Redis Sentinel" is a fancy name, involves multiple nodes and some significant
logic, less experienced developers will assume it's bulletproof.

I guess I'd prefer that people shoot themselves in the foot with their own
automatic failover schemes, instead of using a publicized ready-to-use system
which has marketing materials (name, website, spec, multiple implementations)
which lead them to believe it is bulletproof, and these systems end up being
used in for a bunch of general-purpose cases without specific consideration
for their failure modes. In this case, the "marketing" is just the fact that
you're behind it, and redis really is solid for what it is.

(Usually I avoid automatic promotion/fail-over altogether. It's often not a
good idea. Instead I find some application-specific way of operating in a
degraded state until a human operator can confirm a configuration change. So
service is degraded for some minutes, but potential calamity of automatic
systems doing wrong things with lots of customer data is averted.)

It's really not your responsibility to make sure the open source software you
offer to the world for free is used appropriately. But people like me will be
annoyed when it contributes to the over-complicated-and-not-working messes
(not directly your fault) we have to clean up :)

One specific comment:
[http://redis.io/topics/distlock](http://redis.io/topics/distlock) notes the
random value used to lock/unlock to prevent unlocking a lock from another
process. But at that point, the action just before the unlock could very well
also have been done after the lock was acquired by another process. So while
this is a good way to keep the lock state from being corrupted, it is really a
side-note rather than a prominent feature, since if this happens, the data the
lock was protecting could well have been corrupted.

~~~
wyaeld
I have to disagree here.

"fancy names == people think bulletproof" is not a credible criticism.

Now "Oracle Enterprise Manager Fusion Middleware Control" is a fancy name!!!
Pretty sure no-one with a clue thinks a product named like that is
bulletproof, and its probably got significant logic/nodes etc

I apparently have the opposite experience to you. I'll often find complex
distributed systems that are painful to troubleshoot when they misbehave, and
find it refreshing on the other hand when someone is just using Redis, because
typically THAT system is working a lot more predictable.

If people are too inexperienced to realize that ALL systems have tradeoffs,
and not read up on what those are (because apparenly the name is fancy) then
they'll get burned. Antirez does a pretty good job of explaining and
documenting where he sees Redis limits to be.

Arguing that everyone should be home-baking their own failovers until of
learning the limits of well-known ones our there doesn't seem like responsible
advice.

------
dmuth
It's a long read, but the gist of this article is "use the right tool for the
job". Redis is really good for some things, but in its current implementation,
distributed locking is not one of them.

Along with the author of this blog post, I would recommend the usage of
Zookeeper if you have a need for obtaining a lock in a distributed
environment. You can read an analysis of how Zookeeper performs under a
(intentionally) partitioned network here:

[https://aphyr.com/posts/291-jepsen-
zookeeper](https://aphyr.com/posts/291-jepsen-zookeeper)

~~~
dvirsky
The author dedicates a good chunk of code to describing a potential problem
that might happen in long GC pauses or similar situations. That problem is not
unique to redis. Is there any way to avoid it in ZK?

~~~
teraflop
Yep, and the author even (briefly) describes it. But you can't do it without
moving some of the logic into the storage service, to check fencing tokens.
Zookeeper exposes various kinds of IDs that can be used to do fencing;
apparently (I haven't read the details) Redlock doesn't.

Interestingly, even though HBase uses Zookeeper for coordination, it does
_not_ use it for fencing. Instead, fencing is handled by atomic "log-rolling"
operations on the HDFS namenode, which can be configured to use its own
quorum-based journal. (The equivalent of a "token" is the monotonically-
increasing namenode operation sequence number.) So the principle is the same:
the system responsible for doing mutual exclusion, and the system that
actually stores the data, must be coupled.

~~~
dvirsky
The solution of fencing tokens on the storage layer that's proposed there,
just pushes the problem one layer down - you'll still need the storage layer
to be consistent for this to work. Also it means it should be aware of those
semantics, which complicates thigns. Redlock has non incremental tokens, so I
guess it can be used as well. Read @antirez's reply above.

------
jondubois
If you want to have a truly parallel system, you cannot share resources
between processes. If you need a lock, then that implies sharing of
resources... Which implies limited scalability according to Amdahl's law.

To achieve unlimited scalability, you need to make sure that no single data
store is shared between all processes and that the amount of interprocess
communication (on a per-process basis) doesn't increase as you add more
processes.

The more sharing you have, the less you can scale your system.

~~~
TallGuyShort
Although no one starts out wanting to have a truly parallel system. They start
out wanting to solve a problem and when that requires scalability or
redundancy a parallel system is a means to that end. Not every problem can or
should be solved with no shared state, and when that's the case, you should
know how to do it right.

~~~
jondubois
I find the code of highly parallel systems much more elegant and easy to read.

Personally, I find it easier to write highly parallel code than parallel-up-
to-a-certain-point code that involves locks likes mutexes and/or semaphores.
Highly parallel code just takes a little bit of extra planning and thinking
up-front but it saves lots of time and stress down the line.

------
erentz
"If you still don’t believe me about process pauses, then consider instead
that the file-writing request may get delayed in the network before reaching
the storage service. Packet networks such as Ethernet and IP may delay packets
arbitrarily, and they do [7]: in a famous incident at GitHub, packets were
delayed in the network for approximately 90 seconds [8]. This means that an
application process may send a write request, and it may reach the storage
server a minute later when the lease has already expired."

This statement confused me. It seems to say that the packets were delayed in
the network for 90 seconds before being delivered. From reading the original
sources it actually sounds like packets were discarded by the switches, so the
original requests discarded, and the nodes were partitioned for 90 seconds.
When the partition was removed both nodes thought they were the leader and
simultaneously requested the other to shutdown. Can anyone confirm? Keeping
packets delayed in a network for 90 seconds would seem quite difficult (though
not impossible assuming certain bugs).

Edit: On re-reading I think this is just talking about the network stack in
general - not the network. A temporary partition may delay delivery of your
request until max TCP retries is exceeded on your host, if its recovered
before then your request may arrive later than you intended.

~~~
martinkl
I only have the blog post to go by, and don't have first-hand information. It
seems possible to me that a few packets could indeed be delayed by 90 seconds,
perhaps stuck in a switch buffer somewhere, although this would be a small
number of packets since those buffers are not very big.

However, yes, I was thinking about the network stack as a whole. Any kind of
retries, e.g. TCP retransmission on timeout, effectively turns packet loss
into packet delay (within certain bounds). Thus, even if you have a network
interruption ("partition") and all packets are dropped, it could happen that
after the interruption is fixed, a node receives packets that were sent before
or during the interruption. For this reason, I find it helpful to think of it
as delay, not just packet loss.

------
amelius
Why not just use the database to handle the "locking" for you? For example, to
ensure that an email with ID=123 gets sent only once, just check if "email 123
sent" is in the database, otherwise commit it to the database, wait for the
transaction to be committed, and send the email.

Edit: Why is this downvoted? It is a serious question.

~~~
abraae
Perhaps because your example fails to describe what happens if there is a
crash, after updating the database to say "email sent", but before the email
has actually been sent.

~~~
amelius
Well, I didn't want to go into that much detail here, because that also
presents a problem in the case of locks (if the locking process crashes, the
lock is never released).

~~~
bryanh
Good observation! The article also covers that exact scenario - plus a
solution called fencing which a proper locking system can help facilitate.

~~~
amelius
I like the fencing solution. Couldn't such a scheme be implemented over a
database instead?

Why I'm thinking more in terms of a database-oriented solution is because
there is still the problem of crashing. What happens if the system crashes
just before the email is sent? The system somehow needs to remember to restart
that task (sending the email) when it comes back up. And this is probably best
done through a database anyway.

------
sdab
"When used as a failure detector, timeouts are just a guess that something is
wrong. (If they could, distributed algorithms would do without clocks
entirely, but then consensus becomes impossible [10]."

Having just re-read Lynch's paper, can you explain what you mean here? I
didn't see anything explicitly relying on time. It could be there is some
implicit usage I didnt see. Additionally, the paper's impossibility result is
about "perfectly correct consensus" which applies with and without clocks and
then has a positive result for "partially correct consensus" (i.e. not
deciding a value is a correct result). Im not sure which you mean when you say
"consensus becomes impossible" as it is either already impossible (the
perfectly correct protocol) with one faulty process or (to my understanding)
not dependent on time (the partially correct protocol).

p.s. great article!

~~~
robinhouston
I'm having trouble relating this comment to the referenced paper [1], which
seems to be explicit that its impossiblity result does not apply when
synchronised clocks are available.

For example, on page 375: “Crucial to our proof is that processing is
completely asynchronous; that is, we make no assumptions about the relative
speeds of processes or about the delay time in delivering a message. We also
assume that processes do not have access to synchronized clocks, so algorithms
based on time-outs, for example, cannot be used.”

1\.
[http://www.cs.princeton.edu/courses/archive/fall07/cos518/pa...](http://www.cs.princeton.edu/courses/archive/fall07/cos518/papers/flp.pdf)

~~~
sdab
You mean the OP's comment or mine? I agree with you if you mean the OP's
comment. Though your quote refers to synchronized clocks. I dont think OP is
referring to synchronized clocks, though your point that Lynch et al does not
rely on timeouts is what Im getting at.

------
spo81rty
For those interested in distributed locking with Redis and C#, check out this
blog post we did which also links to a github project. We use this very
heavily. Hope it helps someone.

[http://stackify.com/distributed-method-
mutexing/](http://stackify.com/distributed-method-mutexing/)

------
bogdando
> In a reasonably well-behaved datacenter environment, the timing assumptions
> will be satisfied most of the time – this is known as a partially
> synchronous system [12]. But is that good enough?

Please consider this answer [0] (which I personally understood as - YES, real
systems are always partial sync): "Asynchronous systems have no fixed upper
bounds. In practice, systems tend to exhibit partial synchrony, which is
described as one of two models by Dwork and Lynch in Consensus in the Presence
of Partial Synchrony. [1]"

[0] [http://bravenewgeek.com/category/distributed-
systems-2/](http://bravenewgeek.com/category/distributed-systems-2/)

[1]
[http://groups.csail.mit.edu/tds/papers/Lynch/jacm88.pdf](http://groups.csail.mit.edu/tds/papers/Lynch/jacm88.pdf)

------
praneshp
Interesting document a colleague of mine wrote:
[https://specs.openstack.org/openstack/openstack-
specs/specs/...](https://specs.openstack.org/openstack/openstack-
specs/specs/chronicles-of-a-dlm.html)

~~~
tyre
Off-topic but that URL is remarkably repetitive. Is that some kind of SEO
play?

~~~
praneshp
No, its an effect of autobuilding from a git repo, I think.

The specs are submitted for review at [https://github.com/openstack/openstack-
specs](https://github.com/openstack/openstack-specs)

------
brightball
Not directly related, but in a sublink within that piece I found an
interesting read about 2 phase commits in Postgres.

[https://aphyr.com/posts/282-call-me-maybe-
postgres](https://aphyr.com/posts/282-call-me-maybe-postgres)

------
mattiemass
Interesting stuff. In all my distributed systems work so far, I've assumed
that a distributed lock is a thing to avoid. I really should take another look
at them, just as a tool to have at my disposal.

~~~
Randgalt
If you assume that your distributed lock gives you transactional guarantees
that you are the only lock holder then you are making a mistake. If, however,
you can tolerate small overlaps in lock holders you are fine and this helps
with numerous distributed algorithms. Further, using other facilities such as
fences can make it even more secure. Another feature of ZooKeeper is write-
with-version. You could obtain a lock (using Apache Curator - note I wrote
this), then do a write-with-version to achieve 100% certainty that you are the
only writer.

~~~
Randgalt
BTW - I wrote a Curator Tech Note about this a while ago:
[https://cwiki.apache.org/confluence/display/CURATOR/TN10](https://cwiki.apache.org/confluence/display/CURATOR/TN10)

------
EGreg
Why would you ever want to do distributed locking when you can do MVCC via eg
vector clocks, possibly with cryptographic signatures?

If you want to acquire some shared resource, then elect an owner for that
resource.

~~~
elbee
If you want to do mutual exclusion using distributed locking then you end up
in a painful place (as the article points out). In general you can't
distinguish between a process that is running really slowly and a process that
is dead. So when confronted with a non-responsive lock owner you end up with
two unpleasant options: 1) Assume the lock owner is alive and don't do
anything. This approach guarantees mutual exclusion but isn't especially
useful because it gets stuck when things die. 2) Assume the lock owner is dead
and take the lock. This doesn't guarantee mutual exclusion but "probably"
works if you make the timeout "long enough." This can work correctly if you
have a realtime system which will monitor the lock owner and kill the owner if
the lease on the shared resource expires (this would probably require a
special hardware/OS/programming language stack).

------
waxjar
Instead of the fencing token, could the scenario in the blog post be prevented
using a good hashing function?

When you perform a write, instead of the token, send a hash of the object
previously read. The storage can then compare this against a hash of the
resource's current state. If it doesn't match the lock expired and the write
is not accepted.

This would reduce the state to keep track off to the resources themselves.

------
hardwaresofton
Has anyone used both Zookeeper and etcd in production for management of
distributed state?

Generally when I think of this problem I reach for etcd, not Zookeeper first,
in the hopes of it being lighter (with a relatively vague definition of
"light"), and easier to use.

~~~
bkeroack
etcd would be a fine choice, but Consul provides a lock feature[1] out of the
box.

1\. [https://www.consul.io/docs/guides/leader-
election.html](https://www.consul.io/docs/guides/leader-election.html)

~~~
teraflop
As the article explains at length, using Zookeeper or etcd as a "locking
service" in this way is _not_ safe, even if the service is perfect and
failure-free. There is a fundamental race condition between finding out that
you've obtained a lock, and doing some other operation that assumes mutual
exclusion.

~~~
bkeroack
It's not really "fundamental". It's simply that the process that acquires the
lock can fail (or be paused, or partitioned away from everything else, etc),
and if it does, but then comes back later with a valid lock, bad things may
happen.

The author's solution is to push serialization logic into the resource/storage
layer (by checking fencing tokens). But what if the resource _is itself
distributed_? Then it needs it's own synchronization mechanism? It's locks all
the way down.

~~~
bkeroack
Thinking more about it, this _is_ a fundamental weakness of having self-
policing processes, which I suppose is the OP's main point. It can be
mitigated by having infinite lock TTLs, at the cost of risking system deadlock
on process failure. Thank you to GPP for spurring me to think more deeply
about this.

As I stated, though, if the resource being protected is either a distributed
system itself, or a system that cannot support fencing logic, this failure
mode is difficult or impossible to prevent. The frequency of failure should be
kept in mind here: most services can probably guarantee 99.99% uptime against
the likelihood of 5 minute GC pauses.

------
kang
> If they could, distributed algorithms would do without clocks entirely, but
> then consensus becomes impossible

Consensus can be achieved without clocks using blockchains: Instead of
timelocking the resource, client can broadcast his altered version after
making changes to the network. The next client than then start working on top
of this changed resource, but since multiple versions might be floating around
(due to timing issues in the first place) the longest tree wins. So if a
client submits work done on a shorter tree then it is rejected by the network.

This has other issues like it takes longer time & reorganization risks but it
does away with clocks altogether by providing a different method of
timestamping.

This is similar to proof-of-work timestamp server that bitcoin uses but we can
do away with proof-of-work because the resource and membership in the network
is centralised.

~~~
querulous
a blockchain does not make consensus without clocks possible it is just a
grossly inefficient vector clock implementation (that incidentally does not
provide consensus in the distributed systems sense)

~~~
kang
I just described how you can do it without clocks. Is there a mistake
somewhere, sorry?

~~~
querulous
a clock in the context of a distributed system is a method of providing a
partial ordering (a happened before b). most blockchain implementations use a
merkle tree that uses predecessor transactions as inputs to the output as the
clock

regardless, blockchains as implemented in certificate transparency and bitcoin
aren't actually consensus systems, they are just (highly accurate)
approximations

~~~
kang
Both of the points you make are true but how are they relevant here?

We are talking about clocks that tell time in the quoted quote (time is
ordering of events by definiton) & consensus in bitcoin is final(accurate)
enough an practical approximation as much as a centralised system as described
in OP is. Either I am misunderstanding or this is your dislike for bitcoin.

