Hacker News new | comments | show | ask | jobs | submit login
Is Redlock Safe? Reply to Redlock Analysis (antirez.com)
181 points by antirez 341 days ago | hide | past | web | 135 comments | favorite

This response is incorrect at a very fundamental level.

First, Antirez claims that requiring the "materialization" store to be able to tie break with a monotonically increasing token requires linearization? This is completely false. The monotonically increasing token allows for eventual consistency. That's the entire point of it. It's monotonically increasing.

For anyone who claims that once you have coordination (locking) in your system you already lost is completely ignoring the research coming out of BOOM (Berkeley Orders of Magnitude). You can design your system to push the coordination out to the edge and way from your "choke points" and use these monotonically increasing tokens to keep your bottlenecks coordination free.

Secondly, Antirez's argument that you can use a compare & swap in a transactional storage layer is also wrong. This is not possible to write safely.

I'm not even going to touch his argument that using system clocks in a distributed locking algorithm is safe...

From Martin post: "However, the storage server remembers that it has already processed a write with a higher token number (34), and so it rejects the request with token 33."

This is not eventual consistency, this is refusing any new write with ID < past_ID, which requires linearizability.

Second, about eventual consistency, the Token ID even when increasing may not be casually related with the work performed while the lock is hold. if you need just a random order when there are concurrent accesses, you can lexically order the random token IDs, for what is worth.

Compare & Swap: could you argument that? If you want to avoid races in a read-modify-write scenario, you can set the token, do your work, and only write if the token is still the same, so that if your write succeed is because the state of the shared resource is still the same to when you started operating on it.

Martin made no mention of strong vs. eventual consistency.

If you write to a store like cassandra using a monotonic token, the result will converge correctly. If you require to never read "stale" data, you can read / write w/ quorum. There is no linearization in the storage layer in this case.

In this case the lock is not superfluous and you have safety.

Regarding CAS, I said that your argument (in support of redlock) was wrong. I would be happy to respond to a proposed algorithm, but as someone else mentioned, either the storage layer can provide a CAS operation, in which case your lock is not required and the workers read, mutate, CAS, OR if the storage layer has any weaker semantics then it is not possible to ensure safety with a token.

>the storage layer can provide a CAS operation, in which case your lock is not required and the workers read, mutate, CAS

Yeah, so Optimistic Concurrency Control. Possibly a Redlock-like lease scheme could be used to improve performance in the high-contention case.

you don't need a token service if you are going to set the token on the locked resource, perform work and then unset the token. your lock is completely superfluous in that scenario

That's my point in the blog post I wrote. Please read it, I would love your feedback. Or better, you could do even with a best-effort lock, given that race conditions from time to time are harmless.

  2. If your data store can always accept the write only if 
  your token is greater than all the past tokens, than it’s a 
  linearizable store. If you have a linearizable store, you 
  can just generate an incremental ID for each Redlock 
  acquired, so this would make Redlock equivalent to 
  another distributed lock system that provides an 
  incremental token ID with every new lock. However in the 
  next point I’ll show how this is not needed.
you confuse which system is the linearizable one in this point. your storage system is just a dumb storage system that simply returns the object version with the highest clock. what makes it linearizable to an outside observer is that the system generating the lock ids is itself linearizable. if every insert is accompanied by an authentic lock id the system is linearizable. if you generate this lock id via the storage system itself then the storage system must also be linearizable

  3. However “2” is not a sensible choice anyway: most of the times the 
  result of working to a shared resource is not writing to a linearizable 
  store, so what to do? Each Redlock is associated with a large random 
  token (which is generated in a way that collisions can be ignored. The 
  Redlock specification assumes textually “20 bytes from /dev/urandom”). 
  What do you do with a unique token? For example you can implement Check 
  and Set. When starting to work with a shared resource, we set its state 
  to “`<token>`”, then we operate the read-modify-write only if the token 
  is still the same when we write.
would you say this sequence of events is correct:

* process A acquires lock

* process B acquires a lock on the same resource after a time out and sets it's lock id on the resource

* process A sets it's lock id on the resource

* process A writes a new version of the resource

* process B attempts and fails to write a new version of the resource

if it's not correct, would you say redlock prevents it? how?

now consider what happens if the resource is only one of a group of resources that need to be transactionally updated

the bottom line is that distributed transactional locking requires linearization and redlock is plainly not linearizable. please reconsider it's continued development and steer users towards something safer like zookeeper

> your storage system is just a dumb storage system that simply returns the object version with the highest clock. what makes it linearizable to an outside observer is that the system generating the lock ids is itself linearizable.

I don't think you understand Martin motivations. He says that the system has to reject the write with a lower ID, in order to protect from concurrent accesses, and to do so in a way that the client attempting a stale write is notified.

If you mount an eventually consistent system returning the object with higher version, there is nothing of linearizable about it. If the shared resource is a list of names, one client will try to add "Hellen" with version 8, one will try to add "Josh" with version 9. Only one name will end in the list in your schema, and the client adding "Josh" will not even know that the write failed.

So the goal is, when race conditions happen, to avoid them in some way. While I believe that when you use a distributed lock the problem is very often that you don't have other controls over the shared resource, if you really need to do that, a random token is equivalent: using CAS you can reject one of the writes if two clients are writing at the same time.

Note also that when it happens that, because of delays, the clients are accessing concurrently, the lock ID has little to do with the order in which the operations were indented to happen. Back to "add name to a list" problem:

T0: Client A receives new name to add from web.

T0: Client B is idle

T1: Client A is experiencing pauses.

T1: Client B receives new name to add from web.

T2: Client A is experiencing pauses.

T2: Client B receives a lock with ID 1

T3: Client A receives a lock with ID 2


So the client that received the name before, can do the operation after, or you can have delays in any other part of the system.

i disagree that i don't understand martin kleppmann's method and also that his method is 'eventually consistent'.

using a random token is not equivalent to using a linearizable lock, see: https://medium.com/@talentdeficit/redlock-unsafe-at-any-time...

i am not sure what your point about delays and concurrent clients is, of course a system can't infer order from events that happen outside of it's visibility. the lock system can only provide synchronization at the point the lock is acquired

> Secondly, Antirez's argument that you can use a compare & swap in a transactional storage layer is also wrong. This is not possible to write safely.

Could you elaborate on this? I'm not sure I'm as pessimistic as you on this one point.

Compare and swap has a consensus number of 2. A consensus number is the maximum number of processes for which the object can solve a simple consensus problem. It is impossible to construct a wait-free implementation of an object with consensus number n from an object with a lower consensus number. In other words, test-and-set is weaker than transactions or Paxos for >2 concurrent clients.


Wait, compare and swap has a consensus number of infinite (from Herlihy's paper).

Oops, you're right. I was thinking of test-and-set.

In short, either the storage layer provides enough guarantees to ensure safety without redlock OR it is not possible to ensure safety.

I believe that antirez's proposal is the former, but if he has a safe algorithm that does not require a linearizable CAS operation in the storage layer and is safe w/ redlock, I would love to hear it.

I also provided an example of how to use Cassandra which is eventually consistent safely with a locking service that provides monotonic tokens.

> I'm not even going to touch his argument that using system clocks in a distributed locking algorithm is safe...

Please do

Clock drift is a real thing in distributed systems.

If I have a 5 second lock that expires at 12:01:00 and my clock is off by 1 second I could write at 12:01:01 potentially after someone else has a lock, or worse, after someone else has written.

but the expiration time is relative to the redis server's time, and the time measurement on the client side is done relative to the client's time. Other clients don't care about your clock and you don't care about redis' clock AFAIK. From TFM:

> The client computes how much time elapsed in order to acquire the lock, by subtracting from the current time the timestamp obtained in step 1. If and only if the client was able to acquire the lock in the majority of the instances (at least 3), and the total time elapsed to acquire the lock is less than lock validity time, the lock is considered to be acquired.

What is the effect on the client of a VM live migration "pause"?

or daylight savings change?

or ntp updates which change the time?

Since the system clock can change relative to itself at any time, what effect does that have on the algorithm?

Distributed time is tricky, but it depends on what the intended use is. Under normal operation it will only move forward, for example.

> VM live migration "pause"?

System clock will take a larger step than usual. It won't go backwards.

> daylight savings change?

None. System clock is UTC for a reason.

> ntp updates which change the time?

NTP only drifts the clock (under normal operation).

Not that I think it effects your point, but just off the top of my head:

System clock will take a larger step than usual. It won't go backwards.

Assumes that checkpoint and restore to a previous time won't happen.

NTP only drifts the clock (under normal operation).

Unless it's a system like this. Note that this was pulled at about 1pm EST.

  Waiting for clock tick...
  ...got clock tick
  Time read from Hardware Clock: 2016/02/09 22:36:50
  Hw clock time : 2016/02/09 22:36:50 = 1455057410 seconds since 1969
  Tue 09 Feb 2016 02:36:50 PM PST  -1.038948 seconds
Under normal operation it will only move forward

'Under normal operation' is not really a high bar for distributed systems. After all, the network is reliable 'under normal operation' too.

> Assumes that checkpoint and restore to a previous time won't happen.

Or migration to a host whose clock is behind the original host's clock.

Use CLOCK_MONOTONIC. Or you could make your client abort when the clock goes backwards.

If someone forks a VM from a running snapshot, you're screwed whether they mess with the clock or not.

> NTP only drifts the clock (under normal operation).

That shouldn't be an assumption. NTP may jump the clock if the difference is too high, I've had it once that a system was connected to two different NTP servers and they each had a different clock and the system would jump the clock every now and then based on what NTP server it though was more correct.

I don't disagree, I'm just saying that time drift between nodes is not an issue with this algorithm. Also, using monotonic clock as antirez says he intends to in the post, will take care of most of these scenarios besides VM migration pauses.

What about requiring PTP sync or shutting down the node? I have no idea about the practical implications, just something I thought of. You can get a CDMA time server on ebay for a few hundred bucks that support PTP.

Off hand you either need multiple time servers or you might as well have a central lock (since you are bottlenecked in either case) which just punts the issue, now the various time servers need to be in sync or the issue comes up.

I don't need to, Martin did an excellent job touching on this in his original post, including real world references.

Antirez responded to that and disagreed to this point in his new post. What do you make of his response?

Replying here since I can't reply to your deeper comment.

A client can never know with certainty that it has successfully acquired the lock because, for example, the process could freeze between reading the current time from hardware and the steady_time function returning.

You can not rely on time, for safety. Only for advice.

yes, that goes back to the first issue in Martin's article, and he mentioned the tokens thing for this. But time drift between the machines is not an issue, is my point.

> I'm not even going to touch his argument that using system clocks in a distributed locking algorithm is safe...

He's using it relatively, ie. if (stop time - start time <= $got_lock_fast_enough) { continue... } Is this not legit?

What if the system freezes between reading the current time from the hardware and the stead_time function returning?

In short, you CANNOT rely on time in any way shape or form for safety. You can only use it as advice & to get liveliness.

Martin's point is that Redlock pays the cost of other safe locking algorithms w/o the benefit of safety.

This, usually. It's fair to point out that very specialized clocks, such as Google uses with Spanner, can allow you to achieve some consistency without coordination/locks, but for pretty much anyone else the rule you stated very much applies.

FWIW, I went digging for more recent info about TrueTime and found the following.


In a virtualized environment, system time and wall clock time can vary widely. Imagine a VM that's paused and restored.

I would assume that introducing a VM that has "time-travelled" (especially the restore-from-previous-state case someone mentioned above) into a distributed system would inherently break that distributed system, no matter what it's doing. It's a really crazy thing to do, much crazier than a regular network partition.

Personally, though, I don't see why it's so hard to just say "this distributed lock system requires that soft real-time guarantees can be made of the host platform. If the platform misses the algorithm's scheduling deadline, the system breaks. If the platform has no way of providing for a soft real-time deadline (because e.g. it's running in a VM that can be paused), the software will detect this and fail to start."

Because you don't want the system to break.

But it wouldn't, because the surrounding infrastructure would be guaranteeing that. My point is that distributed systems engineers are assuming too few things about the hardware and OS. A much wider space of distributed algorithms become useful if you have full control over your stack.

Do you think NASA quibbles about how to make a perfect consensus bit-coding algorithm between the fault-tolerant processors of a mars probe? No; they just use hardware+firmware that provides a hard real-time platform, allowing the algorithm itself to be simple.

It doesn't require a VM pause to cause a system freeze, an SMI interrupt can do that as well.

It doesn't take a lot to have backwards time travel, NTP can do that for you as well in some bad configurations.

It's very hard to rely on time in a distributed system. If you want a simple algorithm just don't rely on time at all, use it for logs so a human can correlate things when debugging an issue but don't assume time will flow at the same rate for all systems. Do use a monotonic clock always for internal timers, time does move backwards in systems.

I have seen SMI pauses for almost 5 seconds!

Do you have any references for this?

This discussion is very interesting and I would like to read up a bit more on your argument and why is this so.

I know that distributed systems are hard, but this is a problem that I find interesting to learn more about.

References to what specifically?

Martin provided a number in his original blog post, I also elaborated a bit more on my various points in other comments.

Upon reading a bit more on the subject I realized that I would need a more background before asking for references that will be a very specific in terms of subject depth.

I read a bit on BOOM and it's not a field I'm keeping up with so I figured it would be no point in more references if I didn't understand the more shallower levels of the discussion first.

Sorry about that, I tried to jump a bit too soon, and thanks for answering :)

Whoah, whoah, whoah, no:

1. Most of the times when you need a distributed lock system that can guarantee mutual exclusivity, when this property is violated you already lost. Distributed locks are very useful exactly when we have no other control in the shared resource.

In context, Sanfilippo seems to be arguing that providing monotonic logical timestamps or sequence numbers is futile, since a system that relies on locks that break is going to fail one way or another.

That is probably true, but wildly misses the point: when a sound system that relies on distributed locks fails, it doesn't silently corrupt data. But that's exactly what happens in the "unfenced" lock-breaking scenario Kleppmann describes.

There are worse things than crashing, is the point.

slightly later

The argument about storage servers taking fencing tokens being "linearizable" also seems flimsy: aren't they "linearizable" (in whatever sense Sanfillippo means) in the case Kleppmann because because the lock service arranges for them to be? That's functionality Kleppmann says is missing from Redlock.

Maybe you missed the point ;-) I said this and later showed how the Redlock unique token per lock can be used to guarantee the same. So it's equivalent, but yet very non credible model in production. When you have a way to avoid a mess when a race condition happens, it's better to avoid a distributed lock at all, to start with.

Can you further explain your "check" and "set" storage example? I feel like you might be begging the question. "check" and "set" seem like they describe another lock. How does that lock time out?

Whatever you can mount with a random token is always simpler than checking if Token_ID_A < Token_ID_B which needs a linearizable store. But basically this is how it works:

You want to modify locked resource X. You set X.currlock = token. Then you read, do whatever you want, and when you write, you "write-if-currlock == token". If another client did X.currlock = somethingelse, the transaction fails.

Edit: note that I noticed some confusion about this. Any other client is free to replace X.currlock even if set to a past value.

The problem with this solution is that if A writes Token_A to resource, and then A dies, then the resource can never be written to again. How do you avoid not needing to time out the 'curr_lock'

No, when you acquire a lock, you are allowed to write to currlock immediately. It's AFTER you've done whatever processing you need to do that you do a compare-and-set transactional write that verifies no one else has written to currlock between your acquisition and the completion of your task.

This pseudo code for random fence tokens should work as well safe as Martin's scenario with incremental IDs in the original article:



  val=raw(shared, key, token)


  awar(shared, key, val, token)

  fail("Can't do!")

Where a shared object has: 1) a key to store a value being changed in the critical section; 2) a currlock to store a random fence token associated with a lock. The raw reads the shared key's value after the currlock was written to the shared object. The awar does an atomic write of the value to the key, only if the currlock has an expected token (i.e. is unchanged).

When things go wrong, the do_things may be executed more than once, in parallel, but the key is safe and guaranteed to have no lost updates to its value.

But your "write" action is separate from the "X.currlock" operations, right?

Doesn't it mean that this can happen?

    A: X.currlock = Token_ID_A
    A: resource read
    A: is X.currlock still Token_ID_A? yes
    B: X.currlock = Token_ID_B
    B: resource read
    B: is X.currlock still Token_ID_B? yes
    B: resource write
    A: resource write
Is there anything that makes the "write-if-currlock == token" operation atomic?

Yes: it turns out, he's stipulating a single server, in which case this is a trivial feature to build. The problem then is, if you're willing to have that single point of failure, you don't need a lock service at all.

Perhaps he means test and set?

Or the more efficient test and test and set?


Issue here is detection of conflicting ops vs. fact of conflicting ops.

The monotonic sequence provides that detection means. So in the example you gave, the tardy (expired) tx commit would abort since its lock id would break the monotonic sequence of accepted lock-ops (which clearly can have gaps, 1, 2, 4, 6, 7, ... etc.[)]

A lot of the back-and-forth here reminds me of:

"Note that, with careful optimization, only 14 gajillion messages are necessary. This is still too many messages; however, if the system sends fewer than 14 gajillion messages, it will be vulnerable to accusations that it only handles reasonable failure cases, and not the demented ones that previous researchers spitefully introduced in earlier papers in a desperate attempt to distinguish themselves from even more prior (yet similarly demented) work."

from James Mickens' "The Saddest Moment" - https://www.usenix.org/system/files/login-logout_1305_micken...

This is such an elegant way to phrase my general sense of frustration I felt reading certain exchanges here. Thanks =)

As with all distributed systems algorithms, we should assume Redlock is unsafe until proven otherwise. A proof looks like http://ramcloud.stanford.edu/raft.pdf (Raft), http://research.microsoft.com/en-us/um/people/lamport/pubs/p... (Paxos) or http://www.tcs.hut.fi/Studies/T-79.5001/reports/2012-deSouza... (Zookeeper).

Personally I'd prefer stronger proofs than even those, but I think those papers get us to the point where implementation issues are as important as theoretical issues.

In my personal opinion, Redlock is not at the point where we even need to consider the implementation.

A good starting point. But those proofs are notoriously based on fairy tales such as reliable networking, Fail/Stop, and other white lies.

The fact that a protocol (e.g. Raft) has been proven correct sets the floor but it is certainly not sufficient to assume correctness of an implementation.

It should always be a prerequisite to have a proof before one begins an implementation. If one cannot construct such a proof then they have no business trying to design new distributed algorithms. Fortunately there are a few algorithms for proofs of replicated state machines available as previously noted.

Proofs will typically show the limitations of an algorithm, for instance the original paxos paper notes that the contained algorithm does not provide liveliness.

A system based on an algorithm with a proof has some chance of being correct, while a system based on an algorithm with no proof while technically it may have a non-zero probability of being correct that probability approaches zero.

Having worked on such systems and having talked to some people who have worked on debugging issues with paxos implementations at extreme scale while it is virtually impossible to to ensure that the system will be able to make forward progress it is possible to ensure that the state of the state machine remains correct.

When I look at an implementation of such an algorithm to evaluate it I pay very careful attention to the test suites. The more bizarre the tests they contain the more likely the implementation is to handle very weird network edge cases.

> those proofs are notoriously based on fairy tales such as reliable networking

Raft isn't. (I actually don't believe Paxos or Zookeeper do either, but I only understand Raft, so let me speak for that one.) Directly from the link in the post you're responding to:

> Consensus algorithms for practical systems typically have the following properties:

> They ensure safety (never returning an incorrect result) under all non-Byzantine conditions, including network delays, partitions, and packet loss, duplication, and reordering.

This applies to Raft, as well: Raft-the-algorithm will behave correctly during these events. The paper linked to above provides a proof of such.

> The fact that a protocol (e.g. Raft) has been proven correct sets the floor but it is certainly not sufficient to assume correctness of an implementation.

These are two separate things, and you need both. You need to know that an algorithm is sound, that it accomplishes what it sets out to do. You also need to know that a particular implementation of an algorithm is correct.

(While I said I haven't read the ZK paper, it does mention,

> Paxos does not require FIFO channels for communication, so it tolerates message loss and reordering.


> is certainly not sufficient to assume correctness of an implementation

Indeed. This is where tools like Jepsen and good fuzz testing can help.

The algorithm needs a proof and the implementation needs rigorous testing (Jepsen).

If you need a global lock in your distributed system, then you're already in trouble. In most cases, the best solution is to rewrite your software so that you don't need the lock. I don't want to say all, because I'm sure there are use cases where it would be impossible to do so. However, for my own research, I'm having trouble finding examples where a global lock is absolutely necessary. If anyone has any examples, please send them to me.

For example, a lot of people say, "updating a bank balance". But even that can easily be eventually consistent. You send increments and decrements and settle up at the end. Yes, your balance might go negative, but that's what ATM limits are for -- the most you could go negative is that limit. Most other systems can be written the same way. And for non-numeric values, there are always vector clocks or timestamps or quorum to determine which update came next, or you could send a client a list of all recent updates and let the client "do the right thing".

While I agree with the general sentiment, the matter is, there were a ton of people using a single Redis instance for distributed locks, since people find this model useful sometimes, even when best alternatives exist, or maybe because of limitations of systems they interact with. So I tried to provide something better compared to what people were using, but comparatively easy.

I think what you have provided is a far better solution than how most people were using redis. :) I was just hoping to convince people that they should think a little deeper before going straight for the global lock.

Hope you're doing well BTW!

Have definitely thought through this process and wound up using Redlock for the time being.

I didn't expect it to be a perfect system but I don't want to deal with setting up a queuing system right now when there simply isn't enough activity taking place.

Will it fail at some point? Possibly, but it's a great intermediary solution as you state. I for one would be curious to see other solutions to this problem. Are there any resources you'd recommend?

That may be the case, but all other protocols that guarantee consistency suffer from similar subtle issues (which can be resolved through careful algorithm design). Eventual consistency, however, is never the answer, unless your question is "how do I get eventual consistency?" The consistency required by an application is entirely imposed by the domain, and it is that requirement that causes algorithmic difficulties. Using less-consistent protocols to achieve a consistent view simply pushes the problem to a different layer.

Also, eventual consistency is a weak requirement that, when sufficient, enables a certain level of fault tolerance and low latency not achievable with other consistency levels (see [1]), although other consistency levels may offer good-enough fault-tolerance and latency (rivaling EC in many cases). EC is not in general the solution to the problem of "how do I best distribute my data?"

[1]: http://www.bailis.org/blog/when-does-consistency-require-coo...

> Using less-consistent protocols to achieve a consistent view simply pushes the problem to a different layer.

Absolutely, but if you can avoid using global locks, which are error prone and reduce reliability, isn't that a good thing?

It's not a question of good or bad. Providing a higher level of consistency than what your application needs will only hurt you, but many applications just cannot do with weak consistency.

I think the point being made here is that "what your application needs" is often something you chose your way into, and can choose your way back out of by backing out some other constraints—and that it's often more sensible to back out these constraints than to dive head-first into a problem domain that requires distributed locking.

In other words: if you set out to create Bitcoin in 1998, you might end up running into distributed consensus problems that are difficult enough that you decide to "solve an easier problem"—such that you end up creating Paypal instead.

> I think the point being made here is that "what your application needs" is often something you chose your way into, and can choose your way back out of

I think that while this may be true for many requirements, consistency is among the least flexible ones. Besides, losing consistency usually makes your system more complicated, not less. Weak consistency has some real benefits (latency in some cases, tolerance for more failure modes), but simplicity is not usually one of them.

> In other words: if you set out to create Bitcoin in 1998, you might end up running into distributed consensus problems that are difficult enough that you decide to "solve an easier problem"—such that you end up creating Paypal instead.

I don't think that the design problems behind Bitcoin are any easier than distributed consensus, which, BTW, is not necessary for quite a few consistency levels.

What happens when you sent the increment and then failed to send the decrement? The problem isn't typically the eventual consistency aspect, though that can also be a problem since you want the balance to be accurate, but rather the transactional nature.

How would a global lock fix this? Presumably you would have a process that is trying to decrement get a global lock and then crash, and eventually release the lock anyway, right? So you get the same outcome.

You can build things like 2 phase locking on top of that. You're right though that simply taking a lock before updating two values doesn't solve the atomicity of that update problem. However if you're updating a single value then taking a lock can make your life easier vs. systems with no locks and eventual consistency.

You can make your life as a programmer easier in exchange for making your life as an operator harder. :) Locks are brittle and hurt reliability.

I'd imagine it being pretty unsettling to watch a bank balance fluctuate. Also, what if there are downstream operations that depend on the value being accurate I.e. Trigger overdraft condition?

> I'd imagine it being pretty unsettling to watch a bank balance fluctuate.

Banks have generally solved this by batching updates, but yes, it could be unsettling for the customer in a similar situation. There are ways around that though, for example you update a cache that stores the data instantly and then let the eventually consistent source of truth update the cache again if it is not in sync, or perhaps have a short timeout on the cache, so the customer gets a consistent and smooth experience but the source of truth is still accurate, eventually.

Is leader election somehow a usage of a global lock? I do agree it should be avoided actively though.

It's related.

Leader election generally[1] requires consensus among distributed processes, and global locks generally[1] require consensus as well. The benefit of this is that both problems can be solved on top of a common consensus implementation which is what http://atomix.io does.

1: I say generally because you can do fancy things with fancy clocks to avoid running operations through a quorum under certain circumstances, but these carry caveats that preclude them from being reliable enough to use in many use cases.

I don't think so, there are distributed leader election schemes that don't require a lock. The simplest is using HRW hashing[0] so that each node independently determines the same leader.

[0] https://en.wikipedia.org/wiki/Rendezvous_hashing

Another potential issue with antirez's argument:

""" If you read the Redlock specification, that I hadn't touched for months, you can see the steps to acquire the lock are:

1. Get the current time.

2. … All the steps needed to acquire the lock …

3. Get the current time, again.

4. Check if we are already out of time, or if we acquired the lock fast enough.

5. Do some work with your lock.


The above also addresses “process pauses” concern number 3. Pauses during the process of acquiring the lock don’t have effects on the algorithm's correctness. """

What if the process GCs for 2 minutes between 4 and 5? Then the lock times out, but the process still does its work with the lock, even though another process may think it has the lock as well.

This is a specific example of the criticism that you cannot rely on clock checks for correctness that has been made elsewhere.

This is clearly addressed in the blog post. A GC pauae after step 3 is equivalent to any other pause after the lock is acquired. For example it's not different than a lock server replying OK to a lock request but the client only reads it after two minutes.

Yes, but you can't predict how long it would take so you can't be sure the lock you acquired you still hold while operating on a shared resource.

Yes, that's the most obvious flaw in his argument (amongst so many others). I'm surprised he (and his friends) didn't spot at least that one.

I recently had to help some developer that was stuck trying to have mongodb lock a resource and started using redis+redshift to prevent multiple webserver from updating a document in parallel (mongodb basic locking features weren't enough in his case).

As a RDBMS aficionado, that whole stack made me puke, BUT, to all the really smart people here having a PHD in distributed systems, please : try to help antirez build something that works. Many developers are trying to reach the shore of consistent systems after having been lost in the ocean of "NoSQL is great for everything".

Note : and i'm not saying redis is to blame here. It's obviously a great software. It's mostly the users that are to blame in this case.

Redlock, not Redshift, I assume?

Yeap, typo

I'll take one point here. Antirez's point that clients have to timeout on locks, otherwise the resource will be deadlocked is not correct. We implemented an alternative model based on distributed shared memory. Clients take a lock on the root of a subtree (of potentially millions of inodes in our version of HDFS) and persist the lock to shared memory and look for existing locks on the subtree in shared memory. If the client dies while processing the subtree, its lock is still persisted in shared memory alongs with the clientID. If another client tries to take the lock and it notices the client holding the lock has died, it takes the lock. We used heartbeats to shared memroy to implement the failure detector for clients (based again on shared memory http://www.jimdowling.info/sites/default/files/leader_electi... ). Importantly, timeouts for heartbeats expand when the system is congested.

There may be systems where you can reliably detect if a client has a lock at all, and if it's still alive or is failing. In most systems this is not possible, so your locks need to have an auto release feature. Distributed locks with auto release are like a huge needed evil: you avoid using it at all costs if you can, but there are problems when they are needed.

In this case, we transformed the problem from detecting if the client has the lock or not, to one of detecting if the client is dead or not - for which there are many more protocols (with different levels of accuracy). The point stands that auto-releasing distributed locks is not the only possible solution.

It is impossible to write a reliable distributed failure detector in this use case, in a practical system model.

You mean in the asynchronous systems model, in the presence of failures, it is impossible to write a reliable failure detector: http://the-paper-trail.org/blog/a-brief-tour-of-flp-impossib... However, you can write a failure detector for a practical system with the help of omega (weakest failure detector): http://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/p685-c...

In practice, failure detectors mean timeouts. What I am arguing is that you can have systems that adapt their timeouts for failure (such as our one), based on the current level of network congestion or load or whatever. The setting/changing of the timeout is averaged over a period of time, to ensure it is high enough to not give false positives. You could do the same approach at the per message level, but for all possible messages, it may be prohibitive. If you have a small number of messages, it may work. The problem with messages, compared to failure detection, however, is the following: how do you figure out you have had a false positive? For failure detection it's easy: the timeout expired, butre i'm still alive and my heartbeat arrives late, so we increase the timeout for that node. Vector clocks (fencing tokens) make it easier to reliably find out if messages arrive very late (identify false positives).

I feel a bit like how a juror must feel when two expert witnesses are making opposing arguments...

No kidding. Unfortunately, it seems people sometimes get kind of snarky with these discussions, which I don't like. It starts to remind me of the Comic Book Guy.

I would encourage everyone to be civil and patient, and if you have the time, expand on what you're writing so that those of us who are not experts can learn something.

Short answer: no.

Longer answer: provided that things never go wrong it works. But the assumption that things never go wrong is false. In particular the whole basis of the original point is that there is no true universal time, and this post (essentially) says "provided that there is universal time then I can't see what you are complaining about"

Swing and a miss.

I'm actually exploring the usage of Redlock in one my projects and the first question I've is - How can you count on a program to end in a given time? That feels similar to Halting problem. The only option I've is to use an external program to kill the program that's using the lock after the lock's expiration period.

Note: you have the same issue with all the other dist locks with auto-release. If you have a mission critical use case you could:

1. Try real time kernel extensions.

2. Try a time bomb or watchdog process as you suggested.

3. Put monitoring systems to avoid load to go over a certain level.

4. Tune the Linux scheduler params.

5. Use TTLs that are comparatively very large compared to the work you need to do with the lock hold.

You don't have to count on a programing ending in a given time. You require the program to renew the lock if it fails to complete in sufficient time.

I'm not distributed systems expert, but I don't see any attempt at responding to:

> 1. Distributed locks with an auto-release feature (the mutually exclusive lock property is only valid for a fixed amount of time after the lock is acquired) require a way to avoid issues when clients use a lock after the expire time, violating the mutual exclusion while accessing a shared resource. Martin says that Redlock does not have such a mechanism.

The section "Distributed locks, auto release, and tokens" talks about why Martin's solution isn't right (I'm not well-read enough to tell if that argument is correct), but never actually argues that Redlock has a mechanism to avoid violation of mutual exclusion due to timeouts.

Perhaps I'm missing some background here: does such a mechanism exist?

I don't address what you do when the mutual exclusion is violated, because I show how a different token for each lock is good enough to avoid the races that Martin worries about, but at the same time I'm deeply skeptical in real-world use cases you often have this luxury. Very very often distributed locks are used when you need the lock as the sole way to avoid race conditions, so IMHO is a purely theoretical arguments for most people that are going to use Redlock. When you can use the token, use it, and even consider of not using the distributed lock at all if not to gain performances by avoiding race conditions all the times you can.

It doesn't exist. I think Martin called this out because of the fundamental impossibility to guaranty consensus asynchronously, i.e. with auto-release.

(tl;dr, I don't think he does)

I'm also not a distributed systems expert, but this is how I interpreted the arguments. AntiRez is responding to the scenario where a client acquires a lock, hangs, a new client acquires a lock, and the old client comes back and tries to write something based off of a now stale lock. As demonstrated in this image [0]

So the random token ensures that even though many clients may "think" they have a lock, only up to a single client can ever actually have one. The write in the diagram which corrupts the database shouldn't happen, because you'll need to check that your lock token matches the service. [1] (excuse the terrible MSPaint modifications).

Which covers the common cases of lock timeout. But the clients aren't the only part of the system that can violate the mutex. Martin seems to have already thought of this rebuttal:

    You cannot fix this problem by inserting a check on the lock expiry just before writing back to storage. Remember that GC can pause a running thread at any point, including the point that is maximally inconvenient for you (between the last check and the write operation).
    [ ... ]
    If you still don’t believe me about process pauses, then consider instead that the file-writing request may get delayed in the network before reaching the storage service. Packet networks such as Ethernet and IP may delay packets arbitrarily
And I don't believe AntiRez answers this (or I'm not understanding when they do :P ). Even if a client makes a write-request at a perfectly valid point in time, when they own the lock, how do I know that by the time the write-request reaches storage, that the lock is still valid? The difference boils down to "redlock guarantees requests are sent during a valid lock", but not "redlock guarantees requests transact during a valid lock"

Martin's fix [2] is similar to the token method suggested, except: A) the write-if-token-matches logic is handled directly by the storage layer. B) It uses an auto-incrementing token.

He spends a few paragraphs making the point that redlock cannot generate good fencing tokens, because it has no good auto-increment consensus. But I don't see how a random token is any worse than an incrementing one if you're just going to straight up reject requests where the token doesn't match. Antirez addresses this, without addressing A.

I would also really like someone to correct me if I'm missing something. The Redlock implementations are very bare-bones with tests or examples of how to actually use them. The Ruby library has a more "advanced" example, which says this at the point where they try writing to the file system:

                # Note: we assume we can do it in 500 milliseconds. If this
                # assumption is not correct, the program output will not be
                # correct.
Which doesn't give me good faith in this...

[0] http://martin.kleppmann.com/2016/02/unsafe-lock.png

[1] http://i.imgur.com/8rezAaX.png

[2] http://martin.kleppmann.com/2016/02/fencing-tokens.png

[3] https://github.com/antirez/redlock-rb/blob/master/example2.r...

> But I don't see how a random token is any worse than an incrementing one if you're just going to straight up reject requests where the token doesn't match.

No, random tokens are not worse per se, they just have no place in the algorithm Martin described. It's a completely different system. In that system incremental tokens presume a mechanism that gives you a way to have some order of events. And one can rely on that order to decide which requests with which tokens to reject.

> All the time Martin says that “the system clock jumps” I assume that we covered this by not poking with the system time in a way that is a problem for the algorithm, or for the sake of simplicity by using the monotonic time API. So:

> About claim 1: This is not an issue, we assumed that we can count time approximately at the same speed, unless there is any actual argument against it.


> Timestamps, as implemented in Riak, Cassandra, et al, are fundamentally unsafe ordering constructs. In order to guarantee consistency you, the user, must ensure locally monotonic and, to some extent, globally monotonic clocks. This is a hard problem, and NTP does not solve it for you. When wall clocks are not properly coupled to the operations in the system, causal constraints can be violated. To ensure safety properties hold all the time, rather than probabilistically, you need logical clocks.

> A somewhat less safe but reasonable option is to use NTP rigorously on your machines, and sync it to TAI or GPS instead of POSIX time or UTC. Make sure you measure your clock skew: everyone I know who says they run NTP has discovered, at one point or another, that a node was way out of sync. If you want rough correspondence to POSIX time, you can still ensure monotonicity by running your own NTP pool and slurring leap seconds over longer time frames.

Clock skew is a very real problem and virtually impossible to avoid 100% of the time. Particularly when you have things like leap seconds where the fundamental concept of how many seconds in a day get changed.


System clocks are fundamentally unsafe if you just use NTP. That isn't to say you can never do it. Some use cases it is "safe enough" [e.g. Analytics, Metrics] where the occasional fuck up isn't substantial but Antirez isn't making that argument with Redlock as far as I can tell.

> “Okay, so maybe you think that a clock jump is unrealistic, because you’re very confident in having correctly configured NTP to only ever slew the clock.” (Yep we agree here ;-) he continues and says…)

That still isn't reliably safe because there is no guarantee all nodes will perform the slew identically.

This entire debate reminds me quite strongly of Peter Bailis's talk at MesosCon last year on designing distributed systems to avoid coordination where possible. It is a really excellent talk for anyone into this sort of thing:



I don't understand many of the leaps you make in your argument. From what I can see, you're defending that Redlock is "good enough" in cases where you don't care about your computers hating you. This is not a good environment for distributed systems when you need strong correctness properties.

Reading through the original description of Redlock[1] and your "rebuttal" is interesting. The algorithm described in the spec isn't safe if a GC pause or process pause happens between <checking how long it took to acquire a majority> and <doing work and releasing the lock>. You don't even touch this argument (pretending to not understand which step is the problem and defending that everything before the gap I just described is safe).

You claim that fenced tokens require linearizablity. I don't know if this is true, but even if it is all you've proven is that the solution offered isn't good enough. The fact that something that tries to fix your system doesn't fix it properly doesn't mean that your system isn't broken.

The article calls on someone(?) to Jepsen test Redlock, which is a good idea, but I think that the author(s) of such systems should release their own Jepsen test suite along with the software they're purporting to be safe (as others have done [1][2][3]). I would suspect that operations on Redlock, when subjected to nemesis that partition, kill and skew-clocks on various nodes, would not be found linearizable.

In general, I don't understand why one would build a system that attempts to approximate consensus without just using one of the proven consensus algorithms. Redlock is not the only one here, there are other systems that do this as well.

[1] https://github.com/atomix/atomix-jepsen

[2] https://www.datastax.com/dev/blog/testing-apache-cassandra-w...

[3] https://foundationdb.com/blog/call-me-maybe-foundationdb-vs-...

Redlock has nothing to do with linearizability. The only strong property it tries to provide is mutual exclusivity. I'm interested in working to a Jepsen test for this, but the Jepsen learning curve and actual level of documentation requires to spend a lot of time on writing the test.

I'm not sure that merely checking the result of some set of operations - whether a mutex is held by the expected process - tells you very much about the validity of how that mutex was obtained (assuming interleaving requests and such). This is where you need to explore the history, which is what the linearizability checker does and which is what Jepsen tests generally use.

It is a bit of work and learning curve writing a Jepsen test suite, but it's not too bad, particularly with the excellent docs that Kyle has recently written:


The problem with Jepsen is that it can prove that systems are unsafe, but it can't be used to prove that systems are safe.

For example Consul did their own Jepsen testing[1] in which they passed them, but Kyle pointed out [2] that they passed them only because they changed timeouts from 1 s to 300 ms effectively making the race condition window smaller.

[1] https://www.consul.io/docs/internals/jepsen.html [2] https://aphyr.com/posts/316-call-me-maybe-etcd-and-consul

It's typicaly not tractable to prove that a system is safe. For example, RethinkDB used the raft consensus algorithm, which on its own is proven correct, but due to a bug could in some cases break linearization. Jepsen was able to uncover this.

The best method we currently have for demonstrating that large systems are safe is to try really hard to prove that they are unsafe and fail.

For problem 1, you'd use a distributed lock to update more than one resource. A single resource could very well implement its own lock in which case you don't need a distributed lock. But once you have more than one resource to update, you do.

A lock should result in linearized access (all resources updated while holding the lock should be updated in the same order based on the time when the lease was granted). A counter is enough to put the updates in order. Giving leases a unique id doesn't do that - the network could reorder writes seen by two different resources. Fencing ensures that if the network does reorder writes, one of them fails.

The lock is acting sort of like a clock that ticks whenever it hands out a lease. The resources effectively update their local clock whenever they see an update with a new token from the central clock, and use this local clock to detect stale writes.

Note that fencing doesn't guarantee that writes don't happen late - all updates could be delayed by 10 seconds and it wouldn't catch it. But it isn't feasible to prevent this on systems with bad clocks connected by a flaky network.

Have you read this Salvatore? http://research.microsoft.com/en-us/um/people/lamport/tla/fo...

> ... The good thing is that distributed systems are, unlike other fields of programming, pretty mathematically exact, or they are not, so a given set of properties can be guaranteed by an algorithm or the algorithm may fail to guarantee them under certain assumptions. In this analysis I’ll analyze ..

Saw no mathematics in your article.

More important than that paper are the rest of the papers by Leslie Lamport.

Specifically the paper entitled Time, Clocks and the Ordering of Events in a Distributed System. [1]

This is basically the seminal paper on distributed systems and should be required reading for anyone considering playing with distributed locking mechanisms.

[1] http://research.microsoft.com/en-us/um/people/lamport/pubs/t...

The paper discusses benefits of applying the theoretical tools and written by engineers. You[r] cite is of course essential but no doubt OP knows this as well :)

Very true, I didn't consider it like that.

Unfortunately to test Redlock with a formal method is pretty useless. The algorithm is so trivial, for the part you can mathematically model it, that is a wasted effort to model it: even with a proof of the lock acquisition process it all boils down to "is the system model credible"? And that's what we are debating. Note that similarly Martin don't even try to address the problem of how the lock is acquired for the same reason I guess. Also note that if you try to model the problem adding unbound processes pauses, and put as a requirement mutual exclusivity with auto-release, I don't think there is a way the model will ever be validated, whatever system you model implementing this.

I'm still relatively new to formal specification but it seems that some form of weak/strong fairness would suffice for the correctness of your invariant and then it's a matter of testing your liveness properties while holding the invariant with a model checker.

TLA+ seems pretty good at this. Why would this be a waste of effort?

So since it depends on the correctness of the underlying system is there a TLA+ (or other) proof for that? Last time jepsen ran on redis it showed that it was broken unless something has changed in the meantime.

Before any proof, the debate would benefit from more precise definitions.

For instance, the issue around the usage of fencing tokens to reject any late usage of a lock is unclear just because the protected resource and its access are themselves unspecified. Is the resource distributed or not? If distributed, does the resource has a mean to ensure that tokens are increasing over all the nodes? Does the resource have a mean to rollback any effects done by a client which session is interrupted by a timeout?

Likewise for the timing arguments. It seems clear to everyone that timeouts provide only some protection rather than a guarantee. A more precise timeline of the worst case scenario will help to estimate that protection given the misc timeout values and clock shifts.

I don't see a distributed lock implementation being both 100% correct without any support of the resources guarded by the lock and actually feasible.

For data which absolutely must never violate the lock then the underlying data storage has to be involved. The vast majority of times where a distributed lock is used is where the underlying storage doesn't provide any support and the distributed lock is good enough, failures are exceptional and not disastrous.

If you want perfect correctness then you can just not have auto-release but practically this will cause more problems than it solves.

I understand how a fencing token can prevent out of order writes when 2 clients get the same lock. But what happens when those writes happen to arrive in order and you are doing a value modification? Don't you still need to rely on some kind of value versioning or optimistic locking? Wouldn't this make the use of a distributed lock unnecessary?

> But what happens when those writes happen to arrive in order and you are doing a value modification?

I believe the "first" write fails, because the token being passed in is no longer "the lastest", which indicates their lock was already released or expired.

if your tokens come from a linearizable source (like zookeeper) you can do a compare-and-set to mutate state safely. you read the state, perform your mutations on it, then in an atomic transaction read the state again and if it is the same as the initial state save your new state. if your fencing token is the highest yet seen the service uses your state as the authorative state. if it's not it rejects the change

Would something like Google TrueTime help in this scenario. My takeaway being accurate with time...

I don't understand how the migration to a monotomic clock source is treated as a trivial move. In practice, using a monotomic clocks source makes coordinating lease time records correctly even more complicated.

Neither the specification nor the analysis contain a formal, verifiable specification. How can one test an invariant and properties and make real conclusions about the safety/liveness of Redlock without it?

Is it not useful to do so?

I don't understand how the migration to a monotomic clock source is treated as a trivial move. In practice, using a monotomic clocks source makes coordinating lease time records correctly.

What does "linearizable" means in this context?

there's a formal definition but informally it means every participant agrees on an order of events in a system. if a system is linearizable you can safely compare-and-set to mutate state without introducing conflicts

The first step to getting out of the hole is to stop digging. The author of this article clearly hasn't learned this lesson yet.

There's no salty like antirez salty. I have mad respect for the guy as a programmer, but he needs to take things a bit less personally IMO.

Sorry I did not meant to be rude.

Fwiw, I didn't perceive your piece to be rude. I'm not sure who's right, but it seems like everyone is being civil.

I did not read anything you said as rude, either. Sometimes I find that people can't tell the difference between saying "you are wrong" and "you are stupid" when it's about something they care about :(

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact