I've done distributed systems, and in fact I usually don't implement systems that are 100% correct under maximally-adverse conditions, for performance and convenience reasons. But it's important to know and acknowledge where and how the system is less than 100% bullet-proof, so that you can understand and recover from failures. When you have 1000 servers, some virtual in the cloud, some in a colocation facility, and maybe 10 or 20 subtle variants in hardware, you will get failures that should not happen but they do. For example, I've had an AWS host where the system time jumped backwards and forwards by 5 minutes every 30 seconds. And I've seen a tcp load-balancer appliance managed by a colo facility do amazingly wrong things.
The point is, you need to understand where the consistency scheme in a distributed system is "fudged", "mostly reliable", "as long as conditions aren't insane", and where they are really guaranteed (which is almost nowhere). Then you can appropriately arrange for "virtually impossible" errors/failures to be detected and contained. This ad-hoc logic you bring to the problem of distributed systems doesn't really help.
Not to "dogpile", but elucidate for others who may not know some of the history here: https://aphyr.com/posts/287-asynchronous-replication-with-fa...
and the also interesting reply: http://antirez.com/news/56
About the ad-hoc logic, in this case I think the problem is actually that you want to accept only what already exists. For example there are important distributed systems papers where the assumption of absolute time bound error among processes is made, using GPS units. This is a much stronger system model (much more synchronous) compared to what I assume, yet this is regarded as acceptable. So you have to ask yourself if, regardless of the fact this idea is from me, if you think that a computer, using the monotonic clock API, can count relative time with a max percentage of error. Yes? Then it's a viable system model.
About you linking to Aphyr, Sentinel was never designed in order to make Redis strongly consistent nor to to make it able to retain writes, I always analyzed the system for what it is: a failover solution that, mixed with the asynchronous replication semantics of Redis, has certain failure modes, that are considered to be fine for the intended use case of Redis. That is, it has just a few best-effort checks in order to try to minimize the data loss, and it has ways to ensure that only one configuration wins, eventually (so that there are no permanent split brain conditions after partitions heal) and nothing more.
To link at this I'm not sure what sense it makes. If you believe Redlock is broken, just state it in a rigorous form, and I'll try to reply.
If you ask me to be rigorous while I'm trying to make it clear with facts what is wrong about Martin analysis, you should do the same and just use arguments, not hand waving about me being lame at DS. Btw in my blog post I'll detail everything and show why I think Redlock is not affected by network unbound delays.
"Redlock" is a fancy name, involves multiple nodes and some significant logic, less experienced developers will assume it's bulletproof.
"Redis Sentinel" is a fancy name, involves multiple nodes and some significant logic, less experienced developers will assume it's bulletproof.
I guess I'd prefer that people shoot themselves in the foot with their own automatic failover schemes, instead of using a publicized ready-to-use system which has marketing materials (name, website, spec, multiple implementations) which lead them to believe it is bulletproof, and these systems end up being used in for a bunch of general-purpose cases without specific consideration for their failure modes. In this case, the "marketing" is just the fact that you're behind it, and redis really is solid for what it is.
(Usually I avoid automatic promotion/fail-over altogether. It's often not a good idea. Instead I find some application-specific way of operating in a degraded state until a human operator can confirm a configuration change. So service is degraded for some minutes, but potential calamity of automatic systems doing wrong things with lots of customer data is averted.)
It's really not your responsibility to make sure the open source software you offer to the world for free is used appropriately. But people like me will be annoyed when it contributes to the over-complicated-and-not-working messes (not directly your fault) we have to clean up :)
One specific comment: http://redis.io/topics/distlock notes the random value used to lock/unlock to prevent unlocking a lock from another process. But at that point, the action just before the unlock could very well also have been done after the lock was acquired by another process. So while this is a good way to keep the lock state from being corrupted, it is really a side-note rather than a prominent feature, since if this happens, the data the lock was protecting could well have been corrupted.
"fancy names == people think bulletproof" is not a credible criticism.
Now "Oracle Enterprise Manager Fusion Middleware Control" is a fancy name!!! Pretty sure no-one with a clue thinks a product named like that is bulletproof, and its probably got significant logic/nodes etc
I apparently have the opposite experience to you. I'll often find complex distributed systems that are painful to troubleshoot when they misbehave, and find it refreshing on the other hand when someone is just using Redis, because typically THAT system is working a lot more predictable.
If people are too inexperienced to realize that ALL systems have tradeoffs, and not read up on what those are (because apparenly the name is fancy) then they'll get burned. Antirez does a pretty good job of explaining and documenting where he sees Redis limits to be.
Arguing that everyone should be home-baking their own failovers until of learning the limits of well-known ones our there doesn't seem like responsible advice.
also note that you need deterministic ordering over the cluster and deterministic ordering on individual nodes does not actually grant you any greater certainty
i believe you're referring to google's chubby lock system with your reference to gps but the clocks were only used as an optimization in chubby, it still had to implement a consensus protocol
But that's optimistic concurrency control! In that case the locking service is a pure optimization and doesn't affect correctness. So why use this complex multi-node locking thing then, instead of a single redis node?
Along with the author of this blog post, I would recommend the usage of Zookeeper if you have a need for obtaining a lock in a distributed environment. You can read an analysis of how Zookeeper performs under a (intentionally) partitioned network here:
Interestingly, even though HBase uses Zookeeper for coordination, it does not use it for fencing. Instead, fencing is handled by atomic "log-rolling" operations on the HDFS namenode, which can be configured to use its own quorum-based journal. (The equivalent of a "token" is the monotonically-increasing namenode operation sequence number.) So the principle is the same: the system responsible for doing mutual exclusion, and the system that actually stores the data, must be coupled.
To achieve unlimited scalability, you need to make sure that no single data store is shared between all processes and that the amount of interprocess communication (on a per-process basis) doesn't increase as you add more processes.
The more sharing you have, the less you can scale your system.
Personally, I find it easier to write highly parallel code than parallel-up-to-a-certain-point code that involves locks likes mutexes and/or semaphores. Highly parallel code just takes a little bit of extra planning and thinking up-front but it saves lots of time and stress down the line.
This statement confused me. It seems to say that the packets were delayed in the network for 90 seconds before being delivered. From reading the original sources it actually sounds like packets were discarded by the switches, so the original requests discarded, and the nodes were partitioned for 90 seconds. When the partition was removed both nodes thought they were the leader and simultaneously requested the other to shutdown. Can anyone confirm? Keeping packets delayed in a network for 90 seconds would seem quite difficult (though not impossible assuming certain bugs).
Edit: On re-reading I think this is just talking about the network stack in general - not the network. A temporary partition may delay delivery of your request until max TCP retries is exceeded on your host, if its recovered before then your request may arrive later than you intended.
However, yes, I was thinking about the network stack as a whole. Any kind of retries, e.g. TCP retransmission on timeout, effectively turns packet loss into packet delay (within certain bounds). Thus, even if you have a network interruption ("partition") and all packets are dropped, it could happen that after the interruption is fixed, a node receives packets that were sent before or during the interruption. For this reason, I find it helpful to think of it as delay, not just packet loss.
Edit: Why is this downvoted? It is a serious question.
Why I'm thinking more in terms of a database-oriented solution is because there is still the problem of crashing. What happens if the system crashes just before the email is sent? The system somehow needs to remember to restart that task (sending the email) when it comes back up. And this is probably best done through a database anyway.
In the scenario you describe, the database commit works, you send the email but... you get an error back from the email send function. Now what?
This is a basic distributed computing problem. Choosing how to solve it can have a drastic effect on your code and your infrastructure. If the secondary function is "charge a credit card" obviously you can't do that twice. But if it's merely "send an email" then maybe it's okay if people get a duplicate.
Google lease/lock/deadlock/race condition and read up on why databases tend to make for bad implementations.
Having just re-read Lynch's paper, can you explain what you mean here? I didn't see anything explicitly relying on time. It could be there is some implicit usage I didnt see. Additionally, the paper's impossibility result is about "perfectly correct consensus" which applies with and without clocks and then has a positive result for "partially correct consensus" (i.e. not deciding a value is a correct result). Im not sure which you mean when you say "consensus becomes impossible" as it is either already impossible (the perfectly correct protocol) with one faulty process or (to my understanding) not dependent on time (the partially correct protocol).
p.s. great article!
However, if you do allow some timing assumptions — just enough to measure a timeout, nothing more — then consensus becomes possible. In this case, the timeout is known as an "unreliable failure detector". See Chandra and Toueg's paper for details.
In a good algorithm, the safety properties do not depend on timing, only the liveness properties do. Rather than "consensus being impossible" perhaps it would be clearer to say "consensus may never terminate" in an asynchronous system. But "consensus impossible" is the standard way of phrasing this issue in the distributed systems literature.
For example, on page 375: “Crucial to our proof is that processing is completely asynchronous; that is, we make no assumptions about the relative speeds of processes or about the delay time in delivering a message. We also assume that processes do not have access to synchronized clocks, so algorithms based on time-outs, for example, cannot be used.”
I send a packet over the network. I configured the timeout to be 10 seconds. 10 seconds passes. Is the packet truly lost? Or, maybe the packet was received, processed, and and a response is on the way back, but it takes 20 seconds instead of 10! Or maybe it takes an hour instead of 20 seconds. Or a decade.
We don't know if the packet is truly lost or is just taking >$timeout_timer and making assumptions is dangerous.
Please consider this answer  (which I personally understood as - YES, real systems are always partial sync):
"Asynchronous systems have no fixed upper bounds. In practice, systems tend to exhibit partial synchrony, which is described as one of two models by Dwork and Lynch in Consensus in the Presence of Partial Synchrony. "
The specs are submitted for review at https://github.com/openstack/openstack-specs
https://lwn.net/Articles/662140/ is where this also was written up.
As Martin points out though, many (most?) distributed lock implementations are or can be broken in various ways. He hints at one of the fundamental problems - consensus. Many distributed lock implementations fail simply because they cannot achieve reliable consensus, or don't even try to.
That said, distributed locks can be safe and handle reasonably high throughput. The Atomix distributed lock is one example:
Since consensus requires quorum and quorum requires availability, there is a risk that your ability to obtain a lock or learn about lock related events could be effected by availability, but at least in the case of Atomix, the system is fairly resilient with auto-failovers to passive or inactive nodes as needed (as compared to, say, a ZooKeeper based lock).
If you want to acquire some shared resource, then elect an owner for that resource.
When you perform a write, instead of the token, send a hash of the object previously read. The storage can then compare this against a hash of the resource's current state. If it doesn't match the lock expired and the write is not accepted.
This would reduce the state to keep track off to the resources themselves.
Generally when I think of this problem I reach for etcd, not Zookeeper first, in the hopes of it being lighter (with a relatively vague definition of "light"), and easier to use.
The author's solution is to push serialization logic into the resource/storage layer (by checking fencing tokens). But what if the resource is itself distributed? Then it needs it's own synchronization mechanism? It's locks all the way down.
As I stated, though, if the resource being protected is either a distributed system itself, or a system that cannot support fencing logic, this failure mode is difficult or impossible to prevent. The frequency of failure should be kept in mind here: most services can probably guarantee 99.99% uptime against the likelihood of 5 minute GC pauses.
So you can do fencing with consul.
node.modifiedIndex appears sensible for fencing.
Consensus can be achieved without clocks using blockchains: Instead of timelocking the resource, client can broadcast his altered version after making changes to the network. The next client than then start working on top of this changed resource, but since multiple versions might be floating around (due to timing issues in the first place) the longest tree wins. So if a client submits work done on a shorter tree then it is rejected by the network.
This has other issues like it takes longer time & reorganization risks but it does away with clocks altogether by providing a different method of timestamping.
This is similar to proof-of-work timestamp server that bitcoin uses but we can do away with proof-of-work because the resource and membership in the network is centralised.
regardless, blockchains as implemented in certificate transparency and bitcoin aren't actually consensus systems, they are just (highly accurate) approximations
We are talking about clocks that tell time in the quoted quote (time is ordering of events by definiton) & consensus in bitcoin is final(accurate) enough an practical approximation as much as a centralised system as described in OP is. Either I am misunderstanding or this is your dislike for bitcoin.