
Call me maybe: Redis redux - llambda
http://aphyr.com/posts/307-call-me-maybe-redis-redux
======
antirez
What Aphyr tested was not my "toy" example model, that was not even proposed
as something to actually implement, but just to show that WAIT per se is not
broken or ok, it is just a low level building block. The consistency achieved
depends on the whole system, especially the failover procedure safety
guarantees.

What I proposed is a toy system as described here:
[https://gist.github.com/antirez/7901666](https://gist.github.com/antirez/7901666)

It is a toy as it has a super strong coordinator that can partition away
instances, that is never partitioned, that can reconfigure clients magically
and so forth. Under the above assumptions the theoretical system is trivially
capable of reaching linearizability I believe.

Aphyr tested a different model, with an implementation that is not capable to
even guarantee some weak assumption about the model (for example, the slave
reset the replication offset to 0 when restarted), so I'm not sure what the
result means.

I could test the actual model I proposed even with Redis, but manually
following the steps that I outlined in the Redis mailing list thread. The
point was, if you can guarantee certain properties in the system, there is
always the "transfer" of data and higher offsets to a majority of replicas,
and the system becomes strongly consistent.

The properties are hard to achieve in practice once you try to move the
features from the mythical super strong coordinator into the actual system,
and this is why, for example, Raft uses epochs and other mechanisms to
guarantee both safety and liveness.

Unfortunately the focus is in showing other people are wrong without even
caring where the discussion is headed.

\--- EDIT ---

Btw now that I read the full post carefully, Aphyr also cherry-picked parts of
the thread to construct a story that does not exist, like if I was going to
implement strong consistency into Redis based on the proposed toy system that
was only useful to show that WAIT per se was not a system, but just a building
block. Note that yesterday I wrote the opposite in my blog, that there is no
interest in strong consistency in Redis Cluster.

Very unfair IMHO... I read only the analysis part at first, and was thinking
this was just a "let's check anyway this model with the current
implementation".

~~~
mmcnickle
So to clarify, you see WAIT as a replication primative. It can be used as part
of a larger scheme with a strong coordinator to provide strong consistency.

~~~
antirez
Yes, with a strong coordinator or with a distributed system that can provide
similar guarantees. But currently there is no plan to add this in Redis
Cluster. Three main reasons:

* Our main business is low latency, so very few will use synchronous replication.

* Redis cluster is composed of multiple master-slave systems that hold a subset of the key space each. This means that if you need a majority of replicas to promote a new master in order to achieve consistency, after a partition you end with different hash slots having the majority in different sides of the partitions perhaps.

* With the current weaker consistency guarantees Redis Cluster can elect a replica that is isolated by the others but is in the right side of the partition, where the majority of other masters are. Similarly you can have just two total nodes for every hash slot, a setup that I believe will be very used, and still get some availability if the master fails.

So the consistency premises and the tradeoff of Redis Cluster are not exactly
compatible with strong consistency.

Is WAIT still a useful tool? I believe yes, if documented as such:

WAIT in the context of Redis Cluster / Sentinel is not able to provide strong
consistency, but lowers the percentage of probabilities that a failure mode
that results in data loss happens.

A trivial example, what happens if a client is partitioned away with just a
master? Without WAIT there is a window of NODE_TIMEOUT to lose data, with WAIT
there is not this problem in this specific failure scenario. There are still a
number of failure modes for WAIT, but are a smaller number compared to the
full set of failure modes that there are without WAIT, so practically it does
not feature strong consistency but provides the user with a smaller
probability of data loss.

Another example: manual failover for very important data of some kind. You may
have 5 nodes and use WAIT 4 to always write everywhere and improve durability.

And so forth.

------
rdtsc
> Ultimately I was hoping that antirez and other contributors might realize
> why their proposal for a custom replication protocol was unsafe nine months
> ago, and abandon it in favor of an established algorithm with a formal model
> and a peer-reviewed proof, but that hasn’t happened yet. Redis continues to
> accrete homegrown consensus and replication algorithms without even a
> cursory nod to formal analysis.

That is kind of my feel. Redis is an outstanding product with a beautiful code
base. This replication feature has been tough though. It is kind of due to
external factors as I've mentioned in the previous post. Everyone and their
cousin are talking about distributed databases, everyone likes CAP, CRDTs,
Vector Clocks, Raft, Zookeeper and so on. It is hard to come up and say "Here
I have made this custom replication protocol". Everyone stares and asks, "Hey
where is your whitepaper or your partition tolerance tests?". 5-7 years ago,
there would be only nods and approvals. The other aspect is this is about a
database, so it is potentially toying and touching user's valuable data. If
that gets lost either by a bug, mis-communication in docs, bad default,
anything, it will not be taken lightly.

In the end I think it is fine to have it as what it is, with the warnings and
disclaimers that data could be lost and avoiding papering over or hiding
issues.

As an extra side note, simply put partition tolerance is hard. Net-splits are
the devil of the distributed world. Some claim it doesn't exist or doesn't
happens often. Others fear and tremble its name is mentioned. When it does
happen it means having to resolve conflicts, throwing away user data, stopping
killing your availability to stop some from accepting writes in order to
provide consistency. This is a tough test (that Aphyr runs) and not very many
databases fair well in it. But it is good these things are discussed.

~~~
jasonwatkinspdx
I think the blog post is valuable, but I'd point out that aphyr is also not
publishing formal proofs or utilizing the verification he suggests is
necessary. I don't think this is intentional, but they should realize and own
that they are borrowing authority of formal method they are not demonstrating
and then criticizing others for the same lack of demonstration.

~~~
aphyr
Well this one's proof by construction, so I don't feel the need to go
particularly deep into the math. If you prefer a more rigorous approach take a
look at the TLA proofs earlier in the Jepsen series; think I linked one on
Redis+async replication in the article.

~~~
jasonwatkinspdx
I read and enjoyed them, thanks. I'd just like people to be in a more
collaborative vs combative mode, and I think a big part of that is not overly
relying on rhetoric or overreaching what you've actually demonstrated. Using
formalism on a blog is commendable, but it's also not the same thing as a
properly reviewed paper. People should keep straight the levels of dialog here
and value them each for the light they shed. This has been mostly a productive
exchange despite a few "yer doin' it wrong bro" attitude folks, and I just
want to throw my hat in to strongly advocate for keeping it that way.

~~~
davidw
As a way of checking, when a phrase is in character for "Comic Book Guy", you
know it's probably a bit too much on the mean-spirited side of things.

------
derefr
To rephrase antirez from the previous thread:

People use Redis, in large part, for its time and space guarantees on data
structure operations. (Without those, you may as well be using a serialized
object store.) Strong consistency requires rollbacks; and the book-keeping
necessary to do rollbacks throws away the time and space guarantees. So either
you have Strong Consistency, or you have Redis, but you don't get both.

But Redis Cluster is a compromise: something which is roughly good enough for
_most cases_ people actually use Redis for, while failing horribly at things
Redis isn't used for anyway, and still providing Redis's time and space
guarantees.

Theorists balk, because there are obvious places where Redis Cluster falls
down, and they can demonstrate this. Engineers shrug, because Redis isn't
being used in their companies in such a way that those demonstrations are
relevant to their problems.

Most people who need Redis Cluster have already Greenspunned a Redis Cluster
themselves, and they're already happily living with the compromise it entails.
They'll gladly hand the support burden of writing cluster-management code
upstream to antirez; it won't change any of the facts about the compromise.

~~~
moe
Most companies also simply don't have redis-type data problems that can't be
solved by throwing a pair of 768G ($40k) or 2TB servers ($120k) at them.

When that option is available it tends to beat complex software solutions in
every way.

------
justin66
A person who found themselves sympathetic to the kind of hand-wavey feel-good
explanation of things in yesterday's Redis thread might find this conclusion
kind of snotty:

> I wholeheartedly encourage antirez, myself, and every other distributed
> systems engineer: keep writing code, building features, solving problems–but
> please, please, use existing algorithms, or learn how to write a proof.

That person should be sure to note these experimental results:

> These results are catastrophic. In a partition which lasted for roughly 45%
> of the test, 45% of acknowledged writes were thrown away. To add insult to
> injury, Redis preserved all the failed writes in place of the successful
> ones.

~~~
antirez
Regardless of the fact that a random, meaningless model was tested? This does
not appear to be very formal to me.

~~~
justinsb
I would treat this as a great bug report. Aphyr has done a lot of work,
showing that a (reasonable to me) formal interpretation of your gist produces
catastrophically bad results.

If you're unhappy with the formal interpretation of your gist, publishing the
formal interpretation you intended (or, better yet, code) would allow others
to build the system you actually intended.

~~~
antirez
1) The thing that Aphyr tested is not the gist.

2) There was never any plan for it to go inside the Redis implementation. This
was just an argument in a mailing list to show that synchronous replication as
implemented by WAIT is dependent on the rest of the system.

What bug report we are talking about?

~~~
justinsb
I was saying that the blog post is the best bug report I've ever seen. To
extend the analogy: in my mind, the way you're treating it is like closing it
as "INVALID" without any comment, which tends to annoy bug reporters :-)

If your argument is #1 (that Aphyr tested the wrong thing), then a reasonable
reply would be to provide the model you did intend. If you tested it as well,
that would be great, but it is reasonable to require the "bug submitter" to
retest.

If your argument is #2 (that WAIT is just best-effort replication, and that it
does not provide any guarantees) then that's fine, just say so clearly. But
you should then stop disputing the model that Aphyr tested, because to do so
implies the existence of a model which does provide guarantees.

------
mjs
Note that antirez's reply (in the comments) begins with "thanks to Aphyr for
spending the time to try stuff, but the model he tried here is not what I
proposed..."

------
rcoh
As a gut check, if you're solving some replication problem and you'd consider
using Paxos to solve the problem, be /very/ wary and reason extremely
carefully about why your weaker solution will provide the same guarantees.
Chances are, it will fail in certain cases of network outage or system
failure.

~~~
erichocean
And if you're considering Paxos, PaxosLease[0] is a nice, fast variation
that's much less complex to actually implement in a production setting, and
doesn't require co-ordinated clocks (clocks only need be
consistent—monotonic—locally).

It's amazing how far you can get with a reliable leader election protocol in a
distributed system. We're getting to the point where all of the other stuff is
just picking specific algorithms with specific tradeoffs, much like you do
today in a non-distributed setting.

[0] [http://arxiv.org/pdf/1209.4187.pdf](http://arxiv.org/pdf/1209.4187.pdf)

------
banachtarski
Guys, use Redis for your _real time data_. Why else would you care about
having the benefits of in-memory speed? Jesus. If a partition happened to my
redis setup, you know what I'd do? Trash the whole thing and start again.

------
falcolas
I'm not personally very familiar with Reddis or it's HA tenders, but (based
solely on reading this article) they seem to suffer from problems (improper
handling of non-quorum situations) that have been solved with tools like
Pacemaker and Corosync.

Has anyone attempted to use Pacemaker to wrangle Reddis instances?

