

Call me maybe: etcd and Consul - nwjsmith
http://aphyr.com/posts/316-call-me-maybe-etcd-and-consul

======
philips
Thanks to aphyr for doing this sort of testing. It is important that we not
only verify etcd with our own internal testing but also have dedicated third
party feedback. From the beginning we have wanted something that was simple
and worked correctly.

It is also great to see etcd showing up in lots of interesting projects like
skydns and kubernetes. I think we have built something that is not just a
great building block for CoreOS but the OSS community at large.

Thanks to everyone who has helped get the project to where it is today; there
is a bright future ahead.

------
dinedal
I think the most surprising fact, despite all the amazing work here, is that
Comcast helps pay for this kind of research!

Other then that, it's great that we are getting some good results after
serious vetting for more modern replacements to ZK. I may not like ZK's
crufty-ness, but I could always trust it. Now with these results, I am
seriously going to consider Consul as a replacement.

~~~
hallmark
In fact, Comcast has a formal program to fund research grants and open source
work. There is a lot more "real" technology in place here at Comcast than one
may expect for a telecommunications and entertainment provider.

[http://techfund.comcast.com/](http://techfund.comcast.com/)

The ideal project would be something used within Comcast that doesn't
otherwise have a corporate sponsor.

------
antirez
This sounds like a report of a bug, but I believe this is not the actual
story. It is more a report of a design tradeoff: the authors of those CP
systems completely understand what happens, but were not happy to pay this
performance price for reads. One thing is to have a data store that has a very
limited performance in write operations but is very fast when you need to
read, another thing is a data store where both writes and reads are very slow.
However once you read potentially stale data from nodes, many of the
advantages of having a CP system are gone. IMHO to revert those systems to a
default where reads are applied to the state machine like writes is the sanest
thing to do, even if options to potentially read stale reads are also useful
in some context.

~~~
lomnakkus
> This sounds like a report of a bug, but I believe this is not the actual
> story. It is more a report of a design tradeoff: the authors of those CP
> systems completely understand what happens, but were not happy to pay this
> performance price for reads

If the authors were aware of these issues then the documentation was
dangerously misleading[1] and they should be docked points for that.

[1] As reported by aphyr, haven't read through it all myself. I'm thinking
primarily of the labeling of "read from leader without going through log" as
"consistent" bit.

~~~
antirez
That's why I think this is a design decisions in both cases:

In one of this products (etcd if I remember correctly) there was a clear
statement in the documentation about this semantics, and anyway, who
implements Raft knows that for reads to be consistent they need to go the same
path as writes. In the Raft paper you can find a whole section about this.

If you check the paper there are the following clearly stated informations:

Leaders can't reply to read queries without doing additional checks otherwise
the reads are not linearizable.

For the reads to be linearizable, the following two things must be performed
by leaders.

1) Commit a NOP at the start of its term, which is not a problem from a
performance point of view. The problem is "2".

2) A leader needs to check if it is still the leader before every read, and
this requires to contact a majority. That's the performance problem of
linearizable reads, because you need to pay a latency equal to the latency of
the slowest reply of the N/2+1 acks you need.

However note that even linearizable reads _don 't require_ fsync() to be
called, so they are still better than writes.

------
Dave_Rosenthal
Wow. I thought of etcd as taking consistency very seriously. Aphyr's
discoveries are indeed quite surprising: "The very first test I ran with
reported a linearizability failure. I was so surprised I spent another week
double-checking Knossos and Jepsen, then writing my own etcd client, to make
sure I hadn’t made a mistake. Sure enough, etcd’s registers are not
linearizable."

~~~
leorocky
I'm not sure what you've added here other than to exclaim surprise. Bugs can
be surprising, bugs are usually not intentional. Given the fast response by
the etcd team, they seem to be taking consistency very seriously.

~~~
aaronblohowiak
This was not a bug, it was a design decision (source:
[https://github.com/coreos/etcd/issues/741](https://github.com/coreos/etcd/issues/741)
) . You can get stale data in some circumstances because they wanted to avoid
the performance penalty of ensuring the latest really is the latest value from
a quorum of followers.

~~~
teraflop
There's the design decision, and then there's the incorrect documentation that
says reads with "consistent=true" are guaranteed to return the latest value.

[https://github.com/coreos/etcd/blob/master/Documentation/api...](https://github.com/coreos/etcd/blob/master/Documentation/api.md#read-
consistency)

------
rubyn00bie
Small pet peeve, please for the love of god and people stop with the "call me
maybe" titles... they aren't cute, funny, or anything but a waste of
characters. By the title alone, I assume the content to be juvenile (despite
already knowing the content of _that_ blog is anything but).

~~~
syntern
It is not about being funny, "call me maybe" is very relevant in the eventual
consistency world. And If you know any other article by aphyr, you jump
immediately to read it.

~~~
rubyn00bie
"It is not about being funny, 'call me maybe' is very relevant in the eventual
consistency world."

That's not fact, please don't state it as such. That's your opinion on modern
day vernacular influencing an unrelated field/topic.

Perhaps you didn't actually read what I wrote, but I said I'm aware of how
good their content is/can be. My issue is that a blog of that caliber is using
something in our modern vernacular that has been beat to death and adds no
actual value.

It'd probably do you best to read what people write instead of putting words
in their text or assuming things outside of the scope of what they said...

~~~
syntern
You may criticizing the author of its choice of words. You may disagree on
what I state or don't state. But calling out people that they should add
value, while all you do is trashing them is a bit controversial and it has no
place here.

On the actual critique: if you have had worked with eventually consistent
database, a 'Call' record's presence is really a great pun on the 'call me
maybe' phrase. I'm really sorry if you don't appreciate that part either.

