
Jepsen: Etcd 3.4.3 - aphyr
https://jepsen.io/analyses/etcd-3.4.3
======
jzelinskie
Congrats on the community blog launch! <3

>We would love to see someone in the etcd community integrate the etcd Jepsen
tests directly into the existing etcd release pipeline.

I consider this to a be an issue of higher priority than any of the bugs they
just found, because this will ensure preventable bugs don't crop up in the
future. It's shocking to me that Jepson goes through all the effort and than
very few projects build a permanent pipeline for it. It's debatable these bugs
would've existed if a Jepson pipeline had been consistently in use from the
0.4.x days. I'm sure it's no simple task, but neither is a lot of the existing
testing infrastructure for etcd.

~~~
aphyr
> It's debatable these bugs would've existed if a Jepson pipeline had been
> consistently in use from the 0.4.x days.

I don't think it would have helped: the Jepsen tests I wrote in 2014 only
checked single-key gets, puts, and CaS operations; the problems we found in
this report were in watches and locks.

~~~
jzelinskie
That's a good point that I hadn't remembered (first-class locks weren't
totally baked back then), but I think the intent of my original comment holds
true: we should have valued the Jepson test suite more and continuously
leveraged and improved our usage of it, rather than doing one-off tests every
now and then. As a result, the community has no idea if there were regressions
between now and then. I will admit what I don't know: I have no idea if this
was actually feasible or desirable for us or you at the time, but I'd feel
more comfortable if every release of etcd, Zookeeper, etc... had some kind of
Jepson stamp of approval on it in terms of API coverage. I'm pretty sure
CockroachDB has tried this[0], but it's been a few years and I don't know how
it turned out for them long term.

[0]: [https://www.cockroachlabs.com/blog/diy-jepsen-testing-
cockro...](https://www.cockroachlabs.com/blog/diy-jepsen-testing-cockroachdb/)

~~~
irfansharif
CockroachDB runs the Jepsen test suite nightly. We've been following along
Aphyr's recent test additions (`multi-register` for instance, which
immediately caught [0]), porting them over when appropriate. We definitely
have work to be done incorporating the more DDL focused tests that tripped up
YugaByte.

[0]:
[https://github.com/cockroachdb/cockroach/pull/40600](https://github.com/cockroachdb/cockroach/pull/40600)

~~~
breakingcups
That's pretty damn cool. I wish I had more time to seriously try out
CockroachDB.

------
kodablah
> This is, apparently, not correct. Asking for revision 0 causes etcd to
> stream updates beginning with whatever revision the server has now, plus
> one, rather than the first revision. Asking for revision 1 yields all
> changes. This behavior was not documented.

I had worked on an alternative etcd impl and had to workaround this assumption
as well. It is technically documented in the proto[0], and numeric 0 is of
course "unset" or "default" in proto3 land.

One thing I would like to see tested is nested transactions where one txn
child mutates something then the second sibling txn child uses that something.
I've found that implementation is lacking.

0 - [https://github.com/etcd-
io/etcd/blob/53f15caf73b9285d6043009...](https://github.com/etcd-
io/etcd/blob/53f15caf73b9285d6043009fa64c925d5a8f573c/etcdserver/etcdserverpb/rpc.proto#L684)

~~~
aphyr
This is also one of the things I suggested that'd be nice to have in the etcd
API, along with an AST for boolean operations over guard expressions, more
flexible comparators etc. You can emulate these with a sequence of independent
transactions, but it'd be nice if the micro-transaction system were a tad more
general.

------
dang
There's also
[https://etcd.io/blog/jepsen-343-results/](https://etcd.io/blog/jepsen-343-results/),
via
[https://news.ycombinator.com/item?id=22191925](https://news.ycombinator.com/item?id=22191925).

------
candiddevmike
This is a powerful validation for etcd and its status as a mission critical
backend. You don't see a lot of positive Jespen results these days!

~~~
shaklee3
I didn't think it was too positive. He found some documentation bugs, as well
as what seems like really bad locking bugs.

~~~
senderista
It is not possible to implement distributed locking correctly in the general
case. Fencing tokens only work if your data store supports RMW operations, in
which case you could just implement locking inside the data store itself.

------
kbenson
_This is, apparently, not correct. Asking for revision 0 causes etcd to stream
updates beginning with whatever revision the server has now, plus one, rather
than the first revision. Asking for revision 1 yields all changes. This
behavior was not documented._

Whoops, looks suspiciously like someone tested the revision integer for
truthiness to see if something was passed.

~~~
lvh
Nope. It's because protobufs can't differentiate between nil/unset and a zero
value in this field: [https://github.com/etcd-
io/etcd/blob/53f15caf73b9285d6043009...](https://github.com/etcd-
io/etcd/blob/53f15caf73b9285d6043009fa64c925d5a8f573c/etcdserver/etcdserverpb/rpc.proto#L684)

~~~
m0zg
Not all protobufs. Protobuf 2 can do it just fine. The decision to not have
emptiness support is the dumbest part of Proto 3 IMO.

~~~
dilap
I don't think it's that crazy a decision, since it was also kind of crazy to
pay the cost of "was this field set" for every single field, when the vast
majority of the time you're not going to use it.

But it can require a bit more more thought in your protocol design. So for
example in this case, the options would be

\- design things such that 0 is ok to mean "most current" (so start revisions
at 1) (this is hard after the fact, but if you know from day 0 that missing
values for int types will be 0, you can design everything to start at 1)
(Edit: maybe this is how etcd works?)

\- explicitly break out revision into a message type (so you can notice if
it's not provided)

\- use something like "-1" to mean "now", so that 0 isn't overloaded

etc...

(You could argue maybe the right call for proto3 was to have a flag on a field
saying if you want to be able to notice if it was provided or not. Best of all
worlds, at cost of a bit of complexity.)

~~~
gen220
At my place of work, we typically follow the "message type" solution in
situations like this. I don't think it's the most legible solution, but it's
the best we can do with the proto spec: I always feel like I should qualify
these fields with a comment explaining the apparently pointless wrapper.

Google themselves provide
[https://github.com/protocolbuffers/protobuf/blob/master/src/...](https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/wrappers.proto)
to deal with this situation.

Quoted from the documentation:

> Wrappers for primitive (non-message) types.

> These types are useful for places where we need to distinguish between the
> absence of a primitive typed field and its default value.

It should probably be advertised more, as we've experienced that default
values of optional fields are a surprising feature for smart developers who
are new to protobufs. Maybe it's seen as a wart in the design? Getting rid of
Null is hard.

I kinda wish they went with the approach that all fields are required, unless
they are explicitly declared as optional. This is how Rust does it, and people
seem to like it.

~~~
m0zg
I think you got it backwards. The current situation where unset field results
in a value is the equivalent of "null", it's the opposite of "getting rid of
null". The previous design didn't have that problem.

~~~
gen220
I might be misunderstanding you, but the current situation results in a “zero”
value, whereas “Null” would represent “unspecified”, or the absence of a
value. I wouldn’t say that the current spec supports null, unless you’re using
a wrapper like the one linked.

It’s the distinction between an optional type and an optional value (default
value). With optional values but no optional types, you can’t be certain about
the caller’s intentions. It’s a distinction that’s subtle but important,
therefore a “gotcha”.

Getting rid of null is a noble idea, because of the headaches that null tends
to induce. Optional types (like Rust’s) is a neat way to get the behavior of
null without the value of null. Proto3 doesn’t have null, it’s replaced with
arcane wrapping that’s arguably less straightforward.

Please let me know if I’m talking past you, it isn’t intentional, just late in
the day. :)

------
Gepsens
For me, etcd is the go to for dynamic configuration management.

------
hprotagonist
The corresponding jepsen post:
[http://jepsen.io/analyses/etcd-3.4.3](http://jepsen.io/analyses/etcd-3.4.3)

I do no work at all in this area, but i love these reports. They're examples
of well-written, clear, "engineer-mind" reports that we would all do well to
emulate.

~~~
cpitman
I agree they are great. I kind of miss the older style of reports though, with
memes all over the place!

------
continuations
I mostly see etcd being used to store metadata & configuration data for
distributed systems.

Can etcd be also used as a general distributed database like FoundationDB or
ScyllaDB? If so how does it compare to those other optiions?

~~~
gregwebs
Etcd is designed to be able to cache it's entire working set in memory.

[https://etcd.io/docs/v3.3.12/dev-
guide/limit/](https://etcd.io/docs/v3.3.12/dev-guide/limit/)
[https://github.com/etcd-
io/etcd/blob/master/Documentation/op...](https://github.com/etcd-
io/etcd/blob/master/Documentation/op-guide/hardware.md#memory)

TiKV, a distributed kv store that can store many terabytes of data, actually
uses etcd internally for metadata storage.

~~~
fulafel
Most databases fit in memory though.

~~~
aphyr
Many do, and even large DBs can be accomodated by throwing terabytes of RAM at
the problem, but etcd is fundamentally targeted at a different problem domain
than a general-purpose database. Everything in etcd is represented by a
single-threaded state machine, which is great for simplicity and correctness,
but that comes with performance limitations.

------
shaklee3
As usual, Kyle provides an awesome write-up. As much as I miss the old, funny,
prose, the level of detail is still unmatched.

------
senderista
The elementary confusions apparent in the quoted documentation do not inspire
confidence in the design or implementation.

~~~
jupp0r
If you look at other reports by Jepsen (elasticsearch and mongodb are real
gems), you might find that this one found very few very minor bugs and
documentation problems. Etcd is pretty solid for what it does.

edit: seems like my recollection of the mongodb one is the 2.4.x one [1] and
the later ones are much better. Also mongo included the Jepsen test suite into
their CI.

[1] [https://aphyr.com/posts/284-call-me-maybe-
mongodb](https://aphyr.com/posts/284-call-me-maybe-mongodb)

------
nif2ee
How many nodes can etcd handle without having noticeable decay in performance?
their FAQ says 7 but did somebody use it in some other distributed app other
than k8s with more nodes? Assume that most of the operations are get and watch
(i.e. write/read <<< 1.0), how big of a cluster in terms of number nodes can
we scale up to?

~~~
ideal0227
To improve read perf, you can add learner members or add a layer of cache
proxy.

~~~
nif2ee
Thanks! never heard of the learner member feature until now. I was asking for
an app where most of the nodes are exerting watch operations while only a few
other nodes do PUTs due to human intervention. This means that I have a very
low write/read ratio. Also I assume that the number of nodes is usually stable
so it's not like a very dynamic system where nodes join and leave very
frequently. Does this make it easier to have a cluster of 50-100 nodes in
different datacenters without breaking etcd down?

------
jascii
Going off the name I thought it would be something to manage Unix
configurations (/etc Daemon)

~~~
klysm
It’s kind of used for things of that nature in a distributed context

------
jinqueeny
For the impatient: “The etcd key-value store is a distributed database based
on the Raft consensus algorithm. In our 2014 analysis, we found that etcd
0.4.1 exhibited stale reads by default. We returned to etcd, now at version
3.4.3, to investigate its safety properties in detail. We found that key-value
operations appear to be strict serializable, and that watches deliver every
change to a key in order. However, etcd locks are fundamentally unsafe, and
those risks were exacerbated by a bug which failed to check lease validity
after waiting for a lock.”

