
Etcd v3: increased scale and new APIs - philips
https://coreos.com/blog/etcd3-a-new-etcd.html
======
sandstrom
Sounds interesting! What are the benefits of using Etcd over Consul?
([https://www.consul.io/](https://www.consul.io/))

~~~
ideal0227
I think the two projects focus on different things.

Consul provides features like health checking, failure detection besides its
consistent key-value store. It aims to provide an all-in-one solution[0].

etcd focuses on the consistent key-value store. The key-value store has more
advanced features like multi-version keys, reliable watches, and provides
better performance. People build additional features on top of etcd's key-
value store/raft.

(I work on etcd)

[0]
[https://www.consul.io/intro/vs/index.html](https://www.consul.io/intro/vs/index.html)

~~~
otterley
Do you have proof of the "better performance" claim?

~~~
ideal0227
Consul:
[https://github.com/hashicorp/consul/blob/master/bench/result...](https://github.com/hashicorp/consul/blob/master/bench/results-0.3.md)

etcd:
[https://github.com/coreos/etcd/blob/master/Documentation/op-...](https://github.com/coreos/etcd/blob/master/Documentation/op-
guide/performance.md#benchmarks)

Note that the testing environments are not exact the same in the doc I listed,
but comparable. Also Consul performance is improved after a few releases.

So we did the benchmark internally on the same environment. The result is
still comparable to what I listed in the two official docs.

The best way to compare performance is still probably to run the benchmark on
your own environment.

------
gshx
Does someone have any benchmarks or comparisons with Zk? We have run it for
many yrs without a problem and are very happy with it. Would be interested in
hearing from anyone who switched from zk->etcd for distributed locking,
presence, leader election type idioms and ran it in prod for a few months/yrs
and their takeaways.

~~~
Randgalt
etcd is a key value store that has features of ZK. ZooKeeper is strictly
distributed coordination. Apples and oranges, no?

~~~
atombender
Apples all the way. Etcd is pretty much a clone of ZooKeeper in Go; they both
support hiearchical keys, atomic autoincrement, watches, though Etcd uses the
Raft consensus algorithm, whereas ZooKeeper uses its own homegrown algorithm,
and there are other minor differences. Both are intended for configuration
management and coordination. (Of course, both are ultimately clones of
Google's internal tool, Chubby.)

~~~
embiggen
I agree that they are similar in some ways, but under the covers they are
fundamentally different beasts in almost all ways!!!

Etcd uses Raft and ZooKeeper uses it's own protocol called Zab [0]. Zab shares
some characteristics with Paxos but certainly IS NOT Paxos.

[0]
[https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zab+vs...](https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zab+vs.+Paxos)

~~~
lobster_johnson
As I understand it, both Paxos and Zab will result in similar performance
characteristics, since writes need, by design, to be coordinated with peers
and serialized in a strict manner. In this sense, Etcd and ZK are very much
alike, irrespective of how they are implemented internally. I wouldn't be
surprised if Etcd was found to be faster and more scalable than ZK, however.

~~~
ideal0227
It really depends on how you view the problem. Yes, the latency of agreeing a
proposal is similar, which is limited by physical (network latency + disk io).
However, there are ways to put more stuff into one proposal (batching) and
submitting proposals continuously (pipelining). These optimizations highly
depend on the implementations.

------
Perceptes
It's not mentioned in the blog post—is there a document that explains the
migration plan for the etcd that ships on the host in CoreOS?

Edit: They haven't been published to the docs on the CoreOS website yet, but
there are two documents listed under "upgrading and compatibility" at the
bottom of
[https://github.com/coreos/etcd/blob/v3.0.0/Documentation/doc...](https://github.com/coreos/etcd/blob/v3.0.0/Documentation/docs.md)

~~~
philips
Great question. The upgrade path is a rolling upgrade from v2.3.y series to
v3.0.0 series. This is how all of the etcd upgrades have worked since the
start of v2.x.y.

Doc is here:
[https://github.com/coreos/etcd/blob/master/Documentation/upg...](https://github.com/coreos/etcd/blob/master/Documentation/upgrades/upgrade_3_0.md)

~~~
Perceptes
Seems worth noting as well that this only upgrades the version of the cluster.
Data populated via the v2 API will not magically be available via the v3 API
as they have separate data stores/keyspaces.
[https://github.com/coreos/etcd/blob/v3.0.0/Documentation/op-...](https://github.com/coreos/etcd/blob/v3.0.0/Documentation/op-
guide/v2-migration.md) talks about how to migrate data that was stored with v2
to v3's data store.

------
Ygor
Etcd looks more and more promising as its usage and development activity
increases. Anyone using it internally as a standalone part in the system (e.g.
not just for k8s or coreos)?

Using e.g. gRPC shows great promise, but systems like ZooKeeper still play
nicer in more traditional Java shops, or do they? How hard is it to use etcd
from the JVM?

~~~
bogomipz
Etcd looks more promising?

Zookeeper has been around now for over 5 years with an extremely large install
base.

Kafka, Hadoop, Solr, Mesos and Hbase projects that leverage Zookeeper
distributed coordination.

Zookeeper has already delivered.

~~~
harlowja
I would tend to agree, knowing what zookeeper has been doing and actually
using zookeeper (and etcd) I can say that the API and the primitives offered
by zookeeper are IMHO better (although this multi-version concurrency control
model is interesting) and more mature.

It feels like etcd is 'still discovering itself' for lack of better words.

Btw:
[https://issues.apache.org/jira/browse/ZOOKEEPER-2169](https://issues.apache.org/jira/browse/ZOOKEEPER-2169)
(this is the equivalent of TTLs for zookeeper).

------
Randgalt
Are there high level APIs for etcd like there are for ZooKeeper? I'm the main
author of Apache Curator and I know that writing "recipes" is not trivial.

~~~
ideal0227
Yes. Here:
[https://godoc.org/github.com/coreos/etcd/clientv3/concurrenc...](https://godoc.org/github.com/coreos/etcd/clientv3/concurrency)

It would be great if you can provide opinions, comments or help on these high
level APIs. We also might move these to an internal proxy layer, so that other
clients in other language can use it more easily.

(some more here:
[https://github.com/coreos/etcd/tree/master/contrib/recipes](https://github.com/coreos/etcd/tree/master/contrib/recipes))

~~~
Randgalt
I assume there's a Java API for etcd? If so, it would be interesting to try to
port Curator.

~~~
ideal0227
We are working on it
([https://github.com/coreos/etcd/issues/5067](https://github.com/coreos/etcd/issues/5067)).
Probably we could work together on the Java client first? It should not be
hard given that gRPC supports Java.

------
agentultra
Do they publish formal specifications of these distributed algorithms?

Has anyone verified the implementation?

Zookeeper is at least based on Paxos which has a TLA+ model one can check.

~~~
ideal0227
etcd is based on Raft. Raft has a TLA+ spec. But do note that the
implementation usually diverges from its algorithm [0].

For etcd, we try to keep the core algorithm as self-contained and
deterministic (no I/O, no timer) as possible. So it can be very close to the
pure algorithm. We are very confident about it since we throughly tested it,
and the implementation is shared with other consistent large scale database
systems too (cockroachdb, tikv).

ZooKeeper uses ZAB [1] under the hood. I do not think there is a TLA+ for ZAB.

[0]
[https://www.cs.utexas.edu/users/lorenzo/corsi/cs380d/papers/...](https://www.cs.utexas.edu/users/lorenzo/corsi/cs380d/papers/paper2-1.pdf)

[1] [http://www-
users.cselabs.umn.edu/classes/Spring-2016/csci821...](http://www-
users.cselabs.umn.edu/classes/Spring-2016/csci8211/Papers/Distributed%20Systems%20Zab-%20High-
performance%20broadcast%20for%20primary-backup%20systems.pdf)

~~~
carterschonwald
I took a look at the etcdv3 raft.go code, it is indeed pretty nicely written!
Definitely better than any other one i've seen.

(though i'm pretty excited about the raft impl my summer intern and I are
hacking on currently :) )

~~~
sagichmal
Consul's Raft implementation is leagues better than etcd's, unfortunately.

~~~
justinsb
I'd be very interesting in hearing more about this.

I tried using both the Consul and etcd Raft implementations as libraries. I
found the Consul library much easier to interface with. But it was my
impression that the etcd library was much more tested in the real-world, with
big projects like Kubernetes and with the library being embedded into projects
like CockroachDB. I also wasn't sure if the details that the Consul
implementation was hiding were actually important.

------
qwertyuiop924
This doesn't look like feature creep, but coreos bought into systemd big time,
and with etcd being used in more places, the temptation of feature creep
grows... and that's worrying.

------
hn_rate_limiter
Will CORS ever be enabled for the
[https://discovery.etcd.io/new](https://discovery.etcd.io/new) endpoint?

~~~
philips
This is the GitHub issue for this service. Please +1 the thing on GitHub and
we will try and get it fixed in prod:
[https://github.com/coreos/discovery.etcd.io/issues/12](https://github.com/coreos/discovery.etcd.io/issues/12)

~~~
meta_AU
Wait... there are maintainers that encourage +1s all over their GitHub issues?

~~~
jvoorhis
Probably referring to the new "reactions" feature.
[https://github.com/blog/2119-add-reactions-to-pull-
requests-...](https://github.com/blog/2119-add-reactions-to-pull-requests-
issues-and-comments)

------
jondubois
Last time I looked into Etcd, I got the impression that it was great at
handling a high-volume of read operations (theoretically read-scalability) but
bad for handling high-volume of write operations (since every write has to be
propagated to every node in the cluster via Raft)? Is this still the case in
v3?

