
Etcd 3.4 - jinqueeny
https://kubernetes.io/blog/2019/08/30/announcing-etcd-3-4/
======
meddlepal
I really wish Kubernetes would make it's storage backend pluggable or the k3s
folks would push their work to allow a SQL database as the backed upstream.
Then you could just back Kubernetes with some Cloud SQL offering.

~~~
jacques_chester
I'd like that too (I trust SQLite more than almost anything), but it would
prove difficult in some respects. Kubernetes intimately relies on etcd's watch
capability. k3s does this with polling, which works fine, but some
consideration would need to be given as to how to most effectively replicate
that behaviour (eg. would LISTEN/NOTIFY work well on postgresql? I have no
idea).

Further out, a case can be made that the whole business of reconciliation is
one of event stream processing. My gut feeling for some time has been that the
intersection of event stream and bitemporalism[1][2] makes a lot of the
problems that motivate consensus protocols approximately moot. Done properly
it might be possible to do away with the need for a master for a lot of
things, which would improve scalability, failure tolerance and security.

[1] or Johnston's tritemporalism, I don't yet feel confident enough in my
understanding to say.

[2] I am not alone and probably not the first to think so. For an example of a
streaming-oriented bitemporal store, see Crux:
[https://juxt.pro/crux/index.html](https://juxt.pro/crux/index.html)

------
ec109685
One of the failure conditions is having a new follower read data off of the
leader as it bootstraps, which adds extra load on the system.

It seems like a follower could pull the initial snapshot off of another
follower to start instead?

------
MichaelMoser123
Why is client-go sending http requests to kube-apiserver? I wonder if a
message queue would have been a more reliable and scalable transport option.

~~~
whalesalad
Sometimes you need or want synchronous communication.

~~~
MichaelMoser123
With client-go one is often watching for notifications. Looks like what you
would do with an mq.

~~~
paulfurtado
The way kubernetes uses this streaming is to keep datastructures up to date in
memory. A node needs a list of all its pods, so it makes a list call at
startup and then starts a watch from that resource version onwards to keep the
in-memory list in sync.

The whole architecture is designed such that the etcd storage backend could be
swapped out completely and the only thing that would care is the api servers.
Much like the transition from etcdv2 over http with json objects to grpc with
etcdv3 and protobufs.

You can also create alternative implementations of kubelet, kube-proxy, the
scheduler, the controller-manager, etc because they all access the data via
the api server's well-defined public facing API and anyone using the API can
easily watch objects in any programming language using the same semantics as
the rest of the API. It also works from browsers, etc.

Additionally, kubernetes supports RBAC for nodes themselves such that they can
only see updates for objects related to pods running on them - you wouldn't
want any node to be able to watch all secrets in the cluster needlessly.

Overall, I think we'd lose a lot if kubernetes switched to having all of its
components access the data store directly. Every operator ultimately needs the
same things the controller-manager, scheduler and the other kubernetes
components need

~~~
MichaelMoser123
I think in addition to the initial full list it has to do additional full list
requests upon expiration of the resync intervall.

I didn't ask about accessing etcd directly, i just wonder if http is the best
transport for this case.

------
anonymousJim12
Which k8s version will use this version by default? Has it been tested with
any current versions?

~~~
antpls
According to
[https://github.com/kubernetes/kubernetes/pull/81434](https://github.com/kubernetes/kubernetes/pull/81434)
, some of the bugs will be fixed in k8s 1.16 with the update to etcd 3.3. All
previous versions are affected.

------
networkimprov
Note that the etcd project ignored this report of a data loss/corruption bug
on MacOS:

[https://github.com/etcd-io/bbolt/issues/124](https://github.com/etcd-
io/bbolt/issues/124)

~~~
segmondy
IMHO, why shouldn't they? MacOS is not a server OS, it's a consumer OS.
Doesn't matter if it's derived from the FreeBSD family. Apple will prioritize
user experience over anything else. Anyone working on a server app is more
concerned about Linux or Windows first and then the pure BSDs such as
FreeBSD/OpenBSD and everything else is maybe.

------
jacques_chester
> _For instance, a flaky (or rejoining) member drops in and out, and starts
> campaign. This member ends up with higher terms, ignores all incoming
> messages with lower terms, and sends out messages with higher terms. When
> the leader receives this message of a higher term, it reverts back to
> follower. This becomes more disruptive when there’s a network partition._

I'm glad this has been fixed, considering that one of the use cases for using
partition-tolerant data stores is to tolerate partitions.

Cloud Foundry used earlier versions of etcd and this category of problem was
the leading cause of severe outages. To the point that several years of effort
were invested to tear it out of everything and replace it with bog-ordinary
RDBMSes.

Disclosure: I work for Pivotal, we did a lot of that work, but I wasn't at the
front line. Just watching from a safe distance.

