I am the creator of the k3sup (k3s installer) tool that was mentioned and have a fair amount of experience with K3s on Raspberry Pi too.
You might also like my video "Exploring K3s with K3sup" - https://www.youtube.com/watch?v=_1kEF-Jd9pw
Kube is designed for nodes to have continuous connectivity to the control plane. If connectivity is disrupted and the machine restarts, none of the workloads will be restarted until connectivity is restored.
I.e. if you can have up to 10m of network disruption then at worst a restart / reboot will take 10m to restore the apps on that node.
Many other components (networking, storage, per node workers) will likely also have issues if they aren’t tested in those scenarios (i’ve seen some networking plugins hang or otherwise fail).
That said, there are lots of people successfully running clusters like this as long as worst case network disruption is bounded, and it’s a solvable problem for many of them.
I doubt we’ll see a significant investment in local resilience in Kubelet from the core project (because it’s a lot of work), but I do think eventually it might get addressed in the community (lots of OpenShift customers have asked for that behavior). The easier way today is run edge single node clusters, but then you have to invent new distribution and rollout models on top (ie gitops / custom controllers) instead of being able to reuse daemonsets.
We are experimenting in various ecosystem projects with patterns would let you map a daemonset on one cluster to smaller / distributed daemonsets on other clusters (which gets you local resilience).
Don't do this. Have two K8s clusters. Even if the network were reliable you might still have issues spanning the overlay network geographically.
If you _really_ need to manage them as a unit for whatever reason, federate them(keeping in mind that federation is still not GA). But keep each control plane local.
Then setup the data flows as if K8s wasn't in the picture at all.
Where have you been suffering from this?
I don't want to have to restart the whole thing on each site every time it happens. I'd like a deployment/orchestration system that can work in such scenarios, showing a node as unreachable but then back online when it gets network back.
Isn't that exactly what happens with K8s worker nodes? They will show as "not ready" but will be back once connectivity is restored.
EDIT: Just saw that the intention is to have some nodes in a DC and some nodes in the edge and the intention is to have a single K8s cluster spanning both locations with unreliable network in between. No idea how badly the cluster would react to this.
No, it's ok. What you don't want to have is:
* Poor connectivity between K8s masters and ETCD. The backing store needs to be reliable or things don't work right. If it's an IOT scenario, it's possible you won't have multiple k8s master nodes anyway. If you can place etcd and k8s master in the same machine, you are fine.
You need to have a not horrible connection between masters and workers. If connectivity gets disrupted for a long time and nodes start going NotReady then, depending on how your cluster and workloads are configured, K8s may start shuffling things around to work around the (perceived) node failure(which is normally a very good thing). If this happens too often and for too long time it can be disruptive to your workloads. If it's sporadic, it can be a good thing to have K8s route around the failure.
So, if that is your scenario, then you will need to adjust. But keep in mind that no matter what you do, if network is really bad, you would have to mitigate the effects regardless, Kubernetes or not. I can only really see a problem if a) network is terrible and b) your workloads are mostly computing in nature and don't rely on the network (or they communicate in bursts). Otherwise, a network failure means you can't reach your applications anyway...