
Kured – A Kubernetes Reboot Daemon - awh0
https://www.weave.works/blog/announcing-kured-a-kubernetes-reboot-daemon
======
Artemis2
I recently learned about a CoreOS-specific operator for that purpose:
[https://github.com/coreos/container-linux-update-
operator](https://github.com/coreos/container-linux-update-operator). In
standard CoreOS installations, once system updates are installed, the node is
automatically rebooted using a semaphore. Instead, this uses the Kubernetes
API to drain the pods from the node for a clean shutdown.

Kured has a feature to stop reboots if there are triggered Prometheus alerts;
that’s a nice touch.

------
otterley
Performing unattended upgrades seems like a great idea, until a bug introduced
in an upgrade causes a performance or operability regression, introduces an
even worse security bug, or is otherwise problematic. At worst, you could
automate yourself into an outage.

IMO good reliability practices counsel against this approach, certainly not
without testing these upgrades in a staging environment first.

~~~
alexk
In Telekube we have hybrid auto updates with fallback to manual in case if
things go wrong, so we can recover:

[https://gravitational.com/docs/cluster/#performing-
upgrade](https://gravitational.com/docs/cluster/#performing-upgrade)

~~~
otterley
How does this play out in an actual failure scenario? How can a administrator
detect failures? Does it apply to the entire OS, including kernel upgrades, or
just your own software?

~~~
alexk
Imagine etcd or docker fails during the upgrade on one of the master nodes.

Our monitoring system - satellite will report failure to the upgrade process
that will stop.

Administrator will use our diagnostic utility to diagnose the problem, fix it
and proceed with upgrade from the last failed step, but in the manual mode.

> Does it apply to the entire OS, including kernel upgrades, or just your own
> software?

Right now it only applies to our software

------
djb_hackernews
Why not just spin up fresh nodes and age out the old ones?

See: pets vs cattle

~~~
lambda
One reason I could see is if upgrading in place is substantially faster or
cheaper than migrating data to new ones.

For instance, if the nodes are storage nodes in a redundant storage system,
taking each one offline briefly for a reboot, and letting the others handle
the slightly higher load, is a lot quicker than spinning up a new node and
replicating all of the data over there so you can de-provision the old one.

Also, even if they don't have a lot of data, the time and resulting expense of
spinning up and provisioning a new node while the old one is still online
could add up to higher costs than just performing a reboot during a point of
lower load.

------
nvarsj
This is pretty neat. We accomplished something similar but simpler. We use
unattended reboots with randomised selection of time throughout 24 hours.
Drain node with a script. Not perfect but simple to do and setup.

------
susane123
Even though unattended upgrade is a good idea, it may cause further bugs.

