
Monitoring Kubernetes in Production - twakefield
http://blog.gravitational.com/satellite-monitoring-kubernetes-in-production/
======
sciurus
I don't get this. They say "Monitoring the state of a Kubernetes cluster is
not straightforward using traditional monitoring tools." but they didn't try
monitoring it using a traditional monitoring tool. They tried monit, which is
a process watchdog that's limited to a single host.

In the end it sounds like they created three things

1) Health checks of a kubernetes cluster's individual components 2) End-to-end
checks of a kubernetes cluster's functionality 3) A distributed monitoring
system for running those checks

I'm pretty sure they could have plugged their checks into e.g. Nagios (which
is about as traditional as you can get) and been fine.

~~~
alexk
Think of it as a system of low level K8s specific checks that is designed to
be autonomous and does not need a third-party to run.

Regardless of any node failures we can still get the diangostic information
about cluster state (as long as there at least one node left). We actually
plug in these checks to a higher-level monitoring and metrics system (in our
case it's InfluxDB and not Nagios as you've suggested)

------
nzoschke
Nice. Thanks for sharing lessons learned and tools that encode this knowledge.

I work on container cluster management full time myself. I am focusing on AWS
ECS so the problems are very different technically but very similar
conceptually.

The question is who watches the watchmen?

A container scheduler is supposed to be responsible for maintaining the entire
health of the cluster. But if it has fundamental troubles in doing so, how do
you automatically detect this and get it back into a working state?

On ECS I have a an agent container running one every instance that terminates
the instance on observed failures. The most common problems I have observed
are a bad disk (full, read-only, or too slow) and a locked up docker daemon.

I also schedule one more monitor process in the cluster that periodically
monitors the ECS, EC2 and ASG APIs. A common failure is instances that lose
ECS agent connectivity and need to be terminated.

All this hard won knowledge is encoded in the open source Convox platform:
[https://github.com/convox/rack](https://github.com/convox/rack)

The next problem is that sometimes this monitor container stops working due to
the very problems it's trying to correct! I plan to move it to a Lambda task
to remove the correlated failure.

But I always wonder. Why aren't these problems handled natively by Amazon and
ECS?

The same question applies to this post. If you have to run additional
monitoring to make kubernetes work reliably long term, can we consider that a
kubernetes bug?

~~~
brendandburns
Kubernetes handles most of this seemlessly for the cluster infrastructure.

The central master handles node failures by removing nodes that aren't
heartbeating.

On the node, we require a process monitor for the kubelet (by default we use
supervisord) but then the kubelet monitors Docker [and also does garbage
collection and resource limiting ], and then all of the other node daemons
(e.g. the kubernetes proxy) are run/monitored/restarted by the kubelet.

~~~
alexk
Fully agree - K8s has tons lots of self-healing capabilities as long as it
functions correctly. Our goal was to extend this with a lower layer that will
detect failures that are mission critical for k8s deployments to run properly
like etcd, docker state and skydns

------
jbaptiste
We had the very same issue with kube-dns not that long ago, have you
considered running prometheus against your cluster ?

You can leverage the power of the kubernetes services to make viable
monitoring along automatic discovery for new metrics/services.

We're using it for some time on three different clusters (kubernetes, aws and
bare metal) and are very pleased with the performances.

~~~
deemok
Yes, I see prometheus as a step towards a more sophisticated monitoring setup
if you consider (and enable) this as a prometheus service. Satellite can push
metrics to one of the prometheus servers for all the benefits prometheus
provides. So neither is really a replacement - rather a complement. Satellite
was born out of immediate need to monitor the cluster as it is being created
and definitely appears naive in most other respects - it implements just the
bits to do that. So, pairing it with an established solution like prometheus
(or InfluxDB, for that matter) is an undeniable benefit.

------
hathym
and who monitors the monitor of a monitor?

~~~
nzoschke
Customers. They send Twitter messages.

