Borg, Omega, Kubernetes: Lessons learned from container management over a decade

hosh · on March 3, 2016

This is a great read. It's a fairly coherent summary of the things I intuitively felt great about Kubernetes, and discusses it plainly and explicitly.

Towards the end of the article, the authors mention the open challenges.

Unbeknownst to me, I had been working on something that attempts to address some of that. It is my sixth attempt at a tool like this. It came out from when I saw how many dev environments attempting to use Docker Compose ended up with a large, ad-hoc collection of Bash scripts.

The project is called Matsuri, and it is found here: https://github.com/shopappsio/matsuri

It's intended as a framework for programmatically generating manifests and executing kubectl commands. It doesn't try to have a lot of opinions (at least, opinions that you can't change). The idea is that your platform support tool is like any other app, and should be tailored to your specific collection of apps.

It also has a notion of Apps, which are a bundle of K8S resources declared as dependencies. But that's as far as I gotten -- it has a fairly anemic convergence tool. I was more concerned about standardizing builds, pushes, updates (collection of rollouts, migrations, etc.), shell commands, "console" commands, etc. These are all achieved by expecting Apps to define callback hooks for those actions.

I don't know if this is useful or interesting to anyone else. But if anyone is interested, feel free to contact me about it.

mkulke · on March 4, 2016

Yes, the part "Some open hard problems/Configuration" was particularly interesting to me. Over time we evolved our tooling around k8s to a versioned (e)yaml + jq + rc.tpl.json => rc.json solution, with our own implementations for rolling-updates (to make them dependent on readyness checks) and node evacuation - all in bash. while being nifty, a wrong indent in a yaml file can still spoil the party and requires debugging at container level.

it did not occur to me that a robust, typesafe approach when dealing with k8s configuration objects could be a good idea.

hosh · on March 4, 2016

That's an interesting solution!

Mine's isn't typesafe. But now that I'm thinking about it, it would be an interesting approach too. In Matsuri, debug options are available to show the manifest or kubectl controls. I specifically don't use yaml or json to define the template, and instead, have it programmatically generated from Ruby directly. This let me use Ruby class inheritance and module mixins to manage everything. So no indent problems, though, I sometimes run into specs that don't validate for Kubernetes.

The rolling-update in Matsuri still uses the kubectl rolling-update under the cover. What it does do is introspect to find the current revision numbers and the current image tag (if you do not provide them).

pnathan · on March 3, 2016

Sounds cool, but your readme is quite anemic.

hosh · on March 3, 2016

That's correct. There is not much documentation or examples with the code.

anentropic · on March 4, 2016

what does it do?

hosh · on March 4, 2016

It's a framework for building out your platform. If Kuberneets is a CAAS (containers as a service), then Matsuri is a framework for building a custom PAAS (platform as a service) on top of Kubernetes, specifically tailored to what you are trying to deploy. It can only be like that because Kubernetes was designed from the ground up to be composable building blocks for PAASs.

At it's core, Matsuri does some of the things that Kubernetes doesn't do, but you still need to do. A lot of people use Bash scripts for this, and I didn't want to. For example, a single app might be coordinating among 3 replication controllers, 1 secret, and different environmental flags. You might also have different needs on dev mode (mounting source paths to the containers), staging (reduced resources, testing secrets), and production (full-blown HA, live secrets, etc). Matsuri takes advantage of certain features in the Ruby language to accomplish that. It doesn't use a template. Instead, you write Ruby code to generate the manifest.

For more technical details, check out this thread: https://groups.google.com/forum/#!topic/kubernetes-sig-confi...

jamesblonde · on March 3, 2016

It is interesting to see the evolution of cluster scheduling at Google. Mostly, it has been driven by the desire to improve resource utilization - with a second thought given to usability. Their first go with Borg was centralized, correct, and enabled global state to make good scheduling decisions. Then they tried Omega, which never made it to production. But they subsequently separated scheduling from resource tracking in Borg (one scheduler node, ~5 paxos nodes handling resource notifications/communicating scheduling decisions with workers). Again, all optimizing utilization. Kubernetes, however, is about making it easier for developers. So, one IP per pod - unlike Borg, where containers share the host IP. I like Kubernetes, but adoption has been slow. It will be interesting to see how it develops - whether it will subsume Mesos. How it will stack up against Hadoop/YARN.

cpitman · on March 3, 2016

Mesos and Kubernetes are not really competitors, in so far as Kubernetes is more than just a scheduler and there is a partial implementation and work ongoing to get Mesos plugged in as a scheduler for Kubernetes: https://github.com/kubernetes/kubernetes/blob/master/docs/ge...

I'm hoping that Red Hat's two Kubernetes based products will help kickstart adoption. Red Hat Atomic Enterprise Platform is basically hardened and supported Kubernetes (https://github.com/projectatomic/atomic-enterprise) and OpenShift v3 is a PaaS built on top of Kubernetes (https://www.openshift.org/).

Disclaimer, I work for Red Hat. But any way, Kubernetes is awesome and could change the way data centers are run.

alexc05 · on March 4, 2016

The thing I'm still struggling to wrap my head around with kubernetes (and docker in general) is the management of data persistence.

If I have a pod of containers everything I read seems to say that containers need to be 'stateless' and if you're running kubernetes, will likely also be transient based on the system load & scale.

So if any container in a pod could go away or be spun up at any time, if it could live on any virtual machine in a cluster of physical machines...

where do you keep the physical database file so that it is accessible to all different instances? How do the multiple instances access it at the same time? Generally, how does the database layer "work" in container land?

It's the only piece that I really struggle to understand.

binocarlos · on March 4, 2016

It's a good question! k8s has the concept of Persistent Volumes - these are either NFS volumes or block devices (e.g. EBS) that are attached to a single node. When k8s moves a pod to another node - it will re-attach the volume to the new node before the pod starts. There is also Flocker by clusterHQ which has support for different volume types and works with k8s - disclaimer - I work for ClusterHQ.

https://github.com/kubernetes/kubernetes/blob/master/docs/us...

alexc05 · on March 4, 2016

Ok ... So that makes a little more sense. It really struck me that I just didn't know how to trust the persistence. In a world where the application could run on any virtual cluster and even across different sets of physical hardware. (Looking at CoreOS for example as a system designed to appear as "one machine" despite spanning multiple hardware instances).

So when I specify the VOLUME (in a docker file/kubernetes-pod-spec) this is managed by the system.

And with respect to ensuring uninterupted service on the database layer then, how are ongoing changes synched? Or is that something I must design for by adding a queueing layer?

To make that concrete, I have one pod with MySQL and the associated VOLUME is configured on KUBERNETES instance.

I tell K to scale MySQL to 2 instances.

It points the second instance to the same physical files and starts up?

So VOLUMES live outside of KUBERNETES and do not spin up or down in a transient nature?

Can 2 instances of a database server use the same files without stepping on each other?

How does the VOLUME scale without becoming the new single point of failure?

Maybe that is the point where you use massive RAID striping and replication to scale storage? (That's far from a new concept and is a pretty stable tech).

If I'm on the right track here I might almost be ready to say I understand and trust persistence!

justinsb · on March 4, 2016

Your limitation is your database. Getting MySQL or Postgres to run on multiple machines is not trivial. You can do Master/Slave with failover (manual or automated), or try to find a clustered SQL database (Galera, or what I was working on at FathomDB).

If you have a single node database, Kubernetes can make sure that only a single pod is running at any one time, attached to your volume. It will automatically recover if e.g. a node crashes.

If you have a distributed SQL database, Kubernetes can make sure the right volumes get attached to your pods. (The syntax is a little awkward right now, because you have to create N replication controllers, but PetSets will fix this in 1.3). Each pod will automatically be recovered by Kubernetes, but it is up to your distributed database to remain available as the pods crash and restart.

In short: Kubernetes does the right thing here, but Kubernetes can't magically make a single-machine database into a distributed one.

alexc05 · on March 4, 2016

This was very helpful. Thank you.

mkulke · on March 4, 2016

Kubernetes is integrated with a lot of IaaS providers to get you a block storage volume to persist your data on. Once you request a persistent volume in a container it will provision the volume, attach it to the node where the container is scheduled to. It is then mounted (formatted if empty) into the container. When the container is killed and restarted on another node the volume moves with the container to that node.

Now when you want clustering of things like mysql/mongo/elasticsearch/rabbitmq/etc it's a bit more complex, b/c they bring their own sharding/clustering concepts, which you have to implement on top of kubernetes. So you won't be able to simply scale mysql up via "kubectl scale rc --replicas=5", but you will have to implement a specific clustering solution, with five unique mysql-pods with their own volumes. For mysql there is "vitess" which is an attempt to build such an abstraction upon kubernetes.

sschueller · on March 4, 2016

I feel the same way. There are a lot of tutorials but most ignore this aspect of data persistence and backup/disaster recovery. Even in those with persistence storage there is nothing about backups to deal with data corruption (application level).

If I build up a fluid infrastructure where containers get created and deleted all the time I need to make absolutely sure my persistent data is safe.

I think a big issues is most of these setups use a central storage device which for those of us coming from single server with local storage don't know much about or don't have the budget. Setting up your own EBS is hard.

dcuthbertson · on March 4, 2016

I'm learning as well. In fact, I didn't know EBS referred to Amazon's Elastic Block Store. Googling "EBS block", the first link is http://ebs-block.com/. So no one feels rick rolled, the site is about using shipping containers to create sustainable housing <sigh>.

Yes, searching for "ebs block device" is much more informative.

anentropic · on March 4, 2016

I'm wondering too, I think this might be useful though http://kubernetes.io/v1.1/examples/flocker/

fh973 · on March 4, 2016

Kubernetes and alike are not solving this problem. But you can just drop in a distributed POSIX filesystem that provides file systems that are accessible (concurrently) from any host and hence any container.

For Quobyte we added an implicit file locking mechanism that exclusively locks any file an open. This way you can protect files from corruption through unintended concurrent access from applications that are not prepared to run in this environment. This way you can also build HA mysql without using mysql's replication mechanisms but resorting to the rescheduling of the container scheduler in case of a machine failure.

lisivka · on March 4, 2016

You need to use networked file system, e.g. nfs, samba, gluster, gfs, ceph, MoozeFS, etc. Just mount it into container and then you can move container freely.

mentat · on March 5, 2016

I found the presentation on Vitess to be informative regarding this. You basically do it via sharding and anti-affinity with reasonable restore times to pull from backup and get back in replication chain.

abraae · on March 3, 2016

I always admire papers that make the virtual things we wrangle with all day seem even more physical.

This achieves that with its (afaik) novel usage of "hermetic":

  The key to making this abstraction work is having a hermetic container image.

Poetry.

atombender · on March 3, 2016

Not a novel usage at all. It's frequently used to mean "insulated" or "protected from the outside". E.g.:

http://www.nytimes.com/2015/04/07/opinion/roger-cohen-iran-u...

http://learning.blogs.nytimes.com/2012/08/08/word-of-the-day...

http://www.nytimes.com/2016/01/14/t-magazine/art/lucy-jorge-...

(Googling lowercase "hermetic" is nearly impossible since the uppercase meaning is much more prevalent.)

Of course, talking about "hermetic containers" makes it a sort of pun.

xplot · on March 3, 2016

This is gold.