
Borg, Omega, Kubernetes: Lessons learned from container management over a decade - alanfranz
http://queue.acm.org/detail.cfm?id=2898444
======
hosh
This is a great read. It's a fairly coherent summary of the things I
intuitively felt great about Kubernetes, and discusses it plainly and
explicitly.

Towards the end of the article, the authors mention the open challenges.

Unbeknownst to me, I had been working on something that attempts to address
some of that. It is my sixth attempt at a tool like this. It came out from
when I saw how many dev environments attempting to use Docker Compose ended up
with a large, ad-hoc collection of Bash scripts.

The project is called Matsuri, and it is found here:
[https://github.com/shopappsio/matsuri](https://github.com/shopappsio/matsuri)

It's intended as a framework for programmatically generating manifests and
executing kubectl commands. It doesn't try to have a lot of opinions (at
least, opinions that you can't change). The idea is that your platform support
tool is like any other app, and should be tailored to your specific collection
of apps.

It also has a notion of Apps, which are a bundle of K8S resources declared as
dependencies. But that's as far as I gotten -- it has a fairly anemic
convergence tool. I was more concerned about standardizing builds, pushes,
updates (collection of rollouts, migrations, etc.), shell commands, "console"
commands, etc. These are all achieved by expecting Apps to define callback
hooks for those actions.

I don't know if this is useful or interesting to anyone else. But if anyone is
interested, feel free to contact me about it.

~~~
mkulke
Yes, the part "Some open hard problems/Configuration" was particularly
interesting to me. Over time we evolved our tooling around k8s to a versioned
(e)yaml + jq + rc.tpl.json => rc.json solution, with our own implementations
for rolling-updates (to make them dependent on readyness checks) and node
evacuation - all in bash. while being nifty, a wrong indent in a yaml file can
still spoil the party and requires debugging at container level.

it did not occur to me that a robust, typesafe approach when dealing with k8s
configuration objects could be a good idea.

~~~
hosh
That's an interesting solution!

Mine's isn't typesafe. But now that I'm thinking about it, it would be an
interesting approach too. In Matsuri, debug options are available to show the
manifest or kubectl controls. I specifically don't use yaml or json to define
the template, and instead, have it programmatically generated from Ruby
directly. This let me use Ruby class inheritance and module mixins to manage
everything. So no indent problems, though, I sometimes run into specs that
don't validate for Kubernetes.

The rolling-update in Matsuri still uses the kubectl rolling-update under the
cover. What it does do is introspect to find the current revision numbers and
the current image tag (if you do not provide them).

------
jamesblonde
It is interesting to see the evolution of cluster scheduling at Google.
Mostly, it has been driven by the desire to improve resource utilization -
with a second thought given to usability. Their first go with Borg was
centralized, correct, and enabled global state to make good scheduling
decisions. Then they tried Omega, which never made it to production. But they
subsequently separated scheduling from resource tracking in Borg (one
scheduler node, ~5 paxos nodes handling resource notifications/communicating
scheduling decisions with workers). Again, all optimizing utilization.
Kubernetes, however, is about making it easier for developers. So, one IP per
pod - unlike Borg, where containers share the host IP. I like Kubernetes, but
adoption has been slow. It will be interesting to see how it develops -
whether it will subsume Mesos. How it will stack up against Hadoop/YARN.

~~~
cpitman
Mesos and Kubernetes are not really competitors, in so far as Kubernetes is
more than just a scheduler and there is a partial implementation and work
ongoing to get Mesos plugged in as a scheduler for Kubernetes:
[https://github.com/kubernetes/kubernetes/blob/master/docs/ge...](https://github.com/kubernetes/kubernetes/blob/master/docs/getting-
started-guides/mesos.md)

I'm hoping that Red Hat's two Kubernetes based products will help kickstart
adoption. Red Hat Atomic Enterprise Platform is basically hardened and
supported Kubernetes ([https://github.com/projectatomic/atomic-
enterprise](https://github.com/projectatomic/atomic-enterprise)) and OpenShift
v3 is a PaaS built on top of Kubernetes
([https://www.openshift.org/](https://www.openshift.org/)).

Disclaimer, I work for Red Hat. But any way, Kubernetes is awesome and could
change the way data centers are run.

------
alexc05
The thing I'm still struggling to wrap my head around with kubernetes (and
docker in general) is the management of data persistence.

If I have a pod of containers everything I read seems to say that containers
need to be 'stateless' and if you're running kubernetes, will likely also be
transient based on the system load & scale.

So if any container in a pod could go away or be spun up at any time, if it
could live on any virtual machine in a cluster of physical machines...

where do you keep the physical database file so that it is accessible to all
different instances? How do the multiple instances access it at the same time?
Generally, how does the database layer "work" in container land?

It's the only piece that I really struggle to understand.

~~~
binocarlos
It's a good question! k8s has the concept of Persistent Volumes - these are
either NFS volumes or block devices (e.g. EBS) that are attached to a single
node. When k8s moves a pod to another node - it will re-attach the volume to
the new node _before_ the pod starts. There is also Flocker by clusterHQ which
has support for different volume types and works with k8s - disclaimer - I
work for ClusterHQ.

[https://github.com/kubernetes/kubernetes/blob/master/docs/us...](https://github.com/kubernetes/kubernetes/blob/master/docs/user-
guide/volumes.md)

~~~
alexc05
Ok ... So that makes a little more sense. It really struck me that I just
didn't know how to _trust_ the persistence. In a world where the application
could run on any virtual cluster and even across different sets of physical
hardware. (Looking at CoreOS for example as a system designed to appear as
"one machine" despite spanning multiple hardware instances).

So when I specify the VOLUME (in a docker file/kubernetes-pod-spec) this is
managed by the system.

And with respect to ensuring uninterupted service on the database layer then,
how are ongoing changes synched? Or is that something I must design for by
adding a queueing layer?

To make that concrete, I have one pod with MySQL and the associated VOLUME is
configured on KUBERNETES instance.

I tell K to scale MySQL to 2 instances.

It points the second instance to the same physical files and starts up?

So VOLUMES live outside of KUBERNETES and do not spin up or down in a
transient nature?

Can 2 instances of a database server use the same files without stepping on
each other?

How does the VOLUME scale without becoming the new single point of failure?

Maybe that is the point where you use massive RAID striping and replication to
scale storage? (That's far from a new concept and is a pretty stable tech).

If I'm on the right track here I might almost be ready to say I understand and
trust persistence!

~~~
justinsb
Your limitation is your database. Getting MySQL or Postgres to run on multiple
machines is not trivial. You can do Master/Slave with failover (manual or
automated), or try to find a clustered SQL database (Galera, or what I was
working on at FathomDB).

If you have a single node database, Kubernetes can make sure that only a
single pod is running at any one time, attached to your volume. It will
automatically recover if e.g. a node crashes.

If you have a distributed SQL database, Kubernetes can make sure the right
volumes get attached to your pods. (The syntax is a little awkward right now,
because you have to create N replication controllers, but PetSets will fix
this in 1.3). Each pod will automatically be recovered by Kubernetes, but it
is up to your distributed database to remain available as the pods crash and
restart.

In short: Kubernetes does the right thing here, but Kubernetes can't magically
make a single-machine database into a distributed one.

~~~
alexc05
This was very helpful. Thank you.

------
abraae
I always admire papers that make the virtual things we wrangle with all day
seem even more physical.

This achieves that with its (afaik) novel usage of "hermetic":

    
    
      The key to making this abstraction work is having a hermetic container image.
    

Poetry.

~~~
lobster_johnson
Not a novel usage at all. It's frequently used to mean "insulated" or
"protected from the outside". E.g.:

[http://www.nytimes.com/2015/04/07/opinion/roger-cohen-
iran-u...](http://www.nytimes.com/2015/04/07/opinion/roger-cohen-iran-united-
states-embassy-tehran.html)

[http://learning.blogs.nytimes.com/2012/08/08/word-of-the-
day...](http://learning.blogs.nytimes.com/2012/08/08/word-of-the-day-
hermetic/)

[http://www.nytimes.com/2016/01/14/t-magazine/art/lucy-
jorge-...](http://www.nytimes.com/2016/01/14/t-magazine/art/lucy-jorge-orta-
antarctica-art.html)

(Googling lowercase "hermetic" is nearly impossible since the uppercase
meaning is much more prevalent.)

Of course, talking about "hermetic containers" makes it a sort of pun.

------
xplot
This is gold.

