
Lmctfy: Google's Linux application container stack - friism
https://github.com/google/lmctfy/
======
WestCoastJustin
There is a great wired article [1], which outline how Google uses
orchestration software to manage linux containers in its day-to-day
operations. Mid way through the article there is this great diagram, which
shows Omega (Google's orchestration software), and how it deploys containers
for images, search, gmail, etc onto the same physical hardware. There is an
amazing talk by John Wilkes (Google Cluster Management, Mountain View) about
Omega at Google Faculty Summit 2011 [2], I would highly recommend watching it!

By the way, one of the key concepts of containers is, control groups (cgroups)
[3, 4], and this was initially added to the kernel back in 2007 by two Google
engineers, so they have definitely given back in this area. I know all this
because I have spent the last two weeks researching control groups for an
upcoming screencast.

I am happy Google released this, and cannot wait to dig though it!

[1] [http://www.wired.com/wiredenterprise/2013/03/google-borg-
twi...](http://www.wired.com/wiredenterprise/2013/03/google-borg-twitter-
mesos/all/)

[2]
[http://www.youtube.com/watch?v=0ZFMlO98Jkc](http://www.youtube.com/watch?v=0ZFMlO98Jkc)

[3]
[http://en.wikipedia.org/wiki/Cgroups](http://en.wikipedia.org/wiki/Cgroups)

[4]
[https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt](https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt)

~~~
stormbrew
Am I the only one who feels like cgroups are extraordinarily complex for the
problem they're trying to solve? It seems like a simpler structure could have
achieved most of the same goals and not required one or two layers (in the
case of docker cgroup->lxc->docker) of abstraction to find widespread use.

In particular, was the ability to migrate a process or have a process in two
cgroups really essential to containerization? It seems like without those it'd
be a simple matter of nice/setuidgid-style privilege de-escalation commands to
get the same kinds of behaviour without adding a whole 'nother resource
management to the mix (the named groups).

The cgroups document you link to as [4] has such a weirdly contrived use case
example it makes me think they were trying really hard to come up with a way
to justify the complexity they baked into the idea.

~~~
menage
(Original cgroups developer here, although I've since moved on from Google and
don't have time to play an active role anymore.)

It's true that cgroups are a complex system, but they were developed to solve
a complex group of problems (packing large numbers of dynamic jobs on servers,
with some resources isolated, and some shared between different jobs). I think
that pretty much all the features of cgroups come either from real
requirements, or from constraints due to the evolution of cgroups from
cpusets.

Back when cgroups was being developed, cpusets had fairly recently been
accepted into the kernel, and it had a basic process grouping API that was
pretty much what cgroups needed. It was much easier politically to get people
to accept an evolution of cpusets into cgroups (in a backward-compatible way)
than to introduce an entirely new API. With hindsight, this was a mistake, and
we should have pushed for a new (binary, non-VFS) API, as having to fit
everything into the metaphor of a filesystem (and deal with all the VFS logic)
definitely got in the way at times.

If you want to be able to manage/tweak/control the resources allocated to a
group after you've created the group, then you need _some_ way of naming that
group, whether it be via a filesystem directory or some kind of numerical
identifier (like a pid). So I don't think a realistic resource management
system can avoid that.

The most common pattern of the need for a process being in multiple cgroups is
that of a data-loader/data-server job pair. The data-loader is responsible for
periodically loading/maintaining some data set from across the network into
(shared) memory, and the data-server is responsible for low-latency serving of
queries based on that data. So they both need to be in the same cgroup for
memory purposes, since they're sharing the memory occupied by the loaded data.
But the CPU requirements of the two are very different - the data-loader is
very much a background/batch task, and shouldn't be able to steal CPU from
either the data-server or from any other latency-sensitive job on the same
machine. So for CPU purposes, they need to be in separate cgroups. That (and
other more complex scenarios) is what drives the requirement for multiple
independent hierarchies of cgroups.

Since the data-loader and data-server can be stopped/updated/started
independently, you need to be able to launch a new process into an existing
cgroup. It's true that the need to be able to move a process into a different
cgroup would be much reduced if there was an extension to clone() to allow you
to create a child directly in a different set of cgroups, but cpusets already
provided the movement feature, and extending clone in an intrusive way like
that would have raised a lot of resistance, I think.

~~~
stormbrew
Cool, thanks for the details.

------
fletchowns
I assume the name is a reference to [http://lmgtfy.com](http://lmgtfy.com) ?

~~~
Sevein
haha that's true! let me contain that for ya!

------
icambron
Can someone explain why you'd would use this instead of LXC? Is it just that
Google built this before LXC existed, or are there differences in what it's
useful for or capable of?

------
jamesaguilar
One of the pieces of code that has a more general purpose:

[https://github.com/google/lmctfy/blob/master/util/task/codes...](https://github.com/google/lmctfy/blob/master/util/task/codes.proto)

One thing I really like about working with Google software is that you can
count on the same namespace of error codes being used pretty much everywhere.
Generally speaking, the software I write and work with can't differentiate
between errors finer than these. The machine readable errors are the ones you
can respond to differently, and you put detailed messages in the status
message. This is how it should be, according to this semi-humble engineer!

------
contingencies
I wouldn't get super excited about this... there's very little that's new here
(it's almost like a LMRTFY: _let me reimplement that for you_!). As a layer on
top of kernel functionality, this code really seems very thin, basically doing
the same thing but far less of it than the existing LXC userspace tools. It is
targeted at process CPU and memory isolation, rather than entire system image
virtualization.

Key quotes:

(1) _Currently only provides robust CPU and memory isolation_

(2) _In our roadmap... Disk IO Isolation ... Network Isolation ... Support for
Namespaces ... Support for Root File Systems ... Disk Images ... Support for
Pause /Resume ... Checkpoint Restore_

~~~
thockin
FWIW, we have a lot of these things done internally, but not in a releasable
form yet.

~~~
contingencies
Could you perhaps explain what the overall motivations were for not using LXC
userspace tools and instead creating an alternative?

Watching the Google Omega talk video linked in the comment above, I am
guessing your implementation probably mostly exists to _instantiate Google
Omega cluster cell-manager specified jobs including intra-google properties of
resource shape, constraints and preferences_ in a local container. I am
guessing that part is not released because the current code has too much to do
with your internal standards for expressing the above job-related metadata.

~~~
menage
LXC didn't really exist in a usable form when Google started work on kernel-
based resource isolation.

~~~
contingencies
Hrrm OK... when was that? 2009 or earlier I suppose. I was using it in 2010
with few hiccups.

~~~
menage
We started building a large-scale datacenter automation system in 2003, and by
late 2005 it was deployed on most machines but it became apparent that
achieving the high density of job packing we wanted was going to be impossible
by relying on post-hoc user-space enforcement of resource allocation (killing
jobs that used too much memory, nicing jobs that used too much CPU, etc).
Sensitive services like websearch were insisting on their own dedicated
machines or even entire dedicated clusters, due to the performance penalties
of sharing a machine with a careless memory/CPU hog. We clearly needed some
kind of kernel support, but back then it didn't really exist - there were
several competing proposals for a resource control system like cgroups but
none of them made much progress.

One that did get in was cpusets, and on the suggestion of akpm (who had
recently joined Google) we started experimenting with using cpusets for very
crude CPU and memory control. Assigning dedicated CPUs to a job was pretty
easy via cpusets. Memory was trickier - by using a feature originally intended
for testing NUMA on non-NUMA systems, we broke memory up into many "fake" NUMA
nodes, and dynamically assigned them to jobs on the machine based on their
memory demands and importance. This started making it into production in late
2006 (I think), around the same time that we were working on evolving cpusets
into cgroups to support new resource controls.

~~~
contingencies
Interesting history, makes sense. (I thought your name was familiar: you are
the author of the _cgroups.txt_ kernel documentation! Do you still get to work
on this stuff much? What is your take on the apparent popularization of
container-based virt? What are the kernel features you would like to see in
the area that do not yet exist?)

Was there a reason you guys didn't open source this many years ago?

~~~
menage
I left the cluster management group over three years ago, so I've not had much
chance to work on / think about containers since then.

This code grew symbiotically with Google's kernel patches (big chunks of which
were open-sourced into cgroups) and the user-space stack (which was tightly
coupled with Google's cluster requirements). So open-sourcing it wouldn't
necessarily have been useful for anyone. It looks like someone's done a lot of
work to make this more generically-applicable before releasing it.

------
teraflop
People are comparing this to Docker, but lmctfy seems to be much more limited
-- it doesn't even try to isolate the filesystem, for instance.

In fact, based on the documentation, I don't see how this is anything
different from the "cgroup-bin" scripts that have shipped with Ubuntu for
years: [http://linuxaria.com/article/introduction-to-cgroups-the-
lin...](http://linuxaria.com/article/introduction-to-cgroups-the-linux-conrol-
group?lang=en)

------
natch
Decent description in the readme, but could not find the part that explains
"what does this buy me?"

Finally, it came down to:

"This gives the applications the impression of running exclusively on a
machine."

OK but as an outsider that still doesn't tell me what it buys me (or what it
buys you, or Google).

(By outsider, I mean I have reasonable ability to administer my own Linux
system, but wouldn't trust myself to do so in a production environment... so
I'm not up on the latest practices in system administration or especially
Google-scale system administration.)

Setting aside whether I need it (I'm pretty sure I don't, so no need to tell
me that) I'm really curious what this is good for. Can someone explain it in
more lay person's terms? Sounds like applications can still stomp on each
others files, and consume memory that then takes away from what's available
for other applications, so what is the benefit?

I'm not questioning that there's a benefit, just wondering what it is, and how
this is used.

~~~
menage
cgroups (and lmctfy) does support limiting memory usage on a per-application
basis, as well as a bunch of other resources (ability to run on particular
CPUs, access certain network ports, disk/IO, etc).

You can also prevent applications from stomping on each others' files, with a
combination of permissions, chroots and mount namespaces.

This is basically a low-level API for a controller daemon. The daemon knows
(via a centralized cluster scheduler) what jobs need to be running on the
machine, and how much of each kind of resource they're guaranteed and/or are
limited to. lmctfy translates those requirements into kernel calls to set up
cgroups that implement the required resource limits/guarantees.

While you _could_ use it for hand administration of a box, or even config-
file-based administration of a box, you probably wouldn't want to (lxc may
well be more appropriate for that).

------
macarthy12
Great some competition for Docker! Which is good news for everyone (esp.
Docker) Containers are the future of hosting web-apps in my opinion. They more
implementations there are the better.

~~~
yannk
It's not competition for docker, more a complement to it.

edit: see shykes reply:
[https://news.ycombinator.com/item?id=6487080](https://news.ycombinator.com/item?id=6487080)

~~~
vishal0123
It's somewhat related to compete with docker but in no way to complement
docker. In fact it has been recommended not to run lmctfy alongside LXC (which
docker user in its core).

~~~
nickstinemates
What about linking to the lmctfy so instead of shelling out to LXC as Docker
currently does? There's a lot of promise in such an approach, should lmctfy
solidify further.

------
silas
Hacked together a Vagrantfile if anyone wants to try it out:

[https://github.com/silas/vagrant-lmctfy](https://github.com/silas/vagrant-
lmctfy)

------
tomrod
After hearing about this and Docker, I'm interested in learning more about
containers from a high level point of view. What are they used for, beyond
very slim VMs? What direction is the technology heading?

~~~
vidarh
Consider it very fine-grained resource control: You can restrict exactly what
memory, cpu, process visibility, network access etc. a process, or group of
processes can have.

~~~
tomrod
So you have a base set necessary for an OS of some sort, then match and choose
whatever you want?

------
chrisreichel
\- Off Topic -

Is just me or do you guys think that's strange Google using github instead his
own code hosting infrastructure.

~~~
mkr-hn
I don't think it's strange that Google lets teams use their preferred code
hosting/versioning system.

~~~
mh-
that, and GitHub's offering isn't going to be spring-cleaned anytime soon.

------
jared314
> it is not recommended to run lmctfy alongside LXC or another container
> system

I guess this makes it a choice between this and Docker.io, unless it becomes a
docker backend.

------
fsniper
It should have a better name. More pronounceable and more strong against typos
:)

Other than that, why open source it now? Is it a race against docker and lxc?
Or is it just simply Google's paying back to FLOSS?

~~~
vmarmol
We've wanted to open source this for some time, but it takes some time to make
it less Google-specific. Things just lined up recently and we were able to
finally release it.

------
nwmcsween
Someone (probably me, eventually) needs to write a libcontainer and a
libresource that can be used on BSD / Linux without the LXC mess

~~~
contingencies
Like this? [http://libvirt.org/](http://libvirt.org/)

It seems the FreeBSD port originally failed due to deficiencies in kernel
features[1] but is now available[2] _at least partially_ , possibly only to
manage workloads on remote systems (typically Linux).

[1]
[http://forums.freebsd.org/showthread.php?p=100894](http://forums.freebsd.org/showthread.php?p=100894)

[2]
[http://svnweb.freebsd.org/ports/head/devel/libvirt/](http://svnweb.freebsd.org/ports/head/devel/libvirt/)

~~~
nwmcsween
No libvirt is a disgusting mess, look at the internals sometime.

------
United857
Interesting that they are using Github and not Google Code. Wonder if the
latter is next on the Google kill list?

------
teddyh
Other have commented that this is a very thin API on top of the already
existing LCX system. This worries me - maybe this is a move from Google to be
able to switch away from Linux (their last GPL component in Android).

------
zurn
From README:

> lmctfy was originally designed and implemented around a custom kernel with a
> set of patches on top of a vanilla Linux kernel.

No sign of said patches though. Anyone know if they're available?

~~~
jnagal
We'll put a kernel image on the site and the set of google-specific patches.
In process of cleaning them up to work with some of the upstream kernels.

------
zurn
Interesting that this is C++ (vs Docker in Go).

