By the way, one of the key concepts of containers is, control groups (cgroups) [3, 4], and this was initially added to the kernel back in 2007 by two Google engineers, so they have definitely given back in this area. I know all this because I have spent the last two weeks researching control groups for an upcoming screencast.
I am happy Google released this, and cannot wait to dig though it!
It can be used as a C++ library, too - I'm going to evaluate it as a possible low-level execution backend for docker :)
* Building is relatively straightforward on an Ubuntu system. You'll need to install re2 from source, but that's about it.
* No configuration necessary to start playing. lmctfy just straight up mounts cgroups and starts playing in there.
* Containers can be nested which is nice.
* I really couldn't figure out useful values for the container spec. Even the source doesn't seem to have a single reference - it's all dynamically registered by various subsystems in a rather opaque way. I opened a few issues to ask for more details.
* This is a really low-level tool. Other than manipulating cgroups it doesn't seem to do much, which is perfect for my particular use case (docker integration). I couldn't figure out how to set namespaces (including mnt and net namespaces which means all my containers shared the host's filesystem and network interfaces). I don't know if that functionality is already in the code, or has yet to be added.
* Given the fairly small footprint, limited feature set, and clean build experience, this really looks like an interesting option to use as a backend for docker. I like that it carries very little dependencies. Let's see what the verdict is on these missing features.
Are you referring to the container spec in the proto file? https://github.com/google/lmctfy/blob/master/include/lmctfy.... Which attributes are you having trouble setting a useful value for?
And you can approximate mount namespaces with chroots and bind mounts. (In some ways that's better, since it's a bit easier for a process outside the container to interact with the container's filesystem).
In particular, was the ability to migrate a process or have a process in two cgroups really essential to containerization? It seems like without those it'd be a simple matter of nice/setuidgid-style privilege de-escalation commands to get the same kinds of behaviour without adding a whole 'nother resource management to the mix (the named groups).
The cgroups document you link to as  has such a weirdly contrived use case example it makes me think they were trying really hard to come up with a way to justify the complexity they baked into the idea.
It's true that cgroups are a complex system, but they were developed to solve a complex group of problems (packing large numbers of dynamic jobs on servers, with some resources isolated, and some shared between different jobs). I think that pretty much all the features of cgroups come either from real requirements, or from constraints due to the evolution of cgroups from cpusets.
Back when cgroups was being developed, cpusets had fairly recently been accepted into the kernel, and it had a basic process grouping API that was pretty much what cgroups needed. It was much easier politically to get people to accept an evolution of cpusets into cgroups (in a backward-compatible way) than to introduce an entirely new API. With hindsight, this was a mistake, and we should have pushed for a new (binary, non-VFS) API, as having to fit everything into the metaphor of a filesystem (and deal with all the VFS logic) definitely got in the way at times.
If you want to be able to manage/tweak/control the resources allocated to a group after you've created the group, then you need some way of naming that group, whether it be via a filesystem directory or some kind of numerical identifier (like a pid). So I don't think a realistic resource management system can avoid that.
The most common pattern of the need for a process being in multiple cgroups is that of a data-loader/data-server job pair. The data-loader is responsible for periodically loading/maintaining some data set from across the network into (shared) memory, and the data-server is responsible for low-latency serving of queries based on that data. So they both need to be in the same cgroup for memory purposes, since they're sharing the memory occupied by the loaded data. But the CPU requirements of the two are very different - the data-loader is very much a background/batch task, and shouldn't be able to steal CPU from either the data-server or from any other latency-sensitive job on the same machine. So for CPU purposes, they need to be in separate cgroups. That (and other more complex scenarios) is what drives the requirement for multiple independent hierarchies of cgroups.
Since the data-loader and data-server can be stopped/updated/started independently, you need to be able to launch a new process into an existing cgroup. It's true that the need to be able to move a process into a different cgroup would be much reduced if there was an extension to clone() to allow you to create a child directly in a different set of cgroups, but cpusets already provided the movement feature, and extending clone in an intrusive way like that would have raised a lot of resistance, I think.
cgroups are indeed a mess. The api is highly unstable and there is an effort underway to sanitize it, with the help of a "facade" userland api. In other words kernel devs are basically saying: "use this userland api while we fix our shit". (I don't claim to understand the intricacies of this problem. All I know is that, as a developer of docker, it is better for my sanity to maintain an indirection between my tool and the kernel - until things settle down, at least).
warden has a c server core  wrapping cgroups and other features currently on the lmctfy roadmap like network and file system isolation . the current file system isolation uses either aufs or overlayfs depending on the distro/linux version you are using . the network uses namespaces and additional features.
warden also has early/experimental support for centos in addition to ubuntu, although some of the capabilities are degraded. for example, disk isolation uses a less efficient, but still workable copy file system approach.
the client orchestration of warden is currently written in ruby, but there was also a branch started to move that to go  that has not been hardened and moved into master.
recently cloudfoundry started using bosh-lite  leveraging warden to do full dev environments using linux containers instead of separate linux hosts on many virtual machines from an IaaS provider, which has dramatically reduced the resources and time required to create, develop and use the full system.
One thing I really like about working with Google software is that you can count on the same namespace of error codes being used pretty much everywhere. Generally speaking, the software I write and work with can't differentiate between errors finer than these. The machine readable errors are the ones you can respond to differently, and you put detailed messages in the status message. This is how it should be, according to this semi-humble engineer!
(1) Currently only provides robust CPU and memory isolation
(2) In our roadmap... Disk IO Isolation ... Network Isolation ... Support for Namespaces ... Support for Root File Systems ... Disk Images ... Support for Pause/Resume ... Checkpoint Restore
Watching the Google Omega talk video linked in the comment above, I am guessing your implementation probably mostly exists to instantiate Google Omega cluster cell-manager specified jobs including intra-google properties of resource shape, constraints and preferences in a local container. I am guessing that part is not released because the current code has too much to do with your internal standards for expressing the above job-related metadata.
One that did get in was cpusets, and on the suggestion of akpm (who had recently joined Google) we started experimenting with using cpusets for very crude CPU and memory control. Assigning dedicated CPUs to a job was pretty easy via cpusets. Memory was trickier - by using a feature originally intended for testing NUMA on non-NUMA systems, we broke memory up into many "fake" NUMA nodes, and dynamically assigned them to jobs on the machine based on their memory demands and importance. This started making it into production in late 2006 (I think), around the same time that we were working on evolving cpusets into cgroups to support new resource controls.
Was there a reason you guys didn't open source this many years ago?
This code grew symbiotically with Google's kernel patches (big chunks of which were open-sourced into cgroups) and the user-space stack (which was tightly coupled with Google's cluster requirements). So open-sourcing it wouldn't necessarily have been useful for anyone. It looks like someone's done a lot of work to make this more generically-applicable before releasing it.
We have a lot of really well tested and battle-hardened code and behavior, so we chose to keep that.
lmctfy is designed from the ground up as automation which can be used by humans, and never the other way around.
Who came up with the name?
And yeah, it's been retooled a lot to extract it more cleanly from the surrounding code.
In fact, based on the documentation, I don't see how this is anything different from the "cgroup-bin" scripts that have shipped with Ubuntu for years: http://linuxaria.com/article/introduction-to-cgroups-the-lin...
Finally, it came down to:
"This gives the applications the impression of running exclusively on a machine."
OK but as an outsider that still doesn't tell me what it buys me (or what it buys you, or Google).
(By outsider, I mean I have reasonable ability to administer my own Linux system, but wouldn't trust myself to do so in a production environment... so I'm not up on the latest practices in system administration or especially Google-scale system administration.)
Setting aside whether I need it (I'm pretty sure I don't, so no need to tell me that) I'm really curious what this is good for. Can someone explain it in more lay person's terms? Sounds like applications can still stomp on each others files, and consume memory that then takes away from what's available for other applications, so what is the benefit?
I'm not questioning that there's a benefit, just wondering what it is, and how this is used.
You can also prevent applications from stomping on each others' files, with a combination of permissions, chroots and mount namespaces.
This is basically a low-level API for a controller daemon. The daemon knows (via a centralized cluster scheduler) what jobs need to be running on the machine, and how much of each kind of resource they're guaranteed and/or are limited to. lmctfy translates those requirements into kernel calls to set up cgroups that implement the required resource limits/guarantees.
While you could use it for hand administration of a box, or even config-file-based administration of a box, you probably wouldn't want to (lxc may well be more appropriate for that).
edit: see shykes reply: https://news.ycombinator.com/item?id=6487080
Is just me or do you guys think that's strange Google using github instead his own code hosting infrastructure.
I guess this makes it a choice between this and Docker.io, unless it becomes a docker backend.
Other than that, why open source it now? Is it a race against docker and lxc? Or is it just simply Google's paying back to FLOSS?
If so I'm curious how they compare with cgroups.
I agree a platform-independent API would be very useful, but I wonder how close the semantics are.
I think a process-isolation model, possibly with capiscum, is more interesting than the LXC-like "VM model" (which does seem messy to me). I don't need an init process and fake hardware inside the container. Just Unix process tree isolation.
For example, I think BSD jails have the option to use host
networking, which in Linux is analogous to not using network namespaces.
It seems the FreeBSD port originally failed due to deficiencies in kernel features but is now available at least partially, possibly only to manage workloads on remote systems (typically Linux).
> lmctfy was originally designed and implemented around a custom kernel with a set of patches on top of a vanilla Linux kernel.
No sign of said patches though. Anyone know if they're available?