I'm not familiar with any of the technologies used in this. Anybody care to comm...

trotsky · on March 20, 2013

Container based virtualization can provide an impressive amount of isolation while improving density dramatically on light duty loads over virtualization. Solaris zones are very well regarded and are used for multi-tenant by Joyent, and many many linux hosts provide multi-tenant solutions based on virtuozzo which predates linux containers by a good number of years.

The main theoretical difference between hypervisor isolation and container isolation is one sits above the kernel, so a kernel level exploit only applies to a single virtual machine. With containers you're relying on the kernel to provide the isolation so you are still subject to (some) kernel level exploits.

Practically linux containers (the mainline implementation) have only provided full isolation in recent patches and probably shouldn't be considered full shaken out for something like full in the wild root level multi-tenant access.

They are super for application isolation for delivery of multiple single tenant workloads on one machine though - something people use hypervisors for quite a bit. The resources used can be a small fraction of what you're committing to with a hypervisor.

bcantrill · on March 20, 2013

As trotsky mentions, we at Joyent are fervent believers in OS-based virtualization -- to the point that in SmartOS, we run hardware virtualization within an OS container. There are many reasons to favor OS-based virtualization over hardware-based virtualization, but first among these (in my opinion) is DRAM utilization: with OS-based virtualization, all unused DRAM is available to the system at large, and in the SmartOS case is used as adaptive replacement cache (ARC) that benefits all tenants. Given that few tenants consume every byte of their allocated DRAM, this alone leads to huge efficiencies from both the perspective of the cloud operator and the cloud user -- a higher-performing, higher-margin service. By contrast, for hardware-based virtualization, unused DRAM remains with the guest and is simply wasted (kludges like kernel samepage mapping and memory ballooning notwithstanding).

DRAM isn't the only win, of course: for every other resource in the system (CPU, network, disk), OS-based virtualization offers tremendous (and insurmountable) efficiency advantages over hardware-based virtualization -- and it's great to see others make the same realization!

For more details on the relative performance of OS-based virtualization, hardware-based virtualization and para-virtualization, see my colleague Brendan Gregg's excellent blog post on the subject[1].

[1] http://dtrace.org/blogs/brendan/2013/01/11/virtualization-pe...

zobzu · on March 20, 2013

Solaris zones use similar concepts to LXC/namespaces, but are actually providing secure isolation.

Recent patches DO NOT provide "full isolation" and never did. What they add is usermode containers. Those are broken weekly since the release. Seriously. Have a look at http://blog.gmane.org/gmane.comp.security.oss.general

price · on March 20, 2013

> Those are broken weekly since the release. Seriously. > Have a look at http://blog.gmane.org/gmane.comp.security.oss.general

Funny you should say that. The latest virtualization-related CVEs there are actually in KVM -- a trio including two host memory corruptions, which usually enables completely owning the host. http://permalink.gmane.org/gmane.comp.security.oss.general/9...

And on the other hand, I don't see any container-related CVEs at all from 2013 in the CVE database: http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=linux+kernel (The KVM issues I mentioned don't show up yet either, because they're from today.) What vulnerabilities are you referring to?

Maybe you mean kernel vulnerabilities in general, some of which could be usable by a user inside a container. Everyone should stay on top of kernel updates in any event. If you hate the rebooting, Ksplice is free for Ubuntu (and Fedora.)

teraflop · on March 20, 2013

The Linux namespace stuff is evolving pretty fast, and I personally wouldn't trust it as the main line of defense for anything important.

With virtualization, a buggy or malicious guest is still limited to its sandbox unless there's a flaw in the hypervisor itself. With containers/namespaces, the host and guest are just different sets of processes that see different "views" of the same kernel, so bugs are much more likely to be exploitable. Plus, if you enable user namespaces, some code paths (like on-demand filesystem module loading) that used to require root are now available to unprivileged users.

There's already been at least one local root exploit that almost made it into 3.9: https://lkml.org/lkml/2013/3/13/361

derefr · on March 20, 2013

> The Linux namespace stuff is evolving pretty fast, and I personally wouldn't trust it as the main line of defense for anything important.

If I recall, Heroku uses cgroups (EDIT: and namespaces) exclusively for multitenant isolation (and by the looks of this, dotCloud does too), so that's two big votes in the "if it's good enough for them" category.

teraflop · on March 20, 2013

Sure, but cgroups and namespaces are kind-of-orthogonal features that both happen to be useful for making container-like things. cgroups are for limiting resource usage; namespaces are for providing the illusion of root access while actually being in a sandboxed environment.

And as far as I'm aware (speaking as an interested non-expert, so please correct me if I'm wrong) cgroups have no effect on permissions, whereas UID namespaces required a lot of very invasive changes to the kernel.

jpetazzo · on March 20, 2013

That's correct: cgroups have no effect on permissions. They only enforce resource usage limits.

Shameless plug: I work at dotCloud, and I wrote 4 blog posts explaining namespaces, cgroups, AUFS, GRSEC, and how they are relevant to "lightweight virtualization" and the particular case of PAAS. The articles have been grouped in a PDF that you can get here if you want a good technical read for your next plane/train/whatever travel ;-) http://blog.dotcloud.com/paas-under-the-hood-ebook

menage · on March 20, 2013

Fundamentally, the cgroups framework is just a way of creating some arbitrary kernel state and associating a set of processes with that state. For most cgroup subsystems, the kernel state is something to do with resource usage, but it can be used for anything that the cgroup subsystem creator wants. At least one subsystem (the devices cgroup) provides security (by controlling which device ids processes in that cgroup can access) rather than resource usage limiting.

lukeschlather · on March 21, 2013

Personally I think the biggest value these days with para-virtualization like this is in development. I can be running twenty or so different applications on the same physical machine, and for the most part (as long as they're idle since I'm only working with one) I don't even notice that they're running.

shykes · on March 20, 2013

Yes, you probably don't want to run untrusted code with root privileges inside a container if anything valuable is running on the same host.

However if that code is trusted, or if you're running it as an unprivileged user, or if nothing else of importance is sharing the same host, then I would not hesitate to use them.

Containers are awesome because they represent a logical component in your software stack. You can also use them as a unit of hardware resource allocation, but you don't have to: you can map a container 1-to-1 to a physical box, for example. But the logical unit remains the same regardless of the underlying hardware, which is truly awesome.

zdw · on March 20, 2013

Barring kernel bugs, it should prevent against the mentioned resource monopolization issues. Normal virtualization is pretty resource wasteful, especially if the guests are not hypervisor aware.

Getting away from huge per-VM block devices is a step in the right direction.

bradrydzewski · on March 20, 2013

This is still technically a virtualization technique, known as "operating system-level virtualization". http://en.wikipedia.org/wiki/Operating_system-level_virtuali...

Here are some of the technologies explained:

cgroups: Linux kernel feature that allows resource limiting and metering, as well as process isolation. The process isolation, also called namespaces, is important because it prevents a process from seeing or terminating other running processes.

lxc: this is a utility that glues together cgroups and chroots to provide virtualization. It helps you easily setup a guest OS by downloading your favorite distro and unpacking it (kind of like debootstrap). It can then "boot" the guest OS by starting it's "init" process. The init process runs in its own namespace, inside a chroot. This is why they call LXC a chroot on steroids. It does everything that chroot does, with full process isolation and metering.

aufs: this is sometimes called a "stacked" file system. It allows you to mount one file system on top of another. Why is this important? Because if you are managing a large number of virtual machines, each one with 1GB+ OS, it uses a lot of disk space. Also, the slowest part of creating a new container is copying the distro (can take up to 30 seconds). Using something like AUFS gives you much better performance.

So what about security? Well, like every (relatively) new technology LXC has its issues. If you use Ubuntu 12.04 they provide a set of Apparmor scripts to mitigate known security risks (like disabling reboot or shutdown commands inside containers, and write access to the /sys filesystem).

wslh · on March 20, 2013

I am familiar with Microsoft App-V, VMWare ThinApp, and Symantec Workspace Virtualization. They can help you as a security sandbox but not as a full protection. A virtual machine will be much more secure (and theoretically very strong), although there are security bugs there that enable you to escape it.

Those products work at two levels: using filtering drivers for registry and the filesystem, and hooking into the Windows operating system API.

laumars · on March 20, 2013

Virtual machines are not more secure. In fact there's been more documented attacks where root access on a guest VM has gained shell access on the host, than there's been against containers.

This doesn't mean that containers are more secure than VMs either. Attacking VMs attracts more security researchers from what I've seen (but I may be wrong on that point). However whether your running a container or a virtual machine, you still need some shared processes (eg the 'ticks' of a system clock) and with any sufficiently complicated code WILL have bugs that can be potentially exploited.

However the crux of the matter is regardless of whether you're running containers or full blown virtual machines, you cannot escape out of the sandbox without having elevated privileges on the guest to begin with. And if an attacker has that, then you've already lost - regardless of whether the attacker can or cannot escape the sandbox.

Lastly, I'm not sure if you're aware of this or not, but this is a Linux solution and has nothing to do with Windows (I only say this because your post seemed tailored towards Windows-hosted virtualisation)

wslh · on March 20, 2013

Are you saying that both approaches have the same level of security or probable insecurities? or that you can't currently estimate the difference?

Even being aware that this is a Linux solution I mentioned the Windows technologies that I know technically.

laumars · on March 20, 2013

> Are you saying that both approaches have the same level of security or probable insecurities? or that you can't currently estimate the difference?

A bit of both, but mostly the former. In practical terms, they both have the same level of security. But -as with any software- something could be published tomorrow exposing some massive flaw that totally blows one or the other out of the water. However neither offer any technical advantage over the other from a security stand point and from a practical perspective, the real question of security is whether your guest OSs are locked down to begin with (eg it's no good arguing which home security system is the most effective if you leave the front door open to begin with).

> Even being aware that this is a Linux solution I mentioned the Windows technologies that I know technically.

That's fair enough and I had suspected that was the case. I just wanted to make sure that we were both talking about the same thing :)

scarmig · on March 20, 2013

Back in January I got a new laptop, installed Arch on it, got it all nicely set up. And I decided it was high time to start playing with LXC because container virtualization seems extremely promising to me. Created an Ubuntu container, seemed to work fine, and then used lxc-destroy, which took some time.

It destroyed my entire file system. I have no clue how the hell it happened--it floors me that something like that would be possible--and I suspect it's probably simply the result of a newbie like myself somehow misusing userspace tools. But it was enough to turn me off of it for the time being.