
An introduction to Clear Containers (2015) - sillysaurus3
http://lwn.net/Articles/644675/
======
rwmj
I've been writing a paper and hoping to give a talk about this at KVM Forum. I
wasn't ready to publish this, but here's an early version of the paper for
anyone interested:

[http://oirase.annexia.org/tmp/paper.pdf](http://oirase.annexia.org/tmp/paper.pdf)

Note: I'm not connected to Intel, just trying to reproduce their work in
QEMU/KVM and regular kernels. KVM Forum details:
[http://events.linuxfoundation.org/events/kvm-
forum/](http://events.linuxfoundation.org/events/kvm-forum/)

~~~
wyldfire
> "DAX is also working in modern kernels, and it was a relatively trivial job
> to implement DAX."

What is DAX? Is that like an execute-in-place feature that means we don't need
paging to execute code?

If so, is this feature generally useful for all QEmu use cases?

~~~
rwmj
In real hardware, DAX comes in two parts. There is an NVDIMM device (basically
ordinary DRAM, but backed by additional flash chips so it preserves its
contents when powered off). NVDIMMs work at RAM speeds so although you can use
them as fast block devices, it's better to direct map them to avoid the block
layer completely. The other part is a filesystem implementation (I have used
ext4, but xfs exists too) which lets you mmap files directly if they come from
a device which is backed by NVDIMMs. Also execute-in-place for binaries (like
the old XIP support in ext2 which is obsoleted by DAX).

QEMU has a virtual NVDIMM ("vNVDIMM") so you can test this without needing the
real NVDIMM hardware. But in this case it's also a useful way to reduce memory
usage, since you're avoiding the block layer and page cache in the guest.
That's the theory. Although I've been able to make it function correctly, I
didn't observe any great improvements in speed or memory usage (see the paper
for details).

Here is a patch which adds DAX support to libguestfs which should give you
some ideas how to try out DAX in QEMU:
[https://www.redhat.com/archives/libguestfs/2016-May/msg00138...](https://www.redhat.com/archives/libguestfs/2016-May/msg00138.html)

------
ymse
Needs (2015) tag. See also the official release announcement[0] and
documentation[1].

0: [https://coreos.com/blog/rkt-0.8-with-new-vm-
support/](https://coreos.com/blog/rkt-0.8-with-new-vm-support/)

1:
[https://github.com/coreos/rkt/blob/master/Documentation/runn...](https://github.com/coreos/rkt/blob/master/Documentation/running-
lkvm-stage1.md)

~~~
sillysaurus3
Good catch. I've added the year.

------
sillysaurus3
This explains how it's possible to boot a VM + a container in 150ms.

It's surprising that you can boot a server in a blink of an eye.

~~~
rwmj
It's not really surprising, it's just a lot of hard work on small details of
the boot process. Intel are way ahead here. See my other comment for a draft
of a paper I've been writing about this.

------
nickpsecurity
This is an improvement on regular containers but an expected one given it's
old news for microkernels and separation kernels. Dresden I think was first
with L4 and L4Linux showing you could boot a VM every second on a slow machine
with low TCB. I imagine it could've been faster. Then, OKL4 and LynxSecure
both showed the context-switch time could be negligible with joint Intel and
Lynx work showing 100,000 context switches per second with 97% idle CPU or
something. Vastly stronger isolation in both of those since the Linux part is
deprivileged plus you can run apps directly on separation kernel.

So, the problem becomes easier once you transition from building on components
and architecture not designed for security to those designed to build it
ground up. An example of latter is GenodeOS framework. You can have as much or
as little complexity in your app deployment as you want.

------
sargun
I have a few questions if anyone can answer them: 1) how did this deal with
networking? In containers, we can delegate them IPs in a fine grained way, and
they can even use our own IPs? 2) how did this deal with storage?

