
Containers in 2019: They're Calling It a Hypervisor Comeback - ingve
https://www.infoq.com/articles/containers-hypervisors-2019/
======
dopylitty
I've had this sneaking but hard to articulate suspicion that datacenters, bare
metal servers, VMs, operating systems, containers, OS processes, language VMs,
and threads are all really attempts to abstract the same thing.

You want to run business code in a way that's protected from other business
code but also able to interact with other business code and data in a well
defined way.

I also have this sneaking suspicion that new generations are re-inventing the
wheel in a lot of ways. If you have minimal containers running on a hypervisor
how is that different from processes running on an operating system? You have
all these CPU provided virtualization instructions to protect guests from each
other and the host from the guests but there's no reason those instructions
couldn't have been developed to protect processes from each other. You have
indirections to protect one guest from accessing another's storage but there's
no reason processes couldn't have the same protections (and they do in many
operating systems). Why container orchestration and overlay networks instead
of OS scheduling and IPC?

I'm sure people in academic computer science have already published many a
paper about this but it feels different seeing it from inside an IT
organization where people don't seem to apply the lessons from older
technologies to newer ones and we end up in this constant churn of reinvention
which, as far as I can tell, is mostly a way for people, both in management
and in the trenches, to keep their jobs, at least until you're "too old to
learn new things" and pushed aside.

~~~
no_wizard
What I’ve always failed to Understand is how FreeBSD jails[0] never got very
popular (discounting the fact that FreeBSD isn’t very popular on the whole
from what I can tell) but Docker is huge. I personally think jails are
superior in implementation in that it requires no other abstractions on top of
the OS.

The only thing I can surmise is that Docker might have a better secure
default, but improvements to Jails could emulate that, and following best
practices as well should. Beyond that I’m still baffled how this didn’t take
to the mainstream.

[0]
[https://www.freebsd.org/doc/handbook/jails.html](https://www.freebsd.org/doc/handbook/jails.html)

~~~
zbentley
This happened because docker, in addition to an isolation system, also bundled
a user friendly interface to a per-app persistent filesystem. No matter how
many people sing the praises of isolation and security to Docker, I will
continue to suspect that almost all of its adopters use it because packaging
software with dependencies is hard, poorly understood, terribly tooled
(looking at you, Python), and even more poorly executed in the vast majority
of projects and companies.

Docker gives you a very simple way to not think about that (at least until
your massive container that bundles ancient versions of a dozen different CVE-
filled libraries bites you in the ass down the road).

It's not novel. It's not even particularly elegant. But Docker users don't
want elegance; most of them just don't want to think about how to configure a
production environment to work like their development environment, so we get
the old joke: "It works on my machine!" "Then we'll ship your machine"...and
so we got docker.

~~~
sedachv
> No matter how many people sing the praises of isolation and security to
> Docker, I will continue to suspect that almost all of its adopters use it
> because packaging software with dependencies is hard, poorly understood,
> terribly tooled (looking at you, Python), and even more poorly executed in
> the vast majority of projects and companies.

Here is a quote from Eberhard Wolff's _A Practical Guide to Continuous
Delivery_:

"It is laborious to install a real application including all components. Of
course, it is even more laborious to automate this process... When an
installation crashes, it has to be restarted. In such a scenario the system is
in a state where some parts have already been installed. Just to start the
installation again can create problems. The script usually expects to find the
system in a state without any installed software... This problems is also the
reason why updates to a new software version are problematic. In such a case
there is already an old version of the software installed on the system that
has to be updated. This means that files might already be present and have to
be overwritten. The script has to implement the necessary logic for this. In
addition, superfluous elements that are not required anymore for the new
version have to be deleted. If this does not happen, problems can arise with
the new installation because old values might still be read out and used.
However, it is very laborious to cover and automate all update paths that
occur in practice."

If a self-professed expert that writes books on the subject does not realize
that package managers exist, and proposes that the only alternatives for
software installation are either hand-hacked shell scripts or Docker, what do
you think the average developer knows?

~~~
icebraining
To be fair to Wolff, it's not uncommon for deb and rpm packages of complex
daemons (e.g. postgres) to include shell scripts - postinst, preinst, etc -
that makes changes to the system, and which do have to be coded in such a way
to handle re-execution after partial failure. Taking my /var/cache/apt/
directory as a (poor) sample, about 1/3 of packages had such scripts.

~~~
burpsnard
_that_ libc update spawned a wealth of dependency solutions. And 'dll hell' of
course

------
hinkley
And let me state again for the record that all of these promises being made by
container systems sound an awful lot like the promises I was offered by 'real'
operating systems in the early nineties.

I think the only real difference is that there has been a sea change in public
opinion on this kind of aggressive isolation by default being worthwhile.

But a hypervisor publishing a bunch of services that talk to the world and
each other? Things are beginning to look a bit more like microkernels as time
goes on.

~~~
naasking
> Things are beginning to look a bit more like microkernels as time goes on.

Hypervisors typically are microkernels. The very first "hypervisor" was L4 in
fact, with L4Linux.

Also, you're right that virtualization and containers are fulfilling the
promises of operating systems. The problem of earlier OSes is that they didn't
take security seriously enough. Modern hypervisors aren't much better at
isolation than real microkernels like L4, KeyKOS and EROS. They're better than
containers though.

~~~
msla
> Hypervisors typically are microkernels.

Except they have vastly different histories (look up IBM's VM) and uses and
underlying technologies.

Here's a good post on the difference:

[https://utcc.utoronto.ca/~cks/space/blog/tech/HypervisorVsMi...](https://utcc.utoronto.ca/~cks/space/blog/tech/HypervisorVsMicrokernel)

> Microkernels are intended to create a minimal set of low-level operations
> that would be used to build an operating system. While it's popular to slap
> a monolithic kernel on top of your microkernel, this is not how microkernel
> based OSes are supposed to be; a real microkernel OS should have lots of
> separate pieces that used the microkernel services to work with each other.
> Using a microkernel as not much more than an overgrown MMU and task
> switching abstraction layer for someone's monolithic kernel is a cheap hack
> driven by the needs of academic research, not how they are supposed to be.

> (There have been a few real microkernel OSes, such as QNX; Tanenbaum's Minix
> is or was one as well.)

> By contrast, hypervisors virtualize and emulate hardware at various levels
> of abstraction. This involves providing some of the same things that
> microkernels do (eg memory isolation, scheduling), but people interact with
> hypervisors in very different ways than they interact with microkernels.
> Even with 'cooperative' hypervisors, where the guest OSes must be guest-
> aware and make explicit calls to the hypervisor, the guests are far more
> independent, self-contained, and isolated than they would be in a
> microkernel. With typical 'hardware emulating' hypervisors this is even more
> extremely so because much or all of the interaction with the hypervisor is
> indirect, done by manipulating emulated hardware and then having the
> hypervisor reverse engineer your manipulations. As a consequence, something
> like guest to guest communication delays are likely to be several orders of
> magnitude worse than IPC between processes in a microkernel.

~~~
bogomipz
>"Using a microkernel as not much more than an overgrown MMU and task
switching abstraction layer for someone's monolithic kernel is a cheap hack
driven by the needs of academic research, not how they are supposed to be."

I found this interesting. Can you or anyone else say what the context was
where academic researchers have needed to do this? What problem was it solving
for them in a cheap way?

~~~
spinlok
There is a linked article on his blog
[https://utcc.utoronto.ca/~cks/space/blog/tech/AcademicMicrok...](https://utcc.utoronto.ca/~cks/space/blog/tech/AcademicMicrokernels)
that expands on this. Specifically:

>the whole 'normal kernel on microkernel' idea of porting an existing OS
kernel to live on top of your microkernel gives you at least the hope of
creating a usable environment on your microkernel with a minimum amount of
work (ie, without implementing all of a POSIX+ layer and TCP/IP networking and
so on). Plus some grad student can probably get a paper out of it, which is a
double win.

~~~
bogomipz
This is a great read. Thank you.

------
jacques_chester
"Containers" is an unfortunate term, since it really better describes the
container _image_ than an actual running process with API virtualisation.

I think VMs-as-containers is where we'll wind up. The container image has
turned out to be the real thing of interest, the runtime is almost secondary.
Virtual machine systems have closed the performance gap in a variety of ways.

For example: tearing out kernel checks for devices that will never be
connected to the VM; taking advantage of hyper-privileged CPU instructions;
being able to make and restore higher-fidelity checkpoints than an OS can for
faster launches etc.

At which point the isolation benefits of VMs really begin to outweigh
everything else. A hypervisor has a much smaller attack surface and has a much
simpler role than a full monolithic OS kernel. It partitions the hardware and
that's about it. It doesn't exist in a constant tension between kernel-as-
resource-manager and kernel-as-service-provider.

~~~
LIV2
This is kinda where VMware is going with project pacific & vSphere integrated
containers; containers running as individual VMs on a hypervisor. I wonder if
we will see others following the same pattern?

I’m not sure what the drawbacks might be though

~~~
jacques_chester
I think yes. Red Hat and others are working on KubeVirt, plus I believe there
are CRI implementations for gVisor and Firecracker.

Disclosure: I work for Pivotal, which is in the middle of acquisition by
VMware.

------
ris
I'm quite hopeful for the limited re-introduction of hypervisors to
"containment", if only because I've become disappointed with the lack of power
app authors have to specify fine-grained, strict security details. This is
because most of the fun lockdown toys that linux has, selinux, seccomp et al
aren't easily "stackable", and those toys have been used by the orchestrator
to perform the containment. And that's great, but it means containers all end
up with broad, generic policies (that only care about container breakout)
which can't be further restricted by app/container authors.

My hope is that lightweight virtualisation will give back the ability for app
authors to tighten their container's straight jackets. Personally I've got my
eye on kata containers.

------
denton-scratch
I have a Xen hypervisor at home, running on a (well-configured) NUC. VMs boot
in about 15 seconds. I'm patient!

I'm not doing devops stuff; I no longer code, so I don't need a testing
pipeline.

I looked into containers based on LXC, when it was first introduced. I decided
to stay away - I don't want to get tied into Poettering's code. Yeah, I'm
running systemd on some of the VMs, but you don't really have much choice
nowadays. I still use Debian, but I'm bitter about their adoption of systemd -
and my hypervisor machine runs SysV Init, because I know how that works.

I never mixed it with Docker. From what I've been reading, Docker is already
old-hat (it's only about 3 years old; how did that happen?) Kubernetes seems
to be the thing nowadays. I don't even know how to pronounce "Kubernetes".

What was wrong with LXC? Like, LXC comes with the OS. Nothing to install. Why
do people love these 3rd-party container engines?

And "wrappers"? what purpose is served by container wrappers?

Serious questions, I'm not trolling. Promise. I'm just a bit out-of-date.

~~~
meddlepal
Some of us have problems that are actually well solved by these tools and run
on more sophisticated hardware than our home computer... It doesn't sound like
you've hit that set of problems.

~~~
denton-scratch
Yeah, thanks. It's a home computer because I don't work any more - I set-up
and operated a continuous production VM pipeline on my boss's rack, back when
I had a boss.

Docker had turned up; I evaluated it (quite briefly - I had work to do, and
nobody had asked me to evaluate Docker). We had fairly specific requirements
that I couldn't match to Docker.

I recall now that Kubernetes is principally a deployment system - sorry.
Perhaps it is well-integrated with some container systems, such as Docker.
Kubernetes didn't exist when I had a boss. I use Ansible.

I get that LXC is not very friendly. But it works OK, doesn't it? Just wrap a
shiny skin around it. Don't re-invent the wheel; we already have enough wheel
inventions.

I'm not by any means knowledgeable about containers; I evaluated LXC way back
when, and decided that VMs were more secure. And never revisited that
decision.

------
blablabla123
> The Docker engine default seccomp profile blocks 44 system calls today,
> leaving containers running in this default Docker engine configuration with
> just around 300 syscalls available.

...preventing devs/ops people to run tools like iotop, unless extra
capabilities are added.

I'm all in for containers, cgroups/namespaces but at the moment it's namespace
isolation for the price of less features. Unless namespaces become first-class
citizens in the Linux kernel, it will always be more efficient to just run on
VMs or even Bare Metal. At least for non-planet scale workloads. :-)

~~~
ris
> ...preventing devs/ops people to run tools like

This is because docker makes the fundamental mistake of conflating packaging
with isolation. "Packaging" is achieved in docker by using the OS to do an
amount of sandboxing and then letting the user perform whatever non-
reproducible crap they like before balling the whole thing up and calling it a
package.

If instead you make an app author actually figure out what their dependencies
are and how to fetch/build them - in a system such as Nix, you not only get
reproducible packages, you also get to decide to apply actual os-level
isolation on a case by case basis - a developer doesn't necessarily need/want
these barriers on their dev machine.

~~~
gtirloni
So a system like Android's? Honest question, I don't know of that's a good
model to exist in general purpose Linux systems. There's also Fuchsia but I'm
not sure if it's POSIX.

~~~
ris
> So a system like Android's?

Not really, I'm essentially describing nix/guix.

------
xfitm3
I see container systems, such as Docker, as a packaging system more than
anything else.

~~~
hinkley
A Turing complete, language agnostic one.

If you were trying to write one that wasn’t container focused, how would that
look?

~~~
sedachv
[https://nixos.org/nix/](https://nixos.org/nix/)

[https://guix.gnu.org/](https://guix.gnu.org/)

Containers are a very poor substitute for package managers.

------
pjmlp
Just waiting for the new re-discovery of hypervisors, but with better
marketing names.

~~~
pojntfx
"Workload Orchestrators"

K8s can already do VMs with Kube-virt, so yeah.

~~~
tyingq
K8S is unnecessarily complicated. I fully expect "serverless", warts and all,
to take all comers. And, I get the irony. It's basically cgi-bin 2.0. It will
win not because it is better, but because it is better "understood".

~~~
meddlepal
Serverless has been around now for several years and it hasn't taken off
yet... definitely not to the same degree containers have.

I'm skeptical it will. To adopt serverless you need to be willing to
rearchitect your product and retool your developers... that's expensive.

~~~
Havoc
>(Serverless) I'm skeptical it will.

They'd get more uptake if it was easier I think. Not re-architect your
product...but small things here and there.

I wanted to play with azure python functions but despite vs enterprise & lots
of credits I can't. Without admin rights on local machine it's basically
impossible. (Need VSCode & AZ toolkit)

(Unrelated - that kinda blew my mind - no you can't do that in the 2,000 USD
VS enterprise...you need to use the free one)

~~~
snazz
> (Unrelated - that kinda blew my mind - no you can't do that in the 2,000 USD
> VS enterprise...you need to use the free one)

I’m guessing this is because Microsoft wants it to be more accessible—they
probably realize that there isn’t much money to be made in $2,000 developer
tools. Visual Studio was never a Python IDE; VS Code is much more language-
agnostic.

~~~
Havoc
Sure, but the whole "this feature is impossible in our 2000usd suit but its
possible in our free one" doesn't seem strange to you?

If I'm buying the top end product I'm expecting full feature set, no?

------
spion
The future is probably WASI: a sandboxed compile target with a capabilities
based security model not owned by a single corporation.
[https://wasi.dev](https://wasi.dev)

------
curt15
Is there still any place in 2019 for "system containers" like LXC/LXD?

~~~
simosx
They are quite useful as "VMs at the speed of application containers".

When you use application containers, you tend to create a more complicated
setup and lose visibility as to what's going on in the system.

Just like you use VMs on a baremetal server, you use system containers in a
VM.

But I find more important to use LXD as a tool for software development and
also for Linux desktop use.

If you want to setup nodejs or something similar, it makes sense to put it in
a LXD container so that it does not mess up your host. Each step of the
workflow (creation, deletion, etc) takes a few seconds or less.

You can also setup LXD containers to run GUI applications. It makes big sense
if you want to run games (like Steam) so that your desktop does not get
polluted with i386 packages. You can also use for development tools, such as
those from JetBrains, Android Studio and more.

------
kresten
Key point: AWS Firecracker does NOT run on AWS.

Unless you want to pay for bare metal instances.

AWS Firecracker DOES run on Google Cloud, Azure and Digital Ocean.

~~~
jacques_chester
Firecracker is used as the runtime for Lambda, I believe.

------
panpanna
For the record, hypervisor were introduced almost before operating systems. In
the begining it was just a switch to partition memmory and run multiple jobs
on a big expensive machine.

So this is not the first time hypervisors are making a comeback.

------
jasoneckert
Wait....this isn't new - hypervisor-based container isolation has been in
Windows Server since 2016 (it's called Hyper-V containers).

~~~
sayhello
Did the Windows Server isolation bundle a slim kernel?

One of the core elements of this for the Serverless fast boot requirement is
using a kernel that has legacy modules excised.

------
nunchuckninja
So I read something succinct a while back - docker et al are distribution
platforms not security platforms . They add "0" security against an adversary
(think padlocks). How true is it ? And what high perf securely contained
systems are there? OpenVZ?

~~~
upofadown
The generic issue seems to be that stuff like containers can be escaped with
pretty much any privilege escalation exploit ... and such exploits are
reasonably common in the world of Linux.

~~~
nunchuckninja
So either take perf hit or don't expect isolation at all?

~~~
dilyevsky
For completely untrusted workloads basically - yeah. For semi-trusted, there’s
lots of tech that provides reasonable, lightweight isolation. There’s no
reason why hardware vendors cant ship products that are both virtualizable
with high performance and secure, so that may still come.

------
amscanne
I’ve been thinking about these problems for a while. Previously, I thought
that the “put a VM on it” approach was the right one. In 2015, I wrote novm
[1], which I think served as inspiration for some developments that followed.
My thinking has changed over the years and I actually work on gVisor today
(disclaimer!). I’d like to share some thoughts here.

Hypervisors never left. They are a fundamental building block for
infrastructure and will continue to be.

The question is whether there will be a broad shift to start relying on
hypervisors to isolate every individual application. In my opinion, just
wrapping containers in VMs is not a solution. (Nor do I find it
technologically interesting, but that’s me.) I agree that the approach
addresses some of the challenges of isolation, but is one step forward, two
steps back in other ways.

Virtualizing at the hardware boundary lets you do some things very well. For
example, device state is simple, and hardware support lets you track dirty
memory passively and efficiently, so you can implement live migration for
virtual machines much better than you could for processes. It can divide big
machines into fungible, commodity sizes (allowing applications from having to
care about NUMA, etc.). It lets you pass though and assign hardware devices.
It gives you a strong security baseline.

But abstractions work best when they are faithful. Virtual machines operate on
virtual CPUs, memory and devices, and operating systems work best when those
abstractions behave like the real thing. That is, CPUs and memory are mostly
available, and hardware acts like hardware (works independently, interactions
don’t stall).

Containers and applications operate on OS-level abstractions: threads, memory
mappings, futexes, etc. These abstractions are the basis for container
efficiency — not because startup time is fast, but because these abstractions
allow for a lot of statistical multiplexing and over-subscription while still
performing well. The abstractions provide a lot of visibility for the OS to
make good choices with global information (e.g. informing the scheduler,
reclaim policy, etc.).

A problem arises when you decide that you want to bind single applications to
single VMs, and then run many VMs instead of many containers. Effectively, the
abstractions that you expose are now CPUs and memory, and these just don’t
work as well for over-subscription and overall infrastructure efficiency.
There’s no shared scheduler or cooperative synchronization (e.g. in an OS,
threads waking each other will be moved to the same core), there’s no shared
page cache, etc.

There are other problems too: virtualization gives you a very strong security
baseline, but you have to start punching significant holes to get the
container semantics you want. E.g. the cited virtfs is a great example: it’s
easy to reason about the state of a block device, but an effective FUSE api
(and shared memory for metadata) is a much larger system surface. The hardware
interface itself is not a silver bullet. Devices are still complex (escapes
happen), and the last few years have taught us that even the hardware
mechanisms can have flaws. For example, AFAIK Kata containers is still
vulnerable to L1TF unless you’re using instance-exclusive cpusets or have
disabled hyper-threading. (Whereas native processes and containers are not
vulnerable to this particular bug.)

The “put a VM on it” approach also may not have the standard image problems
that plain hypervisors have, but you’ve got portability challenges. It seems
non-ideal that a container isolation solution can run in infrastructure X and
Y, but not in standard public clouds or your on-prem VMWare hosts, etc. (There
might be specific technologies for each case, but that’s rather the point.)

That’s my 2c. I’m pretty optimistic that we can have strong isolation while
still preserving the efficiency, portability and features of container-based
infrastructure. I like a lot of these projects (especially the ones doing
technologically interesting things, e.g. nabla, x-containers, virtfs, etc.)
but I don’t think the straight-up “put a VM on it” approach is going to get us
there.

~~~
ricarkol1
Hi, completely agree with all of this. In fact, we've been focusing on the
problem you mention about needing FS holes for VMs to regain container
semantics ([https://www.usenix.org/system/files/hotstorage19-paper-
kolle...](https://www.usenix.org/system/files/hotstorage19-paper-koller.pdf)).
Just in case, these are some of the container semantics we care about: FS
crash consistency, file sharing (write+write), and efficient use of memory due
to having a single page cache. The key question is: what's the smallest hole
we can poke (smaller than allowing every single FS operation in the host)?

