
Making containers safer - chmaynard
https://lwn.net/SubscriberLink/796700/9bc9daa32a8fe499/
======
sansnomme
The whole container safety story has been a mess/cluster __ __/ bag-of-tricks
since the very beginning. Unlike BSD jails, security was never number one
priority for containers. Just take a look at what GCE/AWS use respectively.
The former built an entire syscall proxy with Gvisor while the latter uses a
hypervisor based solution with Firecracker (think mini-VMs). Anything in
production that touches foriegn code can't really use generic out-of-the-box
container systems like Docker. There is a whole bunch of tech such as Hyper
Container, Kata, Runq etc. all so that you can run them on a hypervisor and
get proper sandboxing. All in all quite disappointing to be honest.

(The main downside of hypervisors is that they are difficult to run on low-
cost commodity cloud like Digital Ocean. You are forced to use bare metal
stuff like Hetzner, AWS, Paket Cloud etc. The whole point of cloud is that the
hardware/infrastructure is mostly abstracted away. If I have to run my own
hypervisor just to ensure the container doesn't get broken out of, what's the
point even calling it cloud?)

~~~
hardwaresofton
> (The main downside of hypervisors is that they are difficult to run on low-
> cost commodity cloud like Digital Ocean. You are forced to use bare metal
> stuff like Hetzner, AWS, Paket Cloud etc. The whole point of cloud is that
> the hardware/infrastructure is mostly abstracted away. If I have to run my
> own hypervisor just to ensure the container doesn't get broken out of,
> what's the point even calling it cloud?)

Sorry if I'm misunderstanding, but the hypervisor in the cloud (i.e. the VM
running under your "instance") is to protect the cloud provider _from you_.
The Hypervisor _you 're_ running (firecracker, kata containers, gvisor, nabla,
etc) are to protect _your_ dangerous workloads from your _other_ workloads.

It is totally within your power to choose to run 1 cloud "instance" per
workload so you don't have to worry about cross-workload contamination, but if
you want to effectively multiplex your workload on constantly available
machines, you're going to have to mix your workloads, and that almost
certainly requires protection of _some_ kind.

In an ideal world we might all be getting VMs that were unikernels that are a
bit safer to use but in general it feels like you need to isolate these
workloads somehow if you're going to run them on the same "instance". What
does your perfect/preferred world look like?

~~~
paulfurtado
Kata works pretty well. You set it as a runtime for docker and it gives a
fantastic illusion that the container isn't different from any other docker
container but uses a KVM VM with minimal hardware to run the container. It
starts in under 200ms (and faster if you make it use firecracker), etc.

The problem your parent is pointing out is that AWS instances don't support
nested virtualization (and it sounds like neither does Digital Ocean) so on
both of these cloud providers you can't use something like kata containers or
any nested VMs. GCP does support nested virtualization in any VM and aws
`.metal` instance types do as well, but are rather expensive. It's really a
shame that even with the new EC2 kvm-based hypervisor they still didn't enable
nested virtualization for most instance types, otherwise we'd definitely be
making heavy use of kata containers.

I wish Oracle hadn't bought Ravello Systems and that Ravello open sourced
their binary translation stuff that made nested VMs possible in EC2 without
the full overhead of software virtualization. Unfortunately, there are no open
source implementations of similar software that I know of. Their blog is now
hosted on oracle's site: [https://blogs.oracle.com/ravello/nested-
virtualization-with-...](https://blogs.oracle.com/ravello/nested-
virtualization-with-binary-translation)

~~~
dragonsh
We tried kata the user experience is not as good as lxd.

Moreover with lxd just use the same deployment platform, scripts as used for
bare-metal and VM. No need to fiddle with mix of shell scripts and DSL for
container orchestration and how to integrate with your own code. No need for
zombie process like in docker.

~~~
paulfurtado
Kata is great if you're already using docker, it also is increased isolation
versus lxd.

If you don't already have a use case for docker, kubernetes, the image
repository features, etc then of course kata wouldn't be useful to you over
lxc.

Also, in case you're dealing with zombie processes in docker again and can't
ditch docker, running a container with `docker run --init` gets you a minimal
init process that reaps the zombies. Adding that flag to dockerd instead does
this for all containers by default. It's insane that this isn't turned on by
default since it's such a common problem. Additionally, if you want to run
systemd inside the container similar to lxc, you can install the systemd-oci-
hook and it will do the necessary setup for systemd to be happy.

~~~
dragonsh
Thanks for the information and indeed I looked at Kata again. But just it
didn't click like the way LXD did. I found with LXD the image repository are
so easy, deployment my team can re-use whatever they learned for bare-metal
and virtual machines and last but not the least its performant with nice
tooling.

Initially when I started using had few networking issues due to different
support in GCP, AWS and Azure. But now my team is proficient enough to use it
even on them. Also the constant updates Stephen and his team did is fantastic.

If you have time please look into weekly updates and play with LXD. It works
great for reasonable sized container cluster which covers most of the
startups.

They won't be good for big google size of kind yet without more tooling like
Kubernetes.

------
nisa
The more I use Linux Containers (big fan of LXD) the more I'm convinced
Solaris got it right with Zones, Crossbow, ZFS, SMF also for Zone services -
Linux is still not there but instead we have lot's of vendor glue-code in go
and many asteriks what is not possible. IMHO Linux should implement something
like the zone concept that conceals cgroups, network, mount-namespaces - quota
is still broken in btrfs, you can't delegate ZFS commands into a ZFS LXD-
Container (where quota actually works) and if you attempt to run something
like Docker Swarm in LXD you realize only parts of the kernel are network-
namespace aware - it probably will converge to either some vendor solution or
it will be resolved in the kernel (I doubt the ZFS integration will happen, as
it looks like the current devs are activly fighting ZFS) but conceptionally
the concept of a zone as a security boundary feels more sane to me than the
glued-together mess in ther kernel at the moment. With kernel zones there was
even a concept for stonger security.

~~~
paulfurtado
What issue did you face around network namespace awareness in this setup?

~~~
nisa
br_netfilter is not namespace aware at least (it's fixed in 5.3 afaik) but
there are probably other issues.

------
hardwaresofton
LXC+LXD is one of the most undervalued container technologies out there. It
can do _a lot_ of cool things (like live migration via criu), and is IMO more
production ready than Docker ever was.

For those wondering about the distinction between "system" containers and
"app" containers, the difference is user namespacing -- I think we should stop
using the naming difference but instead go with specifying that the containers
"full user name spacing". User namespacing is the "magic sauce" for projects
like Podman and a related project Buildah which builds images without root
privileges -- the combination of user namespaces and FUSE.

BTW for those wondering if container orchestrators like Kubernetes will be
shaken by this -- they very likely won't. Kubernetes has sidestepped this
problem by introducing and relying on RuntimeClass[0], which allows you to
swap out the runtime that's running underneath kubernetes. In fact, there is
already a shim being worked on[1].

[EDIT] - While I'm here talking about container tech, please check out
containerd[2]. It's been my runtime of choice for a very long time, has been
very well maintained, and received many of the really advanced features first
(ex. untrusted runtimes). IMO it should be the default container runtime of
k8s.

[0]: [https://kubernetes.io/docs/concepts/containers/runtime-
class...](https://kubernetes.io/docs/concepts/containers/runtime-
class/#upgrading-runtimeclass-from-alpha-to-beta)

[1]:
[https://github.com/automaticserver/lxe](https://github.com/automaticserver/lxe)

[2]: [https://github.com/containerd](https://github.com/containerd)

~~~
scaryclam
I'm curious as to how you get your images built. I'd love to remove the over
reliance on docker in my team, but the sticking point seems to be building and
storing images. I know that Kubernetes can run with different runtimes, and
that docker images can be oci compliant, but it's a hard sell to ask everyone
to try something different in production when they do everything else with
docker. Do you have a workflow that could help with that sort of thing?

~~~
wikibob
Bazel can natively build images without Docker.

Bazel has many other benefits too like creating a full dependency build graph
and fully reproductible builds.

[https://bazel.build/](https://bazel.build/)

[https://github.com/bazelbuild/rules_docker](https://github.com/bazelbuild/rules_docker)

~~~
oblio
Both your solution, Bazel, and that of the neighboring comment, Gitlab,
involve way too many parts you don't necessarily want or need. I'm not going
to rewrite my build process (which is generally a thankless job with very low
customer ROI) just to build container images :-)

Same story for Gitlab, nobody's going to migrate from Github or Bitbucket just
to get container images builds.

------
dragonsh
I have been a user of LXC and LXD since 2013. Earlier when docker was based on
LXC I tried it. Then Docker went its own path and build its own libcontainer
library. During that time LXC project added support for unprivileged container
and since than I didn't use docker. Still today when majority of container
runtime including kubernetes is based on docker container (some on OCI), I
continue to use LXD and LXC.

LXD by default has been more secure given it allowed unprivileged containers
very early that works very nicely. Recently when there was a security problem
with Kubernetes, I was still ok given we only used unprivileged containers.

I love LXD container being lightweight compared to kubernetes and same ansible
or other platform orchestration like puppet, chef can work with baremetla, VM
and containers and no need to fiddle with shell scripts, Dockerfile and
learning container orchestration specific domain specific language (DSL).
Hopefully LXD gets more popular.

So far OpenStack, OpenNebula and Proxmox support native LXD containers besides
KVM and other virtualization. For most of the small website with thousands or
million users LXD itself can work pretty well without relying on any cloud
orchestration platform.

~~~
xarope
How secure do you feel LXD containers are, given the defaults? No worse than
any bare-metal? No worse than any Qemu/KVM VM?

I do use LXD containers too, but mainly to create a bunch of testing nodes.

~~~
dragonsh
Ok to answer your question, Qemu/KVM is more secure than lxd as they run
kennel code for each VM. Here on containers there are just two choices use
Dockerfile with Docker style containers or use lxd with.lxc. There are kata
containers, but not as user friendly as LXD. Most of other container runtime
run as privileged root user. In case of LXD, each of the container runs is
userspace. So your security is like managing multi-user Linux. We understand
management of multi-user Linux very well compared to other esoteric schemes.
So I feel lxd offer better security than docker style containers. This is one
of the reason most of the big cloud providers like gcp, aws, azure do not
offer bare metal containers. Most run on top of their VM which are more secure
in multi-tenant systems.

LXD we use for production and they are very lightweight. Running hundreds or
thousands of them on a single baremetal will be ok.

------
bscphil
This was a pretty good read. I use containers quite a lot on my server at home
at maintain a bunch of utilities. Mostly I'm using systemd-nspawn.

> User namespaces have been around since the 3.12 kernel, but few other
> container management systems use the feature to isolate their containers.
> Part of the reason for that is the difficulty in sharing files between
> containers because of the UID mapping. LXD is currently using shiftfs on
> Ubuntu systems to translate UIDs between containers and the parts of the
> host filesystem that are shared with it. Shiftfs is not upstream, however;
> there are hopes to add similar functionality to the new mount API before
> long.

This is exactly the problem I've run into. If you're trying to share files
between containers or between the host and a container (e.g. with systemd-
nspawn's "\--bind" option), it's _much_ harder to have your permissions set
properly and still access them from in the container if you're using user
namespacing.

There's also the issue of creating the container in the first place. If you
follow the instructions for creating a container on the Arch Wiki[1], the
files will end up owned by the host's root (mostly), which is a problem when
you then try to boot the container as a different (namespaced) user. I don't
know of a straightforward way to create a namespaced container with systemd-
nspawn, and I don't think there's any way to convert an existing container to
a namespaced container.

Yet another problem that sometimes arises is that various distributions have
user namespacing disabled (it has to be enabled when the kernel is built) -
notably Arch [2], though this may have recently changed. This is apparently
due to concerns that the namespacing code is buggy and can itself lead to
privilege escalation vulnerabilities.

[1] [https://wiki.archlinux.org/index.php/Systemd-
nspawn#Examples](https://wiki.archlinux.org/index.php/Systemd-nspawn#Examples)

[2]
[https://bugs.archlinux.org/task/36969](https://bugs.archlinux.org/task/36969)

~~~
e12e
I'm a little surprised that lxd doesn't cater more to nfs rather than bind
mounts etc for file sharing. I mean if you already have private/secure
networking - just use a network filesystem for your... Network filesystem
needs?

~~~
stgraber
NFS currently cannot be used inside a user namespace.

There are patches floating around (similar to the work we did to allow FUSE)
but they haven't made it upstream yet and my understanding is that there is
some tricky corner cases on NFSv3 which still need to be sorted (NFSv4 was
easier due to already having uid mapping capabilities).

------
awirth
Note to anyone confused: The docker concept of '\--privileged' is separate
from what the LXD folks are refering to as 'privileged containers'. The LXD
folks are talking about mapping UID 0 into the container, whereas (IIRC) the
docker flag disables dropping capabilities and the seccomp syscall filters
(and maybe some other things? I can't remember off the top of my head).

The equivalent docker functionality is userns-remap or sometimes just "user
namespaces".

~~~
mav3rick
I hate unnecessary abstractions. All this "docker functionality" is actually
just based on namespaces and cgroups. I get what you're trying to say though.

~~~
awirth
As I was trying to understand this space coming from a 'docker user'
background I was incredibly confused by the two different definitions of
"privileged containers" ("what do you mean if I don't add --privileged it's
still privileged?") -- so I figured others might appreciate the pointer as
well.

Now that I understand what's going on better, I definitely agree that docker
does not abstract the kernel APIs in a way that makes it easy to understand
what's going on underneath the hood. Is that a good thing? I honestly don't
know.

I'd encourage anyone interested in learning more to check out the codebase for
JessFraz's contained.af

------
laurieodgers
Honestly its horses for courses.

If you want to run your workload in the most secure manner using BSD jails
then go for it, but you will soon find that anyone with expertise to maintain
it is hard to find. Almost all DevOps/SRE/Systems Engineers want to run their
workloads (containers) in Docker.

The same goes for LXD.

Lets get one thing right - none of these systems are a Virtual Machine with
their own hardware and kernel. All 3 systems share the hosts kernel, albeit
BSD has separation of workloads as a consideration within the kernel itself.
The Linux kernel has cgroups as a building block, which to me almost makes it
an afterthought.

I would never run a container with hostile code on the same Docker host as my
mission critical workload. Hell, not even the same network
(physical/vpc/overlay). However, I would give it more consideration if the
same workloads were on a BSD machine with Jail separation. This was the
premise of what I was employed to build in my previous job -
[https://github.com/tredly/tredly](https://github.com/tredly/tredly) . A
better separation of host and jail workload can be achieved though BSD jails
as separation of concerns has been taken into consideration within the kernel,
instead of building on top of the kernel.

This article is 2 years old. Linux cgroups and BSD jails have come leaps and
bounds since. Docker may have been the "flavour of the month" back then, but
it continues to be the DevOps/SRE/Systems Engineer's tool of choice. Its not a
silver bullet by any means, and has many shortcoming when used on its own.
This why systems continue to be build on top of systems (tar -> apt -> docker
-> k8s -> ?)

Of course all of the above assumes someone in my position as a DevOps engineer
consuming resources upon either on-prem infrastructure or within the cloud.
Workloads from an IaaS perspective are a completely different kettle of fish.

~~~
chmaynard
> This article is 2 years old.

Really? You just shot your credibility in the foot.

------
trabant00
I used containers in production since 2002. We knew back then that containers
are escapable and nothing changed since. Containers are for ease of management
- package and configuration sepparation. One role per container. In absolutely
no case are they meant for multi tenant cases.

Containers are fine the way they are. Just use them for the right job.

~~~
necovek
So instead of trying to improve their security and keep the startup/overhead
advantage over VMs in the orders of magnitude, you simply suggest doing
nothing?

FWIW, I like it that both camps are working on their shortcomings:
LXD/container camp on isolation, and the VM camp on overhead (eg.
firecracker).

~~~
trabant00
I suggest using the right tool for the right job. There is no such thing as
"shared but separate". After Spectre, Meltdown and co. everybody should
understand this. It's not a limitation of current hardware/code, it's a
limitation or principle.

------
hapless
summary: containers have no security story without SELinux.

