Hacker News new | past | comments | ask | show | jobs | submit login
Making containers safer (lwn.net)
211 points by chmaynard 62 days ago | hide | past | web | favorite | 52 comments

The whole container safety story has been a mess/cluster/bag-of-tricks since the very beginning. Unlike BSD jails, security was never number one priority for containers. Just take a look at what GCE/AWS use respectively. The former built an entire syscall proxy with Gvisor while the latter uses a hypervisor based solution with Firecracker (think mini-VMs). Anything in production that touches foriegn code can't really use generic out-of-the-box container systems like Docker. There is a whole bunch of tech such as Hyper Container, Kata, Runq etc. all so that you can run them on a hypervisor and get proper sandboxing. All in all quite disappointing to be honest.

(The main downside of hypervisors is that they are difficult to run on low-cost commodity cloud like Digital Ocean. You are forced to use bare metal stuff like Hetzner, AWS, Paket Cloud etc. The whole point of cloud is that the hardware/infrastructure is mostly abstracted away. If I have to run my own hypervisor just to ensure the container doesn't get broken out of, what's the point even calling it cloud?)

> (The main downside of hypervisors is that they are difficult to run on low-cost commodity cloud like Digital Ocean. You are forced to use bare metal stuff like Hetzner, AWS, Paket Cloud etc. The whole point of cloud is that the hardware/infrastructure is mostly abstracted away. If I have to run my own hypervisor just to ensure the container doesn't get broken out of, what's the point even calling it cloud?)

Sorry if I'm misunderstanding, but the hypervisor in the cloud (i.e. the VM running under your "instance") is to protect the cloud provider from you. The Hypervisor you're running (firecracker, kata containers, gvisor, nabla, etc) are to protect your dangerous workloads from your other workloads.

It is totally within your power to choose to run 1 cloud "instance" per workload so you don't have to worry about cross-workload contamination, but if you want to effectively multiplex your workload on constantly available machines, you're going to have to mix your workloads, and that almost certainly requires protection of some kind.

In an ideal world we might all be getting VMs that were unikernels that are a bit safer to use but in general it feels like you need to isolate these workloads somehow if you're going to run them on the same "instance". What does your perfect/preferred world look like?

Kata works pretty well. You set it as a runtime for docker and it gives a fantastic illusion that the container isn't different from any other docker container but uses a KVM VM with minimal hardware to run the container. It starts in under 200ms (and faster if you make it use firecracker), etc.

The problem your parent is pointing out is that AWS instances don't support nested virtualization (and it sounds like neither does Digital Ocean) so on both of these cloud providers you can't use something like kata containers or any nested VMs. GCP does support nested virtualization in any VM and aws `.metal` instance types do as well, but are rather expensive. It's really a shame that even with the new EC2 kvm-based hypervisor they still didn't enable nested virtualization for most instance types, otherwise we'd definitely be making heavy use of kata containers.

I wish Oracle hadn't bought Ravello Systems and that Ravello open sourced their binary translation stuff that made nested VMs possible in EC2 without the full overhead of software virtualization. Unfortunately, there are no open source implementations of similar software that I know of. Their blog is now hosted on oracle's site: https://blogs.oracle.com/ravello/nested-virtualization-with-...

We tried kata the user experience is not as good as lxd.

Moreover with lxd just use the same deployment platform, scripts as used for bare-metal and VM. No need to fiddle with mix of shell scripts and DSL for container orchestration and how to integrate with your own code. No need for zombie process like in docker.

Kata is great if you're already using docker, it also is increased isolation versus lxd.

If you don't already have a use case for docker, kubernetes, the image repository features, etc then of course kata wouldn't be useful to you over lxc.

Also, in case you're dealing with zombie processes in docker again and can't ditch docker, running a container with `docker run --init` gets you a minimal init process that reaps the zombies. Adding that flag to dockerd instead does this for all containers by default. It's insane that this isn't turned on by default since it's such a common problem. Additionally, if you want to run systemd inside the container similar to lxc, you can install the systemd-oci-hook and it will do the necessary setup for systemd to be happy.

Thanks for the information and indeed I looked at Kata again. But just it didn't click like the way LXD did. I found with LXD the image repository are so easy, deployment my team can re-use whatever they learned for bare-metal and virtual machines and last but not the least its performant with nice tooling.

Initially when I started using had few networking issues due to different support in GCP, AWS and Azure. But now my team is proficient enough to use it even on them. Also the constant updates Stephen and his team did is fantastic.

If you have time please look into weekly updates and play with LXD. It works great for reasonable sized container cluster which covers most of the startups.

They won't be good for big google size of kind yet without more tooling like Kubernetes.

As a person who runs on (and really likes) Hetzner... I totally misunderstood, thanks for clearing it up.

Now that you mention that, it is kind of odd that AWS continues to not have nested virt considering both Azure and GCP both have it.

In the public cloud (aws/gce) we provision unikernels as vms themselves - that is there is no underlying linux instance. It's just the unikernel riding on top of the hypervisor (in the case of google that's kvm, for the t2s on aws it's xen but their c5s are kvm). For small services you use t2-micros (aws) and f1-micros (gce). Both will give you a thread. If you need more you scale more.

This works great if you have a handful of services - if you have hundreds or thousands of services yes you might want to look at running on your own hardware as the public cloud could get expensive at that point or conversely you can re-evaluate if you need hundreds of services and maybe just opt for larger services that use heftier amounts of threads/memory (eg: larger instance sizes). As for running your own hardware we have crappy (eg: $800) servers in our office that have 32 threads and we haven't really stressed it at all but we can easily run thousands of unikernels on those boxes. Also, in the bay area you can grab a cabinet from Hurricane Electric for $400/month. Imagine what you could do with real hardware.

What domain is the problem you're solving in that you can effectively run unikernels? i.e., you don't need something that ultimately doesn't run well in unikernels. For example, our application is a web app - that by itself can run reasonably well in a unikernel (not with how we've written it, but it can be done). But we call out to imagemagick for some processing, which would pretty much preclude a unikernel approach (barring breaking it out into its own service, which is totally doable - its hard to provide a perfect example).

Also, what are you using to create the unikernels?

> The Hypervisor you're running (firecracker, kata containers, gvisor, nabla, etc) are to protect your dangerous workloads from your other workloads.

I think it's more likely that the hypervisor someone iss running on top of their cloud provider's hypervisor exists because many people like building their own cloud providers and like to add layers of unnecessary complexity (like a second layer of hypervisors) at their employer's or investor's expense.

If someone doesn't trust their own workloads, it is vastly simpler and less expensive to run them on separate cloud instance.

Spawn time is fairly significant for your typical VMs. An entire class of applications (such as serverless) cannot be used as your usual traditional VM creation takes several minutes while the hybrid container-VM types do it in milliseconds, allowing for near real-time applications.

It's not clear how what you're saying relates to nested hypervisors, but I don't think anyone does serverless on nested hypervisors. AWS Lambda uses MicroVMs which to my knowledge are not nested.

Your typical open source "serverless" systems that run on Kubernetes has a security story that's "trusted code only". If your application involves routinely running hostile code e.g. something like Cloudflare Workers. Your usual put-it-in-a-container story just won't cut it. You need on-demand, rapidly deployed containers, that is capable of providing VM level of security and isolation with the spawn time and resource consumption of a container. This isn't new technology, AWS, Google etc. have had some variant of it from day 1. Heroku and a lot of the secondary PaaS all run on the big players' underneath, they don't do it themselves. It is only recently that the technologies involved were open sourced and standardized.

Oh I view typical serverless systems as AWS Lambda on MicroVMs and lambda simulators. Or Azure functions etc.

Cooking your own serverless platform with Kuberbetes seems like the waste of resources I mentioned earlier and most people who work on serverless would regard maintaining a custom isolation framework as not serverless.

We don't need unikernels. The technology for running containers securely has existed for years. The only problem is that few mainstream/major cloud providers support them, it requires nested virtualization among other things. And your traditional companies like AWS and GCE make you pay out of the nose for it because if it becomes widely available, then the whole "Serverless", elastic container service etc. industry may be eroded. Right now you have two choices: run your own co-location bare metal server with all the complexity that comes with it, or pay for a VPS with nested virtualization support that is a magnitude more expensive.

cloud providers are starting to expose the virtualisation cpu feature these days

I've seen it on gcp and ovh (not tested aws)

The more I use Linux Containers (big fan of LXD) the more I'm convinced Solaris got it right with Zones, Crossbow, ZFS, SMF also for Zone services - Linux is still not there but instead we have lot's of vendor glue-code in go and many asteriks what is not possible. IMHO Linux should implement something like the zone concept that conceals cgroups, network, mount-namespaces - quota is still broken in btrfs, you can't delegate ZFS commands into a ZFS LXD-Container (where quota actually works) and if you attempt to run something like Docker Swarm in LXD you realize only parts of the kernel are network-namespace aware - it probably will converge to either some vendor solution or it will be resolved in the kernel (I doubt the ZFS integration will happen, as it looks like the current devs are activly fighting ZFS) but conceptionally the concept of a zone as a security boundary feels more sane to me than the glued-together mess in ther kernel at the moment. With kernel zones there was even a concept for stonger security.

Preach, brother! At a previous company we use zfs snapshots to do with Solaris containers what we now do with docker. We had push-button revertible containers(zones) for developers to muck around and test with circa 2009 and we could send them around do different pieces of hardware with `zfs send`. It tied in very nicely with our continuous integration, too.

HP-UX vaults were also quite good.

We have something like kernel zones in the form of OpenVZ. I think we'll eventually have them again in mainline, but it'll be built from those pieces, just like the current vendor glue-code.

What issue did you face around network namespace awareness in this setup?

br_netfilter is not namespace aware at least (it's fixed in 5.3 afaik) but there are probably other issues.

LXC+LXD is one of the most undervalued container technologies out there. It can do a lot of cool things (like live migration via criu), and is IMO more production ready than Docker ever was.

For those wondering about the distinction between "system" containers and "app" containers, the difference is user namespacing -- I think we should stop using the naming difference but instead go with specifying that the containers "full user name spacing". User namespacing is the "magic sauce" for projects like Podman and a related project Buildah which builds images without root privileges -- the combination of user namespaces and FUSE.

BTW for those wondering if container orchestrators like Kubernetes will be shaken by this -- they very likely won't. Kubernetes has sidestepped this problem by introducing and relying on RuntimeClass[0], which allows you to swap out the runtime that's running underneath kubernetes. In fact, there is already a shim being worked on[1].

[EDIT] - While I'm here talking about container tech, please check out containerd[2]. It's been my runtime of choice for a very long time, has been very well maintained, and received many of the really advanced features first (ex. untrusted runtimes). IMO it should be the default container runtime of k8s.

[0]: https://kubernetes.io/docs/concepts/containers/runtime-class...

[1]: https://github.com/automaticserver/lxe

[2]: https://github.com/containerd

I'm curious as to how you get your images built. I'd love to remove the over reliance on docker in my team, but the sticking point seems to be building and storing images. I know that Kubernetes can run with different runtimes, and that docker images can be oci compliant, but it's a hard sell to ask everyone to try something different in production when they do everything else with docker. Do you have a workflow that could help with that sort of thing?

I build my images for private & professional projects in Gitlab CI and I use the provided free registry.

The biggest downside to my set up is that I can't say I have enterprise-level security. Getting it nailed down properly requires the use of tools like TUF/Notary[0] & signed-image aware container repositories like Harbor[1] and a deployment gate mechanism like Portieris[2]. That's a lot of complexity to tack on.

Just a note -- Docker already runs containerd underneath via a shim[3]

If you're relying on docker-specific features then by all means it makes sense to continue using docker but if you're just looking for a thing to quietly run your containers (or power your kubernetes cluster), containerd should probably be that thing. It's all of the building and none of the extra stuff that docker the company is trying to do/become.

[0]: https://github.com/theupdateframework/notary

[1]: https://github.com/goharbor/harbor

[2]: https://github.com/IBM/portieris

[3]: https://groups.google.com/forum/#!topic/docker-dev/zaZFlvIx1...

Bazel can natively build images without Docker.

Bazel has many other benefits too like creating a full dependency build graph and fully reproductible builds.



Both your solution, Bazel, and that of the neighboring comment, Gitlab, involve way too many parts you don't necessarily want or need. I'm not going to rewrite my build process (which is generally a thankless job with very low customer ROI) just to build container images :-)

Same story for Gitlab, nobody's going to migrate from Github or Bitbucket just to get container images builds.

cri-o and containerd run standard docker images seamlessly. They're mostly drop-in replacements for dockerd. So even if you're using standard dockerd at build time, that's perfectly fine.

Things like buildah, kaniko, img, and whatnot can build standard Dockerfiles fine and push to standard docker registries. I don't have much experience with them though.

I believe ContainerD is the default on Google Kubernetes Engine (GKE)

Didn't know this but this is great news -- last I heard it was experimental[0].

[0]: https://cloud.google.com/blog/products/containers-kubernetes...

I have been a user of LXC and LXD since 2013. Earlier when docker was based on LXC I tried it. Then Docker went its own path and build its own libcontainer library. During that time LXC project added support for unprivileged container and since than I didn't use docker. Still today when majority of container runtime including kubernetes is based on docker container (some on OCI), I continue to use LXD and LXC.

LXD by default has been more secure given it allowed unprivileged containers very early that works very nicely. Recently when there was a security problem with Kubernetes, I was still ok given we only used unprivileged containers.

I love LXD container being lightweight compared to kubernetes and same ansible or other platform orchestration like puppet, chef can work with baremetla, VM and containers and no need to fiddle with shell scripts, Dockerfile and learning container orchestration specific domain specific language (DSL). Hopefully LXD gets more popular.

So far OpenStack, OpenNebula and Proxmox support native LXD containers besides KVM and other virtualization. For most of the small website with thousands or million users LXD itself can work pretty well without relying on any cloud orchestration platform.

I too have been using LXC/LXD since before Docker but it's not the same thing.

LXC is a system container. When you want a full system, instead of a VM, you can use LXC, have ssh, give people accounts. It has the same issues as a VM or a regular server, it's easy to leave snowflakes on it, unless you're very disciplined and automate everything.

Docker is an application container, it's not for hosting a user, but just an app. Easy to reproduce, you can share the images with other's or have them rebuild the image to run the app. So two very different things. When Docker first came out, I was very hesitant to use it, and thought it was very stupid to have just one container for one application. But as I thought about it and played with it, it made sense and I have come around. I still use both.

You compared LXD to kubernetes, they are not comparable. k8s is a container OS, it's for orchestrating tons of containers, so you don't have to manually deploy and network them, and restart them.

How secure do you feel LXD containers are, given the defaults? No worse than any bare-metal? No worse than any Qemu/KVM VM?

I do use LXD containers too, but mainly to create a bunch of testing nodes.

Ok to answer your question, Qemu/KVM is more secure than lxd as they run kennel code for each VM. Here on containers there are just two choices use Dockerfile with Docker style containers or use lxd with.lxc. There are kata containers, but not as user friendly as LXD. Most of other container runtime run as privileged root user. In case of LXD, each of the container runs is userspace. So your security is like managing multi-user Linux. We understand management of multi-user Linux very well compared to other esoteric schemes. So I feel lxd offer better security than docker style containers. This is one of the reason most of the big cloud providers like gcp, aws, azure do not offer bare metal containers. Most run on top of their VM which are more secure in multi-tenant systems.

LXD we use for production and they are very lightweight. Running hundreds or thousands of them on a single baremetal will be ok.

Note to anyone confused: The docker concept of '--privileged' is separate from what the LXD folks are refering to as 'privileged containers'. The LXD folks are talking about mapping UID 0 into the container, whereas (IIRC) the docker flag disables dropping capabilities and the seccomp syscall filters (and maybe some other things? I can't remember off the top of my head).

The equivalent docker functionality is userns-remap or sometimes just "user namespaces".

Any container not running as normal user is considered privileged. A root container is privileged bit isn't --privileged.

An important distinction. From a security point of view running --privileged is just lazy. If you need things like kernel permissions etc, run as root and then request kernel permissions in the deployment yaml... and if running something like k8s make sure to apply a pod security spec limiting permissions and the apply the right seccomp profile.

I hate unnecessary abstractions. All this "docker functionality" is actually just based on namespaces and cgroups. I get what you're trying to say though.

As I was trying to understand this space coming from a 'docker user' background I was incredibly confused by the two different definitions of "privileged containers" ("what do you mean if I don't add --privileged it's still privileged?") -- so I figured others might appreciate the pointer as well.

Now that I understand what's going on better, I definitely agree that docker does not abstract the kernel APIs in a way that makes it easy to understand what's going on underneath the hood. Is that a good thing? I honestly don't know.

I'd encourage anyone interested in learning more to check out the codebase for JessFraz's contained.af

You nailed it. If you can take a hard to grasp or orchestrate concept and make it easy for the layperson to use, then you've built a linux tool

Honestly its horses for courses.

If you want to run your workload in the most secure manner using BSD jails then go for it, but you will soon find that anyone with expertise to maintain it is hard to find. Almost all DevOps/SRE/Systems Engineers want to run their workloads (containers) in Docker.

The same goes for LXD.

Lets get one thing right - none of these systems are a Virtual Machine with their own hardware and kernel. All 3 systems share the hosts kernel, albeit BSD has separation of workloads as a consideration within the kernel itself. The Linux kernel has cgroups as a building block, which to me almost makes it an afterthought.

I would never run a container with hostile code on the same Docker host as my mission critical workload. Hell, not even the same network (physical/vpc/overlay). However, I would give it more consideration if the same workloads were on a BSD machine with Jail separation. This was the premise of what I was employed to build in my previous job - https://github.com/tredly/tredly . A better separation of host and jail workload can be achieved though BSD jails as separation of concerns has been taken into consideration within the kernel, instead of building on top of the kernel.

This article is 2 years old. Linux cgroups and BSD jails have come leaps and bounds since. Docker may have been the "flavour of the month" back then, but it continues to be the DevOps/SRE/Systems Engineer's tool of choice. Its not a silver bullet by any means, and has many shortcoming when used on its own. This why systems continue to be build on top of systems (tar -> apt -> docker -> k8s -> ?)

Of course all of the above assumes someone in my position as a DevOps engineer consuming resources upon either on-prem infrastructure or within the cloud. Workloads from an IaaS perspective are a completely different kettle of fish.

> This article is 2 years old.

Really? You just shot your credibility in the foot.

This was a pretty good read. I use containers quite a lot on my server at home at maintain a bunch of utilities. Mostly I'm using systemd-nspawn.

> User namespaces have been around since the 3.12 kernel, but few other container management systems use the feature to isolate their containers. Part of the reason for that is the difficulty in sharing files between containers because of the UID mapping. LXD is currently using shiftfs on Ubuntu systems to translate UIDs between containers and the parts of the host filesystem that are shared with it. Shiftfs is not upstream, however; there are hopes to add similar functionality to the new mount API before long.

This is exactly the problem I've run into. If you're trying to share files between containers or between the host and a container (e.g. with systemd-nspawn's "--bind" option), it's much harder to have your permissions set properly and still access them from in the container if you're using user namespacing.

There's also the issue of creating the container in the first place. If you follow the instructions for creating a container on the Arch Wiki[1], the files will end up owned by the host's root (mostly), which is a problem when you then try to boot the container as a different (namespaced) user. I don't know of a straightforward way to create a namespaced container with systemd-nspawn, and I don't think there's any way to convert an existing container to a namespaced container.

Yet another problem that sometimes arises is that various distributions have user namespacing disabled (it has to be enabled when the kernel is built) - notably Arch [2], though this may have recently changed. This is apparently due to concerns that the namespacing code is buggy and can itself lead to privilege escalation vulnerabilities.

[1] https://wiki.archlinux.org/index.php/Systemd-nspawn#Examples

[2] https://bugs.archlinux.org/task/36969

While it still needs work for performance and other things, Podman already supports rootless containers using user namespaces. It's actually pretty easy to setup too, especially on Arch:


I'm a little surprised that lxd doesn't cater more to nfs rather than bind mounts etc for file sharing. I mean if you already have private/secure networking - just use a network filesystem for your... Network filesystem needs?

NFS currently cannot be used inside a user namespace.

There are patches floating around (similar to the work we did to allow FUSE) but they haven't made it upstream yet and my understanding is that there is some tricky corner cases on NFSv3 which still need to be sorted (NFSv4 was easier due to already having uid mapping capabilities).

If the container doesn't require root. Just do a FROM image:tag and then add a User in the Dockerfile

I used containers in production since 2002. We knew back then that containers are escapable and nothing changed since. Containers are for ease of management - package and configuration sepparation. One role per container. In absolutely no case are they meant for multi tenant cases.

Containers are fine the way they are. Just use them for the right job.

So instead of trying to improve their security and keep the startup/overhead advantage over VMs in the orders of magnitude, you simply suggest doing nothing?

FWIW, I like it that both camps are working on their shortcomings: LXD/container camp on isolation, and the VM camp on overhead (eg. firecracker).

I suggest using the right tool for the right job. There is no such thing as "shared but separate". After Spectre, Meltdown and co. everybody should understand this. It's not a limitation of current hardware/code, it's a limitation or principle.

Container breakouts typically require running a "bad container" or image that contains malicious code or that has been added after the fact. With the right security tools you can create "white lists" that only allow specific image SHAs to run and that monitor (and even enforce to block) any changes to a running container.

So there are solutions out there to solve your security issues.

summary: containers have no security story without SELinux.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact