(The main downside of hypervisors is that they are difficult to run on low-cost commodity cloud like Digital Ocean. You are forced to use bare metal stuff like Hetzner, AWS, Paket Cloud etc. The whole point of cloud is that the hardware/infrastructure is mostly abstracted away. If I have to run my own hypervisor just to ensure the container doesn't get broken out of, what's the point even calling it cloud?)
Sorry if I'm misunderstanding, but the hypervisor in the cloud (i.e. the VM running under your "instance") is to protect the cloud provider from you. The Hypervisor you're running (firecracker, kata containers, gvisor, nabla, etc) are to protect your dangerous workloads from your other workloads.
It is totally within your power to choose to run 1 cloud "instance" per workload so you don't have to worry about cross-workload contamination, but if you want to effectively multiplex your workload on constantly available machines, you're going to have to mix your workloads, and that almost certainly requires protection of some kind.
In an ideal world we might all be getting VMs that were unikernels that are a bit safer to use but in general it feels like you need to isolate these workloads somehow if you're going to run them on the same "instance". What does your perfect/preferred world look like?
The problem your parent is pointing out is that AWS instances don't support nested virtualization (and it sounds like neither does Digital Ocean) so on both of these cloud providers you can't use something like kata containers or any nested VMs. GCP does support nested virtualization in any VM and aws `.metal` instance types do as well, but are rather expensive. It's really a shame that even with the new EC2 kvm-based hypervisor they still didn't enable nested virtualization for most instance types, otherwise we'd definitely be making heavy use of kata containers.
I wish Oracle hadn't bought Ravello Systems and that Ravello open sourced their binary translation stuff that made nested VMs possible in EC2 without the full overhead of software virtualization. Unfortunately, there are no open source implementations of similar software that I know of. Their blog is now hosted on oracle's site: https://blogs.oracle.com/ravello/nested-virtualization-with-...
Moreover with lxd just use the same deployment platform, scripts as used for bare-metal and VM. No need to fiddle with mix of shell scripts and DSL for container orchestration and how to integrate with your own code. No need for zombie process like in docker.
If you don't already have a use case for docker, kubernetes, the image repository features, etc then of course kata wouldn't be useful to you over lxc.
Also, in case you're dealing with zombie processes in docker again and can't ditch docker, running a container with `docker run --init` gets you a minimal init process that reaps the zombies. Adding that flag to dockerd instead does this for all containers by default. It's insane that this isn't turned on by default since it's such a common problem. Additionally, if you want to run systemd inside the container similar to lxc, you can install the systemd-oci-hook and it will do the necessary setup for systemd to be happy.
Initially when I started using had few networking issues due to different support in GCP, AWS and Azure. But now my team is proficient enough to use it even on them. Also the constant updates Stephen and his team did is fantastic.
If you have time please look into weekly updates and play with LXD. It works great for reasonable sized container cluster which covers most of the startups.
They won't be good for big google size of kind yet without more tooling like Kubernetes.
This works great if you have a handful of services - if you have hundreds or thousands of services yes you might want to look at running on your own hardware as the public cloud could get expensive at that point or conversely you can re-evaluate if you need hundreds of services and maybe just opt for larger services that use heftier amounts of threads/memory (eg: larger instance sizes). As for running your own hardware we have crappy (eg: $800) servers in our office that have 32 threads and we haven't really stressed it at all but we can easily run thousands of unikernels on those boxes. Also, in the bay area you can grab a cabinet from Hurricane Electric for $400/month. Imagine what you could do with real hardware.
Also, what are you using to create the unikernels?
I think it's more likely that the hypervisor someone iss running on top of their cloud provider's hypervisor exists because many people like building their own cloud providers and like to add layers of unnecessary complexity (like a second layer of hypervisors) at their employer's or investor's expense.
If someone doesn't trust their own workloads, it is vastly simpler and less expensive to run them on separate cloud instance.
Cooking your own serverless platform with Kuberbetes seems like the waste of resources I mentioned earlier and most people who work on serverless would regard maintaining a custom isolation framework as not serverless.
I've seen it on gcp and ovh (not tested aws)
For those wondering about the distinction between "system" containers and "app" containers, the difference is user namespacing -- I think we should stop using the naming difference but instead go with specifying that the containers "full user name spacing". User namespacing is the "magic sauce" for projects like Podman and a related project Buildah which builds images without root privileges -- the combination of user namespaces and FUSE.
BTW for those wondering if container orchestrators like Kubernetes will be shaken by this -- they very likely won't. Kubernetes has sidestepped this problem by introducing and relying on RuntimeClass, which allows you to swap out the runtime that's running underneath kubernetes. In fact, there is already a shim being worked on.
[EDIT] - While I'm here talking about container tech, please check out containerd. It's been my runtime of choice for a very long time, has been very well maintained, and received many of the really advanced features first (ex. untrusted runtimes). IMO it should be the default container runtime of k8s.
The biggest downside to my set up is that I can't say I have enterprise-level security. Getting it nailed down properly requires the use of tools like TUF/Notary & signed-image aware container repositories like Harbor and a deployment gate mechanism like Portieris. That's a lot of complexity to tack on.
Just a note -- Docker already runs containerd underneath via a shim
If you're relying on docker-specific features then by all means it makes sense to continue using docker but if you're just looking for a thing to quietly run your containers (or power your kubernetes cluster), containerd should probably be that thing. It's all of the building and none of the extra stuff that docker the company is trying to do/become.
Bazel has many other benefits too like creating a full dependency build graph and fully reproductible builds.
Same story for Gitlab, nobody's going to migrate from Github or Bitbucket just to get container images builds.
Things like buildah, kaniko, img, and whatnot can build standard Dockerfiles fine and push to standard docker registries. I don't have much experience with them though.
LXD by default has been more secure given it allowed unprivileged containers very early that works very nicely. Recently when there was a security problem with Kubernetes, I was still ok given we only used unprivileged containers.
I love LXD container being lightweight compared to kubernetes and same ansible or other platform orchestration like puppet, chef can work with baremetla, VM and containers and no need to fiddle with shell scripts, Dockerfile and learning container orchestration specific domain specific language (DSL). Hopefully LXD gets more popular.
So far OpenStack, OpenNebula and Proxmox support native LXD containers besides KVM and other virtualization. For most of the small website with thousands or million users LXD itself can work pretty well without relying on any cloud orchestration platform.
LXC is a system container. When you want a full system, instead of a VM, you can use LXC, have ssh, give people accounts. It has the same issues as a VM or a regular server, it's easy to leave snowflakes on it, unless you're very disciplined and automate everything.
Docker is an application container, it's not for hosting a user, but just an app. Easy to reproduce, you can share the images with other's or have them rebuild the image to run the app. So two very different things. When Docker first came out, I was very hesitant to use it, and thought it was very stupid to have just one container for one application. But as I thought about it and played with it, it made sense and I have come around. I still use both.
You compared LXD to kubernetes, they are not comparable. k8s is a container OS, it's for orchestrating tons of containers, so you don't have to manually deploy and network them, and restart them.
I do use LXD containers too, but mainly to create a bunch of testing nodes.
LXD we use for production and they are very lightweight. Running hundreds or thousands of them on a single baremetal will be ok.
The equivalent docker functionality is userns-remap or sometimes just "user namespaces".
An important distinction. From a security point of view running --privileged is just lazy. If you need things like kernel permissions etc, run as root and then request kernel permissions in the deployment yaml... and if running something like k8s make sure to apply a pod security spec limiting permissions and the apply the right seccomp profile.
Now that I understand what's going on better, I definitely agree that docker does not abstract the kernel APIs in a way that makes it easy to understand what's going on underneath the hood. Is that a good thing? I honestly don't know.
I'd encourage anyone interested in learning more to check out the codebase for JessFraz's contained.af
If you want to run your workload in the most secure manner using BSD jails then go for it, but you will soon find that anyone with expertise to maintain it is hard to find. Almost all DevOps/SRE/Systems Engineers want to run their workloads (containers) in Docker.
The same goes for LXD.
Lets get one thing right - none of these systems are a Virtual Machine with their own hardware and kernel. All 3 systems share the hosts kernel, albeit BSD has separation of workloads as a consideration within the kernel itself. The Linux kernel has cgroups as a building block, which to me almost makes it an afterthought.
I would never run a container with hostile code on the same Docker host as my mission critical workload. Hell, not even the same network (physical/vpc/overlay). However, I would give it more consideration if the same workloads were on a BSD machine with Jail separation. This was the premise of what I was employed to build in my previous job - https://github.com/tredly/tredly . A better separation of host and jail workload can be achieved though BSD jails as separation of concerns has been taken into consideration within the kernel, instead of building on top of the kernel.
This article is 2 years old. Linux cgroups and BSD jails have come leaps and bounds since. Docker may have been the "flavour of the month" back then, but it continues to be the DevOps/SRE/Systems Engineer's tool of choice. Its not a silver bullet by any means, and has many shortcoming when used on its own. This why systems continue to be build on top of systems (tar -> apt -> docker -> k8s -> ?)
Of course all of the above assumes someone in my position as a DevOps engineer consuming resources upon either on-prem infrastructure or within the cloud. Workloads from an IaaS perspective are a completely different kettle of fish.
Really? You just shot your credibility in the foot.
> User namespaces have been around since the 3.12 kernel, but few other container management systems use the feature to isolate their containers. Part of the reason for that is the difficulty in sharing files between containers because of the UID mapping. LXD is currently using shiftfs on Ubuntu systems to translate UIDs between containers and the parts of the host filesystem that are shared with it. Shiftfs is not upstream, however; there are hopes to add similar functionality to the new mount API before long.
This is exactly the problem I've run into. If you're trying to share files between containers or between the host and a container (e.g. with systemd-nspawn's "--bind" option), it's much harder to have your permissions set properly and still access them from in the container if you're using user namespacing.
There's also the issue of creating the container in the first place. If you follow the instructions for creating a container on the Arch Wiki, the files will end up owned by the host's root (mostly), which is a problem when you then try to boot the container as a different (namespaced) user. I don't know of a straightforward way to create a namespaced container with systemd-nspawn, and I don't think there's any way to convert an existing container to a namespaced container.
Yet another problem that sometimes arises is that various distributions have user namespacing disabled (it has to be enabled when the kernel is built) - notably Arch , though this may have recently changed. This is apparently due to concerns that the namespacing code is buggy and can itself lead to privilege escalation vulnerabilities.
There are patches floating around (similar to the work we did to allow FUSE) but they haven't made it upstream yet and my understanding is that there is some tricky corner cases on NFSv3 which still need to be sorted (NFSv4 was easier due to already having uid mapping capabilities).
Containers are fine the way they are. Just use them for the right job.
FWIW, I like it that both camps are working on their shortcomings: LXD/container camp on isolation, and the VM camp on overhead (eg. firecracker).
So there are solutions out there to solve your security issues.