Hacker News new | past | comments | ask | show | jobs | submit login
Containers vs. Zones vs. Jails vs. VMs (jessfraz.com)
714 points by adamnemecek on Mar 29, 2017 | hide | past | web | favorite | 235 comments

Jails are actually very similar to Linux namespaces / unshare. Much more similar than most people in this thread think.

There's one difference though:

In namespaces, you start with no isolation, from zero, and you add whatever you want — mount, PID, network, hostname, user, IPC namespaces.

In jails, you start with a reasonable secure baseline — processes, users, POSIX IPC and mounts are always isolated. But! You can isolate the filesystem root — or not (by specifying /). You can keep the host networking or restrict IP addresses or create a virtual interface. You can isolate SysV IPC (yay postgres!) — or keep the host IPC namespace, or ban IPC outright. See? The interesting parts are still flexible! Okay, not as flexible as "sharing PIDs with one jail and IPC with another", but still.

So unlike namespaces, where user isolation is done with weird UID mapping ("uid 1 in the container is uid 1000001 outside") and PID isolation I don't even know how, jails are at their core just one more column in the process table. PID, UID, and now JID (Jail ID). (The host is JID 0.) No need for weird mappings, the system just takes JID into account when answering system calls.

By the way, you definitely can run X11 apps in a jail :) Even with hardware accelerated graphics (just allow /dev/dri in your devfs ruleset).

P.S. one area where Linux did something years before FreeBSD is resource accounting and limits (cgroups). FreeBSD's answer is simple and pleasant to use though: https://www.freebsd.org/cgi/man.cgi?rctl

While I'm not sure I agree entirely with the "Complexity == Bugs" section, the main point, that containers are first-class citizens but a (useful) combination of independent mechanisms is spot-on. This has real repercussions: most people I've spoken do don't know these things exist. They know containers do, they have a very vague idea what containers are, but they have no fundamental understanding of the underlying concepts. (And who can blame them? Really, it was marketed that way.)

For example, pid_namespaces, and subreapers are an awesome feature¹, and are extremely handy if you have a daemon that needs to keep track of a set of child jobs that may or may not be well behaved. pid_namespaces ensure that if something bad happens to the parent, the children are terminated; they don't ignorantly continue executing after being reparented to init. Subreapers (if a parent dies, reparent the children to this process, not init) solve the problem of grandchildren getting orphaned to init if the parent dies. Both excellent features for managing subtrees of processes, which is why they're useful for containers. Just not only containers.

But developers aren't going to take advantage of syscalls they have no idea that they exist, of course.

¹although I wish someone could tell me why pid_namespaces are root-only: what's the security risk of allowing unprivileged users to create pid_namespaces?

This is definitely true, but only as long as docker (or $container_runtime) remains lightweight enough that you can still use those independent parts on their own, compatibly with docker. The risk is that docker grows in complexity such that it creates new dependencies between these independent parts and therefore handicaps their power when used individually.

As an example, it's easy to create network namespaces and add routing rules, interfaces, packet forwarding logic, etc all by using `ip netns exec`. But there is no easy way to launch a docker container into an existing netns. You need to use docker's own network tooling or build your own network driver, which may be more complex than what you need. This strikes me as a code smell in docker.

> This is definitely true, but only as long as docker (or $container_runtime) remains lightweight enough that you can still use those independent parts on their own, compatibly with docker.

As someone who has exclusively used LXC containers (which Docker is/was initially built on), none of this applies to me.

Your issue is with Docker, the implementation, not containers as a concept.

Sometimes I feel HN needs to get its head out of Dockers butt and see that there's a world out here too. How many people here even know there are other container types at all? I'm often inclined to think none.

No really. Just once try real, raw containers without all that docker wrapping bloat. How everything works and is tied together is just clear as day and obvious at once. It's all simple and super refreshing.

The comment you replied to is clearly talking about Docker, and not containers in general. So I don't think the snark was warranted.

Docker is to containers what OAUth2.0 is to cryptography: a roll your own solution with a wide complexity.

Whereas jails/zones/VM have a complexity that is mutualized, docker have a feature of being more flexible which comes at the price that you may introduce more escape scenari.

As a result like in cryptography, Docker is kind of a roll your own crypto solution, secured by obfuscation that may if you don't have a lot of knowledge on the topic your own poison.

From this article you can derive 2 conclusions:

- docker is good for a big business having enough knowledge to devote a specialized team for handling the topic, because FEATURES

- jails/zones are more adapted for securing small business

Wasn't oauth 1 more a roll-your-own crypto solution? oauth 2 was created precisely to re-use more of what browsers already provide.

Not according to one of its main artisan.


I think the biggest problem is most namespace functionality is root-only.

If I could create pid namespaces for my user-space apps, then every program I write forever would, as it's first step, launch into a pid namespace.

You can do that by creating an unprivileged user namespace. To be fair, this does break some things, but this is the key feature that makes rootless containers (in runC) possible.

Is it safe to turn that on in current kernels?

Well, I can't guarantee there are no kernel bugs in user namespaces, but the work that Eric and others have done to make user namespaces more secure does make me personally confident about running machines that have CONFIG_USERNS=y.

If you have SELinux, AppArmor, SMACK, Yama, or even good seccomp filters set up then I would classify it as "relatively" secure (most of the security issues in user namespaces have revolved around POSIX ACLs providing access where it doesn't make sense -- supplementing those ACLs with something like SELinux will eliminate entire classes of bugs).

Ultimately though, security is relative. Is a kernel that has CONFIG_USERNS=n more secure than one that doesn't? Yes (because it has less code running) but that doesn't mean that CONFIG_USERNS=y is insecure (it depends on what your paranoia level is dialed to).

I don't know if this addresses your question but...

Check out 'runc'. It is the tool Docker uses to start a Docker container. In that program, there is a '-u' option to start a container as whatever user ID (not username) you choose. Meaning you can start a container as a non-root user, although I don't know if that bubbles up as an option in the Docker public API, or can be set in a Dockerfile.

I'm not sure if it's what GP was talking about exactly, but there is USER instruction in Dockerfiles that lets you specify user to run as in the final image. Many Dockerfiles 'adduser' and then set USER to the newly created one.

(going from memory here, I believe this is accurate...)

When attaching to a running Foo container w/ runc, the default user is root. That's true even if the Foo container was started w/ the Foo user. Can a Dockerfile specify the default user when someone attaches to the running container?

Is it possible to create pid_namespaces for unprivileged users by wrapping pid_namespaces creation in a suid shell script that will take care of loading everithing using the current unprivileged user ?

If you enable user namespaces as well, then you don't need any of that. For example:

  [mrunal@local rootfs]$ id
  uid=1000(mrunal) gid=1000(mrunal) groups=1000(mrunal),10(wheel) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
  [mrunal@local rootfs]$ unshare -m -u -n -i -p -f --mount-proc -r sh
  sh-4.4# ps -ef
  UID        PID  PPID  C STIME TTY          TIME CMD
  root         1     0  0 09:42 pts/12   00:00:00 sh
  root         2     1  0 09:42 pts/12   00:00:00 ps -ef

Shell scripts can't be suid. But a binary wrapper could work.

You can create a pid namespace if you also create a user namespace at the same time.

Ignorance admission time: I still have no idea what problem containers are supposed to solve. I understand VMs. I understand chroot. I understand SELinux. Hell, I even understand monads a little bit. But I have no idea what containers do or why I should care. And I've tried.

Containers are just advanced chroots. They do the same with the network interface, process list and your local user list as chroot is doing with your filesystem. In addition, containers often throttle resource consumption of CPU, memory, block I/O and network I/O of the running application to have some QoS for other colocated applications in the same machine.

It is the spot between chroot and VM. Looks like a VM from the inside, provides some degree of resource usage QoS and does not require you to run a full operating system like a VM.

Another concept that is now also often automatically connected to containers is the distribution mechanism that Docker brought. While provisioning is an orthogonal topic to runtime, it is nice that these two operational topics are solved at the same time in a convinient way.

rkt did some nice work to allow you to choose the runtime isolation level while sticking to the same provisioning mechanism:


Unfortunately containers provide about the same security as chroots too. Nothing even close to a true virtual machine with not much lower cost.

Chroot does not provide security, just a restricted view on the file system. Container can provide pretty ok security, but fail with Kernel Exploits. VMs provide better security, but also fail with VM exploits (which there are quite regularly some).

Actually many of the VM exploits are related to qemu device emulation or paravirtualization drivers, which are closed by the use of Xen stubdom. Only very few were privilege escalation via another vector, both in kvm and in Xen. I have no idea about other hypervisors.

And in turn most QEMU vulnerabilities are closed by SELinux of your distribution enables it. Libvirt (and thus the virt-manager GUI) automatically confine each QEMU process so that it can only access the resources for that particular VM.

Seccomp, apparmor, and namespacing (especially user!) do add a lot more security than plain old chroots, but still not at the level of a VM.

But couldn't containers have been designed that way? One thing I have in mind is one of windows 10 recent features, which consist in running certain applications using the same hardware level memory protection mechanism than VMs, so that the application is safe from the OS/Kernel, and the OS/Kernel is safe from the application (can't find the exact name for this new feature unfortunately).

Containers can't be designed that way as long as the primitives to build them that way (which are mostly part of the Linux kernel) are missing. That's a core part of the article. Containers aren't an entity by themselves, they're a clever and useful combination of an existing set of capabilities.

It is like that... but in Zones or Jails, not in Linux "container toolkit"

No, the Windows 10 feature he's talking about uses Hyper-V internally. It's called, unsurprisingly, Hyper-V containers: https://docs.microsoft.com/en-us/virtualization/windowsconta...

Actually I found it. It's called Windows 10 Virtual Secure mode


(or Windows 10 isolated user mode, which seems kind of similar)


Oh yeah that's another use of Hyper-V, somewhat similar to ARM TrustZone. It's used to implement "Credential Guard".

You can design all you like, but implementation takes work.

Seccomp only landed for Docker in about 1.12

in Linux...

as the article shows, this is not the point for Zones and Jails.

Is there any fundamental difference between containers and shared-kernel virtualization (OpenVZ) that I am missing?

OpenVZ was an early container implementation that required patches to the Linux kernel that never made it into mainline. Parallels acquired the company behind OpenVZ, Virtuozzo, and then worked to mainline much of the functionality into what are now Linux namespaces.

Oh really? I didn't know that namespaces is linux are the openvz changes. Thought they were a completely new implementation mostly driven by Google?

They aren't, but share some similarities. OpenVZ can be considered an inspiration for LXC. (Which was mostly implemented by RedHat and not Google.)

Correction. The LXC project was mainly Daniel Lezcano and Serge Hallyn from IBM. Then some cgroup support from Google. And then Canonical hired Serge Hallyn and Stephane Graber to continue work on LXC around 2009 where they have continued to develop it till today. Docker based off LXC in 2013.

Very helpful. Thanks.

I'm with you, but I've found a single use case that I'm running with, and potentially a second that I'm becoming sold on. So far, the most useful thing for me is being able to take a small application I've written, package it as a container, and package it in a manner when I know it will run identically on multiple remote machines that I will not have proximity to manage should something go wrong. I can also make a Big Red Button to blow the whole thing away and redownload the container if need be, since I was (correctly) forced to externalize storage and database. I can also push application updates just by having a second Big Red Button marked "update" which performs a docker pull and redeploy. So now, what was a small, single-purpose Rails app can be pushed to a dozen or so remote Mac minis with a very simple GUI to orchestrate docker commands, and less-than-tech-savvy field workers can manage this app pretty simply.

I'm also becoming more sold on the Kubernetes model, which relies on containers. Build your small service, let the system scale it for you. I don't have as much hands-on here yet, but so far it seems pretty great.

Neither of those are the same problems that VMs or chroot are solving, as I see it, but a completely different problem that gets much less press.

Everyone says containers help resource utilization but I think there killer raison d'etre is that they are a common static binary packaging mechanism. I can ship Java, Go, Python, or whatever and the download and run mechanism is all abstracted away.

Does this mean we're admitting defeat with shared libraries and we're going back to static libraries again?

Disk space is cheap. And we've got multi CPU core servers.

So now we have the issue that you have lots of applications running on the same server, and how do we make sure the right version of some shared lib is on there. And that we won't break another program by updating it.

Containers solve that. No more worrying if that java 8 upgrade will break some old application.

So now every application stack is a static application.

Isn't just about disk space though. It also allows you to quickly make API compatible vulnerability fixes without a rebuild of your application.

This isn't a virtue. Containers solve problems in automated continuous-deployment environments where rebuilding and deploying your fleet of cattle is one click away. In the best case, no single container is alive for more than O(hours). Static linking solves way more operational problems than the loss of dynamic linking introduces, security or otherwise.

> This isn't a virtue. Containers solve problems in automated continuous-deployment environments where rebuilding and deploying your fleet of cattle is one click away.

This has literally zero to do with containers and everything to do with an automated deployment pipeline.

As a quick FYI: Those are not unique to containers.

> rebuilding and deploying your fleet

...this applies only software developed and run internally, which is a small fraction of all the software running in the world.

I agree that moving towards static linking, on balance, seems like a a reasonable tradeoff at this point, but it is hardly as cut and dried as a lot of people seem to think.

As one very minor point, it turns vulnerability tracking in to an accounting exercise, which sounds like a good idea until you take a look at the dexterity with which most engineering firms manage their AWS accounts. (Sure, just get better at it and it won't be a problem. That advice works with everything else, right?)

One's choice of deployment tools may slap a bandaid on some things, but that is not the same thing as solving a problem; that is automated bandaid application.

And odd pronouncements like any given container shouldn't be long lived are... odd. I guess if all you do is serve CRUD queries with them, that's probably OK.

As a final point, I feel like the container advocates are selling an engineer's view of how ops should work. As with most things, neutral to good ideas end up wrapped up with a lot of rookie mistakes, not to mention typical engineer arrogance[1]. Just the same thing you get anywhere amateurs lecture the pros, but the current hype train surrounding docker is enough to let it actually cause problems[2].

My takeaway is still the same as it was when the noise started. Docker has the potential to be a nice bundling of Linux capabilities as an evolution of a very old idea that solves some real problems in some situations, and I look forward to it growing up. In the mean time, I'm bored with this engineering fad; can we get on with the next one already?

[1] One very simple example, because I know someone will ask: Kubernetes logging is a stupid mess that doesn't play well with... well, anything. And to be fair, ops engineers are no better with the arrogance.

[2] Problems like there being not even a single clearly production-ready host platform out of the box. Centos? Not yet. Ubuntu? Best of the bunch, but still hacky and buggy. CoreOS? I thought one of the points was a unified platform for dev and prod.

Linking with static libraries takes more time (poor programmer has to wait longer on average while it is linking), also when it crashes you see from the backtrace or from ldd which version of Foo is involved.

Mostly, yes. Notice that Go and Rust (two of the newer languages popular at least on HN) also feature static compilation by default. Turns out that shared libraries are awesome, until the libraries can't provide a consistently backwards compatible ABI.

Go has no versioning, in Rust everything is version 0.1... then one day you update that serialization library from 0.1.4 to 0.1.5 and all hell breaks loose because you didn't notice they changed their data format internally and now your new process can't communicate with the old ones and your integration test missed that because it was running all tests with the new version on your machine. This makes you implement the policy "Only rebuild and ship the full stack" and there you are, scp'ing 1GB of binaries to your server because libleftpad just got updated.

Except outside of Javascript nobody on earth makes libleftpad and whose binaries are 1gb?

In D it is part of the standard library. ;)


Except outside of Javascript nobody on earth makes libleftpad

Static libraries can't be replaced/updated post-deployment - you need to rebuild, whereas, shared libraries in a container can be - which is useful if you're working with dependencies that are updated regularly (in a non-breaking fashion) or proprietary binary blobs.

> Static libraries can't be replaced/updated post-deployment

And that's great news. Immutable deployment artifacts let us reason about our systems much more coherently.

No, they prevent an entire class of reasoning from needing to take place. It is still possible to reason coherently in the face of mutable systems, and people still "reason" incoherently about immutable ones.

Is rebuilding and redeploying a container really any different from rebuilding and redeploying statically linked binaries?

For a lot of applications: no, it's very similar, and if you have a language that can be easily statically compiled to a binary which is free of external dependencies and independently testable, and you've setup a build-test-deployment pipeline relying on that, then perhaps in your case containers are a solution in search of a problem :-)

But there are more benefits like Jessie touches upon in her blog post, wrt flexibility and patterns you can use with multiple containers sharing some namespaces, etc. And from the perspective of languages that do not compile to a native binary the containers offer a uniform way to package and deploy an application.

When I was at QuizUp and we decided to switch our deployment units to docker containers we had been deploying using custom-baked VM's (AMI's). When we first started doing that it was due to our immutable infrastructure philosophy, but soon it became a relied-upon and necessary abstraction to homogeneously deploy services whether they were written in python, java, scala, go, or c++.

Using docker containers allowed us to keep that level of abstraction while reducing overheads significantly, and due to the dockers being easy to start and run anywhere we became more infrastructure agnostic at the same time.

Not everyone has container source code - or it might be impractical. If you run RabbitMQ in your container would you want to build that from source as part of your build process?

"Container source code" is usually something like "run pkg-manager install rabbitmq" though.

It would be nice to have a third option when building binaries: some kind of tar/jar/zip archive with all the dependencies inside. It would give the pros of static and shared libraries without everything else containers imply. The OS could the be smart enough to only load identical libraries once.

That's equivalent to static linking, but with extra runtime overhead. You can already efficiently ship updates to binaries with something like bsdiff or cougrette, so the only reason to bundle shared libraries in an archive is for LGPL license compliance, or for poorly thought out code that wants to dlopen() itself.

Upgrading a library that has been statically linked isn't as nice as a shared lib + afaik the OS doesn't reuse memory for static libs.

A container image is a tarball of the dependencies.

Yes, but containers also provide more stuff that I might not want to deal with.

The OS can be smart enough to load identical libraries once. But it requires them to be the same file. This can be achieved with Docker image layers and sharing the same base layer between images. It could also be achieved with content-addressable store that deduplicated files across different images. This would be helped by container packaging system that used the same files across images.

Page sharing can also depend on the storage driver; overlayfs support page cache sharing and brtfs does not.

That's basically what OS X does with bundles.

jars already support this.

yes, I think we should have the same capabilities in a language agnostic way.

Signed jars are a little painful to use (you can't easily bundle them), but that's a minor issue.

This is why something like 25% of containers in the docker registry ship with known vulnerabilities.

Or you could have done the same thing years earlier with AMIs?

But AMIs are full VM images as opposed to container images, aren't they?

Most Docker images also contain a full OS.

Yes, everyone overlooks this and talks about how Docker containers are "app-only images" or something. They're not app-only. They may be using a thin OS like Alpine, but there's still a completely independent userspace layer. The only thing imported from the host is the kernel. If you made VM images in the same way, they'd also be 200M.

The benefit of "containers" is that you don't need to siphon off a dedicated section of RAM to the application.

I'm very new to containers, but I think I'm starting to get the hype a bit. Recently I was working on a couple of personal projects, and for one I wanted a Postgres server, and for the other PhantomJS so that I could do some webscraping. Since I try to keep my projects self-contained I try to avoid installing software onto my Mac. So my usual workflow would be to use Vagrant (sometimes with Ansible) to configure a VM. I do this infrequently enough that I can never remember the syntax, and there's a relatively long feedback loop when trying to debug install commands, permissions etc. I gave Docker a try out of frustration, but was simply delighted when I discovered that I could just download and start Postgres in a self-contained way. And reset it or remove it trivially. I know there's a lot more to containers than this, but it was an eye-opener for me.

You can do this with Vagrant already. Before Vagrant, people distributed slimmed-down VM images for import and execution. Why is this ascribed as a unique benefit of containers?

Yeah this fits my experience exactly. I suppose I use docker a lot like a package manager (easy to install software and when I remove something I know it will be cleaned up).

Nearly every time I install actual software on my mac (beyond editors & a few other things) I feel like I end up tripping over it later when I find half my work wants version N and the other wants version M

Am also a huge newcomer to this.

Yeah, I think a lot of it is better resource utilization compared to VMs. At the same time, though, I don't think containers are the thing, but just a thing that paves the way for something very powerful: datacenter-level operating systems.

In 2010, Zaharia et al. presented [1], which basically made the argument that increasing scale of deployments and variety of distributed applications means that we need better deployment primitives than just at the machine level. On the topic of virtualization, it observed:

> The largest datacenter operators, including Google, Microsoft, and Yahoo!, do not appear to use virtualization due to concerns about overhead. However, as virtualization overhead goes down, it is natural to ask whether virtualization could simplify scheduling.

But what they didn't know was that Google has been using containers for a long time. [2] They're deployed with Borg, an internal cluster scheduler (probably better known as the predecessor to the open-source Kubernetes), which essentially serves exactly as an operating system for datacenters that Zaharia et al. described. When you think about it that way, a container is better thought of not as a thinner VM, but as a thicker process.

> Because well-designed containers and container images are scoped to a single application, managing containers means managing applications rather than machines.

In the open-source world, we now have projects like Kubernetes and Mesos. They're not mature enough yet, but they're on the way.

[1] https://cs.stanford.edu/~matei/papers/2011/hotcloud_datacent...

[2] http://queue.acm.org/detail.cfm?id=2898444

The big missing "virtualization" technology is the Apache/CGI model. You essentially upload individual script-language (or compiled on the spot) functions that are then executed on the server in the context of the host process directly.

This exploits the fact that one webserver only differs from another by the contents of it's response method, and other differences are actually unwanted. You can make this a lot more efficient by simply having everything except the contents of the response method be shared between different customers.

This meant that all the Apache mod_x (famously mod_php and mod_perl) can manage websites on behalf of large amounts of customers on extremely limited hardware.

It does provide for a challenging security environment. That can be improved when starting from scratch though.

I think the modern equivalent of what you are describing is basically the AWS Lambda model of "serverless" applications. In the open source world, there are projects like Funktion[1] and IronFunctions[2] for Kubernetes

[1] https://github.com/funktionio/funktion

[2] https://github.com/iron-io/functions

I get that that's what they're saying, but it just isn't. Functions are just a way to start containers on an as-needed basis, then shut them down when not needed.

Mod_php is 3 syscalls and a function call, and can be less if the cache is warm. Despite the claims on that page, there is no comparison in performance.

"Extremely efficient use of resources"

It is utterly baffling that one would use those words to describe spinning up either a container or a VM to run these lines of code (their example), and nothing else:

  p := &Person{Name: "World"}
  fmt.Printf("Hello %v!", p.Name)
Number of syscalls it needs to switch into this code ... I don't know. I'd say between 1e5 and 1e8 or so. Probably needs to start bash (as in exec() bash) a number of times, probably in the 3 digits or so.

So I guess my issue is that functions use $massive_ton_of_resources (obviously the lines of code printed above here need their own private linker loaded in memory, don't you agree ? It's not even used for the statically linked binaries, but it's there anyway. Running init scripts of a linux system from scratch ... yep ... I can see how that's completely necessary), but when they're not called for long enough, that goes to 0, at the cost of needing $even_more_massive_fuckton_of_resources the next time it's called.

Of course, for Amazon this is great. They're not paying for it, and taking a nice margin (apparently about 80%, according to some articles) when other people do pay for it.

And the really sick portion is that if you look at how you're supposed to develop these functions, what does one do ? Well you have this binary running "around" your app, that constantly checks if you've changed the source code. If you have, it kills your app (erasing any internal state it has, so it needs to tolerate that), and then restarts the app for the next request. Euhm ... what was the criticism of mod_perl/mod_php again ? Yes, that it did exactly that.

A container needs 10-100 syscalls, depending how much isolation you want. A single unshare() and exec gets you some benefit. You are out by orders of magnitude.

And then of course the system inside the container needs to start up, configure, run init scripts, ... Did you count that in those 100 syscalls ?

Take the example here: https://github.com/kstaken/dockerfile-examples/blob/master/n...

Which does something a lot of these functions will do : get nodejs, use it to run a function. Just the apt-get update, on my machine just those instructions, ignoring actually running the function (because it's insignificant) does close to 1e6 syscalls.

Lightweight application containers do not run init or anything like that! They're just chroots but with isolated networking, PIDs, UIDs, whatever.

For example, on my FreeBSD boxes, I have runit services that are basically this:

exec jail -c path='/j/postgres' … command='/usr/local/bin/postgres'

Pretty much the same as directly running /usr/local/bin/postgres except the `jail` program will chroot and set a jail ID in the process table before exec()'ing postgres. No init scripts, no shells, nothing.

I don't understand the criticism. FreeBSD jail is more like chroot than like a container. A container, as I understand it, runs it's own userland. Otherwise, you can't really isolate programs in it. If that postgres was compiled with a libc different from the one on the host system, or let's say required a few libraries that aren't on the host system, for instance, would it run ?

Does it have it's own filesystem that can migrate along with the program ? Does it have it's own IP that can stay the same if it's on another machine ?

Even a basic chroot runs its own userland! "Userland" is just files.

In my example, /j/postgres is that filesystem that can migrate anywhere. (What's actually started is /j/postgres/usr/local/bin/postgres.) Yeah, you can just specify the IP address when starting it.

You're correct. Containers do contain their own userlands, a fact many gloss over. PgSQL will have to load its containerized version of all libraries instead of using any shared libraries linked by the outside system.

This is often done via a super thin distribution like Alpine Linux to keep image size down, despite the COW functionality touted by Docker that's supposed to make it cheap to share layers.

The difference is that unlike a fully virtualized system, the container does not have to execute a full boot/init process; it executes only the process you request from within the container's image. Of course, one could request a process that starts many subservient services within the container, though that is typically considered bad form.

What people really want is super cheap VMs, but they're fooling themselves into believing they want containers, and pretending that containers are a magic bullet with no tradeoffs. It's scary times.

What system? Your link just starts a nodejs binary, no init process. And you also don't seem to realise that a docker container is built only once? Executing apt happens when building the image (and then is cached for if a rebuild happens later), not when starting it.

These steps are only run for initial creation of the container image. Running the container itself is only the last step from that file: Executing the node binary.

I am not quite sure what it is that you want. It seems obvious to me that containers should have more overhead than CGI scripts; it also provides a better isolation story. I mean, you already said it:

> [the Apache/CGI model] does provide for a challenging security environment. That can be improved when starting from scratch though.

And the number of lines of code in the example probably doesn't quite matter so much, because that's all it is: an example. I am sure that you can run more lines of code than that.

> Euhm ... what was the criticism of mod_perl/mod_php again? Yes, that it did exactly that.

I mean, that's also basically my point, that Lambda is basically the CGI of the container world. Lambda and CGI scripts really do seem like they are basically the same thing; I still speculate that they will be used to fill similar use cases. I am not really opining on which one is actually better.

You can share resources between VMs (frontswap etc. and deduplication, using network file systems like V9FS instead of partitions) but it complicates security.

It is still safer than containers as one kernel local root bug does not break a VM, but breaks a container. The access to hardware support also allows compartmentalized drivers and hardware.

I will show you some use cases:

- have different versions of libs/apps on the same OS (or run different OS's) - tinker with linux kernel, etc without breaking your box (remember the 90's?) - building immutable images packed with dependencies, ready for deploy - testing distributed software without VMs (because containers are faster to run) - if you have a big box (say 64gb, eight core or whateva) or multiple big boxes, you can manage the box resources through containerization, which can be useful if you need to run different software. Say every team builds a container image, then you can deploy any image, do HA, Load balancing, etc. Ofc this use case is highly debatable

These comments are helpful. Thanks. Sounds like for a given piece of hardware you might be able to fit 2 or 3 VMs on it, or a lot more containers. But without the security barriers of VMs.

That being the case, why not just use the OS? And processes and shared libraries?

The article touches on the technical details of this briefly, but the underlying point here is that containers effectively do use the OS, and processes. Like Frazelle says in the article: "a 'container' is just a term people use to describe a combination of Linux namespaces and cgroups." If that's nonsense to you, check out some of her talks, they treat those topics in a friendly way. At the most basic level, though, a container is just a process (or process tree) running in an isolated context.

Sharing library code between processes running in containers is more complicated, since it depends on whether and how you've set up filesystem isolation for those processes, but it's possible to do.

The isolation means that don't have to worry about containers interfering with each other. It is more about separating and hiding processes rather than protecting from hostile attacks.

The other big advantage is containers provide a way to distribute and run applications with all their dependencies except for the kernel. This means not having to worry about incompatible libraries or installing the right software on each machine.

It can be easier to run a jail (or container) and assign it an IP and run standard applications with standard configs than to run a second instance of something in a weird directory listening in a special way.

The other big difference between this and a VM is that timekeeping just works.

You're not necessarily restricted to friendly-only tenants, either. Depending on how you configure it, there can be pretty good isolation between the inside and the outside and the other insides. You lose a layer of isolation, but it's not impossible to escape a virtual machine either.

> That being the case, why not just use the OS? And processes and shared libraries?

That's essentially what a Linux container is: a process (that can fork) with its own shlib. If you have lots of processes that don't need to be isolated from each other and can share the same shlib, then no, you don't need this mechanism.

Okay, so it's a nice self-contained packaging mechanism that obviates dependency hell. Sounds a bit like a giant lexical closure that wraps a whole process. And from which escape is somewhat difficult. Makes sense.

> the same shlib, then no, you don't need this mechanism

And if you want to use a modern development tool chain you don't really have this choice. They produce statically linked binaries that need, at minimum, their own process and TCP port (if you run a proxy which when you think about it is pretty wasteful).

There is no good reason (other than ease of tool chain development) for that, and it's probably cost hundreds of millions or even billions of dollars in servers and power, but there you go.

PHP and Java are essentially the only languages with good support for running without containers, and Java isn't even used that way usually.

I think it's important to make the distinction that containers do provide a level of security isolation, but that in most cases it's not as much protection as it provided by VM isolation.

There are companies doing multi-tenant Container setups, with untrusted customers, so it's not an unknown concept for sure.

what I'd say is that the attack surface is much larger than a VM hypervisor , so there's likely more risk of a container breakout than a VM one.

> There are companies doing multi-tenant Container setups, with untrusted customers, so it's not an unknown concept for sure.

I'm a little shocked to hear this (given everything everybody else has said about container security), but I guess it means the security of containers can be tweaked to be good enough in this environment.


How do you make a docker container secure? Run it in a bsd jail :p. But I'm sure that people with the right expertise can do this. For the rest of us Docker is mainly a packaging mechanism which helps alleviate accidents and makes deployment a little more predictable.

I don't understand Docker to be honest. It was a big pain to have unexplainable race conditions when I tried to use it for production apps.

Ended with a spectacular data loss, of my own company's financial data. Luckily I had 7-day old SQL exports.

In my experience it does two things that VMs don't do as well:

1. More efficient use of hardware (including spin up time) 2. Better mechanisms for tying together and sharing resources across boundaries.

But in the end they don't really do anything you couldn't do with a VM. It's just that people realized that VMs are overkill for many use cases.

They make shared folders and individual files a lot easier than VMs, also process monitoring from the "host".

Very much not worth the cost of reduced security and reliability. You also have vastly more complicated failover due to no easy migration.

Increase server utilization by packing multiple non-hostile tenants on it, quickly create test environments, have a volatile env. You can have all of those with VMs although at much higher CPU, RAM usage cost.

With one big limitation: they must all run the same os kernel (so you cannot run say a Windows or FreeBSD container on a Linux host).

In fact, nobody guarantees that say Fedora will run on an Ubuntu-built kernel. Or even on a kernel from a different version of Fedora. So, IMO, anything other than running the exact same OS on host and in container is a hack.

> In fact, nobody guarantees that say Fedora will run on an Ubuntu-built kernel.

"nobody guarantees" just means that you can't externalize the work of trying it and seeing if it works. I don't think that's a huge loss, considering the space of all possible kernels, configuration switches, patches and distro packages is huge.

It's like refusing to use a hammer because nobody can assure you that hammer A was thoroughly tested with nail type B.

No, its like using a nailgun A with nails Y when it's only guaranteed to work with nails X. Or like using a chainsaw A with chain Y when it says you should use X. But hey, at least you are not trying to use nails on a chainsaw... ;-)

No. It's like shooting yourself in the face because your friend survived it.

> nobody guarantees

As long as the ABI is stable and you don't reach out to something that would have moved within /{proc,sys,whatevs}, you're good. [0]

[0]: https://en.wikipedia.org/wiki/Linux_kernel_interfaces

Measure the "much higher" before deciding. Especially after you apply solutions to reduce the memory and disk cost.

I'd say the "much higher" is nowadays a relic of the past.

Same with me. This plays right into the complexity issue.

Even if you understand them, you have to understand the specific configuration (unlike VMs, where you have a very limited set of configurable options, and the isolation guarantees are pretty much clear).

Eliminates the redundancy of maintaining an OS across more than 1 service.

They're VM's but much more efficient and start faster. There's a clever but shockingly naive build system involved. That's pretty much it.

Going beyond this you get orchestration - which you can certainly do with VM's but it's slow; and various hangovers from SOA rebadged and called microservices.

But they're really, really efficient compared to VM's.

> The're VM's

They are definitely not VMs.

> But they're really, really efficient compared to VM's.

I think that the virtualisation CPU overhead is below 1%. Layered file systems are possible with virtual machines as well so disk space usage could be comparable.

What do you mean that they are "really, really efficient" ?

5-10%, realistically, with some very informal testing. Not really a particularly big deal.

Really really efficient relates to how many containers can be run on a given system vs VM's. About 10x as many.

As a lowly user¹: linux containers are more like gaffer tape around namespaces and cgroups than something like lego. You want real memory usage in your cgroup? let's mount some fuse filesystem: https://github.com/lxc/lxcfs - https://www.cvedetails.com/vulnerability-list/vendor_id-4781...

We have to gaffer tape with AppArmor and SELinux to fix all the holes the kernel doesn't care about: https://github.com/lxc/lxc/blob/master/config/apparmor/conta...

Solaris Zones are more designed and an evolution from FreeBSD Jails. Okay, the military likely paid for that: https://blogs.oracle.com/darren/entry/overview_of_solaris_ke...

Maybe it's Deathstar vs. Lego. But I assume you can survive a lot longer in a Deathstar in vacuum than in your Lego spaceship hardened by gaffa tape.

1: I have uttermost respect for anyone working on this stuff. No offense, but as a user sometimes a lack of design and implementation of bigger concepts (not as in more code, but better design, more secure) in the Linux world is sad. It's probably the only way to move forward but you could read on @grsecurity Twitter years ago that this idea is going to be a fun ride full of security bugs. There might be a better way?

I really wish this post went into more detail. It feels too high level to be useful.

I ran into the memory issue recently. In DC/OS when you use the marathon scheduler, if you go above the allocated memory limit, the scheduler kills your task and restarts it.

The trouble is, if you ran top inside your container, and you're running on a DC/OS node with 32GB of memory, top reports all 32GB of memory. So interpreters that do garbage collection (like Java) will just continue to eat memory if you don't specify the right limits/parameters. The OS will eve let it allocate past the container limit, but just kill the container afterwards.

Now the container limit is available under the /proc/cgroups somewhere, but now interpreters need to check to see if they're running in a container and adjust everything accordingly.

Of course you could always tell your scheduler not to hard kill something when it goes over a memory limit, which is why we never ran into that when we were running things on CoreOS since we didn't configure hard limits per container.

This is an interesting observation. It seems the simple fix would be tricking the containerized application into reading the "total system memory" from a syscall hooked by the container runtime to return the configured memory limit of the container making the syscall. I'm surprised this is not already done; is there some inherent limit that prohibits this?

It seems an unintended consequence of containerizaion is that the responsibility for garbage collection effectively moves from the containerized application (e.g. the interpreter) to the container runtime, which "collects garbage" by terminating containers at their memory limit, just like a process level garbage collector terminates (or prunes) functions or data structures at their memory limit.

I'm not sure this is a bad thing. Moving garbage collection up one level in the "stack" of abstraction seems in line with the idea that containers are the building blocks of a "data center operating system."

Naturally then, shouldn't garbage collection happen at the level of the container runtime? Otherwise you're wasting compute cycles by collecting garbage at two levels.

When garbage collection moves to the container runtime, it should mean that the application no longer has to worry about garbage collection, since the container runtime will terminate the container when it reaches its memory limit. Therefore, the application (e.g. the java interpreter) only needs to make sure it can handle frequent restarts. In practice this means coding stateless applications with fast startup times.

Applications like the java interpreter were designed in an era dominated by long running, stateful processes. Now we are seeing a move to stateless applications with fast boot times (i.e. "serverless" shudder). Stateless applications are a prerequisite to turning the data center into an "operating system" because they essentially take the role of function calls in a traditional operating system. Both containers and function calls are the "building blocks" of their respective levels of abstraction. In a traditional OS, you wouldn't expect a single function call to run forever and do its own garbage collection, so why would you expect the same from a containerized application in a datacenter OS?

> When garbage collection moves to the container runtime, it should mean that the application no longer has to worry about garbage collection, since the container runtime will terminate the container when it reaches its memory limit.

I can't be the only one horrified at that statement.

Whatever happened to writing software that doesn't use infinite RAM + infinite CPU? Why rely on the container / OS to just kill your misbehaving process and restart it?

My background is embedded 8-bit RTOS. We cared about the resources we used.

> top reports all 32GB of memory.

Same applies to CPU and number of cores. A common pitfall for example for ElasticSearch, which bases its default threadpools on the number of visible CPUs. The isolation layer is indeed very thing and leaky in all places.

But with CPU cores, it's not a hard limit. At least in Marathon, you can ask for 0.5 CPUs and marathon will use that to decide where your container gets scheduled (so if it has 16 cores, all scheduled containers should add up to 16; which is why you should use fractions of a core to start and then monitor to see real CPU usage) , but the container gets all of the CPUs.

I don't know if that's specific to our Marathon/DCOS setup or if that's a limitation of the underlying Docker daemon. Not sure how K8s or Nomad handle CPU/memory limits either.

Nomad's exec & java drivers handle CPU and memory through cgroups (via libcontainer). In particular CPU uses CPU shares[0] so all cores/CPUs are treated as a single pool of shares. In the future we plan on adding CPU pinning and might add cores as a unit of CPU resources.

The idea of constraining CPU resources by core is especially important on ARM servers which don't always report their clockspeed. Packet.net for example has 96 core ARM boxes, so users would probably prefer to set how many cores they need instead of MHz.

[0] https://access.redhat.com/documentation/en-US/Red_Hat_Enterp...

Is there a common kernel patch dealing with this problem yet? I know it's possible, but Google is light on links and it's a problem I have right now.

Your Java container must have a limit too, regardless of where it is running. You must be in control of the application.

Even if that's true, applications that try to automatically tune themselves to be able to use all available resources sound valid.

Yeah, Linux security features work like: throw ... against a wall and see what sticks. I find it amusing when people say: "We should write a new kernel" and their only proposed security feature is using that memory safe language (TM)... they'd have my attention if they said "We should write a new kernel and design all the permissions/isolations/resource limits from the ground up".

I.e. an enterprise operating system.

Yes, and as long as your life isn't threatened and you live in a world full of other people, problems, opportunities, the lego ship with gaffa tape is way more useful.

It feels like Ms Frazelle's essay ends abruptly. I was looking forward to the other use cases of non-Linux containers.

I think most people are considering these OS-level virtualization systems for the same or or very similar use cases: familiar, scalable, performant and maintainable general purpose computing. Linux containers win because Linux won. Linux didn't have to be designed for OS virt. People have been patient as long as they've continued to see progress -- and be able to rely on hardware virt. Containers are a great example of where even with all of the diverse stakeholders of Linux, the community continues to be adaptive and create a better and better system at a consistent pace in and around the kernel.

That my $job - 2, Joyent, re-booted Lx-branded zones to make Linux applications run on illumos (descendent of OpenSolaris) is more than a "can't beat them join them strategy" as it allows their Triton (OSS) users full access, not only to Linux API and toolchains, but to the Docker APIs and image ecosystem and has been an environment for their own continued participation in micro services evolution.

Although Joyent adds an additional flavor, it targets the same scalable, performant and maintainable cloud/IaaS/PaaS-ish use case. In hindsight, it's crazy that I worked at three companies in a row in this space, Piston Cloud, Joyent, Apcera, and each time I didn't think I'd be competing against my former company, but each time the business models as a result of the ecosystems shifted. Thankfully with $job I'm now a consumer of all of the awesome innovations in this space.

I think an interesting bit here is that e.g. Solaris first had Zones (i.e. "Containers"), while virtualisation was added later (sun4v), while the story is exactly the other way around for Linux.

I also felt that way. It it like the beginning of an awesome blog post. Maybe she'll continue later on after thinking more about it.

Its probably a good time to stop using containers to mean LXC considering the new OCI runc specs containers on Solaris using Zones and Windows using Hyper-V:


I don't think anyone in the container dev community thinks that containers means LXC only. Even back in 2013, docker's front end api was designed to support other runtimes such as VMs and chroot. Perhaps this is a marketing story gone awry?

Afaik even the docker noob tutorial already points out that it is not just LXC only.

Docker hasn't support LXC as a backend for at least a year.

You mean they replaced it? May also what I've read. Unsure.

They've been using libcontainer and then moved to runC (which is effectively a standard-compliant wrapper around libcontainer) since before 1.0. LXC was only used in the early history of Docker, and it was pretty bad to be quite honest (it's better now but there's no chance Docker will switch back).

It's probably a good idea time to stop thinking that anyone cares what Solaris calls anything.

I think it's important to realize that the reduced isolation of containers can also have pretty significant upsides.

For example monitoring the host and all running containers and all future containers only means running one extra (privileged) container on each host. I don't need to modify the host itself, or any of the other containers, and no matter who builds the containers my monitoring will always work the same.

The same goes for logging. Mainly there is an agreed-upon standard that containers should just log to stdout/stderr, which makes it very flexible to process the logs however you want on the host. But also if your application uses a log file somewhere inside the container, I can start another container (often called "sidecar") with my tools that can have access to that file and pipe it into my logging infrastructure.

If I want multiple containers can share the same network namespace. So I listen on "localhost:8080" in one container, and connect to "localhost:8080" in another, and that just works without any overhead. I can share socket files just the same.

I can run one (privileged) container on each host that starts more containers and bootstraps f.e. a whole kubernetes cluster with many more components.

You can save yourself much "infrastructure" stuff with containers, because the host provides them or they are done conceptually different. For example ntp, ssh, cron, syslog, monitoring, configuration management, security updates, dhcp/dns, network access to internal or external services like package repositories.

My main point is that by embracing what containers are and using that to your advantage, you gain much more than by just viewing them as lightweight virtualisation with lower overhead and a nicer image distribution.

Edit: I want to add that not all of that is necessarily exclusive to containers or mandatory. For example throwing away the whole VM and booting a new one for rolling updates is done a lot, but with containers it became a very integral and universally accepted standard workflow and way of thinking, and you will get looked at funny if you DON'T do it that way.

The meme image ("Can't have 0days or bugs... if I don't write any code") is incorrect.

You can't have bugs if you don't have any code, but not writing code just means that your bugs are guaranteed to be someone else's bugs. Now, this may be a good thing -- other people's code has probably been reviewed more closely than yours, for one thing -- but using other people's code doesn't make you invulnerable, and other people's code often doesn't necessarily match your precise requirements.

If you have a choice between writing 10 lines of code or reusing 100,000 lines of someone else's code, unless you're a truly awful coder you'll end up with fewer bugs if you take the "10 lines of code" option.

There's probably no good way to pick up this context from the article, but the meaning of that particular meme is that the caption is supposed to be a shortsighted analysis. See http://knowyourmeme.com/memes/roll-safe , which lists examples like "You can't be broke if you don't check your bank account" or "If you're already late.. Take your time.. You can't be late twice."

> If you have a choice between writing 10 lines of code or reusing 100,000 lines of someone else's code, unless you're a truly awful coder you'll end up with fewer bugs if you take the "10 lines of code" option.

I disagree, this is only true if you understand why the other code has 100k lines [Although this example is a bit extreme].

A good example that could send a junior developer astray is date handling. Or most likely date mishandling if they are coding it themselves.

Sure. I'm talking about the case where the 100k lines of code provides a large set of features you're not intending to use.

These container and container like solutions are not 10 lines of code, no implementation will be 10 lines. Therefore solutions which have had time to stabilize will be better since 10 lines of code isn't even a valid solution. New code causes new issues and increased complexity, thats the only point to be made by the meme.

Nobody mentioned unikernels yet? It's a bit unrelated to the containers discussion in this thread, but I thought I'd mention it anyway. They let you create an operating system image, which only includes the code you need. Nothing more, nothing less. This improves security, because the attack surface is reduced.

It makes a lot of sense too me when I think about how cloud computing works. Most of the time an operating system container, zone, jail, VM... is booted just to run a select number of processes. There is absolutely no need for a general purpose system. I think unikernels could really shine in this area.

MirageOS is a project that lets you create unikernels. It's written in OCaml, so it's interesting in more than one way. MirageOS images mostly run on Xen, by the way.

[1] https://en.wikipedia.org/wiki/Unikernel

[2] https://mirage.io/

So first step is containers as a service. And then this container can run as a unikernel as an implementation detail. Did I understand it correctly?

I asked a similar question yesterday. [0] The problem is that containers share the kernel of the hostOS, so you cannot host a unikernel without some kind of hardware virtualization, since the unikernel is obviously a different kernel from that of the host OS. However you can run qemu inside docker if you want to inherit its sandboxing and namespace configurations. The problem comes when you have to isolate resources, like network devices, at both the namespace level on the host OS, and the virtualization level inside qemu.

Intel's Clear Container project tries to solve this problem, but it's still limited by some virtualization overhead because qemu requires a tap device, which then connects to eth0 in a netns, which is one half of a veth pair with the host. So you end up creating 3 or 4 virtual Ethernet links just to route packets down to the guest.

[0] https://news.ycombinator.com/item?id=13976125

Yeah, it seems to me that there's a sort of duality between a server OS + containers vs. a hypervisor OS + unikernels. Their both attempting to minimize the overhead of process isolation and deployment flexibility.

Meh, what's the big difference between providing PID1 and the kernel? You don't have nor want direct hardware access (bus, MMU), so what would be the principal advantage?

It's a sad reflection of a technical community when 3 years later many do not seem to still clearly understand the bare basics of how containers work. HN has been complicit in massively hyping containers without a corresponding understanding of how containers work outside the context of docker.

How many container users understand namespaces and how easy it is to launch a process in its own namespace, both as root and non root users? Or know overlay file systems and how they work. Or linux basics like bind mounts, and networking.

The docker team leveraged LXC to grow from its tooling to container images but didn't shy from rubbishing it and misleading users on what it is. LXC was presented as 'some low level kernel layer' when it has always been a front end manager for containers like Docker, the only difference is LXC launches a process manager in the container and Docker doesn't. Just clearly articulating this in the beginning would have led to a much better understanding of containers and Docker itself among users and the wider community.

How many docker users know the authors of aufs and overlayfs? The hype is so intense around the front end tools that few know or care to know the underlying tools. This has led to a complete lack of understanding of how things work and an unhealthy ecosystem as critical back end tools do not get funding and recognition, with the focus solely on front ends as they 'wrap' projects, make things more complex and build walls to justify their value. Launch 5000 nodes and 500000 containers. How many users need this?

And this complexity has a huge cost and technical debt, when you are scaling as many stories here itself report and when you are trying to figure out the ecosystem so much so that its now at risk of putting people off containers.

A stateless PAAS has never been the general use case, its a single use case pushed as a generic solution because that's Docker's origin as a PAAS provider. The whole problem with scaling for the vast majority is managing state. Running stateless containers or instances does not even begin to solve that in any remote way. Yes, it sounds good to launch 5000 stateless instances but how is it useful? Without state scaling has never been a problem. A few bash scripts which is what Dockerfiles are will do it. But now because of hype around Docker and Kubernetes users must deal with needless complexity around basic process management, networking, storage and re-architect their stack to make it stateless, without any tools to manage state. Congratulations on becoming a PAAS provider.

A couple of observations from someone not-so-familiar with containers:

If the consensus is that containers for the most part are just a way to ship and manage packages along with their dependencies to ease library and host OS dependencies, I'm missing a discussion about container runtimes themselves being a dependency. For example, Docker has a quarterly release cadence I believe. So when your goal was to become independent of OS and library versions, you're now dependent on Docker versions, aren't you? If your goal as IT manager is to reduce long-term maintainance cost and have the result of an internally developed project run on Docker without having to do a deep dive into the project long after the project has been completed, then you may find yourself still not being able to run older Docker images because the host OS/kernel and Docker has evolved since the project was completed. If that's the case, the dependency isolation that Docker provides might prove insufficient for this use case.

Another point: if your goal is to leverage the Docker ecosystem to ultimately save ops costs, managing Docker image landscapes with eg. kubernetes (or to a lesser degree Mesos) might prove extremely costly after all since these setups can turn out to be extremely complex, and absolutely require expert knowledge in container tech across your ops staff, and are also evolving quickly at the same time.

Another problem and weak point of Docker might be identity management for internally used apps; eg. containers don't isolate Unix/Linux user/group IDs and permissions, but take away resolution mechanisms like (in the simplest case) /etc/password and /etc/group or PAM/LDAP. Hence you routinely need complex replacements for it, adding to the previous point.

As a sysadmin I just want to point out to this mostly dev crowd, that my current favorite method of operations is to have multiple compartmentalized VM's which then may or may not hold containers or jails.

Why do I do it this way? Because having a full stack VM for each use-case on a good server is realistically not that much more resource hungry than a container, but the benefits are noticeable.

Lots of the core reason stems from security concerns. For example, there are quite a few Microsoft Small Business Server styled linux attempts at hitting the business space, but instead of playing to the strengths of modern hardware, they all mostly throw every service on the same OS just like SBS does... which is a major weakness. So instead of an AD server that also does dns and dhcp and the list goes on, each thing in my environments get it's own seperate VM (eg, SAMBA4 by itself, bind by itself, isc-kea by itself, and so on)

Another reason for this is log parsing related. It's much easier to know that when the bind VM OSSEC logs go full alert, I know exactly what to fix. On multi container systems, a single failure or comprimise can end up affecting many containerizations and convoluting the problem/solution process.

Of course, the main weakness to such a system is any attempt to break out of the VM space illicitly could comprimise many systems, but that's why you harden the VM's and have good logging in the first place, but also do it to the host system, along with using distributed seperation of hosts and good backups.

Just some real world usage from a sysadmin I wanted to convey. I still will do a container or a VM with many containers for the devs if needed, but when it comes time to deploy to prod, I tend to use a full stack VM. I'm also open to talk about weaknesses in this system, as I'd be curious to hear what devs think.

To be fair, I still haven't fully caught up with the whole devops movement either, so perhaps I'm behind.

Also, a big shoutout to proxmox for a virtual environment system, FOSS and production quality since 4.0. I have also run BSD systems with jails in a similar way. The key pont of the article is that zones/jails/vms are top level isolations and containers are not (but that doesn't make containers bad!)

Have you considered a solution like graylog for centralized logging with both the VMs and Containers? Thats what my company tends to do with clients and we frequently choose containers and multiple VMs together running something like docker stack. As long as you do proper security practices on the container itself, such as not running the application as root, I dont see any downside to running things in containers over VMs as long as you have proper logging. Combined with something like ansible or another config management you can automate this whole process and it works really well for us.

Yeah I use graylog and elk stack and ansible already, there is still an isolation level with full VM's you don't get with containers. (namespace sharing issues, kernel-level system call attack issues, etc). Of course VM's have the same issues with the host OS, but there is a single layer more for an attacker to penetrate, and as we know, security is all about layers.

On that note, a FOSS ansible tower alternative popped up on my radar recently that looks interesting.


Thanks for sharing! Currently only using jenkins for our ansible deployments but this looks like a great WIP for an eventual replacement.

+1 for proxmox. I use it to run VMs and LXC containers at home for my toy projects.

A little bit off topic but I've been following Jess for a while and I think that developers like her are great. In my country is hard to see a happy developer and she seems to enjoy everything she does. That's why I follow her, because of her great work and great personality. I'm happy to see one of her blog posts here in HN

In this post the author links to one of her previous posts[0], where she wrote:

> As a proof of concept of unprivileged containers without cgroups I made binctr. Which spawned a mailing list thread for implementing this in runc/libcontainer. Aleksa Sarai has started on a few patches and this might actually be a reality pretty soon!

Does anybody know if this made it into runc/libcontainer? I'm not an expert on these technologies but would love to read through docs if it has been implemented.

[0] https://blog.jessfraz.com/post/getting-towards-real-sandbox-...

I had to read the post twice before I really got what she was saying. I think the distinction I would make is that while there are many more use cases that you can apply to Containers that may not apply to Jails, Zones, or VMs the most common use case of "run an app inside a pre-built environment" applies to all of them. Since I believe most users (or potential users) of Containers are only looking at that use case, it's harder to see the differences between the different technologies.

My only hope is that anyone in a position of making a decision on which technology to use can at least explain at a high level the difference between a Container and a VM.

I'm not an OS person, so forgive me if this is a stupid question: Lots of people are excited about Intel SGX and similar things. Are there any interesting ways people are thinking about combining, like, Docker containers with SGX enclaves and such? One could imagine (e.g.) using remote attestation to verify an entire container image.

Yeah, it's called SCONE: https://www.usenix.org/conference/osdi16/technical-sessions/... It's pretty kludgey due to SGX limitations like not supporting fork().

It's definitely not a stupid question. I think you can't do it very well because a contained process is just a process on the host system, albeit with a few things changed (like what network devices it sees), and it's pretty hard to make an enclave as big as an entire OS process.

However, see VMware's Overshadow paper for a pretty clever system that lets you run a verified process inside an untrusted kernel instide a trusted hypervisor: https://labs.vmware.com/academic/publications/overshadow You might be able to do something similar.

It doesn't matter how many distinction you make on these things (first-class, last-class, second-class, poor-class, etc...). These kind of discussions are always relative.

All is good as long as your decision is conscious of the compromises taken by each approach and what they entail (what other security mechanisms do you have at your disposal ? how could they enhance your app ? will your solution depend on external tools like ansible/puppet/etc ? do you actually need "containers" or jails or [insert your favorite trendy tech here] ?).

Running a *BSD or a Linux is a way bigger design decision than what kind of isolation mechanisms you have as many of the underlying parts are becoming different.

As a novice, this was a great informative read. More posts like this on the Internet, please!

I'm trying to understand something. At my last work we had a big problem with "works for me". We started using Vagrant and all those problems disappeared. Then Docker became popular and all of a sudden people wanted to use that instead.

But is Docker really suitable for this? While each Vagrant instance is exactly the same Docker runs on the host system. It feels like it will be prone to all sorts of dissimilarities.

It depends on what you want to keep constant between hosts; for many (dare I say most?) projects, docker will provide a sufficiently identical environment that it "just works", because the filesystem as seen by the same image on two different computers will be exactly the same. This is sufficient for many, many projects, and is the primary source of "works for me" problems, in my experience.

However, if your application requires things like the CPU architecture set is the same, or the amount of memory available to the process is the same, etc... then no, docker will not give you this level of "sameness". I have had bugs that manifested on one docker host and not another because one host had certain x86 extensions available (AVX, or something similar) and the other didn't, and this caused a certain codepath to be followed. However this is probably extremely rare for non-performance-targeted code, as most projects will compile for a conservative subset of x86_64 instructions so that they can run everywhere.

Anecdotally, I find learning the "Docker way" has transformed my development paradigm in a much more fundamental way than Vagrant ever did, mostly because docker containers are almost instantaneous to start, which enables their use for a whole range of applications that I would never have even considered for something like Vagrant. Because I can launch a docker container in ~300ms, I can use containers like native applications. I do everything from building software to compiling latex documents using docker just so that I don't have to worry about installing toolchains or configuring things just right; I get an image that does what I want, and then I invoke the docker containers as if I were invoking the actual tools themselves.

Docker is just a tool, but it's a darned powerful one, and it's pretty fun to see how it continues to evolve right now. I highly suggest you check it out.

In my experience Vagrant only solves 50% of the problem. In complexity it's not really reproducable enough. Docker until now always did what I asked from it. So I'm confident that it at least solves 75% of the problem. I can suggest to fully move to docker.

SmartOS run containers in zones get the best of both worlds

I always run my containers in a jail which I run in a zone which is running in a VM just in case.

I appreciate the humor but having an ability to run containers in a zone gives you very good isolation without overhead of a VM.

And never connect it to the internet, ever. Island is best land.

All behind NAT to make it securr

"container is not a real thing". But what could we say about real things within software field?

In earlier versions of ProxMox the openvz vms were called containers and the KVM vms were called vms. So it is pretty confusing overall.

For myself I would point out that Zones, Jails, OpenVZ and LXC , even KVM, all pretend that they are fully separate from the host node OS.

While Docker et al do not pretend this; in fact if you are running Apache on your host system and try to run a Dockerized web server on port 80 the Docker container might refuse to start. The other methods mentioned, can't even determine what they are running under.

> pretend that they are fully separate from the host node OS

No! You can see all processes of the OpenVZ and jails from the host system

>in fact if you are running Apache on your host system and try to run a Dockerized web server on port 80 the Docker container might refuse to start

depends on network options.

Sorry for not being clear ... the OpenVZ, LXC, Zones (not sure about Jails) all have their own startup options, and if they don't start "init" as the first process inside, then they have a method that fakes it. You can install random new packages of software and run them, from inside the "VM" - from what I know, you can't do this from within a container.

FWIW using KVM all you can see from the host node (unless you use debugging tools) is the PID and related process info of the KVM instance (one process per VM you have started).

You can easily run top inside Xen VM via xl console. Or via SSH. Manage it with a stack like salt or ansible.

This is a super weak argument.

Hm, that isn't what I am trying to say at all. In drawing a distinction, some of the virtualization tech acts like a full blown OS in and of itself, while others do not. x

>- from what I know, you can't do this from within a container.

You most certainly can.

We wrote a paper comparing containers and VMs for the middleware conference: http://people.cs.umass.edu/~prateeks/papers/a1-sharma.pdf

I don't understand why anyone would say to give up containers and just use Zones or VMs? Containers are solving a very real problem. The problem is that containers weren't as well marketed before Docker (or as user friendly).

Containers were incredibly popular in the hosting industry for a long time https://en.wikipedia.org/wiki/Virtuozzo_(company) at that time SWsoft and their Virtuozzo hosting platform Docker marketed container idea to the different audience.

Many people's use case for containers is more or less identical to that of Zones/VMs -- they aren't sharing anything across containers.

The majority of use cases for containers seem to be as more of a deployment and/or scheduling solution and less of an overlap with the use cases for virtualization.

Because the vast majority of use cases are already solved by zones/VMs.

I have been learning about containers fairly recently. What are these security vulnerabilities that the post talks about? I haven't come across any docs that mention security yet.

It's mainly an attack surface issue. A process running in a Linux container is just a process running in a regular OS with some extra bells and whistles for resource constraints and isolation. So when it comes to making a kernel call, it's a call directly to the same kernel all of the other containers are calling.

This means the entire kernel/userspace API is the attack surface for a malicious container. Compare that to a VM where the attack surface is the API the hypervisor exposes to a virtual machine.

It's not that the former is necessarily smaller, it's just that the modus operandi in systems administration has always been that if a person executes malicious code as a user on the OS, you better wipe the system because kernel vulnerabilities aren't treated with the same severity of hypervisor vulnerabilities.

This is the reason your containers are actually executed in a dedicated VM if you use something like GCE.

Earlier implementations had bugs like this http://blog.bofh.it/debian/id_413

Is it possible to design a process that isn't a self-contained OS instance depending on a lot of horseshit overlay controls to perform a task in a well designed way that still allows privilege separation from the host OS in a manageable way?

Of course not. It's the basis for all this other shite.

I think jess is on the money here. The complexity in linux containers vs zones show up in two ways:

1) the linux kernel container primitives are implemented in ways that are more complicated. for example in zones pid separation is implemented by just checking the zone_id and if the zone_id is different then processes can't access each other. this also means that in zones pids are unique and you can't have two processes from two different zones with the same pid [with the exception: i believe they may have hacked something in to handle pid1 on linux].

similarly, in zones there is no user mapping if you are root inside the zone you are also root outside of the zone. the files you create inside a zone are uid: 0 and also uid: 0 outside the zone.

if you look at how device permission is handled in linux we have cgroups that controls what devices can be accessed and created. while in solaris zones they use the existing Role Based Access Control and device visibility. so inside a zone you can either have permission to create all devices (very bad for security) or create no devices. In zones access to devices is mediated by whatever devices the administrator has created in your zone.

in zones there is no mount namespace instead there is something that is very similar to chroot. it is just a vnode in your proc struct where you are restricted from going above. zones have mostly been implemented by just adding a zone_id to the process struct and leveraging features in solaris that already existed [i guess the big exception would be the network virtualization in solaris] while in linux there are all these complicated namespace things.

this complexity means there are probably going to be more bugs in the linux kernel implementation. however, because you don't have as much fine grain control this can also create security bugs in your zone deployment. for example i found an issue in joyent's version of docker where you could trick the global zone into creating device files in your zone and these could be used to compromise the system. under a default lxc container this would not be possible because cgroups would prevent you from accessing the device even if you could trick someone else into creating it. you also have to be careful in zones with child zones getting access to files inside the parent zone. if you ever leak a filesystem fd or hard link into the child zone from the parent zone then all bets are off because the child is able to write into the parent zone as root. (i believe this situation was covered in a zone paper where they describe the risk of a non-privileged user in the global zone collaborating with a root user in a child zone to escalate privileges on the system)

2) because all the pieces are separate in linux then something has to put it together and make sure all the pieces are put together correctly. like i wouldn't trust sysadmins to do this on their own and luckily there are projects like lxc/lxd/docker etc that assemble these pieces in a secure way.

Yep, exactly. Since zones were inspired by jails, FreeBSD works in pretty much the same way. For device permissions though, we have rules support directly in devfs. https://github.com/freebsd/freebsd/blob/master/etc/defaults/... — there's a default ruleset for jails that makes sense (allows log, null, zero, crypto, random, stdin/out/err, fd and tty stuff).

By the way, I couldn't find this anywhere on the internet — is there a simple way to just run something in a zone on illumos, without the installer stuff? Like on FreeBSD you can just do this:

jail -c path=/my/chroot/path command=/bin/sh

and you have a jailed shell. What's the Solaris/illumos equivalent of this?

I found this article to be humorous and informative.


This is almost always the harbinger of lies, deception, propaganda, or lack of nuance.

before i even heard of containers i found out about jails and wondered if you could serve each users jail using an nginx config file in their home directory.

"Lego" it's "Lego"


emphasis added

Some point are just wrong. Containers and jails have many design similarities which were dismissed by author. Notably PIDs, both containers and jails are nearly identical with regard, you can kind of have one leg here another there, although that harder to achieve with FreeBSD jails; both implementations do not hide PIDs from the host systems. Networking - jails can run on top of non-virtualazed IP/net dev, containers can run in such modes as well. Link is someones rant without tech details.

I think you missed the point of the article completely.

Which is:

Containers are not actually a _thing_. BSD Jail is a _thing_, Zones are a _thing_... Linux containers are just a particular configuration of _multiple things_.

PIDs in containers CAN be like PIDs in BSD Jails... if that is what you want. It's up to you to use what Linux primitives you want in your containers.

For example:

I can run a application in a 'linux container' that shares PID, user, and network namespace with the main OS.. and the only thing that is different is that the file system is namespaced. I can run cgroups without running namespaces. I can run namespaces without cgroups.

Now if you want to talk about _Docker containers_ then, yes, that is a _thing_, but it's just one of many different possible ways to have Linux containers.

No, from the article 'Solaris Zones, BSD Jails, and VMs are first class concepts.' It's just happens that jail as a name, jail as a cmd tool and jail as a system call bears the same name. Nothing stops one to implement superfancyjails on top of that system call. Same story with Linux containers, we have clone(), unshare(), and setns() and couple of popular implementations on top of them. Thus, lets say, 'man systemd-nspawn' container is _thing_ as 'man 8 jail' is _thing_

You're splitting hairs to explain something that doesn't matter. The article stands well on its own two feet without nitpicking the similarities as you have done.

Do you have a good overview of how to use pieces of FreeBSD jails without using the whole thing? I've interacted (under duress) with FreeBSD jails in production, but I definitely found it a lot easier to learn about Linux containers / namespaces / cgroups.

sudo jail -c path=/ command=/bin/sh

to get a shell in the least isolated jail possible. It's that simple. Read the "Jail Parameters" section in `man jail` to see what you can add to this, e.g.:

sudo jail -c path=/ ip4.addr= command=/bin/sh

to isolate the IP address…

'man jail_attach' is somewhat extensive if you wish do your things like one fork in one jail another in another. I have to admit similar thing you can do in linux containers just bu spawning 'nsenter' with various parameters.

So Jails support things like allowing the same process to be visible in multiple jails, or sharing a root filesystem, or sharing a network interface?

Jails (on FreeBSD) each get their own IP address(es) which may be associated with the same network interface.

With ZFS, one can use a common "template" filesystem for jails such that updates to the userland or the ports tree only need to be applied to the base file system once and become visible in all jails (as far as I understand ZFS, at least).

To my knowledge, it is not possible to have a process be visible in several jails at once. Each process has a jail ID associated with it, and it is visible only inside the corresponding jail (and the host system, of course).

FreeBSD jails can share IP with the host systems. Also multiple FreeBSD jails can share the same IP from the host system. Jails are IP level isolated in contrast with linux namespace containers which do interface level isolation.

Jails can get their own interfaces too (VNET/VIMAGE). This functionality has been buggy in the past, but in 11 it's ready to go.

Shared file systems, sure; sharing TCP/IP stack, sure; same process in multiple jails, no (except that processes are visible both in a jail plus its parent jail; jails can be nested).

Ok thanks. Can you recommend a higher quality article for people new to containers?

Yes. And perhaps you can start with historical material http://linux-vserver.org/Paper That happened long before current linux namespaces and influenced them in many aspects. As fro modern implementation https://lwn.net/Articles/531114/ and further

This article helped me clear few things about internals of container while I was preparing for Docker introduction presentation. It contains few C programs to get better idea than just theory: https://blog.selectel.com/containerization-mechanisms-namesp...

Yeah. I feel like the author could really have used a little more time in the industry, really maybe gotten to know about the subject matter before writing a blog post. If only she had spent some time working at a container company, or maybe studied up to work at one of the big tech firms. What a missed opportunity to be more technical.

I'm going to be captain obvious here and let people know that the post I'm replying to is being extremely sarcastic. The author of the article is extremely well known in the Linux container community.

It's really strange to reply sarcastically to someone pointing out the author made some fundamental mistakes.

An appeal to authority in the form of a sarcastic reply really adds nothing to the discussion.

Adding to what discussion, exactly? To my eyes most comments so far seem to be little more than "I agree with the author" or "I disagree".

OP's comment on the other hand at least told me something about the person behind the article.

That said. I am now curious about comparisons between containers/jails/etc across various different metrics that people care about.

Also, what cool non-containerization uses of cgroups and namespaces have some of you gotten up to?

>Adding to what discussion, exactly?

Please read the context to which I replied. That's usually the discussion someone is referring to when they mention 'the discussion' in a reply. The OP highlighted specific issues with the analysis in the article and then someone shit on it with a sarcastic reply.

Since you are being sarcastic, I don't think a docker engineer had much experience with a jail() system call to speak about flexibility of non-linux OS primitives.

Where does Docker fit into all of this? Asking for a friend.

> Legos

"LEGO is always an adjective. So LEGO bricks, LEGO elements, LEGO sets, etc. Never, ever "legos."" [1]

The other one that gets me is "math". I know it's not really plural, but "mathematics" has an a on the end, so it's "maths!"! Or do Americans say "stat" for "statistics" as well?


If you have two red lego bricks and two yellow lego bricks, do you not have two reds and two yellows? You definitely have A couple of bricks of each colour.

I think that the important feature of containers is that they are running from image containing the application. Zones, jails, and VMs are other isolation mechanisms that could be used to run containers. Running an application image unpacked into VM would be container.

One place where the difference in namespace is visible is with Kubernetes pods. Containers running in pod share network and volumes namespaces.

Your definitions are not the common ones. Dropping the Linux from the full name Linux Containers doesn't make it a different fundamental. Containers, Zones and Jails are all OS layer virtualisation :https://en.m.wikipedia.org/wiki/Operating-system-level_virtu... . The style of application packaging that came with Docker compliments OS virt but the prior art demonstrates that it is its own innovation. "Docker will do to apt what apt did to tar" @bcantril

There are Windows Server containers. Docker runs on FreeBSD using jails and Solaris using zones. Hyper runs containers on hypervisors and RancherVM on KVM.

Container now has two meanings: the isolated application, and the container-based virtualization on Linux. The latter is the original usage (LXC is older than Docker).

But the important innovation of Docker is combining containers with image distribution. The isolated application is more general then virtual machine so the concept is extended to other virtualization techniques.

The kubernetes pods share ipc and network namespaces (and pid soon). You may be referring to mount namespace by volumes namespace. The mount namespace is the minimum that distinguishes one container from the other as that's the mean by which they see different file-system hierarchies created from their images.

k8s volumes are bind mounted into the containers in a pod.

I wasn't sure if pods shared mount namespace or just the external volumes. It sounds like the same external volumes are mounted in each container but the rest of mounts are different.

Great rant about something way over my head by someone who knows way more than me about it!

If I can transfer to a domain I understand better (front-end dev): It sounds like VMs, Jails, and Zones are like Ember.js: it comes with everything built in and is simple if you stay within the design.

Containers are more like React: it gives you the pieces to build it yourself, and building it all yourself can lead to complexity, bugs, and performance issues.

Disclosure: I have no idea what I'm talking about

Thanks for sharing what people see here that usually don't work in the backend.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact