Hacker News new | comments | show | ask | jobs | submit login
What are Linux containers and how did they come about? (bitmason.blogspot.ca)
153 points by WestCoastJustin 1381 days ago | hide | past | web | 63 comments | favorite



In practice, containers are process groups with access restrictions.

In theory, though, containers are simply an optimization of virtual machines, and are best understood that way. If you can use virtual machines to solve a problem--and your virtual machines are all generally based on a recent Linux--then you can achieve the same thing, using far fewer resources, by using containers.


Generally speaking, containers have worse resource separation. Last time I looked, a system with containers shared pagecache.

Sharing pagecache means that one user thrashing the disk can make the system unusable for everyone, well, to the extent that no system without pagecache is usable.

Xen and KVM mitigate this by giving you your own ram and thus your own pagecache.

Personally? I think this makes a way bigger difference than any cgroup limiting of iops. Pagecache is not free ram, and should not be treated as such.

I think containers are one of those tradeoffs; they are more efficient when it comes to hardware, but they are less efficient when it comes to sysadmin time tracking down users who are using more resources than they ought.

On the other hand, some find managing multiple containers easier than managing multiple complete virtual servers.


Part of the problem, Luke, is that you're thinking about containers in terms of a hosting provider. That's not your fault. You're a hosting provider. To you, cotenancy is not only a security concern but also performance and SLA. For almost everyone else, efficiency is way paramount.

I still think right now the best way forward is:

- Hosting provider: Xen/etc

- Internal clusters: containers

Containers merely bring commodity computing closer to HPC operational techniques. Containers are not useful for you at prgmr. Using every drop of RAM on the machine for page cache is a positive for my use case, not a negative.

Virt overhead makes no sense in internal clusters, in my opinion. I've worked in an all virt environment and an all container environment. The latter is way better, and these are scales where Puppet burning a core for several seconds across the fleet has a demonstrable impact on Opex.


>Part of the problem, Luke, is that you're thinking about containers in terms of a hosting provider. That's not your fault. You're a hosting provider. To you, cotenancy is not only a security concern but also performance and SLA. For almost everyone else, efficiency is way paramount.

Oh yeah. More fair is almost always preferable to more efficient when you are in a multi-user situation, and certainly, containers are more efficient.

I... have always had a problem, though, with this idea that you should virtualize clustered production applications; virtualization takes a big server and cuts it up into little servers (at a cost) - whereas clustering takes a bunch of little servers and combines them into a big server, again at a cost. It seems to me like once you are out of testing and actually need a full server worth of capacity, you don't want to use virtualization anymore. Multi-tenancy is a feature I need, sure, but for the cluster syadmin? it's all downside.

containers/chroots do make a lot more sense than Xen or Kvm if you must virtualize your production clusters; I guess that's the idea behind docker; make a clean way to pass around directories full of binaries.

I mean, to be fair, making packages /is/ a pain in the ass, so I can understand wanting a clean way to pass around your production code without using packages.

Really, the more I think about it, the more I see containers as an easy way to roll back all the benefits and problems of using shared libraries without going through the incredible pain of statically linking everything. Distribute your "binary" along with all the libraries and the environment it requires. Hah. And actually? I can understand why you would want to do that; Before I was a service provider I was a cluster sysadmin, and I've put in my time resolving shared-library conflicts.


A ton of production applications do not need a full server worth of capacity per server process.

And as someone who uses virtualisation like this extensively: it is about separating functional concerns. I have images of our "standard" web frontends, and our "standard" database backends and their slaves. We can deploy one of those with a few commands to any of our VM hosts (currently using openvz, but looking at both LXC and KVM and combinations). I don't want them co-mingled because it complicates migrations as needs change.

E.g. some customers could easily fit on a single server when we start deploying our apps for them, but may a year down the line require a complicated multi-server setup. When they do, and it's all in containers, we can migrate the web container to another machine (or duplicate it and load balance), and database container somewhere else, and it's all generic tools. We don't need to start messing around and untangle dependencies on a single server.

Having them neatly isolated in small manageable chunks also means that we are free to change hardware specs, and move containers around depending on what is most cost-effective rather than having to make sure everything fits.

And there's the dependency issue. In my experience, I don't trust for a second that you will end up with a working setup if you ship packages and install them on a pre-existing OS install. It takes about 5 seconds from someone other than a very experienced sysadmin gets an ssh login to a server before it deviates from the servers it is supposed to be identical to, at which points all bets are off whether or not your app will run once deployed to your production servers at all. VM's and containers provide the opportunity to easily guarantee that the images you deploy are unchanged from what you tested on.


>And as someone who uses virtualisation like this extensively: it is about separating functional concerns. I have images of our "standard" web frontends, and our "standard" database backends and their slaves. We can deploy one of those with a few commands to any of our VM hosts (currently using openvz, but looking at both LXC and KVM and combinations). I don't want them co-mingled because it complicates migrations as needs change.

Once you are at scale, each node, doing one thing, should be able to fill a server,(at scale, you will need a lot more than one server for each service you've got, assuming the app has enough scale to support a team of sysadmins.) - the exception I can think of is 'lopsided' services that require, say, more cpu and less ram than your servers were bought with; in which case, yeah, if you can find another service that is lopsided in another way, you are in good shape.

Until then, sure, you need less than a server, at which point some sort of virtualization makes sense. (And if you own and sysadmin all apps, containers could make a lot of sense, as they are rather lighter weight.) And hey, sometimes even if just one app doesn't have the scale to support a team of sysadmins, the whole lot of them do.

If you are saying that virtualization and containers make a lot of sense for small-scale stuff, or any use case where you need servers that are smaller than the physical servers you buy, we are in violent agreement.

>E.g. some customers could easily fit on a single server when we start deploying our apps for them, but may a year down the line require a complicated multi-server setup. When they do, and it's all in containers, we can migrate the web container to another machine (or duplicate it and load balance), and database container somewhere else, and it's all generic tools. We don't need to start messing around and untangle dependencies on a single server.

This makes it sound like you are as much in the 'service provider' realm as I am. That is an interesting way to think of the cluster administrator role; thinking of the different development groups you are running code for as customers, helping them scale up and down. I actually don't have as much experience with that style of cluster administration. At all my cluster administration jobs, the products I was sysadmining were already many-server affairs by the time babysitting them became my job. In those cases, I (and/or the team) pretty much owned the application, with a little bit of help/support from the devs when we couldn't figure something out, which is different from the customer/provider setup.

>And there's the dependency issue. In my experience, I don't trust for a second that you will end up with a working setup if you ship packages and install them on a pre-existing OS install. It takes about 5 seconds from someone other than a very experienced sysadmin gets an ssh login to a server before it deviates from the servers it is supposed to be identical to, at which points all bets are off whether or not your app will run once deployed to your production servers at all. VM's and containers provide the opportunity to easily guarantee that the images you deploy are unchanged from what you tested on.

Sure, sure.... but generally speaking, if you are using whole-server images? you wipe and re-install the whole goddamn thing every time you get half a chance because of this. With PXE this isn't much more difficult than copying over a new tarball. Most large clusters have some software that let you burn-in and re-install thousands of servers by typing one or two lines.

Now, you can argue that copying a new chroot to a thousand hosts is generally easier, or at least faster to recover from if you screw it up than pxe-installing a thousand hosts, say, if you accidentally target the wrong servers.

>Having them neatly isolated in small manageable chunks also means that we are free to change hardware specs, and move containers around depending on what is most cost-effective rather than having to make sure everything fits.

Sure, different hardware needs different drivers, but I think the difficulty of dealing with that is overstated.

The hard part of having different hardware specs isn't the drivers, it's the different performance characteristics, and virtualization doesn't always help you with that. You've got some servers with 15K sas disks and others with 'intelipower' No amount of virtualization is going to make the two servers perform the same. But then, sometime Virtualization does help with that, for instance if the problem is ram to cpu (or ram to disk) ratio differences; if you have one app that requires a lot of ram and disk but little cpu, and another app that requires a lot of cpu and nothing else, if your new boxes with the best cpu happen to have a lot of ram, you could gain benefit from running them both on the same servers.


You hit the nail on the head. This approach of containers as a "static binary on steroids" is a major aspect of my motivation in docker's design. I hope this will become more obvious as we progress in the implementation.


Except when you need snapshotting, live-migration, or have a highly disparate set of underlying OS requirements.

The author of the article explicitly says, "This isn't to say that they're going to replace virtual machines." and I agree - there's no point in pretending like they're solving all of the same problems. I guess you could say that containers are an optimization of some use cases of VMs.


Actually, I've been doing live-migration & snapshotting with OpenVZ for years.

Really, the only reason to use virtual machines is if you truly need different OS kernels.


OS level virt can easily do the first two. The third is harder due to a common kernel. But FreeBSD for instance can run CentOS 6 with the Linuxulator. I believe Solaris has similar feature called "branded zones".


Branded zones do not run Linux any more and emulation was never complete. Nor is the FreeBSD emulation complete.


But that's simply not true. 20 years ago you could live migrate processes around a VAXcluster with nary a VM in sight.


But since history does not necessarily flow in one direction, today you need a hypervisor (or experiment with CRIU) to do live migration.


Almost. Although I don't remember the details, IBM has Live Partition Mobility for WPARs on AIX. They were looking for differentiation from Solaris Containers at the time and bought a startup called Meiosys, which had some tech which helped them add live migration to WPARs.


You actually have to buy an add-on called WPAR Manager[1] to enable live mobility; it's not available in baseline AIX. They have been shipping this tech for several years now. Disclaimer: I was on the WPAR Manager dev team for several releases.

[1] http://www-03.ibm.com/systems/power/software/aix/sysmgmt/wpa...


HN is awesome. Where else do you get someone who had something to do with a project post a comment about it? :)


On Reddit ;-)


Heh. Even though I work for IBM I wouldn't bother mentioning AIX on HN; it simply doesn't exist inside the echo chamber.


And people in the academic world have implemented live migration for just plain Linux processes as well. But when you migrate at that level you run into issues with things like open files and open sockets that are not easy to solve.

LXC is mainly a way to manage multifunction servers in a clean way, and Docker is a nice tool to help you leverage LXC without having to learn all the details. The nicest thing about Docker is that a developer on Windows could set up a VirtualBox VM with Linux, run all the same Docker Scripts as the production server, and have a reasonable facsimile of production to do testing.


Here's an example of process migration implemented in Linux:

http://en.wikipedia.org/wiki/MOSIX


Live migration from a container is pretty close to working (ie works if you are careful). Linux live migration is not whole-kernel it is per resource so you can migrate just a containers resource.

I still do not understand the use case of live migration though...


The main use case is doing maintenance on the host machine without incurring planned downtime.

E.g. You may have multiple VMs running applications like sharepoint, exchange, zimbra, or whatever, all on a host. You want to perform some planned maintenance on the host, e.g. update the firmware, add memory, upgrade the hypervisor, etc. You place the host in maintenance mode and the VMs are migrated off the host onto another one within the cluster without having to resort to planned downtime.


There's a vicious pack of wolves roaming one floor of the server complex.


As someone who has used OpenVZ for a long time, I'd have to disagree with the notion that containers are simply process groups with access restrictions. You have a distinct namespace for them (including networking), which really changes everything.


I think we've used something similar some time ago (maybe XEN?) and had the following issues, how does Docker compare: - what happens when one container changes the system clock? - are iptables rules per container or global?


Linux network namespaces let you have multiple independent network stacks in linux. That means routing tables, interfaces, routing tables, IP addresses, etc.

Each container gets its own network namespace (along with other namespaces like hostname, pids, users, ipc, filesystem mounts). Anything not handled by one of the 6 namespaces is the same across all containers. That includes things like what kernel modules are loaded, the system clock, etc.

Because a user with root can manipulate the kernel in many ways, I wouldn't give root to an untrusted user and assume containers were enough to contain them. Certainly if they can load a custom kernel module it's game over, but I'd bet there's plenty of other ways to break out too.


See the excellent series of LWN articles, "Namespaces in operation" <https://lwn.net/Articles/531114/> for an overview of how namespaces work on Linux.

To answer your time question: AFAIK there is no namespace for system time in Linux. If you don't want processes within a contaier to be able to set the system clock then don't launch them with the CAP_SYS_TIME capability.


As far as I understand, the clock and such are tied to the base host. Typing 'uptime' in my linux container shows 5 days, the uptime of the base host.


In theory things like the clock could be namespaced too, if anyone found it useful.


That would be useful - sometimes you have to change the system time to debug some code, and if that trips up all the other containers, its a deal breaker for us.


I was thinking more like FreeBSD jails. Containers can't change the system clock.


Networking for Docker containers is bridged, so iptables are per container.

http://blog.docker.io/2013/08/containers-docker-how-secure-a...


Welcome back OpenVZ... except that this will be part of the OS now.. and i wouldn't call these "an optimization of virtual machines".. containers are part of the kernel/OS so call it kernel/OS virtualization based if you want where virtual machines are or supposed to be hardware virtualization based.

Ex. openvz,lxc - containers kvm,xen - virtual machines


OpenVz was something that required hacked kernels.

A better analogy is LXC = Solaris Zones

KVM and XEN are hypervisors for virtualization. VirtualBox is a non-hypervisor virtual machine.


OpenVz is something that requires "hacked kernels" only because the changes were not accepted upstream. I don't think your analogy is any better. We could tack on an " = OpenVz" on there - all three of them are pretty similar.


I think the virtual machine analogy is a bit dicey because lxc+cgroups/Docker containers have to be much more aware of the host kernel api and devices.

edit: yes derefr, I agree. My point was also if there was a distribution or kernel specific Ruby or Ruby library bug, for example, then you might have to account for it in your code if you ran in containers (if you don't control the host on which your code is running), whereas you control the whole stack in the virtual machine.


But it's not idiomatic to take advantage of those close ties (by, say, having two containers share a named pipe mounted from the host.)

I suppose what you mean is that containers aren't applicable in every instance that VMs are--like, say, providing VPS instances, where some of the instances might want to do funny fiddling things with their virtualized "hardware." This is true.

But for the situations where the applicability of VMs and containers intersect--using them to run Heroku-like isolated "app slugs", for example--then you can think of containers as lighter-weight VMs, and you won't really go wrong.


But there should be no such thing as a VPS. You should be able to just give n people a shell on a box and quotas, permissions etc would suffice. That you can't is a shortfall of the OS and a hypervisor is a bandaid.


And what's your security model? Each user that needs to run services has to run them under their user account? Or are you going to also allocate some number of service accounts to each user and pay the cost to manage these accounts and the combined quotas they entail?

Now how do your users' service accounts share resources? Someone will need a web server running under one account and a job processor running under a different account (different accounts in order to allow for limited privileges). Are they restricted to communicating through named pipes (hope no one else binds the pipe)? What if they want to share some disk-based resources? Shove them all in a group and set someone's home directory to 770?

How do you handle it when one of your users needs a SQL server? Do they need to install this in their home directory, too? And lock down their install so that hopefully other users on the box can't call it?

What about port binding? Just luck of the draw, and everyone binds on what they want? Then they file a request for you to forward to the port they happened to grab? And they hope that they never lose the port due to a server reboot, or a service crash?

People could work in the environment you describe, but it would be miserable. And you (the server owner) would have to sink a massive amount of resources into maintenance. This isn't a shortcoming of the OS. It's a fundamental mismatch between the goal of sharing a server among many users and giving those users sufficient control to build what they need.


How do you handle it when one of your users needs a SQL server? Do they need to install this in their home directory, too? And lock down their install so that hopefully other users on the box can't call it?

NixOS can handle situations like this. It enables per-user package management with deterministic, immutable installs. If multiple users install the same package, it only puts one copy on the system. Each user gets their own environment where they can customize the package without affecting the other users.


How does that work when you factor in quotas? What if a malicious user installs every possible package? Do they kill the system? How are versions handled? Can N users install N versions of a package? (Looks like the answer is yes.)

This sounds like a nice system. I'm just wondering how much of the core issue is really addressed.


For quotas, you probably want something like ZFS. It allows you to set both quotas and reservations. Reservations are nice in that they guarantee the specified amount of space to the given filesystem. ZFS allows the creation of many nested filesystems with different properties (including block-level compression and encryption) and uses inheritance and overrides to give you a very flexible, fine-grained level of control.

ZFS also has real-time, block-level deduplication. This will dramatically cut down on the space usage when users install multiple different versions of the same package.

By combining all this, you can give every user their own nix store with its own set of packages and a quota as well as a minimum reservation of space. Deduplication will take care of all the wasted space from multiple users having the same/similar packages installed.


Which could be made to work until someone wants to run an alternative OS (ie, not Linux) or use features that are only available in newer kernel versions. Or until someone realizes that a shared kernel with a security policy is a larger attack surface than a tiny hypervisor with no concept of sharing.


If you could do this, it would be implemented by using LXC. I used to use an MAI OS that was based on Charles River Unix which used something like this. Instead of just executing /bin/sh on login, they started a chroot jail and ran /bin/sh inside that. To the user it did look a lot like a private machine. Only admin user accounts could see the whole system.

It would be trivial to do this with LXC too.


Yep, this is more like it


Thats kinda true indeed. Containers people want to use are VMs that directly talks to the local kernel (i.e. it does not run another kernel on top, like virtualization does).

You can make process groups without making the container look like a VM ... BUT. That's hard and time consuming for little gain.


That is why micro-hypervisors where all OS instances are virtualized even the one acting as master, like on mainframes, are gaining acceptance.


the concept--and products based on that concept--has been around for almost a decade

Closer to 6 decades, this is old hat for IBM mainframes.

For me the question is tho', why virtualize? And the answer I keep coming back to is that a) it is too hard to retrofit good manageability onto processes as the basic unit of applications and b) people are used to it, the idea of having 1 machine per app, when really it isn't cost effective to do that.


Segregate applications. If app one gets compromised, app two is off in another land. If my linode box gets hacked, without containers, all 16 websites will be compromised/go down.

Desktop virtualization using the NX protocol and Linux containers is also really friendly. Our company uses old Pentium D boxes for desktops. The performance gain from using a thin client + lxc + lubuntu is insane! I can actually watch youtube videos and draw in OpenOffice Writer. (Also great for disaster recovery scenarios).


But processes - theoretically - are segregated. Own address space, running as different users, bound to a processor set, etc. The use of hypervisors is because processes in practice don't really do what they say on the tin, so you need to force another layer of protection and manageability in, and just eat the overhead.


I suppose I was thinking from the 'lazy throw up apache, mysql and wordpress' mindset. If example.com is in its own container with its own mysql database, I do not have to worry about www.test.com getting exploited and example.com's mysql data getting leaked into the wild. I'm also from the days when buffer overflows were on every daemon. Unauthorized remote shell access was always a threat back in those days.


I don't think you have to, working on a prototype of full process isolation with decent guarantees.


This article curiously fails to mention Linux VServer http://linux-vserver.org/ which has done this since 2003 and unlike Virtuozzo has always been completely open source. We've used it with great success in a hosting product.


I probably should have mentioned it. Virtuozzo was much higher profile at the time I was closely following various workload separation technologies--in part because it was making a serious attempt to position itself as an alternative to VMware in the enterprise space. (Something it ultimately failed to do although it continued to enjoy considerable success in hosting.) The fact is that there have been lots of different isolation approaches, including OS virtualization, and Virtuozzo's implementation of OS virtualization--while hardly unique--was the best known along with Solaris Containers.


It also didn't mention OpenVZ by name. Honestly, ever since OpenVZ came out, I haven't felt like the case for using VServer was terribly strong.


Can anyone please give a simple example or link of a use case of containers need?

I use KVM/Qemu VMs at home and work(dev box), I have a dozen always running VMs. However, I started to read more and more about LXC containers, and still cannot grasp their importance... for example, I manage a few VPSs too, each has it's own purpose (web server, database), how could LXC help me?

Thank you in advance.


Basically every Platform-as-a-Service uses containers. They're much more efficient than VMs (resources, spin-up/down times) and the underlying infrastructure is standardized anyway. They may not very well not be relevant to a couple of home systems. We're talking about large-scale infrastructures.


Yes, fast spin-up/down are an advantage.

Thanks.


Basically you can use them just like you use your VM's. The main difference is you lose the ability to have different kernel's for each VM.


Ok, thank you.


This is the current text:

> to change the text of this page send 0.1 btc to 16JMNc3B5vCkuuPxNcNj388gmhP8UDKBuW

Well done, it didn't take more than 1 hour to someone find a loophole.


Isn't that $100? The mind boggles.


Wrong thread?




Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: