Joyent has solved some really hard problems to make this work. I was lucky enough to get the chance to sit down with Bryan last week and talk with him about what this took. It's essentially reimplementing Linux on top of Illumnos, which is no tiny feat.
There are still messy hurdles to running Docker in production, but it's clear that Joyent has really tried to make something awesome here.
As someone who works deeply in the future-container space, I applaud folks who are taking us deeper down the rabbit hole. I think it's clear that this whole 'back-to-the-future' isolation technology is really cool stuff. Jails/zones/containers have been around forever and I think it's great that we're finally taking advantage of this technology.
Edit: at Terminal.com, we want people to push Linux forward, and this is a great example of taking Linux to new and intriguing heights. I did not think we would have Linux on Illumnos in quite this way in 2015 and it's delightful to see. We are all standing on the shoulders of giants and it's great to reach new vistas.
That's a great point, but I think the point of what Bryan has been doing is to make Linux work with Zones (and dtrace).
That's a primitive that Joyent has wanted to upstream into the Linux kernel for a long time and has never been able to get the necessary consensus around it (similar to OpenVZ's troubles getting their work upstreamed).
In short, this is sort of a hack to give you zones on Linux without needing to get zones into the upstream. Yes, there's no linux code, but there is a lot of required understanding of Linux code to make something like this work.
It's kinda amazing that they got 64-bit linux to run on top of Illumnos, right? I did not see that coming and maybe that's because I'm ignorant in some capacity, but it's been a pleasant surprise.
Emulations have been a Unix feature for a long time actually. NetBSD has had 64 bit Linux emulation for ages, for example, but its not very complete because no one has cared enough to implement more. For example Illumos is AFAIK the first system to emulate epoll. The Linux API is huge and historically the process has been just fixing stuff for a binary someone wants to run. It is very tedious work...
I dont really see it as zones in linux. More a gateway drug for non-Linux.
Hey Justin -- do you have insight into how hard any particular remapping (ie: epoll) is to perform ? I was talking to @bcantrill about their effort at a Docker meetup and mentioned the NetBSD emulation (he said "Oh! Of course!"), but whats interesting (in retrospect) is that they (Joyent) just tried running stuff and played whack-a-mole w/ unimplemented APIs... how tough would it be for "us" (NetBSD) to occasionally implement pieces ?
It is just tedious and you need motivation. Especially as Linux has a lot of interfaces, many of which are frustratingly annoying - there are three file change notification interfaces, of different dates. In fact there are at least two of everything!
I amagine much of the Joyent code could be easily ported to NetBSD/FreeBSD (which now has a 64 bit interface as of a few months back). epoll may well be the most difficult (it has edge and level triggered events and other annoyances). But a not very performant version should be doable.
Mostly, few people have been interested. I have a decent test suite though (rump based) so email if you are interested...
Speaking without familiarity with NetBSD, I think it depends on what kernel facilities the system happens to have; speaking for SmartOS/illumos, in many cases we were able to slightly rephrase Linux facilities as extant facilities -- saving a considerable amount of time and effort. For example, the big realization with epoll was just how naive it is -- so much so, in fact, that it actually looks very similar to a pre-port mechanism (/dev/poll) that we developed nearly 20 years ago (!!) and later deprecated in favor of ports. epoll would have been much nastier without /dev/poll -- which is likely the greatest service that /dev/poll has ever provided anyone...
Hi Bryan -- I'm also aware that epoll may have been a bad example on my part, because isn't it subject to some nasty fork/share bugs wrt handling the (well) handle, and what file it's actually associated with the handle -- so a parent can get notifications on a handle it doesn't have, or worse, notifications for a socket that it does have that is not really the same handle that's issuing the event.
In cases like that, did you end up trying to be bug-compatible, or make a design decision to clear up the trouble ?
Funny you should mention that one in particular -- from our (SmartOS's) epoll(5) man page:
While a best effort has been made to mimic the Linux semantics, there
are some semantics that are too peculiar or ill-conceived to merit
accommodation. In particular, the Linux epoll facility will -- by
design -- continue to generate events for closed file descriptors
where/when the underlying file description remains open. For example,
if one were to fork(2) and subsequently close an actively epoll'd file
descriptor in the parent, any events generated in the child on the
implicitly duplicated file descriptor will continue to be delivered to
the parent -- despite the fact that the parent itself no longer has any
notion of the file description! This epoll facility refuses to honor
these semantics; closing the EPOLL_CTL_ADD'd file descriptor will
always result in no further events being generated for that event
So while we do aspire to be bug-compatible, we're not about to compromise our principles over it. More details (or some of them, anyway) can be found in the talk on LX-branded zones that I gave at illumos Day at Surge 2014.
Terminal.com is writing next generation virtualization. This includes live-migration without hypervisors (already in production) and live-resizing (also in production).
I think we'd like to think we're working on the ugly bits of infrastructure people don't care about.
This includes a distributed file system optimized for speed and storing machine state like Github, software defined networking for IP migration across metal, and other abstractions for making devops easier.
We're looking for people who want to think about hard computer science problems like managing stateful systems more intelligently (think auto-failure detection and recovery and less I/O intensive WAN replication strategies).
Just make a snapshot on terminal.com and then the state of your web application can be distributed at an instant in time. You can verify its exactly what you say it is since the snapshot is bound to your user name.
You can make a snapshot for free right now if you want to. This should solve your problem.
Let me know if you have any questions, but using terminal's snapshot feature you can distribute your web app at a known state.
While it seems cool, this doesn't solve anything. At best it just shifts trust onto Terminal. And really, there's nothing stopping a malicious VM owner to "fix" up things to a good looking state, then show off that snapshot, then revert and continue.
A solid example to keep in mind is MtGox. How can we run something and know no invalid trades are added, no fake password resets processed, etc etc.
So we implemented a variant of the Bekeley Lab Checkpoint and Restore  on terminal.com. You can snapshot RAM state at any given moment and commit it to disk without interrupting operations. You can use it to, for example, start a Spark cluster  with a dataset already in memory. I've tested it with a lot of software, so I'd say it works irrespective of the application, but you're welcome to test.
When you use our snapshotting, RAM-state, cpu cache and disk-state are all captured and can be resumed in seconds later. This all happens without a hypervisor.
This sort of obfuscates the need for doing any configuration storage (you snapshot systems at an initial state and then you can bring up new machines at that initial state without config files. If you need to pass an argument to a machine on boot you can do it programmatically by passing the shell commands ).
Solving multi-tenancy issues is hard, but not impossible. I think it's a lot easier with live migration. If a box is giving you problems, just move the load to a new box while maintaining the same IP addressing.
With respect to cost, yes, AWS gets expensive at scale, but if you're at scale your servers are generally not your major cost center (it's usually payroll and licensing).
That is what I like about the cloud. Running in your own hardware lets you be, well, "lazy" about application architecture. Running in a place where it is shared and instances can disappear forces you to design a lot more robustly and nimbly.
Of course, that design discipline is great wherever you are running....
When you run a docker container on terminal, we handle all of the networking and storage problems for you. In addition, we're the only place in the world that's offering live-migration of docker containers right now (metal to metal without downtime. We carry everything from the RAM state to the IP address automatically).
If you're really worried about having to deal with reboots, you can run Terminal on top of Linode and gain the ability to live-migrate all of your workloads (so you never have to take down your application because of the underlying metal rebooting).
I discussed it in detail previously on HN , but we give you the ability to live-migrate your workloads, even onto heterogeneous kernels. If that's something you really need, you can get it from Terminal today.
I have a friend who works at Patreon.com; it's a different take on this.
It's basically a way to have a direct financial relationship between artists and patrons. I am curious as to how this model might work over time. It could be a good way to deal with the inability of musicians, for example, to sell records.
If you use Terminal on top of AWS (one deployment option) we can just migrate your workloads without rebooting.
The way it works is that you read the RAM pages from one machine to another in real time and when the RAM cache is almost synchronized you slam the IP address over to the new box (and then you let Amazon reboot your old box and then migrate back post-upgrade if you want to).
You can try it out on our public cloud at terminal.com if you'd like to (we auto-migrate all of our customers off of the degrading hardware before it reboots on our public cloud, but you can control that if you're running terminal as your infrastructure).
Or on Bare-metal. It works just the same on Xen but slower because of the VM overhead.
Containers are really only part of the solution, there's a lot of other things you have to think about if you want to build a better mousetrap in the virtualization world (like networking and storage).
Are you running your VMs inside Amazon VMs? Or are you running containers instead, to avoid the overhead of having 3 nested OSs (the Xen host > the Amazon Xen guest > your VM)? If you run containers, how do you guarantee isolation of tenants (it is generally considered to be very difficult to achieve)?
We are running a custom container implementation. The goal of our implementation is containers that perform like VMWare.
Process isolation is hard, but we've achieved it. We currently have some tens of thousands of users on our public cloud with zero container breakout, and while no security is perfect, we're constantly trying to improve our offering through White Hat bounties and constant security testing. In this case, I can tell you heuristics with which you can infer security, but I can't blanket label something as secure. I would say I think it's the most secure new virtualization tech, but I would also note that's a matter of personal opinion. Again, zero container breakout is probably the main point.
You can run our virtualization inside of Amazon, in which case you only really have the pain of Xen host + Amazon Xen, but it performs faster on bare-metal (as one might expect).
Isn't your ability to "migrate workloads without rebooting" similar to Google Compute Engine transparent maintenance and to the live-update capability that Amazon is progressively deploying (which is explained in the post)?
How is it different from Xen or KVM live migration?
It's faster because we made it fast. What's the difference between migrating a container with network, RAM, CPU and disk state and migrating a VM? IMHO, the difference is that the VM has a massive performance penalty and the container implementation does not.
It's not enough to just migrate the container, you also need to migrate the network and the storage too. That's actually the harder part, IMHO. Everyone forgets about network and storage until it's time to go into production and then it gets hard.