Hacker Newsnew | comments | show | ask | jobs | submit | login

Joyent has solved some really hard problems to make this work. I was lucky enough to get the chance to sit down with Bryan last week and talk with him about what this took. It's essentially reimplementing Linux on top of Illumnos, which is no tiny feat.

There are still messy hurdles to running Docker in production, but it's clear that Joyent has really tried to make something awesome here.

As someone who works deeply in the future-container space, I applaud folks who are taking us deeper down the rabbit hole. I think it's clear that this whole 'back-to-the-future' isolation technology is really cool stuff. Jails/zones/containers have been around forever and I think it's great that we're finally taking advantage of this technology.

Edit: at Terminal.com, we want people to push Linux forward, and this is a great example of taking Linux to new and intriguing heights. I did not think we would have Linux on Illumnos in quite this way in 2015 and it's delightful to see. We are all standing on the shoulders of giants and it's great to reach new vistas.

reply


Isn't it an example of taking Solaris to new and intriguing heights? There is no actual Linux code in the implementation.

reply


That's a great point, but I think the point of what Bryan has been doing is to make Linux work with Zones (and dtrace).

That's a primitive that Joyent has wanted to upstream into the Linux kernel for a long time and has never been able to get the necessary consensus around it (similar to OpenVZ's troubles getting their work upstreamed).

In short, this is sort of a hack to give you zones on Linux without needing to get zones into the upstream. Yes, there's no linux code, but there is a lot of required understanding of Linux code to make something like this work.

It's kinda amazing that they got 64-bit linux to run on top of Illumnos, right? I did not see that coming and maybe that's because I'm ignorant in some capacity, but it's been a pleasant surprise.

reply


Emulations have been a Unix feature for a long time actually. NetBSD has had 64 bit Linux emulation for ages, for example, but its not very complete because no one has cared enough to implement more. For example Illumos is AFAIK the first system to emulate epoll. The Linux API is huge and historically the process has been just fixing stuff for a binary someone wants to run. It is very tedious work...

I dont really see it as zones in linux. More a gateway drug for non-Linux.

reply


Hey Justin -- do you have insight into how hard any particular remapping (ie: epoll) is to perform ? I was talking to @bcantrill about their effort at a Docker meetup and mentioned the NetBSD emulation (he said "Oh! Of course!"), but whats interesting (in retrospect) is that they (Joyent) just tried running stuff and played whack-a-mole w/ unimplemented APIs... how tough would it be for "us" (NetBSD) to occasionally implement pieces ?

edit: parens

reply


It is just tedious and you need motivation. Especially as Linux has a lot of interfaces, many of which are frustratingly annoying - there are three file change notification interfaces, of different dates. In fact there are at least two of everything!

I amagine much of the Joyent code could be easily ported to NetBSD/FreeBSD (which now has a 64 bit interface as of a few months back). epoll may well be the most difficult (it has edge and level triggered events and other annoyances). But a not very performant version should be doable.

Mostly, few people have been interested. I have a decent test suite though (rump based) so email if you are interested...

reply


Speaking without familiarity with NetBSD, I think it depends on what kernel facilities the system happens to have; speaking for SmartOS/illumos, in many cases we were able to slightly rephrase Linux facilities as extant facilities -- saving a considerable amount of time and effort. For example, the big realization with epoll was just how naive it is -- so much so, in fact, that it actually looks very similar to a pre-port mechanism (/dev/poll) that we developed nearly 20 years ago (!!) and later deprecated in favor of ports. epoll would have been much nastier without /dev/poll -- which is likely the greatest service that /dev/poll has ever provided anyone...

reply


Yes, NetBSD added some facilities (and general missing functions) that were Linux-like if that made sense. No one did epoll as kqueue is a bit of a mismatch and we never had /dev/poll...

A lot of the issue is just testing - NetBSD does not have any in tree tests for compat. I have some out of tree, which help a lot.

reply


Hi Bryan -- I'm also aware that epoll may have been a bad example on my part, because isn't it subject to some nasty fork/share bugs wrt handling the (well) handle, and what file it's actually associated with the handle -- so a parent can get notifications on a handle it doesn't have, or worse, notifications for a socket that it does have that is not really the same handle that's issuing the event.

In cases like that, did you end up trying to be bug-compatible, or make a design decision to clear up the trouble ?

[edit -- spell "Bryan" correctly]

reply


Funny you should mention that one in particular -- from our (SmartOS's) epoll(5) man page:

       While  a  best effort has been made to mimic the Linux semantics, there
       are some semantics that are too  peculiar  or  ill-conceived  to  merit
       accommodation.   In  particular,  the  Linux  epoll facility will -- by
       design -- continue to  generate  events  for  closed  file  descriptors
       where/when  the underlying file description remains open.  For example,
       if one were to fork(2) and subsequently close an actively epoll'd  file
       descriptor  in  the  parent,  any  events generated in the child on the
       implicitly duplicated file descriptor will continue to be delivered  to
       the parent -- despite the fact that the parent itself no longer has any
       notion of the file description!  This epoll facility refuses  to  honor
       these  semantics;  closing  the  EPOLL_CTL_ADD'd  file  descriptor will
       always result in no further  events  being  generated  for  that  event
       description.
So while we do aspire to be bug-compatible, we're not about to compromise our principles over it. More details (or some of them, anyway) can be found in the talk on LX-branded zones that I gave at illumos Day at Surge 2014.[1][2]

[1] http://www.slideshare.net/bcantrill/illumos-lx

[2] https://www.youtube.com/watch?v=TrfD3pC0VSs

reply


Agreed. The application is what matters... in the end most people don't care about what Operating System their apps run on.

I've felt for a long time that with the right tooling, smartOS/Illumos would make an ideal Container OS. Glad to see that they're moving hard in this direction.

reply


+1 on terminal.com. Love it.

reply


Terminal.com is writing next generation virtualization. This includes live-migration without hypervisors (already in production) and live-resizing (also in production).

I think we'd like to think we're working on the ugly bits of infrastructure people don't care about.

This includes a distributed file system optimized for speed and storing machine state like Github, software defined networking for IP migration across metal, and other abstractions for making devops easier.

We're looking for people who want to think about hard computer science problems like managing stateful systems more intelligently (think auto-failure detection and recovery and less I/O intensive WAN replication strategies).

reply


I just heard of Terminal.com right here, and trying it out for a spin, but it looks like the disk performance is really low, is this expected for lower tier or trial instances?

reply


This is really interesting. I have never heard of Terminal.com before and I'm now deeply interested. Do you hire remote or sponsor visa?

reply


Feel free to ping me on the email address in my profile. Be forewarned, we have a pretty serious technical interview for all candidates. I would say it's not easy to get a technical job at Terminal.

reply


Thank you. I pinged you ;)

reply


Just make a snapshot on terminal.com and then the state of your web application can be distributed at an instant in time. You can verify its exactly what you say it is since the snapshot is bound to your user name.

You can make a snapshot for free right now if you want to. This should solve your problem.

Let me know if you have any questions, but using terminal's snapshot feature you can distribute your web app at a known state.

Edit: it's like git versioning for machine state.

reply


While it seems cool, this doesn't solve anything. At best it just shifts trust onto Terminal. And really, there's nothing stopping a malicious VM owner to "fix" up things to a good looking state, then show off that snapshot, then revert and continue.

A solid example to keep in mind is MtGox. How can we run something and know no invalid trades are added, no fake password resets processed, etc etc.

reply


>Why only 'zero-emissions" vehicles?

What incentive does Elon Musk have to lobby for other companies? If he's doing the lobbying work, why wouldn't he structure the regulations to favor his organization?

reply


So we implemented a variant of the Bekeley Lab Checkpoint and Restore [0] on terminal.com. You can snapshot RAM state at any given moment and commit it to disk without interrupting operations. You can use it to, for example, start a Spark cluster [1] with a dataset already in memory. I've tested it with a lot of software, so I'd say it works irrespective of the application, but you're welcome to test.

When you use our snapshotting, RAM-state, cpu cache and disk-state are all captured and can be resumed in seconds later. This all happens without a hypervisor.

This sort of obfuscates the need for doing any configuration storage (you snapshot systems at an initial state and then you can bring up new machines at that initial state without config files. If you need to pass an argument to a machine on boot you can do it programmatically by passing the shell commands [2]).

[0] http://crd.lbl.gov/departments/computer-science/CLaSS/resear... [1] https://www.terminal.com/snapshot/c81e6215eba5799335a45b6936... [2] https://blog.terminal.com/tutorial-terminal-startup-scripts-...

reply


You can use abstraction layers to isolate yourself from issues with the underlying metal. For example, I had a good thread the last time the maintenance reboots happened: https://news.ycombinator.com/item?id=9120289.

Solving multi-tenancy issues is hard, but not impossible. I think it's a lot easier with live migration. If a box is giving you problems, just move the load to a new box while maintaining the same IP addressing.

With respect to cost, yes, AWS gets expensive at scale, but if you're at scale your servers are generally not your major cost center (it's usually payroll and licensing).

reply


That is what I like about the cloud. Running in your own hardware lets you be, well, "lazy" about application architecture. Running in a place where it is shared and instances can disappear forces you to design a lot more robustly and nimbly.

Of course, that design discipline is great wherever you are running....

reply


Hardware is almost always cheaper than engineering time. Don't optimize prematurely.

reply


Yes, but engineering is cheaper than running after real-time problems.

reply


Now we get to the technical debt debate. Sometimes, you have to make good decisions now instead of perfect decisions later. The market doesn't care how elegant your code is.

reply


No debate, I agree with everything you have said. But sometimes you have to clean that garbage up because it does matter to the market.

reply


The market cares about uptime, and cost relative to return.

Perfect should never be the enemy of the good.

reply


> Perfect should never be the enemy of the good.

Only in a world where resources are infinite does this work.

reply


How to run Docker in production:

http://blog.terminal.com/docker-without-containers-pulldocke...

When you run a docker container on terminal, we handle all of the networking and storage problems for you. In addition, we're the only place in the world that's offering live-migration of docker containers right now (metal to metal without downtime. We carry everything from the RAM state to the IP address automatically).

reply


If you're really worried about having to deal with reboots, you can run Terminal on top of Linode and gain the ability to live-migrate all of your workloads (so you never have to take down your application because of the underlying metal rebooting).

I discussed it in detail previously on HN [0], but we give you the ability to live-migrate your workloads, even onto heterogeneous kernels. If that's something you really need, you can get it from Terminal today.

[0] https://news.ycombinator.com/item?id=9120289

-----


Meh, it was just a snarky comment. I don't really care; I'm not doing much with my Linode anyway. I'm mostly just using it as a shellbox/IRC client/Mercurial backup.

-----


I have a friend who works at Patreon.com; it's a different take on this.

It's basically a way to have a direct financial relationship between artists and patrons. I am curious as to how this model might work over time. It could be a good way to deal with the inability of musicians, for example, to sell records.

-----


If you use Terminal on top of AWS (one deployment option) we can just migrate your workloads without rebooting.

The way it works is that you read the RAM pages from one machine to another in real time and when the RAM cache is almost synchronized you slam the IP address over to the new box (and then you let Amazon reboot your old box and then migrate back post-upgrade if you want to).

You can try it out on our public cloud at terminal.com if you'd like to (we auto-migrate all of our customers off of the degrading hardware before it reboots on our public cloud, but you can control that if you're running terminal as your infrastructure).

-----


... how?? That is seriously nifty.

Are you migrating just a process tree / other contained environment, or the entire machine?

Are you using CRIU or similar? Do open TCP connections survive the transfer?

-----


We wrote a bunch of hacks to the linux kernel to do it.

Custom container implementation, custom networking, custom storage.

It's just really good hardcore kernel engineering.

If you wanna talk more and you're in SF, come to our meetup on the 10th: machinelearningsf.eventbrite.com.

Edit: the whole machine including RAM cache, CPU instructions, IP connections, etc. is carried over. We can also resize your machine in seconds while it's running.

-----


Is this somehow different to Xen Live Migration/VMware Vmotion/etc?

-----


Yes. VMWare VMotion and Xen Live Migration are both VM migration tools, not containers.

The difference is subtle, but important. VMs have overhead because of virtualizing the kernel, Containers don't (or rather containers benefit from kernel performance much more than VMs).

In other words, you can achieve the same thing with VMotion, but it's slower and more overhead and harder to manage.

-----


Ah I didn't even know you're a container based shop. So you're moving live containers between aws provided xen vms.

-----


Or on Bare-metal. It works just the same on Xen but slower because of the VM overhead.

Containers are really only part of the solution, there's a lot of other things you have to think about if you want to build a better mousetrap in the virtualization world (like networking and storage).

-----


Wow, that's pretty sweet. Any plans for trying to get that upstream?

-----


Don't know that we have plans around that at this time. I'll try to dig and see if I can flesh out our story around this.

-----


With TCP_REPAIR, presumably they could...but both ends need to implement the REPAIR option I think, so maybe not in practice yet.

Or if the SDN of your cloud is good enough, even TCP_REPAIR might not be needed!

-----


It's a custom SDN layer we wrote.

-----


Are you running your VMs inside Amazon VMs? Or are you running containers instead, to avoid the overhead of having 3 nested OSs (the Xen host > the Amazon Xen guest > your VM)? If you run containers, how do you guarantee isolation of tenants (it is generally considered to be very difficult to achieve)?

-----


We are running a custom container implementation. The goal of our implementation is containers that perform like VMWare.

Process isolation is hard, but we've achieved it. We currently have some tens of thousands of users on our public cloud with zero container breakout, and while no security is perfect, we're constantly trying to improve our offering through White Hat bounties and constant security testing. In this case, I can tell you heuristics with which you can infer security, but I can't blanket label something as secure. I would say I think it's the most secure new virtualization tech, but I would also note that's a matter of personal opinion. Again, zero container breakout is probably the main point.

You can run our virtualization inside of Amazon, in which case you only really have the pain of Xen host + Amazon Xen, but it performs faster on bare-metal (as one might expect).

-----


Interesting! Thanks for the clarification.

How does it compare with OpenVZ, which is also able to live migrate containers?

I guess I have to use the kernel provided by you and cannot use a kernel of my choice?

-----


Isn't your ability to "migrate workloads without rebooting" similar to Google Compute Engine transparent maintenance and to the live-update capability that Amazon is progressively deploying (which is explained in the post)?

How is it different from Xen or KVM live migration?

-----


It's much faster and doesn't use VMs.

-----


I guess it's much faster because you migrate containers instead of full VMs? (and also because your implementation is probably very good!)

-----


It's faster because we made it fast. What's the difference between migrating a container with network, RAM, CPU and disk state and migrating a VM? IMHO, the difference is that the VM has a massive performance penalty and the container implementation does not.

It's not enough to just migrate the container, you also need to migrate the network and the storage too. That's actually the harder part, IMHO. Everyone forgets about network and storage until it's time to go into production and then it gets hard.

-----


I don't see anything on your web page about running on top of AWS...? It looks like you guys only run your own cloud. Can you point me at some docs or anything about running on AWS?

-----


I don't have docs yet because I haven't written them, but it's running on AWS right now.

It runs inside of any hypervisor or on bare metal.

Feel free to email me at josh[at]terminal[dot]com if you want to talk more. I can peel back the kimono quite far (we're also in SF if you wanna meet up).

-----


Neat! I'll probably take you up on that. :-)

I was definitely impressed by the pulldocker err... now pullcontainer project and think it'd be great to see how you secure your containers and handle networking.

-----

More

Applications are open for YC Summer 2015

Guidelines | FAQ | Support | API | Lists | Bookmarklet | DMCA | Y Combinator | Apply | Contact

Search: