Hacker News new | comments | ask | show | jobs | submit login

> Unikernels are entirely undebuggable

I'm pretty sure you debug an Erlang-on-Xen node in the same way you debug a regular Erlang node. You use the (excellent) Erlang tooling to connect to it, and interrogate it/trace it/profile it/observe it/etc. The Erlang runtime is an OS, in every sense of the word; running Erlang on Linux is truly just redundant, since you've already got all the OS you need. That's what justifies making an Erlang app a unikernel.

But that's an argument coming from the perspective of someone tasked with maintaining persistent long-running instances. When you're in that sort of situation, you need the sort of things an OS provides. And that's actually rather rare.

The true "good fit" use-case of Unikernels is in immutable infrastructure. You don't debug a unikernel, mostly; you just kill and replace it (you "let it crash", in Erlang terms.) Unikernels are a formalization of the (already prevalent) use-case where you launch some ephemeral VMs or containers as a static, mostly-internally-stateless "release slug" of your application tier, and then roll out an upgrade by starting up new "slugs" and terminating old ones. You can't really "debug" those (except via instrumentation compiled into your app, ala NewRelic.) They're black boxes. A unikernel just statically links the whole black box together.

Keep in mind, "debugging" is two things: development-time debugging and production-time debugging. It's only the latter that unikernels are fundamentally bad at. For dev-time debugging, both MirageOS and Erlang-on-Xen come with ways to compile your app as an OS process rather than as a VM image. When you are trying to integration-test your app, you integration-test the process version of it. When you're trying to smoke-test your app, you can still use the process version—or you can launch (an instrumented copy of) the VM image. Either way, it's no harder than dev-time debugging of a regular non-unikernel app.

> You don't debug a unikernel, mostly; you just kill and replace it

Curious as to how you would drive to root cause the bugs that caused the crash in the first place? If you don't root cause, won't subsequent versions still retain the same bugs?

There are bugs that can only manifest themselves in production. Any system where we don't have the ability to debug and reproduce these classes of problems in prod is essentially a non-starter for folks looking to operate reliable software.

Personally, I wouldn't use unikernels for my custom app-tier code† (unless my app-tier either was Erlang, or had a framework providing all the same runtime infrastructure that Erlang's OTP does.)

Instead, unikernels seem to me to instead be a good way to harden high-quality, battle-tested server software. Of two main kinds:

• Long-running persistent-state servers that have their own management capabilities. For example, hardening Postgres (or Redis, or memcached) into a unikernel would be a great idea. A database server already cleans and maintains itself internally, and already offers "administration" facilities that completely moot OS-level interaction with the underlying hardware. (In fact, Postgres often requires you to avoid doing OS-level maintenance, because Postgres needs to manage changes on the box itself to be sure its own ACID guarantees are maintained. If the software has its own dedicated ops role—like a DBA—it's likely a perfect fit for unikernel-ification.)

• Entirely stateless network servers. Nginx would make a great unikernel, as would Tor, as would Bind (in non-authoritative mode), as would OpenVPN. This kind of software can be harshly killed+restarted whenever it gets wedged; errors are almost never correlated/clustered; and error rates are low enough (below the statistical noise-floor where the problem may as well have been a cosmic ray flipping bits of memory) that you can just dust your hands of the responsibility for the few bugs that do occur (unless you're operating at Facebook/Google scale.) This is the same sort of software that currently works great containerized.


† The best solution for your custom app-tier, if you really want to go the unikernels-for-everything route, might be a hybrid approach. If you run your app in both unikernel-image, and OS-process-on-a-Linux-VM-image setups, you'll get automatic instrumentation "for free" from the Linux-process instances, and production-level performance+security from the unikernel instances. You could effectively think of the Linux-process images as being your "debug build."

Two ways of deploying such hybrid releases come to mind:

• Create cluster-pools consisting of nodes from both builds and load-balance between them. Any problems that happen should eventually happen on an instrumented (OS-process) node. This still requires you to maintain Linux VMs in production, though.

• Create a beta program for your service. Run the Linux-process images on the "beta production" environment, and the unikernel-images on the "stable production" environment. Beta users (a self-selected group) will be interacting with your new code and hopefully making it fall down in a place where it's still instrumented; paying customers will be working with a black-box system that—hopefully—most of the faults will have been worked out of. Weird things that happen in stable can be investigated (for a stateful app) by mirroring the state to the beta environment, or (for a stateless app) by accessing the data that causes the weird behavior through the beta frontend.

> In fact, Postgres often requires you to avoid doing OS-level maintenance, because Postgres needs to manage changes on the box itself to be sure its own ACID guarantees are maintained.

Huh? What are you talking about?

With unikernels you get a lot more consistency. E.g. I once saw a bug that came down to one server using reiserfs and another using ext2. But there's no way to have that problem with a unikernels.

But sure, you need a debugger. So you use one. I'm not sure why the author seems to think that's so hard.

> But sure, you need a debugger. So you use one. I'm not sure why the author seems to think that's so hard.

The author wrote and continues to contribute to DTrace, which is an incredibly advanced facility for debugging and root causing problems. GDB (for example) doesn't help you solve performance problems or root-cause them, because now your performance problem has become ptrace (or whatever tracing facility GDB uses on that system).

The point he was making is that there are problems with porting DTrace to a unikernel (it violates the whole "let's remove everything" principle, and you couldn't practically modify what DTrace probes you're using at runtime becuase the only process is your misbehaving app -- good luck getting it to enable the probes you'd like to enable).

You can't modify them from within your app, sure. You modify them from a (privileged) outside context. Allowing the app to instrument itself that thoroughly violates the principle of least privilege the author was so fond of.

> With unikernels you get a lot more consistency

That's not unique to unikernels. You can get that with Docker containers or EC2 instances on AWS.

There's a lot more that can happen differently there. Docker doesn't hide all the details of the filesystem, kernel version, or the like. With EC2 instances you're still running a kernel that has a lot of moving parts of its own.

> . With EC2 instances you're still running a kernel that has a lot of moving parts of its own.

I'm sure that's true, but it's not a statement directly about consistency.

It's a lot harder to get consistency out of a non-unikernel system running on EC2 - e.g. IME the linux boot/hardware probing process can behave nondeterministically before it even starts running your user program.

The problem is how to debug sporadic production problems. It doesn't matter how good your tooling is to debug a problem in a dev environment if you have no idea how to reproduce it there.

You need some level of decent logging in the production environment (with optional extra logging that can be turned on) to capture WTF went wrong. THEN you can try to reproduce it. When a logged system goes boom, you need it to come out of production and remain around until those logs are saved somewhere.

It is old, but http://lkml.iu.edu/hypermail/linux/kernel/0009.1/1307.html is an interesting data point about IBM's experience. They implemented exactly the kind of logging that I advocate in all of their systems until OS/2. And the fact that they didn't have it in OS/2 was a big problem.

You're right. Coming from the Erlang world, I almost implicitly assume good logging (and extremely useful task-granular crashdumps.) If you're implementing your app on a custom unikernel "platform" that isn't basically-an-OS, the first and most important step is logging the hell out of everything.

But that's the case for every system of sufficient complexity (that I've seen).

There's an alternate approach that I was mentally contrasting with. You tend to see it with e.g. Ruby apps on Heroku: you can't "log into" a live app, but you can launch an IRB REPL—with all the app's dependencies loaded—in the context of the live app.

Having this ability to "futz around in prod" frequently obviates the need for prescient levels of logging. You can poke at the problem until it crashes for you.

Ruby folks, don't let not being in Heroku stop you from taking this approach, because it's the best. pry-remote (are you using irb instead of pry? please stop) will give you the same fantastic behavior. It's part of every persistent Ruby service I deploy (guarded for internal access only, of course, don't be crazy).


I haven't tried deploying a unikernel in production yet - I've been mostly using/debugging only OCaml code on Linux in production - but it should be possible to implement the kind of logging you describe. For example I've seen a project that would collect&dump its log ring-buffer when an exception was encountered in a unikernel, or one that collects very detailed profiling information from a unikernel.

It would be nice to have some kind of a "shell" to inspect details about the application when something goes wrong, but that applies equally to non-unikernel applications. The difference is that with unikernels the debugging tool would be provided as a library, whereas with traditional applications debugging is provided as a language-independent OS tool.

Just to add some links for those (I assume these are the ones you mean):

Dumping the debug-level log ring-buffer on exception:


Detailed profiling (interactive JavaScript viewer for thread interactions):


I use Go on App Engine and have never been able to SSH or GDB those machines. Nevertheless, I am still able to debug issues in my app.

I admit that debugging is easier on platforms where you get more control.

> It's only the latter that unikernels are fundamentally bad at.

Well, that's the point of the article.

What good does restarting your service do if the issue will stay there, and come back again later?

It's also not true though.

Running a LING unikernel gets you a huge set of tracing and debugging capabilities that aren't 100% compatible with what's provided by BEAM, but it's close and a couple orders of magnitude better than what almost any other language provides for mucking with a live system; unikernel or not.

Running Erlang via Rumprun gets you a standard BEAM/erts release packaged as a VM image or a bootable ISO, and that is just straight up Erlang. So all the absolutely excellent debugging, tracing, and profiling facilities that any random Erlang application has access to are also accessible when deployed as a unikernel (rumpkernel).

Exactly my thoughts. In addition, I doubt that the author is familiar with the principles of OTP (or immutable services in general), or he wouldn't dismiss restarting a service once it misbehaves as impractical.

I think most people are unfamiliar with Erlang/OTP. Even if I converted my entire team over, I don't know how I'd hire new people without developing strong internal training.

Perhaps that means the real argument is "Unikernels are unfit for production for people who aren't comfortable with the Erlang/OTP way of doing things." Yes, this isn't a technical argument, but—most people, most customers, don't even see SmartOS + Linux emulation as suitable for production (and I imagine the author knows that), not for any technical reason but just for unfamiliarity. And that's still a UNIX working in UNIXy ways.

Hell, a lot of the base docker images I've been using lately don't even come with netstat or vim on them.

I'm already feeling the "can't debug" pinch and we haven't even started getting into unikernels yet.

I thought the idiomatic way of debugging containerized applications was from the host system, not from inside the container. The tooling surely can still be improved, but fundamentally the host has full view to the innards of containers.

I can't tell if I'm exposing the wrong port(s) (or rather, PORTing the wrong EXPOSES, please shoot whoever inverted that terminology). I also can't easily tell if I'm getting relative paths wrong, and I can't rapidly iterate on fixing the problem. That's where you really feel the debugging problems. The drag of day to day troubleshooting of new configuration or tools can still be pretty challenging.

Stuff tends to stay working once you get it working though, so that evens up the score a bit, but I'd really love it if getting there were smoothed out a bit.

Are you talking about debugging code running a container or debugging Docker? Because we're all talking about the former, not the latter.

Ah, I tend to define "debug" not as "the act of using a debugger tool" but as "diagnosing the broken things".

Makes me fairly useful in a crisis, but possibly less so in this thread.

I would think a remote debugger would work just fine in a container, at least once you documented how, but to zokier's point the only reason you'd do that is if the problem isn't repeatable outside the container. To me that means a couple possibilities. Bug in the docker scripts, bug in the unikernal generator, or a timing issue, which a debugger won't help you with very much (timing bugs are often also heisenbugs)

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact