I'm pretty sure you debug an Erlang-on-Xen node in the same way you debug a regular Erlang node. You use the (excellent) Erlang tooling to connect to it, and interrogate it/trace it/profile it/observe it/etc. The Erlang runtime is an OS, in every sense of the word; running Erlang on Linux is truly just redundant, since you've already got all the OS you need. That's what justifies making an Erlang app a unikernel.
But that's an argument coming from the perspective of someone tasked with maintaining persistent long-running instances. When you're in that sort of situation, you need the sort of things an OS provides. And that's actually rather rare.
The true "good fit" use-case of Unikernels is in immutable infrastructure. You don't debug a unikernel, mostly; you just kill and replace it (you "let it crash", in Erlang terms.) Unikernels are a formalization of the (already prevalent) use-case where you launch some ephemeral VMs or containers as a static, mostly-internally-stateless "release slug" of your application tier, and then roll out an upgrade by starting up new "slugs" and terminating old ones. You can't really "debug" those (except via instrumentation compiled into your app, ala NewRelic.) They're black boxes. A unikernel just statically links the whole black box together.
Keep in mind, "debugging" is two things: development-time debugging and production-time debugging. It's only the latter that unikernels are fundamentally bad at. For dev-time debugging, both MirageOS and Erlang-on-Xen come with ways to compile your app as an OS process rather than as a VM image. When you are trying to integration-test your app, you integration-test the process version of it. When you're trying to smoke-test your app, you can still use the process version—or you can launch (an instrumented copy of) the VM image. Either way, it's no harder than dev-time debugging of a regular non-unikernel app.
Curious as to how you would drive to root cause the bugs that caused the crash in the first place? If you don't root cause, won't subsequent versions still retain the same bugs?
There are bugs that can only manifest themselves in production. Any system where we don't have the ability to debug and reproduce these classes of problems in prod is essentially a non-starter for folks looking to operate reliable software.
Instead, unikernels seem to me to instead be a good way to harden high-quality, battle-tested server software. Of two main kinds:
• Long-running persistent-state servers that have their own management capabilities. For example, hardening Postgres (or Redis, or memcached) into a unikernel would be a great idea. A database server already cleans and maintains itself internally, and already offers "administration" facilities that completely moot OS-level interaction with the underlying hardware. (In fact, Postgres often requires you to avoid doing OS-level maintenance, because Postgres needs to manage changes on the box itself to be sure its own ACID guarantees are maintained. If the software has its own dedicated ops role—like a DBA—it's likely a perfect fit for unikernel-ification.)
• Entirely stateless network servers. Nginx would make a great unikernel, as would Tor, as would Bind (in non-authoritative mode), as would OpenVPN. This kind of software can be harshly killed+restarted whenever it gets wedged; errors are almost never correlated/clustered; and error rates are low enough (below the statistical noise-floor where the problem may as well have been a cosmic ray flipping bits of memory) that you can just dust your hands of the responsibility for the few bugs that do occur (unless you're operating at Facebook/Google scale.) This is the same sort of software that currently works great containerized.
† The best solution for your custom app-tier, if you really want to go the unikernels-for-everything route, might be a hybrid approach. If you run your app in both unikernel-image, and OS-process-on-a-Linux-VM-image setups, you'll get automatic instrumentation "for free" from the Linux-process instances, and production-level performance+security from the unikernel instances. You could effectively think of the Linux-process images as being your "debug build."
Two ways of deploying such hybrid releases come to mind:
• Create cluster-pools consisting of nodes from both builds and load-balance between them. Any problems that happen should eventually happen on an instrumented (OS-process) node. This still requires you to maintain Linux VMs in production, though.
• Create a beta program for your service. Run the Linux-process images on the "beta production" environment, and the unikernel-images on the "stable production" environment. Beta users (a self-selected group) will be interacting with your new code and hopefully making it fall down in a place where it's still instrumented; paying customers will be working with a black-box system that—hopefully—most of the faults will have been worked out of. Weird things that happen in stable can be investigated (for a stateful app) by mirroring the state to the beta environment, or (for a stateless app) by accessing the data that causes the weird behavior through the beta frontend.
Huh? What are you talking about?
But sure, you need a debugger. So you use one. I'm not sure why the author seems to think that's so hard.
The author wrote and continues to contribute to DTrace, which is an incredibly advanced facility for debugging and root causing problems. GDB (for example) doesn't help you solve performance problems or root-cause them, because now your performance problem has become ptrace (or whatever tracing facility GDB uses on that system).
The point he was making is that there are problems with porting DTrace to a unikernel (it violates the whole "let's remove everything" principle, and you couldn't practically modify what DTrace probes you're using at runtime becuase the only process is your misbehaving app -- good luck getting it to enable the probes you'd like to enable).
That's not unique to unikernels. You can get that with Docker containers or EC2 instances on AWS.
I'm sure that's true, but it's not a statement directly about consistency.
You need some level of decent logging in the production environment (with optional extra logging that can be turned on) to capture WTF went wrong. THEN you can try to reproduce it. When a logged system goes boom, you need it to come out of production and remain around until those logs are saved somewhere.
It is old, but http://lkml.iu.edu/hypermail/linux/kernel/0009.1/1307.html is an interesting data point about IBM's experience. They implemented exactly the kind of logging that I advocate in all of their systems until OS/2. And the fact that they didn't have it in OS/2 was a big problem.
Having this ability to "futz around in prod" frequently obviates the need for prescient levels of logging. You can poke at the problem until it crashes for you.
It would be nice to have some kind of a "shell" to inspect details about the application when something goes wrong, but that applies equally to non-unikernel applications.
The difference is that with unikernels the debugging tool would be provided as a library, whereas with traditional applications debugging is provided as a language-independent OS tool.
Dumping the debug-level log ring-buffer on exception:
I admit that debugging is easier on platforms where you get more control.
Well, that's the point of the article.
What good does restarting your service do if the issue will stay there, and come back again later?
Running a LING unikernel gets you a huge set of tracing and debugging capabilities that aren't 100% compatible with what's provided by BEAM, but it's close and a couple orders of magnitude better than what almost any other language provides for mucking with a live system; unikernel or not.
Running Erlang via Rumprun gets you a standard BEAM/erts release packaged as a VM image or a bootable ISO, and that is just straight up Erlang. So all the absolutely excellent debugging, tracing, and profiling facilities that any random Erlang application has access to are also accessible when deployed as a unikernel (rumpkernel).
Perhaps that means the real argument is "Unikernels are unfit for production for people who aren't comfortable with the Erlang/OTP way of doing things." Yes, this isn't a technical argument, but—most people, most customers, don't even see SmartOS + Linux emulation as suitable for production (and I imagine the author knows that), not for any technical reason but just for unfamiliarity. And that's still a UNIX working in UNIXy ways.
I'm already feeling the "can't debug" pinch and we haven't even started getting into unikernels yet.
Stuff tends to stay working once you get it working though, so that evens up the score a bit, but I'd really love it if getting there were smoothed out a bit.
Makes me fairly useful in a crisis, but possibly less so in this thread.
I would think a remote debugger would work just fine in a container, at least once you documented how, but to zokier's point the only reason you'd do that is if the problem isn't repeatable outside the container. To me that means a couple possibilities. Bug in the docker scripts, bug in the unikernal generator, or a timing issue, which a debugger won't help you with very much (timing bugs are often also heisenbugs)