And yes, for those comparing this to The Birth And Death Of JavaScript: "After watching it for the first time, I realized that the joke idea that Gary proposed could actually work, and work well! I'd say that that talk is probably why I started writing Nebulet."
The idea of a multi-user time-sharing virtual memory operating system is seriously outdated.
Today, the server OS could be based on a one user, multi computer abstraction, with multiple gigs of memory, solid state storage, and reliable gigE+ network as assumed essentials.
Entire sub-systems of the current linux kernel can be omitted.
Most of the VFS layer, with so many optimizations for spinning rust. Right out.
Paging, swapping, shared objects, address randomization and PIC. Right out.
User access controls, file access controls, network access controls / firewalls. Right out.
Of the 50,000 device drivers in the linux kernel, probably 50 deserve to be supported.
For a workstation with graphics / GPU and console support, and every pluggable external device support, maybe some sort of legacy emulation layer would work. Basically run a linux VM for backwards compatibility with those 50,000 devices. Would be less work than implementing those drivers...
Remember, Linus was once a 19 year old with a dream, and a small repo of prototype kernel demo code, 25 years ago.
You don't need to implement networking either, because connecting it to the network is out of the question -- with no process isolation, access privileges, ASLR etc., any security bug means immediate privilege escalation to ring 0.
The great thing about using a transparently compiled isa is that there is no way you can get arbitary code execution. The only things exposed to the outside will be SIPs and if you exploit one of those, the most you can do with it is make it crash.
Yeah, you have a point -- after all, nobody published any code execution bug in V8 in, like, forever, and by forever I mean in last 6 months.
FYI, there have been 6 CVEs last year for code execution in V8 only in the last year. Fortunately, Chrome has great sandboxing and mitigation mechanisms to limit the impact of these, the mechanisms the parent explicitly recommends doing away with.
The point being, when everything is ring 0, you bet on both the hardware and software being perfect. And if there's anything that years of vulnerabilities in cryptographic software has taught me, it's that perfection is REALLY DAMN HARD.
> Most of the VFS layer, with so many optimizations for spinning rust. Right out.
You already aren't running 'em! Also, that's the block device layer, not the VFS layer. The VFS is the bit between the syscall interface and the concrete FS (which then in turn works with the block layer, if you are using a block device).
Don't go signing the death warrant for something you clearly don't understand.
As part of the lightly educated masses, this sounds good to me. Can anyone more knowledgeable than me comment on the feasibility of something like this?
To the extent the OP took it, as a universal way that 'servers' could go, not very feasible. Partly because we already HAVE what's being described, in the form of hypervisors.
We then layer within it all that other stuff that's chopped away to bring us back to a system that has a bunch of stuff we want, but most importantly that supports a wide range of applications we want to deploy.
The big problem with stripping back to the absolute bare essentials is that you optimize towards a local maxim, and severely limit your flexibility. This is certainly the way you want to go if you have deep pockets coupled with a need for bare-bones speed that can't be sharded in an effective manner. But that's not the majority of workloads.
Honestly the overhead of a modern OS, especially one like Linux which has been finetuned for these sorts of workloads is negligible. More importantly, any gains you make by stripping out parts of the OS you don't need you immediately throw away by going for a virtualised, sandboxed ISA.
If you want to squeeze more performance per watt than you can get from a modern server, the only way forward is to code your application in Verilog and to run it on an FPGA or ASIC.
JavaScript was running in the browser until Ryan Dahl decided to create Node.js. Nowadays it is a big platform, and it can run also WebAssembly.
This is just another step in that direction where is the OS the one running that code.
Does it have problems? Yes. Adding more complexity to the kernel will open new attack vectors for breaking its security. But that is true for any added layer, even in the hardware which complexity has brought Meltdown and Spectre. Running JavaScript in your kernel is a lot riskier than running it in your app.
So in light of Spectre, the Chrome developers don't believe it's safe to have any sensitive data in the same memory space as V8, but WebAssembly is safe in ring 0? What am I missing here?
"Normally, this would be super dangerous, but WebAssembly is designed to run safely on remote computers, so it can be securely sandboxed without losing performance."
I'm staring at this sentence hoping the author is being supremely sarcastic...
I think they're saying that WebAssembly code doesn't lose any performance when you sandbox it, not that WebAssembly has equivalent performance to native code.
Note: I haven’t read the source code so I’m not sure how it’s actually implemented, this is off my personal knowledge of WebAssembly and Spectre.
WebAssembly isn’t assembly. It can only refer to memory offsets within its allocated block (so it always does *(baseAddr + offset)), so to generate assembly for it, you already need to add checks. One way to prevent some spectre-like attacks is to mask the offset after those checks. Another way is to use virtual memory to keep program spaces very far apart and only use int32 displacements in memory loads (since they’re always relative to the base memory address).
He mentions Singularity as inspiration which is exactly what came to my mind.
This isn’t an unproven space. Singularity proved you can use a single global address space (given 64-bit) and software to isolate processes - something MS Research called Software Isolated Processes.
This requires a verifiable bytecode/VM system so the kernel can verify the instruction stream at load time. In a way, WebAssembly is even easier to verify than C#.
It’s obviously a research toy but that isn’t a bad thing.
That was before spectre happened. Software isolated processes are now basically impossible to do. All the CPU microcode updates & workarounds are purely about fixing isolation between processes & rings that the CPU is aware of.
>Software isolated processes are now basically impossible to do.
I don't think that's the case at all. To my understanding, software has always been capable of doing this just fine, but at a performance cost. So the mechanism was ported to hardware. Optimized implementations have proven to contain security issues. It's not that the security problems are unsolvable, it's that you can't just patch a broken CPU in the wild.
So we fall back to software implementations again, resulting in an overall perform decrease, but with the more correct security.
Specifically regarding speculative sidechannel attacks, your rebuttal does not address the comment you're replying to. kllrnohj is specifically noting that the mitigations for speculative sidechannel attacks rely on seeing the patterns of a preemptive operating system using the facilities expected by the hardware for that task.
Does that really follow?
That purely software based solutions are doomed because hardware based solutions failed? Wasn't most fixes done at least partially in software?
I think you misunderstood. The hardware-based solutions are being fixed with various approaches whereas the software-based solutions nobody has any clue how to fix.
Depends on how you define hardware and software. There are global workarounds that amount to crippling CPU behavior, to provide guarantees for code not written to be spectre safe. I guess that’s what you mean by “hardware”. But “software” approaches are localized mitigations to code written to avoid spectre attacks - and these localized mitigations use arch specific sequences to do so.
Another major problem with the software level mitigations is that they cost considerably more. Speculation fences cut performance to a small integer fraction of prior performance.
> Process-based sandboxing was too heavy in old hardware systems, but on modern ones is the best approach.
They actually started first with full process isolation, then realized that it had an outsize impact on performance, and now they're going back because the consequences for security are now known. There have been spectre-related patches going into chromium and v8 for the last few months, and they just keep coming, it's hard to overstate how much work must be done to have a decent probability of not being exposed to the known variants of these bugs in a software isolated system.
AFAIK, all the CPU microcode updates were for fixing Meltdown, not Spectre. Unless and until CPU manufacturers find a way to fix speculative execution for good, software-based mitigations would seem to be mandatory.
The problem is that "fixing speculative execution" means "stopping to make things faster in any measurable way". Given that the caches are there to make things 2 orders of magnitude faster, this is rather crippling.
The hardware based protections are the only boundary where you can "fix" speculative execution without slowing down your system by 100x.
The hardware boundaries effectively act as hints that say to the hardware "slowing down things by 100x here won't kill performance too badly -- we should turn off speculation for safety". Without those hints, everything needs to get that much slower.
Considering the hardware mitigations aren’t some magic bullet, but are a “big hammer” that cripples CPU OoO behavior in crrtain ways for everything, software mitigations don’t seem that unreasonable. Of course, “software mitigations” use CPU facilities as well, and can allow for making critical sections safe. In practice, it is well understood what code needs to be written in a spectre-safe way. The difficult question isnhow to write code to anticipate the next 10 varieties of speculation attacks...
I do not believe that it's impossible to fix Spectre in hardware without disabling speculation. How about we fix the state changes that Spectre detects. For example, we could have a buffer that stores evicted cache lines during speculation and restores them if a rollback is necessary. Yeah, it wouldn't be free, but it would be faster than disabling all speculation.
> For example, we could have a buffer that stores evicted cache lines during speculation and restores them if a rollback is necessary
The problem is that with pure software isolation, as being discussed here, every access is observable within the "same program", so every speedup would need to be rolled back.
With hardware boundaries, you know which speculations to roll back (or just avoid doing), so you can put a fix in silicon.
I understand that. You cannot make this work, because in the case of pure software isolation, any instruction can theoretically be used to snoop on any other instruction. That means speculation can theoretically be detectable if it speeds up anything.
Unfortunately, this is not correct. Intel and AMD exposed MSRs for the Indirect Branch Prediction Buffer (variant 2), Indirect Branch Restricted Speculation (variant 2), and Memory Disambiguation Disable (variant 4). Variant 1 is unfixed. Variant 3 is meltdown.
Thanks for the links. But the Ars link seems to confirm that all these mitigations need to be done in software, and that the firmware updates merely give the OS the hooks necessary to implement their own mitigations (implying that application software didn't get or need any help from the CPU in applying mitigations). And if all this is being done in software anyway, then it would seem to suggest that should at least partially suffice for a scheme like the OP's (someone who knows more about what these hooks actually do should say more). I'm not sure whether that means it's as secure in practice as conventional OSes--naively we should expect that it's not until proven otherwise--but it also appears to be a far cry from the above claim that "software isolated processes are now basically impossible to do", which, even if true, seems true as well for the software we already have.
Imagine you have a BASIC interpreter with a million cooperative threads. Can you write a BASIC program using spectre to break from one thread to another?
Singularity was nice, but not the only one OS in this domain.
Like Joe Duffy complained about Midori, the problem is getting management to be willing to push it no matter what.
Eventually Microsoft took some of those ideas into Windows 8.x and 10, but they still are implemented in the context of a COM based world, thus only partially implemented.
> Normally, this would be super dangerous, but WebAssembly is designed to run safely on remote computers, so it can be securely sandboxed without losing performance.
This seems a like a pretty strong claim. I hope that it's true, but I'm not going to be running WASM modules in ring 0 any time soon.
Agreed. I highly advise against this. What I heard when I read this headline was 'WebAssembly has no vulnerabilitities so we can run remote WebAssembly code in ring 0 without worry'. Almost everything has some kind of vulnerability, it is simply a matter of the time necessary to find it. The more analyzed and tested a software is, the more time it will take to find the next vulnerability (Barnhill's Law).
The protection ring 3 offers is greatly reduced by current operating systems. Processes usually can do everything the user could do via the file system.
WASM is a stack machine, which does not add to security. You sense that no real VM specialist, nor one in field of security put hand to its design.
When I first read the specs, it screamed to me "VM design 101." It feels to be someones master thesis, more than a piece of production software. Just as the original Netscape Javascript 1.0 was.
It will have its fair share of "typeof null" style bugs to come.
Why is a stack machine bad for security? The JVM also had sandboxed execution as a goal and also uses a stack machine. But perhaps the stack machine was choosen because it tends to produce smaller binaries (which is important for things you send over the net) and not for security reasons?
>But perhaps the stack machine was choosen because it tends to produce smaller binaries (which is important for things you send over the net) and not for security reasons?
Who knows what was in their heads, but stack level attacks are as easy as to exploit unsafe type casting in anything that amount to a stack pointer.
My guess why they choose to do it that way is simply because there are more literature available for mid-tier coders in style of "VMs for dummies" and they wanted to always have an option to not to do extensive research on every small mater, and just copy JVMs behaviour.
The stack in stack-based VM does not refer to the real stack that contains return pointers that can be manipulated. You don’t have access to that from web assembly.
The security problems of java are not related to it being a stack-based VM at all. The problems are that the api lets applets do things they shouldn’t be able to and arbitrary code execution during serialisation.
It's a really funny talk, but this is nothing we haven't seen before with Java, Python, .Net... heck one of the most popular unikernel implementations is in OCaml! We'll survive having more languages in/as kernels.
Linux already allows untrusted user code to be run in a ring 0 VM, via BPF/eBPF. Honestly, given the choice between Web Assembly and eBPF to run in kernel mode, I'd go with Web Assembly. The fewer secure VMs that have to be audited, the better.
To be fair, BPF/eBPF is far more designed for simple and obviously correct verification. For eBPF, it's 10 register, two address, but load store, so that on CISC and RISC platforms there's generally a 1 to 1 between eBPF asm and JITed ASM. So the in kernel JIT doesn't even need a register allocator. The code flow graph is constrained to be a DAG so you can solve the halting problem on programs. etc.
You're looking at at least a couple of orders of magnitude more work to get to the same level of correctness for a WebAssembly runtime.
Is it really that much worse? A simple greedy register allocator is very, well, simple. (You don't even need a register allocator, in fact, though your performance will be poor.)
I'm not sure guaranteeing that a program halts really matters, either; really what you want is the ability to limit the amount of time a filter can run, which is simple to do directly. (In fact, it's simpler to add a timeout than to perform control flow analysis.)
> Is it really that much worse? A simple greedy register allocator is very, well, simple. (You don't even need a register allocator, in fact, though your performance will be poor.)
I mean, if you're doing something so crazy as pushing user controlled code into interrupt context, you care about performance. And the BPF scheme is within spitting distance of natively compiled code.
> I'm not sure guaranteeing that a program halts really matters, either; really what you want is the ability to limit the amount of time a filter can run, which is simple to do directly. (In fact, it's simpler to add a timeout than to perform control flow analysis.)
Right now, I don't think that there's a way for a BPF filter to 'fail' once it's been verified. It's sort of like a graphics shader in that regard.
And timeouts can't be implemented with a timer since the the filters run at interrupt context already, and manual bookkeeping comes with a perf cost (at least a lost register, and some basic block epilogue code). And that's in addition to the "well the filter failed, now how do we handle that" question that's hinted at above.
That's the kicker. BPF is simpler, _and_ in spitting distance of native code perf. And if you're doing something crazy like injecting user code into interrupts, you care about perf.
That being said, you're totally right that it's possible to get parity, just orders of magnitudes more work.
"Nebulet is just a way to try new ideas in operating system design. It’s a playground for interesting and exotic techniques that could improve performance, usability, and security."
> Normally, this would be super dangerous, but WebAssembly is designed to run safely on remote computers, so it can be securely sandboxed without losing performance.
This throws away the very important security property of defense in depth. A system design should include interlocking levels of security, so even if there is a vulnerability in one place, extra work may be required to exploit it.
I agree with you for the most part here. For the things that I typically use computers for, I would prefer to have both hardware and software protected sandboxes. There's a reason that browsers are switching to using multiple processes.
I suspect you are being downvoted for being overly emphatic. I can certainly think of scenarios where having this extra security is more costly than helpful.
An interesting point of note is that the mill architecture has been designed to have much cheaper hardware protection than other architectures. [1]
It's since been extended quite a bit from even that, and can be used for a variety of things, including dynamic tracing (BCC is an excellent frontend here), processing beyond just filtering (XDP), and more.
https://lwn.net/Articles/740157/
It might be less complex than the WASM VM, but it's quite a bit beyond just a packet filter these days.
Hardware protection rings are a holdover from a time when we didn't have program formats that could be statically transformed and verified to do exactly what the operator desires. They have a nonzero performance overhead even when implemented in hardware, and in an ideal design should not be required at all
Re: your followup about defensive in depth, this is a common and frankly boring fallback argument. At some point computers had much less reliable internals, and for example even the result of strlen () could vary across runs. Should we also perpetually account for the presence of unreliable registers or memory too?
> when we didn't have program formats that could be statically transformed and verified to do exactly what the operator desires.
Which we still don't. Rowhammer, spectre/meltdown, etc... proved that even if the code doesn't violate any sandbox constraints that doesn't mean it didn't violate everything the sandbox was attempting to protect.
Hardware isolation is still very important and very necessary, now more than ever.
Yes, which means they also spanned software protection domains. Meaning your software protection didn't work, regardless of how "verifiable" the bytecode is.
And a few of those so far have no known software protection domain fix, relying instead of hardware domains (eg spectre, which is why Chrome pushing site isolation hard - because they can't fix the software protection and are relying on hardware ones instead)
I think that having a user-accessible VM in the kernel is a useful thing that a lot of OS's eventually need for other reasons (e.g. packet filtering, syscall sandboxing). We might as well design for it properly up front.
Thank you for bringing that up. That's an entirely different discussion, and one that I agree is valid to have. Though, I really don't see the need for user space programs to be able to insert executable code into kernel memory, when hw ring protections would otherwise be in use. Maybe if there were already a verified set of interlocking software controls, then hw protections could be replaced, but I don't see that with running all wasm in the kernel space.
The language is memory safe, so it is theoretically limited to the surface area that the non-memory-safe OS libraries allow. Of course, that is assuming there are no exploits in the WebAssembly compiler, and no exploits in the OS libraries.
Access to syscalls would not be unusual in any case. This is the normal attack surface for an OS. The new attack surface is the WebAssembly compiler and checker.
To be honest though, given the processor vulnerabilities that have come about, I don't know if I really feel so bad about software protections like this anymore. Nothing is a panacea, even magical processor protection rings.
I don't see how to square "makes essentially arbitrary syscalls so we can run faster than Linux code" and "is safe". WebAssembly is not Platonically safe; it's safe for specific reasons. If you bind syscalls into it, or more generally, bind anything that allows you to do what syscalls do in an OS, you're going to be poking holes in its safety at a distressing pace, because you're taking some of the reasons WebAssembly is safe, and throwing them away.
It may be memory safe, but honestly, memory-safety is no great trick anymore... pretty much everything except the languages currently used to implement kernels is memory safe. Memory safe is still not "generally safe".
It all depends on how safe the OS software sandbox is. Which is something that recent HW vulnerabilities really has emphasized even if you're not running in ring0. And it's not like this is a new idea: as mentioned, Microsoft already built a research OS on this idea.
Personally, I say more power to this high schooler. It's just a research project, and it'll be interesting to see where it goes. Nobody is suggesting you replace Linux/OSX/Windows with this thing anytime soon.
This is what the Java Virtual Machine could have been but never was. Most applications are so lightweight for today's computer capabilities that running them over a virtualization layer is not a problem at all. As a developer I want my applications to run everywhere without having to take into account different OSs, architectures, etc.
You still will have native applications for a lot of different situations where performance is more critical and paying the higher cost of creating, testing and distributing for different OSs. That is a skill that should never die, like creating the hardware itself.
HTML5 is a good attempt to this, but it has problems with not-well-defined behaviours and the fact that not all applications, e.g. games, fit the hypertext approach.
Java virtual machine could have had better integration into OSs and browsers, but it's licensing model made it impossible. I worked with Applets on the browser and had a lot of potential, but probably it was too soon for them as computers where slower and Java was not so will optimized back then.
WebAssembly doesn't has this history and it can be used in ways that Java has been discarded for historical reasons.
As someone who was in high school just a few years ago and am now working I would argue that is definitely not true. High school can quite easily take a lot more time than work, especially if you are an honors student or do any extra-curricular activities. I would wake up at 6 am, be at school at 6:45am, and leave school at 5pm most days of the week due to sports and class. Then I would have 2-5 hours of homework to do. Now, working for a large company is 40-50 hours a week, 9am-6pm with little work after hours.
Depends on the job, some of us do a lot more than 40.
Getting married and having children will make you reassess your statements. Heck, just getting married will. Of course, I am assuming that you are single.
What's safer, running untrusted code at ring 0, or running it at a lower privilege level? The answer should be obvious to anyone that's not dreaming of some ideal world.
Ideals are great to strive for BTW, just don't lie to yourself about already having gotten there.
Lower privilege levels don't really convey performance benefits. I suppose syscall overhead may impact some applications, but one should not be running untrusted code where the performance bottleneck is accessing the system.
This is an interesting idea by putting some userspace stuff back into kernel. But I wonder how much overhead you can save as the supporting argument is just one sentence.
context switches / syscalls cost a lot of cycles to save and reload cpu context. you basically need to push all stuff to memory related to context, registers, flags, stack pointers et.c etc., then load the ones for kernel space, perform some call, and switch back. This means a lot of overhead is saved, especially for small functions.
lol I'm well informed on this topic but thanks anyway. Maybe I didn't make myself clear in my previous post. I was questioning the motivation -- Is there any existing profiling work that shows a significant amount of time has been spent on context switches in this particular workloads and how much you can save by adopting this approach.
Well, do you not recall the massive amounts of moaning that took place because now everyone's I/O bound stuff (most webservers/databases) works far slower than it used to when spectre/meltdown patches had been applied? If we could get rid of the cache flush that occurs when a userspace process syscalls, there'd be a lot less heat produced at your local data center.
interesting project. it's not safe, but definatly i like this kind of application of system software.
It might be interesting to try and run the webassembly in a VM ring 0 where the ring -1 can do some security routines and checks on it, but that might be beyond the scope and intent of this project.
Well, congratulations are in order, if this is the first step towards throwing out most of the OS.
Face it. Linux had its time, its conception of the world reflect a developers wet dream from the mainframe era, with many users, few resources, and the soon to be extinct dictator, err, universally hated system administrator.
Who needs elaborate permissions when you're the only user on the system? Which user still shares data without sending it, manipulating permissions? Who likes managing installs and incompatible dependencies? Hell, just copy the data, that's what we all do. The file-system dedups it anyways if necessary.
The list of unused and unwanted features goes on, but developers just keep reincarnating this same old fantasy of an anachronistic OS. It seems to me nobody but Linus can make them see again..
Linus, there's so much pain. What's the use case these days? Do you see Linux going over its own horizon, how? I suspect your answer involves a server OS.
PS, this idea of course also reminds of Microsoft's Singularity OS.
Virtually all production servers in the world today are running GNU/Linux. The substantial majority of smartphones run a Linux kernel. I wouldn't say Linux has "had its time".
You may be right that the traditional permissions model is outdated, but it represents only a very small part of what Linux does, and indeed it seems totally unrelated to this project, so I'm not sure what point you're making.
To use the Android example again, it's proof that you can layer a very different permissions model (single-human-user app sandboxing) on top of the Linux kernel.
Android uses the Unix permission model heavily - sandboxing is achieved by creating a new user per app, and filesystem, network, etc permissions are managed via user and group IDs just like with Unix.
So how is it "very different" (in implementation, not surface appearance)?
Surface appearance was my point -- it's possible to create a model that's very different from the user's perspective without changing the underlying model of the kernel.
I'd just like to interject for a moment. What you’re referring to as Linux, is in fact, GNU/Linux, or as I’ve recently taken to calling it, GNU plus Linux. Linux is not an operating system unto itself, but rather another free component of a fully functioning GNU system made useful by the GNU corelibs, shell utilities and vital system components comprising a full OS as defined by POSIX.
Many computer users run a modified version of the GNU system every day, without realizing it. Through a peculiar turn of events, the version of GNU which is widely used today is often called “Linux”, and many of its users are not aware that it is basically the GNU system, developed by the GNU Project. There really is a Linux, and these people are using it, but it is just a part of the system they use.
Linux is the kernel: the program in the system that allocates the machine’s resources to the other programs that you run. The kernel is an essential part of an operating system, but useless by itself; it can only function in the context of a complete operating system. Linux is normally used in combination with the GNU operating system: the whole system is basically GNU with Linux added, or GNU/Linux. All the so-called “Linux” distributions are really distributions of GNU/Linux.
And yes, for those comparing this to The Birth And Death Of JavaScript: "After watching it for the first time, I realized that the joke idea that Gary proposed could actually work, and work well! I'd say that that talk is probably why I started writing Nebulet."