Hacker News new | past | comments | ask | show | jobs | submit login
Antikernel: A Decentralized Secure Operating System Architecture (2016) [pdf] (iacr.org)
88 points by ingve 42 days ago | hide | past | favorite | 32 comments

Here's the GitHub repo: https://github.com/azonenberg/antikernel

I was curious about the state of the project and whether any source was available. Thought others might feel the same way.

Interesting and wildly impractical :D

In order for this to fly it's a ground up rewrite of fundamental principles of hardware.

For example, how on earth would paging work in such a world? Who or what would be in charge of swapping memory?

How would a system admin go about killing an errant process?

How would a developer code on such a machine that allows for no inspection of remote application memory.

Beyond that, seems like instead of decreasing the amount of trust needed to function, it increases it with every hardware manufacture. All while making fixing security holes harder (since they are now part of the hardware itself.)

Imagine, for example, your ram having a bug that allows you to access memory regions outside of what you are privileged to do.

It's a nice idea, but I think google's zircon microkernel approach is far more practical while being nearly as secure and not requiring special purpose hardware.

That was the point of the research: throw away how everything has been done and explore what a clean slate redesign would look like if we had the benefit of 30-40 years of hindsight.

The easiest way to implement transparent paging would be to have a paging block that exposed the memory manager API (allocate, free, read, write) and would proxy requests to either RAM or disk as appropriate. But since I was targeting embedded devices, it was assumed that you would explicitly allocate buffers in on-die RAM, off-die RAM, or flash as required. The architecture was very much NUMA in nature.

The prototype made for my thesis only allowed the "terminate" request to come from the process requesting its own termination. It would be entirely plausible to add access controls that allowed a designated supervisor process to terminate it as well.

Remote memory can be inspected, but only with explicit consent. I had a full gdbserver that would access memory by proxying through the L1 cache of the thread context being debugged. (Which, of course, had the ability to say no if it didn't want to be debugged).

The goal was to have a system that had very small number of trusted hardware blocks which did only one thing, then build an OS on top of it microkernel style - except that all you have is unprivileged servers, there's no longer any software on top.

What was your plan to address things like hardware bugs/errata/updates? How could someone initiate a firmware flash in this sort of system?

Keep the hardware design as simple as possible, formally verify everything critical, then run the rest as userspace software and hope you got all the critical bugs.

The point wasn't to move everything into silicon, it was to move just enough into silicon that you no longer needed any code in ring-0.

As an example, the memory controller's access control list and allocator was a FIFO of free pages and an array storing an owner for each page. Super simple, very few gates, hard to get wrong, and easy to verify.

What about the more complex hardware such as the CPU? There are plenty of opportunities for mistakes there, some not so obvious (such as Spectre attacks). And I can't imagine you'd get away with completely isolating it like the memory.

My long term plan was actually to do a full formal verification of the CPU against the ISA spec and prove no state leakage between thread contexts, but I didn't have time to do that before I graduated

I deliberately went with a very simple CPU (2-way in order, barrel scheduler with no pipeline forwarding, no speculation or branch prediction) to minimize opportunities for things to go wrong and keep the design simple enough that full end to end formal in the future would be tractable. Spectre/Meltdown are a perfect example of attack classes that are entirely eliminated by such a simple design.

I was targeting safety critical systems where you're willing to give up some CPU performance for extreme levels of assurance that the system won't fail on you.

At what point does it become easier to formally verify the software than the hardware? Certainly it's easier to change the software than the hardware, so there is already incentive to not move things into hardware that need not be there.

Since software is so easy to change, nobody bothers verifying it. The more we put into software, the more room there is for cutting corners in testing because, "we can always send out a firmware upgrade later." Me, I'd rather go back to the old days when updates to hardware were expensive so you better get it right.

It's like writing with a ball pen vs a pencil. But you could be just as careful with a pencil. I'm just wondering if what you are proposing is the equivalent of runes.

> To create a process, the user sends a message to the CPU core he wishes to run it on

What happens if the user doesn't care which CPU the process runs on?

This is certainly an interesting execution model. I think this should be adopted for IoT & given the simplicity of devices there I think it's a perfect fit. I do worry that this may increase the BOM & make this model only cost-effective for small runs. The code is generally a "fixed" upfront cost ammortized over the number of units you sell (the more units you reuse the software across, the cheaper that software cost to develop) whereas more complex hardware is a constant % increase for each unit (ignoring the costs for building that HW).

I think traditional CPUs/operating systems may pose additional use-case to adoption & I'm interested to hear the author's thoughts on this. Ignoring the ecosystem aspect (which is big, but let's assume this idea is revolutionary enough that everyone pivots to making this happen), how would you apply security & resource usage policies? Let's say I want to access file at "/bin/ps". Traditionally we have SW that controls your access to that file. Additionally, we can choose to sandbox your process' access to that file, restrict how much of the available HW utilization can be dedicated to servicing that, etc etc. If we implement it in HW, is the flexibility of these policies to evolve over time fixed at manufacturing time? I wonder if that proves to be a serious roadblock. In theory you could have something sitting in the middle to apply the higher level policy, but I think that's basically going to effectively reintroduce a micro kernel for that purpose. You could say that that could be implemented, in this example, at the filesystem layer (i.e. the thing that reads for example EXT4 & is the only process talking to that block device), but resource usage would prove tricky (e.g. if a block device had an EXT4 & EXT3 partition, how would you restrict a process to only occupy 5% of the I/O time on the block device?).

I was targeting industrial control systems, medical implants, and other critical applications. So a lot of the constraints of desktop/mobile/server computing just don't apply.

The assumption was that you'd end up with something close to a microkernel architecture, but without any all-powerful software up top. A purely peer to peer architecture running on bare metal.

Yup. I've worked with all kinds of CPUs so I can definitely see it in more embedded applications.

I skimmed the paper but I couldn't get a sense of how the CPU scheduler works. What causes the CPU to switch between threads? A common need that I've seen, even in the more limited applications you're describing, is preemption & thread priorities. For example, a driver thread may need to preempt other threads that may be running or we want more of the CPU to be devoted to running some thread even when all threads are runnable. How is this handled?

Re blocking of threads, is the only way they get blocked because they issue an RPC request & thus the CPU then magically knows to switch to a different thread? Is there any other blocking operation that might require needing to switch threads & if so how would that be accomplished?

Another thing I've experienced is that not all HW is DMA'able for cost or space reasons & thus you frequently find such I/O busses to be implemented via bit-banging (e.g. IIRC some times you might have 2 UARTS & only 1 can be used with DMA & the other must be bit-banged). Is that compatible with your approach with all the same guarantees or does this 100% require every I/O system to be part of the DMA network? Or maybe your DMA design is more cost-effective than how CPUs deal with I/O today?

The CPU used in the thesis is a barrel processor. There is a forcible context switch every clock cycle, equal priority round robin for now although more sophisticated prioritization schemes could be used.

The advantage of a barrel scheduler is that you never can have more than one pair of instructions in the pipeline from the same thread at a given time, so data hazards are impossible and all of the checking/forwarding logic can be entirely absent from the CPU.

You lose single-thread performance with this vs more conventional hyperthreading, but it's much simpler to implement.

This reminds me of Qubes Air, which I think will be implemented earlier: https://www.qubes-os.org/news/2018/01/22/qubes-air/

Decentralizing control over resources is a pretty interesting idea!

Sad choice of name though. If you're anti something, the biggest your ideas can reach is the size of what you are against.

One might, for instance, explore the "embedded systems" world and how one wants to distinguish oneself from it as well. I would say, embedded + everything that runs a kernel is a bigger scope already.

So, what do you want to stand for instead of against?

It is an interesting idea, but I think it would be easy to simply spam/scan the handle space to elevate privilege. I couldn't find any mention of how handles/capabilities are protected from impersonation.

[Update] clearly it wasn't as simple as I thought, thanks!

NoC addresses are implicitly part of the handle. So for example, a pointer isn't just physical address 0x41414141. The full capability is "physical address 0x41414141 being accessed by hardware thread 3 of CPU 0x8000" because when you dereference that pointer, it creates a DMA packet from the CPU's L1 cache with source address 0x8003 addressed to the NoC address of the RAM controller.

If malicious code running in thread context 4 tries to reach the same physical address, the packet will come from 0x8004. Since the RAM controller knows that the page starting at 0x41414000 is owned by thread 0x8003, the request is denied.

The source address is added by the CPU silicon and isn't under software control, so it can't be overwritten or spoofed. If the attacker were somehow able to change the "current thread ID" registers, all register file and cache accesses would be directed to that of the new thread, and all memory of their evil self would be gone. Thus, all they did was trigger a context switch.

This is awesome.

In the doc, they claim:

> Antikernel is designed to ensure that following are not possible given that an attacker has gained unprivileged code execution within the context of a user-space application or service: ... or (iv) gain access to handles belonging to another process by any means.

There are only 16 address bits, that's a very small space to search.

Which is why the interconnect fabric adds the source address based on who made the request.

Even hardware IP blocks don't have the ability to send packets from arbitrary addresses. You can send to anywhere you want, but you can't lie about who made the request. Which allows access control to be enforced at the receiving end.

That "something you are" as well as "something you know" factor prevents any capability from being spoofed, since the identity of the device initiating the request is part of the capability.

I guess I am not understanding something still about this spoof protection.

What prevents me, an attacker, from making (or simulating) a bunch of hardware with different identities? Perhaps by setting up my hardware to run through multiple different 'self' identities and then trying to talk to other hardware with each disguise.

If the address space is small and cheap, it seems that I could quickly map any decentralized hardware at low cost.

This is a SoC, addresses of on chip devices are fixed when the silicon is made.

The source address is added to a packet by the on-chip router as it's received. You can send anything you want into a port, but you can't make it seem like it came from a different port without modifying the router.

This model allows you to integrate IP cores from semi-untrusted third parties, or run untrusted code on a CPU, without allowing impersonation.

Let me put it another way: I ran a SAT solver on the compiled gate-level netlist of the router and proved that it can't ever emit a packet with a source address other than the hard-wired address of the source port. Any packet coming from a peripheral into port X will have port X's address on it when it leaves the router, end of story.

One of the problems with microkernels is performance - messages are being sent between parts and that takes time. Here we have sort of an ultimate microkernel - can we have enough hardware support to have performance competitive, while keeping main advantages?

This architecture also borrowed a lot from the exokernel philosophy, which would actually have provided significant speedups by cutting unnecessary bloat and abstraction.

My conjecture was that this would make up for most of the overhead, but I didn't have time during the thesis to optimize it sufficiently to do a fair comparison with existing architectures.

Andrew, I think it's a wonderful project, but to meaningfully evaluate it it definitely takes way more than an hour to figure out what, why and how is done here :) reading articles is hard. Thanks for your work!

I can see how this design might isolate allocated resources, but how does this design prevent a malicious agent from just requesting all free resources and DoS'ing the system other than by having a fully pre-allocated static configuration?

I had some ideas for quotas but never got to the point of fully building that part of the system.

This is a neat idea, one that I’ve been wanting to explore recently. The devil is the details for implementation practicality though.

One question is: how does this compare with micro-kernel OSes? People thought that was the future even before linux was invented.

My guiding design principle was to take a trendline from monolithic kernels to microkernels, then keep on going until you had no kernel left.

Interesting ideas, this could be very useful for the IoT space.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact