Hacker News new | past | comments | ask | show | jobs | submit login
X86 MMU fault handling is turing complete (github.com/jbangert)
417 points by mman on Feb 22, 2013 | hide | past | favorite | 39 comments

This is more or less the greatest thing I've learned about in the last couple years.

What's happening here is that they're getting computation without executing any instructions, simply through the process of using the MMU hardware to "resolve addresses". The page directory system has been set up in such a way that address resolution effects a virtual machine that they can code to.

This works because when you attempt to resolve an invalid address, the CPU generates a trap (#PF), and the handling of that trap pushes information on the "stack". Each time you push data to the stack, you decrement the stack pointer. Eventually, the stack pointer underflows; when that happens, a different trap (#DF) fires. This mechanism put together gives you:

    if x < 4 { goto b } else { x = x - 4 ; goto a }
also known as "subtract and branch if less than or equal to zero", also known as "an instruction adequate to construct a one-instruction computer".

The virtual machine "runs" by generating an unending series of traps: in the "goto a" case, the result of translation is another address generating a trap. And so on.

The details of how this computer has "memory" and addresses instructions is even headachier. They're using the x86 TSS as "memory" and for technical reasons they get 16 slots (and thus instructions) to work with, but they have a compiler that builds arbitrary programs into 16-colored graphs to use those slots to express generic programs. Every emulator they could find crashes when they abuse the hardware task switching system this way.

Here's it running Conway's Life:


Here's their talk for a few months back:


The talk is great, but if you're not super interested in X86/X64 memory corruption countermeasures, you might want to skip the first 30 minutes.

Reminds me of "ICMP delay-line memory": http://stackoverflow.com/questions/12748246/sorting-1-millio...

The slides in the github repo ( https://github.com/jbangert/trapcc/blob/master/slides/PFLA-s... ) also have a few interesting points, like "No publicly available simulator implements this correctly" (how did they record the youtube video?) and a few vague hints about exploiting this for doing VM escapes.

> "No publicly available simulator implements this correctly" (how did they record the youtube video?)

Probably by patching the emulator.

> how did they record the youtube video?

By running it on a physical machine. Unless there is a requirement that the processor not be multitasking that I am missing.

The Life video pretty clearly shows it running in Bochs. I assume they fixed it up.

Author here: Actually, the slide 'no publically available simulator implements this correctly' is somewhat misleading (it had a follow up slide that I cut and replaced by verbal comments - I should re-add it to the PDF). What I meant is that the entire mechanism (Task switching, page fault handling, etc.) is not implemented correctly on any sim, i.e. you have subtly different behavior. This was quite a challenge in debugging this in the first place, so I actually wrote the code to work around any bochs quirks I encountered -- just so that I have a debug environment.

By fixing bugs in Bochs?

Isn't it a bit misleading to say that they are getting computation without executing any instructions? It's true that they are not executing any x86 instructions from the CPU, but the MMU is doing all the work by executing its instructions of address resolution.

Actually is it even true that they are not executing any x86 instructions on the CPU? From my understanding, the handling of page faults needs the CPU to execute some instructions. Maybe I'm wrong, if so could you enlighten me? Thanks.

I'm not very familiar with the x86 architecture, but usually when there's a fault the CPU generally attempts to lookup the address of a "callback" function in an interrupt/fault vector.

I suppose if you setup everything very carefully you can make it fault over and over again without giving it the time to execute any instruction.

Without looking into the specifics, I think it's very possible that the CPU is not actually executing any instructions, just waiting for the MMU to get a hold of itself. After all, in order to simply load the instructions you need the MMU to be responsive (or deactivated I suppose, if there's such a thing as no-MMU x86).

If I understand MMU correctly, it is merely an address translator and physical memory data fetcher. It cannot process page faults, and when it encounters one, it will have to signal the CPU, because the CPU and the OS on top of it knows how to handle faults. Even if faults are generated repeatedly, doesn't the CPU still have to execute the instructions to push the stack which is how this "instruction-less" machine works? Unless there are certain PFs where the MMU will not signal the CPU and tries to handle the fault by itself.

The MMU is not so cleanly separable from the CPU on x86.

386 has both a segment mechanism and a paging mechanism. The segment mechanism has several luxury features like automatic saving and restoring of task context. It is possible to set up a segment descriptor so that the CPU, when jumping (or faulting) to an address in the segment, will automatically save task state at one address (taken from a register) and restore task state from another address (taken from the descriptor). Ibelieve that's what they use here. Hence, free memory accesses.

Author here: While it is true that with the current implementation, memory access is extremely limited (essentially one DWORD per page, or about 0.1% of the available physical RAM) that limitation can certainly be avoided. For one, you could shift how the TSS is aligned (and align them differently for different instructions), multiplying your address space by a factor of 10 or so. Furthermore, you could also place another TSS somewhere in memory (only a few of the variables need to actually contain sane values) with an invalid EIP and use that as a 'load' instruction.

The easiest way however would be to use the TrapCC mechanism to transfer control between bits of normal assembler code (perhaps repurposed from other functions already in your kernel), doing something similar to ROP. Of course, for additional fun, feel free to throw in BX's Brainfuck interpreter in ELF and James Oakley's DWARF exception handler. We might drop a demo of this soon, i.e. implementing a self-decrypting binary via page faults.

"memory access is extremely limited (essentially one DWORD per page" – referring to non-code addresses, yes? In the current (simplest) implementation, each instruction (a TSS) must be aligned across a page boundary. You do comment below that altering alignment could increase the available code space.

I'm wondering what method PFLA uses to read/write non-code addresses. Only one address per page can be addressed? I'll take a look at the compiler.

By simply expanding the addressing capability, a very tiny program could emulate an instruction stream from memory, overcoming the limited code space (at the cost of execution speed).


>Move, Branch if Zero, Decrement

This is basically the canonical instruction for OISCs (one instruction set computers). Wikipedia describes it pretty well: https://en.wikipedia.org/wiki/One_instruction_set_computer#S....

Another place for root kits to hide.

Could one write a preemptive rootkit that sniffs for other rootkits?

Would that slow things down incredibly?

That's called an antivirus :)

can you please elaborate?

It's computation that you can't see with a debugger, or with any sort of tracing.

You can't see the code, but surely it can't do anything useful—i.e. change anything outside its extremely limited memory.

If you read through the slides @ https://github.com/jbangert/trapcc/blob/master/slides/PFLA-s... there is potential to bypass hardware memory protection. Very interesting.

I must have missed the limited ram access when I browsed the slides the first time, I just assumed the youtube Game of Life proved they were poking ascii 'X' chars into video ram at 0xb8000 or whatever. Is there a trick/cheat to how that visualization was done?

Edit: browsed around the code some more and it seems the ascii visualization stuff is done in regular c/asm polling the "virtual game of life" mem(?)

Yes, there is a cheat, it moves values from RAM to VGA ...

Thanks for clarifying, still extremely impressive!

I wonder if you'll receive any fun comments from Intel/AMD/etc. :)

well, that's misleading.

The spec says "don't put a tss cross page boundary". This might also have been the case in some x86 implementations, but surely they fixed that once that hardware virtualization was implemented.

They actually depend on the sane behavior: they trigger the double fault when the cpu spills the tss across a page boundary.

They mention the manual because according to the manual their idea wouldn't work, but it does.

Then they also say "We should test it", and the result of the test is "CPU translates DWORD by DWORD" (i.e. doesn't miss a byte^H^H^H^Hword) And then a nice kitty to show how relieved they were from the good news! No dragons in the way.


(Unless I got all wrong, but this topic is not the simplest one to reverse engineer from some slides, did anybody find a presentation video? It would really help to hear somebody commenting those slides, otherwise is a nested puzzle. And where is the patched qemu so one can play with that?)

Shmoocon will release the video of the presentation in a bit, until then here is our talk at CCC http://www.youtube.com/watch?v=NGXvJ1GKBKM

Ahh, ok. Might actually be enough to, say, copy an encryption key out of kernel memory then?

Most of the techniques described in the slides require ring-0 privileges (replacing descriptor tables and page tables etc). If you have those privileges, you can copy what you want anyway.

Unless the encryption key is guarded by something with SMM privileges -- has that been done?

Well the original idea was a rootkit, which traditionally requires ring-0 privileges to install in the first case.

How fast (slow) is this relative to the host CPU?

Probably incredibly slow given the reduced instruction set and that it relies on context switches/pushing stuff on the stack for functioning.

This is really interesting. In a way, it's a form of computer self-replication. Could the virtual machine created by the computer be considered offspring?

Is there a way the virtual machine might spawn another virtual machine child of its own?

if you like this kind of things there is also:


That the hardware version of the brainfuck philosophy.

Best explanation ever. I second this. lol

Expect this technique in the future malwares and software protection DRM systems for making code analyzing harder.

somebody checked in vim backup files :-)

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact