Hacker News new | past | comments | ask | show | jobs | submit login
X86 mov is turing complete: mov-only compiler (github.com)
238 points by thejj on June 20, 2015 | hide | past | web | favorite | 54 comments

Who needs instructions? Page faults are Turing complete: https://github.com/jbangert/trapcc

HN discussion of the paper that this is based on: https://news.ycombinator.com/item?id=6309631

That paper is so funny:

    Finding Turing-completeness in unlikely places has long been a
    pastime of bored computer scientists.

    Removing all but the mov instruction from future iterations of the
    x86 architecture would have many advantages: the instruction
    format would be greatly simplified, the expensive decode unit
    would become much cheaper, and silicon currently used for complex
    functional units could be repurposed as even more cache. As long
    as someone else implements the compiler.
Well, I guess someone has implemented it now!


> Thus, while it has been known for quite some time that x86 has far too many instructions, we can now contribute the novel result that it also has far too many registers.

That's a fun program. Similar techniques to this are used to create "ROP" (return-oriented programming) exploits. You get enough building blocks to get general purpose programming, and it's "just" matter of mapping to those building blocks.

X86 "mov" is a bit of a cheat, though. It's a single mnemonic for what's really dozens of quite different instructions. For example, ARM has "mov" for register-register operations, but "ldr/str" for register-memory, while X86 just uses mov for everything. Even ARM mov optionally puts a barrel shifter into the path, so it's not that pure either.

Even if they weren't a single mnemonic, if you read the paper you'll find that reg = mem[reg + C], mem[reg + C] = reg, and reg = C appear to be the only 3 instruction types that are actually needed; these are only register-memory and register-constant data transfers. Even register-register isn't needed.

register-register movs are needed because compilers are not the world's smartest code generators, and we don't have an infinite number of registers. The biggest problem arises when some calling convention expects certain things to be in certain registers - unless you do some pretty extreme long term planning in your register allocation code in the compiler, you're pretty likely to eventually wind up with some piece of information in a register that you need to move to another register before making or returning from a call.

Of course, modern processors don't really care about register-register moves because they have hundreds of hidden intermediate registers and register renaming which makes those types of moves "free." (e.g. Haswell has 168 GP registers, only 16 of which are exposed through the AMD64 ISA.)

Note that you don't need a dumb compiler, register pressure, or an instruction with physical register constraints to require a copy. Two-address instructions suffice: simply have a register for the first operand that contains a value that is used later. A copy is required to preserve the value stomped by the mutating instruction.

In any case I believe the point of the GP is that any necessary copy can be done by going through memory, so register-register moves can be dispensed with where minimalism is the goal.

Do you have a citation for the 168GP registers, just wondering if Intel have "offically" stated the number they use in a given microarch.

You can find an official Intel source for Haswell in here (page 15): http://pages.cs.wisc.edu/~rajwar/papers/ieee_micro_haswell.p...

I might be misreading this, but I think it's on slide 24 of this PDF http://www.hotchips.org/wp-content/uploads/hc_archives/hc25/...

Yes, and these are effectively 3 different instructions sharing the same mnenomic. If you assemble them:

        mov     0x1000(%rax,%rax), %eax
        mov     %eax, 0x1000(%rax,%rax)
        mov     $7, %eax
You get:

        0000000000000000	8b840000100000  	movl	0x1000(%rax,%rax), %eax
        0000000000000007	89840000100000  	movl	%eax, 0x1000(%rax,%rax)
        000000000000000e	b807000000      	movl	$0x7, %eax
Which is basically 0x8b->Load, 0x89->Store, 0xb8->Constant.

ARM mov can be used as a JMP <addr in reg>. Program counter is in fact register R15 in ARM.

This changed with V8, IIRC the PC can only be updated on branch or an exception...

If you're going down that path, all of x86 is used as a 'bytecode' that turns into microcode instructions anyway in modern processors.

Not exactly, most instructions are not "microcoded" for performance reasons. But there is a class of instructions that is.

Intel CPUs translate x86 to some internal RISC-like "uops" before doing optimizations and execution. This is a widely documented fact (though Intel doesn't really talk publicly about it as far as I know): see for example ยง2.1 in http://www.agner.org/optimize/microarchitecture.pdf

For a (hopefully!) more accessible introduction to the crazy world of what really happens inside the x86, I did a talk at GOTO: http://youtu.be/hgcNM-6wr34 - it's a really fascinating subject!

For marketing reasons, Intel made quite a big commotion over the fact that its Pentium was "RISC-like" when it was first released, but in reality what it meant was using a decoder that could emit multiple uops in parallel instead of sequentially like it was in the 486.

Even CPUs considered RISC today, like ARMs, need to decode instructions into one or more wider uops for efficient execution.

Is there a way to estimate how many time slower would be a program compiled to use only "mov" instead of standard instructions ? 3x, 10x, 100x, more ?

It depends. In this case, I would say it would be at least 100x, probably much more. But, that's because this only compiles BF (It is basically a POC after-all), and thus you have to use that as an intermediate language. BF is extremely un-optimal, and the compiler here spits out a completely unoptimized one-to-one conversion of the BF source to assembly that uses only `mov` instructions. Thus, the size of the assembly will balloon extremely quickly simply because it takes a lot of BF to express fairly simple things. Branching using only `mov` is also impossible, so the code instead uses a flag to control where the assembly writes it's results. Because of this, even when you hit the condition of a loop, if your condition check is before the loop's code, then it results in you running the entire loop's code another time, and just throwing out the results. It's extremely wasteful.

Based on this POC alone, it's pretty much impossible to guess how much slower a real `mov` only compiler would be vs. a regular compiler. It depends a lot on how much of this can be optimized. Directly compiling C to this via clang or gcc would be a good start. I have no idea how much work that would take though, and compiling anything more complex then BF is a real challenge in itself because of the branching problem.

"compiling anything more complex then BF is a real challenge in itself because of the branching problem."

That's not true.

It is very hard to estimate the relative speed of a mov-only CPU (with a jump at the end of the program). The mov-only restriction applies only to how a program is input into the machine, it does not restrict the CPU architecture in any way. Obviously, the CPU would first analyze the whole mov-only program and convert it into something that is closer to x86 instructions: arithmetic instructions, unconditional jumps, conditional jumps. The success of this procedure, and ultimately the execution speed, depends on how many mov-patterns the CPU can recognize.

Compilers would need to produce code that consists of standardized mov-only code patterns. This would help keep the CPU architectures relatively simple.

Coming very close to native x86 performance seems possible, at least in theory.

Sounds like if we make a JS to BF transpiler, then we can complete this with any language that can be transpiled to JS.

Just imagine the applications of this!

The next step would be to make this into an eventual Quine.


It's turtles all the way down.

I'm not sure if I'm more amazed by how slow that is or how fast it is.

Oh the world we live in: playing minesweeper in a graphical OS running in a javascript x86 emulator.

The Github description is misleading:

> The single instruction C compiler

I was seriously impressed until I got further down.

Well, if it reaches a point where it can compile itself, you'll get your compiler written with just one instruction (assuming that's what you were hoping for).

I meant that this software compiles brainfuck, rather than C. I was expecting a C compiler from the description.

From README.md:

    M/o/Vfuscator 2.0 is a complete C compiler, and will be available soon.
Just be patient, or submit patches :-)

Would be an interesting LLVM code generator too.

Forgive my ignorance but I thought being Turing complete was important because a Turing complete machine can in principle perform any arbitrary computation. As far as I know the program counter cannot be a destination for an X86 MOV. So you can't branch with only MOVs. How can you do an arbitrary computation if you can't branch?

Maybe you have to reorganize if(a) { b; } else { c; } as a kind of permuted b; followed by c; such that either the permuted b; or the permuted c; are effectively no-ops depending on the value of a.

You're right, they never actually do any jumping. From what I can tell, they use a variable called `on` to control where every set of commands write their results too. They do some funky `mov` stuff to do a comparison, and then set `on` with that comparison. They use a stack of `on` variables and then pop off the value of `on` from that stack when you exit a loop.

They cheat to get a jump without using anything but `mov`'s by executing an illegal set of `cs` using `mov`, and then use the sigill generated to execute a signal handler, which is just the same spot that the code starts at. IMO, it doesn't really count, so a `mov`-only program requires at least one jmp instruction to work.

I think loops are achieved by simply running the code to the end without saving any of the results, and then hitting the singal and going back to the beginning - ignoring all the code until you hit the loop you're executing. But, the `mov` code is fairly hard to understand so I could be wrong on that.

Could one use self-modifying code (such as rewriting the SIGILL handler) to simplify this a bit?

It also uses a single jmp instruction to branch back to the start in a loop.

Two values a and b can be compared by storing 0 into * a and 1 into * b. Then a == b if * a is 1. Then this value can be used to index into an array to provide different behavior based on if a == b.

Edit: apparently HN doesn't allow me to escape * with \, which the markdown spec says I should be able to do.

> HN doesn't allow me to escape * with \, which the markdown spec says I should be able to do

So? What does the markdown spec have to do with... anything? The XML spec says you have to begin your comment with a version number declaration, but you're not doing that.

I thought it was trying to implement markdown by italicizing half my sentence when I put * in. Markdown was what popped into my mind when I saw that behavior. It would be nice if I could find a way of escaping them without having to insert spaces where I don't want spaces.

There isn't one, HN doesn't use markdown. See https://news.ycombinator.com/formatdoc

It should use markdown, though.

It's not just mov's. There's a non-branching jump at the end that loops back to the beginning. Termination is accomplished by reading from address 0.

It's based on this paper: http://www.cl.cam.ac.uk/~sd601/papers/mov.pdf

It's a superset of the One Instruction Computer https://en.wikipedia.org/wiki/One_instruction_set_computer

Could be good to defeat RF snooping attacks.

"-O" enables optimizations

Replaces movs with less movs??


This is almost surely a dumb question, but what would happen if someone made a processor that only executed mov instructions, which would be extremely simple, and presumably would be really goddamn fast. Would that be competitive with today's fastest CPUs? If it were, what would be the advantages and disadvantages of this design?

Well, Turing machines do not have random access memory (which changes the big-theta complexity of algorithms), and that operations such as multiplications would be complicated and multi-cycle, whereas conventional CPUs can easily parralize them.

If you were comparing the performs of mov-only programs, a specialized cpu would likely be faster. Otherwise, there is a good reason we do not actually use Turing machines.

what would happen if someone made a processor that only executed mov instructions, which would be extremely simple, and presumably would be really goddamn fast.

The laws of physics get in the way. It's essentially the same reason why clock frequencies have stopped going up.

"a processor that only executed mov instructions, which would be extremely simple"

I believe such a CPU wouldn't end up simple due to competition for better performance. It would end up being more complex than x86 CPUs.

Might want to look at the examples. They turn an equality check into two register stores and a load.

Extremely Reduced Instruction Set Computer.

Or is it a Ridiculous Instruction Set ?

When I read the title I thought "ah of course cmov is turing complete", but no, it's only mov apparently.

Which is quite amazing (and probably dog slow)

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact