Hacker News new | past | comments | ask | show | jobs | submit login
Weird things I learned while writing an x86 emulator (timdbg.com)
218 points by azhenley 46 days ago | hide | past | favorite | 72 comments

This is a good list! Another fun quirk: because x86 is a register-memory architecture and allows all kinds of variants of reg/mem operand encodings, there are a handful of equivalent encodings with exactly the same lengths (and just slightly different ModR/M bytes). You can take advantage of this to do software fingerprinting or, in my case, steganography without changing an executable’s size or semantics[1].

[1]: https://github.com/woodruffw/steg86

The (in)famous A86/A386 assembler claimed to use this technique to identify code produced with it. As far as I know, it was just a simple inversion of reg-reg ModRMs depending on the register values being used, although I didn't test it too exhaustively.

Speaking of things with the same names, what happened to Linux a386 and what the heck was it even? I can't find a link nowadays, but it was something like User Mode Linux, but not really.

Used the exact same trick when researching dumb ways to break AV signatures at uni :)

You can also abuse the displacement math with eiz:

  ff 34 24 push   DWORD PTR [esp]
  ff 34 e4 push   DWORD PTR [esp+eiz*8]
Or some useless prefixes may work:

  80 c0 53 add    al,0x53
  36 04 53 ss add al,0x53

This is the first time I learned about eiz. Cool trick!

What’s eiz: https://stackoverflow.com/a/2553556/3125367

You won't find "eiz" in any Intel manual, and NASM doesn't recognize it either. x86 has no zero register¹.

This appears to be some GNU-specific syntax meaning "encode SIB byte with no index register". The only case where this would be required by the hardware is when using ESP/RSP as a base register, and every assembler should produce the correct encoding for that if you simply write [ESP].

So using "eiz" on GAS lets you control what is put into the (unused) scale field. One might call that a feature, but it is a meaningless encoding detail similar to which variant of "register to register" opcodes is emitted, something that I don't think any assembler gives you control over.

¹ except maybe on the microarchitectural level, but that isn't visible to the programmer

If I remember correctly there are also tricks like

  lea rax, [eiz+rbx*4]
instead of

  lea rax, [rbx*4]
because the latter will produce an heavier binary by being interpreted as

  lea rax, [00000000h+rbx*4]

The article itself exposes some encoding redundancy too: "[...] the count is masked against 1FH, essentially using only the lowest five bits".

So each rotation instruction is 3 bits free for out-of-band information.

Would love to learn more and help with this project if you need extra work done

It’s relatively feature complete, but there are some ideas for increasing the steganographic capacity listed in the README and issues! In particular, we could use the flexibility of the multi-byte NOP sequences to hide some more information.

What are some interesting use-cases for information hiding in the binaries?

Embed your customer's serial number in the binary at download time. When the file ends up on The Pirate Bay, send them a bill.

This is a huge issue in the cloud computing space.

“Golden Images” are often used to try to counter exploits hidden in the binaries of cloud images used by containers, etc

Why would I want to hide a message beside a program?

I think this has mostly curiosity/show off value. Encoding messages into executable binaries without changing their running behavior is kinda cool.

It gives you room to try to cause a hash collision. Could be used as a vector if a broken checksum is used.

Maybe to be able to trace individual copies of a piece of software back to a licensee.

It’s primarily a cute trick.

An ADD instruction will update the carry flag but the INC instruction does not!

Neither does DEC, and I believe the original reason for this was multiple-precision arithmetic routines --- you need to propagate the carry flag, but also update loop counters and pointers.

It seems that the actual behavior of the undefined flags is related to the internal implementation of the shift operation, and is different between different architectures.

According to https://www.sandpile.org/x86/flags.htm which unfortunately hasn't been updated nor is exhaustive, all of the P6 family behave the same, but different from the P5 and the P4, and probably the earlier generations too. "Undefined" values are often used by anti-debugging/anti-emulation/VM detection code to determine if the CPU is real hardware or not, so it's actually quite important to emulate them correctly.

IIRC from our similar code, the LEA (load effective address) instruction is useful for doing adds / subtracts of arbitrary integers without updating the flags.

Flags turn out to be quite the annoyance for the kind of in-process virtualization needed by Time Travel Debug. You need to instrument code with minimal overhead so, on the one hand, you don't want to save/restore flags all the time .... And on the other hand it still all has to work when flags get used.

Yes, saving and restoring flags is very expensive. I thought about talking about that in the article but figured that was too much of a detour.

Darek Mihocka wrote a really interesting article about how to optimize flag calculations in an x86 emulator:


Although looking at your username I suspect you may have read this one before...

I had not read it before! Thanks for the link. I don't get too involved with our JIT other than to occasionally peer into some code and go "ooooh"

Multiplication even.

This behaviour goes all the way back to (at least) the Z80 and i8080, and I bet it was intentional for exactly the reason you describe (updating loop counters inbetween a sequence of ADC/SBC instructions without destroying the carry flag from the previous ADC/SBC).

> "Undefined" values are often used by anti-debugging/anti-emulation/VM detection code to determine if the CPU is real hardware or not, so it's actually quite important to emulate them correctly.

That seems quite brittle, unless all real CPUs implement them the same way. If they do, one has to wonder whether it's really undefined or an undocumented part of the x86 spec instead.

I'd guess it's "undocumented", not "undefined". Don't know how the situation is on x86, but on the Z80 there were indeed some slight differences in undocumented behaviour between CPU vendors, but those are so obscure that it hardly affected any real world code (they affected the undocumented flag bits 3 and 5, and their behaviour was only properly 'decoded' in the 2000's (https://github.com/floooh/emu-info/blob/master/z80/memptr_en...).

The only CPU I know with actual 'undefined' behaviour is the 6502 for some of the undocumented/illegal opcodes which can yield different results based on things like current CPU temperature (see the ANE/XAA instruction description: https://www.masswerk.at/nowgobang/2021/6502-illegal-opcodes)

> According to https://www.sandpile.org/x86/flags.htm which unfortunately hasn't been updated nor is exhaustive

Do you know if there's a more exhaustive source (besides the official manuals)?

> You can add quite a few until you get to 15 bytes. This length is a hard limit on current x86-compatible CPUs. Any instruction longer than 15 bytes is considered invalid and will generate an exception.

There's a few valid 16 byte instructions though.. Sandpile lists a few examples: https://www.sandpile.org/x86/opc_enc.htm

  36 67 8F EA 78 10 84 24 disp32 imm32 =  bextr eax,[ss:esp*1+disp32],imm32
  64 67 8F EA F8 10 84 18 disp32 imm32 =  bextr rax,[fs:eax+ebx+disp32],imm32

The longest structurally valid x86 instruction is 26 bytes, from some research I did a few years ago[1]. But as others have noted, structurally valid does not mean that any x86 CPU will accept them: they’ll all produce #UD or similar, including these 16 byte ones.

[1]: https://yossarian.net/res/pub/mishegos-langsec-2021.pdf

You can just keep sticking prefixes on an instruction to get something longer, no? The processor will refuse to decode it but it's "legal" otherwise.

Ostensibly, only one prefix from each group can matter. Although I did notice that XACQUIRE and LOCK are both group 1 prefixes, which makes that statement kind of a lie (but it's an intentional design to make XACQUIRE do nothing on processors that don't support it).

In any case, there's only a finite number of prefixes you can meaningfully stick onto an instruction, and repeating the same prefix will do absolutely nothing.

Yep: in that sense the longest structurally valid x86 instruction is unbounded in length, so long as you conform to the group restrictions.

What is it? I can't seem to find it in the paper.

There are many structurally valid 26 byte productions; the diagram on page 2 shows the layout and page 5 has pseudocode for generating them.

(There may be even longer valid productions; my analysis was pretty naive. But 26 is already substantially longer than the limit!)

Thanks, I saw that, but I was hoping for explicit examples.

Figure 6 has examples (in the context of instruction generation). But note that those instructions are “nonsense,” in the sense that they’re guaranteed to either #UD at 15 bytes or well before then.

Ah, thanks, I missed that! Kind of disappointing that it doesn't give the instructions' dissassembly (so to speak -- obviously as you point out in reality they are nonsense), but I suppose I can sit down and figure that out myself. :)

Valid meaning "valid according to spec" or "actually executable by processors in practice"?

Going by the linked webpage, the answer appears to be "valid if we ignore the limit, but in fact the limit is enforced and you'll get an exception".

Assuming this limit has even been the cause of a vulnerability before: https://lists.gnu.org/archive/html/qemu-devel/2017-03/msg038...

That's really interesting. I had no idea there were structurally valid instructions that are longer than 15 bytes.

I thought this was going to be about a toy emulator/hobby project, but it's apparently about an emulator used in Time Travel Debugging!

Ha, I was just reading some of the discussion and thinking it sounded quite similar to the JIT we use in our own (Linuxy) Time Travel Debugger.

It's similarly am x86-on-x86 JIT / emulator but (and I'm sure WinDbg's TTD is similar) most of what you want to do there is just code copying for any instructions that don't need special instrumentation.

And you want to run entirely in the cache of JITted code so you're close to native speed, rather than exit to C code and make decisions.

Funny enough, we tried going down the road of doing a JIT, but for our use cases pure emulation was often fast enough and it wasn't worth the extra complexity to do JIT. Part of that is probably due to the architecture of how TTD works, but part of it is how efficient we were able to get with the emulation mode. Later, a different emulator was created for running x86 code on ARM64, and that one uses an extremely efficient JIT. (I didn't get to work on the x86-on-arm64 emulator though. Would have been fun)

Working on the JIT in TTD was a lot of fun though, and it's a shame it didn't make sense to keep it around.

Very impressive you got such good performance out of straight emulation! And WinDbg TTD runs with truly parallel threaded execution within one process, which we do not (yet).

I'm surprised a JIT wasn't worth it but, from what I'm aware of, I can see a few reasons why the trade offs are different.

Very cool! Writing emulators for simple CPUs is challenging enough because it's extremely easy to overlook a quirk or implement a flag calculation incorrectly that is used infrequently and have it silently lurking like a mine waiting for the rare piece of code to trip it. x86 is very complex. Instruction decoding alone looks like a nightmare. I guess the one saving grace here is that since most of us still run on x86, side-by-side validation is a lot easier.

Yes, that made it 100x easier, because I was able to write unit tests that literally did a single step over the instruction and compare the register context to the emulated register context. The single step approach didn't work for all instructions (and didn't work for undefined flags) but was extremely effective and let me brute force a whole lot of testing.

> Other instructions leave some of the flags in an undefined state, although in some cases they are conditionally undefined. > I’ve heard that the Atom line of CPUs used a cheaper/slower way of doing bit shifts in the ALU, which results in the undefined flags having different values, although I haven’t tested this myself.

That seems... problematic. Do compilers know this and just ignore the flags if/when they use these instructions?

Yes. In general, compilers only test emit flag testing instructions after instructions that are explicitly defined as having flag-modifying semantics.

(This is a common feature of ISAs, and makes sense in terms of component reuse: it allows you to cheaply reuse the ALU, for example, without needing extra wires/control signals telling it not to update the flag state.)

> compilers only emit flag testing instructions after instructions that are explicitly defined as having flag-modifying semantics.

I think compiler writers sometimes also have to know that instructions do not modify the flags.

Such instructions can be inserted between flag-setting ones and the test of that flag.

Yes, good point! It would have been more accurate for me to say “the compiler should always have a model of the flags during lowering.”

Nice article, but nothing really surprising when you follow the progression starting from 8080.

One correction though:

> EAX is called the “Accumulator register” is not just a convention, it actually makes a difference to the encoding (and potentially the performance, as a result)

No. As the decoder doesn't decode byte-by-bytes, but whole strings together and registers are all renamed meaning that %eax isn't really different from %ebx etc. the only difference is that it saves one byte of code space which is an extremely minor density improvement and you'd have to make a very contrived example to be able to measure the difference.

My point is the smaller encoding is what gives you the performance benefit. When done across an entire function/module, you can get a measurable increase in hit rate for the instruction cache. Not that the instruction itself is faster. Apologies if that wasn't clear.

No, I got that point and it's vanishingly small. It _only_ affects miss rate (ie. does nothing if you are hitting in the cache) and the effect on miss rate is minuscule.

This reminds me a lot of the defcon 25 talk by xlogicx about assembly being too high level.


Has anyone used TTD, or rr for that purpose in Linux world, to debug a complex multi-threaded application? What is the main use-case of these tools?

For example, limitations of rr seem to suggest that it is almost of no use for multi-threaded programs so I have never actually tried it. I don't know about the TTD though.

  > rr limitations

  > ...

  > emulates a single-core machine. So, parallel programs incur the slowdown of running on a single core. This is an inherent feature of the design.

It can still debug multi-threaded programs, but if your bug relies on multi-core interactions it won't appear under rr (you can still catch a lot of race conditions though, and it has a 'chaos mode' which generate irregular scheduling intended to increase the likelyhood of such bugs appearing). It's been used to debug race conditions as well as other bugs in firefox, for example.

That's what I thought as well, limited to the single-core race conditions era. Thanks for the evidence.

It's still much better than the alternative, which is not having any kind of way to reproduce race conditions :-) QEMU's a pretty complex multithreaded program and my experience with chaos mode has been that it's quite good at tripping up the races (and you only need to catch the race once).

It's not limited to any era - it's a general-purpose tool that can perfectly reproduce bugs, even most race conditions. The lack of multi-threading will mostly just slow things down. Being able to perfectly reproduce a bug is immensely valuable.

I didn't say anything about the usefulness of rr other than in the context of concurrency bugs so I don't quite understand a defending attitude of yours.

> limitations of rr seem to suggest that it is almost of no use for multi-threaded programs

Only if your main use case is debugging race conditions only possible with multiple cores, which is a tiny subset of all bugs.

I use rr all the time for a complex multi-threaded (although not extremely parallel) application and it works wonders. It frequently saves me hours of debugging. Practically none of the issues I use it for are race conditions (not because it doesn't work well for those, but because I rarely get any).

Even if I had to suffer an 8x slowdown from running it on a single core, it would still be worth it nearly every time.

(Disclaimer: rr maintainer.)

As others have said here, in practice rr works very well for debugging a wide range of bugs in multithreaded programs, including race conditions.

Where it falls down:

* It can only use a single core, so highly parallel programs run very slowly when recorded by rr.

* Some race conditions may be difficult or impossible to reproduce under rr recording (but rr's chaos mode helps a lot with this).

* rr imposes sequential consistency on the recorded program, so bugs due to weak memory models do not show up under rr. Such bugs are pretty rare on x86 (because x86's memory model is pretty strong); this may be more of an issue on ARM.

TTD (the WinDbg one) works very well for complex multithreaded apps. (The caveat being the performance hit you get from emulation). It's one of the big advantages it has over rr.

The main use case I saw for TTD was debugging complex memory corruption issues. Certain types of issues like stack corruption became trivial to debug under TTD. It was also very useful for capturing a repro. If a customer complained about something and I couldn't immediately reproduce it or get a crash dump, I'd ask them to record a TTD trace. More than 75% of the time I'd say it was enough to root cause the bug, without spending tons of time figuring out the repro steps.

Fascinating list, and really well presented and written! Lots of stuff I've never read about before.

What resources did he use for this research?

I would expect at least a reference to an Intel reference manual, but perhaps there are better ways to learn about the ISA.

There are reference manuals with a lot of details. Specifically you'd want to read the "Intel® 64 and IA-32 Architectures Software Developer’s Manual", Volumes 2, which includes specifications of each instruction including pseudo-code of how the processor executes them. You can freely access a PDF of the entire manual here: https://www.intel.com/content/www/us/en/developer/articles/t...

But there's a lot of details there — that volume alone spans about 4000 printed pages in the manual — so you're bound to mess things up.

I've written emulators for simpler instruction sets (RISC-V and the 6502) and found that the best way to ensure they work correctly is to do instruction-by-instruction comparisons against a reference. The good news is that if you've got a PC computer, you've got easy access to a reference machine! So you can do side-by-side comparisons where you set up an initial state on both a real machine and an emulated machine, then execute some code on both, and compare the final states. By using the trap flag you could also single-step through the instructions to extract the intermediate states after each instruction and ensure they match the emulated state.

So that's my guess about how he did his too. It's painstaking work, but also kind of fun when you get into it.

Yes, that's exactly what I did. Most useful thing I did was setting up a unit test framework to test a single instruction, and then generating a huge number of variations of those instructions.

I talked a bit about the Intel SDM in the last post I wrote (linked at the top). The SDM is great for getting the specific details, but isn't always approachable for high level overview things.

I wish this wasn't written largely using a gray font on a gray background, because it's a pain to read.

Apologies. I'm just using a Hugo template and haven't spent a lot of time figuring out how to customize it. You're right though, the contrast isn't great.

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact