You can also abuse the displacement math with eiz:
ff 34 24 push DWORD PTR [esp]
ff 34 e4 push DWORD PTR [esp+eiz*8]
80 c0 53 add al,0x53
36 04 53 ss add al,0x53
What’s eiz: https://stackoverflow.com/a/2553556/3125367
This appears to be some GNU-specific syntax meaning "encode SIB byte with no index register". The only case where this would be required by the hardware is when using ESP/RSP as a base register, and every assembler should produce the correct encoding for that if you simply write [ESP].
So using "eiz" on GAS lets you control what is put into the (unused) scale field. One might call that a feature, but it is a meaningless encoding detail similar to which variant of "register to register" opcodes is emitted, something that I don't think any assembler gives you control over.
¹ except maybe on the microarchitectural level, but that isn't visible to the programmer
lea rax, [eiz+rbx*4]
lea rax, [rbx*4]
lea rax, [00000000h+rbx*4]
So each rotation instruction is 3 bits free for out-of-band information.
“Golden Images” are often used to try to counter exploits hidden in the binaries of cloud images used by containers, etc
Neither does DEC, and I believe the original reason for this was multiple-precision arithmetic routines --- you need to propagate the carry flag, but also update loop counters and pointers.
It seems that the actual behavior of the undefined flags is related to the internal implementation of the shift operation, and is different between different architectures.
According to https://www.sandpile.org/x86/flags.htm which unfortunately hasn't been updated nor is exhaustive, all of the P6 family behave the same, but different from the P5 and the P4, and probably the earlier generations too. "Undefined" values are often used by anti-debugging/anti-emulation/VM detection code to determine if the CPU is real hardware or not, so it's actually quite important to emulate them correctly.
Flags turn out to be quite the annoyance for the kind of in-process virtualization needed by Time Travel Debug. You need to instrument code with minimal overhead so, on the one hand, you don't want to save/restore flags all the time .... And on the other hand it still all has to work when flags get used.
Darek Mihocka wrote a really interesting article about how to optimize flag calculations in an x86 emulator:
Although looking at your username I suspect you may have read this one before...
That seems quite brittle, unless all real CPUs implement them the same way. If they do, one has to wonder whether it's really undefined or an undocumented part of the x86 spec instead.
The only CPU I know with actual 'undefined' behaviour is the 6502 for some of the undocumented/illegal opcodes which can yield different results based on things like current CPU temperature (see the ANE/XAA instruction description: https://www.masswerk.at/nowgobang/2021/6502-illegal-opcodes)
Do you know if there's a more exhaustive source (besides the official manuals)?
There's a few valid 16 byte instructions though.. Sandpile lists a few examples: https://www.sandpile.org/x86/opc_enc.htm
36 67 8F EA 78 10 84 24 disp32 imm32 = bextr eax,[ss:esp*1+disp32],imm32
64 67 8F EA F8 10 84 18 disp32 imm32 = bextr rax,[fs:eax+ebx+disp32],imm32
In any case, there's only a finite number of prefixes you can meaningfully stick onto an instruction, and repeating the same prefix will do absolutely nothing.
(There may be even longer valid productions; my analysis was pretty naive. But 26 is already substantially longer than the limit!)
It's similarly am x86-on-x86 JIT / emulator but (and I'm sure WinDbg's TTD is similar) most of what you want to do there is just code copying for any instructions that don't need special instrumentation.
And you want to run entirely in the cache of JITted code so you're close to native speed, rather than exit to C code and make decisions.
Working on the JIT in TTD was a lot of fun though, and it's a shame it didn't make sense to keep it around.
I'm surprised a JIT wasn't worth it but, from what I'm aware of, I can see a few reasons why the trade offs are different.
That seems... problematic. Do compilers know this and just ignore the flags if/when they use these instructions?
(This is a common feature of ISAs, and makes sense in terms of component reuse: it allows you to cheaply reuse the ALU, for example, without needing extra wires/control signals telling it not to update the flag state.)
I think compiler writers sometimes also have to know that instructions do not modify the flags.
Such instructions can be inserted between flag-setting ones and the test of that flag.
One correction though:
> EAX is called the “Accumulator register” is not just a convention, it actually makes a difference to the encoding (and potentially the performance, as a result)
No. As the decoder doesn't decode byte-by-bytes, but whole strings together and registers are all renamed meaning that %eax isn't really different from %ebx etc. the only difference is that it saves one byte of code space which is an extremely minor density improvement and you'd have to make a very contrived example to be able to measure the difference.
For example, limitations of rr seem to suggest that it is almost of no use for multi-threaded programs so I have never actually tried it. I don't know about the TTD though.
> rr limitations
> emulates a single-core machine. So, parallel programs incur the slowdown of running on a single core. This is an inherent feature of the design.
Only if your main use case is debugging race conditions only possible with multiple cores, which is a tiny subset of all bugs.
I use rr all the time for a complex multi-threaded (although not extremely parallel) application and it works wonders. It frequently saves me hours of debugging. Practically none of the issues I use it for are race conditions (not because it doesn't work well for those, but because I rarely get any).
Even if I had to suffer an 8x slowdown from running it on a single core, it would still be worth it nearly every time.
As others have said here, in practice rr works very well for debugging a wide range of bugs in multithreaded programs, including race conditions.
Where it falls down:
* It can only use a single core, so highly parallel programs run very slowly when recorded by rr.
* Some race conditions may be difficult or impossible to reproduce under rr recording (but rr's chaos mode helps a lot with this).
* rr imposes sequential consistency on the recorded program, so bugs due to weak memory models do not show up under rr. Such bugs are pretty rare on x86 (because x86's memory model is pretty strong); this may be more of an issue on ARM.
The main use case I saw for TTD was debugging complex memory corruption issues. Certain types of issues like stack corruption became trivial to debug under TTD. It was also very useful for capturing a repro. If a customer complained about something and I couldn't immediately reproduce it or get a crash dump, I'd ask them to record a TTD trace. More than 75% of the time I'd say it was enough to root cause the bug, without spending tons of time figuring out the repro steps.
I would expect at least a reference to an Intel reference manual, but perhaps there are better ways to learn about the ISA.
But there's a lot of details there — that volume alone spans about 4000 printed pages in the manual — so you're bound to mess things up.
I've written emulators for simpler instruction sets (RISC-V and the 6502) and found that the best way to ensure they work correctly is to do instruction-by-instruction comparisons against a reference. The good news is that if you've got a PC computer, you've got easy access to a reference machine! So you can do side-by-side comparisons where you set up an initial state on both a real machine and an emulated machine, then execute some code on both, and compare the final states. By using the trap flag you could also single-step through the instructions to extract the intermediate states after each instruction and ensure they match the emulated state.
So that's my guess about how he did his too. It's painstaking work, but also kind of fun when you get into it.