Self-modifying code was useful for optimisation back in the 80s, but these days ...

nneonneo · on Dec 15, 2021

On x86 you can use self-modifying code without explicitly flushing caches. However, if you execute the modified code soon after writing it, the penalty can be tens to hundreds of cycles (source: Agner Fog’s optimization manual).

ARM quite famously requires the explicit cache flush, and will usually fail to work without it. However, some emulators, e.g. QEMU, don’t require the cache flush, which can lead to confusion if you usually test on emulators.

zinxq · on Dec 15, 2021

Came here to hope someone wrote this. Wasted many hours of my young life trying to figure out why my self-modifying assembler program worked perfect in the debugger but not without it.

bogomipz · on Dec 15, 2021

Could you elaborate? I feel like the OP is saying that it works just painfully slowly but your comment/problem indicates it didn't work at all without the debugger. Can you say how these relate? Am I missing something obvious?

StillBored · on Dec 15, 2021

Self modifying code on x86 has always been a bit microarch dependent as in the past there was a minimum distance required for the processor to notice the change. This is was one of the tricks anti piracy code used to keep people from reverse engineering it. Changes made close enough to the IP wouldn't be hazarded properly so the stale instruction would be executed anyway. If this code is run under a debugger the extra break/traps would change the behavior and the newer instruction would get executed rather than the stale one. Someone who plays with this on more recent x86's could talk about how this presumably works on modern x86's, but I would guess that if the CPU detects a hazard and has to roll back to a previous state, it probably goes into some kind of strong in order mode around the code in question. This might mean modern processors behave better than some older models, but likely there is a absolutely massive perf hit when this happens (think > 10x). On something like an arm or risc-v(?) without coherent I-D caches this "window" could basically be forever, it makes an interesting question around security because in theory its possible to have code being executed for extended periods of time which isn't actually visible anywhere due to page/cache invalidation not clearing stale cache lines.

bogomipz · on Dec 16, 2021

This is really fascinating subject and seems like a rich area for research.

I was curious about your comment regarding arm and risc-v not having coherent instruction and data caches. Is this a toggle on these chips hen for turning it on and off? I think I remember reading about some SoC that have this configurable.

rep_lodsb · on Dec 16, 2021

On older x86 chips - anything before the 486 - there was no cache, but opcode bytes were prefetched into a queue, ranging from 4 bytes (on the 8088) to 16 bytes on the 386. The 286 and 386 had an additional queue holding up to 3 decoded instructions (regardless of length).

These queues where "visible" in their effect on self-modifying code. After modifying one of the instructions that could be already in the queue, you had to do a jump to flush it.

If you know about this, it is obvious how certain code only works when single-stepped through, or perhaps when run on an 8088 with its shorter queue. But few people did, even among experienced programmers.

IIRC the 486 and everything newer can detect when a cache line containing code is changed, so this is no longer necessary (but bad for performance as other commenters said).

bogomipz · on Dec 16, 2021

Oh wow this is fascinating bit of history. I wonder how the i-cache drift detection is implemented. Cheers.

gpderetta · on Dec 15, 2021

Inlining passed function pointer is not really a JIT only optimization. As long as the pointer is a constant it only requires interprocedural optimizations and/or link time optimization.

The jit can help if the value of the pointer varies dynamically and in an unpredictable way (otherwise PGO would also help).