Self-modifying code was useful for optimisation back in the 80s, but these days it's usually awful for performance (with JIT compilation as the main exception to this rule).
Your CPU has an instruction cache and a data cache, and on ARM (and x86, too, but I'm not sure) these caches are not coherent. So if you modify your instruction stream with a write, you have to clear the instruction cache to ensure that your modified instructions are actually executed by the processor. If you do this a lot, this will make things S-L-O-W, because it forces you to go all the way to main memory to find the next instruction to execute.
This means that if you do want to generate code at runtime, you want to batch the modifications into large groups, so that you have to invalidate the i-cache less frequently. This actually is useful -- it's what JIT compilation is! The reason that JIT can be helpful (even in statically typed languages like Java or Haskell) is that programs often get passed functions as arguments (eg, qsort in C). A static compiler can't optimise these functions much, because you have to know what the function argument will be to do much. But at runtime, you do know what the function is, and by inlining it your code can be made much faster.
On x86 you can use self-modifying code without explicitly flushing caches. However, if you execute the modified code soon after writing it, the penalty can be tens to hundreds of cycles (source: Agner Fog’s optimization manual).
ARM quite famously requires the explicit cache flush, and will usually fail to work without it. However, some emulators, e.g. QEMU, don’t require the cache flush, which can lead to confusion if you usually test on emulators.
Came here to hope someone wrote this. Wasted many hours of my young life trying to figure out why my self-modifying assembler program worked perfect in the debugger but not without it.
Could you elaborate? I feel like the OP is saying that it works just painfully slowly but your comment/problem indicates it didn't work at all without the debugger. Can you say how these relate? Am I missing something obvious?
Self modifying code on x86 has always been a bit microarch dependent as in the past there was a minimum distance required for the processor to notice the change. This is was one of the tricks anti piracy code used to keep people from reverse engineering it. Changes made close enough to the IP wouldn't be hazarded properly so the stale instruction would be executed anyway. If this code is run under a debugger the extra break/traps would change the behavior and the newer instruction would get executed rather than the stale one. Someone who plays with this on more recent x86's could talk about how this presumably works on modern x86's, but I would guess that if the CPU detects a hazard and has to roll back to a previous state, it probably goes into some kind of strong in order mode around the code in question. This might mean modern processors behave better than some older models, but likely there is a absolutely massive perf hit when this happens (think > 10x). On something like an arm or risc-v(?) without coherent I-D caches this "window" could basically be forever, it makes an interesting question around security because in theory its possible to have code being executed for extended periods of time which isn't actually visible anywhere due to page/cache invalidation not clearing stale cache lines.
This is really fascinating subject and seems like a rich area for research.
I was curious about your comment regarding arm and risc-v not having coherent instruction and data caches. Is this a toggle on these chips hen for turning it on and off? I think I remember reading about some SoC that have this configurable.
On older x86 chips - anything before the 486 - there was no cache, but opcode bytes were prefetched into a queue, ranging from 4 bytes (on the 8088) to 16 bytes on the 386. The 286 and 386 had an additional queue holding up to 3 decoded instructions (regardless of length).
These queues where "visible" in their effect on self-modifying code. After modifying one of the instructions that could be already in the queue, you had to do a jump to flush it.
If you know about this, it is obvious how certain code only works when single-stepped through, or perhaps when run on an 8088 with its shorter queue. But few people did, even among experienced programmers.
IIRC the 486 and everything newer can detect when a cache line containing code is changed, so this is no longer necessary (but bad for performance as other commenters said).
Inlining passed function pointer is not really a JIT only optimization. As long as the pointer is a constant it only requires interprocedural optimizations and/or link time optimization.
The jit can help if the value of the pointer varies dynamically and in an unpredictable way (otherwise PGO would also help).
Your CPU has an instruction cache and a data cache, and on ARM (and x86, too, but I'm not sure) these caches are not coherent. So if you modify your instruction stream with a write, you have to clear the instruction cache to ensure that your modified instructions are actually executed by the processor. If you do this a lot, this will make things S-L-O-W, because it forces you to go all the way to main memory to find the next instruction to execute.
This means that if you do want to generate code at runtime, you want to batch the modifications into large groups, so that you have to invalidate the i-cache less frequently. This actually is useful -- it's what JIT compilation is! The reason that JIT can be helpful (even in statically typed languages like Java or Haskell) is that programs often get passed functions as arguments (eg, qsort in C). A static compiler can't optimise these functions much, because you have to know what the function argument will be to do much. But at runtime, you do know what the function is, and by inlining it your code can be made much faster.