Way back in the day (mid high-school), I wrote a Tetris game in Pascal and Z80 assembly for the TRS-80. At some point I realized that since what was displayed on the screen was just a representation of an area of memory, I could keep track of the "cup" and all of the pieces that had already fallen into it directly in video memory instead of variables. I also optimized the code for drawing the "cup" so all three sides of it would get rendered in one shot in a single loop. The first time I ran it, a bug caused half of the bottom not to get drawn, and the first Tetris piece dropped through the hole and sliced straight into whatever vital memory came after the video area, immediately crashing the computer.
Reminds me of the concept of 'compiled bitmaps' - where instead of having a function that reads an arbitrary image and writes it to screen buffer (as the one in this article), you'd have a function that writes a single specific hardcoded image to screen buffer, which could be done in less cycles than if you'd need to read the data from somewhere.
The machine code for such functions can be generated from any image in a straightforward way with all kinds of useful optimizations - i.e. simply skipping transparent pixels without doing anything instead of checking transparency every single time, but require some space tradeoff, as the code was noticeably larger than the source image.
So the core of this trick is to point the stack pointer into video memory, because pull/push instructions can deliver faster throughput than traditional load/stores.
The Atari 2600 has hardware designed to take advantage of such a trick. It has some 1-bit graphical objects that are turned on or off by writing to a certain register. The hot bit of that register is located at bit 1.
Why that, why not the more obvious low or high bit? Because bit 1 corresponds to the location of the 6502's zero flag in the flags register. So you can correctly turn the object on or off with literally just a compare on its coordinate then a push-flags instruction. You'll end up pushing a 1 or 0 to enable or disable the object. No branches or conditionals, and the code is even time-invariant, a helpful property in the tightly cycle-counted world of this machine.
Finally, the video chip registers are laid out with several such objects in succession, so you can continue pushing into several of them without resetting the stack pointer. (This machine has no interrupts, so hijacking the stack pointer into video memory is perfectly safe as long as you never issue a call or other push-pop instruction.)
The machine also aligns some of its read registers the same way. The joystick button is mapped to bit 7 of its input register, so you can read and then immediately branch on the sign flag without a compare instruction.
I never click on articles with titles like "classic game programming hack", because it's invariably another discovery of that quake inverse square root. The old chestnut that hypnotizes anyone who has never read HAKMEM or "Hackers Delight".
1. The specific approaches required were different for each particular CPU, and that time was very heterogenous with lots of very different machines in use. For example, the article involves two people; the developed optimization is intended for one of them but would not even work on the author's own slightly older CPU.
2. Information exchange was much less efficient than now. They literally would have no way of knowing what the other guys did unless each new developer was ready do reverse engineer their code themselves, which requires a significant investment of time that most likely exceeds the time they would be investing on the actual platformer. Nowadays a single person can reverse engineer the technique and everyone else can just use it, but then it simply wouldn't be published and you're on your own.
The tools available were crude; disassembly or easy finding of the 'hot' inner loops just wasn't available, and simply attaching a debugger would often result in the system not working as the game code would often rely on specific cycle timing, interrupts not happening (e.g. the article mention of any interrupts corrupting their data due to abuse of stack pointer), etc. Reverse engineering could and did happen, but it's not that feasible to two boys in the basement at the time.
Furthermore, 'the pros' at the time would deploy their code in firmware (arcade machines or cartridges) where it's even less accessible to hobbyists of the time. The only actual way of spreading such knowledge was by some of those early professionals publishing books, that among basic techniques also listed the optimization tips and tricks that they knew.
You can just take apart other games already out there.
Finding loop that takes 80% cpu should be simple enough and it can't be big.
You might spend weeks making stuff faster and arrive to inferior or wrong solution. Better ship it weeks earlier and enjoy lying on a beach in the sun.