I think this is where the real gems start. The biggest contribution that SMB had was the "physics engine", to retrofit a modern term. The friction, the jumping, the inertia. If you compare it with the primitive physics in Donkey Kong or Mario Brothers, you can really grasp the groundbreaking novelty that was SMB. You can change direction in mid-air, but not too much. When you run, you skid if you try to run in the opposite direction. The height of your jumps is affected by your running speed.
It's all of these little details combined, barely noticed in tandem, which made the game new and fun.
This line shows the calculation that makes use of the JumpMForceData.
There was a funky "bounce" at the edge of the screen in the arcade version that I captured pretty well by reversing X and Y velocity on a wall collision; this could not have been done well with a lookup table. It wasn't that much code, and the simple parameter space made playability tuning really easy.
On modern CPUs, data dependency, such as lookup tables often cause pipeline stalls — worse pipelining.
L1 cache is at a premium as well, you rarely want to waste it to access LUTs.
You can compute a lot in 12 cycles caused by L2 hit (L1 miss). In theory up to 32 * 12 = 384 floating point operations.
What adding a lookup table can do is to add the load-latency to the dependency chain involving the calculation, which seems to be what you are talking about here. For an L1 hit that's usually 4 or 5 cycles, and for L2 hits and beyond it's worse, as you point out. How much that actually matters depends on whether the code is latency-bound and the involved lookup is on the critical path: in many cases where there is enough ILP it won't be (an general rule is that in most code most instructions are not on a critical dependency-chain).
If the involved method isn't that hot then L1 misses (like your example) or worse are definitely a possibility. On the other hand, in that case performance isn't that critical by definition. If the method is really hot, e.g., in a tight(ish) loop, then you are mostly going to be getting L1 hits.
The comparison with 384 FOPs seems a bit off: I guess you are talking about about some 32-FOP per cycle SIMD implementation (AVX512?) - but the assumption of data dependencies kind of rules that out: one would assume it's scalar code here. If it's vectorization, then the whole equation changes!
If it's not vectorizable, LUT result is often used for indirect jump/call (like large switch statement) or memory access (say, a histogram etc.).
> What adding a lookup table can do is to add the load-latency to the dependency chain involving the calculation, which seems to be what you are talking about here.
Yeah, used a bit sloppy terminology. Loads can affect performance system wide. To be a win, LUT function needs to be something pretty heavy, while LUT itself needs to be small (at least <16 kB, preferably <1 kB).
> If the method is really hot, e.g., in a tight(ish) loop, then you are mostly going to be getting L1 hits.
That really depends. It's generally good to keep L1 footprint small. There are just 512 of 64-byte L1 cache lines. Hyperthread shares L1 as well. There can be other hot loops nearby that could also benefit from hot L1. It's very easy to start to spill to L2 (and further). Microbenchmarks often miss "system" level issues.
> The comparison with 384 FOPs seems a bit off
384 was for the extreme vectorization case, 12 x 2 x 8 FMACs (AVX). Most vendors count FMACs nowadays as two FOPs...
> If it's vectorization, then the whole equation changes
Well, isn't that where the performance wins are and what you need to do to extract maximum performance from that hot loop? A good truly parallel vector gather implementation could make (small) LUTs very interesting performance wise.
You've lost me. The parent comments were specifically talking about using LUTs to replace calculation, and particular calculations involving branches. So basically rather than some ALU ops + possibly some branches, you use a LUT and fewer or zero ALU ops, and fewer or zero branches.
No one is talking about the context of a LUT of function pointers being used for a switch statement.
A histogram is generally a read/write table in memory and doesn't have much to do with a (usually read-only) LUT - unless I missed what you're getting at there.
> Yeah, used a bit sloppy terminology. Loads can affect performance system wide. To be a win, LUT function needs to be something pretty heavy, while LUT itself needs to be small (at least <16 kB, preferably <1 kB).
Definitely. Useful LUTs are often a few dozen bytes: the JumpMForceData one referring to about was only 8 bytes!
> Microbenchmarks often miss "system" level issues.
But it's like a reverse myth now: at some point (apparently?) everyone loved LUTs - but now it's popular to just dismiss any LUT use with: "yeah but a cache miss takes yyy cycles which will make a LUT terrible!" or "microbenchmarks can't capture the true cost of LUTs!".
Now the latter is certainly true, but you can certainly put reasonable bounds on the cost. The key observation is the "deeper" the miss (i.e., miss to DRAM being the deepest, ignoring swap), the less the implied frequency the LUT-using method was being called anyways. If the method is always missing to DRAM, the LUT entries are being cleared out before the next invocation (that hits the same line) which must be a "while". Conversely, when the LUT method is very hot (high of calls), the opportunity for the LUT to stay in cache is great.
You can even analyze this more formally: basically looking at the cost-benefit of every line of cache used by the LUT: if the LUT didn't use that line, what would be the benefit to the surrounding code? For any non-trivial program this gets hard to do exactly, but certainly with integration benchmarks and performance counters you can make some reasonable tests. Associativity makes everything tougher and less linear though...
> Well, isn't that where the performance wins are and what you need to do to extract maximum performance from that hot loop?
Sure, if it can be vectorized. My point there was that you were discussing the latency of L2 misses rather than the throughput, which implies that the operations on consecutive elements were dependent (otherwise it would be the throughput of 1 or 2 per cycle that would be important). So you just have to keep it apples to apples: if you assume independent ops, you can perhaps vectorize, but then the LUT comparison is a throughput one (as is the scalar alternative), but if the code is dependent and non-vectorizable, then the LUT latency more becomes important.
> A good truly parallel vector gather implementation could make (small) LUTs very interesting performance wise.
Anything that can be vectorized usually adds another factor of 4 or 8 to performance making it much harder for LUTs to come out on top, since the gather implementations on x86 anyways just use the same scalar ports and are thus still limited to 2/cycle, and without any "smarts" for identical or overlapping elements.
Sometimes you can use pshufb to do what amounts to 16 or 32 parallel lookups in a 16-element table of bytes. If your LUT can be made that small, it works great.
A lookup table meshes much better with fixed frame rate gameplay, either with one entry per frame, or quantizing countdown timers of how many frames to wait to go to the next state.
 not the full story
From Takashi Tezuka:
> “In the end we used the New Super Mario Bros. U system for all of the game styles. There was quite a lot of discussion about this within the team. Staff who had strong attachment to the original games expressed a strong desire to see implemented the same system they remembered. However, when players who are used to the modern Mario physics tried playing with the old physics, they found it much more difficult than they remembered."
People on reddit have done some pretty extensive breakdowns of the physics of different Mario versions vs. SMM:
Always remember to view past "Tricks" from the era they were written in.
Lookup tables, in the 90's, for example, were used EVERYWHERE. People always used for sin/cos/tan functions. Fixed-point math was very common as well. Nowadays, it may seem like "magic" but it's not. It was just "the way" to get things done.
I'm not downplaying the skill of the programmers of those eras, it's just keep it in perspective. Many "tricks" weren't invented for that one game, they were common place for all programmers using those platforms.
I used to use compiled sprites all the time in the 90's. They seem like magic today, but they were just another kind of drawing to us. In Allegro 4, you could even draw one by simply calling a "make compiled sprite" function and making sure to not draw it outside of the clipping rectangle (since they can't be clipped). That was it. You build it with a builder function, and then you call draw on it. But it "seems" insane nowadays to convert the bits of a bitmap into raw machine code that draws those bits to save some CPU cycles.
speeds = [0, 1, 2, 4, 8, 4, 2, 1, 0]
And then you keep track of how many frames it's been since you jumped, and then just do
yVelocity = speeds[ticksSinceJump]
No math needed at all.