Hacker News new | past | comments | ask | show | jobs | submit login

https://gist.github.com/1wErt3r/4048722#file-smbdis-asm-L601...

I think this is where the real gems start. The biggest contribution that SMB had was the "physics engine", to retrofit a modern term. The friction, the jumping, the inertia. If you compare it with the primitive physics in Donkey Kong or Mario Brothers, you can really grasp the groundbreaking novelty that was SMB. You can change direction in mid-air, but not too much. When you run, you skid if you try to run in the opposite direction. The height of your jumps is affected by your running speed.

It's all of these little details combined, barely noticed in tandem, which made the game new and fun.




I once read that the way SMB was able to pull off the physics engine on such limited hardware was that it used lookup tables for physics instead of actually calculating velocity. My assembly-fu is weak but it looks like your link is to the section that contains all the lookup tables. I think JumpMForceData for example is a series of offsets for each successive frame after you hit the jump button.

https://gist.github.com/1wErt3r/4048722#file-smbdis-asm-L611...

This line shows the calculation that makes use of the JumpMForceData.


When I wrote Atari 800 Donkey Kong, I used 16-bit precision arithmetic for the motion, with pretty good results. All the jumping was done with a horizontal velocity, a vertical velocity and a vertical acceleration. A jump is just setting Mario's vertical velocity to some value, and you need to do floor collisions (but you want those anyway, so other objects can interact with floors).

There was a funky "bounce" at the edge of the screen in the arcade version that I captured pretty well by reversing X and Y velocity on a wall collision; this could not have been done well with a lookup table. It wasn't that much code, and the simple parameter space made playability tuning really easy.


You can also fine-tune the feel of a jump when you're directly editing a handful of values vs trying to find a function that describes your desired results.


How so? "Trying to find a function" is editing a handful of parameters to a polynomial+exponential model -- the same thing


When you edit one coefficient of a polynomial, you change its behavior everywhere. When you change one value in a lookup table, you're only changing the value for one point in time.


i'd say a lookup table is more easily edited than a parameter in a function (provided that you are editing a function with less parameters than the entries in the lookup table).


If they used tables it was probably to account for a bunch of edge case and to fine tune the jumps to feel right and controllable. Counter-intuitively, plain inertial physics doesn't seem to feel very good in platform games. But inertial physics is pretty cheap to implement on the 6502 using fixed point arithmetics.


Lookup tables were indeed a common technique used by games in the past.


And present, too, right? It's not the same reason as it would have been in the 80s, but today in performance critical code it is not uncommon to reduce the number of conditionals for better CPU pipelining, and lookup tables are a very common tool for this.


> it is not uncommon to reduce the number of conditionals for better CPU pipelining, and lookup tables are a very common tool for this.

On modern CPUs, data dependency, such as lookup tables often cause pipeline stalls — worse pipelining.

L1 cache is at a premium as well, you rarely want to waste it to access LUTs.

You can compute a lot in 12 cycles caused by L2 hit (L1 miss). In theory up to 32 * 12 = 384 floating point operations.


To be fair, replacing a series of ALU ops with a lookup table doesn't usually add a "data dependency" - the data dependency probably already existed, but perhaps flowed through registers rather than memory.

What adding a lookup table can do is to add the load-latency to the dependency chain involving the calculation, which seems to be what you are talking about here. For an L1 hit that's usually 4 or 5 cycles, and for L2 hits and beyond it's worse, as you point out. How much that actually matters depends on whether the code is latency-bound and the involved lookup is on the critical path: in many cases where there is enough ILP it won't be (an general rule is that in most code most instructions are not on a critical dependency-chain).

If the involved method isn't that hot then L1 misses (like your example) or worse are definitely a possibility. On the other hand, in that case performance isn't that critical by definition. If the method is really hot, e.g., in a tight(ish) loop, then you are mostly going to be getting L1 hits.

The comparison with 384 FOPs seems a bit off: I guess you are talking about about some 32-FOP per cycle SIMD implementation (AVX512?) - but the assumption of data dependencies kind of rules that out: one would assume it's scalar code here. If it's vectorization, then the whole equation changes!


> To be fair, replacing a series of ALU ops with a lookup table doesn't usually add a "data dependency"

If it's not vectorizable, LUT result is often used for indirect jump/call (like large switch statement) or memory access (say, a histogram etc.).

> What adding a lookup table can do is to add the load-latency to the dependency chain involving the calculation, which seems to be what you are talking about here.

Yeah, used a bit sloppy terminology. Loads can affect performance system wide. To be a win, LUT function needs to be something pretty heavy, while LUT itself needs to be small (at least <16 kB, preferably <1 kB).

> If the method is really hot, e.g., in a tight(ish) loop, then you are mostly going to be getting L1 hits.

That really depends. It's generally good to keep L1 footprint small. There are just 512 of 64-byte L1 cache lines. Hyperthread shares L1 as well. There can be other hot loops nearby that could also benefit from hot L1. It's very easy to start to spill to L2 (and further). Microbenchmarks often miss "system" level issues.

> The comparison with 384 FOPs seems a bit off

384 was for the extreme vectorization case, 12 x 2 x 8 FMACs (AVX). Most vendors count FMACs nowadays as two FOPs...

> If it's vectorization, then the whole equation changes

Well, isn't that where the performance wins are and what you need to do to extract maximum performance from that hot loop? A good truly parallel vector gather implementation could make (small) LUTs very interesting performance wise.


> If it's not vectorizable, LUT result is often used for indirect jump/call (like large switch statement) or memory access (say, a histogram etc.).

You've lost me. The parent comments were specifically talking about using LUTs to replace calculation, and particular calculations involving branches. So basically rather than some ALU ops + possibly some branches, you use a LUT and fewer or zero ALU ops, and fewer or zero branches.

No one is talking about the context of a LUT of function pointers being used for a switch statement.

A histogram is generally a read/write table in memory and doesn't have much to do with a (usually read-only) LUT - unless I missed what you're getting at there.

> Yeah, used a bit sloppy terminology. Loads can affect performance system wide. To be a win, LUT function needs to be something pretty heavy, while LUT itself needs to be small (at least <16 kB, preferably <1 kB).

Definitely. Useful LUTs are often a few dozen bytes: the JumpMForceData one referring to about was only 8 bytes!

> Microbenchmarks often miss "system" level issues.

Definitely.

But it's like a reverse myth now: at some point (apparently?) everyone loved LUTs - but now it's popular to just dismiss any LUT use with: "yeah but a cache miss takes yyy cycles which will make a LUT terrible!" or "microbenchmarks can't capture the true cost of LUTs!".

Now the latter is certainly true, but you can certainly put reasonable bounds on the cost. The key observation is the "deeper" the miss (i.e., miss to DRAM being the deepest, ignoring swap), the less the implied frequency the LUT-using method was being called anyways. If the method is always missing to DRAM, the LUT entries are being cleared out before the next invocation (that hits the same line) which must be a "while". Conversely, when the LUT method is very hot (high of calls), the opportunity for the LUT to stay in cache is great.

You can even analyze this more formally: basically looking at the cost-benefit of every line of cache used by the LUT: if the LUT didn't use that line, what would be the benefit to the surrounding code? For any non-trivial program this gets hard to do exactly, but certainly with integration benchmarks and performance counters you can make some reasonable tests. Associativity makes everything tougher and less linear though...

> Well, isn't that where the performance wins are and what you need to do to extract maximum performance from that hot loop?

Sure, if it can be vectorized. My point there was that you were discussing the latency of L2 misses rather than the throughput, which implies that the operations on consecutive elements were dependent (otherwise it would be the throughput of 1 or 2 per cycle that would be important). So you just have to keep it apples to apples: if you assume independent ops, you can perhaps vectorize, but then the LUT comparison is a throughput one (as is the scalar alternative), but if the code is dependent and non-vectorizable, then the LUT latency more becomes important.

> A good truly parallel vector gather implementation could make (small) LUTs very interesting performance wise.

Anything that can be vectorized usually adds another factor of 4 or 8 to performance making it much harder for LUTs to come out on top, since the gather implementations on x86 anyways just use the same scalar ports and are thus still limited to 2/cycle, and without any "smarts" for identical or overlapping elements.

Sometimes you can use pshufb to do what amounts to 16 or 32 parallel lookups in a 16-element table of bytes. If your LUT can be made that small, it works great.


I think the biggest difference is that modern games tend to be written for variable frame rates, stuffing floating point time deltas through equations.

A lookup table meshes much better with fixed frame rate gameplay, either with one entry per frame, or quantizing countdown timers of how many frames to wait to go to the next state.


Actually, modern (and not so modern) physics engines as used in games generally use a fixed time delta for each step, and just iterate faster/slower to keep the simulation in sync[1]. This is done for many reasons, but predominantly numerical stability.

[1] not the full story


I have a strong feeling that Super Mario Maker always has the same physics engine going, but just has four different lookup tables that it switches between depending on the level theme. Anyone want to partially disassemble it for comparison?


Super Mario Maker uses "New SMB" physics for all themes. It doesn't actually mimic the exact physics of each game (aside from small tweaks like not letting you wall-jump or carry shells depending on the theme). They did this because newer players found it super confusing to go back to the older physics.

From Takashi Tezuka:

> “In the end we used the New Super Mario Bros. U system for all of the game styles. There was quite a lot of discussion about this within the team. Staff who had strong attachment to the original games expressed a strong desire to see implemented the same system they remembered. However, when players who are used to the modern Mario physics tried playing with the old physics, they found it much more difficult than they remembered."

People on reddit have done some pretty extensive breakdowns of the physics of different Mario versions vs. SMM:

- https://imgur.com/XC38rcX - https://www.reddit.com/r/MarioMaker/comments/4iqa5s/super_ma...


Yes I recall a Forza motorsport physics guy saying they used a simple lookup table for the chart which holds the curve for the limit if grip on a tyre. Beats the crab out if calculating it every time.


Thank you. I wrote a long post but then I decided not to post it...

Always remember to view past "Tricks" from the era they were written in.

Lookup tables, in the 90's, for example, were used EVERYWHERE. People always used for sin/cos/tan functions. Fixed-point math was very common as well. Nowadays, it may seem like "magic" but it's not. It was just "the way" to get things done.

I'm not downplaying the skill of the programmers of those eras, it's just keep it in perspective. Many "tricks" weren't invented for that one game, they were common place for all programmers using those platforms.

I used to use compiled sprites all the time in the 90's. They seem like magic today, but they were just another kind of drawing to us. In Allegro 4, you could even draw one by simply calling a "make compiled sprite" function and making sure to not draw it outside of the clipping rectangle (since they can't be clipped). That was it. You build it with a builder function, and then you call draw on it. But it "seems" insane nowadays to convert the bits of a bitmap into raw machine code that draws those bits to save some CPU cycles.


I was going to respond to an earlier comment of yours, decided I couldn't quite figure out what to ask, and so kept reading. But after this comment I think my question is a bit clearer: please write longer posts or start a blog with stories of the kind of work you were doing > 10 years ago. It's fascinating, and you're a good writer!


I may be missing something, but how is that not "actually calculating velocity"?


That is, instead of using y velocity = acceleration * time and then adding in gravity, they made something like this, as I understand it:

speeds = [0, 1, 2, 4, 8, 4, 2, 1, 0]

And then you keep track of how many frames it's been since you jumped, and then just do

yVelocity = speeds[ticksSinceJump]

No math needed at all.


The second game, melee, definitely had a more traditional engine running it, too many ~bugs~ fantastic features to have been taking advantage of lookup tables.


Are you thinking of Super Smash Bros.? This is Super Mario Bros. for the NES, over a decade older :)


I actually wonder if a version of Smash Bros could have worked on NES with lookup tables for physics? The existing games are 3d but the physics are all 2d.


Yes I read super and bros and didn't read the middle.


I'll plug the book http://www.game-feel.com/ here because it has a 28-page chapter devoted to the physics of SMB.


Not platformers I realize, but lots of outer space games with actual acceleration/velocity calculations preceded SMB. Off the top of my head: Lunar Lander (1979), Asteroids (1979), Defender (1981), Gravitar (1982), Sinistar (1982). Just want to keep the history straight. Several of these modeled gravity as well.


Used to have a Gravitar clone for DOS. It was one of the harder games I've ever played :)




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: