I think this is where the real gems start. The biggest contribution that SMB had was the "physics engine", to retrofit a modern term. The friction, the jumping, the inertia. If you compare it with the primitive physics in Donkey Kong or Mario Brothers, you can really grasp the groundbreaking novelty that was SMB. You can change direction in mid-air, but not too much. When you run, you skid if you try to run in the opposite direction. The height of your jumps is affected by your running speed.
It's all of these little details combined, barely noticed in tandem, which made the game new and fun.
This line shows the calculation that makes use of the JumpMForceData.
There was a funky "bounce" at the edge of the screen in the arcade version that I captured pretty well by reversing X and Y velocity on a wall collision; this could not have been done well with a lookup table. It wasn't that much code, and the simple parameter space made playability tuning really easy.
On modern CPUs, data dependency, such as lookup tables often cause pipeline stalls — worse pipelining.
L1 cache is at a premium as well, you rarely want to waste it to access LUTs.
You can compute a lot in 12 cycles caused by L2 hit (L1 miss). In theory up to 32 * 12 = 384 floating point operations.
What adding a lookup table can do is to add the load-latency to the dependency chain involving the calculation, which seems to be what you are talking about here. For an L1 hit that's usually 4 or 5 cycles, and for L2 hits and beyond it's worse, as you point out. How much that actually matters depends on whether the code is latency-bound and the involved lookup is on the critical path: in many cases where there is enough ILP it won't be (an general rule is that in most code most instructions are not on a critical dependency-chain).
If the involved method isn't that hot then L1 misses (like your example) or worse are definitely a possibility. On the other hand, in that case performance isn't that critical by definition. If the method is really hot, e.g., in a tight(ish) loop, then you are mostly going to be getting L1 hits.
The comparison with 384 FOPs seems a bit off: I guess you are talking about about some 32-FOP per cycle SIMD implementation (AVX512?) - but the assumption of data dependencies kind of rules that out: one would assume it's scalar code here. If it's vectorization, then the whole equation changes!
If it's not vectorizable, LUT result is often used for indirect jump/call (like large switch statement) or memory access (say, a histogram etc.).
> What adding a lookup table can do is to add the load-latency to the dependency chain involving the calculation, which seems to be what you are talking about here.
Yeah, used a bit sloppy terminology. Loads can affect performance system wide. To be a win, LUT function needs to be something pretty heavy, while LUT itself needs to be small (at least <16 kB, preferably <1 kB).
> If the method is really hot, e.g., in a tight(ish) loop, then you are mostly going to be getting L1 hits.
That really depends. It's generally good to keep L1 footprint small. There are just 512 of 64-byte L1 cache lines. Hyperthread shares L1 as well. There can be other hot loops nearby that could also benefit from hot L1. It's very easy to start to spill to L2 (and further). Microbenchmarks often miss "system" level issues.
> The comparison with 384 FOPs seems a bit off
384 was for the extreme vectorization case, 12 x 2 x 8 FMACs (AVX). Most vendors count FMACs nowadays as two FOPs...
> If it's vectorization, then the whole equation changes
Well, isn't that where the performance wins are and what you need to do to extract maximum performance from that hot loop? A good truly parallel vector gather implementation could make (small) LUTs very interesting performance wise.
You've lost me. The parent comments were specifically talking about using LUTs to replace calculation, and particular calculations involving branches. So basically rather than some ALU ops + possibly some branches, you use a LUT and fewer or zero ALU ops, and fewer or zero branches.
No one is talking about the context of a LUT of function pointers being used for a switch statement.
A histogram is generally a read/write table in memory and doesn't have much to do with a (usually read-only) LUT - unless I missed what you're getting at there.
> Yeah, used a bit sloppy terminology. Loads can affect performance system wide. To be a win, LUT function needs to be something pretty heavy, while LUT itself needs to be small (at least <16 kB, preferably <1 kB).
Definitely. Useful LUTs are often a few dozen bytes: the JumpMForceData one referring to about was only 8 bytes!
> Microbenchmarks often miss "system" level issues.
But it's like a reverse myth now: at some point (apparently?) everyone loved LUTs - but now it's popular to just dismiss any LUT use with: "yeah but a cache miss takes yyy cycles which will make a LUT terrible!" or "microbenchmarks can't capture the true cost of LUTs!".
Now the latter is certainly true, but you can certainly put reasonable bounds on the cost. The key observation is the "deeper" the miss (i.e., miss to DRAM being the deepest, ignoring swap), the less the implied frequency the LUT-using method was being called anyways. If the method is always missing to DRAM, the LUT entries are being cleared out before the next invocation (that hits the same line) which must be a "while". Conversely, when the LUT method is very hot (high of calls), the opportunity for the LUT to stay in cache is great.
You can even analyze this more formally: basically looking at the cost-benefit of every line of cache used by the LUT: if the LUT didn't use that line, what would be the benefit to the surrounding code? For any non-trivial program this gets hard to do exactly, but certainly with integration benchmarks and performance counters you can make some reasonable tests. Associativity makes everything tougher and less linear though...
> Well, isn't that where the performance wins are and what you need to do to extract maximum performance from that hot loop?
Sure, if it can be vectorized. My point there was that you were discussing the latency of L2 misses rather than the throughput, which implies that the operations on consecutive elements were dependent (otherwise it would be the throughput of 1 or 2 per cycle that would be important). So you just have to keep it apples to apples: if you assume independent ops, you can perhaps vectorize, but then the LUT comparison is a throughput one (as is the scalar alternative), but if the code is dependent and non-vectorizable, then the LUT latency more becomes important.
> A good truly parallel vector gather implementation could make (small) LUTs very interesting performance wise.
Anything that can be vectorized usually adds another factor of 4 or 8 to performance making it much harder for LUTs to come out on top, since the gather implementations on x86 anyways just use the same scalar ports and are thus still limited to 2/cycle, and without any "smarts" for identical or overlapping elements.
Sometimes you can use pshufb to do what amounts to 16 or 32 parallel lookups in a 16-element table of bytes. If your LUT can be made that small, it works great.
A lookup table meshes much better with fixed frame rate gameplay, either with one entry per frame, or quantizing countdown timers of how many frames to wait to go to the next state.
 not the full story
From Takashi Tezuka:
> “In the end we used the New Super Mario Bros. U system for all of the game styles. There was quite a lot of discussion about this within the team. Staff who had strong attachment to the original games expressed a strong desire to see implemented the same system they remembered. However, when players who are used to the modern Mario physics tried playing with the old physics, they found it much more difficult than they remembered."
People on reddit have done some pretty extensive breakdowns of the physics of different Mario versions vs. SMM:
Always remember to view past "Tricks" from the era they were written in.
Lookup tables, in the 90's, for example, were used EVERYWHERE. People always used for sin/cos/tan functions. Fixed-point math was very common as well. Nowadays, it may seem like "magic" but it's not. It was just "the way" to get things done.
I'm not downplaying the skill of the programmers of those eras, it's just keep it in perspective. Many "tricks" weren't invented for that one game, they were common place for all programmers using those platforms.
I used to use compiled sprites all the time in the 90's. They seem like magic today, but they were just another kind of drawing to us. In Allegro 4, you could even draw one by simply calling a "make compiled sprite" function and making sure to not draw it outside of the clipping rectangle (since they can't be clipped). That was it. You build it with a builder function, and then you call draw on it. But it "seems" insane nowadays to convert the bits of a bitmap into raw machine code that draws those bits to save some CPU cycles.
speeds = [0, 1, 2, 4, 8, 4, 2, 1, 0]
And then you keep track of how many frames it's been since you jumped, and then just do
yVelocity = speeds[ticksSinceJump]
No math needed at all.
I've seen Shigeru Miyamoto speak at several game developer conferences over the years. He's absolutely brilliant, a really nice guy, and there's so much to learn by studying his work and listening to him talk. Will Wright calls him the Stephen Spielberg of games.
At one of his earlier talks, he explained that he starts designing games by thinking about how you touch, manipulate and interact with the input device in the real world, instead of thinking about the software and models inside the virtual world of the computer first. The instantaneous response of Mario 64 and how you can run and jump around is a great example of that.
Shigeru Miyamoto GDC 1999 Keynote (Full): https://www.youtube.com/watch?v=LC2Pf5F2acI
At a later talk about how he designed the Wii, he said that he now starts designing games by thinking about what kind of expression he wants it to evoke on the player's faces, and how to make the players themselves entertain the other people in the room who aren't even playing the game themselves. That's why the Wii has so many great party games, like Wii Sports. Then he showed a video of a little girl sitting in her grandfather's lap playing a game -- http://youtu.be/SY3a4dCBQYs?t=12m29s , with a delighted expression on her face. The grandfather was delighted and entertained by watching his granddaughter enjoy the game.
This photo -- https://i.imgur.com/zSbOYbk.jpg -- perfectly illustrates exactly what he means!
Shigeru Miyamoto 2007 GDC Keynote - Part 1: https://www.youtube.com/watch?v=En9OXg7lZoE
Shigeru Miyamoto 2007 GDC Keynote - Part 2: https://www.youtube.com/watch?v=jer1KCPTcdE
Shigeru Miyamoto 2007 GDC Keynote - Part 3: https://www.youtube.com/watch?v=SY3a4dCBQYs
Shigeru Miyamoto 2007 GDC Keynote - Part 4: https://www.youtube.com/watch?v=jqBee2YlDPg
Shigeru Miyamoto 2007 GDC Keynote - Part 5: https://www.youtube.com/watch?v=WI3DB3tYiOw
Shigeru Miyamoto 2007 GDC Keynote - Part 6: https://www.youtube.com/watch?v=XvwYBSkzevw
Shigeru Miyamoto Keynote GDC 07 - Wife-o-meter: https://www.youtube.com/watch?v=6GMybmWHzfU
I've always loved this as a visualization of machine code. It's simultaneously amazing how complex it is and yet how simple it is when you consider that it represents the entire game all in one graphic. This is from 2007 I believe. I love that 10 years later the machine code is now fully annotated and understood.
I tried to see if there was something very central, called from lots of places, but there doesn't seem to be any such place. There are a few, but not something singular that stands out as completely dominant.
Long jumps get a more dominant visual appearance than short jumps. Would be interesting to see what it would look like with the address printed in size proportional to eg how many places call it, or how CPU intensive that subroutine is, etc.
It just simulates player input and runs it through the regular game engine. (The alternative, playing a recorded video, would have been laughably data intensive.)
Part 1: https://www.youtube.com/watch?v=UnU7DJXiMAQ
Part 2: https://www.youtube.com/watch?v=f1kbABTyeo8
Here's one of my favorite parts:
The byte here is for the BIT instruction, but why is it just a lonely byte? Well, the BIT instruction in this case also includes the two following bytes. When the game processes that instruction, the `ldy #$04` is swallowed up as part of the BIT instruction, effectively skipping over it. IIRC this was a pretty common trick used among 6502 programmers. It allows you to jump ahead over the next (2byte) instruction with just a single byte!
And some other pokemon games:
It's a much larger game, though.
I haven't checked, but I'm assuming that a lot of the giant chunks of statically-defined data are just graphics and audio (unlike many games, LoZ stored graphics interspersed with the program code and copied data over to an in-cartridge RAM chip, instead of storing the complete graphics data in a ROM chip.)
This is because Zelda 1 was a port from the Famicom Disk System—no memory-mapped ROM chip to rely on, so you've got to load everything you're going to use to RAM. (Also like this: Metroid.)
I believe this is why both LoZ's and Metroid's maps are built out of individual "screens" with a "pause to transition" effect between them: in the FDD version, the game would be reading the new map from disk, and there'd (sometimes, if the load took long enough) be a loading screen involved. (You can see the screen for LoZ here: http://tcrf.net/The_Legend_of_Zelda/Console_Differences#Load...)
Would the original game have been written in assembly? And if so, would the source have looked similar to this?
Having never touched assembly language (aside from learning some very basic cracking many years ago swapping JE for JNE in the serial check routine, haha), it seems like a true dark art to me, so I’m really curious to know!
The source would've looked very similar to this, although I can assume the original labels would've been in Japanese. The difficulty in creating a disassembly like this isn't converting the machine code back into assembly, which can be done rather simply, but instead re-adding all the label names, which are lost when the game is built. It's quite the undertaking, and the author must know the complete game back to front.
Here's some actual Atari 7800 (a less popular console from the same generation) code that was found on disks in a dumpster when some Atari offices closed. They both use a 6052-based CPU but have very different sound/graphics chips. I'd bet the NES code looked a lot more like this - https://github.com/OpenSourcedGames/Atari-7800
Yes, and probably somewhat. The programmers were Japanese, so variable names would almost certainly be different, and the assembler this code was written for is actually for the CPU in the Super Famicom/SNES, so although I'm not sure when the assembler was written, it certainly wasn't around in 1984 when this game was being written. I think that the notation style is based on Nintendo's system development documentation, though.
The NES would actually be a good place to look at some assembly, at least to get a basic idea of how it works. There aren't many operations, they're pretty easy to understand. There are only a few registers, and no layers of historic cruft layered on top ofit. The same (well, very close) CPU was used in a lot of computers from the same era: https://en.wikipedia.org/wiki/MOS_Technology_6502#Computers_...
Super Mario Bros is an NES game.
I was commenting on the fact that since the assembler itself supports the 16-bit variants of the processor family, it couldn't be the same assembler that would have been used to build the original NES code, so the exact assembly language used might not be a precise match either.
I'm absolutely not meaning that as a pejorative. I am just curious if there's an actual goal here. Is the goal to make a faithful reproduction, historical reference, a hacking challenge, or? Is it just curiosity?
I see lots of links in the thread, many for different games. Curiosity is as valid a reason as any other, but is there some sort of end result trying to be had? The various links don't actually seem to enumerate this very well, unless I missed it.
What's this endlessloop for: https://gist.github.com/1wErt3r/4048722#file-smbdis-asm-L712
(Yes, please ;)
I'd love to see the process of extracting actual audio from that.
(Yes, I know it was written in ASM).