
A Comprehensive Super Mario Bros. Disassembly - shubhamjain
https://gist.github.com/1wErt3r/4048722
======
jordigh
[https://gist.github.com/1wErt3r/4048722#file-smbdis-
asm-L601...](https://gist.github.com/1wErt3r/4048722#file-smbdis-asm-L6011)

I think this is where the real gems start. The biggest contribution that SMB
had was the "physics engine", to retrofit a modern term. The friction, the
jumping, the inertia. If you compare it with the primitive physics in Donkey
Kong or Mario Brothers, you can really grasp the groundbreaking novelty that
was SMB. You can change direction in mid-air, but not too much. When you run,
you skid if you try to run in the opposite direction. The height of your jumps
is affected by your running speed.

It's all of these little details combined, barely noticed in tandem, which
made the game new and fun.

~~~
davidscolgan
I once read that the way SMB was able to pull off the physics engine on such
limited hardware was that it used lookup tables for physics instead of
actually calculating velocity. My assembly-fu is weak but it looks like your
link is to the section that contains all the lookup tables. I think
JumpMForceData for example is a series of offsets for each successive frame
after you hit the jump button.

[https://gist.github.com/1wErt3r/4048722#file-smbdis-
asm-L611...](https://gist.github.com/1wErt3r/4048722#file-smbdis-asm-L6116)

This line shows the calculation that makes use of the JumpMForceData.

~~~
dieterrams
Lookup tables were indeed a common technique used by games in the past.

~~~
asveikau
And present, too, right? It's not the same reason as it would have been in the
80s, but today in performance critical code it is not uncommon to reduce the
number of conditionals for better CPU pipelining, and lookup tables are a very
common tool for this.

~~~
vardump
> it is not uncommon to reduce the number of conditionals for better CPU
> pipelining, and lookup tables are a very common tool for this.

On modern CPUs, data dependency, such as lookup tables often cause pipeline
stalls — worse pipelining.

L1 cache is at a premium as well, you rarely want to waste it to access LUTs.

You can compute a _lot_ in 12 cycles caused by L2 hit (L1 miss). In _theory_
up to 32 * 12 = 384 floating point operations.

~~~
BeeOnRope
To be fair, replacing a series of ALU ops with a lookup table doesn't usually
add a "data dependency" \- the data dependency probably already existed, but
perhaps flowed through registers rather than memory.

What adding a lookup table can do is to add the load-latency to the dependency
chain involving the calculation, which seems to be what you are talking about
here. For an L1 hit that's usually 4 or 5 cycles, and for L2 hits and beyond
it's worse, as you point out. How much that actually matters depends on
whether the code is latency-bound and the involved lookup is on the critical
path: in many cases where there is enough ILP it won't be (an general rule is
that in most code most instructions are not on a critical dependency-chain).

If the involved method isn't that hot then L1 misses (like your example) or
worse are definitely a possibility. On the other hand, in that case
performance isn't that critical by definition. If the method is really hot,
e.g., in a tight(ish) loop, then you are mostly going to be getting L1 hits.

The comparison with 384 FOPs seems a bit off: I guess you are talking about
about some 32-FOP per cycle SIMD implementation (AVX512?) - but the assumption
of data dependencies kind of rules that out: one would assume it's scalar code
here. If it's vectorization, then the whole equation changes!

~~~
vardump
> To be fair, replacing a series of ALU ops with a lookup table doesn't
> usually add a "data dependency"

If it's not vectorizable, LUT result is often used for indirect jump/call
(like large switch statement) or memory access (say, a histogram etc.).

> What adding a lookup table can do is to add the load-latency to the
> dependency chain involving the calculation, which seems to be what you are
> talking about here.

Yeah, used a bit sloppy terminology. Loads can affect performance system wide.
To be a win, LUT function needs to be something pretty heavy, while LUT itself
needs to be small (at least <16 kB, preferably <1 kB).

> If the method is really hot, e.g., in a tight(ish) loop, then you are mostly
> going to be getting L1 hits.

That really depends. It's generally good to keep L1 footprint small. There are
just 512 of 64-byte L1 cache lines. Hyperthread shares L1 as well. There can
be other hot loops nearby that could also benefit from hot L1. It's very easy
to start to spill to L2 (and further). Microbenchmarks often miss "system"
level issues.

> The comparison with 384 FOPs seems a bit off

384 was for the extreme vectorization case, 12 x 2 x 8 FMACs (AVX). Most
vendors count FMACs nowadays as two FOPs...

> If it's vectorization, then the whole equation changes

Well, isn't that where the performance wins are and what you need to do to
extract maximum performance from that hot loop? A good truly parallel vector
gather implementation could make (small) LUTs very interesting performance
wise.

~~~
BeeOnRope
> If it's not vectorizable, LUT result is often used for indirect jump/call
> (like large switch statement) or memory access (say, a histogram etc.).

You've lost me. The parent comments were specifically talking about using LUTs
to replace calculation, and particular calculations involving branches. So
basically rather than some ALU ops + possibly some branches, you use a LUT and
fewer or zero ALU ops, and fewer or zero branches.

No one is talking about the context of a LUT of function pointers being used
for a switch statement.

A histogram is generally a read/write table in memory and doesn't have much to
do with a (usually read-only) LUT - unless I missed what you're getting at
there.

> Yeah, used a bit sloppy terminology. Loads can affect performance system
> wide. To be a win, LUT function needs to be something pretty heavy, while
> LUT itself needs to be small (at least <16 kB, preferably <1 kB).

Definitely. Useful LUTs are often a few dozen bytes: the JumpMForceData one
referring to about was only _8 bytes_!

> Microbenchmarks often miss "system" level issues.

Definitely.

But it's like a reverse myth now: at some point (apparently?) everyone loved
LUTs - but now it's popular to just dismiss any LUT use with: "yeah but a
cache miss takes yyy cycles which will make a LUT terrible!" or
"microbenchmarks can't capture the true cost of LUTs!".

Now the latter is certainly true, but you can certainly put reasonable bounds
on the cost. The key observation is the "deeper" the miss (i.e., miss to DRAM
being the deepest, ignoring swap), the less the implied frequency the LUT-
using method was being called anyways. If the method is always missing to
DRAM, the LUT entries are being cleared out before the next invocation (that
hits the same line) which must be a "while". Conversely, when the LUT method
is very hot (high of calls), the opportunity for the LUT to stay in cache is
great.

You can even analyze this more formally: basically looking at the cost-benefit
of every line of cache used by the LUT: if the LUT didn't use that line, what
would be the benefit to the surrounding code? For any non-trivial program this
gets hard to do exactly, but certainly with integration benchmarks and
performance counters you can make some reasonable tests. Associativity makes
everything tougher and less linear though...

> Well, isn't that where the performance wins are and what you need to do to
> extract maximum performance from that hot loop?

Sure, if it can be vectorized. My point there was that you were discussing the
_latency_ of L2 misses rather than the throughput, which implies that the
operations on consecutive elements were dependent (otherwise it would be the
throughput of 1 or 2 per cycle that would be important). So you just have to
keep it apples to apples: if you assume independent ops, you can perhaps
vectorize, but then the LUT comparison is a throughput one (as is the scalar
alternative), but if the code is dependent and non-vectorizable, then the LUT
latency more becomes important.

> A good truly parallel vector gather implementation could make (small) LUTs
> very interesting performance wise.

Anything that can be vectorized usually adds another factor of 4 or 8 to
performance making it much harder for LUTs to come out on top, since the
gather implementations on x86 anyways just use the same scalar ports and are
thus still limited to 2/cycle, and without any "smarts" for identical or
overlapping elements.

Sometimes you can use pshufb to do what amounts to 16 or 32 parallel lookups
in a 16-element table of bytes. If your LUT can be made that small, it works
great.

------
DonHopkins
I wrote this earlier on another forum but I'll repost it here:

I've seen Shigeru Miyamoto speak at several game developer conferences over
the years. He's absolutely brilliant, a really nice guy, and there's so much
to learn by studying his work and listening to him talk. Will Wright calls him
the Stephen Spielberg of games.

At one of his earlier talks, he explained that he starts designing games by
thinking about how you touch, manipulate and interact with the input device in
the real world, instead of thinking about the software and models inside the
virtual world of the computer first. The instantaneous response of Mario 64
and how you can run and jump around is a great example of that.

Shigeru Miyamoto GDC 1999 Keynote (Full):
[https://www.youtube.com/watch?v=LC2Pf5F2acI](https://www.youtube.com/watch?v=LC2Pf5F2acI)

At a later talk about how he designed the Wii, he said that he now starts
designing games by thinking about what kind of expression he wants it to evoke
on the player's faces, and how to make the players themselves entertain the
other people in the room who aren't even playing the game themselves. That's
why the Wii has so many great party games, like Wii Sports. Then he showed a
video of a little girl sitting in her grandfather's lap playing a game --
[http://youtu.be/SY3a4dCBQYs?t=12m29s](http://youtu.be/SY3a4dCBQYs?t=12m29s) ,
with a delighted expression on her face. The grandfather was delighted and
entertained by watching his granddaughter enjoy the game.

This photo --
[https://i.imgur.com/zSbOYbk.jpg](https://i.imgur.com/zSbOYbk.jpg) \--
perfectly illustrates exactly what he means!

Shigeru Miyamoto 2007 GDC Keynote - Part 1:
[https://www.youtube.com/watch?v=En9OXg7lZoE](https://www.youtube.com/watch?v=En9OXg7lZoE)

Shigeru Miyamoto 2007 GDC Keynote - Part 2:
[https://www.youtube.com/watch?v=jer1KCPTcdE](https://www.youtube.com/watch?v=jer1KCPTcdE)

Shigeru Miyamoto 2007 GDC Keynote - Part 3:
[https://www.youtube.com/watch?v=SY3a4dCBQYs](https://www.youtube.com/watch?v=SY3a4dCBQYs)

Shigeru Miyamoto 2007 GDC Keynote - Part 4:
[https://www.youtube.com/watch?v=jqBee2YlDPg](https://www.youtube.com/watch?v=jqBee2YlDPg)

Shigeru Miyamoto 2007 GDC Keynote - Part 5:
[https://www.youtube.com/watch?v=WI3DB3tYiOw](https://www.youtube.com/watch?v=WI3DB3tYiOw)

Shigeru Miyamoto 2007 GDC Keynote - Part 6:
[https://www.youtube.com/watch?v=XvwYBSkzevw](https://www.youtube.com/watch?v=XvwYBSkzevw)

Shigeru Miyamoto Keynote GDC 07 - Wife-o-meter:
[https://www.youtube.com/watch?v=6GMybmWHzfU](https://www.youtube.com/watch?v=6GMybmWHzfU)

~~~
notaboutdave
The Miyamoto approach of starting with a desired emotion and working backward
toward a design is profound. This is radically different from most things I've
read which involve cramming emotion into existing designs. This changes
everything for me. Thanks for sharing.

------
jzl
Ben Fry, the creator of Processing, created a beautiful visualization of the
disassembled machine code of Super Mario Bros with arrows representing jump
instructions:

[http://benfry.com/dismap/mario-large2.jpg](http://benfry.com/dismap/mario-
large2.jpg)

via: [http://benfry.com/dismap/](http://benfry.com/dismap/)

I've always loved this as a visualization of machine code. It's simultaneously
amazing how complex it is and yet how simple it is when you consider that it
represents the entire game all in one graphic. This is from 2007 I believe. I
love that 10 years later the machine code is now fully annotated and
understood.

~~~
retSava
Wow, that is amazingly complex! It's beautiful, in a way.

I tried to see if there was something very central, called from lots of
places, but there doesn't seem to be any such place. There are a few, but not
something singular that stands out as completely dominant.

Long jumps get a more dominant visual appearance than short jumps. Would be
interesting to see what it would look like with the address printed in size
proportional to eg how many places call it, or how CPU intensive that
subroutine is, etc.

~~~
retSava
Also check out the other example dismap shows: Excitebike! Much fewer and
longer subroutines. Interesting, I wonder if that in turn is mostly a result
of developer coding style differences, or a result of game mechanics being
different.

------
raldi
Check out the section under "DemoActionData": this is where it stores (and
plays) the demo you see when you don't push Start and Mario runs around on his
own volition.

It just simulates player input and runs it through the regular game engine.
(The alternative, playing a recorded video, would have been laughably data
intensive.)

~~~
dEnigma
Same thing goes for Super Mario 64. The popular TASer pannenkoek actually
explored whether it was possible to manipulate Demo-Mario's starting position
in such a way that he collects a star with the demo input (this was for the
purposes of special "A-Button Challenge" speedruns, where pressing the
A-Button must be kept to a minimum, but since the demo input isn't actual
player input it isn't counted) Sadly I think there was no conceivable way to
do it. (i.e. manipulating the starting position is possible, but not in a way
that leads to collecting a star)

[https://youtu.be/-0emgkIEobI](https://youtu.be/-0emgkIEobI)

~~~
opdahl
That's an amazing youtube channel. Here [1] the creator describes in an over
seven-minute long video the intricacies of Mario falling asleep.

[1] [https://www.youtube.com/watch?v=7OtW-
LLZ2OA](https://www.youtube.com/watch?v=7OtW-LLZ2OA)

~~~
dEnigma
That's nothing! He has two videos on walls, floors and ceilings, each longer
than half an hour (and both of them extremely interesting)

Part 1:
[https://www.youtube.com/watch?v=UnU7DJXiMAQ](https://www.youtube.com/watch?v=UnU7DJXiMAQ)

Part 2:
[https://www.youtube.com/watch?v=f1kbABTyeo8](https://www.youtube.com/watch?v=f1kbABTyeo8)

------
justin_
I remember using this disassembly many years ago when writing a little NES
emulator. Having a reference available for a popular game is _incredibly_
useful.

Here's one of my favorite parts:
[https://gist.github.com/1wErt3r/4048722#file-smbdis-
asm-L942](https://gist.github.com/1wErt3r/4048722#file-smbdis-asm-L942)

The byte here is for the BIT instruction, but why is it just a lonely byte?
Well, the BIT instruction in this case also includes the two following bytes.
When the game processes that instruction, the `ldy #$04` is swallowed up as
part of the BIT instruction, effectively skipping over it. IIRC this was a
pretty common trick used among 6502 programmers. It allows you to jump ahead
over the next (2byte) instruction with just a single byte!

------
looperhacks
You might also be interested in the dissambly of the first pokemon games:

[https://github.com/pret/pokered](https://github.com/pret/pokered)

And some other pokemon games:

[https://github.com/pret/pokered#see-
also](https://github.com/pret/pokered#see-also)

~~~
DarkTree
I loved this article regarding the algorithm used for capturing pokemon:
[http://www.dragonflycave.com/mechanics/gen-i-
capturing](http://www.dragonflycave.com/mechanics/gen-i-capturing)

~~~
katastic
Good gosh, that was a long, fun read.

------
khedoros1
There's also a Legend of Zelda disassembly, but it's not as nice:

[https://github.com/camthesaxman/zeldasource](https://github.com/camthesaxman/zeldasource)

It's a much larger game, though.

I haven't checked, but I'm assuming that a lot of the giant chunks of
statically-defined data are just graphics and audio (unlike many games, LoZ
stored graphics interspersed with the program code and copied data over to an
in-cartridge RAM chip, instead of storing the complete graphics data in a ROM
chip.)

~~~
derefr
> unlike many games, LoZ stored graphics interspersed with the program code
> and copied data over to an in-cartridge RAM chip, instead of storing the
> complete graphics data in a ROM chip

This is because Zelda 1 was a port from the Famicom Disk System—no memory-
mapped ROM chip to rely on, so you've got to load everything you're going to
use to RAM. (Also like this: Metroid.)

I believe this is why both LoZ's and Metroid's maps are built out of
individual "screens" with a "pause to transition" effect between them: in the
FDD version, the game would be reading the new map from disk, and there'd
(sometimes, if the load took long enough) be a loading screen involved. (You
can see the screen for LoZ here:
[http://tcrf.net/The_Legend_of_Zelda/Console_Differences#Load...](http://tcrf.net/The_Legend_of_Zelda/Console_Differences#Loading_Screen))

~~~
khedoros1
Cool, I never dug in to see why they were that way, but those were both test
cases when I built an NES emulator.

------
DonHopkins
How I love the sleek smooth razor sharp columns of three letter 6502 opcodes.
The right edge of columns of opcodes in other instruction sets look so rough
and jagged like sandpaper in comparison. That's what I've always hated about
x86 code. It looks rough and torn.

~~~
andyjohnson0
Nice observation. Its 35+ years since I last did any 6502, and this morning
teenage-me looked at those opcodes and smiled a little.

------
indescions_2017
Master list of SMB glitches makes a nice companion to this. Sorry in advance
if it results in anyone staying up well past their bedtime ;)

[https://www.mariowiki.com/List_of_Super_Mario_Bros._glitches](https://www.mariowiki.com/List_of_Super_Mario_Bros._glitches)

~~~
laurent123456
That'd be great if someone knowing the disassembly well enough could explain
some of these glitches, for example the Minus World[0]. I guess the logic (or
bug) to active this appears somewhere in the disassembly.

[0]
[https://www.mariowiki.com/Minus_World](https://www.mariowiki.com/Minus_World)

~~~
LocalH
[https://www.youtube.com/watch?v=Hv_h_R3o9r8](https://www.youtube.com/watch?v=Hv_h_R3o9r8)

------
binarymax
I love that this is just a gist. Like 'hey just needed to copy and paste this
for a minute'

------
x3ro
In case anyone else is trying to re-assemble this into a working game, here's
one way to do it :)

[https://github.com/x3ro/super-mario-bros-
build](https://github.com/x3ro/super-mario-bros-build)

------
tomduncalf
This seems like a really impressive effort to make sense of all this!

Would the original game have been written in assembly? And if so, would the
source have looked similar to this?

Having never touched assembly language (aside from learning some very basic
cracking many years ago swapping JE for JNE in the serial check routine,
haha), it seems like a true dark art to me, so I’m really curious to know!

~~~
aquova
Yes, all old NES games were written in 6502 assembly (named after the NES's
6502 processor), and even most games into the Super Nintendo and Game Boy days
were written using assembly language.

The source would've looked very similar to this, although I can assume the
original labels would've been in Japanese. The difficulty in creating a
disassembly like this isn't converting the machine code back into assembly,
which can be done rather simply, but instead re-adding all the label names,
which are lost when the game is built. It's quite the undertaking, and the
author must know the complete game back to front.

~~~
bluedino
For another interesting read there's the guy that disassembled Robotron, and
traced the code out by hand across 512 printed pages of assembly and fixed 2
long-standing bugs.

[http://www.robotron2084guidebook.com/technical/christianging...](http://www.robotron2084guidebook.com/technical/christiangingras/)

------
7scan
I also have a makefile and game genie code generator for this on github:
[https://github.com/nwoeanhinnogaehr/smb-
assembler](https://github.com/nwoeanhinnogaehr/smb-assembler)

------
KGIII
What is the goal of a project like this?

I'm absolutely not meaning that as a pejorative. I am just curious if there's
an actual goal here. Is the goal to make a faithful reproduction, historical
reference, a hacking challenge, or? Is it just curiosity?

I see lots of links in the thread, many for different games. Curiosity is as
valid a reason as any other, but is there some sort of end result trying to be
had? The various links don't actually seem to enumerate this very well, unless
I missed it.

~~~
PyroLagus
It has many uses really. It's a nice resource for learning assembly,
especially 6502 assembly; it's a good reference for creating NES games,
whether those are original titles or simply romhacks; it's also a great
reference for building an emulator for the platform or a level editor for the
game. And of course, disassembling a game and documenting the source code is a
great hacking challenge that will leave you confident that you know asm. So,
all the reasons that you listed, I suppose.

~~~
KGIII
I can also understand a 'just because' answer. I was just curious if there's a
greater goal. I'm not a gamer but I do kind of pay attention to some aspects.

------
webXL
I love stuff like this. Seeing the disassembly somehow adds to the nostalgia
for a childhood pastime.

What's this endlessloop for: [https://gist.github.com/1wErt3r/4048722#file-
smbdis-asm-L712](https://gist.github.com/1wErt3r/4048722#file-smbdis-asm-L712)

(Yes, please ;)

~~~
jepler
You can see that just above the endless loop, the code will "enable NMIs". I'm
not sure about the nomenclature here (because normally NMI stands for non-
maskable interrupts, meaning you can't disable them) but basically the game at
this point becomes event (interrupt) driven, probably from the vertical
retrace interrupt or another kind of timer interrupt. When no event is being
handled, the CPU idles within this endless loop.

~~~
jepler
[https://wiki.nesdev.com/w/index.php/PPU_registers#Controller...](https://wiki.nesdev.com/w/index.php/PPU_registers#Controller_.28.242000.29_.3E_write)
states that bit 7 of the register at $2000 (which they call PPUCTRL, and this
disassembly calls PPU_CTRL_REG1) will "generate an NMI at the start of the
vertical blanking interval"

------
camhenlin
For people interested in the SMB disassembly, you may also be interested in
this one:
[http://bisqwit.iki.fi/jutut/megamansource/](http://bisqwit.iki.fi/jutut/megamansource/)
NES MegaMan disassembly, with some comments

------
wgrover
Looks like data for the various songs here:

[https://gist.github.com/1wErt3r/4048722#file-smbdis-
asm-L160...](https://gist.github.com/1wErt3r/4048722#file-smbdis-asm-L16027)

I'd love to see the process of extracting actual audio from that.

~~~
strangecasts
The NES had a memory-mapped APU[1], so the game just sets sound registers to
play the appropriate notes, and ticks down a timer until it's time to switch
to the next note: [https://gist.github.com/1wErt3r/4048722#file-smbdis-
asm-L156...](https://gist.github.com/1wErt3r/4048722#file-smbdis-asm-L15610)

[1]
[https://wiki.nesdev.com/w/index.php/APU](https://wiki.nesdev.com/w/index.php/APU)

------
kchr
Lots of retro game reverse engineering stuff on HN lately; please continue
this trend :-)

------
andyjohnson0
I'm not familiar with the history of this work, so I'll just ask. Is this the
assembler code that was written by the original dev team, or were the comments
reverse-engineered from a dissassembly of the game's machine code?

~~~
thristian
It's a "disassembly", so it's the work of somebody converting machine code
back into assembler, and studying it long and hard to add comments, break out
numbers into sensible-named constants, and choose decent names for loop
labels. The original dev team was almost certainly not involved.

~~~
andyjohnson0
Thanks for the clarification. Given what you've said and the size of the
thing, this is an awesome piece of work. The level of detail and perseverance
required is enormously impressive.

------
segmondy
The entire game in 16,000 lines of assembly code. :-)

------
tinus_hn
I love how the ‘author’ expresses his thanks for everyone and his dog, except
for the people at Nintendo who actually wrote the code.

------
zxy_xyz
Is the gameplay logic in a different file or did i miss something?

~~~
psyc
It's all in there. Probably start with PlayerCtrlRoutine.

------
1zael
This is insane.

------
salqadri
Tu tu tu turuturu turuturutururur...

------
MBCook
Is there a version of this somewhere that people have changed it to C or some
other higher level language for easier skimming? I’d love to see that.

(Yes, I know it was written in ASM).

