
Achieving full-motion video on the Nintendo 64 (2000) [pdf] - dmitrygr
https://ultra64.ca/files/other/Game-Developer-Magazine/GDM_September_2000_Achieving_Full-Motion_Video_on_the_Nintnedo_64.pdf
======
phoboslab
Lately there has also been an effort by a hobbyist to use an MPEG1 decoder
(PL_MPEG)[1] on the N64[2].

Disclaimer: I wrote PL_MPEG, but not the N64 port.

[1]
[https://github.com/phoboslab/pl_mpeg](https://github.com/phoboslab/pl_mpeg)

[2]
[https://www.reddit.com/r/n64/comments/dr15py/i_just_started_...](https://www.reddit.com/r/n64/comments/dr15py/i_just_started_a_dragons_lair_port_to_n64/)

~~~
giovannibajo1
Hi, I'm the guy working on that.

I've actually since moved to port a H264 implementation to N64. It's been a
long journey and I'm now at around 18 FPS, after nights of manual RSP assembly
optimizations, vectorizing most of the intra-prediction and inter-prediction
algorithms. I want to reach 30FPS so there's still some work to do.

~~~
fulafel
What kind of vector operations does the N64 support for his case?

~~~
giovannibajo1
For H264? Basically everything.

Vector registers are 8 lanes, signed 16-bit, so they map quite well to per-
pixel calculations on each plane (YUV), which is what video codecs do, as you
can process 8 pixels at a time, and you have 16-bit precision to handle
intermediate results.

The most complex hurdle is that RSP only has 4K of RAM so you need to DMA in
and out macroblocks a lot (especially since I can't possibly rewrite a FULL
h264 decoder in RSP assembly, not in this lifetime: I need to write only
specific performance-sensitive algorithms, while the bulk of the decoder stays
in C; this means that the same data ends up going in & out the RSP a lot,
especially since the H264 decoder I'm using is not aware of this problem).

This said, RSP DMA is even rectangle based, so it's another perfect fit: I can
DMA a macroblock by specifying the pointer in RAM, width and height (usually
16x16, but some algos works on sub-partitions of 8x8 or 4x4) and the stride
(screen width), so that a single DMA call will transfer the block from the
middle a frame, skipping the rest of the data.

Vector multiplications in RSP were designed to write DSP-like filters, so they
map quite well to the pixel filters required by H264. There are several
different multiplication instructions for different fixed point precisions,
and there's even one that automatically adds 0.5 (in the correct fixed point
precision) which is also a common pattern in FIR filters, and also used in
H264.

Saturation (VCH/VGE/VLT opcodes) is also supported; this is useful as most
algorithms eventually need to saturate the calculated value in the 0-255
range, so that's another thing which usually require 1 clock cycle for 8
pixels.

When working with 4x4 partitions, half of the vector lanes are ignored; when
writing back to memory, you need to do a read / combine / write sequence (as
you may want to write 4 pixels and keep the existing 4 pixels, but vector
writes will write 8 pixels); in this case, the VMRG instruction is used, which
basically allow to combine two vector registers into one, with a bitmask to
specific where to get each lane frame.

For IDCT, it comes very handy that most RSP opcodes allows to do partial
broadcasts of the lanes of one of the input registers; this allows to keep a
4x4 matrix into 2 consecutive registers and then play some tricks with
broadcast to multiply by rows and by columns (which is required by IDCT where
you need to compute A' x B x A, with A&B being 4x4 matrices, so if you expand
that you will see that you need to rotate vectors a lot).

So well, it's actually a pretty good fit.

PS: in the Gamasutra article, it shows the RSP code used to do colorspace
conversion (YUV->RGB). The article says that it give a big boost (and I can
believe it: especially in MPEG1, CSC is like 30% of decoding time), but I
brought it to basically 0% by letting the RDP do it (RDP is the GPU in N64).
In fact, the RDP supports YUV textures: so in my H264 player, the RSP just
does the interleaving (that is, merges the 3 separate Y, U, V planes into one)
and then asks the RDP to blit a textured rectangle in the correct format. The
RDP even runs in parallel to both RSP and CPU. It might be that, back in 2000,
this wasn't fully documented by Nintendo, though I found several references in
old Nintendo docs. I can't see otherwise why it wasn't used. Once you reverse
engineer how to pass the correct constants, it works really well and brings
the CSC cost to basically zero.

~~~
fulafel
Really fascinating. The dma latency must be pretty small? The 4k of ram and
DMA makes the programming model a lot like the Cell, I wonder if experiences
with game evs using the RSP microcode encouraged the Cell design. I also
wonder how much in common this HW has with the SGI GPUs of the day...

~~~
giovannibajo1
The DMA transfers 64-bit words per each bus clock cycle between the main
shared memory (RDRAM) and the internal RSP 4K DMEM (or IMEM, to transfer
code).

So it's quite fast, but you need to remember that the main RDRAM is shared
among the main CPU and the whole RCP (eg: it's also used as video memory for
textures and frame buffers by the RDP), so contention is really high.

------
derefr
See also, for techniques used on a much earlier console with much tighter
constraints:

• [https://youtu.be/c-aQvP7CUAI](https://youtu.be/c-aQvP7CUAI)

• [https://youtu.be/IehwV2K60r8](https://youtu.be/IehwV2K60r8)

~~~
dleslie
That Sonic 3D hack is damn clever

~~~
nitrogen
Regarding the other hack, people did some really cool stuff with palette
swaps, e.g.
[http://www.effectgames.com/demos/canvascycle/](http://www.effectgames.com/demos/canvascycle/)

And also the Windows 95/98 startup screens.

------
porsupah
I rather wish the 3DO version of The 11th Hour had seen release, but the
publisher canned the project quite late in development.

There, we had about an hour of video - on CD, it should be noted, not a
cartridge - at 288x320 (using an interlaced display mode, which virtually
nobody used beyond splash screens), with a perfectly solid 30fps. Left
unregulated, the decoder yielded around 40-70fps. All ARM assembly, and a
_huge_ amount of fun to write. ^_^

No hardware acceleration, needless to say, other than page blitting to copy
the previous frame to the current buffer.

------
corysama
The Gamasutra web article on this is not as pretty. But, it includes the RSP
microcode.

[http://web.archive.org/web/20081221184231/http://www.gamasut...](http://web.archive.org/web/20081221184231/http://www.gamasutra.com/features/20001004/meynink_pfv.htm)
(edit, link fixed, thanks!)

If you like this kind of stuff, check out
[https://www.reddit.com/r/TheMakingOfGames/](https://www.reddit.com/r/TheMakingOfGames/)
and
[https://www.reddit.com/r/videogamescience/](https://www.reddit.com/r/videogamescience/)
I often post games-related stuff I find here to there. But, this might be the
first time I’ve seen something posted to HN because it was first posted there
:)

~~~
mambodog
Your link seems to be broken, here's a working version:
[http://web.archive.org/web/20040313034855/http://www.gamasut...](http://web.archive.org/web/20040313034855/http://www.gamasutra.com/features/20001004/meynink_pfv.htm)

------
STRML
Absolutely fascinating, high-quality, deeply technical article with a healthy
dose of nostalgia. I'd love to see more of this (even though we often have
quite a bit!) on HN.

Does anyone know if this technique was ever used again on the N64?

~~~
CrashOveride95
Not to my knowledge, the only other time _good looking_ FMV was used on N64
was in Pokemon Puzzle League. I am not profiecent enough in RSP to disassemble
that microcode, though I can tell you resident evil 2's fmv codec came after
the HVQM fmv system used in Puzzle League

------
awscherb
It always amazes me reading about the novel techniques N64 devs used to
squeeze every last ounce of performance out of such a hardware / storage
constrained platform. Nowadays I get excited if a Call of Duty patch is under
10gb....

~~~
Causality1
It's incredible to me to see just how much smaller Nintendo Switch versions of
games are than their PC counterparts, even when the visual differences are
minimal.

~~~
Grazester
Note also that the textures on the switch are for a console that does 1080px
max so it would be smaller.

edit://Visual difference minimal you say?

~~~
Causality1
For example, Cuphead at 3.3GB is 1/4 the size on Switch as it is on Xbox One.
Another is Doom 2016, which, while obviously having lower fidelity visuals, is
13.2GB on Switch and 77GB on PC.

------
fghyjsrtyhjsw
I have feeling this is a lost art and wasn't used since xbox one / ps4 release
(and probably way less before that).

~~~
selectodude
Fitting things onto a 32MB cartridge isn’t a constraint anymore. Current
consoles use 50GB discs and next-gen ones are going to use 100GB BD-XL discs.
Makes it less necessary to do stuff like that.

~~~
mattl
Some use 32GB cartridges.

------
kevinventullo
I owned this game as a kid and I remember the cartridge being noticeably
physically heavier than all of my other N64 cartridges. As I remember it, the
FMV sequences looked basically identical to their PS counterparts.

------
mambodog
Interestingly the N64's RSP has some instructions specifically intended for
implementing MPEG decoding (ctrl+f for "MPEG"in
[http://ultra64.ca/files/documentation/silicon-
graphics/SGI_N...](http://ultra64.ca/files/documentation/silicon-
graphics/SGI_Nintendo_64_RSP_Programmers_Guide.pdf))

~~~
giovannibajo1
Yes unfortunately they're _very_ specific of MPEG1, so they're not really
useful for other codecs in the MPEG family. I'm not using them in my H264
implementation (see sibling answer).

