I do wonder what would have happened if the N64 had included a much bigger textu...

mrguyorama · 2024-03-26T15:34:36

The other big problem with the N64 was that the RAM had such high latency that it completely undid any benefit from the supposedly higher bandwidth that RDRAM had and the console was constantly memory starved.

The RDP could rasterize hundreds of thousands of triangles a second but as soon as you put any texture or shading on them, the memory accesses slowed you right down. UMA plus high latency memory was the wrong move.

In fact, in many situations you can "de-optimize" the rendering to draw and redraw more, as long as it uses less memory bandwidth, and end up with a higher FPS in your game.

mips_r4300i · 2024-03-26T15:58:35

That's mostly correct. It is as you say, except that shading and texturing come for free. You may be thinking of Playstation where you do indeed get decreased fillrate when texturing is on.

Now, if you enable 2cycle mode, the pipeline will recycle the pixel value back into the pipeline for a second stage, which is used for 2 texture lookups per pixel and some other blending options. Otherwise, the RDP is always outputting 1 pixel per clock at 62.5 mhz. (Though it will be frequently interrupted because of ram contention) There are faster drawing modes but they are for drawing rectangles, not triangles. It's been a long time since I've done benchmarks on the pipeline though.

You're exactly right that the UMA plus high latency murders it. It really does. Enable zbuffer? Now the poor RDP is thrashing read modify writes and you only get 8 pixel chunks at a time. Span caching is minimal. Simply using zbuf will torpedo your effective full rate by 20 to 40 percent. That's why stuff I wrote for it avoided using the zbuffer whenever possible.

The other bandwidth hog was enable anti aliasing. AA processing happened in 2 places: first in the triangle drawing pipeline, for inside polygon edges. Secondly, in the VI when the framebuffer gets displayed, it will apply smoothing to the exterior polygon edges based on coverage information stored in the pixels extra bits.

On average, you get a roughly 15 to 20 percent fillrate boost by turning both those off. If you run only at lowres, it's a bit less since more of your tender time is occupied by triangle setup.

mrguyorama · 2024-03-26T19:11:56

I was misremembering about instances involving the zbuffer and significant overdraw as demonstrated by Kaze https://www.youtube.com/watch?v=GC_jLsxZ7nw

Another example from that video was changing a trig function from a lookup table to an evaluated approximation improved performance because it uses less memory bandwidth.

Was the zbuffer in main memory? Ooof

What's interesting to me is that even Kaze's optimized stuff is around 8k triangles per frame at 30fps. The "accurate" microcode Nintendo shipped claimed about 100k triangles per second. Was that ever achieved, even in a tech demo?

christkv · 2024-03-26T22:22:42

Did you see the video of the guy who super optimized Mario 64 to run at 60 fps https://youtu.be/t_rzYnXEQlE?si=MpucGm0r_5KN-Nc_

01HNNWZ0MV43FF · 2024-03-26T21:05:52

So Kaze is hitting 240k tris/second right?

33985868 · 2024-03-26T23:48:27

There were many microcode versions and variants released over the years. IIRC one of the official figures was ~180k tri/sec.

I could draw a ~167,600 tri opaque model with all features (shaded, lit by three directional lights plus an ambient one, textured, Z-buffered, anti-aliased, one cycle), plus some large debug overlays (anti-aliased wireframes for text, 3D axes, Blender-style grid, almost fullscreen transparent planes & 32-vert rings) at 2 FPS/~424 ms per frame at 640x476@32bpp, 3 FPS/~331ms at 320x240@32bpp, 3 FPS/~309ms at 320x240@16bpp.

That'd be between around 400k to 540k tri/sec. Sounds weird, right ? But that's extrapolated straight from the CPU counter on real hardware and eyeballing, so it's hard to argue.

I assume the bottleneck at that point is the RSP processing all the geometry, a lot of them will be backface culled, and because of the sheer density at such a low resolution, comparatively most of them will be drawn in no time by the RDP. Or, y'know, the bandwidth. Haven't measured, sorry.

Performance depends on many variables, one of which is how the asset converter itself can optimise the draw calls. The one I used, a slight variant of objn64, prefers duplicating vertices just so it can fully load the cache in one DMA command (gSPVertex) while also maximising gSP2Triangle commands IIRC (check the source if curious). But there's no doubt many other ways of efficiently loading and drawing meshes, not to mention all the ways you could batch the scene graph for things more complex than a demo.

Anyways, the particular result above was with the low-precision F3DEX2 microcode (gspF3DLX2_Rej_fifo), it doubles the vertex cache size in DMEM from 32 to 64 entries, but removes the clipping code: polygons too close to the camera get trivially rejected. The other side effect with objn64 is that the larger vertex cache massively reduces the memory footprint (far less duplication): might've shaved off like 1 MB off the 4 MB compiled data.

Compared to the full precision F3DEX2, my comment said: `~1.25x faster. ~1.4x faster when maxing out the vertex cache.`.

All the microcodes I used have a 16 KB FIFO command buffer held in RDRAM (as opposed to the RSP's DMEM for XBUS microcodes). It goes like this if memory serves right:

1. CPU starts RSP graphics task with a given microcode and display list to interpret from RAM

2. RSP DMAs display list from RAM to DMEM and interprets it

3. RSP generates RDP commands into a FIFO in either RDRAM or DMEM

4. When output command buffer is full, it waits for the RDP to be ready and then asks it to execute the command buffer

5. The RDP reads the 64-bit commands via either the RDRAM or the cross-bus which is the 128-bit internal bus connecting them together, so it avoids RDRAM bus contention.

6. Once the RDP is done, go to step 2/3.

To quote the manual:

> The size of the internal buffer used for passing RDP commands is smaller with the XBUS microcode than with the normal FIFO microcode (around 1 Kbyte). As a result, when large OBJECTS (that take time for RDP graphics processing) are continuously rendered, the internal buffer fills up and the RSP halts until the internal buffer becomes free again. This creates a bottleneck and can also slow RSP calculations. Additionally, audio processing by the RSP cannot proceed in parallel with the RDP's graphics processing. Nevertheless, because I/O to RDRAM is smaller than with FIFO (around 1/2), this might be an effective way to counteract CPU/RDP slowdowns caused by competition on the RDRAM bus. So when using the XBUS microcode, please test a variety of combinations.

mips_r4300i · 2024-03-27T05:24:32

I'm glad someone found objn64 useful :) looking back it could've been optimized better but it was Good Enough when I wrote it. I think someone added png texture support at some point. I was going to add CI8 conversion, but never got around to it.

On the subject of XBUS vs FIFO, I trialled both in a demo I wrote with a variety of loads. Benchmarking revealed that over 3 minutes each method was under a second long or shorter. So in time messing with them I never found XBUS to help with contention. I'm sure in some specific application it might be a bit better than FIFO. By the way, I used a 64k FIFO size, which is huge. I don't know if that gave me better results.

33985868 · 2024-03-27T07:09:09

Oh, you're the MarshallH ? Thanks so much for everything you've done !

I'm just a nobody who wrote DotN64, and contributed a lil' bit to CEN64, PeterLemon's tests, etc.

For objn64, I don't think PNG was patched in. I only fixed a handful of things like a buffer overflow corrupting output by increasing the tmp_verts line buffer (so you can maximise the scale), making BMP header fields 4 bytes as `long` is platform-defined, bumping limits, etc. Didn't bother submitting patches since I thought nobody used it anymore, but I can still do it if anyone even cares.

Since I didn't have a flashcart to test with for the longest time, I couldn't really profile, but the current microcode setup seems to be more than fine.

Purely out of curiosity, as I now own a SC64, is the 64drive abandonware ? I tried reaching out via email a couple times since my 2018 order (receipt #1532132539), and I still don't know if it's even in the backlog or whether I could update the shipping address. You're also on Discord servers but I didn't want to be pushy.

I don't even mind if it never comes, I'd just like some closure. :p

Thanks again !

VelesDude · 2024-03-27T04:32:54

Recently Sauraen on Youtube demonstrated their performance profiling on their F3DEX3 optimizations. One thing they could finally do was profile the memory latency and it is BAD! On a frame render time of 50ms, about 30ms of that is the processors just waiting on the RAM. Essentially, at least in Ocarina of time, the GPU is idle 60% of the time!

Whole video is fascinating but skip to the 29 minutes mark to see the discussion of this part.

https://www.youtube.com/watch?v=SHXf8DoitGc

rasz · 2024-03-28T00:06:18

Ram latency is bad, GPU is spending half the time doing nothing, but in return using RDRAM allowed for 2 layer PCB making whole thing insanely cheap to manufacture.

https://gmanmodz.com/2020/01/30/2020-the-year-of-n64-again/

https://gmanmodz.com/2020/10/05/n64-3x3/

rightbyte · 2024-03-26T12:34:44

Wasn't the thing you put in the slot infront of the cart a ram extension slot?

I think you can play Rogue Squadron with and without if you want to compare.

Or do youe mean some lower cache level?

skhr0680 · 2024-03-26T12:51:17

that pack added 4MB extra RAM, OOT and Majora’s Mask are like night and day thanks to it.

The N64 had mere kilobytes of texture cache, AFAIK the solution was to stream textures, but it took awhile for developers to figure that out

VelesDude · 2024-03-27T04:25:24

N64 had a 4KB texture cache while the Ps1 had a 2KB cache. But the N64's mip-mapping requirement meant that it essentially had 2KB + lower resolution maps.

The streaming helped it a lot but I think the cost of larger carts was a big drag on what developers could do. It is one thing to stream textures but if you cannot afford the cart size in the first place it becomes merely academic.

pezezin · 2024-03-27T12:40:24

The real problem was different. The PS1 cache was a real cache, managed transparently by the hardware. Textures could take the full 1 MB of VRAM (minus the framebuffer, of course).

In contrast, the N64 had a 4 kB texture RAM. That's it, all your textures for the current mesh had to fit in just 4 kB. If you wanted bigger textures, you had to come up with all sort of programming tricks.

skhr0680 · 2024-03-27T07:31:37

I don’t think storage space was an issue for graphics in particular. Remember the base system had 4MB memory and 320x240 resolution