I built games in the 90s. Graphics was obviously the hardest part. We thought ab...

cmroanirgo · on Nov 6, 2021

Agree with everything you said. We used x87 though and paid extreme attention to fpu stalls to ensure nothing wasted clocks.

As developers, we were also forced to give the graphics guys a really hard time: "no that texture is too big! 128x128" & "you need to do it again with less polygons". We used various level of detail in textures and models to minimise calcs and rendering issues. Eg. A tank with only 12 vertices when it would only be a pixel or three on screen. I think it only used 2x2 texels as part of a 32x32 texture (or thereabouts)...

This was around mid 90's.

Dave_Rosenthal · on Nov 6, 2021

Ha. Yeah, there was not a lot of memory. The toolchain we built at the the time automatically packed all individual game texture maps into a single 256x256 texture (IIRC). If the artists made too many textures, everything started to look bad because everything got downsampled too much.

Any yeah, the design of the game content was absolutely affected by things like polygon count concerns: "Say, wouldn't it be cool to have a ship that shaped like a torus with a bunch of fins sticking out? Actually, on second thought... How about one that looks like a big spike? :)"

lostgame · on Nov 6, 2021

A fantastic early example of memory limitations in games are that the clouds and bushes in Mario 1 are the same sprite with a different palette.

Those were certainly different times. :) It was so much cooler to see what developers and artists did within those limitations than what we are doing today. The entire game dev community was like a demoscene in a way.

kqr · on Nov 6, 2021

> If the artists made too many textures, everything started to look bad because everything got downsampled too much.

That's really clever. On the surface of it, it's "just" about dynamically adapting fidelity to fit in the available memory.

But really, it's a deeper organisational solution: it pushes the breadth vs fidelity of assets trade-off back to the designers, who are the ones that should decide along that curve anyway. It provides a cheap way for them to visually evaluate various points on that trade-off curve. Very clever.

Agentlien · on Nov 6, 2021

Honestly, though the numbers are bigger and floating point arithmetic is fast on a modern GPU this still sounds a lot like how we work nowadays.

I recently spent two years on the performance team for a large indie title and a huge portion of it was asking artists to simplify meshes, improve LODs, increase the amount of culling we could do on the CPU, etc.

My own work was mainly optimising the rendering itself.

pkphilip · on Nov 6, 2021

The 199os were a time when we tried all sorts of tricks to get the last bit of performance out of the hardware. One book which stands out is the one Michael Abrash wrote:

Michael Abrash's "Graphics Programming Black Book" https://github.com/jagregory/abrash-black-book

https://www.drdobbs.com/parallel/graphics-programming-black-...

As CPU power, the number of cores, RAM sizes, HDD sizes, graphics card capabilities have increased, the developers are no longer as careful to squeeze out the performance.

patentatt · on Nov 6, 2021

I may be naïve/out of the loop here, but it’s fun to imagine what would be possible with the same gusto and creativity applied to today’s hardware. I imagine that a significant amount of modern hardware, even in games, is eaten up by several layers of abstraction that make it easier for developers to crank out games faster. What would a 90’s developer do with a 16-core CPU and an RTX3080?

kbuck · on Nov 6, 2021

You can check out some demoscene demos [0], which usually do this (albeit to save executable size instead of just to run fast). These days you don't even have to run them yourself; most have YouTube recordings.

[0]: https://www.pouet.net/prodlist.php?platform%5B%5D=Windows&pa...

charcircuit · on Nov 6, 2021

Well to be in the spirit with the parent it should also try and use terabytes of storage to push the storage to the limit.

musicale · on Nov 8, 2021

You see a bit of this sort of optimization at the end of a game console's life cycle.

I expect the PS3 still has some headroom since few games adequately exploited the Cell processor's SPUs.

Since the Switch is underpowered compared to PS5/Xbox Series X/PC perhaps we'll see some aggressive optimization as developers try to fit current-gen games onto it.

smcameron · on Nov 6, 2021

I remember reading the comp.graphics.algorithms news group back when Descent was just out. People were going a little bit crazy trying to figure out how the hell that thing worked. I found this page that talks about some of the things done to do texture mapping: https://www.lysator.liu.se/%7Ezap/speedy.html

lostgame · on Nov 6, 2021

Descent blew my mind, as well. IIRC it predated Quake and was the first ’true’ 3DoF FPS?

The source code has since been released on GitHub, if you’re ever interested in seeing it!

kubanczyk · on Nov 8, 2021

6DoF, as there are six degrees of freedom. Three correspond to rotational movement around the x, y, and z axes, commonly termed pitch, yaw, and roll. The other three correspond to translational movement along those axes, moving forward or backward, left or right, up or down.

egfx · on Nov 10, 2021

Yep close, I think it was even more constrained then that though.. https://news.ycombinator.com/item?id=20973306

DeathArrow · on Nov 6, 2021

I dabbled with graphics using mode 13h and later with VGA. It was orders of magnitude simpler than using Vulkan or DX12.

CPUs were simpler DOS and Windows 95 were very simple compared to Windows 10.

That means that writing optimized C or even assembler routines was pretty easy.

If we go 10 years back in time, programming Z80 or MOS Technology 6510 or Motorola 68k was even simpler.

cryo · on Nov 6, 2021

Yes the instruction sets were simpler, but the developers at the time had invented a lot of clever solutions to solve hard problems.

I think the most innovative timespan was between 1950–1997'ish, and hope we get back to get the most out of hardware again as common sense.

DeathArrow · on Nov 6, 2021

Developers at that time were great. But is harder to be great today when complexity is 1000x.

cma · on Nov 6, 2021

Some interesting related stuff in this talk:

HandmadeCon 2016 - History of Software Texture Mapping in Games

https://www.youtube.com/watch?v=xn76r0JxqNM

I think they say at one point it went from 14 to 8 instructions, and then the Duke Nukem guy (Ken Silverman) got it down to around 4.

Quake would do something where it only issued a divide every 8 pixels or something, and then only interpolate when inbetween and the out of order execution on pentium pro (I think?) would let it all work out.

joakleaf · on Nov 6, 2021

For perspective correct texture mapping quake did the distance divide every 8 pixels on the FPU, and the affine texture mapping on the integer side of the CPU in between (you can actually see a little bit of “bending”, if you stand right next to a wall surface in low res like 320x200).

Since the FPU could work in parallel with the integer instructions on the pentium, this was almost as fast as just doing affine texture mapping.

This worked even on the basic Pentium.

It was likely also the reason Quake was a unplayable on the 486 and Cyrix 586.

kqr · on Nov 6, 2021

Great discussion! Early on, they mention what I think is the book Applied Concepts in Microcomputer Graphics. It sounds like it would be right up my alley, but I can only find it with very expensive shipping.

Does anyone know of a book like it? I'm very interested in getting started with software rendering from the ground up, mainly to scratch an intellectual itch of mine. I learn better from well-curated books than online material in general.

freewizard · on Nov 6, 2021

This calls back a lot of good memories. In additional to the cool tricks you mentioned, I recall the limited 256 or 16 color also created some innovative ways to use color palettes dynamically. The limited built in PC speaker also pushed the boundary for sound fx and music implementations.

andrewjf · on Nov 6, 2021

Out of curiosity, How did you know how many clock cycles your rendering code took?

SavantIdiot · on Nov 6, 2021

You look at the assembly code, grab an Intel Programmer Reference manual (they were about 1000 pages), look up each instruction opcode and that would tell you the clock cycles. For memory operations it is much more difficult due to caching. However, for many hot regions of code, the data is already in the L1s so manual counting is sufficient. (At the time there was a book called The Black Art of ... Assembly? I can't recall, and I should be flogged for forgetting it... but it was directed at game programmers and covered all sorts of assembly tricks for Intel CPUs.)

Also, a little later in the 90's: VTune. When VTune dropped it was a game changer. Intel started adding performance counters to the CPUs that could be queried in real-time so you could literally see what code was missing branches or stalling, etc. Source: I worked with Blizzard (pre-WoW!) developing VTune, feeding back requirements for new performance counter requests from them and developers.

radicalbyte · on Nov 6, 2021

It was common back in the day for machines to ship with detailed technical documentation.

I spent many an hour as a young child reading the C64 programmers reference guide, calculating the op speed and drawing memory maps.

https://archive.org/details/c64-programmer-ref

SavantIdiot · on Nov 7, 2021

What were you drawing memory maps for?

radicalbyte · on Nov 7, 2021

Mainly making cheats for games. Back then you have devices which allowed you to view memory, so I would play games and look for changes made in memory when you did certain actions. Made a map of them. Most of the time a life counter was good enough.

Also for my own games - you don't have pointers as such, you have your memory, and you need to know what lives where (and when it lives where).

lapsis_beeftech · on Nov 6, 2021

You might be thinking of "Zen of Assembly Language" or "Zen of Code Optimization" by the brilliant Michael Abrash. I own the latter and in addition to plenty of low-level code optimization advice for the microprocessors available at the time it also includes timer code for DOS that lets you measure code performance with high precision.

SavantIdiot · on Nov 6, 2021

Yep! Thanks! I was also thinking of his "Graphics Programming Black Book". Black Art ... hah! My bad memory. That dude abused the hell out of int 13! Kinda surprised he's with Oculus under Facebook. I wish his brain wasn't owned by Zuck. Maybe he's in that phase that musicians hit, when they get old and gray and start performing their classics at Vegas for a million bucks a night to audiences of boomers.

nitrogen · on Nov 6, 2021

(At the time there was a book called The Black Art of ... Assembly? I can't recall, and I should be flogged for forgetting it...

Probably not the book you're thinking of, but Michael Abrash's Graphics Programming Black Book was a mix of reprinted old articles and new content that had a bunch of highly optimized (for 486) graphics routines. IIRC there was a 9-cycle-per-texel texture mapping routine.

kqr · on Nov 6, 2021

> You look at the assembly code, grab an Intel Programmer Reference manual (they were about 1000 pages), look up each instruction opcode and that would tell you the clock cycles.

Wouldn't this "just" tell you how many "cycles of code" there are, and not how many cycles actually run? Branches etc. will of course cause some cycles to be double-counted and others to be skipped, in the dynamic view.

pepoluan · on Nov 6, 2021

Then you just calculate them all.

If Branch A is taken ... total X clock-cycles.

If Branch B is taken ... total X clock-cycles.

If Branch C is taken ... total X clock-cycles.

And so on. And then make sure that the "longest" branch fits into the clock-cycle budget.

kqr · on Nov 6, 2021

Were games at this point of low enough complexity that the combinatorial explosion of branches could be contained and reasoned about by humans? Or did you have software doing this?

Jensson · on Nov 6, 2021

There is no such thing as combinatorial explosion here, just take the longest branch every time.

kqr · on Nov 6, 2021

But this means if you have a situation like

    if (a) { ... }
    ...
    if (b) { ... }

Where the branches are long, and b = !a, you significantly overestimate the amount of code. I guess that was considered good enough, then?

SavantIdiot · on Nov 6, 2021

Yes.

However, this is part of the reason why you always try to avoid performing if/then in critical loops. Obviously the index counter cannot be hoisted, but if you are doing 1000 iterations, 2-bit Yeh prediction (which was common at the time) can be amortized.

Later CPU architectures speculatively executed and then re-executed instructions that were incorrect due to branching, and VLIW allowed you to "shut off" instructions in a loop rather than have to predict.

mjul · on Nov 6, 2021

You just need your code to finish before the next frame. Then you wait and start the next cycle when you receive the interrupt from the start of the next frame from your graphics hardware.

Of course once you start writing for heterogeneous hardware like PCs with very different performance between models, you may use adaptive approaches to use all the available resources rather than just providing the most basic experience for everyone.

Dave_Rosenthal · on Nov 6, 2021

You could just measure the isolated inner loop with accurate timers and figure out down to the clock how many cycles it was taking.

You also basically knew how many cycles each instruction took (add, multiply, bit shift, memory reference, etc.) so you just added up the clock counts. (Though things got a bit harder to predict starting with Pentium as it had separate pipelines called U and V that could sometimes overlap instructions.)

jeffbee · on Nov 6, 2021

Yeah, it's helpful to remember that games in the early 90s at least would have been expected to run on a 486, which were still very widespread, and the 486 was neither superscalar nor out of order. It was pipelined (the main difference between a 486 and a 386) but it was still at that time simple to reason about how long your instructions would take. And there was no speedstep or any of that stuff yet.

Agentlien · on Nov 6, 2021

This is so different from the current state of affairs.

Code for the CPU might get optimized beyond recognition, vectorized, executed out of order, ...

The shader code I write these days is translated to a whole list of intermediate formats across different platforms, maybe scheduled in a jobs system, then handed to graphics drivers which translate it to yet another format...

boomlinde · on Nov 6, 2021

On the C64 I will change the border color when my frame routine starts and change it back once my routine finishes. This will tell me how much of a fraction of the total frame time I use, in quite literal terms. I wonder if similar tricks were used for VGA. I think you could change color index 0 to the same effect.

mjul · on Nov 6, 2021

Speaking about the C64 the exact instruction timing was also key to the two rather clever hacks to put sprites outside the nominally drawable 320x200 pixel screen area.

First in the top and bottom border. This was done by waiting for the GPU to start drawing the bottom line of text on the screen and then switching to a mode with one less line of text. The sprite hardware would then not get the signal to stop drawing after the last line since that presumably already happened with fewer lines to display. This would cause it to keep drawing sprites below the bottom border. The in the vertical blanking before the top line was draw you would switch it back from 24 to 25 lines of text.

The side-border is a variation on this where you time your code to wait for the 39th character of a scan line then switch from 40-character wide to 38-wide. Again the sprite engine would not get the signal to stop drawing and continue drawing in the side border outside the 320x200 pixel nominal capabilities of the hardware.

For side border it was necessary to do this for every scan line (and every 8th line would have different timing, probably related to fetching the next line of text), so timing was critical to the exact CPU cycle.

These machines were not much by modern standards but for the hacker mind they were an amazing playground where you had just your assembly code and the hardware to exploit without the layers and layers of abstractions of modern gear.

Edit: spelling

Lerc · on Nov 7, 2021

I did that for A game in DOS days. It needed a virtual retrace interrupt using a timer interrupt to trigger shortly before retrace, busy waiting until retrace then recalibrating the timer for the next frame.

Pretty soon after that Operating Systems managed memory and interrupts themselves. The added latency on timers made precision tricks like that impractical.

rasz · on Nov 7, 2021

https://uops.info

laumars · on Nov 6, 2021

Good write up. I’d also add to that, that modern techniques are written to use the modern hardware.

It’s easy to forget that each upgrade to graphics is an exponential jump. Going from 8 colours to 256 colours on screen. Jumps in resolution. Jumps in sound, number of sprits the hardware can track. Etc.

When we look at graphics now and the tangible visible difference between 8k, 4K and HD and Moore’s Law no longer in effect it is easy to forget just how significant the jumps in tech was in the 80s and 90s if you hadn’t lived through it and/or developed code during it.

tinus_hn · on Nov 6, 2021

A game like Donkey Kong uses a static tile mapped background and some small dynamic objects that move around, and the hardware combines them.

These machines don’t even have enough memory to store one frame buffer, you can’t program like a modern game where everything is customizable and you can just do whatever you want as long as it’s fast enough.

In a game like Donkey Kong you do what the hardware allows you to do (and of course the hardware is designed to allow you to do what you want to do).