Out of curiosity, How did you know how many clock cycles your rendering code too...

SavantIdiot · on Nov 6, 2021

You look at the assembly code, grab an Intel Programmer Reference manual (they were about 1000 pages), look up each instruction opcode and that would tell you the clock cycles. For memory operations it is much more difficult due to caching. However, for many hot regions of code, the data is already in the L1s so manual counting is sufficient. (At the time there was a book called The Black Art of ... Assembly? I can't recall, and I should be flogged for forgetting it... but it was directed at game programmers and covered all sorts of assembly tricks for Intel CPUs.)

Also, a little later in the 90's: VTune. When VTune dropped it was a game changer. Intel started adding performance counters to the CPUs that could be queried in real-time so you could literally see what code was missing branches or stalling, etc. Source: I worked with Blizzard (pre-WoW!) developing VTune, feeding back requirements for new performance counter requests from them and developers.

radicalbyte · on Nov 6, 2021

It was common back in the day for machines to ship with detailed technical documentation.

I spent many an hour as a young child reading the C64 programmers reference guide, calculating the op speed and drawing memory maps.

https://archive.org/details/c64-programmer-ref

SavantIdiot · on Nov 7, 2021

What were you drawing memory maps for?

radicalbyte · on Nov 7, 2021

Mainly making cheats for games. Back then you have devices which allowed you to view memory, so I would play games and look for changes made in memory when you did certain actions. Made a map of them. Most of the time a life counter was good enough.

Also for my own games - you don't have pointers as such, you have your memory, and you need to know what lives where (and when it lives where).

lapsis_beeftech · on Nov 6, 2021

You might be thinking of "Zen of Assembly Language" or "Zen of Code Optimization" by the brilliant Michael Abrash. I own the latter and in addition to plenty of low-level code optimization advice for the microprocessors available at the time it also includes timer code for DOS that lets you measure code performance with high precision.

SavantIdiot · on Nov 6, 2021

Yep! Thanks! I was also thinking of his "Graphics Programming Black Book". Black Art ... hah! My bad memory. That dude abused the hell out of int 13! Kinda surprised he's with Oculus under Facebook. I wish his brain wasn't owned by Zuck. Maybe he's in that phase that musicians hit, when they get old and gray and start performing their classics at Vegas for a million bucks a night to audiences of boomers.

nitrogen · on Nov 6, 2021

(At the time there was a book called The Black Art of ... Assembly? I can't recall, and I should be flogged for forgetting it...

Probably not the book you're thinking of, but Michael Abrash's Graphics Programming Black Book was a mix of reprinted old articles and new content that had a bunch of highly optimized (for 486) graphics routines. IIRC there was a 9-cycle-per-texel texture mapping routine.

kqr · on Nov 6, 2021

> You look at the assembly code, grab an Intel Programmer Reference manual (they were about 1000 pages), look up each instruction opcode and that would tell you the clock cycles.

Wouldn't this "just" tell you how many "cycles of code" there are, and not how many cycles actually run? Branches etc. will of course cause some cycles to be double-counted and others to be skipped, in the dynamic view.

pepoluan · on Nov 6, 2021

Then you just calculate them all.

If Branch A is taken ... total X clock-cycles.

If Branch B is taken ... total X clock-cycles.

If Branch C is taken ... total X clock-cycles.

And so on. And then make sure that the "longest" branch fits into the clock-cycle budget.

kqr · on Nov 6, 2021

Were games at this point of low enough complexity that the combinatorial explosion of branches could be contained and reasoned about by humans? Or did you have software doing this?

Jensson · on Nov 6, 2021

There is no such thing as combinatorial explosion here, just take the longest branch every time.

kqr · on Nov 6, 2021

But this means if you have a situation like

    if (a) { ... }
    ...
    if (b) { ... }

Where the branches are long, and b = !a, you significantly overestimate the amount of code. I guess that was considered good enough, then?

SavantIdiot · on Nov 6, 2021

Yes.

However, this is part of the reason why you always try to avoid performing if/then in critical loops. Obviously the index counter cannot be hoisted, but if you are doing 1000 iterations, 2-bit Yeh prediction (which was common at the time) can be amortized.

Later CPU architectures speculatively executed and then re-executed instructions that were incorrect due to branching, and VLIW allowed you to "shut off" instructions in a loop rather than have to predict.

mjul · on Nov 6, 2021

You just need your code to finish before the next frame. Then you wait and start the next cycle when you receive the interrupt from the start of the next frame from your graphics hardware.

Of course once you start writing for heterogeneous hardware like PCs with very different performance between models, you may use adaptive approaches to use all the available resources rather than just providing the most basic experience for everyone.

Dave_Rosenthal · on Nov 6, 2021

You could just measure the isolated inner loop with accurate timers and figure out down to the clock how many cycles it was taking.

You also basically knew how many cycles each instruction took (add, multiply, bit shift, memory reference, etc.) so you just added up the clock counts. (Though things got a bit harder to predict starting with Pentium as it had separate pipelines called U and V that could sometimes overlap instructions.)

jeffbee · on Nov 6, 2021

Yeah, it's helpful to remember that games in the early 90s at least would have been expected to run on a 486, which were still very widespread, and the 486 was neither superscalar nor out of order. It was pipelined (the main difference between a 486 and a 386) but it was still at that time simple to reason about how long your instructions would take. And there was no speedstep or any of that stuff yet.

Agentlien · on Nov 6, 2021

This is so different from the current state of affairs.

Code for the CPU might get optimized beyond recognition, vectorized, executed out of order, ...

The shader code I write these days is translated to a whole list of intermediate formats across different platforms, maybe scheduled in a jobs system, then handed to graphics drivers which translate it to yet another format...

boomlinde · on Nov 6, 2021

On the C64 I will change the border color when my frame routine starts and change it back once my routine finishes. This will tell me how much of a fraction of the total frame time I use, in quite literal terms. I wonder if similar tricks were used for VGA. I think you could change color index 0 to the same effect.

mjul · on Nov 6, 2021

Speaking about the C64 the exact instruction timing was also key to the two rather clever hacks to put sprites outside the nominally drawable 320x200 pixel screen area.

First in the top and bottom border. This was done by waiting for the GPU to start drawing the bottom line of text on the screen and then switching to a mode with one less line of text. The sprite hardware would then not get the signal to stop drawing after the last line since that presumably already happened with fewer lines to display. This would cause it to keep drawing sprites below the bottom border. The in the vertical blanking before the top line was draw you would switch it back from 24 to 25 lines of text.

The side-border is a variation on this where you time your code to wait for the 39th character of a scan line then switch from 40-character wide to 38-wide. Again the sprite engine would not get the signal to stop drawing and continue drawing in the side border outside the 320x200 pixel nominal capabilities of the hardware.

For side border it was necessary to do this for every scan line (and every 8th line would have different timing, probably related to fetching the next line of text), so timing was critical to the exact CPU cycle.

These machines were not much by modern standards but for the hacker mind they were an amazing playground where you had just your assembly code and the hardware to exploit without the layers and layers of abstractions of modern gear.

Edit: spelling

Lerc · on Nov 7, 2021

I did that for A game in DOS days. It needed a virtual retrace interrupt using a timer interrupt to trigger shortly before retrace, busy waiting until retrace then recalibrating the timer for the next frame.

Pretty soon after that Operating Systems managed memory and interrupts themselves. The added latency on timers made precision tricks like that impractical.

rasz · on Nov 7, 2021

https://uops.info