Optimizing BitBlt by generating code on the fly

pjmlp · on Feb 10, 2018

> Nearly all of GDI was written in assembly language in versions of Windows up to and including the Windows 95 family. In that era, being able to not only read but also write assembly language was a core developer skill.

Quite true.

C compilers are cherished nowadays as the ultimate speed daemon, but back on MS-DOS, even earlier than when Windows started to be relevant, all high level languages could be considered some kind of "scripting".

I remember seeing Basic, Pascal and C listings, where the function bodies were 100% inline Assembly, the actual high level language was only used for data structures and slow application paths.

viraptor · on Feb 10, 2018

> C compilers are cherished nowadays as the ultimate speed daemon,

Doesn't that title still belong to Fortran?

pjmlp · on Feb 11, 2018

Only from the point of view of easier to write code.

Since C99 with restrict it is possible to give the compiler the same semantic information that Fortran gets for free.

However if you get it wrong, the compiler will crash and burn, just like other cases in C.

badsectoracula · on Feb 10, 2018

A few years ago i bought a couple of old PCs from ebay, one being an original IBM PC, the other some unknown 286-based PC, so i decided to create some stuff for it. One thing i wrote was a 3D maze and of course i wanted to make it run at interactive levels on the IBM PC's 4.77MHz CPU.

Initially i wrote a map editor and an initial implementation in QBasic which, of course, was far from fast but it allowed me to find some algorithm to display a simple first person maze. Then i ported it to Turbo C which made it faster, but again it was too slow. After several optimizations, i ended up writing a machine code generator for rendering the vertical spans (this was one of the slowest parts) which made it to draw the maze almost instantly in the 286 with its "turbo" mode turned off [1]. The IBM PC wasn't working at the time (i had to buy a few more components to make it work) but once i fixed it, i tried the maze and after a few tweaks it worked there too [2] (sadly no video).

The generated code actually draws four spans in one go (since CGA packs 4 pixels in a byte) so you first get a coarse frame of the walls that is quickly "patched" with some C code to fill the few pixels that would differ and smooth it out. I found it amusing that it takes almost as much time to fill the entire wall as to patch out the few pixels.

I hadn't written any program that generated its own code before and i found it very interesting. And felt a bit sad that this sort of optimization isn't really needed much anymore (in the stuff i am working on anyway).

[1] https://www.youtube.com/watch?v=tfbQIvRYph4

[2] https://i.imgur.com/M3gKlfA.jpg

megaman22 · on Feb 12, 2018

Were you inspired by any of the Wolfenstein 3D code when you were doing this? The approach sounds very similar to what I've read about in Fabien Sanglard's work on analyzing the Wolf3D code - Carmack used some similar tricks there, from what I recall.

faragon · on Feb 10, 2018

"Nearly all of GDI was written in assembly language in versions of Windows up to and including the Windows 95 family. In that era, being able to not only read but also write assembly language was a core developer skill."

For graphics, today it is an important skill to write inlined SIMD optimizations: using language "intrinsics" instead of pure assembly (you can write inline assembly, too, but in most cases it is enough using intrinsics for reaching 80-90% of max performance, avoiding manual register allocation handling).

slededit · on Feb 10, 2018

Using intrinsics leaves a lot of performance on the table if there is any register contention. Even Intel's compiler doesn't do that good a job. I've seen gains of over 2x by hand coding.

faragon · on Feb 10, 2018

Sure, specially with some operations, e.g. code requiring "shuffle" operations [1], which use constants for programing byte reordering, having most compilers problems on complex functions, where because of register pressure some "shuffle" constants are moved in and out from the registers (even when using the "register" keyword for 'helping' the compiler).

[1] https://software.intel.com/en-us/node/524215

filereaper · on Feb 10, 2018

This is very similar to optimized array-copy code generated by JIT compilers today.

pauldavis · on Feb 10, 2018

I think the people who wrote the blit code for early Windows were Marlin Eller and Walt Moore. Paul Klingler may have worked on that aspect of GDI, too - not sure.

Those were exciting days.