
Optimizing BitBlt by generating code on the fly - luu
https://blogs.msdn.microsoft.com/oldnewthing/20180209-00/?p=97995
======
pjmlp
> Nearly all of GDI was written in assembly language in versions of Windows up
> to and including the Windows 95 family. In that era, being able to not only
> read but also write assembly language was a core developer skill.

Quite true.

C compilers are cherished nowadays as the ultimate speed daemon, but back on
MS-DOS, even earlier than when Windows started to be relevant, all high level
languages could be considered some kind of "scripting".

I remember seeing Basic, Pascal and C listings, where the function bodies were
100% inline Assembly, the actual high level language was only used for data
structures and slow application paths.

~~~
viraptor
> C compilers are cherished nowadays as the ultimate speed daemon,

Doesn't that title still belong to Fortran?

~~~
pjmlp
Only from the point of view of easier to write code.

Since C99 with _restrict_ it is possible to give the compiler the same
semantic information that Fortran gets for free.

However if you get it wrong, the compiler will crash and burn, just like other
cases in C.

------
badsectoracula
A few years ago i bought a couple of old PCs from ebay, one being an original
IBM PC, the other some unknown 286-based PC, so i decided to create some stuff
for it. One thing i wrote was a 3D maze and of course i wanted to make it run
at interactive levels on the IBM PC's 4.77MHz CPU.

Initially i wrote a map editor and an initial implementation in QBasic which,
of course, was far from fast but it allowed me to find some algorithm to
display a simple first person maze. Then i ported it to Turbo C which made it
faster, but again it was too slow. After several optimizations, i ended up
writing a machine code generator for rendering the vertical spans (this was
one of the slowest parts) which made it to draw the maze almost instantly in
the 286 with its "turbo" mode turned off [1]. The IBM PC wasn't working at the
time (i had to buy a few more components to make it work) but once i fixed it,
i tried the maze and after a few tweaks it worked there too [2] (sadly no
video).

The generated code actually draws four spans in one go (since CGA packs 4
pixels in a byte) so you first get a coarse frame of the walls that is quickly
"patched" with some C code to fill the few pixels that would differ and smooth
it out. I found it amusing that it takes almost as much time to fill the
entire wall as to patch out the few pixels.

I hadn't written any program that generated its own code before and i found it
very interesting. And felt a bit sad that this sort of optimization isn't
really needed much anymore (in the stuff i am working on anyway).

[1]
[https://www.youtube.com/watch?v=tfbQIvRYph4](https://www.youtube.com/watch?v=tfbQIvRYph4)

[2] [https://i.imgur.com/M3gKlfA.jpg](https://i.imgur.com/M3gKlfA.jpg)

~~~
megaman22
Were you inspired by any of the Wolfenstein 3D code when you were doing this?
The approach sounds very similar to what I've read about in Fabien Sanglard's
work on analyzing the Wolf3D code - Carmack used some similar tricks there,
from what I recall.

------
faragon
"Nearly all of GDI was written in assembly language in versions of Windows up
to and including the Windows 95 family. In that era, being able to not only
read but also write assembly language was a core developer skill."

For graphics, today it is an important skill to write inlined SIMD
optimizations: using language "intrinsics" instead of pure assembly (you can
write inline assembly, too, but in most cases it is enough using intrinsics
for reaching 80-90% of max performance, avoiding manual register allocation
handling).

~~~
slededit
Using intrinsics leaves a lot of performance on the table if there is any
register contention. Even Intel's compiler doesn't do that good a job. I've
seen gains of over 2x by hand coding.

~~~
faragon
Sure, specially with some operations, e.g. code requiring "shuffle" operations
[1], which use constants for programing byte reordering, having most compilers
problems on complex functions, where because of register pressure some
"shuffle" constants are moved in and out from the registers (even when using
the "register" keyword for 'helping' the compiler).

[1] [https://software.intel.com/en-
us/node/524215](https://software.intel.com/en-us/node/524215)

------
filereaper
This is very similar to optimized array-copy code generated by JIT compilers
today.

------
pauldavis
I think the people who wrote the blit code for early Windows were Marlin Eller
and Walt Moore. Paul Klingler may have worked on that aspect of GDI, too - not
sure.

Those were exciting days.

