(Disclaimer: I've only ever written extremely simple games and 3-D programs, and...

(Disclaimer: I've only ever written extremely simple games and 3-D programs, and no 3-D games.)

Sure, you can recreate this level of efficiency on modern systems. But you have to throw most of the modern systems away.

httpdito is 2K and comes close enough to complying with the HTTP spec that you can serve web apps to any browser from it: http://canonical.org/~kragen/sw/dev3/server.s. Each child process uses two unshared 4K memory pages, one stack and one global variables, and there are three other shared memory pages. On my laptop, it can handle more load than all the HTTP servers on the entire WWW had when I started using it in 01992. It's only 710 lines of assembly language. I feel confident that no HTTP/2 implementation can be smaller than 30 times this size.

BubbleOS's Yeso is an experiment to see how far you can get wasting CPU to simplify doing graphics by not trying to incrementally update the screen or use the GPU. It turns out you can get pretty far. I have an image viewer (121 lines of C), a terminal emulator (328 lines of C), a Tetris game (263 lines of C), a real-time SDF raymarcher (51 lines of Lua), a death clock (864 lines of C, mostly actuarial tables), and a Mandelbrot browser (75 lines of Lua or 21 lines of Python), among other things. Most of these run in X-windows, on the Linux frame buffer, or in BubbleOS's own windowing protocol, Wercam. I haven't gotten around to the Win32 GDI and Android SurfaceFlinger ports yet.

On Linux, if you strip BubbleOS's terminal emulator executable, admu-shell https://gitlab.com/kragen/bubbleos/blob/master/yeso/admu-she..., it's only 35 kilobytes, though its glyph atlas is a JPEG which is another 81K. About a quarter of that is the overhead of linking with glibc. If you statically link it, it ends up being 1.8 megabytes because the resulting executable contains libjpeg, libpng, zlib, and a good chunk of glibc, including lots of stuff about locales which is never useful, just subtle bugs waiting to happen. There's a huge chunk of code that's just pixel slinging routines from these various libraries, optimized for every possible CPU.

Linked with shared libraries instead, an admu-shell process on this laptop has a virtual memory size (VSZ in ps u) of 11.5 megabytes, four megabytes of which are the pixmap it shares with the X server, containing the pixels it's showing on the screen. Several megabytes of the rest are memory maps for libc, libm (!), libX11, libjpeg, and libpng, which are in some sense not real because they're mostly shared with this browser process and most of the other processes on the system. There's a relatively unexplained 1.1-megabyte heap segment which might be the font glyph atlas (which is a quarter of a megapixel). If not I assume I can blame it on libX11.

The prototype "windowing system" in https://gitlab.com/kragen/bubbleos/blob/master/yeso/wercaμ.c only does alpha-blending of an internally generated sprite on an internally generated background so far. But it does it at 230 frames per second (in a 512x828 X window, though) without even using SSE. The prototype client/server protocol in wercamini.c and yeso-wercam.c is 650 lines of C, about 7K of executable code.

Speaking of SSE, nowadays you have not only MMX, but also SSE, AVX, and the GPU to sling your pixels around. This potentially gives you a big leg up on the stuff people were doing back then.

In the 01990s programs usually used ASCII and supported a small number of image file formats, and the screen might be 1280x1024 with a 256-color palette; but a lot of games used 640x480 or even 320x240. Nowadays you likely have a Unicode font with more characters than the BMP, a 32-bit-deep screen containing several megapixels, and more libraries than you can shake a stick at; ImageMagick supports 200 image file formats. And you probably have XML libraries, HTML libraries, CSS libraries, etc., before you even get to the 3-D stuff. The OS has lots of complexity to deal with things like audio stream mixing (PulseAudio), USB (systemd), and ACPI, which is all terribly botched, one broken kludge on top of another.

The underlying problems are not really that complicated, but organizationally the people solving them are all working at cross-purposes, creating extra complexity that doesn't need to be there, and then hiding it like Easter eggs for people at other companies to discover through experimentation. Vertical integration is the only escape, and RISC-V is probably the key. Until then, we have to suck it up.

Most of this doesn't really affect you, except as a startup cost of however many hundreds of megs of wasted RAM. Once you have a window on the screen, you've disabled popup notifications, and you're successfully talking to the input devices, you don't really need to worry about whether Wi-Fi roaming changes the IP address the fileserver sees and invalidates your file locks. You can live in a world of your own choosing (the "bubble" in "BubbleOS"), and it can be as complex or as simple as you figure out how to make it. Except for the part which deals with talking to the GPU, I guess. Hopefully OpenCL 3.0 and Vulkan Compute, especially with RADV and WGSL, will have cleaned that up. And maybe if the underlying OS steals too much CPU from you for too long, it could tank your framerate.

To avoid CPU death, use anytime algorithms; when you can't use anytime algorithms, strictly limit your problem size to something that your algorithms can handle in a reasonable amount of time. I think GPGPU is still dramatically underexploited for game simulation and AI.

Unreal Engine 5's "Nanite" system is a really interesting approach to maintaining good worst-case performance for arbitrarily complex worlds, although it doesn't scale to the kind of aggregate geometry riddled with holes that iq's SDF hacking excels at. That kind of angle seems really promising, but it's not the way games were efficient 30 years ago.

Most "modern systems" are built on top of Blink, V8, the JVM, Linux, MS-Windoze, AMD64, NVIDIA drivers, and whatnot, so they're saddled with this huge complexity under the covers before the horse is out of the barn door. These systems can give you really good average-case throughput, but they're not very good at guaranteeing worst-case anything, and because they are very complex, the particular cases in which they fall over are hard to understand and predict. Why does my Slack client need a gibibyte of RAM? Nobody in the world knows, and nobody can find out.