Hacker News new | past | comments | ask | show | jobs | submit login
Fixing the Loading in Myst IV: Revelation (medium.com/tomysshadow)
264 points by davikr 33 days ago | hide | past | favorite | 71 comments



I’ve just read both parts of the article and I still feel like I’m left with more questions than answers.

The game is bottlenecked on memcpy so hard it takes two seconds to load each time? On a modern machine with double-digit GB/s RAM bandwidth and single-digit GB/s SSD bandwidth, when the game was released on two DVDs and thus can’t have more than couple dozen GB of assets total[1]. How? OK, they’re doing a memcpy per image row, that’s not nice and can probably cost you an order of magnitude or so, and the assets are JPEG-compressed so it’s another order of magnitude to copy around uncompressed pixels, but still, how?

Furthermore, if it really is bottlenecked on memcpy, why does running on a modern machine not improve things? I almost want to think there’s a fixed amount of per-frame work hardcoded somewhere, and loading DDS is just accounted for incorrectly.

[1] In fact, a screenshot in part 1 shows data.m4b taking up 1.4GB, and the rest of the files shown are either video, sound, or small.


It's what the profiling hinted at at least, but I don't know how much overhead that tool adds per function call, so if you profile a lot of very small/fast functions you basically just measure which function gets called most.

But you should not underestimate the impact of unnecessarily shoving data around in memory even with fast ram. Cpu speed has improved much much more than memory speed over the past decades. If your data layout sucks and you hit L3 or even worse actual memory, it's slow as heck relative to L1, or even better no copy at all. And then the overhead of the plain function call itself. As this is a 3rd party library you're guaranteed that each call to this wrapped memcpy is an actual call and not inlined.

But in addition to that I'm pretty sure the decoding library used originally isn't nearly as fast as mango.


If the profiler used is a sampling profiler (and it seems to be), then unlike with instrumentation-based profilers, it doesn't add any function call overhead. It just pauses the program every few ms and records what the call stack is at that point. While this makes the data noisier compared to instrumenting all calls, it also makes the data an unbiased approximation of how the program behaves when not being profiled.

But sampling profilers do still tend to "basically just measure which function gets called most". They can tell a program spent a lot of time in a particular function, but they can't count how many times that function was called – so they can't determine whether it's a slow function, or a fast function that was called many times.


There is a hardcoded value of 550ms but else you're right. I guess it's still bottlenecked because it runs in a 32bit environment.


To be completely honest, it's surprising to me as well. I would expect it to be bad, but not as bad as it was. I entirely expected that the slow part would be decoding, not copying. In fact, my initial plan was to convert the remaining images that couldn't be DDS to Targa, on the assumption it would decode faster. However, when I investigated the slow functions and found they were only copying, I changed tactic because then in theory that would not make a difference.

There is no fixed amount of per-frame work. After the 550ms hardcoded timer is up, it is blocking during the loading of those images, and during this phase all animations on screen are completely still. I thought to check for this, because it did occur to me that if it tried to render a frame inbetween loading each image to keep the app responsive, that would push it to be significantly longer, and that would be a pretty normal thing to want to do! But I found no evidence of this happening. Furthermore, I never changed anything but the actual image loading related code - if it tried to push out a frame after every image load or every x number of image loads, those x number of frames wouldn't go away only by making the images load faster, so it'd have never gotten as instant as it did without even more change.

The only explanation I can really fathom is the one I provided. The L_GetBitmapRow function has a bunch of branches at the start of it, it's a DLL export so the actual loop happens in a different DLL, and that happens row by row for 500+ images per node... I can only guess it must be because of a lack of CPU caching, it's the only thing that makes sense given the data I got. Probably doesn't help that the images are loaded in single threaded fashion, either.

That said, there have been plenty of criticisms of my profiling methodology here in these comments, so it would be nice to perhaps have someone more experienced in low level optimizations back me up. At the end of the day, I'm pretty sure I'm close enough to right, at least close enough to have created a satisfactory solution :)


I absolutely did not mean to imply that you did a bad job at any point, or to discourage you. The mere fact that you reached that far into the game’s internals, achieved the speedup you were aiming for, and left it completely functional is extremely impressive to me.

And that’s part of why I’m confused. If you’d screwed up the profiling in some obvious way, I’d have chalked it up to bad profiling and been perfectly unconfused. But your methods are good as far as I can see, and with the detail you’ve gone into I feel I see sufficiently far. Also, well, whatever you did, it evidently did help. So the question of what the hell is happening is all the more poignant.

(I agree with the other commenter that you may have dismissed WaitForSingleObject too quickly—can your tools give you flame graphs?.. In general, though, if machine code produced by an optimizing compiler takes a minute on a modern machine—i.e. hundreds of billions of issued instructions—to process data not measured in gigabytes, then something has gone so wrong that even the most screwed-up of profiling methodologies shouldn’t miss the culprit that much. A minute of work is bound to be a very, very target-rich environment, enough so that I’d expect even ol’ GDB & Ctrl-C to be helpful. Thus my discounting the possibility that your profiling is wrong.)


I was thinking the same thing the whole time. How can a game that was even remotely playable in 2004 still bottlenecked on memcpy?


Unfortunately the author and the paper he links apply alpha premultiply to the gamma compressed image. To be correct, this should be done in a linear colorspace. His solution will make some color edge combos get halos.

Basically, alpha in all formats I’ve seen is stored linear, but colors are gamma compressed (sRGB, HDR stuff, etc.). If you apply alpha premultiply, then linearize, you’ve misapplied alpha. If you ignore linearizing (as even this author shows), you get immediate black halos since your blend is effectively multiplying colors, not adding them.


This is something I'd love to get right. Pixman does appear to support sRGB input, in the form of the PIXMAN_a8r8g8b8_sRGB format, which might work well enough for the premultiply step. It's the unpremultiply that I'm struggling to wrap my head around - I'm guessing I'd need Pixman to output to 16-bit channels in the destination, otherwise I wouldn't be able to convert back to sRGB? That's kind of a massive pain though, I'd have to allocate a whole other temporary buffer that's double the size, for something that is imperceptible enough I never noticed it with my test images or in my playthrough. So I'm unsure what the cheapest way to do it would be. This is all well outside of my area of expertise which is primarily hacking and reverse engineering, but I'm always open to learn.

I tried my hardest to create something that was as "technically correct" as I could approximate given my lack of graphics experience and the performance constraints I was under, but I kind of knew it was likely I could mess up some small detail. Maybe since it's open source someone will eventually come along to correct it? One can dream :P


right, looking at it again, I think I get it now. You'd need: the 8-bit sRGB source, to the premultiplied image as floating point (Pixman can't do to 16-bit channels it seems,) then to the resized image as floating point, then unpremultiply that, and then go back to 8-bit sRGB. It makes sense in my head, I just don't know if it's really worth all that tradeoff, it's a lot of extra steps... I don't even know that the original resize algorithm would've even done it either given its age, and my goal is to replicate that. But maybe I'll test and see how it goes eventually


Followup: I've now implemented this and I determined it doesn't take enough longer to have a noticeable impact. It so happens mango has an sRGB to linear function that is much faster than using Pixman to do the conversion so that's what I used. I kept it 32-bit all the way through which will introduce some colour banding but it's not really noticable with the photorealistic images being resized. So I expect this will be ready for whenever I release my next version


> As any good programmer knows, division is slow, it’s a serializing instruction, and we want to avoid it as much as possible. The favourite programmer tricks to avoid division are to use bit shifts (for division by multiples of two) or flip it into a multiplication — for example, to multiply by 0.333 instead of dividing by 3. In this case though, we are dividing by the alpha, so we can’t know what number we will be dividing by in advance.

> However, because the channels are 8-bit, we will only ever be dividing numbers from 1 to 255 (yes, some values will be zero — but we sure won’t be dividing by them then!) That means there are only about 65K possible combinations, so we can use another classic solution: a lookup table! This is a perfect place to use constexpr to bake the array directly into the compiled result.

Interestingly, when I benchmarked this same problem, three integer divisions would easily beat the LUT on my computer. Maybe because the it's easier on the cache? (Or I did something wrong.)


I think, on a modern processor, a lookup table of 65k is way too big to be worthwhile for something as simple as a division

memory access is relatively very slow; OP probably should have benchmarked it


Chances are only a small portion of the lookup table is used. It's conceivable that this is almost entirely L1 access.

But I'm also very skeptical that this is actually faster. Would have been nice to see some numbers


I think an l2 cache hit is about the same number of cycles as a single byte divide, might as well save the cache for something else

but yeah, this is just my speculation


Depends on the use and context, benchmarking can hint at the "truth", that it is faster to just divide: ultimately the benchmark would have to have the two implementations and run them thru all possible inputs to know for sure.

It may clearly state that memoization is slower, until you plop it into a benchmark that inlines division explicitly without using a given hardware accelerate.

division by a constant input can be solved faster still, so there can be optimizations on top of memoization that would beat raw division on most peocessors/runtime environments.


Instead of compiling and running code, just replace all of your code with composable functions that are each small enough to lookup the results for in a precomputed hardware dictionary. In this way you have a Turing complete cpu with only one instruction.

I started this comment as a tongue in cheek satire of yours, but now I’m honestly wondering if it could be a viable idea for a radically simplified cpu (I’m not a computer engineer). I suppose the lookup tables rapidly become too large, possibly before they are useful?


For an 8-bit CPU you could maybe do it. For a 64 bit cpu, the lookup table for addition alone would be massive. (You can of course do addition in smaller increments, like adding one digit at a time and keeping track of the carry, but then you just have a normal full adder).

The biggest issue is that this CPU would be conceptually simple, but very difficult to make fast. Memory is slow, and accessing a 64k lookup table uses more transistors than just doing the addition


Just build 4 of them and chain them together to get 64-bits.

(Beyond just being a joke, it is often how things like this get scaled. 1-Bit Adders become 2-Bits with a Carry Line between, then 3-bits with another Carry Line, and so forth, ad infinitum and beyond. The real joke is the complexity involved in however you "just" chain them together.)


Again, I'm no expert here but find this stuff fascinating. Could the simplicity possibly allow some radically different CPU design that makes the lookups alone nearly instantaneous? I could imagine some types of optical/photonic physical 3D hash table structures - even ones where the same physical structure could support a large number of read operations in parallel if pre-arranged by a compiler to not physically interfere. I imagine a cpu with only one instruction could be physically miniscule, and therefore pack a lot of cores in a small space.

Hypothetically, if it were orders of magnitude faster than a normal CPU, one could still perform rapid computation on larger numbers as needed while keeping the cpu physically 8 bit- yet be able to revert to even higher performance when less precision is needed.


If you're interested in One Instruction Computers, check out: https://en.m.wikipedia.org/wiki/One-instruction_set_computer


Thanks! I half expected to be told one instruction computers were impossible.


There are packages for those things in npm, but they you'll have to start using javascript...

https://www.npmjs.com/package/@samuelmarina/is-even

The javascript crowd. They are always one step ahead!


Haha, a 100mb library, it doesn't say so but is this really a dictionary of numbers with even/odd as values? I love that they have an entirely separate is-odd package.

//edit: looked at the code, it's literally 100mb of if else statements- and a bunch of them are wrong! The github pull requests to add additional integers are hilarious.



I've been (very slowly) working on this for a 4-bit CPU. Not for any practical reason, just for fun.


Any more details you could share? Is this a physical cpu design, or a VM/assembler?


CPUs do use a ton o LUTs under the hood.


I wonder if this was inspired by this discussion, but this just made the front page. I didn’t realize the famous Pentium fdiv bug was due to an error in a lookup table used for division: https://news.ycombinator.com/item?id=42391079


It also depends a lot on your CPU. Operations can be made faster by dedicating more resources to them. Most CPUs go for a slowish division operation that isn’t too expensive in terms of resources, or lower-end ones don’t provide it at all and make you do it in software. Apple seems to have decided that integer division is pretty important, so they dedicated a lot of resources to it and the M1 does a divide in something like two cycles. Even thinking about loading from memory is probably slower than that.


A 256-item float32 LUT for 8-bit sRGB -> linear conversion is definitely still faster than doing the division live (I re-benchmarked it on Zen4 and Apple M3 last month), however floating point division with the newer microarchs is not as slow as it was on processors 10 years ago or so, so I can imagine using a much larger LUT cache is not worth it.


does this include vectorized code? I stopped using LUTs for anything “trivial” probably 20 years ago because I rarely see any improvements (in particular where it would benefit the overall runtime noticeably).


With all the discussion of lookup tables here it is worthwhile to remember the https://en.wikipedia.org/wiki/Pentium_FDIV_bug from the mid 1990s.


I was a little kid the age my son is now when I went with my dad to buy a 486 cpu and motherboard… he said he wasn’t getting the pentium because he did scientific computing and was worried about this bug. It’s funny that I only now understand what the significance of that was due to this thread.


I think any software engineer can identify with the feeling you get at the moment you do the first run of the solution you have implemented that you are 100% sure has to fix it only to find nothing has changed.


Corollary: the relief/anguish when you discover that the reason none of your fixes have worked, nor your debugging print statements produced output, is because you were editing a different copy of the file than was getting built/run because you moved or renamed something and your editor didn't notice.


Running the Windows build but editing the WSL files


This reminds me of when I was trying to do Minecraft style chunking in Bevy. I was in a situation where (instead of doing the not-so-obvious fix) I threw parallelization, compiler optimization, caching, release flags etc. at my project and nothing made it go faster. I could not figure out why it was so slow. Turns out what I was doing was so unoptimized that I might've as well loaded the whole world per frame.

You live and you learn :)


I was genuinely concerned that everything I was doing with mango and Pixman was going to turn out to be pointless. It wasn't, thankfully, there was a noticeable difference after introducing them. But it was a gamble for sure, because there was no smaller test I could really do to know it was worth it in advance - if I wanted to replace that DLL, I was going to have to replace the whole DLL because it was C++, the DLL exports were all mangled names for classes that all kind of interacted with each other, so I couldn't just cleanly replace one call and see that it was a good idea. I try to gather as much evidence as I can to back the idea it'll work before I make the leap, but I've learned that if you really want to get stuff done sometimes you just have to go for it and assume there is a way to salvage it if it fails


This has happened to me so many times. Especially in the distributed database I work on ... "hmm maybe I need to let the experiment run for longer, this data is noisy so it probably needs more time to show a trend line".


I really enjoyed the author's technical deep-dive and approach to debugging performance issues. Mild spoilers for anyone who hasn't played Riven, but the method for fixing Gehn's faulty linking books is a perfect analogy for the author's more counterintuitive performance optimizations.

While I don’t have a write-up as detailed as this one, I spent a month on a similar journey optimizing an animated ASCII art rasterizer. What started as an excuse to learn more about browser performance became a deep dive into image processing, WebGL, and the intricacies of the Canvas API. I’m proud of the results but I’ve annotated the source for a greater mind to squeeze another 5 or 10 FPS out of the browser.

Maybe it’s time to brush up on those WebGL docs again…

- [1] https://asciify.sister.software/

- [2] https://github.com/sister-software/asciify/blob/main/Asciify...


Very good read! love detailed explanations on the "bad" original code and steps taken toward improving it. a lot of it comes down to personal preference and the author did a good job at respecting what might have been an intentional design decision with their optimizations by making it all configurable


Great writeup. A typical "this shouldn't be too hard" story with yet another surprise around every corner. Seems familiar... :)

One thing I wondered is whether with that optimized loader library, is it even still necessary to do the DXT conversion at all? Sounds like mango and pixman could be fast enough already....


For me at least, with the optimized loader library, yes, the impact of the DDS conversion is almost unnoticeable. However, I have a pretty fast CPU so I don't want to assume that it'd be the case for everyone, and the DDS conversion was done first, so even if it is overkill it costs me nothing to leave in and will better serve anyone with a slow CPU where mango and Pixman aren't enough.

Shortly after I released my tool, I had someone report that it was crashing for them because they were using a third gen Intel CPU that didn't have a SIMD instruction set the x64 command line portion used (particularly the BMI instruction set.) It was a bug in the mango image library anyway because I had disabled that particular instruction set when I built it, but goes to show that when you're doing a retro game hacking project, a lot of gamers are keen to use older hardware for them and I'm quite aware of this fact


Love seeing how the optimization parameters were different back then, that is, size constraints were more important than loading speeds, even though both drives and CPUs were much slower back then.

Ideally companies like this that make games keep all the original assets and make things like image format a build switch, for when the parameters change in the future. That said, back then they released on a DVD (I'm reading it would've taken 12 CDs otherwise), I don't believe any higher capacity storage devices were in the pipeline yet at that point. That said, hard drives back then were around the 100 GB mark, so a multi-dvd release would've been doable.

Ironically nowadays, some games (like the FFVII Remakes) are on two disks again, an install and a run disk, despite them having a 50 or 100 GB capacity nowadays.


It was already on two discs. You could choose whether to copy one or both to the hard drive.

The game ran on machines as old as late 1999, so the typical disk size would’ve been more in the 10-20 GB range. Even the existing 3.4 GB minimum requirement (installing just one disc and streaming content off the other) was a pretty hefty ask.


I remember on PS1 it sold with 3* CDs. It was beautiful as they had to make a special double edges holding box. Stood nicely aligned with all the other games but could be spotted from afar.

We see physical media as a burden today, that's sad as they used to be pieces of arts.

Edit: 3, not 4.


Someone on Reddit pointed out to me after, DDS is DirectX specific and the DVD was for Win/Mac, so it may have been more of a multiplatform issue than a space one. In an ideal world you'd use DDS on Windows because it's fastest there, and whatever the OpenGL equivalent is on Mac, but that would've been well, well outside of space constraints, at that point


It was released on 2 DVDs as far as I remember.


STB image didn't get used in the end because some other library was faster but I think the author missed the possibility of #defining their own allocator using STBI_MALLOC (which could just return a pointer to an existing memory block).


No, actually, I did see this. The problem with using STBI_MALLOC in that way is that it is used for _all_ allocations by STB, not just the main image buffer. The image buffer I need to put the data into is passed in as an argument to Gfx Tools. It is already existing, and I couldn't touch the code where it was created, that code lived in yet another DLL, the one calling into Gfx Tools. So if I overrode STBI_MALLOC to just return that same buffer every time, then any time STB calls malloc, each malloc call would return the same buffer, that same buffer would be used for different allocations with entirely different purposes. So, it's close, but it wouldn't work. I would need to have done some hack like checking the requested buffer size to malloc matches the expected size of the image buffer, and only return it then, but that's of course quite fragile and error prone. Or, you know, just go into the STB source and find the one malloc I need to replace, but that's kind of dirty.


Thanks for the detailed explanation, good to know!


> the author explains they used a tool called Luke Stackwalker to profile the game

Can anyone confirm my memory that Microsoft had a tool called Luke Heapwalker in the mid-1980s, and that Lucasfilms demanded they change the name?


I can't find an exact story but this pdf [1] has a foot note about the naming being changed (search for Luke).

[1] https://libros.metabiblioteca.org/server/api/core/bitstreams... [pdf]


Many thanks. I love the whole book.


> In this profile, we can see that approximately 50% of the time is spent on WaitForSingleObject, but we know that is a part of the game’s normal rendering loop so we can dismiss it as background noise.

That's not an entirely safe assumption. Even a single threaded game could wait on different handles at different points in its logic.


My hat is off to this, I really appreciate how he documented every step he took. It's lengthy but definitely worth the read.


> So we know that WaitForSingleObject is where the majority of CPU time should be spent during normal operation, and we can dismiss anything that appears in this first list as not the source of the problem.

This heuristic might have worked this time but I don't think it's great in general. System functions can be used for many different purposes and even the same use might be fine in one place and a bug in another. For example the game could have been unintentionally vsyncing many times during the loading process, i.e. to update a progress bar. And no, that's not a purely hypothetical scenario.


Great read! Though there is an unnecessary double map lookup in part 2: https://github.com/tomysshadow/M4Revolution/blob/094764c87aa...


This is awesome (and very impressive)!

Two questions:

1. What tool was used to generate that "Ange Albertini-inspired file format diagram"?

2. Is there an emulator that would make this easy to play under Android?


1. There are probably better tools to do it, but I just used Google Drawings.

2. Not for Myst IV currently. All of the prior games are supported by ScummVM, which would work on an Android device, but Myst IV is not in there yet. Maybe someday though


My manager takes one look at this and asks: so, in the end the effort was unsuccessful? No impact? That’s OK, let’s get you refocused on something productive :)


Tangentially related, previous releases of the game were also hit by a DirectInput device enumeration regression that Microsoft (behind the scenes) refused to fix. (I haven't checked the latest re-release.)

https://github.com/riverar/IndirectInput


This guy Mysts.


What is this weird game?


Myst is a very old game where you walk by tapping a series of pre-rendered 3d images. This is back when each image would take minutes to hours to render (like Pixar did with Toy Story 1), so graphically it looks amazing for its time, but it actually existed at the same time when Super Nintendo was popular (1990s)


The first game was unimaginably popular after its release in 1993. It quickly became the best-selling computer game of all time, and stayed there for nine years until The Sims unseated it in 2002. It's still the 25th highest-selling computer game ever[0]. Kids these days!

[0]: https://en.wikipedia.org/wiki/List_of_best-selling_PC_games


Why not just do a search and get an answer in five seconds?


This was very impressive to read and while I don’t have the technical Knowledge to do this, it reminded me of “fixing” my mental health when Stanford Psychiatrists diagnosed me and said I’d be on pills the rest of my life, incurable.

Years later, after rebuilding my psyche from scratch, happy to report they were wrong.

But striking similarities, where the “professionals” just didn’t bother to solve a solvable problem.


You might feel that your mind is a computer program you can simply rewrite, but I feel that mine is a horse that does not listen to me

Pills work for a lot of things (ADHD for instance) and they're a lot faster than years


Trade offs for sure. Short term vs. long term impacts.

Having fully healed from so called bipolar, schizoaffective, anxiety and suicidal depression, I’m quite grateful I read the primary research and didn’t listen to the pharma-industrial complex.

Wild that things like dancing have proven to be more effective (and with no side effects) than SSRIs.

And sure, you can always tranquillized a horse, or you can take the time to learn to ride.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: