> I would have measured the time directly inside of the Doom allocator
Exactly! The author already has all the infra necessary (rdtsc for each call). Measuring time inside the game would be more accurate and simpler. Why did the author do things this way? I must be missing something.
(By the way, the author even seems to know about this issue since they added code that simulates using the allocated blocks (touching one byte for every allocated 4k), but that does not feel like it's nearly enough.)
> grab the call stack as well
Maybe this part could be done with no code with VTune / Linux perf? Sure, those only gather stochastic measurements (so not ideal for the original latency measurement). But to get a rough idea of where the costly allocations come from, it could be an easy way.
> Why did the author do things this way? I must be missing something.
Likely because it's easier to test against different allocators with a small replay tool than it is to try and get Doom3 to compile against dlmalloc, jemalloc, mimalloc, rpmalloc and tlsf.
I would also bet that getting the game to perform the exact same series of allocations would be an intractable problem to solve. I don't think Doom 3 has a benchmark mode; the author just recorded themselves loading the game, loading a level, doing a bit of gameplay, etc.
Carmack is a great empiricist, meaning he pays a lot of attention to how things work in a running system. All of his engines provide rich facilities for record and replay as well as running benchmarks.
There are lots of facilities for recording and playing back demos. The author could record their own gameplay and play it back in realtime vs running timedemo (which plays back as fast as possible).
id Tech games have "timedemo" mode, where they replay a pre-recorded demo as fast as possible, and report how many FPS it was able to process. These were very popular benchmarks long ago.
Exactly! The author already has all the infra necessary (rdtsc for each call). Measuring time inside the game would be more accurate and simpler. Why did the author do things this way? I must be missing something.
(By the way, the author even seems to know about this issue since they added code that simulates using the allocated blocks (touching one byte for every allocated 4k), but that does not feel like it's nearly enough.)
> grab the call stack as well
Maybe this part could be done with no code with VTune / Linux perf? Sure, those only gather stochastic measurements (so not ideal for the original latency measurement). But to get a rough idea of where the costly allocations come from, it could be an easy way.