

IdTech 4, 15% frame rate increase through semiautomatic paralellization - willvarfar
http://www.vectorfabrics.com/blog/item/vector_fabrics_accelerates_idtech_4_game_engine

======
randallu
So looking at the patch:

1\. The allocator was a bottleneck; wonder if tcmalloc would have done better?

2\. Computing what to draw? (idInteraction) was reduced to a for-loop and
parallelized

3\. ... lots more pre-rendering stuff ...

Unfortunately they didn't annotate the patch with the size of the win for any
given change, or write up an analysis on what the changes were (I'm unfamiliar
with idTech, too).

If they got the biggest win from improving the allocator, well, that's not
very interesting. If they got the biggest win from the other stuff, and the
tool told them what was safe to parallelize then that's much more interesting.

Actually, the whole thing would be better if the write up explained how they
used the tool and what it did and how much improvement they got at each
step...

~~~
jspthrowaway
glibc's malloc() is known to be lackluster under parallel workloads.
Alternatives do perform better in certain circumstances, but profile before
making the switch -- your situation might not be one of them.

It's possible to inject tcmalloc into an executable using Linux's preloader,
too, without recompiling[1].

[1]:
[http://gperftools.googlecode.com/svn/trunk/doc/heapprofile.h...](http://gperftools.googlecode.com/svn/trunk/doc/heapprofile.html)

~~~
KaeseEs
I'm quite surprised that idTech4 was using _any_ malloc implementation aside
from the initial allocation of the memory pool that it would use for its
actual runtime allocator.

~~~
BlackAura
It actually does use it's own heap allocator, with some additional allocators
built on top of it. I don't know if it also uses the system / libc allocator
for anything, but I'd be surprised if it did.

Aside from glQuake (which used a custom allocator for most of the game, but
used malloc quite often in the GL renderer), all of id's other engines used
custom allocators for everything.

Edit: just skimmed the white paper. They were talking about the custom
allocator which was, of course, not thread safe, because it was designed for a
single threaded game engine. Replacing that with an allocator designed for
parallel usage (tcmalloc for example) would have helped, but so would having
separate allocation pools for different threads (which you might do for a game
engine designed for multithreading). Instead, they made the allocator thread
safe, using their tool to analyze it, and then sprinkling it with mutexes.
That would probably be why it was a bottleneck.

------
egypturnash
For those of us who haven't played Doom 3, is it normal to have those rapid
flashes of numbers when you're smacked by one of those red guys in the demo
they're showing it off with? Or is this a weird bug introduced somewhere along
the line?

~~~
kevingadd
Those are absolutely a bug. They probably screwed up and didn't put mutexes
around every piece of shared state.

~~~
fons
It is not a bug. That effect was already present in the pristine unoptimized
code. Quoting Vector Fabrics new blog post [1]:

"Some people wondered if the HUD textures are messed up in the optimized
version, because the demo shows that as soon as you get hit by a monster the
health and score numbers seem to blow up in your face. The answer is that this
is part of the game play and already present in the original unoptimized
version."

[1]
[http://www.vectorfabrics.com/blog/item/accelerating_the_idte...](http://www.vectorfabrics.com/blog/item/accelerating_the_idtech_4_engine_with_pareon_a_follow_up)

------
fons
There is a new blog post from Vector Fabrics on this topic. It includes some
clarifications and tries to respond the questions raised until now:
[http://www.vectorfabrics.com/blog/item/accelerating_the_idte...](http://www.vectorfabrics.com/blog/item/accelerating_the_idtech_4_engine_with_pareon_a_follow_up)

------
kevingadd
A 15% framerate increase from 3 weeks of work isn't particularly impressive.
It would be more meaningful if the measurement was in terms of frametime (i.e.
elapsed time per frame) since framerate is not linear.

From looking at the patch (nice of them to provide it), it seems like most of
the changes are putting mutexes around things to guard against simultaneous
access and then parallelizing some loops. It seems to me like anyone with
basic knowledge of the game's codebase could have made these changes without
any support from a tool like the one they're trying to sell, and probably done
it in far less than 3 weeks. If your goal is to run a piece of code in
parallel, you don't need a program to walk through that code and identify all
the shared state it manipulates (and then go through and make sure that state
has a mutex guard). If the software they're trying to sell _did that for you_
, now that would be nice. It seems unlikely that you could automatically
modify C++ source code without a lot of constraints, though.

Software to aid authoring parallel C++, or detect places where you can safely
parallelize existing code, would certainly be valuable but this is a pretty
poor demonstration of any actual advantages offered by their software.

EDIT: From watching the included demo video, the game is running well below
60fps which suggests that they're running it on a low end graphics card (or
generally low-clocked machine) to improve the gains from their optimizations.
I don't think their results are necessarily real-world as a result of this -
someone playing Doom 3 on an actual gaming rig would probably not see a 15%
framerate increase because their framerate would already be well above 60fps
(framerate is nonlinear, as I mentioned above).

~~~
jblow
This is not just middlebrow dismissal, it's outright wrong. In the AAA world
we would _love_ to spend only 3 weeks of an engineer's time to get a 15%
speedup. Seriously, that is a great deal, it's like, where do I sign up?

However, it becomes substantially less impressive when you notice that you're
using 2x or 4x the amount of processor hardware (2 or 4 cores) and only
getting a 15% speedup. In a by-hand implementation that would be very
disappointing.

If it were fully automated that would still be pretty valuable, but it appears
that this isn't. So it seems to be of questionable utility.

~~~
BlackAura
Worse, they didn't actually get a 15% speedup. They reduced the frame time
from 16.7ms (around 60FPS) to 15.3ms (around 65 FPS), which is only a speedup
of something like 7%.

They didn't seem to mention how much they're actually utilizing the remaining
cores either. From the look of it, they've basically parallelized a couple of
loops in the renderer. That strikes me as being the wrong way to go about
this, especially considering that modern AAA games apparently break everything
down into tasks, and run the separate tasks in parallel (idTech5 apparently
does this, for example).

~~~
to3m
p21 in the white paper suggests the speedup is ~2x, for the parts that were
actually parallelized. That's more interesting (well, to me) than the rather
limited increase in overall frame rate.

You may be right that they are doing the wrong thing, but I think the fact
it's doable at all is reasonable evidence that their tool is useful. I'm
pretty impressed that somebody who's totally unfamiliar with the code is able
to jump in and start parallelizing stuff - particularly if they've never
worked on a game before, and so might not have any real idea where would be a
good place to start, or how things are likely to work.

Maybe my standards are too low.

(Those screwy HUD textures might be evidence that they managed to get it
completely wrong, of course ;) - or maybe it's just something simple.)

~~~
BlackAura
Good point. Their tool is obviously useful, even though something like this
might not be the best use case for it.

The HUD thing - it looks like they screwed up the post-processing effects
somehow. As far as I remember, the original game did apply post-processing
effects to the HUD. Probably something to do with OpenGL not getting a decent
way to do render-to-texture until 2005, so they'd have had to use the previous
frame's render buffer. It wasn't that noticeable in the original though.

