I couldn't help but laugh reading the first link. Sure, there are some good ideas for optimization. But (1) no profiling, and (2) there are a couple really bogus suggestions.
For example, suggestion #33, to shift by a constant amount instead of a variable amount. You see, it only looks like you're shifting by a variable amount. This is one of those things that compilers have been optimizing for years and are very good at: strength reduction wrt variables that depend on the loop variable.
You'll also see game programmers do things like "x >> 4" instead of "x/16", because "x >> 4" is faster. It is faster in assembly language, but your compiler already knows that and you are just making the code harder to read every time you do a micro-optimization.
Game programmers spread the gospel on lots of premature optimization nonsense, in my opinion, and are mostly inexperienced when it comes to writing mantainable code. It's a kind of hazard of the industry. Performance problems means cutting beloved features, and rather than doing any maintenance you just start a new game from scratch. (Not universally, of course. There are a few programmers professionally working on engines.)
Slide #33 is actually reasonable advice. Variable shift = 11 cycle latency on PS3/Xbox360, and it blocks both threads and disables interrupts as it runs. (Will the compiler figure this out? Maybe, maybe not. But if you write the code you want - which in this case you might as well, since the transformation is simple - then you won't have to worry about that. As a general point of principle you should write the code you want, rather than something else that you hope will become the code you have in mind; computers are very bad at figuring out intent, but excellent at simply following instructions.)
What are the other bogus suggestions? The overall thrust of the slides seems to me valid: know what's expensive (branches, memory access, mini-pitfalls like the micrododed thing), pick a strategy that avoids all of that, and don't throw away performance. Performance always ends up being an issue; you don't have to carefully preserve every last cycle, but that's no excuse for just pissing away cycles doing pointless stuff.
Not explicitly stated, but apparent from comparing the suggested approach to the original one, is that you can't always just leave this stuff to the last minute, when you've had the profiler tell you what's a bottleneck and what isn't. The requisite changes might have far-reaching consequences, and so it's worth giving a bit of thought to performance matters when you start and things are a bit more in flux.
A bit of domain knowledge also won't hurt. I bet if this function were ten times worse, but called "PostDeserializeFromDVD", it wouldn't attract very much attention.
> but that's no excuse for just pissing away cycles doing pointless stuff.
Yes there is - maintainable code, programmer time and effort. What are you worried about, a large power bill due to your cpu doing what is supposed to be doing?
On an unrelated note this kind of attitude is the first thing I test for during a programmer interview, and is my strongest case for eliminating potential candidates. I made the mistake once of letting one through - his first task was a relatively straightforward modification to a large C program. A week later I was a bit worried he hadn't reported back done, so I went to check up on him and it turns out he was busy changing every line of code to adjust formatting and variable names, not to mention these kinds of pointless micro-optimizations. And he hadn't even checked in once, he was saving the checkin itself for another multiple day effort. Sigh. I tried using the "premature optimization is the root of all evil" line on him to get my point across (and see if he had heard of it), and it was when I saw his eyes flare up in anger I knew he had to go. Sad really because he was otherwise quite bright.
Now I basically put C++/game programmer applications in a "special pile" to be considered as a last resort. I just dont need this kind of arrogance and cowbow mentality wrecking the place. Its like sending a bull into a china shop.
If performance is a requirement, it's a requirement, and you need to bear it in mind. And working in games, it usually is. Virtually every project has problems with performance, and dealing with the issues properly at the end of a project can be very hard. By that point, the code is usually insufficiently malleable to be safely transformed in the necessary fashion, and there's a very good chance that you'll introduce new bugs anyway (or cause problems by fixing existing ones).
So, armed with a few simple rules of thumb about what is particularly expensive (let's say: memory accesses, branching, indirect jumps, square root/trig/pow/etc., integer division), and a bit of experience about which parts tend to cause problems that can be rather intrusive to fix (and object culling is one of these), one might reasonably put in a bit of forethought and try to structure things in a way that means they're relatively efficient from the start. Better that than just producing something that's likely to be a bottleneck, but written in a way that means it is never going to be efficient, whatever you do to it, without a ground-up rewrite. And that's the sort of approach the slide deck appears to be advocating.
Seems uncontroversial, to my mind. Some (and I'd agree) might even just call this planning ahead. Seems that when you plan ahead by drawing up diagrams of classes and objects, because you've been burned by having people just diving in and coming up with a big ball of spaghetti, that's good planning ahead. But when you plan ahead by trying to ensure that the code executes in an efficient manner, because you've been burned by having people come up with slothful code that wastes half its execution time and requires redoing because of that, that's premature optimisation, and a massive waste of time.
As with any time you make plans for the future, sometimes you get it wrong. Ars longa vita brevis, and all that.
> Variable shift = 11 cycle latency on PS3/Xbox360
That's not the issue here. It doesn't matter if the variable shift has to get its results by carrier pigeon from a monk in Tennessee, because the variable shift is eliminated by the compiler. When the programmer is manually doing work that has already been automated, your process is inefficient.
I tried some super-simple test code and ran it through the PS3 compilers... but was unable to get either to turn the variable-width shift into a fixed-width one. With the right flag included, gcc was even good enough to warn me about the microcoded instruction.
I also tried gcc (OS X, x64) and llvm (OS X, x64/ARM) and they didn't do the transformation either. (I'm not sure I would expect this for ARM anyway, but for x64 the variable shift code looked a bit more involved than I was expecting. Perhaps a transformation into a fixed-width shift would be beneficial in that case as well.)
Compile options were "-O3" for SNC (this seems to be about your lot as far as options go) and "-O6 -fstrength-reduce" for gcc (obviously I could spend all day fiddling with all the possible options, but I won't, which I admit is a flaw in my experiment - but I believe snc is supposed to produce much better code anyway). And in both cases, the code for `f' includes a variable shift, and the code for `f2' didn't.
Still, I would stand by my maxim even if the data in this case were against me. It's the winning strategy. Why rely on the compiler's ability to second-guess your intentions, when you could just tell it what you want, straight up?
While, I'm normally of the school of thought to let the compiler do the optimization. Modern compilers often miss what would seem like rather trivial optimizations, And often due to assumptions that the language spec won't allow that the programmer otherwise can make.
I'm not sure there's a point in testing on x64 because I'm not sure that changing to a fixed shift is actually an optimization. I think that the variable shift was slow back in the Pentium days but it's really a thing of the past.
I'm honestly surprised that the variable shift didn't get fixed. This strength reduction will happen if you use a multiply instead of a shift: the multiply will get reduced to an addition. I had assumed that the same would hold for variable and fixed width shifts.
I would file this as a bug against the compilers in question. I don't think it would take very long to fix.
Keep in mind that for games you are often targeting three different consoles plus PC. You may also port to other platforms eventually, and won't be able to easily track down all these lines of code at a later date to guarantee they optimize down correctly on your new target.
Since I actually happen to know Mike, I can guarantee you he profiled. He might not share the results in the slides, because they were about making a point and getting laughs in the process, not production code.
This is one of those things that compilers have been optimizing for years and are very good at: strength reduction wrt variables that depend on the loop variable.
Many compilers are good at it. In many instances. But if you do multi-platform development, with entirely different compilers for each platform, doing manual strength reduction is a good investment of your time if performance really matters.
and rather than doing any maintenance you just start a new game from scratch
Thankfully, that model is fading. It was a result of rapidly changing architectures, so that often at the beginning of the next console cycle, you had to rewrite anyways - the hardware was so different that what was previously fast suddenly was a disaster.
Given the fact that hardware is moving closer and closer to being bog-standard, and that code bases are large enough that a rewrite is actually insanity, not a couple of bored weekends, this mentality is fading out.
I think the points that you bring up are mostly true for well established and common platforms and architectures, but the author, Mike Acton, focuses mostly on the PS3, where it is more conceivable that a compiler would not e.g. optimize a division operation to a shift, especially for an SPU.
I don't think it's fair to generalize game programmers for premature optimization nonsense. Game development certainly has much stricter requirements on performance than say, Joe's Ruby on Rails website, and so I would be much more inclined to trust them for optimization tips. That being said though, yes, a lot of them have no idea what they are talking about.
I have been on both sides of the fence. I have worked with people who attempt to optimize every single instruction, and with people who gladly misinterpret "premature optimization is the root of all evil." Both of them are obviously toxic in their own ways.
Be careful: "x / 16" won't get converted to "x >> 4" by the compiler if x is a signed int. I lost an argument about this in a code review once. Look at the disassembly for the case that x is and is not signed if you don't believe me.
int main(int argc,char *argv)
int x = strtol(argv,NULL,10);
printf("result is %d\n",x/16);
The resulting assembly contained no division instructions, nor a call to a divide routine, but there is one instruction that does an arithmetic right shift by 4. Changing x to unsigned changed the instruction to a logical right shift by 4.
It first generates an offset value in ESI: 15 if the value to divide was negative, or 0 if it was positive.
Then it adds this to the original value. This is the cunning part - well I think so anyway. It ensures that for the values where x>>4 would give the wrong value (i.e., -15<=x<0), x is made positive and smaller than 16, so that x>>4 gives the correct value of zero. For all other values, the bottom 4 bits don't affect the result of the division, so it's fine to offset them as well.
(Don't think I've ever seen VC++ generate anything like this, but I'd pretty much always have an explicit shift rather than a divide by 16 so maybe it does and I'd just never noticed. The choice of 16 usually comes from having some 4-bit field somewhere, and when dealing with bitfields you're best off using bitwise instructions.)
Strangely, I can't reproduce the effect with a divide by 16. With GCC 4.5.2 -O2, I see a "shrl $4, %eax" for unsigned, and "sarl $4, %eax" for signed. However, if I divide by 2, the results are different; the function:
>mostly inexperienced when it comes to writing mantainable code
Mostly. Just like any other average guys in any industry.
But game industry is continuously bringing back many interesting technologies, especially performance-related. Like LuaJIT, rapidly becoming mainstream and widely used, GPU-assisted calculation, used even in PostgreSQL - https://wiki.postgresql.org/wiki/PGStrom - and so much more.