Hacker News new | comments | show | ask | jobs | submit login
C++ for Games: Performance, Allocations and Data Locality (ithare.com)
171 points by ingve on May 30, 2016 | hide | past | web | favorite | 64 comments

Mike Acton's "Data Oriented Design" talk from CppCon is great


This man is really my hero.

I cheered when he said "if you want to talk about premature optimization, you get out now".

Thanks for linking this talk. I've been wanting to get more into modern C++ after spending the last ten years in backend webapps in Perl/Python/PHP/Ruby.

It's hardly modern C++ what you will see in this talk. Mike Acton even said that they only use C++ because it's game industry standard but he would prefer C. If you want to start with modern c++ watch this: CppCon 2015: Bjarne Stroustrup “Writing Good C++14”


No, it's not about modern C++. Please don't follow his tips for regular C++ programming. His approach is "C with classes" which is common in extreme game dev (e.g. Carmack's engines).

Modern C++ practitioners care about security and best practices to avoid data corruption, you hardly find anything like that in talks that use C++ as "C with classes" subset.

> use “++it” instead of “it++”

Has anyone actually benchmarked this recently? It's such a trivial optimization for a modern compiler.

I have trouble believing that they'll produce different code when the return value isn't used.

I remember 15 years ago people were saying the same thing! (not to bother with this micro-optimisation because the compiler will do it anyway).

I often still sometimes write "++it" out of habit, but usually will just go with whichever variant I feel conveys my intent better and leave the optimisation of this to the compiler.


Its also not always true that "++it" is faster than "it++", as this quote from Agner Fog's C++ optimisation guide notes:

    For example, x = array[i++] is more efficient than
    x = array[++i] because in the latter case, the
    calculation of the address of the array element has
    to wait for the new value of i which will delay the
    availability of x for approximately two clock cycles.
    Obviously, the initial value of i must be adjusted
    if you change pre-increment to post-increment.

The example you gave is actually different code and are not directly comparable. With i = 0; x = array[i++] will access the first element, x = array[++i] will access the second element. So of course in the second case the array load will wait until it gets the incremented value.

Edit: The same code with a preincrement notation would be x = array[i]; ++i;

The cited statement accounts for this:

    > Obviously, the initial value of i must be adjusted
    > if you change pre-increment to post-increment.

If you are using a simple type (int, unit...), it could be optimized by the compiler.

But if i is an iterator, you will create an unnecessary copy. Maybe a special treatment can be done for STL types. But in some other cases, expecting ++i's behavior when doing explicitly i++ would surprise the user.

But since the copy won't be used, the compiler should omit it even for class types, since it can assume the copy constructor has no side-effects (if the copy constructor is not trivial in the first place).

I don't see why the compiler is allowed to assume that. For example, the class might collect statistics on #allocations, log them (e.g. for debugging purposes), or the iterator might create new objects (e.g. when the iterator is iterating over a (potentially) infinite sequence that gets created in a database on the fly)

I think the compiler should prove that constructing and destroying the object doesn't have side effects before skipping those steps.

If you want your iterator to be fast, its implementation will be in a header file, at which point the C++ compiler can inline and omit the copy entirely.

Of course, but that isn't making an assumption, as the OP said the compiler might do.


I used to wonder the same thing, but C++ is just weird in that way.

That's only partially related to topic at hand. Even in cases where copy elision takes place, and object is still constructed.

However, as was said before, any performant iterators ought to be header-only implementations, and if they have no side effects, I don't see why a compiler might not do away with the copy altogether.

Personally I always use the prefix increment (unless the other behaviour is needed).

They compile down to the same code on a simple example in release.

Post increment: https://godbolt.org/g/k5rcPA

Pre increment: https://godbolt.org/g/WgfMzU

For std::vector, they should compile to the same thing, the iterator is just a pointer to a flat array. For std::map (ie. red-black tree) or something more complex, it might make a bigger difference. But even then, the iterator is quite light weight (just a pointer) so copying it is not a big deal.

Even for std::map gcc generates identical code:



godbolt is a great resource to see just how good modern compilers are.

I wouldn't say "even" -- that's just another type where the iterator is a trivially copyable/destructable object.

Both are equally clear, and equally easy to type, so unless “++it” comes with a drawback, I see nothing wrong with the suggestion. If the compiler is smart it doesn't matter, if it isn't it will help.

Do you have anything particular against this suggestion ?

It just sets of my BS detector. Some people will think it's OK to lie to get others to use their prefered coding style.

But it's not BS. If you could use one or the other, so this isn't "a[i++]" or whatever, you should always use the prefix version, particularly if you're operating on a non-primitive type.

Assuming the two are written sensibly, and you don't have a Sleep(1000000) in your prefix version or something, the best case is that the two perform the same. The worst case is that the compiler can't spot that the result of the postfix version can be ignored, and the postfix version is then a bit less efficient.

I'm not saying this is likely to end up being the bottleneck in your program... just that choosing ++it is +EV.

I suspect lots of people use it because they learned it that way and now it's been blah blah blah years and it's too late to change. But for some reason, they don't want to admit this, so they come up with spurious reasons for changing being the wrong thing. Which is odd, because when the two would be equivalent, force of habit is basically the only good reason to prefer it++ over ++it.

> just that choosing ++it is +EV

Sounds like pascal's wager

Pascals Wager is about assuming the EV is infinite, not that it simply exists.

That's why I said "like". They're both in that you're choosing the option with the +EV rather than the one that the evidence points to.

The error in Pascal's Wager is assuming there are only two possibilities. Once you realize that there's the possibility of a God who, say, punishes all the faithful, the argument falls down.

That's not the case here. There are actually only two alternatives.

The only reason you'd have +EV is if evidence suggested it to be so. You compute EV using evidence, and if you don't have any evidence, you don't have any EV at all.

This suggestion feels dogmatic. Instead, the suggestion should be to use a better compiler.

If you only have 'platforms' compiler, then you have very little choice. I'd guess it's becoming a smaller problem lately though; there's little reason to use custom architecture.

You often don't have that option in game dev.

For the majority of iterators doesn't make a difference. For something like std::regexp iterator, which holds a shared pointer to the regexp object, it can make a big difference. Optimizing that copy away is hard as it requires the intimate knowledge of the iterator semantics.

Anyway, using ++i is today idiomatic in C++.

Spot on. It seems completely against some of the higher level philosophies of optimization presented in the article.

For example, the 90/10 "rule". You should focus your optimizations on things that matter. If applying the pre-increment operator vs post-increment operate makes any substantive difference, then it warrants better solutions such as:

* Don't use STL. If you're running into overhead problems here you can almost certainly build faster custom data structures more specific to your usecases.

* Use inline assembly for creating the tightest inner loops. If the overhead for an increment is significant, then that suggests the total number of operations in a loop is small. There may be some gains to be had by optimizing the assembly for the specific architecture and usage constraints.

* While not all folks get to use C++11, you should also be using range based for loops in general which makes the number of cases where you see a manual increment on an object very small these days.

STL is not inherently slow per se, some data structures are just not fast on modern CPUs, so they have to be chosen wisely. Here is good talk: https://www.youtube.com/watch?v=fHNmRkzxHWs.

Nice link! It's not that the algorithms are slow in some Big-O notation; it's that the utilization of hardware resources is typically poor. It's the general problem of balancing a general purpose solution versus an optimized high performance one. When you have more domain knowledge you can do better. A handful of examples:

* A STL vector is typically at least 3 * sizeof(void * ) for the begin, end, and allocated size to be a general purpose solution [0]. You can trivially trim down space if you assume your memory is contiguous and then store deltas rather than pointers for the vector; which can be a nice win for 64-bit pointers.

* A std::function blossoms to 48 bytes to support all the various usecases from lambdas to functors to function pointers. Using a template for lambda functions can allow inlining and be a win here, or writing your own that doesn't support all the usecases.

* The lifetime of a std::string can involve a lot of allocations and deallocations. Fixed size char arrays can be a win.

You can address all these with your own library functions. This is very consistent with the rest of the post's content, and in fact they suggest some of these approaches themselves.

While these don't directly attack some fictitious ++i vs i++ performance bottleneck example, they will improve cache utilization and probably actually improve performance of your application :)

[0] http://info.prelert.com/blog/stl-container-memory-usage

In C++ it does, because they are different function calls.

If they are overloaded.

Do C++ assumes that i++ and ++i same for overloaded operators?

edit: same if they are same in the case of primitives

It cannot assume that, specially in binary libraries.

You separate overloads need to be provided and the compiler cannot know if the behavior is the same.

I always use the same approach as in pure OOP languages and FP ones, assume everything is a method call.

It's to get in the habit for when you've got a hairy custom type for your iterator.

When I was doing console game dev, all the compilers were behind the times. I wouldn't be surprised if there are some compilers that produce different code. I also wouldn't be surprised if there aren't.

- BUT -

Even if it is different, in my 20 years of production C++ and 10 years of console game dev with plenty of profiling, optimizing and assembly inspection, I have never seen a case where optimizing the loop counter would have made a meaningful difference. It is possible, and I'm sure someone has seen it, but I never have, and I've seen a lot of code.

As others mentioned, it's unlikely to different code as long as the iterator is internally a simple pointer and the increment operators are inline.

It could be a bit more costly if, say, you're using a library that does something like added runtime checks for iterator validity. Or maybe you're using an STLish type that has a more complicated iterator.

So, no, it's unlikely that there will be a difference 99% of the time. On the other hand, the pre-increment is never slower so it's a good habit to have anyway... and it's more idiomatic C++ as well.

It should be optimized out. The keyword is should. It may not for all iterators (especially custom stuff) and it may not at all optimization levels (important for debugging)

So it becomes "why depend on maybes?"

I think they have to produce different code by definition, because these two have different meaning (in C++, at least).

In the prefix form (the first one) the expression's result is the incremented value, while in the postfix form, the expression's result is the value before it was incremented.

However, compilers are so crazy with optimizations these days that I wouldn't be surprised if they can figure out that the code isn't actually relying on the semantics of the postfix form and convert it to prefix form.

I also recommend watching https://www.youtube.com/watch?v=Nsf2_Au6KxU , made by an engineer at Valve.

One comment: The “dragging a slider to the right” math is assuming all players that pay $25/month are playing the entire month non-stop. In reality it goes in waves. Using the "22.7 hours a week" from WoW, that's only 0.13/user/month. Cloud computing capitalizes on this by renting per hour to capture the waves. This means a 25$/month user is only costing $2.5 for server usage which seems right.

For anyone interested in optimizing C++ i recommend Agner blog.


EDIT: typo

Does anyone know who's behind this site/books? There's lots of great material here. I always look forward to an update. Does anyone know if these books are to be published?

There was a Kickstarter campaign recently for the book which has some background info on who's behind it: https://www.kickstarter.com/projects/341701686/the-most-comp...

> Does anyone know who's behind this site/books?


Not sure how I missed those, thanks for the link and the link to the kickstarter, I would like to support this book. Cheers.

I found this to be fairly interesting content. I knew most of it so useful as a refresher and one can really argue not game specific in many ways though certainly slanted in that direction. Anything with real-time performance considerations like control/embedded system programming can also use the advise. I found Alexandresu's Modern C++ Design valuable many years ago for similar advise. I used Loki and its allocators and smart pointer strategies in a lot of code over the years due to control over data locality.

I found the next posts/chapters on Marshalling and Encoding even more interesting with comparison and pros/cons of different encoding techniques. I was not aware of the FlatBuffers library and see why the author likes it and may have to look into using in future projects.

I prefer this explanation from Herb Sutter, member of the ISO C++ committee: https://youtu.be/TJHgp1ugKGM?t=21m33s

Thanks, very interesting reading. It can be valuable not just for games development but for any case when performance is important.

That applies to most of the information on this website and in the upcoming book. It's labeled as for Game Development, but most things also apply to network programming in general (e.g. implementing RPC systems, message brokers, protocols like HTTP, ...). It's a great resource!

Small question, I don't really see how switching to unordered_map solves the locality problem that the author mentioned... unordered_map most likely uses chaining, and even if we were allocate a continuous block of memory for each bucket (as opposed to having a linked list), we'd still have a locality issue.

I think C++'s built in unordered_map has to use chaining due to restrictions on iterators mandated for standard library containers, which is a drag. You've still got to roll your own open addressed container. :( This comes up in one of Chandler Carruth's talks [1].

[1] https://www.youtube.com/watch?v=fHNmRkzxHWs

Yep, that's exactly what I did (because of the same concerns) in one of my projects, which is why I was initially surprised to see them recommend unordered_map. But, as the other commenter pointed out, unordered_map has way less "hops" and thus provides a significant improvement over a search tree in terms of locality.

Collisions should not be common enough to have chains (or skip patterns) of 20 elements, in an (unordered) hash map. It should be more like, one jump to the hashed index, and then maybe another jump or two if there is a collision. That's significantly fewer jumps in memory than for a balanced ordered binary tree.

Thanks, that makes sense.

Well I think you're overthinking that right now, I think this is "traversing a tree" O(log n) vs a "hash table lookup" O(1)

But the main thing is that the hash table lookup mostly is in the same memory area until you try to get the actual value out of it. Also the author mentions as a side note the problem with them "(though they do have their drawbacks in the worst-case scenarios, so you need to be careful)"

A tree access is ~log n memory lookups and every one of them can miss cache. A hash map access is ~1 memory lookup, which can miss cache. Even when you have cache collisions and need to do a linear search in a bucket it's not going to cause additional cache misses if the bucket is allocated as an array.

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact