This brings up a thought that I've had for a very long time. Almost every type of optimization that a programmer could employ is repeatable. It involves matching patterns ("Identification" in the context of this article), analysis ("Comprehension"), and rewriting ("Iteration").
All of these steps can be efficiently automated. And it turns out that compiler writers collectively know about the vast majority of these techniques, but refuse to implement most of them for what I would consider to be the ultimate copout ever: Compile times. I don't know about you, but I would take 100x increase in compilation times for a release build over a 2x increase in development time due to manual optimization. I'm not sure who wouldn't, especially if it also allows you to eliminate technical debt, eliminate leaky abstractions, and improve code comprehensibility.
Perhaps I'm being overly idealistic, but I can't help but hope for a day that I can work with a high level language and have the compiler take care of optimizations that range from removing redundant elements from struct definitions all the way down to bitshift optimizations like i * 28 == i<<4 + i<<3 + i<<2. And if I have to wait all day long for a release build of something, so be it.
Insulation from performance concerns also makes your performance model less predictable. Because the ends of optimisations are observable, but their means are not, people resort to “butterfly programming”. As a small example, in ActionScript 3, people used to write their conditionals like this:
if (a) if (b) {
...
}
Rather than this:
if (a && b) {
...
}
Because the dumb compiler generated better code for the former.
In the real world, I’ve seen seemingly benign code changes prevent optimisations from firing—optimisations we didn’t know we were relying on. So people start coding to the implementation, rather than the language or the human, without even being able to see what they’re doing. I really don’t like that.
Optimisers are like garbage collectors—they give you better ergonomics than, and equal average performance to the manual alternative, if you’re willing to relinquish predictability and a non-voodoo mental model of your code’s performance.
> Optimisers are like garbage collectors—they give you better ergonomics than, and equal average performance to the manual alternative, if you’re willing to relinquish predictability and a non-voodoo mental model of your code’s performance.
This is why I'd like to see optimizers become less opaque. It would be great if we could not only suggest optimizations to the compiler (and more sophisticated ones that just inlining), but actually see the process which our code went through.
One language that has some interesting things going on in this area is Nim, which lets you write your own domain-specific optimizations[1].
Unfortunately, the nature of optimizations makes tracking these sorts of things really hard. They don't occur in a vacuum - nearly every optimization transformation interacts with a dozen other transforms, sometimes with very surprising (and usually counterproductive) results.
If you simply implement "book" optimizations, you'll find that they often don't even work without some considerable re-engineering from the bottom up. Many book optimizations make the code slower. And then there are the myriad of optimizations that make code faster on one CPU and slower on the next release of that CPU.
If you think that compiler optimizer writers are holding out on you, they aren't. It's a bit of an arcane art, like samurai sword making technique.
Of course. I have no problem with generic, implicit optimisation. However, people often need to optimise further, manually. Without good language support, they start coding to a particular implementation: using less-than-portable intrinsics, or worse, trying to trigger heuristics that produce the optimisations they actually want.
I wonder if it would be practical to expose the optimizations to the programmer and allow the complex ones that are easy to accidentally disable to be invoked explicitly. Your code would look something like:
optimization(fold_constant_vector_lookup_access_frobnitz) {
complex code here
}
And if the compiler can't apply the optimization, it can produce an error rather than slow code.
That works. We already have some things like that, like “__attribute__((always_inline))”. Of course, forcing some optimisations could implicitly prevent others, but that’d also be true if you did it manually.
With expressive enough metaprogramming, you could also do some of these explicit optimisations in user code.
There is such a thing as profile guided optimization. I don't think their results are particularly impressive. You're basically saying "can't you just" to which the answer is almost always no. No you can't just.
I've gotten pretty good results from PGO as well as JIT optimizations (which isn't conceptually much different), but even if I hadn't, your comment is still a fallacy on two accounts:
1) JITs were slow until they weren't. Poor results from existing experimental PGO compilers do not prove that the concept is flawed.
2) Profiling is a heuristic substitute for a real cost model. The most advanced optimizing compilers forego profiling altogether because they evaluate optimizations against a hardware/architecture cost model. In other words, PGO isn't even necessary to accomplish what I'm talking about.
And in theory, dependent types are the awesomest thing ever. But they're not ready yet, and there's no guarantee they ever will be. There's a gulf between even having a few focused, carefully-chosen proofs of concept and a practically-useful technology.
(For that matter, I just grabbed that example off the top of my head as the thing I knew that most resembled the various fancy optimizers I've seen over the years, but it seems to me that dependent typing are full of things that would be darned useful for an optimizer. It would be funny if the fastest path towards whole-program optimization went through dependent types being practical.)
> And in theory, dependent types are the awesomest thing ever. But they're not ready yet, and there's no guarantee they ever will be.
I think the reason dependent types haven't made it into practical programming has less to do with them being not ready and more to do with them being fundamentally challenging to learn. I have a pretty strong pure math background and it still took me a significant amount of time to wrap my head around them to the point that I could write non-trivial proofs using them, let alone programs. For programmers who have never done much formal logic or who have never heard of a category, I can't imagine how much time it would take to get the prerequisite knowledge.
Another issue is the fundamental problem of creating a new language; you have to build up a ton of extra machinery (libraries, tooling, etc.) to even think about it becoming more than a toy, even if you don't care about it being mainstream. Rust has the backing of a large open-source company and a ton of really smart people dedicated to bringing it to primetime, but it is still going to take a long time for it to get there. Dependently typed languages like F* and Idris don't have the same manpower behind them and also have to face fundamental challenges like integrating formally verified code with a world full of unverified (but useful) code.
I think dependent types are amazing, but even the tiny subset of dependent types known as refinement types opens up huge avenues for optimization. Some optimizations (loop nest optimizations for a simple example) might only be feasible with static range definitions or with profile guided optimization...but if you throw in refinement types, you now have static guarantees as to the cardinality or range of a variable, and loop nest optimizations can be statically determined. I think full dependent typing could be great, but compilers can still be improved with small language improvements.
They're ready in small forms in Haskell today. The compiler doesn't take a whole lot of advantage of these yet, but you start to see avenues toward it.
Actually, they can, and quite well actually. The dotnet native compiler actually does quite a few analyses for "hotness", mostly centered around maintaining cache locality. However if there is a lot of entropy in the possible code paths, it might not be computationally feasible to analyze the hotness of all of the paths. In those cases, we accept heuristic substitutes, of which both JITs and PGO have shown good results.
> it might not be computationally feasible to analyze the hotness of all of the paths
It has nothing to do with computational feasibility. It has to do with the fact that it depends on the input. In many cases there simply is not enough information at compile time to determine which paths are hotter than others.
For example, error paths are generally very cold, but how is a compiler supposed to know that a path is an error path? They just look like regular conditionals.
> In those cases, we accept heuristic substitutes, of which both JITs and PGO have shown good results.
JIT and PGO aren't heuristic, they are based on measurement. By your logic, if I look at the speedometer in my car that is just a "heuristic" of my speed. It's not a heuristic, it's an empirical measurement.
> It has nothing to do with computational feasibility. It has to do with the fact that it depends on the input. In many cases there simply is not enough information at compile time to determine which paths are hotter than others.
That's a fair point, but still quite a bit overstated. There are a variety of analyses that compilers can and do perform to understand possible input before it ever receives the input. I'm familiar with a handful of probabilistic and set-cardinality estimation techniques used in math optimization pre-solvers (a form of compiler) that do exactly this, and I'm pretty sure both GCC and LLVM already do several similar analyses.
Furthermore, a very large subset of optimizations won't have any path dependence in the optimization that it choses, and a significant subset after that has choice thresholds that are trivially computable given an accurate cost model. When the former happens, profiling gives no advantage at all, and when the latter happens, it is typically more efficient to branch on that threshold than it is to try to guess at which choice is more efficient by measuring sample input.
> For example, error paths are generally very cold, but how is a compiler supposed to know that a path is an error path? They just look like regular conditionals.
That might be true of some languages, but not so with others. Exceptions are obviously error paths, and several strongly typed languages have error paths encoded in the type system.
> JIT and PGO aren't heuristic, they are based on measurement. By your logic, if I look at the speedometer in my car that is just a "heuristic" of my speed. It's not a heuristic, it's an empirical measurement.
But that's the thing, measurement of your input is a heuristic. If your program execution is heavily influenced by its input, and you optimize using profiling, you are assuming that past input is predictive of future input. That assumption is a heuristic. It might be a really good heuristic, but it still is a heuristic.
> Furthermore, a very large subset of optimizations won't have any path dependence in the optimization that it choses
Hot code should be: optimized for speed, inlined, unrolled (where it helps), kept local with other hot code, and register-allocated with other hot code.
Cold code should be: optimized for size, not inlined, not unrolled, kept away from hot code, and should never be accommodated in register allocation for any reason (ie. should never reduce availability of registers needed by hot paths).
So I don't agree that optimizations should be performed in a path-independent way.
> But that's the thing, measurement of your input is a heuristic.
By that standard, everything is heuristic. The assumption that your cost model matches the actual CPU the code will run on is a heuristic.
If we followed your advice, the differential between worst case and best case performance would be large.
There are good reasons for preferring a less extreme difference in potential performance for general computing with AOT compiled languages. In particular, we should ensure the worst case performance isn't awful. We wouldn't want e.g. an occasional exception getting thrown to act like a DoS attack on a web server.
Nothing I was suggesting would create "awful" performance. Inlining and unrolling are not necessary to get reasonable performance -- just try gcc or Clang with -Os.
For an occasional exception to "act like a DoS attack", the cold path would have to be thousands or millions of times slower than the hot paths. But compiler optimizations don't create anywhere near these kinds of constant factors. Even from -O0 to -O3 is more like a 5-10x difference, maybe 100x in extreme cases. The difference between -Os and -O3 is much more modest, more like 0-50%. Nothing that is going to cause anything remotely resembling a DoS.
>For example, error paths are generally very cold, but how is a compiler supposed to know that a path is an error path? They just look like regular conditionals.
At least for .NET that path is likely throwing an exception. If you don't have a profile, statically assuming those paths are cold is likely to be correct.
That's an interesting point -- my background is more in C and C++-without-exceptions. But in exception-based languages, I can see how the presence of "throw" would be a very useful signal for static analysis.
Profiling comes with its own baggage, nothing's perfect. A lot of times one actually knows the likely/unlikely paths at development time; that's certainly true for error conditions. It makes sense to inform the compiler of those cases using the builtins.
Sure I'm not saying profiling is perfect. I was just taking exception to the statement that profiling is nothing but a poor man's substitute for a cost model.
Explicit annotations for hot/cold paths are useful, I agree. I would probably be more excited about them except last time I actually tried them with GCC, they slowed my code down (admittedly this was five years ago or so). But that's just an implementation problem, the concept is sound.
It's not that results are unimpressive, but the market has dictated faster compile times. When a PGO compilation is an order of magnitude slower, for the 90% of applications out there where the extra 5-25% improvement means less than a 60 sec improvement, then you won't care about PGO. You need to have an application an order of magnitude larger (in run time) to care about PGO. Smaller than Google, bigger than your startup. See xkcd [0][1] for when to care.
For video games you usually have at least Debug, Release, Retail builds. Where Retail turns on Link Time Code Generation (LTCG) which is quite slow but can give a few percent extra perf. If PGO took 10x to compile but gave a 20% perf boost it'd be an absolute no brainer for game development. Keeping in mind that article shared here was a video game.
It's possible that PGO is pretty awesome these days and people just don't realize it. Just because I've heard unimpressive results from other devs doesn't mean it's actually bad!
I work on a major AAA title due to be released in a few months - our compile times are 20-25 minutes for Windows build(Debug/Release/Final), 10-15 minutes for Xbox One/PS4(consoles link much faster), and for Retail builds, with link time optimization, the build takes anywhere between an hour, hour and a half. That's on a machine with a 6-core, 12 threaded intel Xeon(we use distributed build anyway), 64GB of ram and the source code sitting on a 1TB SSD. The baking of data takes another 10-30 minutes depending on config. Anything that improves our workflow is very,very,very welcome.
I probably misunderstand, but if this is a critical part in your workflow, why wouldn't you
- put 'data baking' (I presume that means things like packing images into blobs etc?) on a separate server
- put another few computers into your 'compile farm' if you're using distributed builds already anyway
How much of, let's say, an hour is spend on compiling, linking and packaging? Compiling is 'trivially' (for some values of that word) parallelizable.
> Compiling is 'trivially' (for some values of that word) parallelizable.
Really? I've always heard that it's virtually impossible to parallelize. Or do you mean a particular step in the pipeline can be parallelised - like parsing separate source files?
The actual compilation of each "translation unit" (roughly, ".cpp file"), so the preprocessing, lexing, parsing, AST generation as well as the first-pass code generation, can be done in parallel. A project with 1000 .cpp files would (theoretically) need only as much time for that phase as the longest of the 1000 would take (plus time for the overhead of transferring files to build machines).
Of course after that there is global optimization and linking, maybe those take the majority of the time on the GP's case, which is why I was asking. In my experience, even for optimizing release builds, those phases are just a small part though - 25% or so for my projects, as a high-end guesstimate? But as I said, maybe it's very different for others.
Products like Incredibuild do just that.
(of course it's not 'trivial' as in 'I'll set it up over lunch', there are many many details to work out, which is why I said 'for some values of that word' - i.e., the engineer's meaning of 'trivial').
We actually use Incredibuild(and SM-DBS as well). Those times are distributed build times. Compiling on a single machine would take at least an hour.
Also - we do bake data on a separate server too, I just prefer to do it myself because then I have several configs(one small one with a test map, one bigger one with the main story etc), and they all work with my changes.
I would say that if I got latest right now, it would take me at least an hour to start the game for the first time on either console(game server is part of Win64 solution so I need to build that first).
In C++ (and C for that matter) separate source files are separate compilation units. So it is trivial to parallelize. The compilation process produces object code, which, generally, has to be linked into an executable to be useful and that is usually not a parallelized task even though nothing prevents you from writing a parallel linker in theory. Unless LTO (link-time optimization) is enabled, the bulk of the linker's time is in I/O so making it parallel would not produce any performance gains till recently. Nowadays we have SSDs that can benefit from deeply queued I/O and, probably, somebody will write a parallel linker eventually.
Thank you for proving my point gambiting. @forrestthewoods, asset baking is already 50-200% of the gambiting's non-Retail workflow build time. This is before PGO. Ask him, if had to choose between reducing his build time 10% (without a game performance penalty), or, increasing his build time 1000% (with a game performance improvement of 15%), I think he'll quickly choose the first option. Debug time/feedback is much more valuable (usually) [0].
If you offered me mildly faster iteration OR mildly framerate at 10x compile I'd also choose the former. Iteration speed is king.
But it's not either/or. If you offered me Debug/Release/Retail(LTCG)/SuperRetail(GPO) I'd love to have that fourth option. It doesn't have to come at the expensive of any other build configurations. Even if PGO takes 10 hours to compile. It's part of the overnight build. Great! There's no shortage of AAA games that have an overnight 12 hour process to build lightmaps for a single level. And that's using a build farm.
But so far I've never heard a major, or even minor, PGO success story so it's all kinda moot. :)
> All of these steps can be efficiently automated.
Good luck with that. Yes, a compiler will do a good job with i * 28, but the optimization in the article - which is a pretty simple case, chosen to fit in a blog post - requires manipulating the algorithm beyond the capability of any compiler I'm aware of.
We do have good tooling to show what code is slow, for instance, so it's not like automation doesn't help here.
Indeed. Often the slowest code for us is due to cache friendliness more than algorithmic complexity.
For the data to be laid out correctly the problem needs to be solved way earlier in your toolchain as well as downstream with a rewrite of any algorithm that touches it
Please show us some of these promised optimizations. I don't think they actually exist. Most of the larger code / algorithm tweaks require subtle changes to the behaviour. The compiler can't choose to do that by itself, as it could break the program. Programmers need to make these choices for themselves.
Besides which, compilers can already select slow or fast optimizations on the command line, so users can minimize compile time if they wish. There are plenty of people who would give anything for faster code, no matter what the compile time is.
Please prove me wrong and explain a few of these supposed too-slow optimizations.
> Please show us some of these promised optimizations. I don't think they actually exist.
I already gave two examples: "removing redundant elements from struct definitions all the way down to bitshift optimizations like i * 28 == i<<4 + i<<3 + i<<2". Neither of those optimizations are used by mainstream compilers, although it is possible that LLVM might include some optimizations like the latter due to work from the Souper project.
> Most of the larger code / algorithm tweaks require subtle changes to the behaviour. The compiler can't choose to do that by itself, as it could break the program. Programmers need to make these choices for themselves.
If it introduces a bug, then it is not an optimization, and it has no business being included in an optimizer (at least without explicit flags like -ffast-math).
> Besides which, compilers can already select slow or fast optimizations on the command line, so users can minimize compile time if they wish. There are plenty of people who would give anything for faster code, no matter what the compile time is.
My whole point is that -O3 is still weak sauce. There are tons of compiler optimizations that are excluded from both GCC and LLVM because of fears of compilation time impacts. Take a look at the specific techniques section of this wiki article: https://en.wikipedia.org/wiki/Optimizing_compiler
I can guarantee you that the mainstream C compilers cover maybe half of those, and for lesser known languages far less than half.
gcc will do these kind of optimizations, when they make sense. But for 28, most processors will actually be faster doing the multiply.
Multiply instructions only take a few clock cycles. A sequence of adds and shifts can also be slower because they are a sequence of dependent instructions - you need the result of the first one to calculate the next. (You can do some of the adds in parallel, but you still need to total them up)
So your 'optimization' would take 6 cycles on an ARM 9, while the MUL takes 3.
It's really easy to double check this. Compile this dumb code with 'gcc -s -O3 test.c'
#include <stdio.h>
int main(int argc, char *argv[])
{
int i = argc;
printf("%d\n", i*28);
return 0;
}
Try changing the 28 to different values and run a diff against each version (e.g. try 8, 9 and 10). On x86, gcc will use a variety of techniques based upon the multiplier.
"removing redundant elements from struct definitions"
When is an element in a struct definition 'redundant'? Let's say I write a struct to a network socket, and my receiver (on the other side of the socket) expects a certain memory layout. But one of the fields isn't used in my code (but maybe it is by another program that uses the same header, like two programs that share a library with common data types).
How would that work? I just don't understand what you're saying.
I don't know what OP meant, but I could see some potential advantage to removeing fields from structs at interfaces. Say you have a loop that goes over a million structs, but only accesses one field. Drop the extras, do the loop, then add them back.
Though it would probably take more time to rearrange things than it would to just do the original loop. The point is, you might be able to look at where the struct is used and change its structure when it moves in memory. If a function is a 'consumer' of structs, and doesn't need all the fields, there is no need to copy all the fields.
"Say you have a loop that goes over a million structs, but only accesses one field. Drop the extras, do the loop, then add them back."
Why? Why would you copy at all? Are you saying that with small structs you'll be able to fit more of your data into caches? I haven't timed it (anyone have 20 minutes to spare?) but then you have to make full copy, very expensive not to mention having major implications on memory usage. If my compiler would do that under the hood, even when set to 'optimize for speed', I'd be pretty annoyed. The only way to know if such a thing is faster, is by looking at each use specifically - at which point we're not talking about a compiler anymore. Well ok we'd be talking about "runtime pgo", and approaching JIT territory.
Furthermore, such an 'optimization' would also have to, e.g., detect offsetof() and make sure that that works correctly, and account for packing and people relying on that, and unions, etc etc. I'm not going to spend much time here further on examining and deconstructing this whole concept, especially as I still fail to comprehend the goal.
Those kinds of bitshift optimizations have been basic for 25 years or more. Read Hacker's Delight by Henry Warren for hundreds of pages of this kind of standard compiler technique.
Not the OP, but one I think about a lot is converting functional, monadic code that uses first class functions/closures to equivalent imperative variants.
For example, manipulating collections with functions like map, flatMap, filter, etc. has become common, even in popular imperative languages like C# and Java. These calls are can be chained together to create non-strict sequences (IEnumerable, Stream, etc.) which are made strict at a later time. Each call creates a new sequence, and many take closures. Both of these require memory allocation and indirect dispatch.
However, in many cases it extremely straightforward to convert these into loops. For example, the following C# code:
list.Select(f).Where(p).SelectMany(g).ToList()
Could be turned into:
var outList = new List<T>();
foreach(var x in list) {
var y = f(x);
if(!p(y))
continue;
foreach(var z in g(y)) {
outList.Add(z)
}
}
This works, even if we known nothing about f, p, and g (e.g. they can be impure functions). This optimization is especially effective if these functions are lambdas. It is also always going to be faster and safe; the only difference is that we removed a series of extra allocations and indirect function calls. This example is admittedly simple, but many functional patterns and behaviors can be rewritten into longer imperative variants that avoid extra allocations.
You are right that you can have subtle changes in behavior with similar high level optimizations. For example, most functional languages make calls like these actually create a new collection at each call. If any of these functions can throw and exception or cause an effect, then we cannot perform the above rewriting because if will call the functions out of order. But if a compiler can take the time to analyze functions for external purity, these can be optimized as well.
Your example doesn't necessarily bolster this thread's original point, because compiler backends are already sophisticated enough to apply this sort of transformation regularly (e.g. you'll see this sort of code generated in Rust, where closures are stack-allocated, iterators are lazy and single-pass, and the resulting loops can even be vectorized).
You're absolutely right, I chose a simple example because it's one I expect most people have seen in one language or another. A better one would be rewriting a combinator based parser (e.g. one written in Parsec) into an imperative one.
Dont you cave in blindly believing into the infoscience gods of old having done good.
Demand profile logs, demand numbers, demand evidence.
Compilers do the most utterly unexpected nonsense, forget the most trivial optimisations.
Take your code to
https://gcc.godbolt.org/
and read what your "Compiler-knows-best" does with your code.
yes, it doesent matter often in the end, cause even bad code must wait for worser memory, but still..
Could you give an example of 'forgetting the most trivial optimizations' please? It would be a bug in a compiler and many people would be interested in fixing it.
I think that the Identification and iteration stages can be automated, but the comprehension stage is still a while off. No compiler is going to understand more about the what the context of the program is than a human. For example, knowing that a particle system is transparent enough (or somehow insignificant enough) that it doesn't need to be rendered in the shadow pass is a difficult thing to automatically optimise for.
Compilers are generic beasts, they need to work accurately for all valid combinations of code and data (which is a complex problem in itself) and optimisation is another layer of complexity on top of that. Current compiler optimisations work on predominantly local data and code, as that is all the state that the compiler can guarantee is accurate. If a function called from a parent is optimised for that parent, then it is conceivable that this same optimisation could be suboptimal when called from another parent (especially if the compiler was able to modify data layout).
The other issue is iteration time. This is a crucial part of software development - lowering iteration time boosts productivity immensely. If, as a programmer, you no longer care about performance due to a compiler that can optimise your code to make it run twice as fast but the compile time is hours, you are rarely going to run the optimised build. And you are going to end up hand optimising the debug/test builds yourself to make them run fast enough.
I do think that we could have better tools to help us understand where our code bottlenecks could be - I would love a plugin that somehow coloured my code's variables by cache locality.
There is a Scheme compiler called Stalin[1] which apparently does some very sophisticated full-program and lifetime analysis. This of course leads to very long compile times, but the code is also very fast compared to other Schemes, and even C in some cases, from the benchmarks I've seen[2][3] etc. You can find more benchmarks for it, and similar compilers like MLton.
Of course, I can't comment on whether or not it is actually usable in a real-world environment, just that the techniques do exist.
There was a HN thread a day or two ago in which someone suggested the idea of a common, open source KV database where compiler optimizations could be stored and curated by a committee. So whenever a compiler recognizes a pattern, it could quickly look up the latest and greatest optimization.
> I would take 100x increase in compilation times for a release build over a 2x increase in development time due to manual optimization.
If you're referring to super-optimization, it can take far more than 100x and isn't something that is being "classically automated" as humans likely can't do it in the first place. In addition, super-optimization will yield better results if you give it a better starting point. It's nothing at all like the optimization discussed in the blog post. Either way, it can be done against LLVM IR[4] and I'm sure it will creep into clang at some point.
So far as using profiling data to optimize code in the way that humans would: MSVC supports it[1], clang supports it[2] and gcc supports it[3]. Supposedly Microsoft saw a 30% performance increase [with a very specific workload] when they tried POGO on an internal build of SQL Server. Under normal circumstances, PGO should net you around 7% free perf.
You're assuming that the hard part is finding optimizable patterns and transforming them, and not deciding whether applying the transformation will actually be an optimization.
The analysis part mostly is about legality of the transformation. Humans have a larger view of the program and change code to behave slightly different but way faster. A compiler can't change the behavior of the program.
If you don't regularly run your program with the same optimizations you're using on the final product, you haven't tested it. And if your automatic optimizations are so slow that you can't just re-run them all the time, you can't tune what you're sending into them in the first place.
That's for classic compilation, but these days all the good stuff is in JIT on a mobile device, and there your optimizer really can't afford to waste your time.
It is? iOS has always used ahead-of-time compilation and Android switched semi-recently from the JIT-based Dalvik to the AOT Android Runtime. The only important JIT left is in the browser...
Maybe for some cases, but sometimes (IMO) a compiler is just not going to know. For example, if instead of hitting the network on initialization for some information we delay load that later (say, for example, the MOTD), a compiler is never going to know it could do that.
Many compilers will iterate the so called classical optimizations until they produce no effect (an iteration where no optimization took place) or some artificial limit (say 4 or 8 iterations through a basic block or a function). At some point, you risk several accidental de-optimizations such as inducing cache thrashing, cache spilling, architectural limitation, or, even optimization clashing where one optimization de-optimizes a previous optimization on one iteration that then gets re-optimized in the following iteration (A->Opt B->"De-"Opt A->Opt B->...). Another poster suggested profile directed optimization and that usually is even more effective naively iterating through optimization steps.
>Perhaps I'm being overly idealistic, but I can't help but hope for a day that I can work with a high level language and have the compiler take care of optimizations
I believe optimizing compilers are already a thing. Am I missing something here?
Currently for game development optimizing compilers are useful but hardly free you from worrying about optimizations.
At least when you are using C++:
a) Optimizing compilers can't organize your data to be cache friendly.
b) optimizing compilers can't re-organize your code to prefer sequential access to that data so the HW prefetch can prevent cache misses
c) optimizing compilers can't separate out the parallelizable parts of an algorithm and push those into threaded jobs
d) optimizing compilers can't find most cases of work you are doing that doesn't need to be done. They find some trivial/local cases of this but not any of the deep or difficult cases.
e) compilers can't rewrite your code or data to not need features or to use simpler features that can be optimized
f) compilers can't generate caches (as in the simple example in the article) and find and handle all of the cases where they need to be updated.
If you are working on an application where performance is one of the most important features (like many games) you will find yourself working on performance problems like these often. In even higher level languages compilers have more flexibility around some of the fundamental constraints in the C++ world, but there are still few cases where those compilers produce faster code than humans do with C++. Of course the optimized C++ usually requires vastly more effort and returns to that effort are diminishing over time with faster hardware.
This is a very important caveat! It probably doesn't mean much for the pragmatic programmer today, and probably tomorrow, but there are research languages and compilers that try to tackle some of these issues. I myself am working on a compiler for a functional language, that already solves (a) and (b) in some cases, and (c) is doable by using an inherently parallel language. Issues (d)-(f) are about deeper algorithmic changes, and beyond even the frontiers of current research (although you can probably do (d) with a supercompiler if you have enough time).
I love that v8 (nodejs /chromium, JavaScript) does JIT compiler optimizations. It means I can build for maintainability and still get the same performance as if I had built for speed. Ex: array vs object literal, or array vs object properties.
I still do higher level optimizations though, like caching and tweaking formulas (less computations), when needed.
There was an interesting paper linked on here about automatic optimization of Photostop filters via machine learning. I think the novel part was that it operated on binary code, not source. Can't find the link at the moment, but searching for it yielded a bunch of similar papers about machine learning-aided optimization.
Dynamic optimization of binary code is 25 years old or more. The first widely deployed JITs were for running binary executables on different architectures, such as the Mac 68k -> PowerPC transition. https://en.wikipedia.org/wiki/Binary_translation#Dynamic_bin...
That was highly specific for the type of operation they were looking at. Still spiffy, but not something that can be folded into a general-purpose compiler.
Hey everyone,
I'm the author of this article and I'm glad you've found it interesting. I'll be keeping an eye on this thread, so if you have any questions or comments I'll address them as soon as I can. I can already see some awesome questions here - looking forward to the discussion.
Because we are interpolating between keys which aren't evenly distributed in time. What you have there is pretty much what we do when looking up the precomputed values from the table.
What's the value gained from using vTune vs xperf sampled profiling?
I use xperf and friends a lot and find that they're pretty good, but if vTune offers something substantially better I wouldn't mind taking a look at it.
Also - interesting that you use the Chrome tools to visualize the graphs. I use WPA to view performance graphs and the breakdowns are somewhat similar. I think the regions of interest files can get you the rest of the way.
Thanks for the writeup. It's interesting to see how other people tackle perf analysis.
Xperf or WPA is an excellent tool for holistically analysing your application (and all the other applications running on your machine at the same time). We do use that tool for looking at lots of different things: file IO, thread contention, server performance, etc. But for a single client running, I find VTune to be excellent. Its very well integrated into Visual Studio, and provides a number of different perf experiments that you can run to isolate the causes of your bottlenecks. VTune is commercial, but you can use Very Sleepy for a free alternative.
Nice article! I totally agree with the iterated approach to optimization (and refactoring).
Since I am casually playing LoL I hope you will make it better and better ;)
Talking about optimization, I always wonder why some people have those veeery long loading times. Actually I wonder what is loaded at all. Effects or something? Of course I have no idea of the insides of the game but looking from the outside the map and models should be easily cached.
It would be great to hear more about those loading challenges!
Are the data sources for Waffles and the Chrome visualizer different? It sounds like Waffles is real-time and the Chrome visualizer is not real-time. Do they use the same macros and route the buffer differently? Or do they have entirely different systems for gathering the profile data?
Waffles is more than just a profiler. It is a nice, high level interface into our (non-public) debugging API. You are correct that the profiling info there is real-time, while Chrome is post. They all use the same buffers, but just interpret the data a little differently.
There is also other profile info which is gathered by Waffles, stuff like number of visible particles, texture calls, GPU cost per emitter, amongst others, and that information is gathered through another interface.
>In our case we output the profile buffer to a file and read that into the visualization tool which is conveniently built into Chrome. (You can find more information about the tracing tool here and you can try it out by typing “chrome://tracing/” into your Chrome browser. It is designed for web page profiling, but the format of the input profile data is a simple json format that can be easily constructed from your own data.)
Clever! Never heard of doing that but it makes sense.
I work with Tony the author of the article and answer (or find someone to answer) any league questions you might have. Tony will be most likely be online later in the day as he works remotely with us from Australia.
One interesting thought for me is, do you have areas of code where you suspect a good optimization exists but are unable to find it or justify the presumably exhaustive time to uncover it?
Could the order of condition-statements-machine-code (as in the case handling code)e.g. in the particle code be made rearranging according to in-game-heat-counting?
You could, but I doubt it would have much of an effect on performance. Modern CPUs are ridiculously good at looking ahead and executing paths pre-emptively, meaning that branching is far less of an issue. On the older consoles, branching was a big issue, so that would have helped there.
This is an interesting article, but Riot hasn't really earned any trust of mine at all as it comes to code quality; for a while they couldn't show the damage output of a spell because it caused the user's other summoner spell to go on cooldown for 15 minutes[1] when they tried.
"Its defects do not reflect issues with our code base (though there are many), but rather my own hacky implementation."
Every code base has tech debt. You can't predict the requirements of the future, and even if you could, trying to account for them means you never release your product.
One large storyline in the ongoing LoL world championships revolves around a major, longstanding bug: the 2nd most popular competitive jungler had to be disabled halfway through the tournament because one of his abilities suddenly became useless. This has had a large impact on competitive strategy, forcing teams to rethink strats on the fly.
Tech debt isn't just salty players whining - it has a major impact on LoL's biggest stage.
Not to mention the ward in the baron pit during (CLGs?) group stage game that their pink ward could not see inexplicably -- the game had to be stopped so that the officiants could tell CLG that the ward could still see them.
This is one of many such issues -- one of the largest complaints about the game from the community is the stability of the game and the frequency with which 'gamebreaking' bugs occur.
e: Addressing your edit:
Every code base has tech debt. You can't predict the
requirements of the future, and even if you could, trying
to account for them means you never release your product.
League is a mature game at this point, they have had a lot of time to fix longstanding issues. This isn't 'release day' woes.
A rewrite would murder them. I'm not going to link that article everyone links about "never rewriting your code", because I don't think it's got the argument right, but I strongly feel that if they tried to rewrite a system as complex as theirs, they would either a) never finish or b) stop developing the game and, by the time they finished their rewrite, have no customers left.
I came to a deeper understanding of this problem earlier this year.
A team that 'made a mess' may think they know all the mistakes they have made in the code, but they have never proven it, and they haven't been practicing doing things 'the right way'. They haven't even learned how to push back on bad strategic decisions that made things worse.
How do you know that the rewrite will be profoundly better than the current version? I think you have rather a lot of evidence that it won't.
Take a team that cleans up their messes as they go. They know whether their new ideas are better or worse. They have proven they 'deserve' a better code base by building one. By the time they know for sure what's really wrong with the code, they no longer need a complete rewrite. They need a partial rewrite that they can string out over the course of a couple years.
If any of that is true, then the people who want a do over won't take advantage of it, and those that would benefit most wouldn't feel comfortable doing it.
They have the most popular video game of all time and you think that a poorly written client is killing them? It has had some problems over the years that are annoying to users (and frankly embarrassing when compared to DOTA 2), but at the end of the day the quality of the client is insignificant to most players compared to the fun-factor of the game itself, as proven by their unprecedented success.
I disagree. They seem to be doing fine, and their client seems to work fine and not detract from the game at all.
I play League a lot. There are bugs, sure, but altogether the game feels very polished. Much more so than many other games I have played. [There were two high-impact bugs in the current tournament that go counter to this, but they've been addressed already.]
The 'hard way' you espouse is ludicrous and far from the best way. It's not even an option. (Of course this is just my hunch, but it's a well-founded one)
[The ways that league does not feel polished are not related to code quality. They're much more around the new user experience and the toolset available to hardcore players (replays, sandboxes, among others).]
"The best choice" is a woolly phrase. Best for whom? Stopping all development to totally rewrite the core game engine is best from a technical perspective, but that's not necessarily the optimal solution for player enjoyment -- and would have impact on other things, e.g. the competitive aspect of the game.
Continuing to develop while totally rewriting the core leads to wasted effort and a "running to keep up" effect, though allows for a strong technical foundation, provided you actually finish it. Incrementally rewriting the codebase takes longer, but is easier to do in flight, and easy to test in a modular fashion -- and that's already happening.
Video games programmers have to learn more and make less. I'll give them the benefit of the doubt, especially since I can barely understand this particular blog post.
Seems they don't want anyone using PIA to access their blog... Tried reconnecting and got denied again, had to switch endpoint country to actually get access.
> Error 1008
Access denied.
The owner of this website (engineering.riotgames.com) has banned your IP address (108.61.57.217).
> Error 1008
Access denied.
The owner of this website (engineering.riotgames.com) has banned your IP address (108.61.13.45).
All of these steps can be efficiently automated. And it turns out that compiler writers collectively know about the vast majority of these techniques, but refuse to implement most of them for what I would consider to be the ultimate copout ever: Compile times. I don't know about you, but I would take 100x increase in compilation times for a release build over a 2x increase in development time due to manual optimization. I'm not sure who wouldn't, especially if it also allows you to eliminate technical debt, eliminate leaky abstractions, and improve code comprehensibility.
Perhaps I'm being overly idealistic, but I can't help but hope for a day that I can work with a high level language and have the compiler take care of optimizations that range from removing redundant elements from struct definitions all the way down to bitshift optimizations like i * 28 == i<<4 + i<<3 + i<<2. And if I have to wait all day long for a release build of something, so be it.