So it's an issue of the optimizer; as is often the case, it unrolls too aggressively and shoots itself in the foot, all the while missing out on various other opportunities.
In my experience, loop unrolling should basically never be done except in extremely degenerate cases; I remember not long ago someone I know who also optimises Asm remarking "it should've died along with the RISC fad". The original goal was to reduce per-iteration overhead associated with checking for end-of-loop, but any superscalar/OoO/speculative processor can "execute past" those instructions anyway; all that unrolling will do is bloat the code and work against caching. Memory bandwidth is often the bottleneck, not the core.
> In my experience, loop unrolling should basically never be done except in extremely degenerate cases
Not true. Like many such optimizations, loop unrolling can be useful because it makes downstream loads constant.
For example:
float identity[4][4];
for (unsigned y = 0; y < 4; y++)
for (unsigned x = 0; x < 4; x++)
identity[y][x] = y == x ? 1 : 0;
... do some matrix math ...
In this case, the compiler probably wants to unroll the loops so that it can straightforwardly forward the constant matrix entries directly to the matrix arithmetic. It'll likely be able to eliminate lots of operations that way.
(You might ask "who would write this code?" As Schemers say: "macros do.")
> (You might ask "who would write this code?" As Schemers say: "macros do.")
To expand on this point - in the more prosaic world of C++ - this sort of code comes about all the time in templated code. For example, the above loop you posted might have been found in something like:
```
template <unsigned N, unsigned M>
class Matrix {
static Matrix Identity() { ... }
}
auto m = Matrix<4, 4>::Identity();
```
The other major source of these sorts of constants leading to DCE oppportunities is inlining. Consider a more classical, matrix implementation that is not templated and doesn't lift its dimensions into the type:
```
class Matrix {
unsigned n;
unsigned m;
static Matrix Identity(unsigned n, unsigned m) { ... }
}
// Somewhere else
Matrix m = Matrix::Identity(4, 4);
```
Here, the inlining of the call to `Identity` at the call-site will turn the `n` and `m` in the body of `Identity` into the constant 4.
If I had to make an educated guess - inlining typically generates these (i.e. partial evaluation, constant folding, an DCE) situations most often in compilers. An incredible amount of information can flow from caller to callee when you specialize the callee for that call-site.
I didn't understand dead elimination until I wrote enough macros. It is a lot easier to generate code and have the optimizer fix it than to make sure to always generate efficient code.
This is also how compilers do things, but it is only that we schemers can see the intermediate result much easier using simple source->source transformations.
As an example: I wrote a clone of racket's for loops. They use #:when and #:break clauses. Instead of generating them when they were present the break clauses just defaulted to #f and the when clauses to #t, meaning that the break clause of the generated code was just optimized away if the user didn't have any break clauses and the test for the when clauses was optimized to a regular (begin ...).
It simplified the code a lot and the optimizer was a lot faster than having to do it all myself at expansion time. I lazily just generate about 30 lines of code for a simple loop that in the end sometimes even is unrolled to the final reault due to guiles optimizer and partial evaluation.
In addition to the optimisations already mentioned, loop unrolling also typically enables vectorisation in compilers. You might argue that for vectorisation it is not exactly necessary to have the relevant oerations next to each other in a continuous instruction stream, but it makes the vectorisation pass a lot nicer and simpler (if it can be called that to begin with).
Assuming you unroll just enough to fill a SIMD lane. As mentioned, in this case aggressive (16-fold) unfolding actually appears to have prevented vectorization. (A smart enough vectorizer could of course handle this but unrolling just to ”re-roll” in a later pass doesn’t sound very smart.)
In most cases on modern systems, small loops should remain compact as possible, to stay in the uop cache. The "for" loop overhead (the inc, cmp, and jmp instructions) effectively execute in parallel. Modern systems are highly out-of-order and the for-loop overhead is virtually nil.
Actually unrolling is often very important. In some cases it is even more important with modern high speed out-of-order cores. For example, you might need several accumulators to handle instruction or memory latency.
For small loops, unrolling is the most important of all, since loop carried dependency chains are dense, and the loop overhead is a high fraction of the overall work.
It is easy to get a 2x speedup by unrolling a small loop, and even larger speedups are not uncommon.
So this "unrolling rarely helps" idea is just as much of a myth as "unrolling never helps". The main problem with unrolling is that the compilers usually don't do it intelligently - usually loops are unrolling if some kind of threshold is met, depending on compiler options - but this always happens in kind of a feed-forward way, rather thank a feed-back way, which would involve unrolling the loop and analyzing the benefit and costs after further optimization passes.
Okay, manual loop unrolling definitely helps, but the programmer MUST be aware of dependency chains and ILP. The compiler cannot make the decision, at least not without a pragma or maybe an autovectorization engine.
At least, I haven't seen todays (2018) compilers cut dependency chains on without a #pragma omp reduce, or other assistance from the programmer.
I've unrolled loops myself to good sucess. But it isn't as easy as some people think it is.
Without knowing about dependency chains or ILP, small loops are often best left in smaller compact form. You leverage the branch predictor and minimize uop cache usage. I'd argue the typical program benefits from compact loops more.
You are right, the vast majority of loops should be left without unrolling. In fact, the majority of code should be compiled for size, not for speed. For a typical application you probably have 99% of the code not being very performance sensitive, and the remaining 1% is where the time is spent. Still, people regularly compile for speed to good effect, because it's so important to compile that 1% for speed that you are better off just compiling everything that way if you aren't able/willing to profile and figure what code falls in the 1%.
That's nothing specific to unrolling though: it applies to all optimizations which trade off size and speed.
About dependency chains and ILP: of course the compiler is in the perfect places to be aware of all of this. They have detailed machine models updated carefully as new CPUs come out (in fact, some of the earliest details about new CPU models often comes from compiler commits from insiders where hardware details are necessarily leaked).
A compiler could certainly statically analyze a loop using an approach similar to Intel IACA or OASCA [1], and then unroll the loop a few times and run the analysis again and see if it improves. So it doesn't need any kind of sophisticated analysis, just try-and-measure. Of course, compilers don't actually work like this. One of reasons it is not so easy is that optimizations is done in layers, against a machine independent IR, and you might not be able to carefully evaluate the impact of unrolling until some later time, possibly as late the machine-dependent instruction emission. This issue is pervasive across many compiler optimizations, and leads to many cases where a compiler generates bad code where a human wouldn't.
Integer-based code probably can (theoretically) be optimized by a good enough compiler I haven't thought of all overflow / underflow conditions, but most "normal" code, consisting of adds, subtracts, and AND/OR/XORs, probably can be broken up as needed by the compiler.
But floating-point code MUST execute in precisely the order that the programmer specifies. In general, (A+B)+C does NOT equal A + (B+C) for floats.
For integer-code, divison is definitely order-specific, while multiplication might be order-specific in overflow conditions (I haven't 100% thought this through yet...). But in any case, the compiler can probably split up operations to take advantage of ILP as long as the results output the same bits.
Which isn't true for some operations (ie: floating point add)
--------
In any case, the "#pragma omp reduce" gives the compiler assurance from the programmer that the ordering does not matter. I think that's why something like that could benefit from parallelism a bit better.
Are there no compilers which attempt to look at that code, decide 'that looks like 1<<(num-2) when n>=2', and replace the code entirely?
There must be so many examples of bubble sort where quicksort would be better, and other code patterns which can be identified and replaced with something orders of magnitude faster.
Did you read TFA? The author did that (though using GCC), and the reason the optimizer does what you see is undefined behavior due to signed integer overflow.
Did you understand the comment? The author used GCC, and GCC is only able to vectorize the loop. But clang on the other hand, essentially turned this O(n) algorithm to calculate a particular sum into an O(1) result.
> the reason the optimizer does what you see is undefined behavior due to signed integer overflow
Yes undefined behavior gives the optimizer the right in this case to transform the code into anything, including a nonsense answer, or a trap instruction. But the optimizer did not; it produced the right answer under 2's complement arithmetic.
There is no undefined behaviour, '#pragma GCC optimize("wrapv")' takes care of that.
EDIT: It seems that clang doesn't support #pragma GCC optimize, so it's a no-op in that snippet. It doesn't change the result though. If you pass -fwrapv flag to clang, it will be optimized in exactly the same way.
Just to be clear, undefined behavior means the standard allows implementations to do what they they feel is the right thing to do under that scenario, and the outcome will still comply with the standard.
Undefined behaviour, in reality, means: the compiler will assume that it doesn’t happen, so whatever code path leading up to it can also (by definition) not happen and be eliminated. E.g. signed integer overflow “cannot happen” so you never need to emit code checking for it or dealing with it.
That’s the real world implication of undefined behaviour.
Undefined behavior is "literally anything can happen". So yes, implementations doing "what they feel is the right thing to do" is one possible result of UB (as in this case). It could also emit an rm -rf / call and it would still comply with the standard...
So if the compiler is too good you want to trick it to produce less optimal code so you can benchmark it fairly? Isn't it part of the benchmark to allow the compiler reduce the whole expression to a compile time constant?
You want the loop to not be optimized away because the loop itself is not a part of benchmarked code, it's the benchmarking code. It executes the same thing million+ times so that the total execution time is much higher than timer measurement error, measurement overhead and random OS fluctuations, that would otherwise drown your result in noise.
Aside from what the sibling says about the difference between the test harness code and the code being benchmarked, there's a more abstract point: you want the compiler to reduce it to a compile time constant if and only if in the real world cases you're trying to model, it will be able to do so. That's pretty rare, since if that happens, you probably wouldn't have to do performance analysis on that code.
These days I find myself telling people that benchmark numbers don’t matter on their own. It’s important what models you derive from those numbers. Refined performance models are by far the noblest and greatest achievement one could get with the benchmarking — it contributes to understanding how computers, runtimes, libraries, and user code work together. --Aleksey Shipilёv https://shipilev.net/blog/2014/nanotrusting-nanotime/
That's a bit of an obscure comment, but I keep coming back to it as I learn about performance work and benchmarking.
With all the optimisations being implemented in compilers today, it is impressive to see how this opportunity to optimise is missed. Put differently, compiler writers bother about optimisations that gain 0.1% performance in some special cases, but others that could gain 20% performance are not implemented.
Why? Is this optimisation particularly difficult to implement? Or is it just missed low-hanging fruit? It sure looks easy (like: rearrange expressions to keep the expression tree shallow and left-branching to avoid stack operations).
Compiler developers have tons of benchmarks which they run. I’d bet that this is as simple as not being significant in their test suite, with a good chance that it’s both not as simple as it might seem or that there are impacts on more complicated code which is in their benchmark suite or a big customer’s app.
The truth is that the hotspot computer is pretty old at this point and never really implemented a lot of good, robust, and thorough optimizations (I've read the source every year or two).
It does some stuff and hopes for the best.
This is why there is a real commercial jvm market with azul.
It's possible that they're working in the frame of mind that there aren't any low-hanging fruit left after so many years of compiler optimizations and forget to even try.
It is a good answer, but my favorite by far is an answer about branch prediction to explain why processing a sorted array is faster than unsorted: https://stackoverflow.com/q/11227809/938695
I find it interesting that there are developers out there that know to look at these nuances when respond to Stack Overflow questions. I'm been developing professionally for 10 years and probably went over branch prediction in my computer architecture class in college (I'm guessing I did, if I didn't then I never encountered it at all!).
The person who answered the multiple question dove into byte code...but also answered questions on Angular.
I am unworthy...and this is what impostor syndrome looks like.
That person works in financial services, which I'm guessing is basically some form of automated trading. It is an industry where every cycle counts (so much so, that often times light speed latency between two edges is something you need to consider when placing servers).
He probably has actual experience with branch prediction. He probably dabbled or had experience with angular in other jobs (he worked at google apparently, so maybe there).
He'll most likely be stumped if you provide a graphic problem that a graphic designer with a few years of experience would solve in an instant, or an ML problem for a data scientist with similar experience.
That doesn't mean he isn't extremely smart. He most likely is (it takes a lot of brain to do these things), but the fact that you can't tell branch prediction problems even though you had some computer architecture class in the past is irrelevant.
The author of that answer wrote y-cruncher, which has been used to set world records in the number of digits of pi calculated. So I'm not surprised at all to see that they how know branch prediction works.
I thought at first this was because integer squaring is potentially faster than general integer multiplication and the compiler wasn't seeing the square operation in the second case, but that's not the explanation here.
I find it weird that he doesn't mention this difference as part of the performance difference. A left shift should be considerably faster than a mul operation?
I believe this is far less true than it used to be, but it’s a good example of why these decisions really need to be data driven as compilers and processors change faster than most people can afford to optimize code. I don’t know that this would be the case for something that simple but I’ve seen a fair amount of heavily-tuned C/ASM code which was replaced with the now-faster “reference” code when someone noticed that the old assumptions weren’t true.
The difference is that with fp ops, it's part of the design and understood that you should never directly compare the equality of fp numbers since they are estimates. You should check for equality of fp numbers by checking their difference according to your needs.
Whereas for int ops, equality works within the limits of the design.
In short equality means something different in fp by design. For int, it means what we think it means within its limits. When we overflow, then things get screwy.
Ideologically, yes, the compiler should generate the fastest code possible for the same math expression.
However, the compiler ('s optimization step) is not magic and produces suboptimal code sometimes. Back when C was young, this was frequently the case (1970's and 1980's), so dropping into assembly to hand-code performance critical sections is just what people did, in order to get software to run smoothly.
Thankfully this is largely no longer the case, however it does still happen.
In Java's case, the JVM runs on top of multiple different architectures which makes optimization even more complicated.
Low-level instruction generation and optimization is just one topic under the umbrella of compiler design, which is a huge (and fascinating!) discipline to get into.
While true, and I still remember those days, when many C code bases were full functions whose body was a big asm { ... } block, there were also compilers which were much better dealing with optimizations than those C compilers were capable of.
Notice the architecture, quite similar to the layers and compiler phases used in modern compiler toolchains like LLVM.
The secret sauce, if one can call it as such, was that PL.8 had a richer type system, and the System/370 was a bit beefier than most platforms adopting C compilers.
> However, the compiler ('s optimization step) is not magic and produces suboptimal code sometimes.
I agree that compilers are not always perfect, but in this particular case the two expressions are trivially equivalent from the associativity of the multiplication so the distinction had to be intentional.
But as gnuvince pointed out, the two expressions are not equivalent when you consider integer overflow.
The two expressions are equivalent, even under overflow. With a few exceptions (eg division, right shift) fixed size integer math follows the same associative rules as "standard math".
Because it's more work for the compiler to reduce the expression to something canonical (and it might even be impossible).
Also what good will it bring? What if the canonical expression triggers the slow path? Now you have no means to change it into the fast version.
Further, in the case of floating point operations, operation order matters for rounding. And with integer operations, the actual form used can be important for preventing overflow (of intermediate results).
Is Overflow UB so the compiler can choose to ignore the fact that 2x(i x i) could overflow differently from 2 x i x i?
I’m not sure it does overflow differently but I would expect overflow to behave consistently as written, and not be dependent on optimization, is that not the case?
Without UB it must be very hard for the compiler to optimize arithmetic. Even obvious things like (2 x A) x B vs 2 x (A x B) are only equivalent without overflow. I guess it can be specified as being up to the jitter to decide - so not UB but not known from looking at the source either? Would be interesting to know what .NET and Java specifications say on it
You can usually optimize integer arithmetic just fine, including the example you gave (both forms are equivalent - try it!).
Floating point arithmetic is different, but Java gives itself wiggle room by not exactly specifying many results unless you choose "strict math". That's not UB though: it's just a range of possible outcomes.
Java can't have UB in the C/C++ sense, since it would break the security sandbox. It certainly has things without specifically defined values, such as hashCode() and what happens under data races isn't entirely deterministic, but it doesn't approach UB in the C/C++ sense.
Yeah I’m painfully aware of the FP gotchas. But are you saying there are usually never any issues with integer arithmetic and overflow vs. optimizations (reordering, common subexpressions etc)?
A branch like “if a+1 < a” seems like it could under a clever compiler (allowed to do what it wants in unchecked overflow) optimize to a completely removed branch but with less optimization it will not, so the addition is carried out and the wraparound means the branch is entered?
Seems that not checking for overflow and not being able to assume there is no overflow, would give the worst of both worlds (slower because of lack of some optimizations but still not safe against overflow like C#’s “checked”).
I thought a deref of a possibly overflown value was what could risk security, ie so long as all array indices and similar are range checked then nothing bad can happen?
> But are you saying there are usually never any issues with integer arithmetic and overflow vs. optimizations (reordering, common subexpressions etc)?
I'm saying that it is uncommon. For example, the example you gave works fine! Most examples work fine since you can do most of the same type of transformations. The exceptions are largely operations where the upper bits affect the lower bits. Division is one, and right shift is another, but nearly all the other bitwise and math operations do not have this property.
The other place where signed-wrapping-is-UB is often used for optimization is in things like loop bounds. Given something like:
for (int i = start; i < end; i++)
Due to the UB of signed wrapping, the compiler can assume that initially i < end, and more importantly that the loop will iterate exactly (end - start) times, and that all accesses will be to contiguous, increasing addresses. This helps in vectorization, among other things.
In Java, the compiler couldn't take advantage of that. However, the impact isn't as big since in any loop that accesses arrays based on the index, bounds checks have to be made, and a typical pattern is to do some up front bounds checks [1] which can guarantee than the main body of the loop can then be run without additional checks: and this subsumes the checks that would be needed due to wrapping signed values anyways. So basically you have to do those checks anyways in many cases.
> A branch like “if a+1 < a” seems like it could under a clever compiler
Sure, but these cases aren't very interesting for comparing the effectiveness of the signed overflow optimizations. It's a case where the optimization breaks the (wrongly expressed) intent of the programmer, so the difference is only between a fast, broken program and a slower, perhaps correct one.
Presumably in the signed-overflow-is-UB world, such a check will have to re-written in a different way (which might even end up slower).
> Seems that not checking for overflow and not being able to assume there is no overflow, would give the worst of both worlds (slower because of lack of some optimizations but still not safe against overflow like C#’s “checked”).
Not exactly - it's just a different point on the spectrum. On the one side you have overflow is UB, which leads to some (fairly limited) optimization opportunities, but also more unexpected results and broken programs, and on the far other size you have overflow is an always-caught checked error. Java is somewhere in the middle: overflow is well-defined (it wraps), which gets you most of the speed of the UB approach (no checking needed, only a few relatively minor optimizations are lost) without the unexpected results of UB - but you still have broken programs when overflow actually occurs (unless it was unexpected). See also Rust's strategy here which is interesting.
> I thought a deref of a possibly overflown value was what could risk security, ie so long as all array indices and similar are range checked then nothing bad can happen?
When "nothing bad can happen" from the point of view of the JVM, i.e., the security sandbox isn't broken, you can't access arbitrary memory, violate the type system, interfere with unrelated code, etc. Of course, overflowing an index and then accessing into the array could still do plenty of "bad things" depending on the higher level semantics of the program, since you are now executing an unexpected code path. You might return sensitive data to an attacker, whatever.
> Ok, I was just mistaken in my belief that integer overflow shenanigans was a major contributor to how a modern compiler optimized e.g. loops.
Yes, there is some impact but I think it's large, at least based on looking at a lot of assembly, and going over the typical examples of where it helps. In the cases it does help, it could make a big difference on a loop, let's say 2x the speed, but these aren't all that common.
> Right. I was considering the sandbox in the sense only of process security rather than program/type soundness.
Usually those two things end up tightly bound together: it's hard to impossible to enforce a sandbox if the user can escape the type system.
How? Java is compiled to bytecode, you don't know the architecture of the system the code is going to run on. It's one of the reasons javac only implements the simplest optimizations possible (constants folding and the like)
Tried what specifically? This particular example, or something similar where the compiler generates code with different speeds for seemingly equivalent code?
In my experience, loop unrolling should basically never be done except in extremely degenerate cases; I remember not long ago someone I know who also optimises Asm remarking "it should've died along with the RISC fad". The original goal was to reduce per-iteration overhead associated with checking for end-of-loop, but any superscalar/OoO/speculative processor can "execute past" those instructions anyway; all that unrolling will do is bloat the code and work against caching. Memory bandwidth is often the bottleneck, not the core.