
Why is 2 * (i * i) faster than 2 * i * i in Java? - trequartista
https://stackoverflow.com/questions/53452713/why-is-2-i-i-faster-than-2-i-i-in-java
======
userbinator
_So it 's an issue of the optimizer; as is often the case, it unrolls too
aggressively and shoots itself in the foot, all the while missing out on
various other opportunities._

In my experience, loop unrolling should basically never be done except in
extremely degenerate cases; I remember not long ago someone I know who also
optimises Asm remarking "it should've died along with the RISC fad". The
original goal was to reduce per-iteration overhead associated with checking
for end-of-loop, but any superscalar/OoO/speculative processor can "execute
past" those instructions anyway; all that unrolling will do is bloat the code
and work against caching. Memory bandwidth is often the bottleneck, not the
core.

~~~
pcwalton
> In my experience, loop unrolling should basically never be done except in
> extremely degenerate cases

Not true. Like many such optimizations, loop unrolling can be useful because
it makes downstream loads constant.

For example:

    
    
        float identity[4][4];
        for (unsigned y = 0; y < 4; y++)
            for (unsigned x = 0; x < 4; x++)
                identity[y][x] = y == x ? 1 : 0;
        ... do some matrix math ...
    

In this case, the compiler probably wants to unroll the loops so that it can
straightforwardly forward the constant matrix entries directly to the matrix
arithmetic. It'll likely be able to eliminate lots of operations that way.

(You might ask "who would write this code?" As Schemers say: "macros do.")

See LLVM's heuristics:
[http://llvm.org/doxygen/LoopUnrollPass_8cpp.html#ad7c38776d7...](http://llvm.org/doxygen/LoopUnrollPass_8cpp.html#ad7c38776d74075aa393534236d5a3d64)

~~~
bjoli
I didn't understand dead elimination until I wrote enough macros. It is a lot
easier to generate code and have the optimizer fix it than to make sure to
always generate efficient code.

This is also how compilers do things, but it is only that we schemers can see
the intermediate result much easier using simple source->source
transformations.

~~~
bjoli
As an example: I wrote a clone of racket's for loops. They use #:when and
#:break clauses. Instead of generating them when they were present the break
clauses just defaulted to #f and the when clauses to #t, meaning that the
break clause of the generated code was just optimized away if the user didn't
have any break clauses and the test for the when clauses was optimized to a
regular (begin ...).

It simplified the code a lot and the optimizer was a lot faster than having to
do it all myself at expansion time. I lazily just generate about 30 lines of
code for a simple loop that in the end sometimes even is unrolled to the final
reault due to guiles optimizer and partial evaluation.

------
jepler
You should translate your program to C++ and build with clang ; it turns the
loop into a single constant load.
[https://godbolt.org/z/slznbU](https://godbolt.org/z/slznbU)

~~~
cryptonector
Did you read TFA? The author did that (though using GCC), and the reason the
optimizer does what you see is undefined behavior due to signed integer
overflow.

~~~
geezerjay
> does what you see is undefined behavior

Just to be clear, undefined behavior means the standard allows implementations
to do what they they feel is the right thing to do under that scenario, and
the outcome will still comply with the standard.

~~~
berti
No, that's implementation defined behaviour.

------
beeforpork
With all the optimisations being implemented in compilers today, it is
impressive to see how this opportunity to optimise is missed. Put differently,
compiler writers bother about optimisations that gain 0.1% performance in some
special cases, but others that could gain 20% performance are not implemented.

Why? Is this optimisation particularly difficult to implement? Or is it just
missed low-hanging fruit? It sure looks easy (like: rearrange expressions to
keep the expression tree shallow and left-branching to avoid stack
operations).

~~~
acdha
Compiler developers have tons of benchmarks which they run. I’d bet that this
is as simple as not being significant in their test suite, with a good chance
that it’s both not as simple as it might seem or that there are impacts on
more complicated code which is in their benchmark suite or a big customer’s
app.

------
techopoly
That just might be the most dedicated answer I've ever seen on Stack Overflow.

~~~
azhenley
It is a good answer, but my favorite by far is an answer about branch
prediction to explain why processing a sorted array is faster than unsorted:
[https://stackoverflow.com/q/11227809/938695](https://stackoverflow.com/q/11227809/938695)

~~~
fma
I find it interesting that there are developers out there that know to look at
these nuances when respond to Stack Overflow questions. I'm been developing
professionally for 10 years and probably went over branch prediction in my
computer architecture class in college (I'm guessing I did, if I didn't then I
never encountered it at all!).

The person who answered the multiple question dove into byte code...but also
answered questions on Angular.

I am unworthy...and this is what impostor syndrome looks like.

~~~
Illniyar
That person works in financial services, which I'm guessing is basically some
form of automated trading. It is an industry where every cycle counts (so much
so, that often times light speed latency between two edges is something you
need to consider when placing servers).

He probably has actual experience with branch prediction. He probably dabbled
or had experience with angular in other jobs (he worked at google apparently,
so maybe there).

He'll most likely be stumped if you provide a graphic problem that a graphic
designer with a few years of experience would solve in an instant, or an ML
problem for a data scientist with similar experience.

That doesn't mean he isn't extremely smart. He most likely is (it takes a lot
of brain to do these things), but the fact that you can't tell branch
prediction problems even though you had some computer architecture class in
the past is irrelevant.

~~~
saagarjha
The author of that answer wrote y-cruncher, which has been used to set world
records in the number of digits of pi calculated. So I'm not surprised at all
to see that they how know branch prediction works.

~~~
aristophenes
> they how know branch prediction works.

Can’t tell if clever joke, or typo

~~~
saagarjha
Typo, but I'll leave it up to brighten someone else's day.

------
dreamcompiler
I thought at first this was because integer squaring is potentially faster
than general integer multiplication and the compiler wasn't seeing the square
operation in the second case, but that's not the explanation here.

~~~
garmaine
There isn’t an integer square opcode on any major processor architecture
though, right?

~~~
dreamcompiler
Not that I know of. It's not really worth it for short integers (64 bits or
less). But it's helpful with bignums.

------
ww520
I'm surprised it's not doing a left shift for the x2.

~~~
jcdavis
It is in the first example (the sal instruction)

~~~
DannyBee
However, if you look at the second, you won't see any left shifts, which is
also interesting

~~~
ascar
I find it weird that he doesn't mention this difference as part of the
performance difference. A left shift should be considerably faster than a mul
operation?

~~~
acdha
I believe this is far less true than it used to be, but it’s a good example of
why these decisions really need to be data driven as compilers and processors
change faster than most people can afford to optimize code. I don’t know that
this would be the case for something that simple but I’ve seen a fair amount
of heavily-tuned C/ASM code which was replaced with the now-faster “reference”
code when someone noticed that the old assumptions weren’t true.

------
podsnap
The graal behavior is a lot more sane:

    
    
        graal:
        [info] SoFlow.square_i_two   10000  avgt   10  5338.492 ± 36.624  ns/op   // 2 *\sum i * i
        [info] SoFlow.two_i_         10000  avgt   10  6421.343 ± 34.836  ns/op   // \sum 2 * i * i
        [info] SoFlow.two_square_i   10000  avgt   10  6367.139 ± 34.575  ns/op   // \sum 2 * (i * i)
        regular 1.8:
        [info] SoFlow.square_i_two   10000  avgt   10  6393.422 ± 27.679  ns/op
        [info] SoFlow.two_i_         10000  avgt   10  8870.908 ± 35.715  ns/op
        [info] SoFlow.two_square_i   10000  avgt   10  6221.205 ± 42.408  ns/op
    

The graal-generated assembly for the first two cases is nearly identical,
featuring unrolled repetitions of sequences like

    
    
        [info]   0x000000011433ec03: mov    %r8d,%ecx
        [info]   0x000000011433ec06: shl    %ecx               ;*imul {reexecute=0 rethrow=0 return_oop=0}
        [info]                                                 ; - add.SoFlow::test_two_i_@15 (line 41)
        [info]   0x000000011433ec08: imul   %r8d,%ecx          ;*imul {reexecute=0 rethrow=0 return_oop=0}
        [info]                                                 ; - add.SoFlow::test_two_i_@17 (line 41)
        [info]   0x000000011433ec0c: add    %ecx,%r9d          ;*iadd {reexecute=0 rethrow=0 return_oop=0}
        [info]                                                 ; - add.SoFlow::test_two_i_@18 (line 41)
        [info]   0x000000011433ec0f: lea    0x5(%r11),%r8d     ;*iinc {reexecute=0 rethrow=0 return_oop=0}
        [info]                                                 ; - add.SoFlow::test_two_i_@20 (line 40)
    
    

while the third case does a single shl at the end.

    
    
        [info]   0x000000010e2918bb: imul   %r8d,%r8d          ;*imul {reexecute=0 rethrow=0 return_oop=0}
        [info]                                                 ; - add.SoFlow::test_square_i_two@15 (line 32)
        [info]   0x000000010e2918bf: add    %r8d,%ecx          ;*iadd {reexecute=0 rethrow=0 return_oop=0}
        [info]                                                 ; - add.SoFlow::test_square_i_two@16 (line 32)
        [info]   0x000000010e2918c2: lea    0x3(%r11),%r8d     ;*iinc {reexecute=0 rethrow=0 return_oop=0}
        [info]                                                 ; - add.SoFlow::test_square_i_two@18 (line 31)                                   
    

Both graal and C2 inline, but as usual the graal output is a lot more
comprehensible.

------
bnegreve
I don't see how generating different code for the same mathematical expression
can be a good thing.

The compiler should detect that the two expressions are strictly equivalent
and generate whatever code it believes is the fastest.

Any idea why it is this way?

~~~
gnuvince
Because of integer overflows and floating-point operations, the notion of
equivalent mathematical expressions is tricky.

    
    
        fn main() {
            let a: i8 = 125;
            let b: i8 = 3;
            let c: i8 = (a + b) / 2;
            let d: i8 = b + ((a - b) / 2);
            println!("{} {}", c, d);
        }
    

This program outputs `-64 64` although the computations of `c` and `d` are
equivalent.

Here's another example using floating point numbers:

    
    
        fn main() {
            let mut total1: f32 = 0.0;
            let mut total2: f32 = 0.0;
            let mut counter1: f32 = 0.0;
            let mut counter2: f32 = 100.0;
    
            for _ in 0 .. 10001 {
                total1 += counter1;
                total2 += counter2;
                counter1 += 0.01;
                counter2 -= 0.01;
            }
            println!("{} {}", total1, total2);
        }
    

The output of this program is `500041.16 500012.16`, a difference of 25 for a
program that computes the same result (unless I made a mistake).

~~~
bnegreve
Right! thanks

------
crb002
TIL about printing ASM from debug JVMs.

~~~
pjmlp
If you use Oracle Studio you can even see it on the IDE.

[https://www.youtube.com/watch?v=_cFwDnKvgfw](https://www.youtube.com/watch?v=_cFwDnKvgfw)

There are also other tools like JITWatch.

[https://github.com/AdoptOpenJDK/jitwatch/wiki/Videos-and-
Sli...](https://github.com/AdoptOpenJDK/jitwatch/wiki/Videos-and-Slideshows)

[https://vimeo.com/181925278](https://vimeo.com/181925278)

------
alkonaut
Is Overflow UB so the compiler can choose to ignore the fact that 2x(i x i)
could overflow differently from 2 x i x i?

I’m not sure it does overflow differently but I would expect overflow to
behave consistently as written, and not be dependent on optimization, is that
not the case?

~~~
BeeOnRope
Nothing you can do in pure Java code is UB in the C/C++ sense.

~~~
alkonaut
Without UB it must be very hard for the compiler to optimize arithmetic. Even
obvious things like (2 x A) x B vs 2 x (A x B) are only equivalent without
overflow. I guess it can be _specified_ as being up to the jitter to decide -
so not UB but not known from looking at the source either? Would be
interesting to know what .NET and Java specifications say on it

~~~
BeeOnRope
You can usually optimize integer arithmetic just fine, including the example
you gave (both forms are equivalent - try it!).

Floating point arithmetic is different, but Java gives itself wiggle room by
not exactly specifying many results unless you choose "strict math". That's
not UB though: it's just a range of possible outcomes.

Java can't have UB in the C/C++ sense, since it would break the security
sandbox. It certainly has things without specifically defined values, such as
hashCode() and what happens under data races isn't entirely deterministic, but
it doesn't approach UB in the C/C++ sense.

~~~
alkonaut
Yeah I’m painfully aware of the FP gotchas. But are you saying there are
usually never any issues with integer arithmetic and overflow vs.
optimizations (reordering, common subexpressions etc)? A branch like “if a+1 <
a” seems like it could under a clever compiler (allowed to do what it wants in
unchecked overflow) optimize to a completely removed branch but with less
optimization it will not, so the addition is carried out and the wraparound
means the branch is entered?

Seems that not checking for overflow and not being able to assume there _is_
no overflow, would give the worst of both worlds (slower because of lack of
some optimizations but still not safe against overflow like C#’s “checked”).

I thought a deref of a possibly overflown value was what could risk security,
ie so long as all array indices and similar are range checked then nothing bad
can happen?

~~~
BeeOnRope
> But are you saying there are usually never any issues with integer
> arithmetic and overflow vs. optimizations (reordering, common subexpressions
> etc)?

I'm saying that it is uncommon. For example, the example you gave works fine!
Most examples work fine since you can do most of the same type of
transformations. The exceptions are largely operations where the _upper bits
affect the lower bits_. Division is one, and right shift is another, but
nearly all the other bitwise and math operations do not have this property.

The other place where signed-wrapping-is-UB is often used for optimization is
in things like loop bounds. Given something like:

    
    
        for (int i = start; i < end; i++)
    

Due to the UB of signed wrapping, the compiler can assume that initially i <
end, and more importantly that the loop will iterate exactly (end - start)
times, and that all accesses will be to contiguous, increasing addresses. This
helps in vectorization, among other things.

In Java, the compiler couldn't take advantage of that. However, the impact
isn't as big since in any loop that accesses arrays based on the index, bounds
checks have to be made, and a typical pattern is to do some up front bounds
checks [1] which can guarantee than the main body of the loop can then be run
without additional checks: and this subsumes the checks that would be needed
due to wrapping signed values anyways. So basically you have to do those
checks anyways in many cases.

> A branch like “if a+1 < a” seems like it could under a clever compiler

Sure, but these cases aren't very interesting for comparing the effectiveness
of the signed overflow optimizations. It's a case where the optimization
breaks the (wrongly expressed) intent of the programmer, so the difference is
only between a fast, broken program and a slower, perhaps correct one.

Presumably in the signed-overflow-is-UB world, such a check will have to re-
written in a different way (which might even end up slower).

> Seems that not checking for overflow and not being able to assume there is
> no overflow, would give the worst of both worlds (slower because of lack of
> some optimizations but still not safe against overflow like C#’s “checked”).

Not exactly - it's just a different point on the spectrum. On the one side you
have overflow is UB, which leads to some (fairly limited) optimization
opportunities, but also more unexpected results and broken programs, and on
the far other size you have overflow is an always-caught checked error. Java
is somewhere in the middle: overflow is well-defined (it wraps), which gets
you most of the speed of the UB approach (no checking needed, only a few
relatively minor optimizations are lost) without the unexpected results of UB
- but you still have broken programs when overflow actually occurs (unless it
was unexpected). See also Rust's strategy here which is interesting.

> I thought a deref of a possibly overflown value was what could risk
> security, ie so long as all array indices and similar are range checked then
> nothing bad can happen?

When "nothing bad can happen" from the point of view of the JVM, i.e., the
security sandbox isn't broken, you can't access arbitrary memory, violate the
type system, interfere with unrelated code, etc. Of course, overflowing an
index and then accessing into the array could still do plenty of "bad things"
depending on the higher level semantics of the program, since you are now
executing an unexpected code path. You might return sensitive data to an
attacker, whatever.

\---

[1] More details:
[https://wiki.openjdk.java.net/display/HotSpot/RangeCheckElim...](https://wiki.openjdk.java.net/display/HotSpot/RangeCheckElimination)

~~~
alkonaut
Thanks that's very kind to take the time to write that much!

> leads to some (fairly limited) optimization opportunities,

Ok, I was just mistaken in my belief that integer overflow shenanigans was a
major contributor to how a modern compiler optimized e.g. loops.

> you can't access arbitrary memory, violate the type system, interfere with
> unrelated code

Right. I was considering the sandbox in the sense only of process security
rather than program/type soundness.

~~~
BeeOnRope
> Ok, I was just mistaken in my belief that integer overflow shenanigans was a
> major contributor to how a modern compiler optimized e.g. loops.

Yes, there is _some_ impact but I think it's large, at least based on looking
at a lot of assembly, and going over the typical examples of where it helps.
In the cases it _does_ help, it could make a big difference on a loop, let's
say 2x the speed, but these aren't all that common.

> Right. I was considering the sandbox in the sense only of process security
> rather than program/type soundness.

Usually those two things end up tightly bound together: it's hard to
impossible to enforce a sandbox if the user can escape the type system.

------
Koshkin
At first, I thought it was because i * i == -1.

------
microcolonel
I guess they do not use value numbering, which is typically how you get
equivalent results for cases like this.

------
qwerty456127
IMHO some kind of logic preprocesor should take care of this before the actual
compilation.

~~~
isbvhodnvemrwvn
How? Java is compiled to bytecode, you don't know the architecture of the
system the code is going to run on. It's one of the reasons javac only
implements the simplest optimizations possible (constants folding and the
like)

~~~
pjmlp
Compiling to bytecode is just one of the possibilities.

Since the early days of Java, OEM vendors targeting embedded targets do
support AOT compilation, with possible PGO feedback.

Some vendors like IBM, also provide similar capabilities on their regular Java
toolchains.

And Maxime finally graduated as Graal/Substrate, which is also another way of
compiling Java.

But all in all, everyone is transitioning to the benefits of bytecode as
intermediate executable format.

Even some cool LLVM optimizations, like ThinLTO, are only possible thanks to
using bytecode.

------
polskibus
I wonder if the same applies to .net (fx/core).

~~~
pjmlp
Depends on the runtime.

You have the old JIT, replaced by RyuJIT on .NET 4.6 and .NET Core.

Then .NET Native, which does AOT compilation via the same backend as Visual
C++.

Followed by Mono's JIT/AOT implementation.

Windows/Windows Phone 8.x used a Bartok derived compiler for the MDIL format.

Same applies to Java though, as the answer only goes through what Hotspot
does, but there are many other JIT/AOT compilers for Java as well.

------
networkimprov
Has anyone tried this with Go?

~~~
pmarreck
No, because come back when you’re a real language with a runtime error handler

~~~
sabujp
thank you for this!

~~~
pmarreck
Go’s an OK language but

1) This is not the forum to bring it up

2) Given its warts it gets FAAAARRRRR too much attention IMHO

Sorry for snark.

------
JohnL4
The database is fast enough for a few extra trips to it, so this is definitely
what we should be focusing on.

(My cup of bitterness doth overflow.)

