
Deoptimizing a C++ program - ingve
http://stackoverflow.com/q/37361145/563941
======
billforsternz
Inevitably the stackoverflow mods have moved into shut this down. They should
be honest and introduce a "too interesting" classification, complete with a
wordy, humorless justification phrase.

~~~
mixedmath
I should note that this wasn't closed by moderators, but instead just normal
StackOverflow users. In fact, one of the mods commented on the question before
it was closed (and did not close the question).

~~~
kristianp
I don't follow what you mean, normal users are moderators, just they're just
users with sufficient karma.

~~~
caf
Stackoverflow does have a separate concept of "Moderator", which is a voted-on
position and denoted by a blue diamond next to their name.

------
Someone
For caching issues, you could replace the loops by more functional code: write
one loop to generate an array of values, and sum them to _payoff_sum_ in a
separate loop.

    
    
      for (int i=0; i<num_sims; i++) {
        gauss_bm[i] = gaussian_box_muller();
      }
      for (int i=0; i<num_sims; i++) {
        S_cur[i] = S_adjust * exp(sqrt(v*v*T)*gauss_bm[i]);
      }
      for (int i=0; i<num_sims; i++) {
        payoff[i] += std::max(S_cur[i] - K, 0.0);
      }
      for (int i=0; i<num_sims; i++) {
        payoff_sum += payoff[i];
      }
    

To make this i7-specific, you have to make sure the arrays are mapped to just
the right cache lines, so that writing to an array evicts the dat from the
array being read from from the cache (hm, maybe that requires combining some
of those loops; what does the level 1 cache of these CPUs look like?)

Next step: DRY. Move the _for (int i=0; i <num_sims; i++)_ loop into a
function taking a lambda for the loop body. That hides the ugliness of the
above code. Here, it probably is fairly easy to accidentally do some extra
copying of data.

Next: parallelism. You shouldn't have individual threads process contiguous
sections of the arrays; do

    
    
       for (int i=threadNo; i<num_sims; i+= numThreads) {
    

instead. And of course, you can multi-thread by having the 4 for loops run in
parallel. That, I think, would be the big gain, as it would introduce quite a
few data dependencies between threads. It also is in the spirit of the
question, I think.

It has nothing to do with the i7, but I would also work on worrying about the
value of RAND_MAX. It can be as low as 32767. So, if you want to pick a double
from the full range of values in the [-1,1] interval you have to make at least
4 calls to _rand()_. Making the distribution uniform from there will take
quite a bit of work, too (there are as many double values between 1/4 and 1/2
as between 1/2 and 1, but you want to pick the former half as often as the
latter)

------
acqq
The context is missing, what's allowed and what's not.

If it's allowed to make own classes, the best thing it can be done is to
introduce MyNumber abstract class, then inherit it to make MyDouble, then make
NumberFactory and access every double through the virtual operators. Then let
every temporary variable be created (with the new operator!) on every loop
pass, of course using the factories. Define the operations in a way to need
the new temporary MyNumbers too. That way the creations would be hidden in the
classes, the calculation would still look relatively innocent.

And additionally, make classes so that they have to do runtime casts for every
operation. Make the classes to have multiple inheritance, to make sure
figuring out in the runtime which cast works as nontrivial as possible.

I know one commercial program where the programmers actually implemented all
this, even more convoluted, and even believed that they "improved" the code
with all these insanities.

For style points, use all possible boost classes that you can to achieve the
above behavior.

~~~
colejohnson66
But would that cause cache fails? Wouldn't that just increase the time needed
to access data?

~~~
dottrap
Seems like virtual functions leave you more vulnerable to instruction cache
misses. Large enough blocks of code combined with bad branch prediction can
leave you all sorts of bad performance here.

------
nkurz
I'm finding this more difficult than I expected, at least if I interpret the
problem as finding "deoptimizations" that are specific to modern Intel
processors. Finding ways to slow things down through changing the algorithm,
increasing precision, or adding unnecessary work are easy, but I haven't yet
found anything that would be a plausible optimization made by a beginning
programmer that hurts performance more on modern Intel than would be expected.

In the answers, I think Peter covers most of the likely attacks. One that
isn't mentioned is "choosing the wrong compiler". Not just compiler options,
but compiler. While there isn't a "best" compiler for everything, for a given
program often one will perform better than the others. Especially if one is
targetting recent Intel CPUs, Intel's compiler can occasionally beat GCC and
Clang by surprisingly large amounts. This appears to be one of those times.

Here are the times I see for for this example on a 3.4 GHz Skylake running
64-bit Linux when compiled with different modern compilers:

    
    
      icpc 16.03:  0.87 s
      clang++ 3.8: 1.76 s
      g++ 5.3.0:   1.73 s
    

All were compiled with "-march=native -g -Ofast", which I think causes all of
the compilers to use the equivalent of "-ffast-math". At least, I'm pretty
sure icpc uses it by default, and I didn't see any difference in performance
when I added "-ffast-math" or "-funsafe-math-optimizations" to clang++ and
g++. So in this case, I (provocatively) would suggest that using clang++ or
g++ would be a subtle 2x deoptimization. There are definitely reasons to
prefer them, but for this example that choice looks to come at a large cost in
performance.

If you want to do your own comparison, Version 17 of the Intel compiler is
currently available as a free beta for Windows, Mac, and Linux:
[https://software.intel.com/en-us/articles/intel-parallel-
stu...](https://software.intel.com/en-us/articles/intel-parallel-studio-
xe-2017-beta). I don't know how they are defining the license duration, but
when I downloaded yesterday it gave me a license valid until October 2016.
Separately, I'd be interested to see what MSVC does with this on a comparable
machine.

~~~
dsharlet
Without digging into it too much, I would bet that most of the improvement
from ICC comes from their awesome implementations of transcendental functions
used in this program (exp, sqrt, etc.).

Still, the general point is good. last time I tried ICC on a heavily
vectorized (with intrinsics) program, ICC was a 30% boost over clang or gcc.

~~~
nkurz
This was my first guess, but as I dig a bit deeper I'm less sure. It looks
like the difference is the implementation of log and exp, but both of them are
linked to the same libm from glibc. The difference is that Intel is inlining
an AVX2 optimized version that uses FMA, but Clang and GCC are using the
compiled function from libm.

The Ubuntu packages I'm using are compiled appear to have been compiled to
support AVX, but not AVX2, and hence no FMA. So while GCC and Clang are much
slower, this is really an issue of having suboptimal system libraries. The
speedup is real, but it's not clear that either GCC or Clang are actually to
blame.

~~~
slavik81
Not an expert on this, but I think the compiler has to inline the function for
that to work well. No matter how good the system library implementation is,
the function signature is defined one scalar float, which limits
vectorization.

~~~
pcordes
That's correct; I'm sure icc has vectorized versions of those functions to
inline. Each monte-carlo iteration is independent, so the code vectorizes very
easily if you have a vectorized RNG and vectorized exp() and log(). (vector
sqrt() is available in hardware, so even gcc and clang will vectorize that,
but not exp / log even with -ffast-math)

~~~
nkurz
OK, I was able to test this more rigorously. Yes, the vast majority of the
difference between Intel and GCC is just the better implementations of exp()
and log(). These are not vector implementations that calculate multiple
results simulataneously, just scalar implementations that make better use of
the available instruction set.

When I can get g++ to use Intel's libimf instead of glibc's libm, it's only
~5% slower than icpc. I'd guess much of the remaining difference is function
call overhead vs inlining. Clang still lags by another 10%, but I think that's
probably because I haven't figured out quite the right incantation to get it
to use libimf without also disabling some other useful optimization.

Here are the commands that I ended up with:

clang++ -fno-finite-math-only -march=native -Wall -Wextra -g -O3 option.cc -o
option -Wl,-rpath=/opt/intel/compilers_and_libraries/linux/lib/intel64
-L/opt/intel/compilers_and_libraries/linux/lib/intel64 -limf -lintlc

g++ -fno-finite-math-only -march=native -Wall -Wextra -g -Ofast option.cc -o
option -Wl,-rpath=/opt/intel/compilers_and_libraries/linux/lib/intel64
-L/opt/intel/compilers_and_libraries/linux/lib/intel64 -limf -lintlc

icpc -march=native -Wall -Wextra -g -Ofast option.cc -o option

The "-fno-finite-math-only" disables an otherwise good clang++ and g++
optimization that requires a __finite_exp() function that libimf does not
have. clang++ seems to also need to switch from -Ofast to -O3 to make this
stick.

What I don't know yet is whether recompiling glibc (or upgrading to the most
recent glibc) will produce better performance out-of-the-box on recent Intel.
My searches aren't turning up much information --- anyone know?

 _gcc and clang will vectorize that, but not exp / log_

I don't know how well it's currently working, but it looks like this might
have changed recently:
[https://sourceware.org/glibc/wiki/libmvec](https://sourceware.org/glibc/wiki/libmvec)

------
TillE
Despite the really long accepted answer, I don't see many suggestions that
actually fulfill the assignment. Playing with compiler flags feels like
cheating IMO.

Pre-generating a bunch of random numbers in a linked list would be amusingly
evil, though.

~~~
jsnell
Yeah. I'd interpret the spirit of the question as not making the program do
more work, but just arrange things such that the operations it does end up
being more expensive. Maybe the problem is with the sample program. It just
doesn't feel like the right one for the task.

For example if it had some actual data structures and operations on those,
rather than just arithmetic on a few stack-allocated scalar values,
introducing bad cache behavior would be a lot plausible. Another problem is
that most of the time is going to be spent in pretty opaque library functions
(exp, log, rand) whose execution cost is hard to affect just by rearranging
data or operations.

~~~
pcordes
That's what I found when writing my answer on SO. The only justification I
could come up with for using memory was to generate all the random numbers
first, then use them later.

I did include several potential microarchitectural slowdowns, like causing a
store-forwarding stall by modifying just the high byte of `double` with an XOR
to flip the sign bit. That isn't much of a stretch, but will hurt a lot

Multi-threading with a shared atomic loop counter is also really good, and
maybe worse than you'd naively expect (i.e. easy to justify re: diabolically
incompetent).

------
SeanDav
> _" put on hold as too broad by ..."_

and yet again StackOveflow do their best to kill a problem/discussion with a
lot of interest and interesting comments.

...sigh...

~~~
TylerH
and yet again someone doesn't understand that Stack Overflow is not a
discussion board.

~~~
SeanDav
It may not be a discussion board but of course almost every question has some
degree of discussion. For example, even a specific question about how to
initialize a list of lists in Python has more than 1 answer and there will be
discussion about the "best" method.

There is almost never only 1 correct answer, in which case a discussion is
needed and the answer may require your personal opinion as there may not be a
single "best" answer. It would depend on circumstances - more discussion
required...

------
jerryhuang100
cue last week's hbo silicon valley for gilfoyle's efforts to 'deoptimize' his
pre-processor for their box @ 200MB/s...

also several days ago there was a joke floating on my fb from a taiwanese
programmer about how he gets raises every year by removing multiple loops in
his programs to show efficiency improvement.

~~~
boyter
Reminds me of this story
[http://thedailywtf.com/articles/It_Must_Be_Broken](http://thedailywtf.com/articles/It_Must_Be_Broken)

------
kardos
Increase the constants at lines 18 and 19 a bit ...

