Hacker News new | past | comments | ask | show | jobs | submit login
Deoptimizing a C++ program (stackoverflow.com)
151 points by ingve on May 22, 2016 | hide | past | web | favorite | 37 comments



Inevitably the stackoverflow mods have moved into shut this down. They should be honest and introduce a "too interesting" classification, complete with a wordy, humorless justification phrase.


I should note that this wasn't closed by moderators, but instead just normal StackOverflow users. In fact, one of the mods commented on the question before it was closed (and did not close the question).


Good point (upvoted), I would edit my comment but it is too late for that. My comment was a little snarky, but my greater point is that I don't like the excessive moderation culture that has developed there. Sure clear out the junk and spam, trim the paths and clear the weeds. But don't insist on a literal interpretation of excessively pedantic guidelines when the content is clearly of high quality and is being strongly upvoted as a reflection of that.


I don't follow what you mean, normal users are moderators, just they're just users with sufficient karma.


Stackoverflow does have a separate concept of "Moderator", which is a voted-on position and denoted by a blue diamond next to their name.


SO has very specific goals and being interesting or a forum for discussion is not one of them. While I do think they are occasionally to strict I don't think is is one such case, I would have voted to close.


I'm sort of OK with most discussion based questions and answers being shut down. This particular question is well asked, is focused (i7 pipeline de-optimizations), and has a few answers that have very specific information related to the question. There is no discussion happening in the answers. There's no way this question should be closed, and I've voted to reopen.


Perhaps instead of closing, they should just `promote' content to a discussion site.


If there is an appropriate SE site moderators can already do this.If not users can do this by copying and pasting the question into the submission box for a more appropriate site.


It was closed by regular users. If you would like to see it open, vote to reopen it.

However, the question is broad and vague, and doesn't really fit the SO model.


I'd love it if the OP would clarify what kinds of things are and aren't allowed. I voted to reopen even without that, since I thought it was an interesting question.

Every rule has exceptions, and judging from the voting clearly a lot of people are ok with allowing this specific exception to the rules.

I just wish the 100 votes were spread out over more days. I answered because I personally thought it was interesting, and I'm not really a rep whore, but still. :P


It looks open right now to me (with one vote to close).


For caching issues, you could replace the loops by more functional code: write one loop to generate an array of values, and sum them to payoff_sum in a separate loop.

  for (int i=0; i<num_sims; i++) {
    gauss_bm[i] = gaussian_box_muller();
  }
  for (int i=0; i<num_sims; i++) {
    S_cur[i] = S_adjust * exp(sqrt(v*v*T)*gauss_bm[i]);
  }
  for (int i=0; i<num_sims; i++) {
    payoff[i] += std::max(S_cur[i] - K, 0.0);
  }
  for (int i=0; i<num_sims; i++) {
    payoff_sum += payoff[i];
  }
To make this i7-specific, you have to make sure the arrays are mapped to just the right cache lines, so that writing to an array evicts the dat from the array being read from from the cache (hm, maybe that requires combining some of those loops; what does the level 1 cache of these CPUs look like?)

Next step: DRY. Move the for (int i=0; i<num_sims; i++) loop into a function taking a lambda for the loop body. That hides the ugliness of the above code. Here, it probably is fairly easy to accidentally do some extra copying of data.

Next: parallelism. You shouldn't have individual threads process contiguous sections of the arrays; do

   for (int i=threadNo; i<num_sims; i+= numThreads) {
instead. And of course, you can multi-thread by having the 4 for loops run in parallel. That, I think, would be the big gain, as it would introduce quite a few data dependencies between threads. It also is in the spirit of the question, I think.

It has nothing to do with the i7, but I would also work on worrying about the value of RAND_MAX. It can be as low as 32767. So, if you want to pick a double from the full range of values in the [-1,1] interval you have to make at least 4 calls to rand(). Making the distribution uniform from there will take quite a bit of work, too (there are as many double values between 1/4 and 1/2 as between 1/2 and 1, but you want to pick the former half as often as the latter)


The context is missing, what's allowed and what's not.

If it's allowed to make own classes, the best thing it can be done is to introduce MyNumber abstract class, then inherit it to make MyDouble, then make NumberFactory and access every double through the virtual operators. Then let every temporary variable be created (with the new operator!) on every loop pass, of course using the factories. Define the operations in a way to need the new temporary MyNumbers too. That way the creations would be hidden in the classes, the calculation would still look relatively innocent.

And additionally, make classes so that they have to do runtime casts for every operation. Make the classes to have multiple inheritance, to make sure figuring out in the runtime which cast works as nontrivial as possible.

I know one commercial program where the programmers actually implemented all this, even more convoluted, and even believed that they "improved" the code with all these insanities.

For style points, use all possible boost classes that you can to achieve the above behavior.


Your case seems to fit nicely with "Typical C++ Bullshit" by Mike Acton (and any of his Data Oriented Design talks).

https://macton.smugmug.com/Other/2008-07-15-by-Eye-Fi/n-xmKD...


But would that cause cache fails? Wouldn't that just increase the time needed to access data?


Seems like virtual functions leave you more vulnerable to instruction cache misses. Large enough blocks of code combined with bad branch prediction can leave you all sorts of bad performance here.


For really nice cache misses the arrays (explicit or implicit) are needed. Havng the allocations of the objects for calculations gives us chance for that too: we'd also need the different, crossed lifetimes of allocated objects, which can be done with "smart" pointers (the application I've mentioned used these too! That's why I've mentioned Boost).


I'm finding this more difficult than I expected, at least if I interpret the problem as finding "deoptimizations" that are specific to modern Intel processors. Finding ways to slow things down through changing the algorithm, increasing precision, or adding unnecessary work are easy, but I haven't yet found anything that would be a plausible optimization made by a beginning programmer that hurts performance more on modern Intel than would be expected.

In the answers, I think Peter covers most of the likely attacks. One that isn't mentioned is "choosing the wrong compiler". Not just compiler options, but compiler. While there isn't a "best" compiler for everything, for a given program often one will perform better than the others. Especially if one is targetting recent Intel CPUs, Intel's compiler can occasionally beat GCC and Clang by surprisingly large amounts. This appears to be one of those times.

Here are the times I see for for this example on a 3.4 GHz Skylake running 64-bit Linux when compiled with different modern compilers:

  icpc 16.03:  0.87 s
  clang++ 3.8: 1.76 s
  g++ 5.3.0:   1.73 s
All were compiled with "-march=native -g -Ofast", which I think causes all of the compilers to use the equivalent of "-ffast-math". At least, I'm pretty sure icpc uses it by default, and I didn't see any difference in performance when I added "-ffast-math" or "-funsafe-math-optimizations" to clang++ and g++. So in this case, I (provocatively) would suggest that using clang++ or g++ would be a subtle 2x deoptimization. There are definitely reasons to prefer them, but for this example that choice looks to come at a large cost in performance.

If you want to do your own comparison, Version 17 of the Intel compiler is currently available as a free beta for Windows, Mac, and Linux: https://software.intel.com/en-us/articles/intel-parallel-stu.... I don't know how they are defining the license duration, but when I downloaded yesterday it gave me a license valid until October 2016. Separately, I'd be interested to see what MSVC does with this on a comparable machine.


Without digging into it too much, I would bet that most of the improvement from ICC comes from their awesome implementations of transcendental functions used in this program (exp, sqrt, etc.).

Still, the general point is good. last time I tried ICC on a heavily vectorized (with intrinsics) program, ICC was a 30% boost over clang or gcc.


This was my first guess, but as I dig a bit deeper I'm less sure. It looks like the difference is the implementation of log and exp, but both of them are linked to the same libm from glibc. The difference is that Intel is inlining an AVX2 optimized version that uses FMA, but Clang and GCC are using the compiled function from libm.

The Ubuntu packages I'm using are compiled appear to have been compiled to support AVX, but not AVX2, and hence no FMA. So while GCC and Clang are much slower, this is really an issue of having suboptimal system libraries. The speedup is real, but it's not clear that either GCC or Clang are actually to blame.


Not an expert on this, but I think the compiler has to inline the function for that to work well. No matter how good the system library implementation is, the function signature is defined one scalar float, which limits vectorization.


That's correct; I'm sure icc has vectorized versions of those functions to inline. Each monte-carlo iteration is independent, so the code vectorizes very easily if you have a vectorized RNG and vectorized exp() and log(). (vector sqrt() is available in hardware, so even gcc and clang will vectorize that, but not exp / log even with -ffast-math)


OK, I was able to test this more rigorously. Yes, the vast majority of the difference between Intel and GCC is just the better implementations of exp() and log(). These are not vector implementations that calculate multiple results simulataneously, just scalar implementations that make better use of the available instruction set.

When I can get g++ to use Intel's libimf instead of glibc's libm, it's only ~5% slower than icpc. I'd guess much of the remaining difference is function call overhead vs inlining. Clang still lags by another 10%, but I think that's probably because I haven't figured out quite the right incantation to get it to use libimf without also disabling some other useful optimization.

Here are the commands that I ended up with:

clang++ -fno-finite-math-only -march=native -Wall -Wextra -g -O3 option.cc -o option -Wl,-rpath=/opt/intel/compilers_and_libraries/linux/lib/intel64 -L/opt/intel/compilers_and_libraries/linux/lib/intel64 -limf -lintlc

g++ -fno-finite-math-only -march=native -Wall -Wextra -g -Ofast option.cc -o option -Wl,-rpath=/opt/intel/compilers_and_libraries/linux/lib/intel64 -L/opt/intel/compilers_and_libraries/linux/lib/intel64 -limf -lintlc

icpc -march=native -Wall -Wextra -g -Ofast option.cc -o option

The "-fno-finite-math-only" disables an otherwise good clang++ and g++ optimization that requires a __finite_exp() function that libimf does not have. clang++ seems to also need to switch from -Ofast to -O3 to make this stick.

What I don't know yet is whether recompiling glibc (or upgrading to the most recent glibc) will produce better performance out-of-the-box on recent Intel. My searches aren't turning up much information --- anyone know?

gcc and clang will vectorize that, but not exp / log

I don't know how well it's currently working, but it looks like this might have changed recently: https://sourceware.org/glibc/wiki/libmvec


I use icc on my main application which is cpu limited and it halves the run time compared to either clang or gcc. Intel has some very clever complier engineers.


I added a link to this post in an update to my answer (where I added a section about causing AVX<->SSE transition stalls, which is actually highly plausible).


Despite the really long accepted answer, I don't see many suggestions that actually fulfill the assignment. Playing with compiler flags feels like cheating IMO.

Pre-generating a bunch of random numbers in a linked list would be amusingly evil, though.


Yeah. I'd interpret the spirit of the question as not making the program do more work, but just arrange things such that the operations it does end up being more expensive. Maybe the problem is with the sample program. It just doesn't feel like the right one for the task.

For example if it had some actual data structures and operations on those, rather than just arithmetic on a few stack-allocated scalar values, introducing bad cache behavior would be a lot plausible. Another problem is that most of the time is going to be spent in pretty opaque library functions (exp, log, rand) whose execution cost is hard to affect just by rearranging data or operations.


That's what I found when writing my answer on SO. The only justification I could come up with for using memory was to generate all the random numbers first, then use them later.

I did include several potential microarchitectural slowdowns, like causing a store-forwarding stall by modifying just the high byte of `double` with an XOR to flip the sign bit. That isn't much of a stretch, but will hurt a lot

Multi-threading with a shared atomic loop counter is also really good, and maybe worse than you'd naively expect (i.e. easy to justify re: diabolically incompetent).


> It just doesn't feel like the right one for the task.

> if it had some actual data structures and operations on those, rather than just arithmetic on a few stack-allocated scalar values

I fully agree, as long as the variables are on the stack and not the part of some structure not much can be done, a good compiler can even just keep them in the registers and inline the functions and remove the dead code. As soon as I'd be allowed to make my own classes and "use C++ features", I know what I'd do, as in my other comment.


I finally got around to re-ordering my answer to put the "cheating" ideas like gimping the compiler with atomic<uint64_t> on `-m32`, or atomic<long double>, at the end. I totally agree it was crap, but my answer just kind of grew that way, and left the good stuff buried.

I'm now pretty happy with my answer's presentation of some actually relevant ideas that are specific to slowing down an out-of-order pipelined CPU like Haswell. The assignment itself doesn't seem to have much scope for solving it the way it seems to be intended. There's not a lot you can do with those dependency chains that out-of-order execution isn't going to eat for breakfast.

Since the answer has gotten widely publicized, I've tried to include links for further reading on anything people might want to know more about. It's up to 28227 chars in length now, though :P


> "put on hold as too broad by ..."

and yet again StackOveflow do their best to kill a problem/discussion with a lot of interest and interesting comments.

...sigh...


and yet again someone doesn't understand that Stack Overflow is not a discussion board.


It may not be a discussion board but of course almost every question has some degree of discussion. For example, even a specific question about how to initialize a list of lists in Python has more than 1 answer and there will be discussion about the "best" method.

There is almost never only 1 correct answer, in which case a discussion is needed and the answer may require your personal opinion as there may not be a single "best" answer. It would depend on circumstances - more discussion required...


cue last week's hbo silicon valley for gilfoyle's efforts to 'deoptimize' his pre-processor for their box @ 200MB/s...

also several days ago there was a joke floating on my fb from a taiwanese programmer about how he gets raises every year by removing multiple loops in his programs to show efficiency improvement.



Increase the constants at lines 18 and 19 a bit ...




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: