I feel like we are now living in a world in which figuring about the performance of your code is almost impossible. Between hopspot compilers, bytecode, CPU intrinsics, vectorization, caches, pipelining, GC, branch prediction, etc how is one supposed to reason about his code? In whatever language I'm programming, I usually just imagine what the C would look like and make performance decisions from there.
But things are so complicated nowadays that's doesn't even come close to capturing the performance.
Is this the only way we can get good performance? Is there an alternative paradigm of optimization in hardware or software that would make performance tradeoffs more explicit? I don't know, but the current state of things makes me uneasy because I feel I never understand what's truly going on (and I definitely don't).
Yes, and it's also complicated by the translation to micro-ops with out of order dispatch and instruction retirement. I remember seeing one statement saying this is now a complicated enough system that the number of cycles it takes to execute a block of code is literally chaotic in the sense that small changes to initial conditions cause large changes in the number of cycles to complete. A butterfly effect, possibly triggered by whatever the kernel chose to do a millisecond before your code runs.
Throw in changing clock rates based on the temperature of the CPU. If your code efficiently uses the ALUs / cores in the processor, it can get hot and slow itself down.
I don't know. In general, it's advisable to write your code first largely ignoring performance considerations, just do the simplest/most readable thing, then profile it and see where it bottlenecks.
So, that's the best general wisdom I know. Don't even bother trying to guess where you'll have performance problems. Profile it.
This just defers the optimization process by one step. It's hard to improve something without measuring it first, but either you know where your inner loop is or the profiler tells you. Now you've still got to improve that bottleneck somehow and take into account all those complicating factors.
I've seen very unexpected results doing this. In one case, I put the inner loop in a separate function to enable swapping implementations and make it simpler to time. Surprisingly, the overall program ran faster without any other changes. Just pulling the inner loop into a function of its own sped things up.
My best guess was that the compiler was willing to put more effort into optimization on the smaller function. This contradicts common wisdom about function call overhead, and it also shows another level of complexity in the problem: Some compiler optimizations are exponential in their cost, so they punt if you put too much of your code in one block or function.
Also note that I wasn't able to prove this is what happened. It's just my best guess based on what I was able to see and infer.
I prefer to keep the approximate performance of my code in my head as I write it to ensure that I'm using appropriate data structures and abstractions since it's often costly to change these things after your profile.
I had a fun accidental optimization last week: I needed to zero out a memory buffer before calling a GnuTLS function to read in data (for security reasons - we can't be sure that the callee will fill in the full buffer and don't want to leak uninitialized memory back to the client). I wondered what the performance penalty of this would be. In fact it made the code about 5-10% faster. This is on the edge of being a measurable effect - could be random variability in the test - but it certainly surprised me that it was not slower.
The non-encrypted transport case had a similar change but there was no measurable performance difference before and after.
I'd like to know what perf stats are available to show what's going on. (I guess it's an L2 cache effect)
Hi, for Intel processors, the renamer can detect instructions that will always result in a zero (zero-idioms or dependency breaking idioms). As a result, the renamer will not send the instruction to the execution engine.
Basically, just think of this as repointing your registers to a register that's always zero. Note this only appears for the value zero.
This is one of the reasons you're supposed to wipe out sensitive data with random numbers, not zeros.
Also note that your compiler may copy data to the stack whenever it wants to. So even wiping out data with random numbers may not wipe out the copied data that lives (momentarily) on the stack.
The zeroing was somehow priming the cache, maybe it was faster this way because all the zero writes were independent and could easily be run in parallel, whereas normal cache stalls would cause more actual stalling.
Unlikely. The microcode changes are largely about disabling indirect branch prediction--virtual functions, jump tables, function returns. The branch prediction here is generally driven by simple branch history predictors (i.e., at a fundamental level, the branch predictor boils down to "guess what the last branch did"), which are way too useful to disable even to mitigate Spectre.
After running the experiment (openjdk 8, linux) with some variations, I'm convinced it's a JIT optimisation and not a CPU predictive branch feature, eg. increaseing the threshold from 128 to post-16 bit, eg 100000:
if (data[c] >= 128)
This looks like one of those questions to which the author already knows the answer, and posts the question on an alternate account so they can get all the upvotes for their subsequent detailed explanation for the behavior of their contrived example.
You don't have to do that on StackOverflow (and other StackExchange sites). It is explicitly encouraged to answer your own questions, or put in other words: to share your knowledge in a Q&A format.
If you are in SO for the points, splitting them between two accounts seems lame. Remember that the question also collects points and badges!
I've never checked the user that answers to see if it's the same user that is asking, and it happened to me a couple of times that I tried to upvote my own answer.
Really? I think a lot of newbie programmers go through a phase where they benchmark random things. I can easily imagine 15 year old me stumbling upon this and thinking that it’s some kind of black magic.
But things are so complicated nowadays that's doesn't even come close to capturing the performance.
Is this the only way we can get good performance? Is there an alternative paradigm of optimization in hardware or software that would make performance tradeoffs more explicit? I don't know, but the current state of things makes me uneasy because I feel I never understand what's truly going on (and I definitely don't).