Even if you're smarter than your compiler it's usually not worth the readability trade-off.
These sorts of optimisations only ever make sense if you're doing something incredibly intensive and you've tested it on real world data.
This is rarely the case and the code suffers.
: "Source Code Optimization" => http://www.linux-kongress.org/2009/slides/compiler_survey_fe...
I just tried my age-old adage of doing bit shifts and adds instead of multiplications. Turns out that's slower on an i7!
The cool thing is that llvm-g++ actually recognizes my bad hand-optimization on -O3 and fixes it.
Code at https://gist.github.com/2896137
Or are working on something incredibly small with an incredibly stupid compiler.
For example 99% of the time I want a binary shift and not a /2.In a dynamically typed language 201/2 will give you 100.5 when you'd prefer to keep it integer.
Not necessarily. Python gives 100, PHP gives 100.5, Perl gives 100.5, Ruby gives 100.
Edit: Versions are Python 2.7.3, PHP 5.3.10, Perl 5.14.2 and Ruby 1.8.7, all as shipped with Ubuntu Precise.
Lisp will give you 201/2.
The takeaway is not that "x/2 is faster than x>>1", instead it's that you shouldn't assume some code it faster - always time it.
And similarly, it would be an unusual CPU indeed - so unusual in fact that you'd probably already know that you were working on it, and would mention it as part of this crazy story of the time that divisions were faster than shifts - that managed to mess up single-bit shifts so badly that they were slower than an integer division.
Perhaps in a higher-level language, one might find that x>>1 were faster than x/2, for some reason unrelated to the relative speeds of the two operations. But if one is working in C or C++ (and I assume this is the case, if one is bothering with stuff like replacing divides with shifts), and yet the shift is still the slower, then this is most unexpected, and probably evidence that something has gone wrong.
(The division algorithm is inherently more complicated than a shift, and so divisions are always going to take more time than a shift. You can expect a shift to take 1 cycle on a modern CPU. A division would take more like 20. It will probably be the slowest integer instruction possible, and it might not be accessible from every pipeline. Sample division latencies for vaguely modern CPUs: 20+ cycles on x86, ~35 cycles on MIPS, ~20 cycles on POWER, ARM doesn't seem to have it. Shifts, on the other hand - always 1 cycle. ARM even gives them away for free.)
In the end, it's results are what matter even if you (or I) don't understand why. For all I know, the compiler parallelized some of the code with divisions and didn't for the shift using code. The thing is, you can't just look at what "should" be fastest and my experience was a good reminder of that.
In the old days, one optimization technique we used was to "unroll loops" i.e. instead of:
for (int i=0; i<8; i++)
a[i] += b[i];
We'd do this:
// Unrolled loop
a += b;
a += b;
a += b;
a += b;
The unroll loop has less clock cycle according to the book.
Great - unless the larger "faster" code no longer fits in the CPU cache, now it's actually slower. The only way to tell is to measure it.
Here is the code. Substitute ">>1" for the "/2" portions and compare.
// pBuffer contains a single horizontal line in a bitmap
// destBuffer will contain the first derivative convolution i.e. if you graph it, it would look like peaks at the edges of things in the image
void Convolution1D_FastFirstDerivative(int* pBuffer, int sizeOfBuffer, int destBuffer)
int lastPos = sizeOfBuffer - 1;
int isum = 0;
// Special case start of buffer where filter "hangs off edge"
isum = -pBuffer / 2; // 0.5
//sum += pBuffer * 0; // * 0
isum += pBuffer / 2;
destBuffer = isum;
for (int i=1; i<sizeOfBuffer-1; i++)
isum = -pBuffer[i-1] / 2; // * -0.5
//sum += pBuffer[lastPos] * 0; // * 0
isum += pBuffer[i+1] / 2; // * 0.5
destBuffer[i] = isum;
// Special case end of buffer where filter "hangs off edge"
isum = -pBuffer[lastPos-1] / 2; // * -0.5
//sum += pBuffer[lastPos] * 0; // * 0
isum += pBuffer[lastPos] / 2; // * 0.5
destBuffer[lastPos] = isum;
Sorry, I wish I could fix the formatting.
Compilers actually generate (x+(x>>31))>>1 for signed division by 2. Which on x86 is three more instructions and requires an additional register.
Using x>>1 makes sense if you want floored division instead of truncating towards zero. Floored division is better if you want to reduce precision of the input or something like that and don't want to bias towards zero.
When you're programming a microcontroller YOU are often the compiler and therefore need to know the relative clock times for each command. I remember one chip where the division command could take 100 cycles while a shift was 2. Big difference there, and you'd definitely pick the shift.
One of the later comments there focused on actually decompiling the resulting code. I was surprised that the compiler did not produce the same output.
x = x / 2;
x = @ / 2;
I know, could do that already with /= 2; But consider we wanted to do
x = x / (x+1);
*(int*)(aRec[RecIndex(p->handle)]->itemCountRef) = @ / (@+1);
*(int*)(aRec[RecIndex(p->handle)]->itemCountRef) = *(int*)(aRec[RecIndex(p->handle)]->itemCountRef) / (*(int*)(aRec[RecIndex(p->handle)]->itemCountRef)+1);
x = @ + (y = @+1)
x = 1 + (@ = @ - 1)
For that matter, in the 1st C implementation the combined-assignment operators were defined this way: i =+ 1; until they noticed it was ambiguious. C has a long history of ambiguity.
So maybe its good that C was no more ambitious that it was.
x = (@++? @ : @ /= 2);
That's a gnarly code smell and a Law of Demeter violation. I suspect you could find a better way to express that operation without requiring that a single line of your code know so much about so many different things (aRec, aRec members, the return value of RecIndex, p, aRec member -> itemCountRef).
"Apparently, many people have made the mistake of assuming that a shift right signed of k positions divides a number by 2^k"
Surprisingly, the top few answers do not give the correction for negative numbers, which in "Hacker's Delight", is to add 2^k-1 to negative numbers before shifting.
Obviously, this ignores optimization and readability arguments, which although important, are probably less important than correctness.
x = x / 2
To DIVIDE, use the DIVISION operator.
This is a case where shifting is worthwhile.
pow is usually a floating point operation, that's why it's slow. Modern CPU's may actually be fast at a pow floating point operation, but it's still expensive to move integers from general purpose registers to floating point registers.