

Efficient C Tip – use the modulus (%) operator with caution - rcfox
http://embeddedgurus.com/stack-overflow/2011/02/efficient-c-tip-13-use-the-modulus-operator-with-caution/

======
DarkShikari
Something about this article bugs me. Division and modulus by constants can be
done very quickly with multiply, shift, and add (Hacker's Delight has a whole
chapter on how this works). Furthermore, _the compiler generally does this
automatically_ in most cases. The clock timings here look rather extreme -- as
if the compiler-builtin integer divide/modulus function was being called (many
of these chips don't have integer divide units).

I'm not one at all to generally trust compilers, but in this case, either
these compilers are very very bad or the author didn't have optimization
turned on. Of course, the author doesn't post disassembly or even what
compilers he's using, so we'll never know!

Edit: in fact, even without optimization, GCC uses this trick. Here's how gcc
4.3.4 compiles _unsigned int foo(unsigned int a) {return a%60UL;}_ on x86_32,
for example (I don't have an ARM cross-compiler handy):

With optimization:

    
    
            movl    4(%esp), %ecx
            movl    $-2004318071, %edx
            movl    %ecx, %eax
            mull    %edx
            shrl    $5, %edx
            imull   $60, %edx, %edx
            subl    %edx, %ecx
            movl    %ecx, %eax
            ret
    

Without optimization:

    
    
            subl    $8, %esp
            movl    12(%esp), %ecx
            movl    $-2004318071, (%esp)
            movl    (%esp), %eax
            mull    %ecx
            movl    %edx, %eax
            shrl    $5, %eax
            movl    %eax, 4(%esp)
            movl    4(%esp), %eax
            leal    0(,%eax,4), %edx
            movl    %edx, %eax
            sall    $4, %eax
            subl    %edx, %eax
            movl    %ecx, %edx
            subl    %eax, %edx
            movl    %edx, 4(%esp)
            movl    4(%esp), %eax
            addl    $8, %esp
            ret
    

Note how neither series of instructions calls integer divide!

~~~
gte910h
gcc is a much better compiler than many embedded compilers.

Many of the chips he's talking about are pretty tiny little underpowered
things with bad-middling compilers.

~~~
pmjordan
If that's the case the whole thing becomes a game of figuring out what
optimisations the compiler _does_ make, and doing the rest yourself. Even in
this light, there's not an awful lot to be learned from the article - no
disassembly in sight, and no mention from which operations the division is
synthesized. This is cargo cult optimisation of the worst kind.

~~~
gte910h
>If that's the case the whole thing becomes a game of figuring out what
optimisations the compiler does make, and doing the rest yourself.

You just described much of embedded programming there.

>This is cargo cult optimisation of the worst kind.

Really? He tested and measured a few different approaches. Do you think he's
advocating just using his approaches without measurement?

>no disassembly in sight, and no mention from which operations the division is
synthesized

I'd say most of this stuff is written for people who wouldn't really learn
much from a disassembly. They may work with 3-5 different families of chips,
all with very different (and sometimes moderately strange) instruction sets.
They mostly chose the chip due to cost or power consumption, and it's a tiny
tiny portion of their design. They're trained in EE, not programming, and do
enough to get through their design.

Debugging arm or mips assembler is considerably different than x86.

~~~
nitrogen
I think I'd probably prefer working with either ARM or MIPS assembly over x86,
except for the lack of some of x86's built-in arithmetic instructions (like
divide). As long as you've got an instruction set reference card handy
(preferably in PDF format), it's not too difficult to work with any assembly
language (I've done debugging in ARM and x86, took a class on MIPS, and wrote
entire firmwares in 8-bit PIC assembly).

I was hunting through disassembled ARM code for performance bottlenecks and
managed to speed things up quite a bit by manually removing divides that
should've been caught by the optimizer from my C code. I didn't find it any
more or less difficult than any other assembly language.

------
tspiteri
Article: "Well a little thought shows that C = A % B is equivalent to C = A -
B * (A / B). In other words the modulus operator is functionally equivalent to
three operations."

The integer divide instruction returns both the quotient and the remainder on
the processors I know of. I can't see what sense it makes to have a modulo
operator that is more expensive than a division operator.

~~~
pmjordan
Many processors don't have an (integer) division instruction, however,
particularly those in the embedded world. That said, I share DarkShikari's
concerns.

~~~
tspiteri
I am not saying that division is not expensive, or that the modulo operator is
not just as expensive. I am only pointing out that the modulo operator is not
more expensive that division.

~~~
pmjordan
Agreed, _if there is a dedicated instruction available_ [1], and _if the
expression can't be reduced to multiplications/shifts/addition due to division
by a constant_. None of the 3 architectures mentioned in the article have a
division instruction, and the denominators _are_ constant. So this is probably
the least of the problems with this article, considering the audience is
embedded programmers whose code probably never runs on x86, POWER or SPARC
processors.

Where you're _not_ using the division instruction, modulo can actually be
_faster_ than division itself. The extreme case are powers of 2 of course; on
x86, a right shift is vastly more expensive than a bitwise AND, for example.
Other denominators don't produce quite so drastic differences. In any case,
once you've done the division, the remainder is practically free; the article
is basically an illustration of how bad the compilers in question are at
spotting this fact. (though not mentioning which compilers and with which
flags are affected makes it pretty useless)

[1] It's technically _conceivable_ that a superscalar, microcoded processor
might actually detect whether the quotient, the remainder or both are actually
used later in the code during the dependency check, and produce 3 different
sets of microcode for these situations. I'd be surprised if any actually did
this, though.

~~~
tspiteri
_... if the expression can't be reduced to multiplications/shifts/addition due
to division by a constant._

Ah, I missed that. Even on normal processors, for the following:

    
    
        int div60(int i) { return i / 60; }
        int mod60(int i) { return i % 60; }
    

gcc uses _one_ multiplication and some shifts/additions for div60() and _two_
multiplications and some shifts/additions for mod60().

------
spc476
I'm surprised he didn't look into the use of the div() function (it's a
standard C function) that returns both the quotient and remainder, since he's
using both results anyway.

------
georgecmu
_No that isn’t a misprint. The ARM was nearly two orders of magnitude more
cycle efficient than the MSP430 and AVR. Thus my claim that the modulus
operator can be very inefficient is true for some architectures – but not
all._

The first thing a real embedded guru would do is look at the assembly listing.
My guess is that ARM actually provides operations for integer division and
modulus, while for AVR they're implemented as low-level gcc library routines.

~~~
dumael
ARM doesn't provide modulus on any chips, division is available on ARM-7R and
ARM-7M chips only.

~~~
georgecmu
Well, once you have division, modulus is one cycle away.

