
Optimizing loops in C for higher numerical throughput and for fun - majke
http://www.lshift.net/blog/2013/11/27/optimizing-loops-in-c-for-higher-numerical-throughput-and-for-fun
======
RyanZAG
I find it pretty interesting that you see a lot of people say 'within an order
of magnitude of C' when talking about performance for dynamic languages and
JITed VMs, etc. But they rarely take this kind of thing into account. If you
wrote this in Python, you'd be trying to get your implementation close to that
original number - 1733.4 MFLOPS - but there's not much discussion on the real
target being 17985.4 MFLOPS which needs very specific coding. I think a lot of
our code we write today is a lot more inefficient than we admit to.

~~~
tmoertel
Yes. And for the same reason, that's why you sometimes have to do things in
assembly. No compiler is going to go against its own grain when generating
code. Yet, sometimes, that's what you must do to eke out the necessary
performance.

As an example, while doing code for a 6809-based video game (see [1] for full
war story), I had to use every single register on the processor, including the
system-stack register and the direct-page register, to speed up a graphics
routine. This meant that, while my code was running, the system effectively
had no stack and couldn't use certain addressing modes. Further, it also meant
that interrupts would corrupt whatever memory the stack register happened to
be pointing to.

All these problems were solvable – and the pain we had to go through to solve
them was, for us, absolutely worth the 70% speed-up it bought us – but no
compiler writer is going to incorporate tactics this weird into a compiler's
optimization logic. If you need to go "full weird," assembly is often the only
way.

[1] [http://blog.moertel.com/posts/2013-12-14-great-old-timey-
gam...](http://blog.moertel.com/posts/2013-12-14-great-old-timey-game-
programming-hack.html)

------
pcwalton
Interestingly, Rust does not have the pointer aliasing hazard in the third
example: the compiler ensures that all mutable array slices cannot overlap,
unlike C, and so the introduction of the temporary should not be necessary.
(Unfortunately we don't communicate that information to LLVM today, and we'd
also have to optimize out the bounds checks, so more work is required...but
from a language design perspective I think we're good.)

~~~
sharpneli
C99 and later provide the restrict keyword which tells the compiler that the
arrays do not overlap.

It pretty much solves the whole issue.

~~~
pcwalton
That's true, although Rust provides for sound enforcement (so if you try to
alias the equivalent of two restrict pointers the compiler will catch it at
compile time).

(It's also not technically in C++, although in practice it can be achieved
through extensions.)

~~~
pbsd
C++'s standard library has valarray, which is supposed to avoid aliasing
effects, and is allowed to use expression templates and other tricks.
Something like:

    
    
        float dot(std::valarray<float> const& x)
        {
            return (x*x).sum();
            // Alternative: inner_product(begin(x), end(x), begin(x), 0.0f);
        }
    

should, in theory, achieve the same kind of performance as the C version (but
none of this is guaranteed, sadly). A smart implementation would actually end
up calling BLAS routines, which is what libraries like Eigen do.

Which reminds me: does Rust have infrastructure that would allow something
like expression templates to exist?

~~~
pcwalton
I believe you can use Rust traits to the same effect, although I haven't tried
so you might hit stuff.

------
the_af
This is interesting, but the first comment in that site is apt:

> Meanwhile, the Fortran programmer writes y = dot_product (x, x) and moves on
> to the interesting bits. Plus, if auto-parallelization is on, or if this is
> in an OpenMP workshare section…

This is titled "optimizing loops in C" but this level of optimization is
actually programming in assembly language, specifically x86 with SSE
extensions.

~~~
stephencanon
The C programmer will actually just write y = cblas_sdot(n, x, 1, x, 1) and
move on to the interesting bits.

However, someone had to implement dot_product and cblas_sdot at some point
(either in the compiler or in the library), and they need to be rewritten from
time to time for new architectures. More to the point, most programmers, most
of the time, aren't _just_ doing a dot product. They're doing some other more
sophisticated computation for which these techniques may be quite relevant.
The dot product is just a convenient example.

~~~
the_af
Agreed on both counts: someone had to write those library functions, and this
was just an example.

However, the author starts by talking about a "holy war between C and Fortran"
and then proceeds to write... well, assembly language using the C compiler. So
the summary could be "assembly language can be made more efficient than
Fortran, and C lets you coax the compiler into writing the assembly language
you want". I guess this could be seen as a win _for C_ , but I'm not so
sure...

------
eliteraspberrie
I really enjoy reading tips like this. The mnemonic I use for looping over
multi-dimensional arrays in C is: the right-most index should be the inner-
most loop:

    
    
        for (i = 0; i < M; i++) {
            for (j = 0; j < N; j++) {
                x[i][j] = ...;
            }
        }

------
shared4you
Doesn't gcc's "-funroll-loops" do the same thing? This looks more manual work
to me, unrolling loops by hand.

~~~
gillianseed
It does but the heuristics for loop unrolling without any runtime data as
basis is very difficult and thus very much a hit or miss affair (and missing
is expensive) which is why no compiler I know of (GCC included) enables
-funroll-loops or equivalent by default in any of the standard optimization
levels (-On).

In GCC, the only option which enables -funroll-loops (apart from explicitly
enabling it) is -fprofile-generate which is GCC's profile guided optimization.

The reason it enables -funroll-loops is that since it gathers runtime
statistics during the profiling run it has enough information to accurately
perform loop unrolling without risking performance degradation.

------
pjmlp
Yeah nice exercise, except Fortran compilers get to optimize the loops without
requiring so much help from the developer.

~~~
stephencanon
If you declare your pointers with the restrict keyword and tell the compiler
to allow reassociation of floating-point additions, it will perform exactly
the same optimizations, without any hand-holding from the programmer. In some
C compilers, these are even the defaults.

------
rsp1984
I find the example is incredibly confusing. I expected this to be an
optimization of a dot product or matrix multiplication or similar but as it
stands it seems to be an element-wise square of a VECTOR_LEN * N matrix with
subsequent accumulation of columns (not the most common operation in typical
signal processing).

It should be also noted that this can be trivially implemented in direct SIMD
using GCC/clang vector extensions, without having the compiler 'guess' the
SIMD part, which btw would be also make the code much easier to read.

------
poulson
Any idea what optimization flags were used? It's rather strange that they're
not reported. I would be surprised if GCC 4.8 was that far from optimal with
-O3.

~~~
nkurz
Do you say this as someone familiar with assembly and GCC? My usual guess
would be that you can often hope for a 50% speedup in a tight loop by dropping
from C to assembly, and that a 2x speedup over GCC is not uncommon.

The original author's code isn't available for this example, but I put
together something I think is comparable. I may still have silly bugs, but
here are my initial result on Sandy Bridge are something like:

    
    
      icc 13.0.1 -03 -march=native -fno-inline wrong-loop: 1.35 s
      icc 13.0.1 -03 -march=native -fno-inline right-loop: 0.78 s
      icc 13.0.1 -03 -march=native -finline-functions wrong-loop: 0.22 s
      icc 13.0.1 -03 -march=native -finline-functions right-loop: 0.22 s
    
      gcc 4.8.0  -03 -march=native -fno-inline wrong-loop -fno-inline: 1.38 s
      gcc 4.8.0  -03 -march=native -fno-inline right-loop -fno-inline: 1.14 s
      gcc 4.8.0  -03 -march=native -finline-functions wrong-loop: 1.35 s
      gcc 4.8.0  -03 -march=native -finline-functions right-loop: 1.14 s
    

There are all sorts of things I might be doing differently (or wrong), but I'm
printing out a total-of-totals so I know it's at least going through the
loops. It's possible that is a fast-math optimization, but I wouldn't be
betting on GCC -O3 to be close to optimal.

~~~
exDM69
> My usual guess would be that you can often hope for a 50% speedup in a tight
> loop by dropping from C to assembly

The problem with inline assembler is that it is almost untouchable by the
optimizer. By adding some inline asm, you may inhibit a lot of optimization
that could give better perf overall.

For this kind of tasks it is often a lot better to use intrinsics (e.g.
xmmintrin.h for SSE) or use compiler extensions
__attribute__((vector_size(16))) etc. This way you can utilize the CPU
features you have available while still allowing the optimizer to do high
level optimizations.

~~~
nkurz
While there is lots to be said for the maintainability of intrinsics, I have
found inline assembly to be significantly better for performance. And this is
precisely because it inhibits the compiler from blindly performing
'optimizations' in the section of code you've already optimized. This thread
offers an example and some numbers: [http://software.intel.com/en-
us/forums/topic/480004](http://software.intel.com/en-us/forums/topic/480004)

------
quentusrex
Was anyone able to locate the speed of the Fortran implementation on the same
hardware? It'd be quiet interesting to see a similar optimization approach
done by someone experienced in Fortran with results from the simple
implementation to more optimized ones.

