
Beating the compiler - snoopybbt
http://www.roguelazer.com/2015/02/beating-the-compiler/
======
st3fan
> What's the point of micro-optimizing a 3ms function call when each request
> spends 8 or 9 seconds inside the SQLAlchemy ORM? Well, sometimes it's nice
> to practice those optimizion skills anyway

I have a different opinion on those 3ms optimizations:

"It all adds up"

Stopping at one 3ms optimization is not going to move mountains. But doing
that 10 times is already 30ms. Imagine that you found 100 micro optimizations.
Now we are talking!

So keep calm, optimize on. This is how compilers get better over time.

~~~
NeutronBoy
I'm a massive fan of this sort of 'large-scale micro' optimization. There's a
big difference between premature optimization, and having heaps of little
sticking points in your project.

One of my favorite examples is Ubuntu - they run their 'One Hundred Papercuts'
program [1], which is about fixing and optimizing little things that on their
own would never get fixed, but together they are more than the sum of their
parts in terms of user experience.

[1]
[https://wiki.ubuntu.com/One%20Hundred%20Papercuts](https://wiki.ubuntu.com/One%20Hundred%20Papercuts)

~~~
barbs
My favourite recent example was SQLite, where hundreds of micro-optimizations
led to version 3.8.7 to be 50% faster than 3.7.17.

[http://permalink.gmane.org/gmane.comp.db.sqlite.general/9054...](http://permalink.gmane.org/gmane.comp.db.sqlite.general/90549)

~~~
panic
Here's the associated HN discussion thread:
[https://news.ycombinator.com/item?id=8420274](https://news.ycombinator.com/item?id=8420274)

------
rogerbinns
Note that the Python code is not limited to 32 bits and will automatically
promote to bignum (aka long in py27).

    
    
        >>> l=[0xffffffff, 17]
        >>> sum(l) 
        4294967312
    

C based code will silently overflow/truncate, giving 16 in the example above.
The author handwaved all this away, but it does show the usual tradeoffs
between right/robust answers and fast answers.

~~~
legulere
Signed integer behaviour is undefined on overflow in C.

~~~
anewhnaccount
But not on specific compilers with specific settings on specific machines like
were used for the benchmarks results. Execution time is also undefined by the
C standard.

~~~
MaulingMonkey
"Undefined behavior" is a specific term used by the C and C++ standards. What
you're describing sounds more like "unspecified behavior", something
_entirely_ different.

 _Undefined_ behavior allows for aggressive optimizations where the compiler
assumes the integer overflow can never happen, leading to such wonderful bugs
as this:

[https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30475](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30475)

Some compilers may offer specific settings, such as fwrapv, which give you a
way to define the behavior of integer overflow. (Although last I read fwrapv
was broken.) I see no mention of any such specific setting in the article. If
you've spotted one I've missed, please call it out more specifically.

------
gridspy
The first instance of 8 way unrolling

    
    
        loop:
            acc += A
            acc += B ...
    

was inefficient because each add instruction is dependent on the one before
it. The CPU has to pause constantly to wait for the result in acc in order to
continue.

The correct way to do this is to have 8 accumulators (or whatever the loop
unrolling depth is) and then to sum those together at the end. This helps to
keep the processor's pipelines full.

    
    
        loop:
            acc1 += A
            acc2 += B
        
        acc = acc1 + acc2
    

The author's use of SIMD instructions is even better still, where multiple
variables were used. However intrinsics would have been far more readable.

For further speed improvements, streaming intrinsics (since all reads are only
done once, and never written to) could be useful. Also OpenMP for
multithreading would be a good fit here.

------
richardwhiuk
Of course, the next step is obvious - work out why the compiler didn't do a
four way avx unroll, and then submit a bug fix to clang to make it do that.
That way all of your future code benefits from your single micro-optimization.

It's also possible that you find out that if you enable --generate-for-haswell
or some other arcane compiler flag, it'll do it for you.

~~~
pbsd
All the author had to to was to add '-march=native' or '-march=core-avx2' to
the compiler command line: [http://goo.gl/H4f62I](http://goo.gl/H4f62I)

~~~
jevinskie
Clang 3.7.0 (experimental) + -march=skylake gives you AVX512. zmm all the way,
baby! 256 bytes processed in the inner loop!

~~~
pantalaimon
But what gives me a Skylake CPU?

~~~
duskwuff
A time machine, or a job working at Intel? :)

------
yoklov
The reason to use intrinsics and inline assembly (actually, the latter is
pretty rare these days, intrinsics being much more common) isn't only about
beating the compiler.

When you're relying on the compiler to vectorize, you run the risk of a
subtle, innocuous change to the code breaking the vectorization -- and this
will happen a lot. Also, when you target multiple compilers, it's very
difficult to get reliable performance across all of them, unless you do the
vectorizing yourself.

Not to mention, compilers tend to do great on simple test cases like these,
but totally barf as soon as the loop becomes more complex (Try adding some
conditionals to the loops some time... It's not that these loops can't be
vectorized, it's just that the compiler doesn't know how).

To get the best performance out of vectorization, it's mostly about organizing
the data so that it can be easily vectorized. If you've gone through this
work, it's fairly pointless not to take the extra effort to guarantee that
you're getting the performance you expect.

~~~
lmm
Rather than inline ASM, wouldn't it be better to have a maker that says "error
if this can't be vectorised"? Something analogous to Scala's @tailrec

~~~
yoklov
Again, inline ASM is pretty rare these days (when we do use it, it isn't for
SIMD). Intrinsics are much more common.

The big issue (aside from convincing MSVC to implement it ;) with your
suggestion is that, unlike TCO, vectorization isn't really a boolean. There's
a range of what vectorization might mean (you can vectorize code and do a bad
job with it, only marginally beating out the scalar code), so you'd still need
to check the generated code for the situations where you care.

And honestly, it's not worth the effort. Vectorization shouldn't be as scary
as it is for most programmers. Once you get the hang of how to do it, it's not
bad at all. We write a lot of SIMD code at work, and 'difficulty of writing
SIMD code' isn't a big issue for us. Honestly, it's kind of fun, a bit like
solving a puzzle (an optimization puzzle, something like SpaceChem or
Infinifactory).

Now, a situation where it might be a win is if you have a lot of different
platforms you need vectorized code for... but in my experience you're probably
better off doing it by hand unless this is a huge number.

~~~
blt
Writing SSE code using compiler intrinsics is indeed a fun puzzle, but it has
huge drawbacks: 1) it's Intel-specific, and 2) it's a maintenance risk unless
everyone in the shop knows how to write and maintain SSE code. Unfortunately
nobody else at my job knows how to do it, so I am not allowed to check any in
:(

~~~
yoklov
> it's Intel-specific

Most platforms offer intrinsics.

> it's a maintenance risk unless everyone in the shop knows how to write and
> maintain SSE code.

This is understandable, but not the case where I work. If you need to be
writing SIMD code and this is the case, then you need to hire programmers who
can do it. That or convince them to learn, as (again) it's not that hard.

------
fijal
Er. Author missed a crucial point of "just" using PyPy: for me it's over 24x
speedup over standard python (if you run it enough times for JIT to warmup).
Know your tools

~~~
cge
For that matter, his numpy code, which wasn't described until the comments,
was terrible. Unlike all the other versions, it included a complete
reallocation, conversion and copying step, which accounted for almost the
entire processing time.

People in the comments who tried a reasonable numpy version found it to be
around the same speed as the unoptimized C version.

------
angersock
On the general topic of how much we can rely on compilers, even using old and
well-understood languages (C++), consult Mike Acton:

[http://www.slideshare.net/cellperformance/gdc15-code-
clinic](http://www.slideshare.net/cellperformance/gdc15-code-clinic)

[https://www.youtube.com/watch?v=rX0ItVEVjHc](https://www.youtube.com/watch?v=rX0ItVEVjHc)

~~~
Dewie
C++ is well-understood? I don't get that impression from people who know it.

~~~
golergka
C++ is well-understood to not be well-understood.

------
billpg
"Our problem will be a very simple one: sum a list of n integers. To make this
fair, we will assume that the integers all fit in a 32-bit signed int, and
that their sum fits in a 32-bit signed int."

What if my list is: [2000000000, 2000000000, -2000000000, -2000000000]

~~~
ajuc
It will wrap around both ways and correctly return 0. At least it does on my
compiler, I never remember what is part of standard C and what is accident of
implementation.

But it won't work for {2000000000, 2000000000}, as stated in the article.

------
spiznnx
Why stop at -O3? If you're going to compare with vectorized asm, you can get
the compiler to use avx vector instructions in its optimizations. Add -mavx to
your flags and the vector instructions will show up there as well.

------
frozenport
>>Cray 6600

Should be Seymour Cray's 6600, as it was sold by CDC [1]

[http://en.wikipedia.org/wiki/CDC_6600](http://en.wikipedia.org/wiki/CDC_6600)

------
im2w1l
Is there a library with "optimal" implementations of functions such as these?

~~~
panic
OS X and iOS have Accelerate.framework:
[https://developer.apple.com/library/mac/documentation/Accele...](https://developer.apple.com/library/mac/documentation/Accelerate/Reference/AccelerateFWRef/)

In particular, see vDSP_sve and vDSP_sveD which compute single-precision and
double-precision vector sums respectively.

