Beating the compiler

st3fan · on March 22, 2015

> What's the point of micro-optimizing a 3ms function call when each request spends 8 or 9 seconds inside the SQLAlchemy ORM? Well, sometimes it's nice to practice those optimizion skills anyway

I have a different opinion on those 3ms optimizations:

"It all adds up"

Stopping at one 3ms optimization is not going to move mountains. But doing that 10 times is already 30ms. Imagine that you found 100 micro optimizations. Now we are talking!

So keep calm, optimize on. This is how compilers get better over time.

NeutronBoy · on March 22, 2015

I'm a massive fan of this sort of 'large-scale micro' optimization. There's a big difference between premature optimization, and having heaps of little sticking points in your project.

One of my favorite examples is Ubuntu - they run their 'One Hundred Papercuts' program [1], which is about fixing and optimizing little things that on their own would never get fixed, but together they are more than the sum of their parts in terms of user experience.

[1] https://wiki.ubuntu.com/One%20Hundred%20Papercuts

barbs · on March 23, 2015

My favourite recent example was SQLite, where hundreds of micro-optimizations led to version 3.8.7 to be 50% faster than 3.7.17.

http://permalink.gmane.org/gmane.comp.db.sqlite.general/9054...

panic · on March 23, 2015

Here's the associated HN discussion thread: https://news.ycombinator.com/item?id=8420274

wsxcde · on March 23, 2015

The cost of maintaining your optimized code adds up too. As does the risk that you introduced a bug. And this takes time away from newer features that customers might be willing to pay money for.

Occasionally performance is a feature, but a lot of times it's just an excuse for developers to have fun writing assembly.

lmm · on March 23, 2015

> Stopping at one 3ms optimization is not going to move mountains. But doing that 10 times is already 30ms. Imagine that you found 100 micro optimizations. Now we are talking!

300ms still doesn't put a dent in 8 or 9 seconds.

> So keep calm, optimize on. This is how compilers get better over time.

So work with the compiler rather than against it. As others have pointed out, with the correct -march setting you get code that's just as good as this hand-tuned assembly - and much more maintainable, and the correct compiler setting will improve the rest of your code as well.

rogerbinns · on March 22, 2015

Note that the Python code is not limited to 32 bits and will automatically promote to bignum (aka long in py27).

    >>> l=[0xffffffff, 17]
    >>> sum(l) 
    4294967312

C based code will silently overflow/truncate, giving 16 in the example above. The author handwaved all this away, but it does show the usual tradeoffs between right/robust answers and fast answers.

legulere · on March 22, 2015

Signed integer behaviour is undefined on overflow in C.

anewhnaccount · on March 23, 2015

But not on specific compilers with specific settings on specific machines like were used for the benchmarks results. Execution time is also undefined by the C standard.

MaulingMonkey · on March 23, 2015

"Undefined behavior" is a specific term used by the C and C++ standards. What you're describing sounds more like "unspecified behavior", something entirely different.

Undefined behavior allows for aggressive optimizations where the compiler assumes the integer overflow can never happen, leading to such wonderful bugs as this:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30475

Some compilers may offer specific settings, such as fwrapv, which give you a way to define the behavior of integer overflow. (Although last I read fwrapv was broken.) I see no mention of any such specific setting in the article. If you've spotted one I've missed, please call it out more specifically.

ebbv · on March 22, 2015

Obviously if this weren't a trivial example and were real code he'd account for this in his C code, and it would still be much faster than the Python code.

He hand waved it away because it's not really relevant to the point.

rogerbinns · on March 22, 2015

The Python code was trivial to write and immediately correct and robust. Trying to get C code to do the right thing is considerably more difficult - for example even trying to detect overflow can result in code the compiler then removes - http://lwn.net/Articles/278137/

Robust, correct and fast are a hard combination to do. I do acknowledge the article is about the latter.

haberman · on March 22, 2015

It's not hard if you use GMP.

lumpypua · on March 23, 2015

The fancy library you're using in that other part of your project doesn't use GMP.

haberman · on March 23, 2015

It likely doesn't need arbitrary precision arithmetic either. A lot of algorithms truly want modular arithmetic (hashes, checksums), many are dealing in domains that can't practically exceed a machine integer size (length of a string, number of vertices in a graph, etc). There is no reason that every library needs to deal in arbitrary precision.

_delirium · on March 23, 2015

> domains that can't practically exceed a machine integer size

This assumption is common among authors of C code, but is sometimes also exploitably incorrect. Even if you really don't care about supporting things larger than a certain size, you do still have to correctly account for overflow, not just ignore it.

haberman · on March 23, 2015

Yes, overflow checks are critical, and tricky in the absence of language support. Yes, testing them is tricky, and often exploitably buggy in the absence of such tests. No, this does not mean that (for example) the Linux kernel should use arbitrary precision to represent pids (for example). Yes, this does mean that better approaches for more systematically dealing with overflow are a good idea.

gridspy · on March 23, 2015

The first instance of 8 way unrolling

    loop:
        acc += A
        acc += B ...

was inefficient because each add instruction is dependent on the one before it. The CPU has to pause constantly to wait for the result in acc in order to continue.

The correct way to do this is to have 8 accumulators (or whatever the loop unrolling depth is) and then to sum those together at the end. This helps to keep the processor's pipelines full.

    loop:
        acc1 += A
        acc2 += B
    
    acc = acc1 + acc2

The author's use of SIMD instructions is even better still, where multiple variables were used. However intrinsics would have been far more readable.

For further speed improvements, streaming intrinsics (since all reads are only done once, and never written to) could be useful. Also OpenMP for multithreading would be a good fit here.

richardwhiuk · on March 22, 2015

Of course, the next step is obvious - work out why the compiler didn't do a four way avx unroll, and then submit a bug fix to clang to make it do that. That way all of your future code benefits from your single micro-optimization.

It's also possible that you find out that if you enable --generate-for-haswell or some other arcane compiler flag, it'll do it for you.

pbsd · on March 22, 2015

All the author had to to was to add '-march=native' or '-march=core-avx2' to the compiler command line: http://goo.gl/H4f62I

jevinskie · on March 22, 2015

Clang 3.7.0 (experimental) + -march=skylake gives you AVX512. zmm all the way, baby! 256 bytes processed in the inner loop!

pantalaimon · on March 23, 2015

But what gives me a Skylake CPU?

duskwuff · on March 23, 2015

A time machine, or a job working at Intel? :)

yoklov · on March 22, 2015

The reason to use intrinsics and inline assembly (actually, the latter is pretty rare these days, intrinsics being much more common) isn't only about beating the compiler.

When you're relying on the compiler to vectorize, you run the risk of a subtle, innocuous change to the code breaking the vectorization -- and this will happen a lot. Also, when you target multiple compilers, it's very difficult to get reliable performance across all of them, unless you do the vectorizing yourself.

Not to mention, compilers tend to do great on simple test cases like these, but totally barf as soon as the loop becomes more complex (Try adding some conditionals to the loops some time... It's not that these loops can't be vectorized, it's just that the compiler doesn't know how).

To get the best performance out of vectorization, it's mostly about organizing the data so that it can be easily vectorized. If you've gone through this work, it's fairly pointless not to take the extra effort to guarantee that you're getting the performance you expect.

lmm · on March 22, 2015

Rather than inline ASM, wouldn't it be better to have a maker that says "error if this can't be vectorised"? Something analogous to Scala's @tailrec

yoklov · on March 22, 2015

Again, inline ASM is pretty rare these days (when we do use it, it isn't for SIMD). Intrinsics are much more common.

The big issue (aside from convincing MSVC to implement it ;) with your suggestion is that, unlike TCO, vectorization isn't really a boolean. There's a range of what vectorization might mean (you can vectorize code and do a bad job with it, only marginally beating out the scalar code), so you'd still need to check the generated code for the situations where you care.

And honestly, it's not worth the effort. Vectorization shouldn't be as scary as it is for most programmers. Once you get the hang of how to do it, it's not bad at all. We write a lot of SIMD code at work, and 'difficulty of writing SIMD code' isn't a big issue for us. Honestly, it's kind of fun, a bit like solving a puzzle (an optimization puzzle, something like SpaceChem or Infinifactory).

Now, a situation where it might be a win is if you have a lot of different platforms you need vectorized code for... but in my experience you're probably better off doing it by hand unless this is a huge number.

blt · on March 23, 2015

Writing SSE code using compiler intrinsics is indeed a fun puzzle, but it has huge drawbacks: 1) it's Intel-specific, and 2) it's a maintenance risk unless everyone in the shop knows how to write and maintain SSE code. Unfortunately nobody else at my job knows how to do it, so I am not allowed to check any in :(

yoklov · on March 23, 2015

> it's Intel-specific

Most platforms offer intrinsics.

> it's a maintenance risk unless everyone in the shop knows how to write and maintain SSE code.

This is understandable, but not the case where I work. If you need to be writing SIMD code and this is the case, then you need to hire programmers who can do it. That or convince them to learn, as (again) it's not that hard.

Dewie · on March 22, 2015

I've been thinking about that for the last few days, and I think that's the best solution, if it's possible. Optimizations might clutter up the code and make the intent not clear. Writing idiomatic code and hoping that the compiler figures it out is also suboptimal, as noted by the grandparent.

I think the best solution is to be able to make some kind of annotation, or other way of declaration, on a function that says "this function should be no worse than this". In Scala's case, for example tailrec. I'm unfortunately having a hard time with coming up with other, specific examples, but the gist of it is that the compiler either manages to do all the work on the function itself and the functions that that function calls, or errors out and reports what it couldn't do. Ideally I would want to make 10 functions which are all pure and referentially transparent, call all those functions from a top function with some kind of annotation that gives some demands with regards to optimizations, and then have that function be transformed to a single, efficient, fused loop with no allocations or intermediary values that are unnecessary. But like I mentioned, the hard part seems to be in actually specifying what your demands are.

innocenat · on March 23, 2015

In my experience, I always write a clear, idiomic C code along with my intrinsics-based vectorized function, alo png with comment on tricky part (such as using packus for clamping thing to 0-255). You got clear code, and also optimized code (you might want C code anyway for pre-sse2/mmx and non-x86). Downside is that you are to maintained multiple copy of code, but if you have the version optimized for each sse2/sse3/sse4.1/avx2 anyway it is not really that more hassle.

fijal · on March 22, 2015

Er. Author missed a crucial point of "just" using PyPy: for me it's over 24x speedup over standard python (if you run it enough times for JIT to warmup). Know your tools

cge · on March 23, 2015

For that matter, his numpy code, which wasn't described until the comments, was terrible. Unlike all the other versions, it included a complete reallocation, conversion and copying step, which accounted for almost the entire processing time.

People in the comments who tried a reasonable numpy version found it to be around the same speed as the unoptimized C version.

wyldfire · on March 22, 2015

I got 25x on sum_naive_python() and 6x on sum_native_python() (PyPy 2.7.3 vs CPython 2.7.6 on Intel Arrandale), using timeit to measure time and not cycles.

But IMO it's a nice article anyways. I'd wager the final implementation he comes up with is faster than PyPy.

angersock · on March 22, 2015

On the general topic of how much we can rely on compilers, even using old and well-understood languages (C++), consult Mike Acton:

http://www.slideshare.net/cellperformance/gdc15-code-clinic

https://www.youtube.com/watch?v=rX0ItVEVjHc

Dewie · on March 22, 2015

C++ is well-understood? I don't get that impression from people who know it.

golergka · on March 23, 2015

C++ is well-understood to not be well-understood.

frozenport · on March 23, 2015

Indeed, the language is well understood but code often isn't because its easy to write UB.

billpg · on March 23, 2015

"Our problem will be a very simple one: sum a list of n integers. To make this fair, we will assume that the integers all fit in a 32-bit signed int, and that their sum fits in a 32-bit signed int."

What if my list is: [2000000000, 2000000000, -2000000000, -2000000000]

ajuc · on March 23, 2015

It will wrap around both ways and correctly return 0. At least it does on my compiler, I never remember what is part of standard C and what is accident of implementation.

But it won't work for {2000000000, 2000000000}, as stated in the article.

spiznnx · on March 23, 2015

Why stop at -O3? If you're going to compare with vectorized asm, you can get the compiler to use avx vector instructions in its optimizations. Add -mavx to your flags and the vector instructions will show up there as well.

frozenport · on March 22, 2015

>>Cray 6600

Should be Seymour Cray's 6600, as it was sold by CDC [1]

http://en.wikipedia.org/wiki/CDC_6600

im2w1l · on March 23, 2015

Is there a library with "optimal" implementations of functions such as these?

panic · on March 23, 2015

OS X and iOS have Accelerate.framework: https://developer.apple.com/library/mac/documentation/Accele...

In particular, see vDSP_sve and vDSP_sveD which compute single-precision and double-precision vector sums respectively.