
Hand Coded Assembly Beats Intrinsics in Speed and Simplicity - nkurz
http://danluu.com/assembly-intrinsics/
======
pandaman
I wonder if it's already the time to put to rest the myth that the human-coded
assembly for a modern CPU is slower than what a compiler generates?

As far as I know it's been semi-true around Pentium time. On that asymetrical
dual pipeline architecture with many parring exceptions and some microcoded
instructions in the mix the typical hand-optimized code for the previous
generation CPUs would likely under-utilize the second pipe and hit some
microcode for a good measure. However, even the Pentium Pro already had a very
robust front end that did not require non-intuitive op rearrangement to
utilize multiple exec units.

Saying that you cannot hand code better than the compiler for a non-Pentium
CPU nowadays sounds similar to saying that if you use too many set bits in
your data then the punch cards will warp and tear apart from too many holes.

~~~
userbinator
Speaking from experience, this myth was never true from 8086 up through the
486; the Pentium (P5) might've been slightly harder to optimise for, but a
human could still easily beat the compilers of the time. And of course, it is
not true today either.

Around the time of the P6 (Pentium Pro/II/III) was when I think things really
improved - out-of-order execution/speculation/wide decoders meant instruction
ordering had little significance, and so getting more performance was more
about reducing the number of instructions needed - something compilers tended
to not do well at.

I believe the myth came from the early days of the RISC fad, where CPUs were
designed specifically so that compilers could easily generate code for them
and make human-coded Asm "obsolete"; although to me this appears to be more of
a "take all the advantage away from humans, making it just as hard for them to
optimise the code as the compiler"... I have only a little experience with
writing Asm for MIPS and ARM, but even from that I can see they do not have
anywhere near the sort of "room for optimisation" that x86 provides.

Another myth that probably needs to die now is "microcoded instructions are
slow"; in a recent article on HN about using the old BCD instructions which
were commonly thought to be much slower than an equivalent sequence of simpler
instructions
([https://news.ycombinator.com/item?id=8477254](https://news.ycombinator.com/item?id=8477254)
), I did a quick benchmark and found they were basically the same speed, but
much shorter. The one caveat is that currently these complex instructions can
only be decoded by one of the decoders, so having too many of them close
together can slow things down if the other decoders can't find a simpler non-
dependent instruction to decode; but using them sparingly could be good for
making code smaller (and thus consume less cache) without any speed penalty,
or possibly increase speed overall due to reduced cache misses.

------
sharpneli
"Well, at least in clang (3.4). Gcc (4.8.2) is too smart to fall for this
separate destination thing and “optimizes” the code back to something like our
original version."

This resonated quite heavily with me. I've spent rather large amount of time
figuring out how to trick various compilers. This is especially critical in
cases where there is no inline assembly, such as shaders. It's really sad how
in some cases you can 'optimize' code by hiding information. In what sane
world code runs faster if you make something not a compile time constant as an
example?

I really wish that every compiler would have "Shut up! I know what I'm doing!"
mode where it would basically do what you say as close as possible. 1:1
mapping on registers to certain variables and the few temp regs required for
evaluating expressions.

This is not needed often but it's absolutely mandatory for going from 40-50%
of theoretical speed to >90%. And in games and other soft real time
applications this can spell the difference between 30 and 60 fps.

~~~
stephencanon
That mode is called "assembly". I use it all the time.

~~~
nkurz
It's a very useful mode, but assuming you don't write the entire program in
assembly, what's the best way you've found to integrate the assembly with the
non-assembly code? Is there any Windows/Mac/Linux portable and self-contained
way of doing it short of writing assembly functions in separate files and
distributing NASM along with the source? Does even that work?

~~~
J_Darnley
Why would you distribute NASM, or any other assembler, with your own code? You
list it as a requirement to build your code. Would you also distribute GCC?

~~~
nkurz
Logically this makes fine sense, but for whatever reason, I don't think I've
ever encountered a popular open source program that required a specific
assembler other than 'gas' unless the entire codebase was written for that
assembler. I don't think I've encountered it in academic code either.

But for whatever reason I've encountered a few that distribute NASM or another
assembler. I don't know exactly why this would be. I'm trying to find out. Do
you have examples where requiring NASM (or another assembler) as an external
dependency has worked for projects in the past? I'm sure they must exist --- I
just don't know of any.

~~~
J_Darnley
FFmpeg. It requires NASM or YASM if you don't want to cripple it by disabling
the external assembly.

------
nkurz
I feel like the problem isn't really with intrinsics. Rather, the problem is
that compilers just aren't very good at optimizing for the last 30-50% of
performance. One notices this when using intrinsics because if you are using
them you probably are concentrating on performance. But if you were to look
closely at the non-intrinsic code, I think you'd see a lot of the same
problems: bad register allocation leading to unnecessary moves and spills,
failure to take advantage of flags already set by arithmetic and logic
operations, and awkward code layout leading to unnecessary jumps and branches.

Obviously, this isn't because compiler writers are stupid or ignorant.
Instead, it's just a really difficult problem to optimize for one target in a
way that doesn't hurt performance on another. Still, I think there's a lot of
room for improvement, if only because comparing the output from GCC, CLang,
and ICC often shows that one compiler does a much better job than the
others[1]. But even once you know (definitively) that the code could be
faster, it's often really tough to convince a given compiler to generate the
code that you want. And coming up with a single construct that works well for
multiple compilers and remains readable is often impossible.

I'd love it if there was a good way to "lock down" the output for a given
function, without needing to depend on the whims of individual compilers.
Other than distributing only pre-compiled binaries, there doesn't seem to be a
great way to do this. Looking at Dan's example code, he has a solution that's
twice as fast as the non-inline-ASM alternatives. But it's still depending on
the compiler to do register allocation, and adding a global variable somewhere
else in the program might well cause the whole thing to blow up, spilling to
stack and triggering the popcnt false dependency that he's working hard to
avoid.

What's the best way to do this --- to keep optimized code optimized across the
mix of compilers in the current ecosystem?

[1] I'm Linux based, but I probably should start adding MSVC to the mix. I'm
sure there are cases where it does something better than the others. Is there
a good way to do this? Perhaps an equivalent to
[http://gcc.godbolt.org/](http://gcc.godbolt.org/) for MSVC?

~~~
WalterBright
Generating optimal code is much like the travelling salesman problem.
Optimization algorithms will lead the result to a local minima, while missing
a global minima.

~~~
userbinator
_Optimization algorithms will lead the result to a local minima, while missing
a global minima._

I think this is the one of the huge differences between humans and compilers
that leads to suboptimal code - the compiler is completely unaware of what the
programmer _intended_ in the HLL source, so the best it can do is to generate
code which it "knows" is the fastest way to implement exactly those
statements; on the other hand, if you are a human writing an algorithm in Asm,
you probably have things like instruction selection and register allocation
already in mind and are willing to slightly change the "semantics" of your
algorithm if it results in e.g. a better use of registers or the possibility
to use a faster/smaller instruction.

In other words, the compiler starts with something horribly unoptimised but
following the semantics of the source language, and tries to make that as
fast/small as it can while still maintaining the same semantics - and so this
is the "local minima" it can reach. The human starts with a concept of an
algorithm and its intended result, and everything in-between is flexible, so
this is a more "global" view of the problem of code generation. The human
isn't restricted by the semantics of the HLL, but can fully express the
intended algorithm by exploiting the full capabilities of the processor.

~~~
WalterBright
The human also knows things that the compiler doesn't - such as which branches
of the code are speed critical and which aren't. This matters for things like
where precious registers get allocated.

------
dalke
> "For example, as of this writing, the first two google hits for popcnt
> benchmark (and 2 out of the top 3 bing hits) claim that Intel’s hardware
> popcnt instruction is slower than a sofware implementation that counts the
> number of bits set in a buffer, via a table lookup using the SSSE3 pshufb
> instruction."

I believe that my benchmark is one of those top hits. My description is at
[http://www.dalkescientific.com/writings/diary/archive/2011/1...](http://www.dalkescientific.com/writings/diary/archive/2011/11/02/faster_popcount_update.html)
. I wrote "My answer is that _if you have a chip with the POPCNT instruction
built-in then use it_. I still don't have one of those chips, but I know
someone who does, who has given me some numbers. "

My own code's logic is "if POPCNT exists then use it, otherwise test one of a
few possibilities to find the fastest, since the best choice depends on the
hardware."

I now have a machine with a hardware POPCNT, and a version with inline
assembly. I should rerun the numbers...

------
Alupis
I seem to recall that Clang consistently produces slower executable binaries
than gcc does, although Clang compiles quicker. (The reason a lot of people
use Clang for development but use gcc for the final production build)

Perhaps the assembly it's producing isn't quite as optimized as it could be if
it were carefully hand crafted.

~~~
stephencanon
That may have been true in the past. All the data I've seen in the past few
years has GCC generating faster code for some benchmarks, and clang generating
faster code for others, with no really consistent pattern (this is by no means
an exhaustive sample, but it is certainly large enough to be vaguely
representative).

Interestingly (to me), I am seeing more and more examples where clang and gcc
are beating icc in performance.

~~~
Alupis
> Interestingly (to me), I am seeing more and more examples where clang and
> gcc are beating icc in performance.

Intel has finite resources (as much as Intel can realistically dedicate to the
project), where-as both Clang and GCC projects have theoretically unlimited
resources (as much as companies the world-over can dedicate to the project).
Doesn't surprise me actually. Most say gcc version 4.7.1 overtook icc in
performance by about 15% (and 4.7.1 is ancient now).

------
alexdowad
Interesting to see this on HN, just a couple months after I ran into the same
problem with GCC's intrinsic popcount myself on a project! I found that I
needed to use an "-march" argument for GCC to actually emit POPCNT
instructions.

~~~
wolf550e
Isn't gcc's default march=native so it would emit popcount if the host
supports it?

~~~
gsg
No, the default produces binaries that are compatible with as many machines as
possible.

~~~
wolf550e
You're right [0], but I couldn't find which march value [1] corresponds to the
settings used for "x86_64-linux-gnu".

EDIT: running "echo | gcc -v -E - 2>&1 | grep cc1" shows the command line,
which says "-mtune=generic -march=x86-64". I do not know what "-march=x86-64"
implies, except SSE2.

0 - [https://gcc.gnu.org/ml/gcc-
help/2012-12/msg00074.html](https://gcc.gnu.org/ml/gcc-
help/2012-12/msg00074.html)

1 -
[https://gcc.gnu.org/onlinedocs/gcc/i386-and-x86-64-Options.h...](https://gcc.gnu.org/onlinedocs/gcc/i386-and-x86-64-Options.html)

------
throwawayaway
finally i think this article and comment thread will settle this contentious
debate - definitively, and we can all move on - with the decision made once
and for all future possible cases and desired outcomes.

