
The surprising cleverness of modern compilers - kiyanwang
http://lemire.me/blog/2016/05/23/the-surprising-cleverness-of-modern-compilers/
======
lisper
There is an often-missed elephant in the compiler optimization living room in
the form of poor language design leading to a tremendous amount of wasted
human effort. First, effort is demanded of the programmer to write the clever
optimized C code in the first place, and then effort is demanded of the
compiler writers to recognize the clever optimized C code and compile it down
to the native instruction that does the Right Thing.

All of this effort could be eliminated simply by adding a function to the
standard library called bitcount. This would make life easier for both the
programmer, who could now count bits by calling bitcount, and the compiler
writers, who no longer have to write and maintain code that recognizes clever
hacks.

This (counting bits) is not an isolated incident either. There is an entire
field of academic endeavor devoted to this sort of thing, focusing
particularly on recognizing common idioms involved in vector and array
processing. All of this could be eliminated at a stroke through better
language design. Instead, an entire academic field dons the ball-and-chain of
a poorly designed language and then pats itself on the back for getting to the
finish line panting and sweating, and people like me who say, "Um, why not
just take off the extra weight and use a better designed language?" get
marginalized at best and vilified at worst.

~~~
mikeash
gcc and clang both provide a builtin for this, __builtin_popcount. There are a
bunch of other useful things like that too:

[https://gcc.gnu.org/onlinedocs/gcc/Other-
Builtins.html](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html)

Of course, that doesn't save you from writing the long version if you need
portable code, but it's a step in the direction you describe.

~~~
lisper
> it's a step in the direction you describe.

Well, sort of. This is not about any one particular optimization. This is
about the general pattern of having one human take a high-level concept (like
counting bits) and reducing it to a low-level implementation, and then having
a second human writing a compiler which tries to reverse-engineer the work
that the first human did in order to figure out that they were trying to count
bits so that it can emit code that actually does the Right Thing.

So yes, __builtin_popcount does improve the situation a little bit, but it
perpetuates a whole host of other problems. For example:

1\. It has the wrong name. The right name for a function that counts bits is
"bitcount", not "popcount" (and certainly not __builtin_popcount!) The
__builtin is there because C doesn't have a proper name spacing system, and so
using __builtin_popcount perpetuates that problem in the same way that clever
compiler optimizations perpetuate other bad aspects of C's design.

2\. __builtin_popcount only works on unsigned ints. If you want to count bits
in anything else (a string, say, or a bignum) you're back to square 1.

To really fix this problem you need a _general_ mechanism for expressing high-
level concepts that a compiler can now about and optimize directly. For that,
C is completely hopeless. You need a new language.

~~~
WalterBright
> a general mechanism for expressing high-level concepts

functions

> that a compiler can now about and optimize directly. For that, C is
> completely hopeless.

This has been done with C for generations, i.e. memcpy, strlen, sqrt, and even
printf are recognized by the C compiler and custom code is generated.

The only problem is the C community has been slow to standardize on additional
functions.

~~~
lisper
No, that's not the only problem, it's one of many problems. But even by itself
it's a pretty frickin' substantial problem.

~~~
WalterBright
It means a new language is not required. Even without a Standard update, the
various C compiler vendors could get together and agree on a set of them,
making a "technical report".

Anyhow, what other problems?

~~~
lisper
> the various C compiler vendors could get together and agree on a set of
> them, making a "technical report".

Yes, they could. But they don't.

> Anyhow, what other problems?

The lack of a real multi-dimensional array type is probably the biggest
problem with regards to the goal of making code run fast. The fact that
strings are required to be zero-terminated arrays of signed 8-bit integers is
another huge problem. It means that you can't have native unicode strings, and
that to create a substring you have to make a copy. The lack of exceptions and
automatic memory management means that you need weird calling conventions if
you want to return composite types, or if a function can fail. The lack of
generics and namespacing means that you need to use weird naming conventions
to avoid collisions, and there is a significant cognitive load placed on the
programmer to figure out the right operation to use for a particular set of
operand types (except for native arithmetic, but that can't be extended to any
non-native types).

~~~
WalterBright
I'm aware of C's language problems, but I don't think the ones you mention
impede encapsulation of algorithms so the compiler can do a better
optimization job.

> strings are required to be zero-terminated arrays of signed 8-bit integers

This is incorrect. C's char type is optionally signed, not required to be
signed. The implementations I know of make them unsigned, because signed
characters make no sense.

~~~
lisper
> I don't think the ones you mention impede encapsulation of algorithms so the
> compiler can do a better optimization job.

Unless you can produce an actual counter-argument I guess we'll just have to
agree to disagree about that.

> signed characters make no sense

I could not agree more. And yet...

    
    
        [ron@mighty:~] gcc -v
        Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
        Apple LLVM version 6.0 (clang-600.0.51) (based on LLVM 3.5svn)
        Target: x86_64-apple-darwin13.4.0
        Thread model: posix
        [ron@mighty:~] cat test.c
        
        int main(int argc, char** argv) {
          unsigned char* x = "baz";
        }
        [ron@mighty:~] gcc test.c
        test.c:3:18: warning: initializing 'unsigned char *' with an expression of type
              'char [4]' converts between pointers to integer types with different sign
              [-Wpointer-sign]
          unsigned char* x = "baz";
                         ^   ~~~~~
        1 warning generated.

~~~
WalterBright
> Unless you can produce an actual counter-argument

There's no problem having a function called popcnt(), and having the C
compiler recognize it just like it recognizes memcpy(), strlen(), sqrt(),
etc., and generate optimal code for it.

~~~
coldtea
Aren't any of those negatively affected by the possibility of aliasing?

~~~
WalterBright
None of those I mentioned are, though C has the 'restrict' type modifier to
deal with that.

------
outworlder
> What does that mean? It means that C is a high-level language.

For this specific example, isn't the reasoning backwards? The machine
instruction is clearly the "high level" here, if you disregard the fact that
it is, in fact, a machine instruction, thus there can be no "lower level"
(ignoring microcode, an implementation detail).

Of course, you cannot say that, in general, C is lower level than X64
assembly. However, just to illustrate, replace the popcntq instruction with
some Scheme function, and try arguing that it is at a "lower" level. The
argument won't work. In fact, people would think you were talking about a
"decompiler", instead.

------
gjm11
One of the comments on that article links to this
[https://llvm.org/bugs/show_bug.cgi?id=1488](https://llvm.org/bugs/show_bug.cgi?id=1488)
which rather suggests that this surprisingly _specific_ optimization was added
in order to speed up some calculations in one of the SPEC benchmarks.

(The benchmark in question is code from a chess program called Crafty, which
represents a chess position as a bunch of 64-bit bitmaps. Unsurprisingly,
64-bit popcount is quite often useful for this.)

~~~
treerex
Compiler vendors have been doing stuff like this (recognizing specific code
patterns from common benchmarks) for 30 years, back in the days when there was
real competition amongst C compilers (Borland vs. Zortech, for example.)

~~~
gjm11
Yup. But the HN discussion of this one seemed to be treating it as an instance
of compilers being terribly clever to benefit their actual users, rather than
yet another benchmark hack.

------
spoiler
Just a side-comment: I never actually needed to count bits, but this technique
illustrated in the post is quite clever! I had to stop for a second to
understand what it actually does: Essentialy, each iteration will remove the
least significant high bit from the number (or the right-most bit, if you
represent the number LTR). and then continue to do so until there's no high
bits left (or the number is zero)!

It's so simple, but to be honest, I don't think I would come up with it
without putting in some mental effort.

~~~
rosstex
I teach Data Structures at UC Berkeley, and this is one of our most
challenging problems each year: find a one-line expression to get the least
significant bit of a number. x & (x-1)! It exploits some very interesting
magic of binary numbers.

~~~
kevinwang
Why not (x & 1) ?

~~~
johnp_
This just results in the "first" bit, essentially showing whether x is odd (1)
or even (0).

Example: 6_10 = 110_2 and 1_10 = 001_2

(x&1) = 110 & 001 = 000 -> even number

I think the x&(x-1) is wrong though. I thought (x & -x) extracts the LSB:

010&110 = 010

110&010 = 010

001&111 = 001

111&001 = 001

001 010 100 = 110 101 100 = 000 000 100

This happens because the negative integers in two's-complement are just the
positive number with all bits negated plus 1. If it'd only be negated, the
result would be 0, as all bits would be different from one-another. Now
everything right of the LSB of the argument is by definition zero and in the
negated version by definition one. Finally, two's complement adds a one which
turns all those bits back to zero and the position where the LSB is to one.
This is now the only place in the two binary numbers where both bits are one,
and therefore the only place that "survives" the '&'-Operation and remains
one. All other bits become 0.

Disclaimer: Still studying and not teaching (yet) ;)

edit: changed order of examples to clarify, that it works from positive to
negative and the other way round.

~~~
kevinwang
Is that a common definition of LSB? I previously learned that the definition
of the LSB is the "first" bit [0]. Would it be more clear to say that this
algorithm finds the least-significant bit that's set?

[0]
[https://en.wikipedia.org/wiki/Least_significant_bit](https://en.wikipedia.org/wiki/Least_significant_bit)

~~~
johnp_
You're right. Least significant _set_ bit is (x & -x).

Got confused by the (x & (x-1)) and thought rosstex must have meant something
more complicated, because it doesn't seem to result in the usual LSB either :/

------
danso
> _We test people’s intelligence in room, disconnected from the Internet, with
> only a pencil. But my tools should get as much or even more credit than my
> brain for most of my achievements. Left alone in a room with a pencil, I’d
> be a mediocre programmer, a mediocre scientist. I’d be no programmer at all.
> And this is good news. It is hard to expand or repair the brain, but we have
> a knack for building better tools._

Reminds me of what Kasparov said in his essay analyzing the rise of "centaurs"
in chess competitions:

[http://www.nybooks.com/articles/2010/02/11/the-chess-
master-...](http://www.nybooks.com/articles/2010/02/11/the-chess-master-and-
the-computer/)

> _Programming yourself by analyzing your decision-making outcomes and
> processes can improve results much the way that a smarter chess algorithm
> will play better than another running on the same computer. We might not be
> able to change our hardware, but we can definitely upgrade our software._

edit: fixed the errant double-copy-paste

~~~
dTal
Hey, thanks for linking this. What an entertaining, insightful piece of
writing!

------
psuter
Seems to be a built-in, extremely specific use case: [https://github.com/llvm-
mirror/llvm/blob/ae889d36724efb174e8...](https://github.com/llvm-
mirror/llvm/blob/ae889d36724efb174e8cc05d26e655c4c4ab8867/lib/Transforms/Scalar/LoopIdiomRecognize.cpp#L1073-L1077)

I suppose it makes sense, given: 1) popcount is very common in bit-
manipulating programs, 2) there is no standard popcount in ISO C, 3) it
potentially saves 100+ instructions.

(Edit: I see a comment on Lemire's blog already points to that.)

~~~
pera
I was not aware of the popcnt[1] instruction. It's quite amazing how LLVM can
recognize that code, thanks for the link!

[1] -
[https://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT](https://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT)

------
EGreg
Yes tools are a big part of augmenting intelligence. However!! The biggest
thing that computers have over humans is the replicability of programs.

A human, somewhere, figures out a correct solution or a better algorithm, then
it gets propagated and replicated.

A self-driving car "learns" something in its experience on the road. That
information is propagated using a pre-existing system. Now all self-driving
cars eventually improve and outperform typical humans.

Watson collects disease incidence rates and diagnoses from around the world,
so the "big data" eventually allows it to make decisions that a single doctor
would not be abls to make.

It's the easy copying of bits that makes this possible.

A photo in the 1960s could be imperfectly reproduced, on a newfangled "copy
machine". Before that, photos were set up and printed using huge printing
presses. The more copies are made of a photo is reproduced, the more likely it
is to survive the destruction of the physical media it is embedded in.

But in the past, making perfect copies was very expensive. Today, the copying
of bits is what has brought the price down of _copying information_. And this
is the key to the intelligence revolution.

Humans were able to zoom ahead because of language. It provided ways of
propagating information and copying it. Although the copies were imperfect, it
is what made every generation smarter than the last.

Anyway, the "easy copying" is a feature of a _platform_. The humans are still
the ones banging away at the platform. And that's why open source beatthe
proprietary software. The same reason why science made progress -- anyone
anywhere can make an incremental improvement, that is then recognized and
propagated according to some political system in place.

And that is what produces the explosion in intelligence.

The things we can improve now are the political systems that govern which
information is propagated. Should they be centralized? Decentralized? Who
determines what goes in? How to deal with versions and dependencies?

The fact that we keep reinventing the same things (NoSQL) going in circles
(thin client / thick client, JS framework fads) locking out users (Apple, MS
Secure Boot) and bloating our computer languages (C++, JS, PHP, etc.) are just
symptoms that we haven't figured out the best ways to keep moving forward.

------
_yosefk
Indeed, Clang/LLVM did a stellar job here. It'd be interesting to know how (in
part to understand what situations can one expect this sort of thing to
happen; for instance, I'd be surprised if IEEE floating point emulation code
would be compiled to a hardware floating point instruction - though perhaps I
underestimate the optimization passes at work here!)

~~~
TorKlingberg
I bet there is a specific optimization that recognizes common patterns for
counting the number of set bits and replaces it with popcntq. There is no
popcnt() function built into C, so everyone who needs it uses one of a few
common implementations.

~~~
psuter
Indeed: [https://github.com/llvm-
mirror/llvm/blob/ae889d36724efb174e8...](https://github.com/llvm-
mirror/llvm/blob/ae889d36724efb174e8cc05d26e655c4c4ab8867/lib/Transforms/Scalar/LoopIdiomRecognize.cpp#L1073-L1077)

~~~
_yosefk
LoopIdiomRecognize::recognizePopcount? Then I think it's more about "the
surprising ability of modern compilers to recognize popcount" than general
"cleverness."

(I suspected it was this kind of thing, but I thought maybe it was something
more generic.)

~~~
acqq
Yes most other loops that produce the same result wouldn't be recognized, just
this very specific, note the steps:

// step 1: Check if the loop-back branch is in desirable form.

// step 2: detect instructions corresponding to "x2 = x1 & (x1 - 1)"

// step 3: Check the recurrence of variable X

// step 4: Find the instruction which count the population: cnt2 = cnt1 + 1

It's very, very specific. But they allow that the loop has something more, see
the resulting transformation at the end:

    
    
       // After this step, this loop (conceptually) would look like following:
       // newcnt = __builtin_ctpop(x);
       // t = newcnt;
       // if (x)
       //  do { cnt++; x &= x-1; t--) } while (t > 0);
    

This can be later potentially optimized more, in the given simplest case the
rest is removed by the later optimization steps.

------
pklausler
The problem with this kind of idiom recognition is that it must be
conservative. That rules out lots of cases involving floating-point arithmetic
and memory references.

In the particular example of population count: this has been a hardware
instruction on many processors since the early 60's. It is a crying shame that
it's not part of the modern C language or standard library.

------
mzs
Site is slow, here is the bug report:

[https://llvm.org/bugs/show_bug.cgi?id=1488](https://llvm.org/bugs/show_bug.cgi?id=1488)

Too bad it won't optimize in the face of clever code like this:

[http://www.hackersdelight.org/hdcodetxt/pop.c.txt](http://www.hackersdelight.org/hdcodetxt/pop.c.txt)

------
flamedoge
> it gets hard to benchmark “algorithms”

The cleverest of all C compilers that I know do not even attempt to rewrite
algorithm to a "smarter" one (they're not there.. yet). All this does is
translate popcount algorithm to a hardware instruction, which one would
presume affects the performance linearly.

