Hacker News new | comments | show | ask | jobs | submit login
GodBolt: Enter C, get Assembly (godbolt.org)
602 points by setra on Dec 15, 2016 | hide | past | web | favorite | 151 comments

Try with gcc 6.2 options -Ofast -march=native:

  int square(int num) {
    int a = 0;
    for (int x = 0; x < num; x+=2) {
      if (!(x % 2)) {
        a += x;
    return a;
All kinds of loop unrolling and vector instructions. Now remove the "!"

For anyone too lazy: https://godbolt.org/g/jvSKCD

I would be much more impressed if I hadn't taken a compilers course. I reckon (god alone knows exactly what GCC does) this is just linear induction variable substitution[1] (so `x` gets replace with `i*2`), then associativity of integer multiplication, then some (probably builtin) rule that `n%n` is always 0. From there, it is pretty straightforward.

Don't get me wrong - the devil is in the details and getting optimizations that are both powerful and only applied when they are valid, and at the right time is difficult as hell. That said, I do expect compilers to be at least this smart.

[1]: https://en.wikipedia.org/wiki/Induction_variable#Induction_v...

> I do expect compilers to be at least this smart.

Does anyone know if, internally, compilers second-guess / double-check themselves ? For instance, when they detect a clever shortcut do they generate and quickly run [1] the optimized bytecode on a sampling of inputs to verify it is functionally-equivalent to what non-optimized bytecode outputs ?

[1] obviously cross-compilers are a thing, so this wouldn't be possible if it were compiling/optimizing for a separate architecture.

> For instance, when they detect a clever shortcut do they generate and quickly run [1] the optimized bytecode on a sampling of inputs to verify it is functionally-equivalent to what non-optimized bytecode outputs ?

No. Optimization transformations are generally expected to result in provably identically-functional code (or sufficiently identical, in case of e.g. floating-point optimizations). Otherwise it is a bug.

Indeed, “provably” being the key word. If I, the compiler developer, prove (in the mathematical sense) that the transformation is functionally identical, I don’t need the compiler to run any experiments to verify it. A proof is much more solid than running a few experiments.

Detecting functional equivalence of Turing-complete languages is a very non-trivial problem. So, no, compilers don't try to construct runtime proofs of their correctness.

There are people who do do more exhaustive checks of the peephole optimizations for correctness--or finding new opportunities (see John Regehr's Souper work). But production compilers are well-known (at least by anyone who works on them) for having lots of bugs in these kinds of optimizations.

I think Nuno Lopes also did some work on proving the correctness of instcombine optimizations, for instance: http://blog.regehr.org/archives/1170

>then associativity of integer multiplication, then some (probably builtin) rule that `n%n` is always 0.

Is this quite right? It's not true in general that (a * b) % c = a * (b % c). E.g. it's not true for 2,3,4. The relevant generalization is that (a * b) % b = 0, which has nothing to do with associativity.

Maybe they meant distributes? Modulo (sort of) distributes over multiplication: (a * b) % c = ((a % c) * (b % c)) % c

If that function is really supposed to return the square of a number, you took a wrong turn somewhere. Because it says the square of 16 is 56.

Now here's a more efficient algorithm which does produce the currect result:

    int square(int num) {
        int a = 0;
        for (int x = 1, n = num; x <= num; x+=x, n+=n) {
          if( num & x) {
              a += n;
       return a;

It would be nice if this site allowed us to run/step through the code, to see exactly what it is doing.

Oh, sorry, I never bothered to change the function name. It's just an interesting loop.

GCC is messing up here. Compare with clang!


We can say that clang is like the young Gauss :)


Wait, that contains no loop. Is clang using mathematical rules for series summation? Awesome!

It gets even better after playing with the code a bit: https://godbolt.org/g/kUMCpl

magic constants galore... where does 3435973837 come from? ;)

First it normalizes to

    if (num <= 0) {
    	return 0;
    int a = 0;
    int loop_n = (unsigned)(num - 1) / 5;
    for (int x = 0; x <= loop_n; x++) {
        a += x * 5 - 1;
    return a;
Division by 5 is done by multiplying by the modular inverse, 3435973837.

From there you just have a summation, which has two components.

    ((loop_n + 1) * loop_n / 2) * 5    // 5x
    - loop_n - 1                       // -1
The assembly between these the manual and automatic methods is slightly different, but the difference is fairly trivial and opaque.

It's 0xCCCCCCCD in hex, which looks a lot more structured at least. Though how exactly this makes sense is still unclear :-)

[edit] and playing around with the increment of x it keeps throwing in some magical values which are a repeating pattern in hex, with the LSB off by one usually...

I thought 0xCCCCCCCD was a very nice number so I asked myself why it would come out like that. I wondered whether

0.CCCCCCCC.... = 4/5 had anything to do with it.

Using hexadecimal we have

    5 * 3 = 10 - 1, so 
    5 * C = 40 - 4 so
    5 * (1 + C + C0 + C00 + C000...) = 5 - 4 + 40 - 40 + 400... where the last term is = 0 mod 2^32
This relates to

    0.CCCCCC.... = 4 / 5
Because in order for that to come out right, multiplying 0.CCCC... with 5 has to have all intermediate terms cancel, leaving only a 4.

So why would it take its digit from 4 / 5 and not 1 / 5? Because it takes x=4 to make the 5-x subtraction in 5 - x + x0 - x0 + x00 to come out to 1.

So this gives us a general rule for inversion (for numbers less than 16). Take the hexadecimal expansion of (n-1) / n, truncate to get enough hex chars, add 1 to the least significant hex digit.

I.e. the inversion of 3 would be 0xAAAAAAAB

Well if A=1 (I know it’s not), D is 5-1. I’ve noticed that whatever you change five to fits this pattern. Ex: +=3 in the code gives you AAAAAAAB, and B would be (3-1) if A=1. I

Alright maybe I'm stupid but I could have sworn that more than

    square(int):                             # @square(int)
        xor     eax, eax
is required to perform this, what's going on?

X can always by devided by 2, meaning the condition will always be false thus returning "a" unchanged. "xor eax, eax" is a shortcut to "mov eax, 0".

"xor r,r" is intel-ese for "set r to zero". For various reasons it's faster and smaller than the more obvious "mov r,0".

From this we can see that gcc and clang have figured out that the "if" can never be true, then they elided the loop, then optimized out the variable, ending in the equivalent of "return 0;".

  > Now remove the "!"
To be fair 4.6 and clang also do that, and I suspect so do earlier versions if I could be arsed to fix the compilation error.

Also as pointed out elsewhere clang turns the "!" version into a simple equation in terms of n, so I'm actually kind of disappointed in GCC here.

Dude, modern compilers are smart.

A simple recursive factorial function:


Not sure if that's smart or stupid.

It does try to vectorize it (doing multiple math operations at once is faster).

(1 * 5 * 9 * 13 * 17 * ...) * (2 * 6 * 10 * 14 * 18 * ...) * (3 * 7 * 11 * 15 * 19 * ...) * (4 * 8 * 12 * 16 * 20 * ...)

(it's not precisely that, but close enough)

It isn't optimal however, optimal code would be pre-computed results (signed integer overflow is undefined, so n <= 12 is defined)

(if signed integer overflow would be defined to be 2's complement overflow, you can still use a table, as n > 33 gives 0)

Dude, your code got:

1. Turned from recursive form into iterative form 2. Unrolled heavily 3. Autovectorized (!)

The throughput will be substantially more than the simple version.

Well, the optimizations are pretty smart to be sure, the stupid thing is that anything over 12! overflows (as GlitchMr points out).

I'd like to know how the recursion to iteration step was done.

Presumably there is a pass to turn f(recur(x), y) into recur(f(y, acc), x) and then tail-call optimization can be applied. This works for any associative f.

I took that and check out what Clang does with 32-bit ints:


vs. what it does with 64-bit ints:


Can anybody explain the 32 bit version?

It's just vectorizing calculations so that it's faster than pure iterative calculation. 64 bit version doesn't get this probably because that optimizer isn't SSE2 aware yet (just a guess I actually don't know) and can't do SIMD arithmetic with 2 64 bit floats

Harder to vectorize 64-bit arithmetic?

was going to say this. It probably only does SSE and not SSE2. Therefore vectorization only happens for 32 bit ints.

Seems to need -mavx2 to really go to town with 64 bit: https://godbolt.org/g/6EFYeY

Thank you both, I appreciate the insight!

Interestingly, all the RISC architectures except for 64-bit ARM seem to compile down to something a lot simpler.

Use -Os to check. Compiler turned recursive function to loop.

Check this one (clang)


That's not smart - that's pattern matching.

Why the hell doesn't gcc eliminate the loop in this case?


Oddly enough, gcc 5.4 does.

Try the same thing again with -Os and see the difference.

This site is extremely valuable to produce good quality reports of GCC bugs. For https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67283, I was able to track multiple regressions on a GCC optimization over the different versions of GCC, within few minutes. Doing the same manually would have been extremely tiresome. Kudos to the website authors!

We use this all the time at work. Also, Matt Godbolt sits about 3 rows away from me, which helps as well!

One optimization I've always found impressive:

    #include <stdint.h>

    uint32_t testFunction(uint32_t x)
        return ((x & 0xff) << 24) | ((x & 0xff00) << 8) | ((x & 0xff0000) >> 8) | ((x & 0xff000000) >> 24);
compiles into:

    testFunction(unsigned int):
        mov     eax, edi
        bswap   eax
Another fun one, that only works with clang:

    #include <stdint.h>

    int testFunction(uint64_t x)
        int count;
        for (count = 0; x; count++)
            x &= x - 1;
        return count;
compiles into:

    testFunction(unsigned long):
        popcnt  rax, rdi

As a compiler engineer, those are some of the least impressive things done by modern compilers. Why? Because there isn't anything smart behind them, it's pure pattern matching.

Now I'm curious, what are the most impressive things modern compilers do?

probably this: http://polyhedral.info/

Impressive? Those are probably specially-coded optimizations rather than the result of generic optimizations.

As I understand it, the bswap optimization doesn't pattern-match for that specific code (and it works for any equivalent shift/mask code); it analyzes the flow of the bits to figure out the final bit pattern relative to the original, realizes the whole expression had the effect of a bswap, and substitutes a bswap.

Actually, in LLVM's case it really is just explicitly checking for a few specific bswap patterns rather than doing anything fancy:


Yes, that's called pattern matching.

Not the kind of pattern matching they were talking about.

The second example makes sense: rdi stores the first subroutine argument in amd64 calling conventions and rax stores the return value. I imagine it's easy for the compiler to infer that the loop is just counting 1 bits, which popcnt does. I imagine if you called functions or initialized local variables in the loop, it would be much more messy

The latter is generally not faster on most modern chips. Results in similar microcode to non vectoring assembly, but you need to put data in the SSE registers which is not free.

A smart compiler would not use the new SSE version. (perhaps you didn't set the flags right?)

This is why: http://0x80.pl/articles/sse-popcount.html

There is no sse usage in the popcnt example posted by the OP. That's probably why you are getting downvoted.

Also, look at Matt's write up of how this works:


as a pragmatic example (incl. all tools & configs) of how to build an auto scaling & deploying site, without overdosing on kool-aid.

I like the highlighting – I finally understand why people say compilers generate faster code than a human could write. Pasted in a pretty straightforward and already fairly optimized function I'd written and of course got back something also pretty straightforward, then I put in "-Ofast -march=native" and, wow. Completely rearranged everything in a totally non-linear way and generated all sorts of bizarre (to me) instructions. I think I'd need to study it for months if not years to understand what the compiler did.

> I think I'd need to study it for months if not years to understand what the compiler did.

Yes, but once you did you might see how you might improve it.

Your brain, while slower at code translation than gcc, can think of things that gcc never will, and yet things that can beat your brain[1] are still slower.

[1]: https://en.wikipedia.org/wiki/Superoptimization

On the other hand, try -Os as in size optimization. You will get readable code that is not slow.

I think most people internally optimise readability and thus size, those who don't generally make unmaintainable mush.

Do realize that no Linux distribution ships such binaries. They do the equivalent of -march=$worstSupportedProcessor (eg x86_64). To actually get any benefits from the smart compiler on these you'll need to use FMV and proprietary compiler extensions (eg. GCC per function options), write runtime dispatch code etc. (GCC also has proprietary extensions for that).

Solus Linux builds some libraries into /usr/$LIB/avx and modified glibc's ld loader to load these libraries if the CPU happens to have avx2. https://github.com/AOSC-Dev/aosc-os-abbs/issues/522#issuecom...

(In that issue I am trying to figure out some FMV stuff.)

In this particular case it just replaced a few of the x86 instructions with AMD64 ones.

If you've never done it before, try stepping through a Haskell program compiled with a recent GHC. The control flow is pretty weird.

You can follow ghc's compilation with '-v5'. Alas, I've never liked the way it compiles - full of unpredictable jumps, makes no sense at asm level, and just doesn't look a good modern CPU citizen.

Of course I am only a minor Haskell learner but jhc could do whole-program compilation with readable assembly.

As for gcc, try 'gcc -fdump-tree-all -fdump-rtl-all-slim' and read the 50 files it prints.

Cool! Certainly quicker than loading up cross compilers (even when they're so easy to get on Debian/Ubuntu), building a binary, and running the right version of objdump.

The Rust Playground at https://play.rust-lang.org/ has a similar function, letting you check ASM, LLVM IR, and MIR (Rust's mid-level intermediate representation) output for current versions of the Rust compiler.

FWIW godbolt also does rust (http://rust.godbolt.org) which is convenient because play.rust-lang.org doesn't perform the aggressive cleanup/removal of assembly godbolt does.

Though it looks like godbolt only has a Rust compiler up to version 1.11 (current stable is 1.13), so it may not be including optimizations in the current Rust release.

Really cool, it surprised me to see even trivial code give very different results in gcc,icc and clang.

  int retNum(int num, int num2) {
      return (num * num2)/num;
gives this in clang

  retNum(int, int):     # @retNum(int, int)
        mov     eax, esi
While icc and gcc give

  retNum(int, int):
        mov     eax, esi
        imul    eax, edi
        idiv    edi

  retNum(int, int):
        imul      esi, edi                                      
        mov       eax, esi                                     
        idiv      edi                                           
The clang version at first sight seem right. But then thinking about it this is integer math.

  4/3 := 1
1 * 3 := 3 leads to 3 != 4 I believe gcc and icc returning 3 there is correct, while clang returning 4 is not. Maybe someone more C/int versed can tell us which are acceptable (knowing C both might be ok)

The product of 'num' and 'num2' is always going to be a multiple of 'num', so there won't be any error introduced by flooring when dividing by 'num' again.

One thing that can happen is an integer overflow: if you pass (0x10000, 0x10000), icc's and gcc's versions will calculate 0x10000 * 0x10000 = 0, 0 / 0x10000 = 0, while clang will return 0x10000. But clang's not wrong: signed integer overflow is undefined behavior in C, so compilers are allowed to just assume it never happens when making optimizations.

There is a flag that enables such optimization behaviour for GCC. I think it is part of Ofast.

If it does not work in new GCC it is a bug.

(num * num2)/num = num2

The division cancels out the multiplication. Just like if you were doing arithmetic on paper as you did in school. There's nothing more to it than that is there? What inputs do you think it's incorrect for in clang?

Overflow is of course undefined.

Unless num is zero. In which case you won't get a floating point signal, you get a number that doesn't male sense.

But a program with division by zero is undefined anyway.

Funny, if you swap the order of the operands in the multiplication, GCC 7.0 with -O3 does optimize it away:

  int retNum(int num, int num2) {
        return (num2 * num)/num;
now becomes

  retNum(int, int):
          mov     eax, esi

It would be an interesting project to take in some (small) source program, and try all sorts of algebraic rearrangements (maybe just associativity and commutativity of addition and multiplication) to see if there is any noticeable performance change.

STOKE, a stochastic superoptimizer, is pretty interesting in this context: http://stoke.stanford.edu/ & https://github.com/StanfordPL/stoke

the performance definitely varies doing this by hand - I guess the current optimizers don't rearrange the code as much as a human can while seeing that it's still equivalent. But sometimes (I'm using LLVM) just rearranging the parentheses to force a different evaluation order makes a difference where I thought it wouldn't have after optimization.

This seems to be the case in 6.1 but not in 5.4. What this proves is also that this tool is really useful to check easily between gcc versions without installing all compilers /toolchains

My thinking was for this C

  return (num / num2) * num2;
Where the division happens first.

That is what I get for thinking about this before coffee.

Err, it's always ok, and shouldn't ever return the wrong answer... Unless I'm misreading it.

What arguments do you think it will do the wrong thing with?

I know C but no Assembler, so this looks like a good way to get more familiar with what's going on under the hood. It would be really neat if you could click on each instruction/register to get a short summary, though.

This is such a good suggestion.

I'd love to see Motorola 68000 support.

Note that "GodBolt" isn't some clever Zeus-inspired service name, "Godbolt" is just this fellow's last name.

But the guy fellow himself may have some divine inspiration, as his work is often next to godliness.

For D, https://d.godbolt.org/

Also this is actually a repost from one year ago,


Was gonna say, pretty sure I know about this site because of HN.

There's a Go version too: https://go.godbolt.org.

Yes, and the title is actually wrong, GodBolt is the last name of the creator, not the project's name (which I believe is Compiler Explorer)

And a Rust version : https://rust.godbolt.org/

Can anyone please explain why this naive pow() implementation is so freaking huge? I lack the chops to figure this out properly https://godbolt.org/g/YFvNWa

The loop was unrolled with a large factor.

The loop was also vectorized, so that each multiply is a 4-element vector multiplication, with the final result being a horizontal reduction.

The rest of the mess is trying to account for all of the cases where the input power is not a multiple of 128.

Is this just a toy or does it have any place at all in real assembly projects? (I don't know assembly besides having tinkered with it a bit)

Cool project regardless.

It's really useful to see what key pieces of code look like in assembly (though easy enough to do locally), and reaaaaaaaallllly useful to see what they look like across compilers and architectures.

Instead of finding (or worse, building) a compiler or cross compiler locally and doing all the boring steps to compile properly (harder than it sounds) and disassemble the code, you can just splat some code into this and take a peek.

Less about "assembly projects", more about breadth of info available to a developer writing C/C++.

When I used to develop firmware, I would write up the critical inner loops code there and look at the assembly output and tweak the functions to result in better assembly code. Sometimes you can express things a little differently to get around a weakness of the compiler.

Useful way to see how different compilers interpret the C standard. Just code up a minimal example and see what assembly they produce.

i think the primary purpose is to find out what your compiler will do out of your C/C++ program, to see e.g. what optimizations are performed. but who knows, maybe it's also used for actually writing assembly :)

I use it to check things like copy elision etc in my C++.

This is an amazing tool, I use it almost daily. Whenever I want to test an idea, or see what is going on in the assembly, I go straight to godbolt.

What kind of work do you do?

I also use it at least weekly - working on a Ruby JIT I want to see what assembly code is generated from simple C code and compare that to what I'm getting.

I don't know about op, but we use the tool a lot. Currently writing performance critical bits of graphics code.

shellcode engineer

I can't edit the code samples using a mobile Firefox browser. Attempting to delete text and then type new text results in the deleted text reappearing appended to whatever new text I typed.

Sadly the mobile support is pretty bad. It's on my list to fix at some point but requires a bunch of changes in the underlying window library (golden-layout.com), plus some work to reconfigure the layout for mobile.

Cool. I missed the target option list on the right the first time I looked at the page.

The closest I've ever come to this back in the day was running (Borland) Turbo Debugger's assembly view after building with Turbo C (w/ or w/out -O...), as either 8086 or 80286 output - "x16". Yeah, that was a while back :-)

That's really cool. Is it actually compiling in the background? Is this a tool you wrote?

It is running the specified compilers. The github page has more info on how it works.


There's a post on his blog about how the entire thing runs a bunch of compilers on EC2:


It uses gcc behind.

I had a chance to read part of the sources and prepare a patched version with a set of toolchains (m68k, amd64, mips). It's nicely written, it's easy to add toolchains as well as to set default compilation options.

It is kind of what you get when you run gcc with the -S option.

folks interested in doing this locally should play around with objdump

Xcode has a disassembly view that's really similar (and works with lldb!).

After moving back to linux I haven't found a replacement ide that's as nice for doing c++ development.

i have heard good things about clion. codelite works fine too for basic needs.

Have you tried CLion?

Awesome project. And Just in time for an assembly-intensive computer organization final

What would be incredibly useful is if I could do this within vim doing side-by-side split.

Java version @ https://javap.yawk.at/

that's just generated bytecode, not the assembly output on a particular platform generated by the optimizing JIT.

Most of the optimizing magic happens in the JITs, not javac.

Reminds me of SPHINX C--, which was great.


        mul     r0, r0, r0

        bx      lr

   Mips is damn awesome.

we need a DevilBolt: Enter Assembly, get C!


Pity it isn't a recompiler.

really nice project - i was playing with compilers and javascript a while ago see http://embeddednodejs.com/compiler - as a simple experiment to learn about javascript's role in compilers

Should be "get assembler". Assembly or assembling is the process of using an assembler to assemble assembler code, which is actually mnemonics.

Assembly language or assembly code is the most typical usage. There's nothing wrong with calling it assembler code either, but it's less common.

Source: I started writing assembly code in 1968, over the years coding for Sigma 5/7, PDP-10, TI 9900, 6502, 8080, Z80, 8086, 68000, x86, and ARM. I occasionally saw it called assembler language, but less often than assembly language.

Update: As I think about it, there is a bit more nuance to the terminology that I remembered at first. While I do recall "assembler language" and "assembler code" as being infrequently used, "assembler" by itself has often been a common usage:

"I'm writing this in assembly code."

"I'm writing this in assembler."

Those were used fairly interchangeably, even by someone like me to whom "assembler code" sounded a bit off.

So you're definitely not wrong to call it "assembler". Where you're going wrong is claiming that "assembly code" is incorrect.

Pick up any book from the '80's on learning assembler and you'll be hard pressed to find "assembly" anywhere, especially for MOS 6502, Z80 or MC68000.

My goodness, if you really insist on arguing this point, here are a few books for you.










Donald Knuth calls it 'assembly' in The Art of Computer Programming, which he started to write in the early 1960s.

When I was coding Z80 in the late 80s/early 90s (on the Amstrad CPC) it was always either "assembly language" or "assembler".

It's definitely also assembly, possibly even as the primary spelling - https://en.wikipedia.org/wiki/Assembly_language

No, that's just lame (and Wikipedia isn't always right). Ask any scene coder and they'll tell you they're coding in assembler, not "assembly".

Why are scene coders more correct than other coders.

If Agner Fog has no problem calling it 'assembly'[0], then neither do I.

0: http://agner.org/optimize/

Why are scene coders more correct than other coders.

Because they've written more applications in assembler than anyone else in the history of computing. Look at Aminet, and csdb.dk, and that's just Amiga and C=64, I haven't even included the Spectrum, Amstrad CPC, Atari ST and PC.

> Because they've written more applications in assembler than anyone else in the history of computing.

Absolute unbridled nonsense.

Have you considered asking https://en.wikipedia.org/wiki/Assembly_(demoparty) to rename itself, then?

They've been using that name since 1992, and if those guys aren't "scene coders" then I've never seen one.

I never went back in the day, but pretty sure I didn't primarily interpret it as "ooh, a congregation of people".

It shouldn't. 'Assembly' is the textual code, 'assembler' is the tool used to convert the text to machine code. I don't think the name is worth an argument, but your suggestion is no more correct than that is already in the title.

Considering I spent my youth writing assembler code, I find it pretty upsetting someone from the internet lecturing me on what I was deeply involved with is called. I think I know what I was writing.

It's a dangerous business, going onto the internet. You step into a site, and if you don't keep your feet, there is no knowing what you might hear.

I don't think anybody is claiming it's not called assembler, just that it's also / mostly called assembly. fwiw I wrote some [machine code via mnemonics] in the '80s (admittedly not lots) and assembly sounds more normal to me.

I like to call it machine code.

But it isn't. Machine code is for computers, assembly for humans.

Although assembly is machine code represented with mnemonics, they're not the same: assemblers take assembly and produce machine code.

To me, machine code would mean the output of the assembler. Maybe the terminology isn't tied down that well though.

> Assembly or assembling is the process of using an assembler to assemble assembler code

Assembly is also the result of something that was assembled.

No, those are called objects or object files. ;)

(which is why the tool to handle them is called objdump)

I'm talking general English. What happens when you assembly a group of people? You have an assembly. The same can be said for other objects.

This should read:

> What happens when you assemble a group of people? You have an assembly. The same can be said for other objects.

Tell that to your National Object File.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact