I would be much more impressed if I hadn't taken a compilers course. I reckon (god alone knows exactly what GCC does) this is just linear induction variable substitution[1] (so `x` gets replace with `i*2`), then associativity of integer multiplication, then some (probably builtin) rule that `n%n` is always 0. From there, it is pretty straightforward.
Don't get me wrong - the devil is in the details and getting optimizations that are both powerful and only applied when they are valid, and at the right time is difficult as hell. That said, I do expect compilers to be at least this smart.
> I do expect compilers to be at least this smart.
Does anyone know if, internally, compilers second-guess / double-check themselves ? For instance, when they detect a clever shortcut do they generate and quickly run [1] the optimized bytecode on a sampling of inputs to verify it is functionally-equivalent to what non-optimized bytecode outputs ?
[1] obviously cross-compilers are a thing, so this wouldn't be possible if it were compiling/optimizing for a separate architecture.
> For instance, when they detect a clever shortcut do they generate and quickly run [1] the optimized bytecode on a sampling of inputs to verify it is functionally-equivalent to what non-optimized bytecode outputs ?
No. Optimization transformations are generally expected to result in provably identically-functional code (or sufficiently identical, in case of e.g. floating-point optimizations). Otherwise it is a bug.
Indeed, “provably” being the key word. If I, the compiler developer, prove (in the mathematical sense) that the transformation is functionally identical, I don’t need the compiler to run any experiments to verify it. A proof is much more solid than running a few experiments.
Detecting functional equivalence of Turing-complete languages is a very non-trivial problem. So, no, compilers don't try to construct runtime proofs of their correctness.
There are people who do do more exhaustive checks of the peephole optimizations for correctness--or finding new opportunities (see John Regehr's Souper work). But production compilers are well-known (at least by anyone who works on them) for having lots of bugs in these kinds of optimizations.
>then associativity of integer multiplication, then some (probably builtin) rule that `n%n` is always 0.
Is this quite right? It's not true in general that (a * b) % c = a * (b % c). E.g. it's not true for 2,3,4. The relevant generalization is that (a * b) % b = 0, which has nothing to do with associativity.
It's 0xCCCCCCCD in hex, which looks a lot more structured at least. Though how exactly this makes sense is still unclear :-)
[edit] and playing around with the increment of x it keeps throwing in some magical values which are a repeating pattern in hex, with the LSB off by one usually...
I thought 0xCCCCCCCD was a very nice number so I asked myself why it would come out like that. I wondered whether
0.CCCCCCCC.... = 4/5 had anything to do with it.
Using hexadecimal we have
5 * 3 = 10 - 1, so
5 * C = 40 - 4 so
5 * (1 + C + C0 + C00 + C000...) = 5 - 4 + 40 - 40 + 400... where the last term is = 0 mod 2^32
-----
D
This relates to
0.CCCCCC.... = 4 / 5
Because in order for that to come out right, multiplying 0.CCCC... with 5 has to have all intermediate terms cancel, leaving only a 4.
So why would it take its digit from 4 / 5 and not 1 / 5? Because it takes x=4 to make the 5-x subtraction in 5 - x + x0 - x0 + x00 to come out to 1.
So this gives us a general rule for inversion (for numbers less than 16). Take the hexadecimal expansion of (n-1) / n, truncate to get enough hex chars, add 1 to the least significant hex digit.
Well if A=1 (I know it’s not), D is 5-1. I’ve noticed that whatever you change five to fits this pattern. Ex: +=3 in the code gives you AAAAAAAB, and B would be (3-1) if A=1. I
"xor r,r" is intel-ese for "set r to zero". For various reasons it's faster and smaller than the more obvious "mov r,0".
From this we can see that gcc and clang have figured out that the "if" can never be true, then they elided the loop, then optimized out the variable, ending in the equivalent of "return 0;".
Presumably there is a pass to turn f(recur(x), y) into recur(f(y, acc), x) and then tail-call optimization can be applied. This works for any associative f.
It's just vectorizing calculations so that it's faster than pure iterative calculation. 64 bit version doesn't get this probably because that optimizer isn't SSE2 aware yet (just a guess I actually don't know) and can't do SIMD arithmetic with 2 64 bit floats
This site is extremely valuable to produce good quality reports of GCC bugs. For https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67283, I was able to track multiple regressions on a GCC optimization over the different versions of GCC, within few minutes. Doing the same manually would have been extremely tiresome. Kudos to the website authors!
As a compiler engineer, those are some of the least impressive things done by modern compilers. Why? Because there isn't anything smart behind them, it's pure pattern matching.
As I understand it, the bswap optimization doesn't pattern-match for that specific code (and it works for any equivalent shift/mask code); it analyzes the flow of the bits to figure out the final bit pattern relative to the original, realizes the whole expression had the effect of a bswap, and substitutes a bswap.
The second example makes sense: rdi stores the first subroutine argument in amd64 calling conventions and rax stores the return value. I imagine it's easy for the compiler to infer that the loop is just counting 1 bits, which popcnt does. I imagine if you called functions or initialized local variables in the loop, it would be much more messy
The latter is generally not faster on most modern chips. Results in similar microcode to non vectoring assembly, but you need to put data in the SSE registers which is not free.
A smart compiler would not use the new SSE version. (perhaps you didn't set the flags right?)
I like the highlighting – I finally understand why people say compilers generate faster code than a human could write. Pasted in a pretty straightforward and already fairly optimized function I'd written and of course got back something also pretty straightforward, then I put in "-Ofast -march=native" and, wow. Completely rearranged everything in a totally non-linear way and generated all sorts of bizarre (to me) instructions. I think I'd need to study it for months if not years to understand what the compiler did.
> I think I'd need to study it for months if not years to understand what the compiler did.
Yes, but once you did you might see how you might improve it.
Your brain, while slower at code translation than gcc, can think of things that gcc never will, and yet things that can beat your brain[1] are still slower.
Do realize that no Linux distribution ships such binaries. They do the equivalent of -march=$worstSupportedProcessor (eg x86_64). To actually get any benefits from the smart compiler on these you'll need to use FMV and proprietary compiler extensions (eg. GCC per function options), write runtime dispatch code etc. (GCC also has proprietary extensions for that).
You can follow ghc's compilation with '-v5'. Alas, I've never liked the way it compiles - full of unpredictable jumps, makes no sense at asm level, and just doesn't look a good modern CPU citizen.
Of course I am only a minor Haskell learner but jhc could do whole-program compilation with readable assembly.
As for gcc, try 'gcc -fdump-tree-all -fdump-rtl-all-slim' and read the 50 files it prints.
Cool! Certainly quicker than loading up cross compilers (even when they're so easy to get on Debian/Ubuntu), building a binary, and running the right version of objdump.
The Rust Playground at https://play.rust-lang.org/ has a similar function, letting you check ASM, LLVM IR, and MIR (Rust's mid-level intermediate representation) output for current versions of the Rust compiler.
FWIW godbolt also does rust (http://rust.godbolt.org) which is convenient because play.rust-lang.org doesn't perform the aggressive cleanup/removal of assembly godbolt does.
Though it looks like godbolt only has a Rust compiler up to version 1.11 (current stable is 1.13), so it may not be including optimizations in the current Rust release.
Really cool, it surprised me to see even trivial code give very different results in gcc,icc and clang.
int retNum(int num, int num2) {
return (num * num2)/num;
}
gives this in clang
retNum(int, int): # @retNum(int, int)
mov eax, esi
ret
While icc and gcc give
retNum(int, int):
mov eax, esi
imul eax, edi
cdq
idiv edi
ret
retNum(int, int):
imul esi, edi
mov eax, esi
cdq
idiv edi
ret
The clang version at first sight seem right. But then thinking about it this is integer math.
4/3 := 1
1 * 3 := 3
leads to
3 != 4
I believe gcc and icc returning 3 there is correct, while clang returning 4 is not. Maybe someone more C/int versed
can tell us which are acceptable (knowing C both might be ok)
The product of 'num' and 'num2' is always going to be a multiple of 'num', so there won't be any error introduced by flooring when dividing by 'num' again.
One thing that can happen is an integer overflow: if you pass (0x10000, 0x10000), icc's and gcc's versions will calculate 0x10000 * 0x10000 = 0, 0 / 0x10000 = 0, while clang will return 0x10000. But clang's not wrong: signed integer overflow is undefined behavior in C, so compilers are allowed to just assume it never happens when making optimizations.
The division cancels out the multiplication. Just like if you were doing arithmetic on paper as you did in school. There's nothing more to it than that is there? What inputs do you think it's incorrect for in clang?
It would be an interesting project to take in some (small) source program, and try all sorts of algebraic rearrangements (maybe just associativity and commutativity of addition and multiplication) to see if there is any noticeable performance change.
the performance definitely varies doing this by hand - I guess the current optimizers don't rearrange the code as much as a human can while seeing that it's still equivalent. But sometimes (I'm using LLVM) just rearranging the parentheses to force a different evaluation order makes a difference where I thought it wouldn't have after optimization.
This seems to be the case in 6.1 but not in 5.4. What this proves is also that this tool is really useful to check easily between gcc versions without installing all compilers /toolchains
I know C but no Assembler, so this looks like a good way to get more familiar with what's going on under the hood. It would be really neat if you could click on each instruction/register to get a short summary, though.
Can anyone please explain why this naive pow() implementation is so freaking huge? I lack the chops to figure this out properly https://godbolt.org/g/YFvNWa
It's really useful to see what key pieces of code look like in assembly (though easy enough to do locally), and reaaaaaaaallllly useful to see what they look like across compilers and architectures.
Instead of finding (or worse, building) a compiler or cross compiler locally and doing all the boring steps to compile properly (harder than it sounds) and disassemble the code, you can just splat some code into this and take a peek.
Less about "assembly projects", more about breadth of info available to a developer writing C/C++.
When I used to develop firmware, I would write up the critical inner loops code there and look at the assembly output and tweak the functions to result in better assembly code. Sometimes you can express things a little differently to get around a weakness of the compiler.
i think the primary purpose is to find out what your compiler will do out of your C/C++ program, to see e.g. what optimizations are performed. but who knows, maybe it's also used for actually writing assembly :)
I also use it at least weekly - working on a Ruby JIT I want to see what assembly code is generated from simple C code and compare that to what I'm getting.
I can't edit the code samples using a mobile Firefox browser. Attempting to delete text and then type new text results in the deleted text reappearing appended to whatever new text I typed.
Sadly the mobile support is pretty bad. It's on my list to fix at some point but requires a bunch of changes in the underlying window library (golden-layout.com), plus some work to reconfigure the layout for mobile.
Cool. I missed the target option list on the right the first time I looked at the page.
The closest I've ever come to this back in the day was running (Borland) Turbo Debugger's assembly view after building with Turbo C (w/ or w/out -O...), as either 8086 or 80286 output - "x16". Yeah, that was a while back :-)
I had a chance to read part of the sources and prepare a patched version with a set of toolchains (m68k, amd64, mips). It's nicely written, it's easy to add toolchains as well as to set default compilation options.
really nice project - i was playing with compilers and javascript a while ago see http://embeddednodejs.com/compiler - as a simple experiment to learn about javascript's role in compilers
Assembly language or assembly code is the most typical usage. There's nothing wrong with calling it assembler code either, but it's less common.
Source: I started writing assembly code in 1968, over the years coding for Sigma 5/7, PDP-10, TI 9900, 6502, 8080, Z80, 8086, 68000, x86, and ARM. I occasionally saw it called assembler language, but less often than assembly language.
Update: As I think about it, there is a bit more nuance to the terminology that I remembered at first. While I do recall "assembler language" and "assembler code" as being infrequently used, "assembler" by itself has often been a common usage:
"I'm writing this in assembly code."
"I'm writing this in assembler."
Those were used fairly interchangeably, even by someone like me to whom "assembler code" sounded a bit off.
So you're definitely not wrong to call it "assembler". Where you're going wrong is claiming that "assembly code" is incorrect.
Why are scene coders more correct than other coders.
Because they've written more applications in assembler than anyone else in the history of computing. Look at Aminet, and csdb.dk, and that's just Amiga and C=64, I haven't even included the Spectrum, Amstrad CPC, Atari ST and PC.
It shouldn't. 'Assembly' is the textual code, 'assembler' is the tool used to convert the text to machine code. I don't think the name is worth an argument, but your suggestion is no more correct than that is already in the title.
Considering I spent my youth writing assembler code, I find it pretty upsetting someone from the internet lecturing me on what I was deeply involved with is called. I think I know what I was writing.
I don't think anybody is claiming it's not called assembler, just that it's also / mostly called assembly. fwiw I wrote some [machine code via mnemonics] in the '80s (admittedly not lots) and assembly sounds more normal to me.