
No Sane Compiler Would Optimize Atomics - adamnemecek
https://github.com/jfbastien/no-sane-compiler
======
_yosefk
Torvalds on memory ordering and clever compilers
([http://yarchive.net/comp/linux/memory_barriers.html](http://yarchive.net/comp/linux/memory_barriers.html)):

"Compiler people tend to think that people want smart compilers. I as a
compiler power-user can emphatically tell you that that isn't true. People
want _reliable_ compilers, and they want compilers that can be told to just
get the hell out of the way when needed."

I guess I'm largely at the same side of the argument, on the theory that
atomics etc. should be used very sparingly by people who know what they're
doing and spend time to hand-optimize their code and think it through
semantically (and these people aren't helped by the semantics of atomics
becoming more complicated for the sake of optimization opportunities), as
opposed to other language mechanisms which are used all over the place by
everyone, and so with these mechanisms you do gain a lot from clever
optimizations. But I understand the other side of this argument...

~~~
sgift
Torvalds' argument makes for a nice quote but reality tells a different story:
If people just wanted a reliable compiler they would use the oldest compiler
they could find which is able to compile their code. If it is old all bugs,
problems and faults are known. You know how to work around them and nothing
can surprise you anymore. The performance may not be great, it may be a bit of
a pain to do some things, but you know it all, so it is not great, but
reliable.

What people really want is something which is reliable _and_ smart _and_ fast
_and_ .. people love smart compilers/programs/whatever, only when those
programs fail them they cry "why did you try in the first place? No one wants
this"

There's a thin line between smart and crazy. Trying to be always on the right
side is one of the hardest parts of software design.

~~~
gus_massa
I agree. Another proof is that the people like the -O3 optimization mode, when
the -O0 optimization mode provides the same functionality with less compiling
time (with the side effect of more run time).

~~~
qb45
I was under impression that, at least on gcc, everybody uses -O2 and -O3 is
reserved for those who have the patience to verify that their software not
only still works correctly, but even actually becomes faster.

FWIW, I once tried -O3 on some number crunching code few years ago and results
were mixed. Improvements, if any, came from only one or few of the many -f
flags enabled by -O3.

Oh, and I also tried -O3 on Gentoo. Some weird gtk crashes, had to rebuild
back with -O2.

------
qb45
I love the WTF potential of this one:

 _Another interesting optimization is to use potentially shared memory
locations (on the stack, heap and globals) as scratch storage, if the compiler
can prove that they are not accessed in other threads concurrently. [..] For
example the following transformation could occur:_

    
    
      // Some code, but no synchronization.
      *p = 1; // Can be on stack, heap or global.
    

_Becomes:_

    
    
      // ...
      *p = RAX; // Spill temporary value.
      // ...
      RAX = *p; // Restore temporary value.
      // ...
      *p = 1;
    

_Since we write to p and there is no synchronization operations, other threads
do not read /write p without exercising undefined behavior. We can therefore
use it as scratch storage._

~~~
mike_hock
I see no WTF here. Concurrent reads from * p _have_ to be allowed to produce
any possible value whatsoever, since if the variable is larger than a system
word, the compiler may have to break up the write into multiple instructions
(and some architectures don't even have instruction-level isolation).

The language doesn't give you any stricter concurrency guarantees just because
a variable is sizeof(void* ). And that's _good._ It makes the language
cleaner. No bullshit special cases that require you to know implementation
details. You want atomic behavior, use std::atomic (or a lock), period.

~~~
xorblurb
This prevents debugging, probably does not really optimize (except 1% in
useless microbenchmarks), and increases compiler complexity and the risk of
bugs in it. A similar thing has already happen in the past: compiler started
to introduce speculative writes on code pathes where the write did not occur
in the source code, and this broke multithreaded programs -- IIRC there was no
threading model at that time (the current one forbid to do that kind of thing,
and it is probably impossible to get a sound threading model that does not
forbid that) but instead of saying that they were technically allowed to do
that compiler authors recognized that because it made multithreaded programs
impossible to write, that they had to fix their shit.

In the C standard, crashing is not a specified behavior, so when an obscure
architecture crashes on a specific construct, the standard has to technically
say that this is undefined behavior. That compiler authors now interpret most
of undefined behavior as a license to be stupid on ALL architecture instead
using their brain and allowing non-portable programs to exist is not them
being smart optimizing all our code-bases for free but just a proof they do
not understand a part of what they are doing, including the standard they
pretend to know so much.

So the position of "all the standard, nothing but the standard" is utter
bullshit. Another example: the current published atomic specification is
subtly unsound. Do compiler authors do the right thing instead of trying to
implement the broken spec, while waiting for the spec to be fixed? On that
point yes -- at least for now. So when, on other subjects, they introduce new
bugs in compiled code compared to what all previously known compilers produced
- and without proving this _really_ improves performance like crazy to justify
taking that kind of risk - they are just being dangerous asses. -O2 has shown
to be able to be fast enough without (reasonable) code breaking behavior in
the past. In most case you wants to keep the same behavior as -O0 built
programs. An optimisation that does not preserve behaviors reasonably expected
by skilled programmers and requires that every programmer knows the standard
by heart at any point of their work under penalty of crashes and security
holes is not an optimisation but a piece of crap. This is blurry but the C /
C++ standards are blurry and their usage even more so -- and more importantly
they do not forbid to stay safe and sane to begin with -- so trying to do
formal logic instead of doing engineering by thinking about the whole picture
and history is completely stupid.

~~~
gpderetta
'the current published atomic specification is subtly unsound'

Are you talking about memory_order_consume? Although I believe that
correctness proofs have so far excluded memory_order_consume, I do not think
the current spec has been shown to be unsound; it is just generally regarded
as hard to implement and hard to use correctly. In practice most compilers
(conformingly) just lower it to memory_order_acquire; even Linus has been
considering doing away with all the dependent load trickery in the kernel and
just convert all of them to explicit load acquires.

x86 and SPARC are TSO machines and have zero cost acquires, modern POWERs can
also be run in a similar sane TSO mode and ARM64 has cheap explicit load
acquire and store release (which I believe were added explicitly to allow
simple mapping of the C and C++ memory model). Even Itanic has (I assume
cheap) load acquire and store release operations.

Other insane architectures that actually require dependant loads for
performance can go and have anatomically impossible sex with themselves.

edit: spelling

~~~
jfbastien
The latest on consume: [http://wg21.link/p0190r0](http://wg21.link/p0190r0)

The spec is pretty sound, but unimplemented because that's not how compilers
work right now for the generality that consume wants. I think the above
proposal is likely to work because it's more restrictive towards how consume
works, but we need implementation experience.

Linux uses consume extensively for RCU, and I don't think it'll go away soon
because of the benefits it provides to Power and ARM. It works through a
gentleperson's agreement between Linux's RCU implementation in C and GCC.

~~~
gpderetta
There was a recent thread don lkml were after a long discussion on
memory_order_consume, Linus wondered what would the cost be of replacing
ACCESS_ONCE with a proper acquire. Someone run a test on arm64 and there was
no measurable effect. Linus then actively considered the switch. I doubt it
will happen right now or even the next few years, but it could happen soon.

------
alblue
The presentation is unreadable as it jumps all over the place and requires you
to press keys to navigate.

~~~
kingosticks
What sane person would write a presentation like this. Horrible to try and
read, particularly on a phone. Whatever his message was is lost on me.

~~~
pjmlp
A hipster one. Reveal.js presentations are everywhere nowadays.

------
Animats
C/C++ language designers used to take the position that concurrency was an OS
issue, not a language issue. That came unglued as processors did more and more
out of order operations.

C/C++ language doesn't really understand concurrency. Even locking is iffy,
because the language doesn't know what data is locked by a lock. Rust, and to
a lesser extent Java, associate locks with data, which gives the compiler some
info to guide what it can't optimize. C/C++ have a whole series of kludges -
"volatile", "atomic", etc., which patch holes in race conditions but don't
provide a coherent concurrency model. Which is why we see papers like this.

~~~
gpderetta
"C/C++ language designers used to take the position that concurrency was an OS
issue, not a language issue"

That was the case until Bohem's 'Threads Cannot be Implemented as a Library'
position paper, which lead the C++ Standard Committee to formally define the
C++ memory model (whose correctness was also independently proven [1]). This
standardization work was greatly inspired by the corresponding Java MM, and in
a break from the past, it explicitly disallowed a few compiler optimizations
and even made the C++ standard unimplementable on a few quirky older
architectures for the sake of sanity.

Today there are no arguments on whether a concurrent algorithm (and compiler
optimizations) is conforming or not, as there are both dynamic and static
checkers that can verify it against the C++ memory model.

I wish that the same rigorous formalization work were to be done for the
strict aliasing rules of the language, which are still a mess of
underspecified, ambiguous and mutually contradicting rules scattered all over
the standard document.

[1] the original drafts had in fact issues which were found when attempting
the correctness proof; these flaws have been corrected, except there is a
still standing issue with out-of-thin-air values; it is not believed to have
any impact on code with current compiler and cpus, but there is an ongoing
effort to correct it.

------
kolapuriya
I haven't dealt with concurrency in C++ yet, but I definitely agree with the
"don't write assembly, file bugs for the compiler instead" methodology. And
not because I don't like ASM -- some of my favourite code consists of ASM
hacks -- but because better compilers help everyone.

------
raverbashing
(If it's not clear you need to navigate the presentation with both left/right
and down keys)

You know what? This is getting pathological. It seems keywords are added to
ensure serialization/atomicity, then compilers find a way to optimize it
away/make it useless, then another thing "now it's the right way" gets added.
Rinse, repeat

And most things that are added have some quirks and people trying to play it
smart

~~~
sjolsen
>It seems keywords are added to ensure serialization/atomicity, then compilers
find a way to optimize it away/make it useless

No, what happens is keywords (or more broadly, semantics) are added which
provide certain guarantees given certain preconditions. Then, developers write
code that depends on those guarantees and fails to meet the preconditions, but
happens to work anyway. At some point, the implementation adds some
optimization which -- while still providing the appropriate guarantees when
the preconditions are met -- breaks that code.

I know C and C++ can be a royal pain in the ass sometimes, but I have little
sympathy for developers who flip the "I know what I'm doing, please assume my
code is right and make it run fast" switch, then get upset with the compiler
when it turns out that their code actually _isn't_ right and the compiler
performs an optimization they weren't expecting. If you don't want the
compiler to reorder your memory access, then don't fuck around with the
memory_order flags.

~~~
raverbashing
> Then, developers write code that depends on those guarantees and fails to
> meet the preconditions, but happens to work anyway.

This reminds me of the memcpy/memmove issue. You should use "move" when areas
overlap, not memcpy

The real question is: why are there two versions? Checking if the memory
overlap is _very cheap_ compared to copying, let's say 16 elements. And if
you're memcpy'ing smaller sizes you can probably do it manually

> I know C and C++ can be a royal pain in the ass sometime

Yes, they lack enough information to know what you are actually trying to do
and have to guess a lot of things. Not sure how "C++ smart" are modern
compilers. (Example: you pass an object by value, and only read one field, can
it optimize this?)

~~~
david-given
I see they've finally changed the memcpy() specification to forbid overlapping
memory areas completely.

Previously, it was defined to copy from low addresses to high addresses, which
meant you could use this to fill an array:

    
    
        p[0] = 42;
        memcpy(&p[1], &p[0], sizeof(p)-sizeof(*p));
    

And people _did_.

~~~
caf
The original 1989 ANSI C specification stated, in _" 4.11.2.1 The memcpy
function"_:

 _If copying takes place between objects that overlap, the behavior is
undefined._

So it has been this way in C since the first official standard. I don't have a
first edition K&R so I can't see what that said, though.

~~~
david-given
Huh. You're quite right.

Well, it may have been undefined, but _in practice_ the behaviour was
standardised and couldn't be changed without breaking existing code... which
is basically C in a nutshell.

~~~
caf
Code like your example would have been broken by real standard library
implementations very early in the piece, because even a forward-copying
implementation that does word-at-a-time copies (a straightforward and obvious
optimisation with real and significant benefits on many machines of the era)
would have broken it.

