
Ways to break your systems code using volatile (2010) - leafario2
https://blog.regehr.org/archives/28
======
raphlinus
This should say 2010. I believe much of it is out of date, as C11 _does_ have
a memory model, and _does_ provide both atomics and barriers. Many, if not
most, uses of volatile should probably be replaced by atomics.

[https://en.cppreference.com/w/c/atomic](https://en.cppreference.com/w/c/atomic)

~~~
icedchai
Many projects are stuck in C99, or even C89...

~~~
BubRoss
And many projects aren't. It is still better to label a title correctly.

------
aidenn0
In terms of "Using volatile too much" I found a comment along the lines of
"Not sure why this has to be volatile, but it doesn't work without it" and the
answer was "There is a race condition and volatile slows down one path enough
to make it go away."

Yuck.

------
alain94040
You really should just use volatile for device drivers when accessing IO space
with side-effects. Do not use volatile to build your own synchronization
primitives.

~~~
ridiculous_fish
What do you think about signal handlers?

Atomics may be implemented with locks, which makes them unsuitable for signal
handlers. The only guaranteed lock-free type is `std::atomic_flag` which is
not very useful.

`volatile sig_atomic_t` still seems like the better choice for signals.

~~~
gpderetta
If the architecture is so broken that atomic load and stores need to use
locks, I can't see how would sig_atomic_t would ever be implementable.

------
dahfizz
I think the title and parts of the article are misleading. Using volatile will
_never_ make a correct program incorrect. It cannot "break" a correct
implementation.

It should not be overused, because as the article mentions it makes for slower
and more confusing code, but it's not quite something to be afraid of either.

It is slower to use volatile, and bad form

------
burfog
That note at the end about Linux is missing a link to the
Documentation/volatile-considered-harmful.txt document. Basically, don't use
volatile. Here, with your choice of formatting:

[https://github.com/torvalds/linux/blob/master/Documentation/...](https://github.com/torvalds/linux/blob/master/Documentation/process/volatile-
considered-harmful.rst)

[https://www.mjmwired.net/kernel/Documentation/volatile-
consi...](https://www.mjmwired.net/kernel/Documentation/volatile-considered-
harmful.txt)

[https://www.kernel.org/doc/html/latest/process/volatile-
cons...](https://www.kernel.org/doc/html/latest/process/volatile-considered-
harmful.html)

~~~
User23
Neat how inline assembly is one of the valid use cases. I'm given to
understand that essential parts of the Linux kernel can't actually be
implemented in pure C and that some assembly is required.

~~~
Narishma
Isn't that the case for most (all?) operating systems?

------
SAI_Peregrinus
The entire section on declarations can also be fixed by always binding type
modifiers and quantifiers to the left. Rewriting the examples:

    
    
        int* p;                              // pointer to int  
        int volatile* p_to_vol;              // pointer to volatile int  
        int* volatile vol_p;                 // volatile pointer to int  
        int volatile* volatile vol_p_to_vol; // volatile pointer to volatile int
    

This method always starts with the most basic type, then adds modifiers
sequentially. The modifier binds to everything left of it.

------
klingonopera

      > "Side note: although at first glance this code looks like it fails to account for the case where TCNT1 overflows from 65535 to 0 during the timing run, it actually works properly for all durations between 0 and 65535 ticks."
    

From example 1, ignoring device and setup-specifics what to do when TCNT1
overflows, it actually works properly for _all_ ticks, both "first" and
"second" are unsigned (therefore behaviour is defined), and the delta between
them both is always between 0 and 65535, no matter what values they may have,
and also correct in all cases.

E.g.:

    
    
      timeDelta = timeStampNow - timeStampLast = 0 - 65535 = 1

~~~
tntn
But if the duration is > 65535 ticks, the calculated duration will be wrong,
no? There is no mechanism to count how many times TCNT1 overflows, so it will
be incorrect if the duration of what you are timing exceeds 65535 ticks.

~~~
klingonopera
That is correct, yes. I had erroneously understood that the author meant "all
durations between 0 and 65535 ticks" as "any duration between the device's 0th
and 65535th tick", my bad... Also makes this entire thread obsolete, but FWIW,
one shouldn't be attempting to measure a duration that can't even be contained
in the variable's bit width. Some workarounds would be to add more bits, slow
down the tickrate or add overflow counters.

------
legohead
I've never had to use volatile in code. This was all very interesting!

For issue #5, a possible solution not mentioned could be to write inline
assembly, no? It would keep the array non-volatile and should be portable.

~~~
kelnos
Inline assembly is basically the definition of non-portable.

------
loeg
volatile should only be used for accessing MMIO registers in device drivers;
that's it.

~~~
ridiculous_fish
There's definitely more uses. For example, shared memory between processes:
you should mark it volatile.

C++ atomics are no good here, because they are not guaranteed to be lock free
or address free.

~~~
loeg
Shared memory is way outside the scope of standard C or C++. It's
implementation-defined. It's inconsistent to insist on the weakest definition
of atomics allowed by the C/C++ standard(s) and simultaneous invoke one of the
weirdest implementation-defined mechanisms defined by POSIX. If your
implementation provides shared memory of some kind, it's up to your
implementation to define some sort of reasonable semantics.

In POSIX' case, it's up to POSIX operating systems to define reasonable
semantics on the memory, using constructs like PTHREAD_PROCESS_SHARED and
"robust" pthread mutexes.

------
kaetemi
Volatile seems quite sufficient for a PleaseExitThread boolean.

------
flafla2
Edit: Looks like the slides had an inaccuracy (see replies). Huh, looks like I
learned something today :)

I think a good way of summarizing volatile is this slide from my parallel
architectures class [1]:

    
    
        > Class exercise: describe everything that might occur during the 
        > execution of this statement
        >     volatile int x = 10
        >
        > 1. Write to memory
        > 
        > Now describe everything that might occur during the execution of
        > this statement
        >     int x = 10
        > 
        > 1.  Virtual address to physical address conversion (TLB lookup)
        > 2.  TLB miss
        > 3.  TLB update (might involve OS)
        > 4.  OS may need to swap in page to get the appropriate page 
        >     table (load from disk to physical address)
        > 5.  Cache lookup (tag check)
        > 6.  Determine line not in cache (need to generate BusRdX)
        > 7.  Arbitrate for bus
        > 8.  Win bus, place address, command on bus
        > 9.  All caches perform snoop (e.g., invalidate their local 
        >     copies of the relevant line)
        > 10. Another cache or memory decides it must respond (let’s 
        >     assume it’s memory)
        > 11. Memory request sent to memory controller
        > 12. Memory controller is itself a scheduler
        > 13. Memory controller checks active row in DRAM row buffer.
        >     (May > need to activate new DRAM row. Let’s assume it does.)
        > 14. DRAM reads values into row buffer
        > 15. Memory arbitrates for data bus
        > 16. Memory wins bus
        > 17. Memory puts data on bus
        > 18. Requesting cache grabs data, updates cache line and tags, 
        >     moves line into exclusive state
        > 19. Processor is notified data exists
        > 20. Instruction proceeds
        > * This list is certainly not complete, it’s just 
        >   what I came up with off the top of my head. 
    

It's also worth mentioning that this assumes a uniprocessor model, so out-of-
order execution is still possible which leads to complications in any sort of
multithreaded or networked system (See #5, 6, 7, 8 in the OP article).

I think a lot of the confusion stems from the illusion that a uniprocessor +
in-order execution model implies to programmers who have never dealt with
system-level code. I think in the future, performant software will require a
bit more understanding of the underlying hardware on the part of your average
software developer -- especially when you care about any sort of parallelism.
It doesn't help that almost all common CS curriculum ignores parallelism until
the 3rd year or more.

[1]
[http://www.cs.cmu.edu/~418/lectures/12_snoopimpl.pdf](http://www.cs.cmu.edu/~418/lectures/12_snoopimpl.pdf)
\- the last 2 slides

~~~
tempguy9999
I don't get this, most likely due to my ignorance, but I thought volatile
doesn't necessarily force anything to RAM, it can just push it out so cache
coherence handles the rest, between cores (and perhaps peripherals). MESI can
do the work without actually hitting memory.

if you want to force actually to ram then perhaps you'd need a memory barrier.

This is not my area though. Wrong? Right?

~~~
spc476
What happens with this code?

    
    
        volatile int x;
        int          y;
        int          z;
        
        x = 10;
        x = 20;
        y = x;
        z = x;
    

Answer:

    
    
        the constant 10 is written to x
        the constant 20 is written to x
        the contents of x is read and written into y
        the contents of x is read and written into z
    

Now, what happens with this code?

    
    
        int x;
        int y;
        int z;
        
        x = 10;
        x = 20;
        y = x;
        z = x;
    

One answer is the same as the above. Another valid answer is:

    
    
        the constant 20 is written to x
        the constant 20 is written to y
        the constant 20 is written to z
    

Why? Because x is not used between the two assignments, so the first will
never be seen. Also, x is not used between it's assignment and the assignment
to y, so the compiler can do constant propagation.

All volatile does it tell the compiler "all writes _must_ happen, and no
caching of reads".

~~~
tempguy9999
Understood but we're talking about different things I think (though this is
very much not my area).

You're saying volatile is acting as a kind of memory barrier instruction _for
the compiler_ \- got it. But I'm saying I understand that at the CPU level,
just considering x86 instructions, writes don't have to be forced to RAM,
despite a common assumption that they are; they can remain in caches. See
johntb86's reply confirming this.

------
nullwasamistake
Ironically volatile is just as bad in Java for different reasons. Frequently
used for "lock free" synchronization, its usually actually worse than using
locks because it can't be cached between cores. The variable is always loaded
from main memory, which is usually much worse than holding a lock mutex in
registers.

~~~
the8472
The standard pattern for working with atomics in java (volatiles are of
limited use without atomic field updaters or varhandles) is to read it into a
local variable, operate on that and only write it back to the volatile once
you're done.

That has many benefits, among them the ability to store its value in
registers.

~~~
nullwasamistake
For primitives Java uses special CPU instructions. In the Atomic* package.
It's not recommended for plain objects.

------
tus87
Err...volatile just tells the compiler not to cache the value in a register,
that's it. If you don't understand volatile you really, really are not the
kind of programmer who should even think about using it.

~~~
empiricus
This is my understanding of volatile as well: volatile just forces read/write
to memory. What a read/write to memory entails is a different story. What
happens with no volatile is again another story. If my understanding is wrong,
someone please enlighten me.

~~~
moefh
"Forcing read/write to memory" is very different from "not caching the value
in a register". Optimizations can involve not just caching values in
registers, but also reordering operations, calculating things at compile-time
and so on.

For a trivial example, see this code:

    
    
        int f() {
            int sum = 0;
            for (int i = 0; i < 10; i++) sum += i;
            return sum;
        }
    

As you can see from [1], a smart compiler will calculate the sum at compile
time and make the function simply return the resulting number (i.e., no loop
is generated).

If you make "sum" volatile, the compiler is forced to do the loop[2].

[1] [https://godbolt.org/z/3sX5mU](https://godbolt.org/z/3sX5mU)

[2] [https://godbolt.org/z/F5CiDJ](https://godbolt.org/z/F5CiDJ)

