
You Can Do Any Kind of Atomic Read-Modify-Write Operation - signa11
http://preshing.com/20150402/you-can-do-any-kind-of-atomic-read-modify-write-operation/
======
exDM69
When spinning on an atomic operation, you should add some kind of yield
operation. The "pause" instruction in x86 at least. On newer CPUs a "monitor"
and "mwait" can work a little better (but may not be available in user space).
Do not use thread_yield or other kernel calls, because at that point you're
better off using a mutex.

When spinning on atomics, the interconnect between cores needs to do a lot of
work synchronizing the memory. When adding a pause or monitor instruction, the
core will give an opportunity for the other hyperthread in the same core to
execute, reducing the amount of churn on the interconnect.

Counterintuitively, adding a "pause" in the spin loop will make the program
faster by yielding the CPU core for a few nanoseconds.

~~~
vardump
That's interesting. Any references about this? What about ARM CPUs?

~~~
exDM69
See Intel Programmer's manual (vol 3a iirc) about pause, monitor and mwait.
Not sure about ARM, that architecture has different rules of cache coherence
and no hyperthreading and you need to add explicit memory barriers.

------
dustyleary
It looks like the examples are broken.

    
    
      uint32_t fetch_multiply(std::atomic<uint32_t>& shared, uint32_t multiplier)
      {
          uint32_t oldValue = shared.load();
          while (!shared.compare_exchange_weak(oldValue, oldValue * multiplier))
          {
          }
          return oldValue;
      }
    
    

If shared.compare_exchange_weak() fails because of a concurrent writer, then
the while loop will never exit (unless another writer sets the value of shared
to oldValue).

What the author wants is something like this:

    
    
      uint32_t fetch_multiply(std::atomic<uint32_t>& shared, uint32_t multiplier)
      {
          while(1) {
              uint32_t oldValue = shared.load();
              if(shared.compare_exchange_weak(oldValue, oldValue * multiplier)) {
                  return oldValue;
              }
          }
      }
    

The other examples have the same problem.

~~~
atomic_cheese
I'm pretty sure that example is actually fine - if there's a concurrent
writer, the modified value will be loaded into oldValue. Repeating shared.load
is unnecessary since that operation occurs as part of compare_exchange_weak.

~~~
dustyleary
Thanks, my mistake.

I've been bit by this before. I sometimes wish you had to mark "reference-of"
just like you have to mark "address-of" with &.

It's obvious that blah(&foo) might modify foo, without needing to examine the
signature of blah().

The tradeoff is a bit of code clutter when you add a symbol (something like
shared.compare_exchange_weak(%oldValue, oldValue * multiplier)), and a
violation of DRY, since the information provided by this symbol would be
redundant.

------
avmich
I remember how I was surprised to find that you can implement compare-and-swap
purely in software (with some caveats :) but still, IMO, preserving the idea).
See, e.g. Peterson's algorithm. It does have a lock, but only for finite (and
brief, if you will) moment, and, wrapped as procedure, is externally lock-
free.

You still rely on ability to locally order operations in time (i.e., you have
to be sure that you - the CPU - reads from that memory cell strictly after it
completed writing into this one) - so there are restrictions for modern CPUs.
You might have issues if you want to synchronize many threads/processes - as
the locking delay may become too long. But still.

