If you still wanted to fine tune it a bit further, you could have used a WeakAto...

dragontamer · on June 4, 2019

Ha! I think I'm finally correct against you :-)

"AtomicExchange" is just a swap. It is NOT a compare-and-swap, so there's no Weak vs Strong version. A pure-swap is probably faster to implement than a compare-and-swap in microcode, but I admit that I've never actually tested this theory.

vardump · on June 4, 2019

Indeed, oops! Brains in pattern matching mode. :-)

Then again, I think XCHG (has implicit atomic LOCK, no need for LOCK prefix) is kinda x86 only thing.

> pure-swap is probably faster to implement than a compare-and-swap in microcode, but I admit that I've never actually tested this theory.

I don't think there's practical performance difference, because inter-core cache coherency traffic is going to dominate by far.

e12e · on June 4, 2019

For others that are curious, there's an example of lock XCHG spin lock for amd64 here:

https://en.m.wikibooks.org/wiki/X86_Assembly/Data_Transfer

Ed: well technically x86 I guess - not sure if there's improvements to be made for amd64?

vardump · on June 6, 2019

That example is based on "lock cmpxchg" (x86 compare-and-swap). Not on atomic swap, like what "xchg" based one [0] above does.

Someone else [1] also had a good point how "xchg" can cause more contention, because it generates stores. If loads were used for waiting instead, the cache line could have remained in much cheaper shared state on the waiting core.

But regardless benchmark on as many system types as you can get your hands on. No matter how well you know the architecture and you read on the internet, real software systems and CPUs are sometimes rather unpredictable beasts.

[0]: https://news.ycombinator.com/item?id=20098978

[1]: https://news.ycombinator.com/item?id=20099764

e12e · on June 6, 2019

Hm, so the above:

while (AtomicExchange(spinlock, true) != false) hyperthread_yield(); // loop until you grab false.

Does do a cmp (the while construct) - but only the XCHG is atomic. While the other example does both CMP and XCHG as an atomic "unit"/op?

vardump · on June 6, 2019

First of all, I think "AtomicExchange" will return previous value of "spinlock". First parameter ("spinlock" here) is actually a pointer to "spinlock", not the value itself. Second parameter is what should be swapped with the memory address.

So it atomically stores true to "spinlock" and returns whatever value it contained previously. While-loop is comparing against the previous value, what was swapped from memory.

Unless "spinlock" wasn't false in the first place, the loop can only terminate when the other thread eventually stores false to "spinlock". After that point, one of the waiting threads will suddenly swap that false into true and notice the previous value was false, meaning the lock was successfully acquired.

So until the other thread releases it by storing false to "spinlock", there'll be just endless swaps of true.

I wouldn't recommend using a spinlock like this, because it probably causes a lot of unnecessary and avoidable cache coherency traffic [0] between all of the contending CPU cores.

[0]: Unless, of course, CPUs contain some instruction stream pattern matching to detect this situation (swapping in same value the memory address contained before) and optimize the endless stores away. Wouldn't surprise me at all if such optimizations existed on modern x86 silicon...

e12e · on June 6, 2019

Thank you. Still a little unclear on how the code maps to actual x86_64 assembler though. There's generally no "return" value - so I'm curious what the machine code would cmp on.

I suppose one puts the value to write in a registry, then if the XCHG succeed, the "return" value will be in the registry. And the only way to detect success is if the value has changed (which is OK, overwriting a value with the same value... Is a logical noop...).

I suppose I could try something like the above, or:

http://www.cplusplus.com/reference/atomic/atomic/exchange/

And look at the output...

vardump · on June 6, 2019

See my quick completely untested idea above in thread how it'll probably be in x86[-64] assembler.

vardump · on June 6, 2019

Note: moved this from thread leaf here, because it broke the page formatting.

Can't check any reference now, but I think the whole while loop is roughly:

  loop:
    mov eax, 1 ; "true"
    ; edi points to &spinlock
    xchg [edi], eax 
    ; eax is AtomicExchange retval
    test eax, eax
    jnz loop ; while(eax != false)
    ; we've acquired the lock

X86-64 version is same, except the address is probably in a 64-bit register, like RDI. Although this would also work in 64-bit mode, if the pointer just happens to be in the lower 4 GB.