> Good lock free algorithms use double-width instructions like cmpxchg16b which compare 64-bits but swap 128-bits
The instructions should compare 128 bits and swap 128 bits.
I don't know why 'good' algorithms would use these if they don't need to, because 128 bit operations are slower.
Not only that, 128 bit compare and swap doesn't work if it is not 128 bit aligned while 64 bit compare and swap will work even if they aren't 64 bit aligned.
On x86, any CAS on a misaligned address that crosses a cache line boundary can fault in the best case (if the mis-feature is disabled by the os) or cost thousands of clock cycles on all cores. So it "works" only for small values of "works".
That's over a cache line boundary, but 128 bit don't even work when they are unaligned, so you can't do things like swap two pointers, then move down 64 bits and swap two more pointers.
The instructions should compare 128 bits and swap 128 bits.
I don't know why 'good' algorithms would use these if they don't need to, because 128 bit operations are slower.
Not only that, 128 bit compare and swap doesn't work if it is not 128 bit aligned while 64 bit compare and swap will work even if they aren't 64 bit aligned.