In terms of the relative cycle cost for instructions, the answer definitely has ...

In terms of the relative cycle cost for instructions, the answer definitely has changed a lot over time.

As CAS has become more and more important as the world has scaled out, hardware companies have been more willing to favor "performance" in the cost / performance tradeoff. Meaning, it shouldn't surprise you if uncontended CAS as fast as a fetch-and-or, even if the later is obviously a much simpler operation logically.

But hardware platforms are a very diverse place.

Generally, if you can design your algorithm with a load-and-store, there's a very good chance you're going to deal with contention much better than w/ CAS. But, if the best you can do is use load-and-store but then have a retry loop if the value isn't right, that probably isn't going to be better.

For instance, I have a in-memory debugging "ring buffer" that keeps an "epoch"; threads logging to the ring buffer fetch-and-add themselves an epoch, then mod by the buffer size to find their slot.

Typically, the best performance will happen when I'm keeping one ring buffer per thread-- not too surprising, as there's no contention (but impact of page faults can potentially slow this down).

If the ring buffer is big enough that there's never a collision where a slow writer is still writing when the next cycle through the ring buffer happens, then the only contention is around the counter, and everything still tends to be pretty good, but the work the hardware has to do around syncing the value will 100% slow it down, despite the fact that there is no retries. If you don't use a big buffer, you have to do something different to get a true ring buffer, or you can lock each record, and send the fast writer back to get a new index if it sees a lock. The contention still has the effect of slowing things down either way.

The worst performance will come with the CAS operation though, because when lots of threads are debugging lots of things, there will be lots of retries.