The overhead of an uncontested lock is not much more than a memory operation but it allows you to be able to use the same code in threaded context in tokio async which is a huge benefit. Unless you need the optimization (i.e. you profiled and determined that Arc in a hot loop is slowing you down) I think it's fine to use Arc in general.
No, this also applies to (non-relaxed) atomic loads and stores, depending on the platform.
> Atomic non-seq-cst load/stores can be cheap.
Relaxed atomic loads and stores are always cheap, but anything above requires additional memory order instructions on many platforms, most notably on ARM.
Here we are talking specifically about mutexes, which follow acquire release semantics.
To be clear: locking an uncontented mutex is indeed much, much cheaper than an actual call into the kernel, but it is not free either.
Ok, technically we both used the weasel word 'can' so we are both right.
But even on ARM, these days store releases and load acquires, while not as free as on x86 are very cheap.
To make my statement more precise, typically what is still expensive pretty much everywhere is anything with #StoreLoad barrier semantics, which is what you need to acquire a mutex.