Maybe I don't understand atomic well enough. If all threads hit an atomic compare and swap on the same address at the same time, doesn't the CPU have to make sure that only one thread is updating the value at a time? Wouldn't this manifest as slightly increased instruction latency for the other threads?
Yes, CPU would have to make sure that only one thread is updating the value and this would definitely increase latency for other threads.
"Guaranteed forward progress" is barely of any interest here. They actually mean, that in the event of a thread randomly stopping for some reason, other threads will still be able to perform atomic operations, because no thread can actually stop in the middle of an atomic operation.