There is another architectural change made in Skylake which is related to this article. In skylake the L3 cache is now non-inclusive, which means that a core trying to acquire the lock will have to read the cache line through a cross core cache read. Making things worse when contention is high the cache-line will frequently be dirty. The latency for a cross core dirty hit of the L1/L2 cache is documented to be 60 cycles which is much high that the L3 cache latency of past architectures. This will increase overall contention which will magnify the frequency the pause instruction is hit. I'm wondering if the pause instruction cycle count was increased to reduce the impact of cross core cache reads.