
Why Skylake CPUs Are Sometimes 50% Slower - GordonS
https://aloiskraus.wordpress.com/2018/06/16/why-skylakex-cpus-are-sometimes-50-slower-how-intel-has-broken-existing-code/
======
jstarks
The title is misleading — no code is broken, though it is true that .NET
programs that exhibit lock contention will be slower.

The root of the problem is that .NET implemented a rather baroque spin loop
with several assumptions about the CPU and OS architecture. Unlike any other
spin loop I have seen or written, they execute multiple pause instructions per
loop. Arguably this was a poor choice.

It would be interesting to see the history of this design. It was probably
done years ago when .NET was closed source.

~~~
eloff
My experience is to the contrary. Every spin lock I have seen uses more than
one pause instruction. Some may start with one and then execute more. They all
involve a loop over pause instruction(s). I don't think that's the problem per
se. The problem is more likely the back-off logic.

~~~
cesarb
At least the Linux kernel spinlocks do a single pause instruction (which it
calls cpu_relax()) before checking again to see whether the lock is free; it's
a loop over the pause instruction, but the loop condition checks the lock
value. I'd expect most spinlock implementations to do the same.

~~~
eloff
They basically all do that. I've seen some with multiple pause instructions in
the loop body, and others with a switch or if/else if/else block where it
starts off with one pause instruction, then goes to multiple, then moves to
more drastic back-off techniques like giving up the thread time slice or micro
sleeping.

~~~
acqq
> I've seen some

Then please quote at least one actually used in Linux kernel, including where
it is used.

~~~
eloff
I didn't say I've seen those in the kernel. I haven't ever looked at the spin
lock(s) in the kernel.

~~~
acqq
And your claims are still completely unsupported, at least here in this
discussion. You don't give even a single example of some other publicly known
code that does what .NET does. And the way you formulated your previous claim
appeared to claim that the Linux kernel contains such a kind of code. I'd
still appreciate even a single verifiable example of all that "some" that
you've apparently "seen".

------
al2o3cr
The Intel docs mention that PAUSE isn't the right choice for waits of
"thousands of cycles or more" \- even on the old architecture, 10ms is
waaaaaay longer than recommended.

I also have to quibble with this conclusion:

    
    
        This is not a .NET issue. It affects all Spinlock implementations
        which use the pause instruction.
    

The instruction isn't the issue, the exponential backoff that .NET includes
is. At the extreme (last iteration, on a 64-core CPU) it will spin for 1.2
_million_ pause instructions...

~~~
eloff
The whole point of spinning is to avoid the expense of a context switch by
putting the thread to sleep, letting the OS schedule another. If you're going
to spin for 10ms, that's longer than the timeslice most threads run for. That
is going to be well above the overhead of any context switch. It makes no
sense.

~~~
dragontamer
> That is going to be well above the overhead of any context switch.

Is it though?

The "pause" instruction gives all resources to the hyperthread sibling. So
under high-utilization (ie: your core has a hyperthread-sibling ready and
willing), a "pause" instruction is basically a super-low latency task-switch.

I wouldn't be surprised that a "spinlock with pause" would be better, even in
5ms to 10ms chunks of time (Windows has 15ms quantas, and may task switch as
late as 30ms to 75ms depending on the situation).

Because its better to do the hardware-assisted "pause" task switch rather than
actually cause the OS-scheduler to get involved. In fact, that's what
Microsoft AND Intel have concluded!! To get better real-world metrics, both
Microsoft AND Intel felt like the "pause" instruction should wait longer!
Skylake-X increases the latency, while .NET created this exponential-backoff.

I bet it was good in isolation, but when combined, the overall effect was too
much pausing.

~~~
eloff
Giving resources to the sibling hyperthread via pause is not the kind of
context switch I'm talking about. I'm talking about the expensive context
switch when the OS scheduler must swap out a running thread for another. That
carries a steep obvious penalty switching between user space and the kernel
and back, but actually can involve much heavier penalties where the cache and
TLB are concerned - depending what thread runs next.

Pausing for as much as 10ms could be better under some circumstances, but that
would seem to me to be heavier than a context-switch. Whether the system
actually achieves better throughput or not in that case would be workload
dependent. Worst case the hyperthread is also running the spin lock loop and
no useful work at all is accomplished on the core.

~~~
dragontamer
> Giving resources to the sibling hyperthread via pause is not the kind of
> context switch I'm talking about.

Isn't this entire thread about "pause" ?? Anyway, i think we both understand
the difference between "context switch" and "pause" here.

> Pausing for as much as 10ms could be better under some circumstances, but
> that would seem to me to be heavier than a context-switch. Whether the
> system actually achieves better throughput or not in that case would be
> workload dependent.

Seems reasonable. Ideally, threads / cores shouldn't be blocked that long, so
10ms+ blocked periods is definitely an anomaly.

But whether or not a context switch is actually worth it? Ehhh, its hard to
say for sure.

------
DoofusOfDeath
I really think the article summary is misleading.

The object code in question uses an instruction that's basically a
microarchitecture-specific hack. IMO the real issue here is that the object
code is optimized for an earlier microarchitecture and needs an update.

Disclaimer: I work for Intel, but not on their CPUs and I'm not a fanbois.

~~~
int0x80
What amazes me is that they do not test this extremely specific uarch code
when a new uarch is released.

On the other hand, I think the hack is not in using the pause instruction
itself, but that crazy backoff algo. For example Linux uses pause too (via
rep; nop) in lots of places. But not in a exponential loop (that I'm aware
of).

------
mickronome
Unsure what I'm more astonished about, that the correct case is that it would
spin 29ms, or that a CPU instruction has changed it expected latency and prder
of magnitude upwards.

I would expect any spinning lock to spin at most some slight fraction of the
scheduling granularity, but almost always far less than the context switch
time except maybe in some specific case of latency sensitive code.

Maybe thr long spin tine is for some reason trying to account for the very
long scheduling quantas of windows server?

Superficially it seems misdirected in anything but real-time code since most
schedulers would allow the thread to keep its quanta until the next scheduling
if the thread wait time was low. The only case I can see this as not being the
case is in a system that is overloaded and all CPU's have multiple threads
that are all starved.

~~~
dragontamer
"Pause" spins give all their resources to the hyperthread-sibling. So unless
the hyperthread-sibling is also stalled out for some reason, the "pause"
instruction is definitely preferred.

So its not like resources are lost (at least, under high CPU utilization). But
still, you're right in that a 29ms spin-lock sounds outrageous to me. That's
definitely long.

~~~
cma
Pause can be t he best choice even without hyperthreading because it can lower
power utilization. And even on desktop that can mean less heat and more time
for other stuff to run at a higher-clocked 'Turbo-boost'.

------
tsh56
There is another architectural change made in Skylake which is related to this
article. In skylake the L3 cache is now non-inclusive, which means that a core
trying to acquire the lock will have to read the cache line through a cross
core cache read. Making things worse when contention is high the cache-line
will frequently be dirty. The latency for a cross core dirty hit of the L1/L2
cache is documented to be 60 cycles which is much high that the L3 cache
latency of past architectures. This will increase overall contention which
will magnify the frequency the pause instruction is hit. I'm wondering if the
pause instruction cycle count was increased to reduce the impact of cross core
cache reads.

------
m_eiman
What kind of code is reasonably written as 40+ cores spinlocking the same
lock? Seems like a strange design, IMHO…

~~~
_wmd
You could easily find examples of this - for example in any textbook threaded
socket server design (thread pool contending on accept())

~~~
spacenick88
But wouldn't that use a non spinning OS lock. I.e. the scheduler knows which
threads are waiting on the accept and can randomly pick one to directly wake
up?

~~~
mickronome
If your design have multiple threads waiting for accept on the same socket,
then leaving it to the OS to schedule said threads on accept should at least
be the first to try. If nothing else because it's the least surprising thing
to do.

------
lucb1e
TL;DR: There exists a "pause" CPU instruction, which signals to the CPU that
the code is in a spinlock wait loop. In Skylake this instruction sleeps longer
than before. If your code heavily uses this, then it'll wait longer for things
like locks to be released.

More info on the purpose of the pause instruction:
[https://stackoverflow.com/questions/4725676/how-
does-x86-pau...](https://stackoverflow.com/questions/4725676/how-
does-x86-pause-instruction-work-in-spinlock-and-can-it-be-used-in-other-sc)

Info on what's a "spinlock" (I thought it was another word for busy wait, but
there's a subtle difference):
[https://stackoverflow.com/questions/38124337/spinlock-vs-
bus...](https://stackoverflow.com/questions/38124337/spinlock-vs-busy-wait)

~~~
geezerjay
Great tldr. This should be the top post in this discussion.

~~~
davidhyde
There is also benefit to be had in HOW the author found the root cause of the
problem. That was more interesting to me than the problem itself.

~~~
geezerjay
I agree that's also interesting, but in a discussion with such a clickbaity
title it's important to correctly identify the source of the so called problem
right upfront just to avoid wild speculations and conspiracies.

~~~
Dylan16807
> but in a discussion with such a clickbaity title it's important to correctly
> identify the source of the so called problem right upfront

That's fair, but consider it closer to a subtitle than a tl;dr, which is
supposed to be a summary that lets you skip the main body entirely.

> just to avoid wild speculations and conspiracies

The burden there is on people to not comment if all they know about the
article is a wild guess.

------
userbinator
From the screenshot of the Intel documented linked:

 _The increased latency ... has a small positive performance impact of 1-2% on
highly threaded applications._

Making an instruction more than 10x slower overall, for a 1-2% gain (who wants
to bet it's some "industry standard" benchmark...?) in a very specific
situation sounds like a textbook example of premature microoptimisation. Of
course they don't want to give the impression that they severely slowed things
down, but the wording of the last sentence is quite amusing: "some performance
loss"? More like "a lot". It reminds me of what they said about the
Spectre/Meltdown workarounds.

~~~
kllrnohj
> a textbook example of premature microoptimisation

No, this is textbook _optimization_

Benchmarks showed a positive gain -> that's a real optimization, not premature
at all.

Premature optimization is when you _guess_ than a change will make things
better _without measuring_. As soon as you measure it becomes real
optimization work, not premature at all. As in, this is how you're supposed to
make things better.

Feel free to debate the merits of their benchmarkmark, but this is a textbook
example of Doing Performance Correctly otherwise.

~~~
codedokode
> Benchmarks showed a positive gain

But this gain of 1-2% is balanced with author's 50% performance loss. I guess
it means that the benchmark results here depend on choice of tested
applications. If they had used more multithread .NET applications, the tests
could show a performance degradation as well.

------
TimJYoung
What we do with the spin lock classes that we use in our products is a
SwitchToThread call:

[https://msdn.microsoft.com/en-
us/library/windows/desktop/ms6...](https://msdn.microsoft.com/en-
us/library/windows/desktop/ms686352\(v=vs.85\).aspx)

If the SwitchToThread call returns False, then the code reverts to a Sleep(0)
call instead.

Credit to StackOverflow and Joe Duffy's excellent article on this:

[https://stackoverflow.com/questions/1383943/switchtothread-v...](https://stackoverflow.com/questions/1383943/switchtothread-
vs-sleep1)

[http://joeduffyblog.com/2006/08/22/priorityinduced-
starvatio...](http://joeduffyblog.com/2006/08/22/priorityinduced-starvation-
why-sleep1-is-better-than-sleep0-and-the-windows-balance-set-manager/)

However, as mentioned in one of the SO replies, you _still_ need to make sure
that you don't use locks that use loops on calls like this when those locks
will have to wait on anything long-running - you're just going to burn
CPU/battery needlessly. So, only use such locks when you know that the
protected code doesn't involve any long wait states, or modify the code to
back out even further into a proper kernel wait after looping X times.
Personally, I agree with second SO answer and don't like such constructs, and
would personally say just go with a straight-up kernel wait if there's any
question about whether such conditions could exist now or in the future.

------
amelius
I was expecting CPUs to become slower because of the Meltdown/Spectre
problems, not this.

------
marcoperaza
Great investigation. Does anyone know or can anyone intelligently speculate
what would lead to this tradeoff in the CPU design?

~~~
dragontamer
Well, first you have to understand exactly what "Pause" is.

"Pause" is a hint to the CPU that the current thread is spinning in a
spinlock. You've ALREADY have tested the lock, but it was being held by some
other processor. So why do you care about latency? In fact, you probably want
to free up more processor resources as much as possible.

Indeed, there's not actually any resources wasted when you do a "pause"
instruction. In a highly-threaded environment, the hyperthread-brother of the
thread picks up the slack (you give all your resources to that thread, so it
executes quicker).

10-cycles is probably a poor choice for modern processors. 10-cycles is 3.3
nanoseconds, which is way faster than even the L3 cache. So by the time a
single pause instruction is done on older architectures, the L3 cache hasn't
updated yet and everything is still locked!!

140-cycles is a bit on the long side, but we're also looking at a server-chip
which might be dual socket. So if the processor is waiting for main-memory to
update, then 140-cycles is reasonable (but really, it should be ~40 cycles so
that it can coordinate over L3 cache if possible).

So I can see why Intel would increase the pause time above 10-cycles. But I'm
unsure why Intel increased the timing beyond that. I'm guessing it has
something to do with pipelining?

~~~
marcoperaza
Thank you! That’s very informative.

------
retrack
2 elements are fuzzy: \- clock speed is way different from 3.5GHz to 2.6. This
can lead to huge differences in some computations and no new instruction set
can compete against brute force. \- changing between those 2 cpus is not just
a socket swap but a full platform change: can we get the other details of the
2 platforms ?

------
codedokode
This is a kind of bug that can be difficult to reproduce. Imagine if someone
reports this bug to you but your own tests show that nothing has changed
(because you use an older CPU).

------
rocky1138
Excellent research into the problem. Very helpful that this person also
included a test program to see if one is affected by the problem.

------
metalliqaz
Interesting article but, man, the prose was difficult to follow.

------
gchokov
It's high time for somebody to disrupt Intel and their model. I really hope
this can happen soon.

~~~
growlist
AMD are doing a pretty stunning job recently.

~~~
paulie_a
Amd has had better chips at a lower price point since always.

You can get a 8 year year old octocore that still destroys anything Intel has
put out recently.

Intel relies on inertia. Amd consistently beats them on actual performance at
a lower price. I remember my 386 dx 2 had better performance than an Intel
486. And let's not forget the p4 issues with their garbage ram. Plus they
backdoor their chips with stupid insecure crap

~~~
mywittyname
I'm not sure where you get this impression, but Intel held the performance
crown over AMD since releasing the Core architecture. In fact, things were
looking pretty dire for AMD during the Phenom days when their highest-tier
offering could only match mid-tier Intel chips in performance, so they had to
compete solely on price. [1] AMD's stock price during this era was the lowest
it's been since like 1980.

It's only with Ryzen that AMD reached performance parity with Intel once
again.

[1] [https://www.tomshardware.com/reviews/amd-phenom-
ii-x6-1090t-...](https://www.tomshardware.com/reviews/amd-phenom-
ii-x6-1090t-890fx,2613-7.html)

~~~
paulie_a
The core 2 duo was crap, the celeries were mediocre, the p4s in combo with
rdram were junk and a total failure with shitty performance. Amd really should
have held the crown during those years, so I don't know why you think Intel
was a consistent winner. Most of the chips they made from 2000-2010 were less
than great. Intel outright didn't have a good chip for a very long time. Like
I previously stated you can buy an octocore and that's still beats most Intel
offerings for far less. If anything Intel is mid tier with a huge mark up on
price.

Stock price doesn't determine CPU quality. AMD stock is a whipping boy for
shorting..and an easy way to profit on when those professional investors are
constantly wrong. (I've invested in AMD over the last 20 years and there is an
extremely predictable cycle in stock price. though just a disclaimer, I
currently do not hold any AMD stock)

~~~
mywittyname
> so I don't know why you think Intel was a consistent winner.

Because I read the benchmarks. Intel was winning by such a huge margin that
AMD nearly bankrupted themselves trying to compete on price. Intel didn't even
need to dip into their margins at all because AMD was several generations
behind.

Intel beat AMD to 32nm fabrications by over a year, with the gap widening over
the next five. Intel got to 14nm in 2015 while AMD _just_ got there last year
with Ryzen.

> Like I previously stated you can buy an octocore and that's still beats most
> Intel offerings for far less

BS. Nothing pre-Zen was competitive with _contemporary_ Intel offerings, much
less the latest offerings. As I pointed out, AMD was a generation or two
behind Intel for this entire decade.

[https://www.anandtech.com/show/6396/the-vishera-review-
amd-f...](https://www.anandtech.com/show/6396/the-vishera-review-amd-
fx8350-fx8320-fx6300-and-fx4300-tested/2)

Yeah, you can still buy these six+ year old CPUs for $90, but, as that review
shows, they were barely competitive with 2012 Intel processors. Much less,
competitive with CPUs that are two generations newer.

> Stock price doesn't determine CPU quality.

It's a measure of how well a company is doing.

The original Phenom had a major CPU bug that hit once they finally began to
gain market-share in the server market. This cost AMD pretty heavily and
forced them to cut R&D budgets, which, as stated, put AMD generations behind
Intel in tech. This led to a slashing of margins to compete, which put them
even further behind Intel.

Oh yeah, and somewhere during all that, they they were the target class action
lawsuits for over-stated performance.

This doesn't even get into the Radeon lineup, which totally missed the AI/ML
craze that sent nVidia profits to the sky.

~~~
Sohcahtoa82
I agree with your post, but I did want to point out...

> This doesn't even get into the Radeon lineup, which totally missed the AI/ML
> craze that sent nVidia profits to the sky.

On the other hand, they scored well on the cryptomining craze. If it wasn't
for the fact that an R9 290 was significantly faster at mining than the
GeForce 980 GTX, I'm not sure if AMD would still be making GPUs.

