
63 Cores Blocked by Seven Instructions - nikbackm
https://randomascii.wordpress.com/2019/10/20/63-cores-blocked-by-seven-instructions/
======
pbsd
For each input source file, cl.exe creates at least 7 temporary files (with
suffixes "gl", "sy", "ex", "in", "db", "md", "lk"). The churn of creating and
deleting those, coupled with the slowness of performing checkpointing on a
huge empty drive, seem to be the root cause here.

This appears somewhat related to this bug report:
[https://developercommunity.visualstudio.com/content/problem/...](https://developercommunity.visualstudio.com/content/problem/310131/clexe-
creates-so-many-temp-files-it-freezes-the-sy.html)

Marking the temporary files as FILE_ATTRIBUTE_TEMPORARY could improve things,
without having to go into significant Windows kernel changes.

~~~
NullPrefix
> Moving %TEMP% to a RAM drive made the builds noticeably faster and also left
> the system responsive, even during two concurrent 20-way parallel builds (40
> compile jobs in parallel, no loss of system responsiveness).

This matters more than anything else. You can play all you want with
FILE_ATTRIBUTE_TEMPORARY or whatever else markings but the OS will just not
care about them enough.

~~~
MuffinFlavored
Does creating a RAM drive in Windows still require third-party
software/drivers or is it natively supported by Microsoft these days? I
haven't tried to do it since Windows 7 I believe.

~~~
NullPrefix
That whole Linux integration into Windows marketing campaign could have you
believe that Linux things just work on Windows now and you can just "mount -t
tmpfs none /path/to/my/mount -o size=64G".

I doubt that is going to happen. File system mounts do not run on marketing
fuel.

~~~
jrockway
The latest variant of WSL uses a VM. So that will work, but you won't be
seeing that mount inside Windows.

~~~
Dylan16807
Well, you can access it over the \\\wsl$ shared drive. But that share can only
do about 4k IOps.

------
markdog12
Bruce Dawson does some of Microsoft's most valuable work for Windows. Doesn't
even work for them.

~~~
Darkphibre
I worked with the man for years in the Xbox Advanced Technology Group.
_Amazing_ individual. When he left the team, I conducted my own exit interview
so I could learn from him, and walked away with pages and pages of insights on
growing my own career and becoming a subject matter expert.

He was on my interview loop at ATG, and I recount it as my favorite interview
of all time. He pointed to a circuit diagram poster, and said to me "You have
to write a game for that, what design considerations should you be aware of?"

It looked something like this (can't find the actual poster, it's been a
decade): [https://qph.fs.quoracdn.net/main-
qimg-9cdbc7bf35ef8126755175...](https://qph.fs.quoracdn.net/main-
qimg-9cdbc7bf35ef8126755175c99410013c)

A bit out of my league, but I identified the important aspects
(multicore/hyperthreaded design, small L0/L1 cache and impacts to
mispedictions, etc.) and spoke to what I could and where my uncertainties lay.
Afterwards he gave up the rest of the time to let me ask questions about the
team.

One XFest he stood on stage giving a Powerpoint presentation on debugging and
multithreaded concerns. An animation was slow, and he broke into it and
started debugging Powperpoint live to demonstrate some of his techniques. A
legend.

A huge loss to Microsoft when we stepped away. I did and do hope him the best!

~~~
anniely
Wow. This was such a valuable post.

Can you describe some of the insights you learned from the exit interview --
about growing your career and becoming a subject matter expert? I'm new to the
field and I feel like that'd be immensely valuable to me and many others.

~~~
Darkphibre
Sure! It's been a decade, but the biggest thing that stuck out was to pick
something that __needs __an SME. Fill a gap on the team, even if it 's
something you may not have a huge interest in. _Find_ the interest in it, and
_own it_.

But most importantly, give talks and brownbags about the technology.
Understand that there's _going_ to be someone in the room that knows more than
you... but they aren't giving the talk and helping everyone else, _you_ are.
They will chime in, and that's OK. _You_ are the one putting yourself out
there educating yourself and others. This helped me so much when I gave talks
at GDC... even if I'm helping ONE person, it makes the event worth it (and the
talk serves as my unique perspective / take on the industry).

Pour over source materials. Bruce read the 600+ page CPU documentation front-
to-back, __twice over __. He said the second time, he gleaned so much more
insight.

The engineers didn't realize just how much knowledge they were trying to
distill, so you might read a comment that says "Of course, the second
parameter determines XYZ." The first read-through, you might gloss over that.
The second read-through, you realize the instruction they're documenting is
doing double-duty elsewhere, and the comment is an important indicator of how
that interaction plays out on the die.

Good luck!!

------
KenanSulayman
Technically this wasn't caused by those instructions but by the spinlocks
waiting for the lock to be released. Also "blocked by seven instructions"
sounds a bit click-baity.. you can lock the CPU or power off the computer with
less than that amount of instructions :-)

~~~
saagarjha
Or break it, depending on how old it is:

    
    
      f0 0f c7 c8: lock cmpxchg8b eax

~~~
Piskvorrr
Does this still bug still bite anywhere? Embedded computing with Pentiums?

~~~
lmm
I used a Pentium with that bug as a router up until 2009.

~~~
Piskvorrr
Yeah, I used to have a similar setup then. That was 10 years ago, though.

------
vkaku
It's good to see how features turned on by default (System Restore) can have
such a bad impact on performance. Thank you for doing the profiling!

~~~
acdha
System Restore is commonly disabled for test systems or anyone who has a good
automated deployment system. Now I’m wondering how likely it is that many
engineers at Microsoft disabled it to save space or conserve every available
IOP, especially in the era before large SSDs were widely available.

~~~
pm7
Do you have any good list of such actions?

------
strictfp
So, one busy process performs a file operation that triggers a system restore
checkpoint, and the OS locks the entire drive during this file operation?
Sounds strange to me.

Is the problem that the checkpointing critical section has the same duration
as the triggering file operation?

I get that there must be some sort of critical section for setting a
checkpoint, but I don't understand why it takes so long, and why it would be
affected by how busy the userspace process that triggered it is.

I would expect it to have a short barrier-style critical section; drain all
outstanding writes, record some checksum or counter from a kernel data
structure, and then release all writers again.

In my mind this should be kernel code only, entirely unaffected by userspace,
and if designed nicely, quite fast.

So I guess I don't get what is going on here.

~~~
brucedawson
My understanding is that the system restore checkpoints happen every five
seconds. They hold a lock, which seems reasonable.

The problem is that for some reason on this machine the checkpoint process was
taking a really long time. I also don't understand why it was taking so long.
It normally doesn't. Something went terribly wrong.

> and if designed nicely, quite fast.

Yep, should be. But it wasn't. If everything worked as it should then I'd
never get to write any blog posts!

------
jeffdavis
It looks like this is a case where a process is holding lock A while waiting
on lock B; and every other process is waiting on lock A. That's normal enough,
though it seems like there are two mistakes:

First: Never spin waiting on a lock for 3 seconds. If you expect a lock to be
released very quickly, you spin K times and then, if you still don't have the
lock, try something heavier that can deschedule your process. K should be
small enough that your time slice is unlikely to expire while spinning,
otherwise, it just causes confusion and wasted work because it looks like your
process is doing work when it's not.

Second: It seems dubious that using a feature like system restore causes all
Write calls to wait for a lock held by a process in the middle of I/O. I'm
sure there are some cases where that must happen (like if out of buffer space
to hold the writes), but I would think it would be harder to hit.

EDIT: Rephrased my comment in terms of two problems rather than just the first
one.

~~~
caf
Look again - that tight loop in RtlFindNextForwardRunClear isn't spinning on a
lock - it's scanning forward through memory, 4 bytes at a time, looking for 4
bytes not equal to the pattern in %ebp.

So it looks more like _" process is holding lock A while doing a very long
scan through memory"_. That would fit with the name of the function, too.

~~~
brucedawson
Correct. Nobody was spinning on a lock. Everybody was waiting politely.

The problem was that the system process held the lock for too long, due to
some inefficiency in system restore (root cause not yet understood by me).

------
saagarjha
Why do the sample counts cluster so heavily on the jne, as opposed to the
other instructions in the loop?

~~~
Const-me
It's the same off-by-one issue that often affects address of crashing
instruction.

The report is incorrect, vast majority of time is taken by the previous one,
cmp dword ptr [r8],ebp. It's the only one accessing RAM, and accessing cache
line shared across cored is very expensive, even more so than a cache miss.

~~~
BeeOnRope
I don't think that's the case here - see my other longer comment, but I think
ETW uses the PEBS sampling events, so the instruction is usually the slow-to-
retire one, not the subsequent one.

I believe cmp/jne macro fuse on this CPU, so you actually will never get any
samples on the first of the two fused instructions: rather they all show on
the second one. You see this same effect on Linux when sampling with the
cycles:ppp event.

~~~
brucedawson
I have seen no signs of samples hitting the most expensive instruction in ETW
trace. Across other cases I have looked at the samples tend to cluster after
expensive instructions, not on them.

I guess it's possible that all of the cases I looked at were distorted by
macro fusion but I don't think so.

~~~
BeeOnRope
Interesting, so I guess ETW is using "normal" interrupts which have skid, and
the CPU just having enough retire bandwidth to retire everything in one cycle.

------
alexeiz
"loop running in the system process while holding a vital NTFS lock"

It's not about the seven instructions. It's the lock that's been held while
doing a busy loop.

------
peter_d_sherman
Excerpt: "...I mean, how often do you have one thread spinning for several
seconds in a seven-instruction loop while holding a lock that stops sixty-
three other processors from running. That’s just awesome, in a horrible sort
of way."

I respectfully disagree.

That's because everything in the universe that is percieved as negative --
turns out to have a positive use-case somewhere, sometime, in some context...

In this case, I think the ability for one core to stop 63 other processor
cores is purely awesome, because think of the possible use-cases! Debugger
comes to mind immediately, but how about another if let's say there are 63
nasty self-resurrecting virus threads running on my PC? What about if you were
doing some kind of esoteric OS testing where you needed to return to something
like Unix's runlevel 1 (single user), but you'd rather freeze most of the
machine (rather than destroying the context of everything else that was
previously running?).

Oh, here's the best one I can think of -- don't just do a postmortem,
everything's dead core dump when something fails -- do a full (frozen!) "live"
dump of a system that can be replayed infinitely, from that state!

Now, because I take a contradictory position, doesn't mean we're not friends,
or that I don't acknowledge your technical brilliance! Your article was
absolutely great, and you are absolutely correct that for your use-case,
"That’s just awesome, in a horrible sort of way.".

But for my use-cases, it's absolutely awsome, in the most awesome sort of way!
<g>

~~~
lilyball
63 cores blocked on a single mutex is not at all like any of the scenarios
you're describing. That's almost like describing the notre dame fire as having
a positive use-case because what if you want to do a controlled demolition of
a large building.

------
snak
That was a good read. In-depth but understandable. Thanks for sharing.

------
CawCawCaw
These posts by Dawson are always interesting. Now, if only he would
investigate and remediate the performance deficiencies of other complex
systems, such as ... Chrome?

------
mehrdadn
Edit: Never mind... I completely missed the word "empty" when reading the
critical sentence. :(

~~~
wjnc
As per the article: "It is unclear why this code misbehaved so badly on this
particular machine. I assume that it is something to do with the layout of the
almost-empty 2 TB disk."

------
Syzygies
So when did you first realize he was discussing Windows, reading this?

The "of course everyone is a straight white male" attitude that the OS need
not be stated, so often seen in Windows posts, gave it away for me. However,
my biases threw me for way too long: the level of sophistication meant this
must be Linux, right? I should have recognized the graphics style in the
screen grabs. Certainly not MacOS, but Linux can be all over the map
stylistically. Does Windows really still look like that? Wow.

~~~
ncmncm
I caught on when I realized this just wouldn't be happening anywhere else.

Microsoft has, singly-handedly, got two generations of people used to
computers working badly, convinced that it's not just unavoidable, but normal.
If cars worked as badly, we would all see multiple explosions every day (and
think it was awesome).

------
ncmncm
The cause is obvious: they were building on Microsoft Windows, using the NTFS
filesystem. Even Microsoft doesn't try to build on NTFS.

Changing any single detail gives better results. Use a Samba share from a
Linux filesystem. Run Mingw on a Linux system. Run MSVS in Wine on a Linux
system.

Windows is an execution environment for applications. There is no need for,
and no value in, actually performing builds in your target execution
environment. Use a system designed from the ground up for builds.

~~~
youdontknowtho
That's the first I've heard that MS doesn't try to build Windows on Windows
with NTFS.

~~~
dijit
I don't have first hand experience, but I know some people at MS who work on
Xbox (which is a modified version of Windows+HyperV underneath);

From what I understood from them, they do not use NTFS (they use SMB from a
clustered filesystem) for builders, but they _do_ use a heavily modified
version of windows, incidentally that modified version went on to become
"windows nano". What the actual "Windows" team does is a mystery to me though,
I would assume it was similar or the same.

~~~
youdontknowtho
That's super interesting. I wonder if they have moved to use SMBv3?

I really liked the direction with Nano in 2016, but I guess it makes more
sense as a container OS. Still, the latest version is what a lot of people
wish they could start an operating system with. NT kernel, no wmi, no
servicing, no activation.

