This appears somewhat related to this bug report: https://developercommunity.visualstudio.com/content/problem/...
Marking the temporary files as FILE_ATTRIBUTE_TEMPORARY could improve things, without having to go into significant Windows kernel changes.
This matters more than anything else. You can play all you want with FILE_ATTRIBUTE_TEMPORARY or whatever else markings but the OS will just not care about them enough.
However, having a Samsung SSD and enabling RAPID mode in Samsung Magician (SSD management software) which effectively uses an invisible RAM disk accelerated my games' startup times by a factor of at least 3x. Some games even start 5x times faster.
I do recognise that games have a much different workload than compiler jobs of course; but invisibly utilising a RAM disk might help a lot regardless.
Back in the 3.1 and 9x days it didn't require 3rd party drivers, thanks to ramdrive.sys
I doubt that is going to happen. File system mounts do not run on marketing fuel.
Granted, IT won't be thrilled that I'll need ~4-8x the RAM, but the devs would certainly love the speed.
....now if only I could make fastlane take less than 2 hours....
On Linux, I can run a build repeatedly that touches several GB of files and there is basically zero IO during the build, because everything is in the page cache.
There is some IO at some point after the build as the dirty files (e.g., .o files) get written out, but if if you delete them fast enough, even that doesn't happen.
Having used Cygwin and MinGW (and less so WSL) NTFS is probably the main factor here. Especially when using Cygwin program compilation is very slow not just due to process creation but file access, potentially on disk.
I could see checkpointing contributing, but having used backups/file history on Windows server, you will see CPU use irregularities that I think are similar in cause to this but you do not see CPU use irregularities as bad as described in the article.
The pathological behavior of NTFS with many files is easy to prove once you encounter it. This would be exposed to the kernel as well and is likely holding up NtfsCheckpointVolume. At least in my experience the problem goes deeper, and how NTFS structures are handled also contributes; for example, trying to enumerate and copy files in certain ways is extremely slow even if the files are easily enumerable.
You can say that if disabling checkpointing gets rid of noticeable slowdown it is the right thing to do, but there are people who will rightly not want to disable it.
There's a ZFS driver for Windows now. I haven't tried it yet, but I think I will. Can't be any worse.
He was on my interview loop at ATG, and I recount it as my favorite interview of all time. He pointed to a circuit diagram poster, and said to me "You have to write a game for that, what design considerations should you be aware of?"
It looked something like this (can't find the actual poster, it's been a decade):
A bit out of my league, but I identified the important aspects (multicore/hyperthreaded design, small L0/L1 cache and impacts to mispedictions, etc.) and spoke to what I could and where my uncertainties lay. Afterwards he gave up the rest of the time to let me ask questions about the team.
One XFest he stood on stage giving a Powerpoint presentation on debugging and multithreaded concerns. An animation was slow, and he broke into it and started debugging Powperpoint live to demonstrate some of his techniques. A legend.
A huge loss to Microsoft when we stepped away. I did and do hope him the best!
Can you describe some of the insights you learned from the exit interview -- about growing your career and becoming a subject matter expert? I'm new to the field and I feel like that'd be immensely valuable to me and many others.
But most importantly, give talks and brownbags about the technology. Understand that there's going to be someone in the room that knows more than you... but they aren't giving the talk and helping everyone else, you are. They will chime in, and that's OK. You are the one putting yourself out there educating yourself and others. This helped me so much when I gave talks at GDC... even if I'm helping ONE person, it makes the event worth it (and the talk serves as my unique perspective / take on the industry).
Pour over source materials. Bruce read the 600+ page CPU documentation front-to-back, twice over. He said the second time, he gleaned so much more insight.
The engineers didn't realize just how much knowledge they were trying to distill, so you might read a comment that says "Of course, the second parameter determines XYZ." The first read-through, you might gloss over that. The second read-through, you realize the instruction they're documenting is doing double-duty elsewhere, and the comment is an important indicator of how that interaction plays out on the die.
Is there a video of this somewhere? Sounds amazing.
At that point I was an ex-Microsoft person giving a talk at a Microsoft conference using Microsoft tools to profile Microsoft's presentation software. It may have been a cheeky thing to do, but it was so much fun.
I'm not aware of any publicly available video but I did a writeup of the issue:
We were the firefighters when a game studio's experts couldn't figure out what was going wrong. They may provide a code snippet, or in rare cases the full game, and we could debug with the console's OS/driver source code. We even had access to the processor layouts for figuring out hardware bugs. We'd get copies of the Red Disc and Green Disc masters used for duplication, before the game was published (helped figured out a few 0-day patch bugs that way).
The other half of the job was proactively figuring out what problems studios would run into with new APIs and new SDKs. How would they want to use them together, and the challenges that posed.
Finally, we were the developer representatives, advocating on their behalf as the platform progressed.
Was an amazing job. I only left because I just couldn't pass up my dream job (reworking the telemetry/stats pipeline for Halo 5, and getting to play with TB of data).
f0 0f c7 c8: lock cmpxchg8b eax
int main = 0xc8c70ff0;
The problem was that the thread that owned the lock was spinning in a seven-instruction loop.
Is the problem that the checkpointing critical section has the same duration as the triggering file operation?
I get that there must be some sort of critical section for setting a checkpoint, but I don't understand why it takes so long, and why it would be affected by how busy the userspace process that triggered it is.
I would expect it to have a short barrier-style critical section; drain all outstanding writes, record some checksum or counter from a kernel data structure, and then release all writers again.
In my mind this should be kernel code only, entirely unaffected by userspace, and if designed nicely, quite fast.
So I guess I don't get what is going on here.
The problem is that for some reason on this machine the checkpoint process was taking a really long time. I also don't understand why it was taking so long. It normally doesn't. Something went terribly wrong.
> and if designed nicely, quite fast.
Yep, should be. But it wasn't. If everything worked as it should then I'd never get to write any blog posts!
Obviously the lock shouldn't be held so long and so often though...
First: Never spin waiting on a lock for 3 seconds. If you expect a lock to be released very quickly, you spin K times and then, if you still don't have the lock, try something heavier that can deschedule your process. K should be small enough that your time slice is unlikely to expire while spinning, otherwise, it just causes confusion and wasted work because it looks like your process is doing work when it's not.
Second: It seems dubious that using a feature like system restore causes all Write calls to wait for a lock held by a process in the middle of I/O. I'm sure there are some cases where that must happen (like if out of buffer space to hold the writes), but I would think it would be harder to hit.
EDIT: Rephrased my comment in terms of two problems rather than just the first one.
So it looks more like "process is holding lock A while doing a very long scan through memory". That would fit with the name of the function, too.
The problem was that the system process held the lock for too long, due to some inefficiency in system restore (root cause not yet understood by me).
I agree that it seems dubious, but it is indisputably what was happening, repeatedly.
micro-up fusion means the seven-instruction loop is actually five micro-ops
Zen2 processors can retire five instructions per cycle
Therefore the loop runs at one iteration per cycle (wow!)
The cmp [r8] instruction occasionally has cache misses
This means that the seven instructions get synchronized such that the cmp [r8] instruction is the last of them to get retired in a seven-instruction block
Therefore the next instruction is usually the jne
TL;DR - the jne gets most of the samples because the cmp [r8] instruction is the most expensive.
Perhaps ETW shows you the precise instruction (i.e., "zero skid") that is slow to retire - this is not like a normal interrupt as described in the article but is available with some performance profiling events like 'cycles:ppp' on Linux perf (in particular, using the zero-skid PEBS events).
In that case, the samples shows up on the jne, not the cmp likely because cmp/jne have fused, so basically get sampled as a single instruction and the samples show up pointing to the jump.
The other scenario is that ETW shows you "skid 1" instructions, i.e., the instructions generally after the slow-to-retire ones (as described in the article), and cmp/jne didn't fuse (perhaps because a cmp with a memory source argument can't fuse on AMD?), and so it again points to the jne.
I haven't looked at many ETW traces, so I couldn't tell you offhand - but for those who have, do the samples usually show on the expensive instructions (things like div and loads that miss are a giveaway), or on the one after?
Added: Per Agner, I guess the "fusion + no skid" is the most likely (from the Ryzen section of microarchitecture.pdf):
> A CMP or TEST instruction immediately followed by a conditional jump can be fused into a
single μop. This applies to all versions of the CMP and TEST instructions and all conditional
jumps, except if the CMP or TEST instruction has a rip-relative address or both a displace-
ment and an immediate operand.
That also lines up with the cmp having exactly 0 samples, unlike any other instruction of the 7: that's a common indication of fusion.
The report is incorrect, vast majority of time is taken by the previous one, cmp dword ptr [r8],ebp. It's the only one accessing RAM, and accessing cache line shared across cored is very expensive, even more so than a cache miss.
Crashing is (with the exception of floating-point exceptions) precise. A particular instruction crashes, the exception record points there, the instructions afterwards are discarded with no side effects. This is necessary to support things like restarting execution.
Sampling, on the other hand, is not even well defined. There are hundreds of instructions in flight, many completing simultaneously, and when an external interrupt happens the CPU has to decide which ones to commit and which to discard. The linked article gives many more thoughts about how the CPU draws the line in the silicon.
Here’s an example for GCC on ARM Linux: https://github.com/dotnet/corert/issues/7826 I think I have observed similar symptoms on Windows, too.
> Sampling, on the other hand, is not even well defined
A crash by e.g. RAM access violation, and interrupt generated by CPU to collect sample for a profiler, are pretty similar, IMO.
Yes, and precise interrupts, too.
> CPU picks wherever it wants to stop in the program.
I've used profilers quite a lot, and based on my observations they're quite accurate, to exact instruction.
So, you end up with patterns. I linked to some detailed reverse engineering of which instructions are likely to end up being the victim. One common pattern is that the instruction after an expensive one will have the samples assigned to it, but there's more to it than that - I recommend reading it.
TL;DR - I'm not saying you're wrong, it's just that you're not saying anything specific enough for write/wrong to apply. "accurate, to exact instruction" has not been meaningful for sampling profilers for more than 2.5 decades.
I believe cmp/jne macro fuse on this CPU, so you actually will never get any samples on the first of the two fused instructions: rather they all show on the second one. You see this same effect on Linux when sampling with the cycles:ppp event.
I guess it's possible that all of the cases I looked at were distorted by macro fusion but I don't think so.
Haven’t thought about µ-ops fusion in this case. Yes, that explanation is very plausible.
It's not about the seven instructions. It's the lock that's been held while doing a busy loop.
I respectfully disagree.
That's because everything in the universe that is percieved as negative -- turns out to have a positive use-case somewhere, sometime, in some context...
In this case, I think the ability for one core to stop 63 other processor cores is purely awesome, because think of the possible use-cases! Debugger comes to mind immediately, but how about another if let's say there are 63 nasty self-resurrecting virus threads running on my PC? What about if you were doing some kind of esoteric OS testing where you needed to return to something like Unix's runlevel 1 (single user), but you'd rather freeze most of the machine (rather than destroying the context of everything else that was previously running?).
Oh, here's the best one I can think of -- don't just do a postmortem, everything's dead core dump when something fails -- do a full (frozen!) "live" dump of a system that can be replayed infinitely, from that state!
Now, because I take a contradictory position, doesn't mean we're not friends, or that I don't acknowledge your technical brilliance! Your article was absolutely great, and you are absolutely correct that for your use-case, "That’s just awesome, in a horrible sort of way.".
But for my use-cases, it's absolutely awsome, in the most awesome sort of way! <g>
And there are simpler ways to prevent all access to a drive.
The "of course everyone is a straight white male" attitude that the OS need not be stated, so often seen in Windows posts, gave it away for me. However, my biases threw me for way too long: the level of sophistication meant this must be Linux, right? I should have recognized the graphics style in the screen grabs. Certainly not MacOS, but Linux can be all over the map stylistically. Does Windows really still look like that? Wow.
Microsoft has, singly-handedly, got two generations of people used to computers working badly, convinced that it's not just unavoidable, but normal. If cars worked as badly, we would all see multiple explosions every day (and think it was awesome).
Changing any single detail gives better results. Use a Samba share from a Linux filesystem. Run Mingw on a Linux system. Run MSVS in Wine on a Linux system.
Windows is an execution environment for applications. There is no need for, and no value in, actually performing builds in your target execution environment. Use a system designed from the ground up for builds.
Citation needed. I haven't worked at Microsoft for a while but when I did we built on Windows using NTFS. When I found a correctness bug in NTFS last year I was told that it had been affecting Windows builds, which means they were using NTFS as recently as February 2018.
The biggest problem I usually encounter with building on Windows is slow process creation.
> no need for, and no value in, actually performing builds in your target execution environment.
.. is completely antithetical to using Visual Studio, where the convenience of building, running and debugging on the desktop is very handy.
NTFS is horribly slow for certain file operations though. Giant batch deletes can wedge the UI while the system catches up. I have in the past benefited from putting %TEMP% on a RAMdisk, although this is a pain to set up.
I'd love to see Microsoft building a VFS driver for, say, ext2/3. It's not impossible, all the APIs are there to add plugin filesystems to Windows, and the OSS-friendly Microsoft shouldn't have any objection in principle to linking against the GPL'd kernel implementation..
1) user space
2) cache manager
3) filesystem driver
4) disk device driver
2) filesystem driver
3) cache manager
4) disk device driver
Just throwing ext2/3 in there wouldn't help because as I said above, it's mainly the system architecture that's screwy.
From what I understood from them, they do not use NTFS (they use SMB from a clustered filesystem) for builders, but they _do_ use a heavily modified version of windows, incidentally that modified version went on to become "windows nano". What the actual "Windows" team does is a mystery to me though, I would assume it was similar or the same.
I really liked the direction with Nano in 2016, but I guess it makes more sense as a container OS. Still, the latest version is what a lot of people wish they could start an operating system with. NT kernel, no wmi, no servicing, no activation.
The build machines are a different story and I don't know the specifics of.