I read the paper, and they make a lot of good points about fork's warts.
But I really wanted some explanation of why Windows process startup seems to be so heavyweight. Why does anything that spawns lots of little independent processes take so bloody long on Windows?
I'm not saying "lots of processes on Windows is slow, lots of processes on Linux is fast, Windows uses CreateProcess, Linux uses fork, CreateProcess is an alternative to fork/exec, therefore fork/exec is better than any alternative." I can imagine all kinds of reasons for the observed behavior, few of which would prove that fork is a good model. But I still want to know what's going on.
I'm a bit rusty on this but from memory the overhead is by and large specific to the Win32 environment. Creating a "raw" process is cheap and fast (as you'd reasonably expect), but there's a lot of additional initialisation that needs to occur for a "fully-fledged" Win32 process before it can start executing.
Beyond the raw Process and Thread kernel objects, which are represented by EPROCESS + KPROCESS and ETHREAD + KTHREAD structures in kernel address space, a Win32 process also needs to have:
- A PEB (Process Environment Block) structure in its user address space
- An associated CSR_PROCESS structure maintained by Csrss (Win32 subsystem user-mode)
- An associated W32PROCESS structure for Win32k (Win32 subsystem kernel-mode)
I'm pretty sure these days the W32PROCESS structure only gets created on-demand with the first creation of a GDI or USER object, so presumably CLI apps don't have to pay that price. But either way, those latter three structures are non-trivial. They are complicated structures and I assume involve a context switch (or several) at least for the Csrss component. At least some steps in the process also involve manipulating global data structures which block other process creation/destruction (Csrss steps only?).
I expect all this Win32 specific stuff largely doesn't apply to e.g. the Linux subsystem, and so creating processes should be much faster. The key takeaway is its all the Win32 stuff that contributes the bulk of the overhead, not the fundamental process or thread primitives themselves.
EDIT: If you want to learn more, Mark Russinovich's Windows Internals has a whole chapter on process creation which I'm sure explains all this.
That was a super interesting read (and view), thank you. I've been in Linux land for almost two decades, but I've also spent a week (or so) porting our Linux-based development environment over to Windows with the help of WSL. This sheds some light on how it actually works. Maybe I'll have to look over it once more armed with this new information and see if I can squash some of those remaining problems with our solution.
> created on-demand with the first creation of a GDI or USER object, so presumably CLI apps don't have to pay that price
This tickles my brain. I read some blog post bitching that because Windows DLL's are kinda heavy weight it's way easy end up paying that price without realizing it.
I used to work on a cross-platform project, and spent several weeks trying to figure out why our application ran significantly faster on linux than windows. One major culprit was process creation (another was file creation). I never really uncovered the true reason, but I suspect it had to do with the large number of DLLs that Windows would automatically link if you weren't very careful. Linux, of course, can also load shared code objects, but in my experience, they are smaller and lighter weight.
This should not be ignored. Windows machines are a favorite for having lots of heavy anti-virus running on them. They can destroy I/O performance. Windows 10 has a "real time scanner" running by default, but many corporate-IT security teams will add more and more. This alone can seriously slow down windows vs linux.
> Anti-virus software makes process and file operations a lot slower.
It was a long time ago (~2006), and I honestly can't remember, but I feel like turning off anti-virus (and also backups, software updaters, and any other resident software) would have been one of the first things I would have checked. There was definitely something more fundamental going on.
This probably isn't the technical explanation your looking for, but, in general, processes on Windows and processes on Unix aren't the same--or, at least, they're not meant to be used the same way. Creating lots of small processes on Windows has long been discouraged and considered poor design, whereas the opposite is true on Unix.
One could probably argue that processes on Windows need to be lighter-weight now that sandboxing is a common security practice. These days, programs like web browsers opt to create a large number of processes both for security and stability purposes. In much the same way that POSIX should deprecate the fork model, Windows should provide lighter-weight processes.
Windows now has minimal processes that have almost no setup and pico processes (based on minimal processes) that are the foundation for Linux processes in WSL.
The last time I used WSL (perhaps 6 months ago), its per-process overhead was awful. I don't recall the numbers, but I think it managed to start fewer than 10 processes per second. My memory suggests it was more like two processes per second, though I would recommend re-testing before trusting that.
On 1803 and 1903 it's 3 to 4 times faster than MsysGit (WSL is ~1s on my laptops). It is possibly slightly faster on 1903 as my laptop running it is faster than the other for this bench, despite having an older processor.
Now in a Linux VM it's approx 10 times faster than even WSL. And that should probably be even faster natively.
So anyway WSL is really usable and if you really only started 10 processes per sec something is wrong. Maybe you are using a crappy antivirus (I've heard that Kaspersky makes WSL extremely slow)
Well, I hadn't installed any antiviruses myself. I think Windows Defender was running, though. It's possible that my computer came with additional crapware on it.
CreateProcess requires an application to initialize from scratch. When you fork, you cheaply inherit the initialized state of the whole application image. Only a few pages that are mutated have to be subject to copy-on-write. Even that copy-on-write is cheaper than calculating the contents of those pages from scratch.
Yeah, it's not really cheap at all. However! vfork() is cheap, very very cheap, though, of course, you then have to follow it up with an exec(), and the cost of that on Windows depends on the setup cost of the executable being exec'ed.
Part of the problem is the DLLs, as many have mentioned, and also the fact that each statically links in its own CRT (C run-time). The shared C run-time MSFT is working on should help here. As should more lazy loading and setup.
fork is pretty much always going to be cheaper than starting a new process in scratch over the same executable image (and library images) and then re-playing everything inside that process so that it gets into exactly the same state as the creator to be a de facto clone of it.
If I had to guess, I'd point to DLLs. The minimal Windows process loads probably half a dozen, plus the entry points are called in a serialized manner.
But I really wanted some explanation of why Windows process startup seems to be so heavyweight. Why does anything that spawns lots of little independent processes take so bloody long on Windows?
I'm not saying "lots of processes on Windows is slow, lots of processes on Linux is fast, Windows uses CreateProcess, Linux uses fork, CreateProcess is an alternative to fork/exec, therefore fork/exec is better than any alternative." I can imagine all kinds of reasons for the observed behavior, few of which would prove that fork is a good model. But I still want to know what's going on.