I read the paper, and they make a lot of good points about fork's warts. But I r...

ralish · on April 10, 2019

I'm a bit rusty on this but from memory the overhead is by and large specific to the Win32 environment. Creating a "raw" process is cheap and fast (as you'd reasonably expect), but there's a lot of additional initialisation that needs to occur for a "fully-fledged" Win32 process before it can start executing.

Beyond the raw Process and Thread kernel objects, which are represented by EPROCESS + KPROCESS and ETHREAD + KTHREAD structures in kernel address space, a Win32 process also needs to have:

- A PEB (Process Environment Block) structure in its user address space

- An associated CSR_PROCESS structure maintained by Csrss (Win32 subsystem user-mode)

- An associated W32PROCESS structure for Win32k (Win32 subsystem kernel-mode)

I'm pretty sure these days the W32PROCESS structure only gets created on-demand with the first creation of a GDI or USER object, so presumably CLI apps don't have to pay that price. But either way, those latter three structures are non-trivial. They are complicated structures and I assume involve a context switch (or several) at least for the Csrss component. At least some steps in the process also involve manipulating global data structures which block other process creation/destruction (Csrss steps only?).

I expect all this Win32 specific stuff largely doesn't apply to e.g. the Linux subsystem, and so creating processes should be much faster. The key takeaway is its all the Win32 stuff that contributes the bulk of the overhead, not the fundamental process or thread primitives themselves.

EDIT: If you want to learn more, Mark Russinovich's Windows Internals has a whole chapter on process creation which I'm sure explains all this.

intea · on April 10, 2019

The WSL processes are called pico processes.

https://blogs.msdn.microsoft.com/wsl/2016/05/23/pico-process...

olemartinorg · on April 10, 2019

That was a super interesting read (and view), thank you. I've been in Linux land for almost two decades, but I've also spent a week (or so) porting our Linux-based development environment over to Windows with the help of WSL. This sheds some light on how it actually works. Maybe I'll have to look over it once more armed with this new information and see if I can squash some of those remaining problems with our solution.

pjmlp · on April 10, 2019

You can also dive into the Drawbridge research papers, and how they used the LibOS concept to bring SQL Server into Linux

Gibbon1 · on April 10, 2019

> created on-demand with the first creation of a GDI or USER object, so presumably CLI apps don't have to pay that price

This tickles my brain. I read some blog post bitching that because Windows DLL's are kinda heavy weight it's way easy end up paying that price without realizing it.

cesarb · on April 10, 2019

It probably was this one: https://randomascii.wordpress.com/2018/12/03/a-not-called-fu...

JdeBP · on April 10, 2019

As mentioned in this very discussion 2 hours before. (-:

* https://news.ycombinator.com/item?id=19622723

speedplane · on April 10, 2019

I used to work on a cross-platform project, and spent several weeks trying to figure out why our application ran significantly faster on linux than windows. One major culprit was process creation (another was file creation). I never really uncovered the true reason, but I suspect it had to do with the large number of DLLs that Windows would automatically link if you weren't very careful. Linux, of course, can also load shared code objects, but in my experience, they are smaller and lighter weight.

fanf2 · on April 10, 2019

Anti-virus software makes process and file operations a lot slower.

kevin_b_er · on April 10, 2019

This should not be ignored. Windows machines are a favorite for having lots of heavy anti-virus running on them. They can destroy I/O performance. Windows 10 has a "real time scanner" running by default, but many corporate-IT security teams will add more and more. This alone can seriously slow down windows vs linux.

speedplane · on April 16, 2019

> Anti-virus software makes process and file operations a lot slower.

It was a long time ago (~2006), and I honestly can't remember, but I feel like turning off anti-virus (and also backups, software updaters, and any other resident software) would have been one of the first things I would have checked. There was definitely something more fundamental going on.

zenexer · on April 10, 2019

This probably isn't the technical explanation your looking for, but, in general, processes on Windows and processes on Unix aren't the same--or, at least, they're not meant to be used the same way. Creating lots of small processes on Windows has long been discouraged and considered poor design, whereas the opposite is true on Unix.

One could probably argue that processes on Windows need to be lighter-weight now that sandboxing is a common security practice. These days, programs like web browsers opt to create a large number of processes both for security and stability purposes. In much the same way that POSIX should deprecate the fork model, Windows should provide lighter-weight processes.

wvenable · on April 10, 2019

Windows now has minimal processes that have almost no setup and pico processes (based on minimal processes) that are the foundation for Linux processes in WSL.

waterhouse · on April 10, 2019

The last time I used WSL (perhaps 6 months ago), its per-process overhead was awful. I don't recall the numbers, but I think it managed to start fewer than 10 processes per second. My memory suggests it was more like two processes per second, though I would recommend re-testing before trusting that.

Found my previous comment on it (which has a test case but not numbers): https://news.ycombinator.com/item?id=18226921

temac · on April 10, 2019

On 1803 and 1903 it's 3 to 4 times faster than MsysGit (WSL is ~1s on my laptops). It is possibly slightly faster on 1903 as my laptop running it is faster than the other for this bench, despite having an older processor.

Now in a Linux VM it's approx 10 times faster than even WSL. And that should probably be even faster natively.

So anyway WSL is really usable and if you really only started 10 processes per sec something is wrong. Maybe you are using a crappy antivirus (I've heard that Kaspersky makes WSL extremely slow)

waterhouse · on April 10, 2019

Well, I hadn't installed any antiviruses myself. I think Windows Defender was running, though. It's possible that my computer came with additional crapware on it.

temac · on April 10, 2019

I just checked and both of my benchs were done with Defender.

When I disable it, it is down to ~0.5s

I would not build a Linux kernel here instead of in a VM, but for tons of things, this is very usable.

SifJar · on April 10, 2019

Others have mentioned about DLLs being pulled in, following post might be interesting:

https://randomascii.wordpress.com/2018/12/03/a-not-called-fu...

chris_wot · on April 10, 2019

It's not process creation that is tricky, it's process termination!

To see how Libreoffice does it, see https://opengrok.libreoffice.org/xref/core/sal/osl/w32/proce...

naasking · on April 10, 2019

Microsoft Research doesn't just do research on Windows. They employ lots of researchers that are free to pursue many different topics.

kazinator · on April 10, 2019

CreateProcess requires an application to initialize from scratch. When you fork, you cheaply inherit the initialized state of the whole application image. Only a few pages that are mutated have to be subject to copy-on-write. Even that copy-on-write is cheaper than calculating the contents of those pages from scratch.

JdeBP · on April 10, 2019

There has been a lot of discussion in recent years about how cheap that "cheaply" really is.

* https://news.ycombinator.com/item?id=9653238

* https://news.ycombinator.com/item?id=18071278

* https://news.ycombinator.com/item?id=19622503

cryptonector · on April 10, 2019

Yeah, it's not really cheap at all. However! vfork() is cheap, very very cheap, though, of course, you then have to follow it up with an exec(), and the cost of that on Windows depends on the setup cost of the executable being exec'ed.

Part of the problem is the DLLs, as many have mentioned, and also the fact that each statically links in its own CRT (C run-time). The shared C run-time MSFT is working on should help here. As should more lazy loading and setup.

dblohm7 · on April 10, 2019

> Part of the problem is the DLLs, as many have mentioned, and also the fact that each statically links in its own CRT (C run-time)

No, that isn't the case on DLLs shipped with Windows.

cryptonector · on April 10, 2019

But it is for 3rd party DLLs.

dblohm7 · on April 11, 2019

Well, that depends on the DLL.

kazinator · on April 10, 2019

fork is pretty much always going to be cheaper than starting a new process in scratch over the same executable image (and library images) and then re-playing everything inside that process so that it gets into exactly the same state as the creator to be a de facto clone of it.

muststopmyths · on April 10, 2019

If I had to guess, I'd point to DLLs. The minimal Windows process loads probably half a dozen, plus the entry points are called in a serialized manner.

richardwhiuk · on April 10, 2019

Pretty much identical to shared objects on Linux

chungleong · on April 10, 2019

Windows DLLs require fixups when they're loaded off their preferred base address.

pjc50 · on April 10, 2019

So do relocatable shared libraries on Linux. https://eli.thegreenplace.net/2011/08/25/load-time-relocatio...