
O(N^2) in CreateProcess - nikbackm
https://randomascii.wordpress.com/2019/04/21/on2-in-createprocess/
======
magicalhippo
One thing I really miss in a lot of library documentation is performance
guarantees similar to that of the C++ standard library.

Yes yes, this method lets me find the index of an element, but is it O(log N)
or O(N)? I need to know, otherwise I might inadvertently create O(N^2) code.

~~~
rightbyte
Yes that would be nice but the O(...) notation is not very useful for choosing
algorithms unless you also compare how eg. a O(n) algorithm compares to a
O(n^2) given a certain amount of nodes to compute.

~~~
fluffything
Complexity bounds do not answer the question of how fast an algorithm or a
particular implementation of it is. I don't think anybody claimed that, and
that was actually not the problem here.

The problem here is that the implementation exploded when passed an input Nx
larger in production than what the implementation was tested with. Complexity
bounds would have prevented that problem because "maybe process creating
shouldn't be O(N^2)".

So sure, choose an O(N^2) algorithm if its faster than an O(N) one for small
enough N, there is nothing wrong with that. But don't do that if the size of
the input does not depend on you (or fall back to a O(N) algo for very large
inputs).

People doing this is why DDoS are a thing: my O(N^3) server was very fast for
100 connections, who would have thought that someone would DDoS it with 1000
connections - yeah, who would have thought about that, 1000 connections isn't
even a real DoS, but whatever.

~~~
rightbyte
Yes I agree.

What I think is misleading about the general use of O() by programmers is that
it is an theoretical construct and other discountineous effects like cache
misses (cache size) etc is not taken into consideration and further more the
range of n:s for which each algorithm is fastest is often ignored.

Some dummy numbers:

T1 = 0.0000001 n^2 + n

T2 = 100000000 n

Would yield: O1(n^2) O2(n)

Long before n is big enough for a hypothetical server running the O(n^2)
algorithm to be slower than the O(n) other effects might cap it.

The best way to find out is empirical testing.

~~~
jnurmine
Isn't the point the order of growth...? If you double the input size, O(N)
will run twice as long, whereas O(N^2) four times as long.

~~~
rightbyte
No, that's only true at the limit as n goes towards infinity.

And even if you see it as a approximation of growth for big n:s, you need to
know "twice what" and "four times what" and "what's a big n" to have any use
from it.

For the example with multiplication of 2 n-digit numbers:

[https://en.wikipedia.org/wiki/Karatsuba_algorithm](https://en.wikipedia.org/wiki/Karatsuba_algorithm)
It's O(n^1.585). Need about 1000 digits before it's actually faster than
"School book" O(n^2) multiplication.

[https://en.wikipedia.org/wiki/Sch%C3%B6nhage%E2%80%93Strasse...](https://en.wikipedia.org/wiki/Sch%C3%B6nhage%E2%80%93Strassen_algorithm)
It's O(n log n log log n). In practice faster than Karatsuba at 10,000 to
40,000 decimal digits.

Then there is the newest and hottest O(n log n) version, that's not practical
at all.

~~~
brucedawson
Sure, there are some cases where O() can be misleading, but those are the
exceptions. You can work in software for a long time without being affected
significantly by them, whereas using O(n^2) when you could use O(n) can make
your code 1,000 times slower than usable, and it does this frequently.

I have fixed hundreds of O(n^2) algorithms over the years, made them O(n) or
O(n log(n)), and made the product dramatically better. The fact that an O(n)
algorithm can sometimes be slower than an O(n^2) feels more like pedantry than
useful information in this context.

------
nonsince
I absolutely love this series

------
brucedawson
Update: Microsoft has built a fix for the issue.

[https://twitter.com/mamyun/status/1120878048620892166](https://twitter.com/mamyun/status/1120878048620892166)

Pretty quick work - two days after the blog post.

------
peter_d_sherman
It looks like the author of this article tested with Windows 7 and Windows 10.

What would be really interesting is if the same tests could be performed with
all of the other, older versions of Windows... why? Well, then you'd know the
version of Windows in which this phenomena first appears (it could not
possibly have existed in Windows 1.0 and probably didn't exist until several
subsequent versions later). So that knowledge of where it first appears could
be interesting... That is, I'd love a Raymond Chen deep-dive explanation for
the Microsoft "why" of this phenomenon...

~~~
garaetjjte
It is caused by Control Flow Guard introduced in 8.1, so there's no point
testing other versions.

------
kristianp
This article is about the Control Flow Guard [1] security feature introduced
in Win 8.1. The same blog has a few articles about it. Seems that it wasn't
performance tested against large binaries in its original development.

[1] [https://docs.microsoft.com/en-
us/windows/desktop/secbp/contr...](https://docs.microsoft.com/en-
us/windows/desktop/secbp/control-flow-guard)

~~~
aneutron
Seeing how many O(n^2) he found, I think you could've stopped your sentence at
tested.

------
charleslmunger
It's very easy to accidentally create O(n^2) code. For example, take a look at
the test case added here:

[https://github.com/google/guava/commit/4e41d621c3f413eb31da2...](https://github.com/google/guava/commit/4e41d621c3f413eb31da2d31b6c9d9713cd851f6)

------
varelaz
Why someone needs 1000 created processes on one server? I worked with IO and
CPU bound tasks and never needed more than ~50 processes, even on very big
servers. Is there any practical reason for this (besides pure interest)?

~~~
seba_dos1
The article literally answers that. It's very common for unit tests and
building infra, especially when ported from POSIX systems.

~~~
varelaz
Sorry, I thought problem is with thousands parallel processes execution. There
problem is with some remainings from previous process runs.

~~~
seba_dos1
...or with the fact that they're constantly being created and destroyed as
some tests/build tasks finish and the new ones are spawned.

------
w8rbt
O(n^2) is polynomial and perfectly acceptable for many use cases. Basic
multiplication is O(n^2), as are many other operations.

    
    
        for i = 1 -> n:
            for j = 1 -> n:
                do something
    

Although, if the outside n differs in size from the inside n, it should be
written O(nm).

~~~
Sharlin
In this case it doesn't seem to be. Although admittedly executables as large
as in the article are rare, process creation on Windows is slow enough as it
is, without extra inefficiencies.

As an aside, basic multiplication is O(n^2) only if your number type is
arbitrarily wide. Standard 32-bit or 64-bit multiplication as implemented in
hardware is constant time.

~~~
rightbyte
It's tautological to state that for a fixed number of nodes, mutliplication is
constant time.

Off course it is. The n is fixed.

Also as a side note, n^2 is not the lowest order algorithm for multiplication.

~~~
enedil
No, what is meant by that, is that multiplying 2 by 3 takes about the same
time as multiplying 432434232 by 1213213213. Not only bounded time, about the
same.

~~~
w8rbt
What's the runtime of 75^4096 mod 236?

~~~
jerf
That sounds like a problem that is not described by "Standard 32-bit or 64-bit
multiplication as implemented in hardware".

------
rightbyte
Maybe he should have split his 120Mb executable into more executables ... or
using a sane number of processes.

It would be bad to have the OS cover this use case by default.

~~~
AstralStorm
You do understand that windows creates about 1000 processes on boot alone? A
clean start?

(Of those about 100 survive as services.)

In short this fail adds as much as 10 seconds to boot time? (Sort of hidden on
most hardware by parallelism, but if you have something not super recent,
well...)

~~~
rightbyte
Yes, but he creates 1000 processes of the same executable with different
arguments to run different tests as I understand it. With a really big
executable.

That's not the same thing as it's slow to add normal sized processes.

~~~
mort96
That's not out of the ordinary though?

Any time you compile C or C++ code, you're creating a process from the same
GCC/Clang/whatever executable but with different arguments for each source
file. For a big project, it's not uncommon for there to be thousands or tens
of thousands of small source files. Creating and maintaining processes is the
core responsibility of an operating system.

~~~
rightbyte
gcc is like 1Mb. It would be interesting if he did the same test on Debian or
something with default options.

I would personally link in the tests as a shared lib or something. chrome.exe
on my laptop is around 1Mb. That would probably speed up his tests more than
his hacks to the validation routine.

~~~
brucedawson
I have been told that moving the meat of unit_tests.exe to a DLL would avoid
the repeated CFG initialization costs (they would be paid on the first
CreateProcess call and then the CFG data would be shared). This wouldn't speed
up the tests "more" than my change did (it would speed up the tests slightly
less, actually), but it would be a possible fix.

But, such a change would be more work. And, honestly, it shouldn't be
necessary. Windows could avoid this problem by using an O(n) algorithm for the
initialization, or by sharing the CFG data between .exe files as well as .dll
files. I am content with my current workaround and I'll consider reverting it
when the underlying OS issue is fixed.

Note that the chrome executable on Linux contains all of the code so it is
50-100 MB. It is only an accident of history that chrome.exe on Windows puts
all the code in chrome.dll/chrome_child.dll

