
A Not-Called Function Can Cause a 5X Slowdown - deafcalculus
https://randomascii.wordpress.com/2018/12/03/a-not-called-function-can-cause-a-5x-slowdown/
======
CoolGuySteve
Windows has a real problem with huge monolithic dlls that inexplicably pull in
other huge monolithic dlls. Especially when each dllmain() has its own weird
side effects. It leads to bizarre behavior all the time such as this.

The weirdest thing that ever happened to me was when Visual Studio 2012 hung
in the installer. After debugging with an older VS, it turned out the
installer rendered some progress bar with Silverlight, which was hung on audio
initialization, which was hung on a file system driver, which was hung on a
shitty Apple-provided HFS driver. Uninstalling HFS fixed the installer.

Why does an installer even need audio when it never played sound? Because
Microsoft dependencies are fucked.

~~~
juliangoldsmith
I've never gotten the hate for static linking, which would avoid issues like
this. You'd always have the version of your dependencies that you're
expecting.

Also, dynamic linking completely neuters LTO. There's not much point to
(theoretically) saving a few megabytes of RAM when you're pulling in twice as
much unused code.

~~~
cesarb
> I've never gotten the hate for static linking

I believe most of the dislike for static linking can be traced to a single
incident: a really bad (as in "remote code execution" bad) vulnerability in
zlib, CVE-2002-0059. Back then, it was common to statically link to zlib (and
zlib is a very popular library, the DEFLATE algorithm it implements being the
"standard" compression algorithm), so instead of just replacing a single
dynamic library and rebooting, everything had to be audited for the presence
of embedded copies of zlib.

Quoting from a message from a few years later
([http://mail.openjdk.java.net/pipermail/build-
dev/2007-May/00...](http://mail.openjdk.java.net/pipermail/build-
dev/2007-May/000055.html)):

"[...] Updating the main system zlib package was easy, but finding all the
embedded copies was a nightmare. I think we ended up grepping every binary and
library in every package for symbols and other bits of code that looked like
zlib. All the different versions involved compounded this. And when everything
is found and patched and built and tested there's the cost of bandwidth to
distribute all the extra stuff. So when people talk of removing static
libraries they're talking about real costs -- time and money. After zlib there
was a definite feeling of "never again"."

~~~
TFortunato
Unfortunately, this is conflating static linking with bad dependency
management. ("Just copy it into your own repo", is a different step, that is
unnecessary to do the second step, "and link it statically")

There is no reason I see why you couldn't just have a build that produces the
library .a file for zlib, which can then be pulled in as a dependency of your
build / linked in statically.

I totally agree, they had a nightmare situation on their hands, but I don't
think static linking was solely to blame :-)

~~~
tedunangst
I have 100s of binaries on my system. Which ones do I need relink?

~~~
juliangoldsmith
You can write a script to look at package files' makedepends, and rebuild
everything that uses the offending library.

------
csours
Great post, but it made me sad because of how people limit themselves when it
comes to tests.

Tests are the things you do after the code works, maybe if you have time.
Managers generally only care about hitting a %. Colleagues dodge and avoid
them. I don't remember any awards for tests or testers.

But tests are so powerful and so cheap...

~~~
vmchale
> But tests are so powerful and so cheap...

Most evidence suggests the opposite, though tests have become better in time.

~~~
fhood
Yeah, I firmly support testing, but if the code you are working on isn't
explicitly written to make writing tests easier, than odds are (in my
experience) that writing the tests will take longer than writing the code.

~~~
dylan604
I have been known to write shitty code, but am always looking to make the next
set of code less shitty or even redo something I'm working on if the time
allows. I'm getting better, but have a long way to go. I have a feeling that
my code would be bad to write tests against. What would entail making code
easier to test against? Small bits of code meant to be included that does one
specific thing? Wrapping that included coded into functions and/or objects and
then creating a test suite to hit all methods etc?

~~~
virgilp
I'm not a real proponent of TDD, but... try writing some tests first, when you
develop a new feature. Or at least, write them sooner - before your
implementation is "done"; write them when you think you figured out the
design. See, tests are a new use of your API/interfaces - if you find it hard
to test stuff that's important, maybe you didn't model the problem right?

Rules of thumb:

\- Don't test implementation, test business workflows ("functional/component
tests" are most important; unit tests are sometimes/often nice, but don't
overdo it! If a simple refactoring breaks lots of tests, you're doing it wrong
- testing implementation detail, not business logic); e2e tests are good and
required, but are often slow and when they fail they don't necessarily always
isolate the problem very well

\- Seek a functional coding style, once you get used you'll find out it is
easier to test and easier to reason about (no state means you just test the
logic, and it's easy to unit-test too)

\- Largely ignore code coverage (use it as a guideline to see whether there
are important parts of your app that you ignored/ whether you forgot to add
tests for some business workflows/ corner cases).

\- Avoid test hacks like calling private methods via reflection or whatnot.
Remember, tests exercise your APIs - you either have the APIs wrong, or you're
trying to test irrelevant implementation details.

\- Look for invariants, and test those. Things that should always be true.
Often times, there are multiple acceptable results - avoid exact-match tests
when that happens (e.g. if you make a speech-to-text system, don't test that
audio clip X produces exact output Y; often times, a genuine improvement in
the algorithm might break some of your tests).

TBH I think correct testing is much harder than the industry gives it credit
for. Maybe that's why it's so rarely encountered in practice :)

~~~
jdmichal
I'll take umbrage with your last point:

"Look for invariants, and test those. Things that should always be true. Often
times, there are multiple acceptable results - avoid exact-match tests when
that happens (e.g. if you make a speech-to-text system, don't test that audio
clip X produces exact output Y; often times, a genuine improvement in the
algorithm might break some of your tests)."

No, you should test for exactly the output you have coded to generate.
Otherwise, you do not know when you have behavior regression. You would expect
to have to update the text-to-speech tests when you modify the text-to-speech
algorithm. But if you're modifying another algorithm, and you start seeing
tests break in the text-to-speech algorithm, you're probably introducing a
bug!

A failed test means nothing other than the fact that you have changed
behaviour -- and should therefore trigger on any behavioural change. It's your
opportunity to vet the expectations of your changes against the actual
behaviour of the changed system.

~~~
virgilp
I respectfully disagree. A test should fail, ideally, only when behavior
change is undesirable (i.e. contracts are broken). Optimizations, new features
etc. should not break existing tests, unless old functionality was affected.
And then there's the whole thing about separating functional from performance
concerns - even degraded performance shouldn't fail the functional tests.

In fact, the example I gave was real-life - a friend from Google changed their
speech recognition tests to avoid exact matches and it was a significant
improvement in the life & productivity of the development team.

[edit] There's also another damaging aspect of exact-match tests: they often
test much more than what's intended. Take for instance a file conversion
software (say from PDF to HTML). You add a feature to support SVG, and test it
with various SVG shapes - it's easy & tempting to just run the software on an
input PDF, check that the output HTML looks right (especially in the relevant
SVG parts), and then add an exact-match test. Job done, yay! Except that, you
do this a lot, and it will slow you down like hell. Because when it fails in
the future, it's very hard to tell why (was the SVG conversion broken? or is
it some unrelated thing, like a different but valid way to produce the output
HTML?). Do this a lot and you won't be able to trust your tests anymore - any
change and 400 of them fail, ain't nobody got time to check in depth what
happened, "it's probably just a harmless change, let me take a cursory look
and then I'll just update the reference to be the new output".

~~~
jdmichal
You're building a bit of a straw man. If you have 400 tests that fail with a
single behavioural change, why are you testing the same thing 400 times? And
you don't need an in depth investigation unless you didn't _expect_ the test
to break. And if you did expect the test to break, then you ensure that the
test broke in the _correct_ place. If a cursory glance is all you need in
order to confirm that, then that's all you need. Tests are there to tell you
exactly what _actually_ changed in behaviour. The only time this should be a
surprise is if you don't have a functional mental model of your code, in which
case it's _doubly_ important that you be made aware of what your changes are
actually doing.

In your Google example, would their tests fail if their algorithm regressed in
behaviour? If it doesn't fail on minor improvements, I don't see how they
would fail on minor regressions either.

~~~
virgilp
400 is an arbitrary number, but it's what sometimes (often?) happens with
exact-match tests; take the second example with the PDF-to-HTML converter, an
exact match tests would test too much, and thus your SVG tests will fail when
nothing SVG-specific changed (maybe the way you rendered the HTML header
changed). Or maybe you changed the order your HTML renderer uses to render
child nodes, and it's still a valid order in 99% of your cases, but it breaks
100% of your tests. How do you identify the 1% that are broken? It's very hard
if your tests just do exact textual comparison, instead of verifying isolated,
relevant properties of interest.

In my Google example, the problem is that functional tests were testing
something that should've been a performance aspect. The way you identify minor
regressions is by having a suite of performance/ accuracy tests, where you
track that accuracy is trending upwards across various classes of input. Those
are not functional tests - any individual sample may fail and it's not a big
deal if it does. Sometimes a minor regression is actually acceptable (e.g. if
the runtime performance/ resource consumption improved a lot).

~~~
jdmichal
> It's very hard if your tests just do exact textual comparison, instead of
> verifying isolated, relevant properties of interest.

I think you have this assumption that you never actually specified that exact-
match testing means testing for an exact match on the _entire_ payload. That's
a strawman, and yes you will have issues exactly like you describe.

If your test is only meant to cover the SVG translation, then you should be
isolating the SVG-specific portion of the payload. But then execute an exact
match on that isolated translation. Now that test only breaks in two ways: It
fails to isolate the SVG, or the SVG translation behaviour changes.

> In my Google example, the problem is that functional tests were testing
> something that should've been a performance aspect. The way you identify
> minor regressions is by having a suite of performance/ accuracy tests, where
> you track that accuracy is trending upwards across various classes of input.
> Those are not functional tests - any individual sample may fail and it's not
> a big deal if it does. Sometimes a minor regression is actually acceptable
> (e.g. if the runtime performance/ resource consumption improved a lot).

... "Accuracy", aka the _output_ of your functionality is a _non-functional_
test? What?

And I never said regressions aren't acceptable. I said that you should know
via your test suite that the regression happened! You are phrasing it as a
trade-off, but also apparently advocating an approach where _you don 't even
know_ about the regression! It's not a trade-off if you are just straight-up
unaware that there's downsides.

~~~
virgilp
> That's a strawman

It wasn't intended to be; yes that's what I meant; don't check full output,
check the relevant sub-section. Plus, don't check for order in the output when
order doesn't matter, accept slight variation when it is acceptable (e.g.
values resulting from floating-point computations) etc. Don't just blindly
compare against a textual reference, unless you actually expect that exact
textual reference, and nothing else will do.

> "Accuracy", aka the output of your functionality is a non-functional test?
> What?

Don't act so surprised. Plenty of products have non-100% accuracy, speech
recognition is one of them. If the output of your product is not expected to
have perfect accuracy, I claim it's not reasonable to test that full output
and expect perfect accuracy (as functional tests do). Either test something
else, that does have perfect accuracy; or make the test a "performance test",
where you monitor the accuracy, but don't enforce perfection.

> And I never said regressions aren't acceptable.

Maybe, but I do. I'm not advocating that you don't know about the regression
at all. Take my example with speech - you made the algorithm run 10x faster,
and now 3 results out of 500 are failing. You deem this to be acceptable, and
want to release to production. What do you do?

A. Go on with a red build? B. "Fix" the tests so that the build becomes green,
even though the sound clip that said "Testing is good" is now producing the
textual output "Texting is good"?

I claim both A & B are wrong approaches. "Accuracy" is a performance aspect of
your product, and as such, shouldn't be tested as part of the functional
tests. Doesn't mean you don't test for accuracy - just like it shouldn't mean
that you don't test for other performance regressions. Especially so if they
are critical aspects of your product/ part of your marketing strategy!

~~~
jdmichal
OK I'm caught up with you now. Yes, I agree with this approach _in such a
scenario_. I would just caution throwing out stuff like that as a casual note
regarding testing without any context like you did. Examples like this should
be limited to _non-functional_ testing, aka metrics, which was not called at
all originally. And it's a cool idea to run a bunch of canned data through a
system to collect metrics as part of an automated test suite!

------
wmu
Bruce's write-ups are just great, sheer joy of reading.

~~~
drudru11
True - and also remind me why I will stay away from windows

~~~
rincebrain
I regret to inform you that, while I think the current Windows development
methodology could use better testing (to put it mildly), things like [1][2]
still crop up in other platforms.

[1] -
[https://bugzilla.kernel.org/show_bug.cgi?id=201685](https://bugzilla.kernel.org/show_bug.cgi?id=201685)

[2] -
[http://lkml.iu.edu/hypermail/linux/kernel/1811.2/01328.html](http://lkml.iu.edu/hypermail/linux/kernel/1811.2/01328.html)

~~~
Valmar
[2] was resolved quickly. A much revised version has landed recently.

[1] is a strange bug, because the devs have consistently been unable to
reproduce it, despite constantly looking over the issue. Users of ZFS have hit
problems, also, suggesting that it is not an EXT4 bug, but a very subtle
problem elsewhere in the block subsystem.

~~~
rincebrain
[2] was resolved after it landed in stable release branches, which is a bit
late for how much impact it had.

[1] was, in fact, root-caused to a blk-mq bug.

[https://patchwork.kernel.org/patch/10712695/](https://patchwork.kernel.org/patch/10712695/)

------
rkagerer
This reminds me of a desktop heap exhaustion problem IE would regularly
trigger for me back in the XP days:

[https://weblogs.asp.net/kdente/148145](https://weblogs.asp.net/kdente/148145)

It all came down to a registry setting that MS neglected to bump up much from
the original Win 98 defaults. IIRC the conservative default even persisted
into Win 7.

That 3MB limit would bring down my 48GB system...

------
pitterpatter
This bug is fixed in the latest insider builds at least.

Using the author's own testing tool:

With the Spring 2018 release:

    
    
        F:\tmp>.\ProcessCreatetests.exe
        Main process pid is 46940.
        Testing with 1000 descendant processes.
        Process creation took 2.309 s (2.309 ms per process).
        Lock blocked for 0.003 s total, maximum was 0.000 s.
        Average block time was 0.000 s.
    
        Process termination starts now.
        Process destruction took 0.656 s (0.656 ms per process).
        Lock blocked for 0.001 s total, maximum was 0.000 s.
        Average block time was 0.000 s.
    
        Elapsed uptime is 7.08 days.
        Awake uptime is 7.08 days.
    
        F:\tmp>.\ProcessCreatetests.exe -user32
        Main process pid is 44584.
        Testing with 1000 descendant processes with user32.dll loaded.
        Process creation took 2.624 s (2.624 ms per process).
        Lock blocked for 0.014 s total, maximum was 0.001 s.
        Average block time was 0.000 s.
    
        Process termination starts now.
        Process destruction took 1.617 s (1.617 ms per process).
        Lock blocked for 1.122 s total, maximum was 0.648 s.
        Average block time was 0.026 s.
    
        Elapsed uptime is 7.08 days.
        Awake uptime is 7.08 days.
    

With an insider build:

    
    
        C:\tmp>.\ProcessCreatetests.exe
        Main process pid is 9928.
        Testing with 1000 descendant processes.
        Process creation took 2.440 s (2.440 ms per process).
        Lock blocked for 0.003 s total, maximum was 0.002 s.
        Average block time was 0.000 s.
    
        Process termination starts now.
        Process destruction took 1.306 s (1.306 ms per process).
        Lock blocked for 0.003 s total, maximum was 0.001 s.
        Average block time was 0.000 s.
    
        Elapsed uptime is 4.78 days.
        Awake uptime is 3.93 days.
    
        C:\tmp>.\ProcessCreatetests.exe -user32
        Main process pid is 14144.
        Testing with 1000 descendant processes with user32.dll loaded.
        Process creation took 4.756 s (4.756 ms per process).
        Lock blocked for 0.022 s total, maximum was 0.004 s.
        Average block time was 0.000 s.
    
        Process termination starts now.
        Process destruction took 1.823 s (1.823 ms per process).
        Lock blocked for 0.003 s total, maximum was 0.001 s.
        Average block time was 0.000 s.
    
        Elapsed uptime is 4.78 days.
        Awake uptime is 3.93 days.
    
    

There's no longer a difference in lock blocked time whether or not you load
user32 during process destruction. Nor does the very obvious mouse stuttering
still happen.

~~~
brucedawson
Woah! That is fascinating. I had heard nothing about this. Your uptime is a
bit shorter on the insider build but the change in lock blocking is too
dramatic to be explained by that.

I notice that all of the elapsed times are worse on the insider build - is
that perhaps a slower machine? And are there enough CPUs on that machine to
trigger the bug? That is, I'd like to believe that the bug is fixed but I'm
skeptical.

~~~
pitterpatter
True, the above comparison might not have been the most scientific :P The
Spring 2018 results were run on a much more powerful desktop than the surface
pro 3 used for the insider results.

Here is the results for the April 2018 Update rerun on the same surface for a
more apples to apples comparison:

    
    
        C:\tmp>.\ProcessCreatetests.exe
        Main process pid is 6448.
        Testing with 1000 descendant processes.
        Process creation took 4.382 s (4.382 ms per process).
        Lock blocked for 0.007 s total, maximum was 0.000 s.
        Average block time was 0.000 s.
    
        Process termination starts now.
        Process destruction took 0.592 s (0.592 ms per process).
        Lock blocked for 0.002 s total, maximum was 0.002 s.
        Average block time was 0.000 s.
    
        Elapsed uptime is 0.01 days.
        Awake uptime is 0.01 days.
    
        C:\tmp>.\ProcessCreatetests.exe -user32
        Main process pid is 11364.
        Testing with 1000 descendant processes with user32.dll loaded.
        Process creation took 4.707 s (4.707 ms per process).
        Lock blocked for 0.009 s total, maximum was 0.000 s.
        Average block time was 0.000 s.
    
        Process termination starts now.
        Process destruction took 1.248 s (1.248 ms per process).
        Lock blocked for 0.904 s total, maximum was 0.902 s.
        Average block time was 0.181 s.
    
        Elapsed uptime is 0.01 days.
        Awake uptime is 0.01 days.
    
    

The mouse movement hanging behaviour is easily evident with the April 2018
release. I didn't notice the same on the insider build.

------
markpapadakis
Great read but what’s up with the 6 banner ads interspersed in the content
page ? Maybe a few too many ?

~~~
pwg
With NoScript blocking execution of javascript, there are zero banner ads
interspersed in the content page.

------
mehrdadn
Also good to know: I seem to recall similar if not worse slow-downs happen if
you try to pull in Windows Sockets (WS2_32.dll).

------
userbinator
Unless the compiler can _prove_ you're never going to run into that case, it
can't remove the call, and because the call is an imported function it still
has to create the import and have an entry in the IAT for it, so it needs to
be resolved at load time. Not all that surprising IMHO.

~~~
ascar
The author even said "we immediately knew what to do", which is kinda the
contrary of surprising.

The interesting bit is not that a slow loading dependency got imported anyway,
but why that dependency is slow and that it can get imported very easily
indirectly.

------
polskibus
Is there a way to tell whether a .net program is also affected by this
behavior?

~~~
mark-r
I just checked a C# app I had laying around with depends, and it depends on
user32.dll which depends on gdi32.dll.

It's hard to imagine a Windows program that _wouldn 't_ depend on one of those
critical DLLs. The only thing that saves us is that we don't often create and
destroy hundreds of processes at a time.

------
chii
wow, how come the mere presence of a function causes a DLL to get loaded? Is
it because in order to compile, the DLL (or its export definition) needs to be
present, and the compiler does some magic because of that?

~~~
_wmd
Lazy binding isn't without downsides, it requires internal synchronization of
its own, which means it's possible to write multithreaded programs that will
suffer latencies due to lock contention during symbol resolution. Depending on
OS (not sure this applies to Windows), it can also mean what used to be fatal
startup errors are delayed long into process life

------
gok
The title is really misleading. "Dependencies cause overhead even when
unneeded" would be more accurate.

~~~
klodolph
That strips out the interesting part, though. A 5X slowdown for an actual,
useful, real-world project is interesting. Vague “overhead” could mean nothing
more than some bigger binaries.

------
gcb0
tl;dr the method presence pulls in a dependency that runs slow buggy code.

so click bait.

~~~
pjc50
This is a stupid dismissal of a well-written piece of investigative debugging.
This article is the kind of thing I'd like to see more of on HN.

~~~
jackewiehose
I also liked the article but to be fair, the title is actually a little
clickbaity. The slowdown has nothing to do with not-called functions, it's
just about DLL-dependencies.

~~~
mark-r
Not clickbaity at all, it's the essence of what makes the problem interesting.
The fact that a DLL dependency can slow down your program at shutdown is not
at all intuitive, particularly when it's a system DLL that should be bullet-
proof.

------
IloveHN84
Imagine if you use layers and layers of abstraction (e.g. Java lasagna
programming style), how slow it could be

------
xtrapolate
> "The first fix was to avoid calling CommandLineToArgvW by manually parsing
> the command-line."

> "The second fix was to delay load shell32.dll."

If your build pipeline is continuously spawning processes all-over, to the
point "delay loading" makes a significant difference - it's time to start re-
evaluating the entire pipeline and the practices employed.

~~~
wtallis
Do you know of a build system that can handle a source tree as large as _an
entire web browser_ without spawning a lot of processes?

It's hard to tell what, if anything, you are recommending here. Pass thousands
of files to a single compiler invocation? Ignore the problems and stop trying
to make process creation and clean-up faster?

~~~
im3w1l
> Pass thousands of files to a single compiler invocation?

Sure. Or pass it a file with all the filenames. Or have the compiler work as a
server that takes compilation requests over a socket. It's not like passing
thousands of filenames between two processes is a deep unsolved problem.

~~~
mannykannot
So, the solution to concurrency problems is to serialize everything?

~~~
fit2rule
"Concurrency is everything serialised, properly."

~~~
mannykannot
The posts we are replying to here seem to have a very narrow concept of what
'properly' entails in this case.

