
What is Windows doing while hogging that lock? - ingve
https://randomascii.wordpress.com/2017/07/27/what-is-windows-doing-while-hogging-that-lock/
======
pbsd
HmgNextOwned doesn't do, by itself, all that much. There is some mild pointer-
chasing, but not too horrible. All it does is to go through the global GDI
handle table and look for handles owned by the terminating process. It is used
loosely as

    
    
        DWORD index = 0;
        while(true) {
          SOME_GDI_OBJECT_TYPE obj;
          index = HmgNextOwned(index, pid, &obj);
          if(index == 0) break; // done / nothing found
          // ... do some clean up of the object here
        }
    

The thing is, this loop is executed 6 (12, if counting the instances inlined
into NtGdiCloseProcess) times when terminating a process, for various object
types, and for each call of HmgNextOwned it will lock and unlock the handle
manager lock.

~~~
brucedawson
Thanks for the explanation.

If it scans the entire GDI handle table then that suggests that its runtime
will be proportional to the system's total number of GDI objects, correct?
That sort of design can easily get expensive. I'm surprised it can't iterate
through its own GDI objects without having to traverse everybody else's as
well.

~~~
pbsd
Yeah, it's proportional to the total number of GDI objects. Approximately:

    
    
      DWORD HmgNextOwned(DWORD index, DWORD pid, HANDLE * handle) {
        GDI_TABLE_ENTRY entry;
        GreAcquireHmgrSemaphore();
        // GetNextEntryIndex is inlined in reality, this is the main loop you see in the disassembly
        while(index = gpHandleManager->GetNextEntryIndex(index, &entry)) { 
          if(entry.Type != 0 && ((pid & 0xFFFFFFFD) ^ entry.ProcessId) & 0xFFFFFFFE) == 0) {
            *handle = index | (entry.Upper << 16);
            break;
          }
        }
        GreReleaseHmgrSemaphore();
        return index;
      }
    

Best information I could find on the relevant structures is [1], which gives
an idea of what you're traversing. I don't know why this is designed this way;
it does seem suboptimal.

[1] [https://cvr-
data.blogspot.com/2016/11/windows-10-anniversary...](https://cvr-
data.blogspot.com/2016/11/windows-10-anniversary-update-gdi.html)

~~~
brucedawson
That link was extremely helpful! My unsubstantiated belief is that the
process-shutdown bug was new in Windows 10 Anniversary Edition, and that link
supports that idea. It sounds like Windows 10 Anniversary Edition changed the
management of GDI objects to avoid leaking kernel addresses and that made
HmgNextOwned more expensive, thus triggering the visible serialization of
process destruction.

Cool!

I really wish I could upvote the parent post more times. Failing that I've
updated the latest post to link to it as an explanation for how the whole
problem got started.

~~~
pbsd
For what it's worth, I just went back to an earlier version (10240) of
win32kbase.sys, and the search is indeed a lot simpler...it's essentially a
flat linear table search (to be clear, it's still iterating through every GDI
object), much friendlier to the CPU than the structure traversing that happens
now. Could very well be the culprit.

~~~
brucedawson
Thanks for the information. I guess if scanning used to be two to three times
faster maybe that would be fast enough that the issue, while still present,
would not be noticeable.

Note that the scanning happens even when the process has _zero_ GDI objects.
That seems... suboptimal.

I've made a few updates to the end of the post based on the latest
information.

------
SeanBoocock
Good followup to Bruce's previous blog, discussed here:
[https://randomascii.wordpress.com/2017/07/09/24-core-cpu-
and...](https://randomascii.wordpress.com/2017/07/09/24-core-cpu-and-i-cant-
move-my-mouse/)

Surprised at how much you can divine from the ETW traces with enough
persistence.

------
revelation
Well.. what is it doing?

Analyzing the assembly is neat but this doesn't seem like a CPU performance
problem, it's a high-level design issue. It's a bit like analyzing sleep sort.

~~~
brucedawson
Agreed. I was just curious, and it seemed like a fun way to demonstrate the
technique, which can be useful when profiling CPU intensive parts of your own
code.

------
nitwit005
Might be worth looking to see if the processes are creating hidden windows or
other UI objects. The previous post references NtGdiCloseProcess, and it'd
make sense that some of those handles would require a lock to clean up safely.

~~~
brucedawson
The test code is up there. The master process loads user32.dll and makes a
couple of window-type calls, but the other processes do nothing but wait on
semaphores and then die.

------
brucedawson
It's great to see all this discussion. If anyone has a link to any Intel
papers or systematic analysis of how sampling profilers interact with Intel
super-scalar out-of-order processors that would be appreciated.

~~~
ThrowingPotato
What are the full specs on your 24 core machine?

~~~
brucedawson
HP Z840, two sockets, 64 GB of RAM, SSD. I don't remember the details and I'm
at home now.

The most recent tests were done on my four-core/eight-thread laptop. It's
running a more recent version of Windows so it's a "fairer" test in that way,
and it reproduces the problem just fun with my artificial test program,
CreateProcessTests.

------
Symmetry
>If there are three instructions executing simultaneously when the interrupt
fires then which one “gets the blame”? If a load instruction misses in the
cache and forces the CPU to wait then will the samples show up on the load
instruction, or on the first use of the data? Both seem to happen.

Specifically in order to handle interrupts a CPU may execute instructions out
of order but there's a final retire step where the results of all instructions
are made 'official' in order. So from outside it looks like the CPU was
executing in order.

The upshot is that, except in crazy optimized code that might sometimes
overwhelm the retire unit, the last instruction you see as not having finished
hasn't finished but there might be other instructions being worked on further
along as well.

------
rootw0rm
meh, i'm an idiot.

~~~
jsnell
The question isn't about three instructions being executed by three separate
OS threads. It's about three instructions being executed at the same time on a
single thread on a single core.

The interrupt context will only contain a single program counter value; which
of the instructions that could currently be considered to be executing on that
thread will it actually point to?

~~~
tedunangst
I would guess PC is most recently retired instruction? Or would it be not
retired? Although that is an interesting scenario.

    
    
        slow eax, ebx
        fast ecx, ecx
    

Assuming slow and fast always finish at the same time, will the profiler ever
see the CPU executing slow?

~~~
jsnell
Working on the most recently retired instruction would be odd, since it'd lead
to things like the cost of cache misses being assigned to instructions that
have nothing to do with memory access.

But other than that, I've never been able to build a mental model for exactly
how the accounting works. For example take an inner loop with no memory
accesses, like this:

    
    
        18: add    $0x1,%rax
            cmp    %rax,%rdx
            jne    18
    

On a Skylake this will execute once per cycle (IPC of ~3), so we're clearly
spending equal amounts of time on each instruction. But perf will assign
basically all of the cost to the jump.

    
    
         0.28 │18:┌─→add    $0x1,%rax
         0.27 │   │  cmp    %rax,%rdx
        99.45 │   └──jne    18
    

So it could be consistent with some kind of "first in-flight" or "last in-
flight" model. But change the loop to be 5 instructions long (and with an IPC
of ~5, so still executing the whole loop once per cycle), and suddenly the PC
gets smeared across two instructions:

    
    
         2.12 │20:┌─→add    %rsi,%rax
         0.94 │   │  add    $0x1,%rcx
        45.36 │   │  add    %rax,%rdx
         0.10 │   │  cmp    %rcx,%rsi
        51.48 │   └──jne    20
    

Take an almost equivalent loop, just swapping two instructions. It's still 5
instructions and an IPC of ~5. And...

    
    
        80.79 │20:   add    $0x1,%rcx
         1.29 │      add    %rax,%rdx
         0.89 │      add    %rsi,%rax
         0.00 │      cmp    %rcx,%rsi
        17.02 │    ↑ jne    20
    

Multiple runs of an individual test case will be pretty consistent on a
specific machine; so it's not that the assignment is just random. It's
deterministic, but the exact mechanism is very opaque.

~~~
ot
> it'd lead to things like the cost of cache misses being assigned to
> instructions that have nothing to do with memory access.

But this happens all the time with perf. I often see cache misses attributed
to a jump dependent on a data load.

~~~
jsnell
Sure, but that's still an in-flight instruction that is stalled waiting on a
memory access (just indirectly). You don't see the miss being attributed to
the instructions that preceded the load.

~~~
brucedawson
Take a look at the "cmp ebx,eax" on line 24 of the spreadsheet linked from the
article. ebx is only ever modified by pure register manipulations. eax was
mostly recently modified by having a register added to it.

And yet the cmp instruction is quite expensive - much more so than those that
proceed it and which are executed equally frequently.

So, somehow the cost of the r8d four instructions early must end up being
propagated through the add in the previous instruction and then gets charged
to the "innocent bystander" of the compare. Crazy. I would have expected the
"add eax,r8d" on line 23 to pay the price.

~~~
qb45
This looks funny, it seems that RDI is never modified by the loop (?) so lines
18-24 always load the same data, presumably from L1. It's mysterious why of
all those instructions, 20 and 24 get most interrupt hits.

~~~
brucedawson
Good observation about rdi and its implications for the load of rcx (line 18)
and therefore the load through rcx (line 20).

I added a note to that effect and also enabled adding comments to the
spreadsheet.

