
Compiler bug? Linker bug? Windows Kernel bug - janvdberg
https://randomascii.wordpress.com/2018/02/25/compiler-bug-linker-bug-windows-kernel-bug/
======
epberry
I used to be surprised when people came across these bugs. I would think: a
kernel bug! They must have truly hit an edge case for something like that to
skip into production. But then I read Showstopper: The Race to Build Windows
NT, specifically the part where that OS shipped with 65,000 known defect (and
it was considered one of the most stable software releases of all time), and
it became clear to me that these bugs are probably known by the team when the
OS ships but just not considered high priority enough to fix. Same with the
recent articles about iOS and engineers not getting time to fix P2 bugs. Long
story short my entire perspective on kernel bugs (and partly software
development) changed and I began to think of them more as known flaws that
weren’t fixed because of schedules rather than unknown bugs that were
uncovered in the wild.

~~~
DoofusOfDeath
This is one aspect of OS development in which I think open source wins: users
are not at the mercy of the OS developers for bug tracking and/or fixes.

~~~
chrisper
Are we not? If I find a bug, I sure hope they fix it soon. I do not have the
time and energy to work myself into the kernel to fix the bug (unless it is a
small bug that does not require a lot of knowledge about the kernel itself).

~~~
mseebach
Sure, as an individual, your direct powers are limited, but they're a lot less
limited than if your kernel is closed. You can offer a cash bounty to anyone
who wants to fix the bug (not just those employed by the kernel-vendor) and
you can deploy their fix to your own systems independently of the kernel
vendor releasing an update (or, accepting the fix at all) because you can
build the kernel from source yourself (or, again, you can pay someone to do
this for you). If you can't afford a big enough bounty to get anyone to bite,
you can team up with others affected by the bug to pool your resources, and
you can even use the kernel bug tracker to coordinate this effort.

~~~
hiccuphippo
At the very least you can write a bug report. I tried writing one for
Microsoft Office, not even found where.

~~~
Someone1234
First off, true, reporting bugs to Microsoft is hell.

But on the other hand reporting feature suggestions to Microsoft is much more
pleasant than anything open source has e.g.
[https://office.uservoice.com/forums/285186-general](https://office.uservoice.com/forums/285186-general)

In Open Source world, because by the nature of it you're non-paying, asking
for features either gets back "submit your pull request" or "go away." Which
makes sense, those who contribute the most time get the most say, and that
would be fine if those that contribute the most time had the same goals and
taste as those that didn't (see the failure of Linux Desktop as an example).

~~~
szatkus
Looks like people really want ClipArts back.

~~~
Someone1234
It was nice having a license free library of art assets even if they were
overused. There are websites that offer that these days but as some of the
commentators pointed out, that requires legal to sign off on in some
businesses or those sites could be hosting stolen images.

~~~
ygra
> It was nice having a license free library of art assets even if they were
> overused.

I guess by now Emoji fill that role to some extent. Including the overused
part ;-)

------
jwilk
The "Avoid Windows kernel bug using Python hack" commit (for people with JS
disabled):

[https://github.com/chromium/chromium/commit/052a09014b2018e2...](https://github.com/chromium/chromium/commit/052a09014b2018e2a1e3b7f046ea9ad3355b831e)

~~~
bla2
And this looks like the lld-link change: [http://llvm.org/viewvc/llvm-
project?view=revision&revision=3...](http://llvm.org/viewvc/llvm-
project?view=revision&revision=325274)

~~~
jwilk
The llvm commit on GitHub (which shows whole diff at once):

[https://github.com/llvm-
mirror/llvm/commit/8b2a3e8b203a332bb...](https://github.com/llvm-
mirror/llvm/commit/8b2a3e8b203a332bbd71cb9a5708f870dec3ae74)

~~~
brucedawson
Thanks for the link. I meant to track that down and I appreciate your doing
that for me.

------
mehrdadn
Huh. There's a memory-mapped I/O bug I've known about for years, I wonder if
this is related to it? It's something along the following lines, but I forget
the exact details, so some bits (no pun intended) might be wrong: if you
create a memory mapping and then close the underlying file handle, you can
even delete the file at that point, but the memory mapping will still hang
around and pretend to work, and you won't know it's not working. I'm not sure
exactly what happens under the hood but I recall it doesn't behave sanely.

~~~
zuzun
A call to unlink removes the name from the filesystem, but keeps the file
around as long as it's still opened by a process; and mmap keeps the file
opened, even if you close the original file descriptor.

~~~
mehrdadn
> A call to unlink removes the name from the filesystem, but keeps the file
> around as long as it's still opened by a process; and mmap keeps the file
> opened, even if you close the original file descriptor.

Are you familiar with Windows or just assuming all OSes works like *nix
systems? That's definitely not at all close to how Windows file systems
behave.

~~~
pjmlp
Not even all UNIXes.

It is implementation defined on the UNIX OS variant and mounted file system.

The beauty of POSIX is how it appears to be portable, while leaving quite a
few things being implementation defined.

~~~
gpderetta
Not sure what UNIX variant are you thinking, but from POSIX:

> The mmap() function adds an extra reference to the file associated with the
> file descriptor fildes which is not removed by a subsequent close() on that
> file descriptor. This reference is removed when there are no more mappings
> to the file.

I'm sure there have been unix variants that got it wrong and might not even
have been in posix from the beginning (it is there since at least SUSv2);
still I think this is traditional unix behavior.

------
Athas
Are PE executable files special in Windows, or could this bug have been hit
with normal IO?

~~~
mehrdadn
I'm not sure in what manner they would need to be special for this, but I
would expect the significant part was the fact that it was memory-mapped
(which isn't normal I/O), not that they were PE files.

Edit (in response to comments): Actually, I _don 't_ think it has anything to
do with PE files being optimized in some special fashion, because they say
loading via LoadLibrary(Ex) can cause this issue too, and those are very much
user-mode constructs, and hence can't mess with the caching/paging behavior
(which are lower, in the kernel). AFAIK the only kernel-mode component to them
is to just create a file mapping (NtCreateSection) and after that all they do
is a bunch of user-mode juggling (fixing up relocations, loading other
libraries that are depended on, calling entrypoints, etc.).

~~~
blattimwind
Since it's only observed by execution, its entirely possible that the PE
loader has been optimized in a low-level fashion leading to bugs that are not
exposed in general I/O facilities.

~~~
gmueckl
My reading is that PE images are memory mapped into RAM instead of being read
from disk. This is why they require write locks on the files during execution:
the program image is backed from the corresponding file instead of the normal
page file.

So my speculation is that this bug is a race condition between the cleanup of
a dirty memory mapped section of a file and the reading of the same file
contents in a separate mapping.

------
rurban
Still a linker bug to me. Where did it became a kernel bug? Looks like wrong
usage of memory-mapped I/O, not like a bug in memory-mapped I/O itself.

~~~
brucedawson
The linker did nothing wrong. It wrote to a file using memory mapped I/O.
Shortly afterwards the build system tried to run the just-linked file. This is
supposed to work. It usually does. Due to this bug it occasionally doesn't.

The linkers are getting a compatibility hack (FlushFileBuffers call) but this
is not supposed to be necessary, by the file-system contract.

------
0xfeba
I just wish they'd fix this bug with incorrect restored window states. Eg,
when monitor layouts changed maximized windows are windowed, but they still
think they are maximized. It requires clicking the maximize button twice to
fix it.

Been there since Windows 98, at least.

------
S3raph
really love reading bug hunting stories

------
dingo_bat
So basically if you write and read immediately, windows will give you bad data
in some cases. Not a good sign for the underlying design.

~~~
huhtenberg
No, basically it's not like that.

From the TFA:

    
    
        if a program writes a PE file (EXE or DLL) using memory mapped file I/O and 
        if that program is then immediately executed (or loaded with LoadLibrary or LoadLibraryEx), and 
        if the system is under very heavy disk I/O load
        then a necessary file-buffer flush may fail

~~~
quietbritishjim
How does your comment contradict the parent one?

~~~
narag
Writing a binary executable file is not a common situation, only compilers do
that. If you add the unusual heavy IO, this is a fringe case.

~~~
mehrdadn
[Edit: whoops, never mind, thanks. Might be worth adding the memory-mapped
part to your comment to clarify for others too.]

~~~
DougBTX
You're right that it is an example of a case that matches the "basic"
description of the conditions to hit the bug, but it probably won't satisfy
any of the three conditions when they are considered in detail:

* The files being written probably won't be using memory mapped IO (I'd expect them to use the equivalent of fwrite instead) as the installer won't need to make complex changes to the binaries in the same way that a compiler/linker would

* There will likely be a significant delay between writing the file and running it compared with a build process that uses its own build output as part of the build (even in the build process case, the bug went away for a year, possibly due to an additional delay between the write and execute steps)

* Once the installer has run, the system will probably not be under heavy load (the author was running a highly parallel build on a 24-core machine, and this bug was still only hit in 3% of builds)

Really the earlier post was just misleading, as it implies the bug is much
easier to hit than it really is.

~~~
kuschku
There are other use cases where this might occur, though – JITs that store
JITed binaries on disk.

That’s rare (usually there you’d read it back in and execute in your process
space), but it’d be at least a second realistic scenario.

------
xtrapolate
Answer is: "none of the above".

Coherency is covered by the WinAPI documentation (check the remarks section).
Always RTFM.

[https://msdn.microsoft.com/en-
us/library/windows/desktop/aa3...](https://msdn.microsoft.com/en-
us/library/windows/desktop/aa366537\(v=vs.85\).aspx)

~~~
masklinn
Impressive. It would be difficult for you to be more wrong, and reading the
article and noting that

> [ex-coworkers at microsoft] confirmed that my fix should mitigate the bug
> (I’d already noted that it had allowed ~600 clean builds in a row), and
> promised to create a proper fix in Windows.

would have saved you the bother of nerd-sniping.

The coherency notes in the article you link are irrelevant because they're
about _concurrent access_. In TFA, the file is created (through mapping),
closed, then executed. Sequentially. Also no ReadFile or WriteFile involved
(in fact though I don't know how Windows implements it one would expect PE
loading to use memory mappings, and thus be covered under coherence
guarantees)

~~~
xtrapolate
Sure, you can call your "ex-coworkers at Microsoft" (most of which are
irrelevant to debugging/verifying this issue) - or, you can just spend some
time reading the documentation, as previously suggested.

The coherency issues are your "tread lightly" warning. The docs regarding
FlushViewOfFile expand on this specific issue:

 _Flushing a range of a mapped view initiates writing of dirty pages within
that range to the disk. Dirty pages are those whose contents have changed
since the file view was mapped. The FlushViewOfFile function does not flush
the file metadata, and it does not wait to return until the changes are
flushed from the underlying hardware disk cache and physically written to
disk. To flush all the dirty pages plus the metadata for the file and ensure
that they are physically written to disk, call FlushViewOfFile and then call
the FlushFileBuffers function._

So, after carefully reading this documentation, we can clearly conclude OP's
pull-request is actually missing a call to FlushViewOfFile prior to calling
FlushFileBuffers.

~~~
brucedawson
Closing the file is supposed to ensure coherency. It generally does. That it
occasionally does not is a bug, that Microsoft will fix.

