
You Got Your Web Browser in my Compiler - joosters
http://randomascii.wordpress.com/2014/03/31/you-got-your-web-browser-in-my-compiler/
======
rohansingh
The unfortunate excerpts from this post that stood out for me:

> Without access to a lot of source code I can’t tell exactly what is going
> on, but...

> Until Microsoft fixes their compiler to not load their web browser it seems
> impossible to avoid this problem when doing lots of parallel builds.

In my past life developing applications and services on Windows, this scenario
was too common — something is broken, you can't quite tell why because you
don't have the source, and you can't do anything but wait for it to be fixed
or find some hacky workarounds (which is tougher to do without the source).

This has been the real win on switching to an open source stack for me: you
can take your tools apart, look inside them, and fix any issues. I know
Microsoft has come further in this in the past few years, but it still sucks
to be stuck in that kind of situation.

~~~
NigelTufnel
Were it GCC instead of VC++ that wouldn't change much for a lot of people.
Fixing bugs in a program with 7 million lines of code is not easy.

Fixing bugs in GCC for an unitiated person is probably super hard.

~~~
kentonv
> Fixing bugs in a program with 7 million lines of code is not easy.

It's not like you have to read all 7 million lines of code. The author of the
article already has a stack trace; all he'd need to do is skim the functions
along that trace. In this case it sounds highly likely that the fix is to
comment out some code somewhere that is initializing a resource that isn't
actually needed.

Frankly it sounds pretty easy.

------
jwise0
I think the true enlightenment in this article (for me, at least) is that
Windows's tracing framework is far more powerful than I had realized. Being
able to capture a system-level trace is an _incredibly_ valuable tool.

Until recently, I had thought that the only real tools out there to do this
kind of analysis lived only on Solaris (DTrace). (As I recall, SystemTap gets
part of the way, but does not reach far enough into userspace to really get a
good view of what's going on -- is that still true?) It would be very
interesting to augment Windows's performance tracing tools with some of the
other things that DTrace has to offer -- pervasive low-overhead trace
scripting seems like it would have made some of the other subsequent analyses
that Bruce was working on much easier.

~~~
wging
>Until recently, I had thought that the only real tools out there to do this
kind of analysis lived only on Solaris (DTrace).

FYI, there are ports of dtrace to systems other than Solaris.

------
fafner
Ah, the wonders of XML. Somehow somebody decided to use XML to store some text
data. Because after all you already have a parser for this in the approved
tools. So why bother using something different?! And look it does all the cool
stuff like namespaces and validation and amazingly it can even fetch remote
DTDs by loading IE...

~~~
simias
I hate XML as much as the next guy but in this case I would blame a poorly
designed (or maybe very misused) API.

There's no reason why any file parsing library would end up fetching remote
data without being explicitly asked to do so. Actually, it shouldn't even be
the library's concern to fetch those files, an XML library has no business
with networking. It's a security concern and a maintenance hell.

~~~
pwg
> There's no reason why any file parsing library would end up fetching remote
> data without being explicitly asked to do so.

I would agree. But the XML spec. authors clearly disagreed with both of us:

External Entities:

XML 1.0: [http://www.w3.org/TR/2008/REC-xml-20081126/#sec-external-
ent](http://www.w3.org/TR/2008/REC-xml-20081126/#sec-external-ent)

XML 1.1: [http://www.w3.org/TR/2006/REC-xml11-20060816/#sec-
external-e...](http://www.w3.org/TR/2006/REC-xml11-20060816/#sec-external-ent)

This is "XML speak" for an "include" statement to something like cpp, with the
exception that this "include" could end up performing remote network fetches
to acquire that which is being included.

So, technically, to be a proper, standards compliant XML parser, the parser
has to at least submit requests to "fetch" these entities to the higher level
code using the library, and let that code decide what to do about the
"includes".

As to why Microsoft's implementation is the way it is, absent a Raymond Chen
blog post explaining the why, we can only guess.

~~~
acqq
Ever tried to look in the documents saved by Libre Office or MS Office? They
all use XML now. The ODT document with only two words in it has at the start
of the content.xml inside of ODT this beauty:

    
    
        <?xml version="1.0" encoding="UTF-8"?>
        <office:document-content
        (...)
        xmlns:xlink="http://www.w3.org/1999/xlink" 
        xmlns:dc="http://purl.org/dc/elements/1.1/" 
        xmlns:math="http://www.w3.org/1998/Math/MathML" 
        xmlns:ooo="http://openoffice.org/2004/office" 
        xmlns:ooow="http://openoffice.org/2004/writer" 
        xmlns:oooc="http://openoffice.org/2004/calc" 
        xmlns:dom="http://www.w3.org/2001/xml-events" 
        xmlns:xforms="http://www.w3.org/2002/xforms" 
        xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
        xmlns:rpt="http://openoffice.org/2005/report" 
        xmlns:xhtml="http://www.w3.org/1999/xhtml" 
        xmlns:grddl="http://www.w3.org/2003/g/data-view#" 
        xmlns:officeooo="http://openoffice.org/2009/office" 
        xmlns:tableooo="http://openoffice.org/2009/table" 
        xmlns:drawooo="http://openoffice.org/2010/draw"
    

It's not Microsoft specific bug that ate some brains.

~~~
yuhong
These are namespace directives, not references to external DTDs.

~~~
acqq
The problem is, whenever you have some link anywhere and it is assumed that it
should be refreshed sometimes, how can you know that you shouldn't load the
more current version? If you write something like a DLL or library why not
leave it to the expert: let the IE try to fetch it, and if it already fetched,
it will return it from its own cache! Brilliant, problem solved! Except when
that happens from 144 instances all the time and IE needs some windows
creations at the start which is what Bruce seems to manage to trigger.

~~~
shawnz
To expand on what the parent commenter was saying: namespace directives aren't
meant to be accessed; they are just used as a unique identifier.

~~~
acqq
Yes. I'm old enough to remember the times though when even some of the
creators of the standard still thought otherwise:

[https://web.archive.org/web/20060613090946/http://textuality...](https://web.archive.org/web/20060613090946/http://textuality.com/tag/Issue8.html)

"Given that namespaces have definitive material, and that such definitive
material is typically available on the Web, and that namespace names may be
"http:"-class URIs, it is a grievous waste of potential if it is not possible
to use the namespace name in retrieving the definitive material."

And in order to do all the processing and transformations popular at the time
somewhere there should be the copies of the documents specified with the
URI's. Bruce detected some loads from some documents stored in the DLLs,
locally.

------
bhouston
You need to reduce the parallelism in your build. If it is possible in your
build setup to be runing 144 concurrent builds you are trashing your caches.
This isn't an optimization. You need to reduce the number of projects being
built concurrently (or even the number of total projects.) While I advocate a
bit of over allocation, you have it in the extreme and this is really not
optimal at all.

Also you need to get using the precompiled headers if you can -- sometimes
this requires structuring your projects right to avoid lots of little
projects.

Even with large projects I usually have really low compilation times for C++.

Here is a blog post I wrote about the right way to do this:

[http://exocortex.com/blog/speeding_up_cpp_builds_on_windows](http://exocortex.com/blog/speeding_up_cpp_builds_on_windows)

~~~
nknighthb
A standard test for scheduler responsiveness in Linux is to fire up a kernel
compile with unlimited jobs[0]. If responsiveness is lost, the scheduler isn't
ready for prime time. You think Microsoft wants the answer to this issue to be
"we can't even match Linux"?

[0] Well, it used to be. I assume they've moved on to more demanding tests
now, since hardware has far outstripped kernel build complexity.

~~~
bhouston
The issue isn't responsiveness during build for me (that isn't usually an
issue, my build machine is generally responsive), it is compile performance
from start to finish. If you over-allocate to the extreme you will trash your
CPU caches and this will slow everything down. You want to minimize CPU
context switches (thread switches) while also ensuring that your CPUs are
running at nearly 100% -- it is a difficult balance, but unlimited jobs and
focusing on the responsiveness of another task is not the solution to
maximizing start-to-finish compile time for a large project.

~~~
TorKlingberg
I think the article author is quite aware that too much parallelism will slow
down the build, but still wanted too investigate why it makes his computer so
unresponsive.

------
bazzargh
This looks like the same bug, back in 2010:
[http://social.msdn.microsoft.com/Forums/fr-
FR/155b51f7-60ea-...](http://social.msdn.microsoft.com/Forums/fr-
FR/155b51f7-60ea-41c2-a369-2e53c3d16a6f/msxml4dll-pops-up-a-warning-
messagebox?forum=xmlandnetfx)

(ie that MSXML is calling URLMon, with unpredictable results). There's a
comment at the bottom of that thread that suggests how to fix the bug if it's
in your own code, but not if it's in Visual Studio...

edited to add: related bugs crop up in many programs that use xml parsers, not
just MS. Some combination of disabling validation, loading of external
entities, or adding an entity resolver is usually needed (see eg these options
for Xerces
[http://xerces.apache.org/xerces-j/features.html](http://xerces.apache.org/xerces-j/features.html)).
The symptom there was usually errors in production that didn't occur on the
developer's box because in production the network connection to grab the
schema didn't work...or worse, a SOAP service that will fetch external
entities in messages.

~~~
acqq
It looks it's urlmon.dll there, whereas Bruce observes mshtml.dll loaded.

~~~
yuhong
It was urlmon that was loading mshtml.

~~~
acqq
You are right, I see. Do you know if urlmon always launches mshtml or is it
something specific for this case?

------
prayerslayer
I wouldn't say that VS loads "Internet Explorer", that's a bit of an
exaggeration. iexplore.exe is Internet Explorer, mshtml.dll holds the DOM
implementation it's using [1,2]. Now that's still close to a web browser, but
one may also use Trident to work e.g. with generic XML documents. I don't know
if VS does that and whether this is a good idea or not though.

[1] N. Zakas: High Performance Javascript, p. 35 [2]
[http://en.wikipedia.org/wiki/Trident_(layout_engine)](http://en.wikipedia.org/wiki/Trident_\(layout_engine\))

~~~
pjc50
He does claim that it creates a window, even if that's a conceptual one rather
than a physical one. Hence the lock contention in the desktop window manager.

------
octo_t
Awesome, in depth writeup (especially for someone who isn't at all familiar
with Visual Studio / development on Windows in general).

From your report this is 'use a builtin XML parser' and that xml parser has a
dependency on IE?

------
bitwize
This is one of the things about Microsoft APIs that gets me. They all depend
on each other in weird ways that have less to do with practicality than they
do with getting people on board to use the various Microsoft APIs. For
example, DirectX depends on COM. It doesn't have to, but they're both
Microsoft technologies so why not? You're going to need to learn how to use
COM anyway; it's the future. Systemd is designed this way too; that's why I
hate it, despite the fact that it won. It seems architected with a "getting
people on board with using certain tools" mindset rather than orthogonally
providing functionality.

But this... this takes the dependency tangle to whole new levels of comedy.

~~~
Negitivefrags
COM isn't really a API like DirectX is. It's a model for making APIs. Thus it
makes sense that they would use it when creating an API.

I can understand your argument in the general case, certainly, but it doesn't
apply very well to COM.

------
psuter
I am not sure whether it is still the case, but the Scala compiler used to
have a dependency on Swing. The reason was that one of the potent -Y flags
gave you access to an AST browser after the phase of your choice. Very
convenient tool for compiler plugin developers, but a questionable dependency
for most.

(Edit: seems to have been taken away at least as of 2.10)

~~~
jfim
scalac -Ybrowse:typer Foo.scala still works on 2.10.3. It's also pretty
convenient when writing macros, not only for compiler plugin developers.

------
muricula
If true, this implies that recent versions of IE compiled with this compiler
needed older versions of IE to compile themselves. IE is probably the first
browser to have the honor of bootstrapping itself.

------
scintill76
Sounds like it might be worth it to write a drop-in replacement DLL that
doesn't parse XML, or at least in a way that doesn't require mshtml. The XML
files that are being parsed appear to be static resources within the analyze
DLL itself. A third party might even be able to patch the DLL to have pre-
parsed resources instead of XML.

------
Flow
Years ago, I've used procmon.exe and saw VB6 load project files by doing tons
of 1 byte reads...

I wonder if the behavior in the article also applies to C# compiles?

~~~
josteink
MSBuild has come a long way since the VB6 days.

C# compilation is seriously fast compared to C++ and the likes (due to its
proper staticly typed nature) and this allows the compiler to assert certain
things quickly and not waste any more time on this.

Given how .NET developers loves separating things into tons of DLLs and
libraries (and how .NET makes this easy), I would be surprised if this part of
the compilation wasn't as optimized as the rest of the process.

~~~
bhouston
C# compilation is fast in part because of the removal of headers and the
replacement of them by well defined assemblies with interfaces. C# compilation
is fast for the same reason Java compilation is fast. :)

C++ compilation is sort of screwy in that to compile a new C++ file the
compiler actually has re-compile all the definitions of the libraries you are
using. It is insanely inefficient compared to just using explicitly defined
and immutable API interfaces on Assemblies/Modules.

~~~
Locke1689
C# is fast because 1) we try really hard to make it fast :) and 2) it's not an
optimizing compiler.

~~~
Flow
C# sure compiles fast, but if you have several projects in one solution the
majority of the time is spent shoveling dll's around like there's no tomorrow.
SSD helps only a bit. :-/

~~~
Locke1689
Roslyn happily addresses this :)

------
jjaredsimpson
shoutouts to xperf. it is truly an amazing tool.

------
benched
Long time Microsoft and Windows app developer here. I just thought somebody
should point out the obvious: that this is not even remotely unusual for a
Microsoft app on a Microsoft operating system. You're just describing the
normal architecture of the OS and the way apps work.

------
danielweber
_The question “why is my computer unresponsive when doing dozens of
simultaneous compiles” is difficult to answer_

Doctor, doctor, it hurts when I do this!

