Tools like ccache are really primitive compared to what is possible. Change one bit in a header file and you have to recompile your whole project, even if the output ends up being bit-identical.
Zapcc is more advanced, but still far from ideal. What I really want is a compiler that can incrementally update the binary as I type code, and can start from a snapshot that I download from someone else so that nobody ever needs to do a full recompile. It would require a radically different compiler infrastructure, but I don't see any reason why it would be impossible.
I then proceeded to simply copy the includes of a random .cpp file and put it into a new .cpp file with nothing but includes. I removed all of them one by one except boost/filesystem. Compile times didn't change and if they did then only by a few milliseconds. A single include for boost/filesystem took 2 seconds to compile. I didn't even instantiate any templates in that simplified .cpp file so in practice the overhead is even higher. Isn't that absolutely insane?
I've seen one experimental IRC bot project that heavily uses template expansion, with the asynchronous model and other components in Boost, as a demonstration of advantages of modern C++. The whole compile requires 2 GiB+ memory, if multiprocess make is used, 4-8 GiB! The scale of the project is nowhere close to Chromium, building the Linux kernel doesn't need much RAM either, it's just a IRC bot! But the hardware requirements for using Boost and C++ template expansion is spectacular!
The actual binary runs well though, and does not need such hardware. And the developer simply saw it as a small price to pay for using these language features.
Any bug is strange until you figure it out.
Had something similar in our office with scala & SBT, but SBT is a POS anyway.
Further, there's then GOMA which is the distcc equivalent. I frequently build Android with 500 cores in the cloud. For that reason, I don't think Google engineers will ever focus on the compiler speed of llvm. That said, LLD (llvm's linker) is the fastest linker in town by far.
- How long do incremental and non-incremental builds take with the build configuration you describe?
- How does core<->engineer allocation work? Is 500 cores an average number? High-priority work presumably commands more access, but I'm curious if allocation is generally team-specific or engineer-specific (eg, more responsibilities = more resources). (I guess the reason I ask this is that 500 cores sounds kind of impressive and I'm going "surely they'll run out of cores if they're handing them out like candy???"), but then thinking about it that's only like 11 computers...)
- I'm curious if you happen to know how Chromium gets built (IIUC Android builds it 2 (or is it 3?) times). It's tricky to distill (presumably/understandably deliberately so) how many cores the CI/trybot/etc infra is using. Overall my curiosity is what the core/time ratios/graphs are, ie how many cores translates to how short of a build.
- As an aside, I vaguely recall absorbing by osmosis several years ago that Chromium takes approximately 15 minutes to build on a Very Cool™ local workstation (HP Z series). Not sure how out of date this info is. Do engineers still do test builds on local hardware (and if yes, how long does this take? probably longer than 500 cores, heh); or is everything done in throwaway cloud instances nowadays?
Thanks in advance for whatever answers you're happy to provide :)
(disclaimer: I'm an engineer on the Bazel team)
Older versions of Stan on GCC 4.8 would take 60+ seconds to compile a single C++ file.
to then be compiled by the host compiler ..
But in cases of "lets spin up 10 compile boxes with <same ubuntu within a week of updating>" or have something along the lines of Xcode which allowed you and your colleagues to help out each other with small work units in order to make a single compile run quite much faster, that is a definite possibility.
For that final release build, you might consider running it on one of those 10 VMs in the example above and have 'only' 9 others help out with the sub-parts in order to get some kind of .. guarantee.
If it takes a minute for a full build on one box, getting a hint after 10 seconds that you misspelled something and that it won't ever link decently is worth something too if you value your time as a developer.
Not saying it's perfect, only chipped in on the "must be 100% or it can't ever work", hopefully without needing to go into the "coach says we need to give 110% this game, and 120% if it's the finals" nitpicking.
(NB. Googling "lowering" turned up the seemingly-random https://news.ycombinator.com/item?id=14422944; that page isn't such a bad set of starting links, so I figured I'd include it.)
The only time it comes in handy is when:
1. You have enormous files AND
2. You have a little number of files.
1. It can get extremely expensive in memory use. Your parallelism limit can be dictated by the memory your machine has. Spending your available cores over fewer simultaneous compilation units could also have benefits arising from data locality, even in situations where a machine's memory headroom isn't a concern.
2. Many large build processes have serialized steps which are currently unable to benefit from any form of parallelism. This is becoming even more the case with the rise of IPO & LTO.
But as ever, we'll probably have to wait to find out how large the practical benefits end up being from this work.
It’d also be nice if parallel make had better load balancing.
On the other hand, maybe this project will help improve whole-program LTO link time. I see that is mentioned as future work for this effort:
> Parallelize IPA part. This can also improve the time during LTO compilations
Big kudos to Giuliano Belinassi, who seems to be the one driving this effort.
On the topic of linking-- can you sub in the gold linker for faster linking in your projects?
Linker slowness ~= GC pause. The cost is approximately proportional to the size of the traced graph.
It actually works just fine (IIRC), but occasionally breaks when people do silly things :)
I believe you just need to enable LLVM_ENABLE_THREADS in the config.
The LLVM Context is the unit of isolation between threads. For instance the multi-threaded ThinLTO optimizer/codegen will use one LLVM Context per-thread, and so each thread is processing a single Module / TU.
The issue for concurrency is fairly deep in LLVM (use-lists, constant stored uniquely in context, etc.) that would make it really difficult to address intra-module parallelism.
MLIR for example is designed with this in mind and the pass-manager is already multi-threaded at every level of nesting (function passes would runs on two functions of the same Module in parallel). This is causing other complication/inefficiency in the infrastructure, I'm still not sure how much it is a good tradeoff, we'll see...
Not real valuable for building big projects that can build in parallel at the file level already.
If GCC devs try anyway and end up adding mutexes to all their data structures, it could potentially make things slower for everyone. It'd be overkill too. Last time I checked, GCC doesn't even memoize the O(n^2*m) stat system calls it does when processing includes (where m = # of -I / -isystem / -iquote flags). That would be a very simple patch to make things less bad. But it still doesn't solve the problem, which is that programming practices need to change.
It's sad these foundations only seem to exist in rust.
I recently learned about c-reduce, which minimizes the size of reproducing C/C++ crash testcases by iteratively permuting the source code and invoking the compiler.
I can imagine a similar tool that takes a set of input file(s), carefully instruments the files somehow to determine the data interactions (a bit like the dataflow analysis mentioned in the sibling comment), and then iterates through different bucket-sorts (automatically invoking the compiler) until it finds some arrangement of locality-optimized input that also happens to compile the fastest.
On the one hand, this process would take hours - but on the other hand it can be lifted out of the compile/test cycle, and run eg overnight instead.
Optimizations might include tracing what you're editing right now and what that depends on, so active work can be relocated to the smallest discrete files possible. The system could just aim to minimize the size of all input files, but weighting what you're currently working on might produce additional speedups, I'm not sure.
In such a model, feeding in something like Boost would result in it eating all the templates that are never referenced.
To me, the biggest problem is that this entire infrastructure would need to understand very large parts of C/C++, and of course would also need to be faster than current infrastructure in order to actually speed anything up. I don't think there are any production-capable research analysis systems out there capable of doing this.
So, the likeliest path forward would be turning LLVM/GCC into something that can a) stay resident in memory (not fundamentally hard, just don't exit() :) ), b) be fed modified source code and accordingly traverse/update its analyses graph(s), and c) (most important) perform (b) efficiently (hah).
One major downside, apart from the total nonsemanticity of the actually-compiled output, would be the introduction of yet another hurdle to jump over to achieve reproducible builds.
I wonder if a design like this could be [part of] an intermediary first stage to getting something like incremental compilation into LLVM/GCC. Ie, it could be a (temporary) binary of its own that would allow for these features to be developed within a production-usable context that doesn't impact the behavior of the compiler itself; and then when it was properly built out, the compiler could be made to more and more progressively depend on it until either a) the compiler itself has the server-mode built in, or b) the changes are so dramatic the server-mode is not needed (unlikely).
I say all the above as a not-compiler person. I have no idea what I'm talking about.
Huh? One wants a fast compiler even for a single compilation, not just for multiple ones that parallel make can handle as implied here...