
Your Version Control and Build Systems Don't Scale - chadaustin
http://chadaustin.me/2010/03/your-version-control-and-build-systems-dont-scale-introducing-ibb/
======
Bluem00
A friend of mine expressed the problem described here in a more formal manner
in this paper: <http://gittup.org/tup/build_system_rules_and_algorithms.pdf>

He ended up creating his own build system, "tup" < <http://gittup.org/tup/> >,
based off of it. It also has the property desired in this article that, "No-op
builds should be O(1) and instantaneous, and most other builds should be
O(WhateverChanged)".

~~~
WayneS
tup looks very cool, but I don't see how to fetch the source.

~~~
WayneS
this seems to work: git clone git://gittup.org/tup.git

------
aidenn0
I strongly disagree with the assertion that C compilation is I/O bound. If you
have optimizations turned off, this is probably true, but if you are using a
decent optimizing compiler, this is not the case.

We ran some tests where I work on a pretty good sized code-base and found that
we were CPU bound, even on an 8-core system.

[edit] There is an exception, if you are using networked file systems of some
type (especially clearcase dynamic views) then you are almost certainly I/O
bound.

~~~
__david__
If C compilation were I/O bound then ccache would not speed things up. It
does. I've used it on my NFS mounted filesystems and it speeds things up there
too.

There is clearly a point where if your filesystem is slow enough that the
process does become I/O bound. I have no experience with ClearCase but from
what I've heard it's molasses slow and might be to that point.

~~~
JoeAltmaier
Years ago I implemented a network file cache using a file-modification-count
returned in every open handle. At that point it became an order of magnitude
faster to build over the network than local, because the local file system had
no such cache. Anyway this race between speed, space and network performance
has been around a while.

------
henrikschroder
This is the one thing I miss from working with Java in Eclipse. There, you're
always in a built state so every time you save the file you're working on, you
instantly get all the compiler errors if any. There's also a large amount of
errors the editor can detect without invoking the compiler, and you get
constant feedback about those as well.

My current main environment is Visual Studio for C#, and there you get a lot
of errors detected while editing, but not all of them, so you have to
continually press the rebuild button, and the time of that rebuild just keeps
growing...

Then again, I recently bought a SSD, that helps a bit with build times. :-)

~~~
ajross
If only there were some sort of tool you could use which would automatically
detect which dependencies have changed and compile only those files...

I swear, I read "press the rebuild button" and wanted to cry.

 _Edit: the sarcasm-free version of the above amounts to this: there was a
time when the idea of "writing software to help you write software" was a
standard notion, something that everyone did. It seems like the modern world
is training a generation of programmers who don't understand this, and who see
the "end result" software as the only software worth writing. The idea of tool
creation and integration is alien to them. That's what the IDE buttons are
for._

~~~
henrikschroder
I meant the build button, of course you don't have to do a full rebuild every
time, that would be insane. Did you honestly think Visual Studio or the cs
compiler is that retarded?

However, it doesn't change the fact that you still have to perform it. If
you're changing a file that has a lot of dependencies, all of those have to be
re-built, unlike if you're sitting in Eclipse coding Java, where you don't.

~~~
prodigal_erik
The only reason incremental compilation is possible in Java is that the
language completely lacks abstractions like macros and overloading which can
globally affect the meaning of existing code. I already spend more time
reading and writing code than recompiling, so making recompiling faster by
making the code more verbose, repetitive, and error-prone is not a tradeoff
I'm happy about.

~~~
eru
I don't see how macros should make everything depend on everything else. When
you want to use a macro you (should) still need indicate somehow where it
comes from.

~~~
lincolnq
They make compilation of the macro's clients depend on the implementation of
the macro, and therefore those clients must be recompiled when the macro
changes. In Java, there's no expectation that changing the implementation of a
method in one java file would cause another java file to be recompiled.

~~~
__david__
But if you add an parameter to a function then don't all the files that have
calls to that function have to be recompiled (so that you can see the error)?

I don't know Java so I'm assuming it's similar to other compilers. Please
correct me if I'm wrong.

~~~
eru
Or how about changing types? Javas weak type system should make it necessary
to re-compile (and even change the sources).

~~~
lincolnq
If you change a parameter, then you are changing the interface, and it makes
sense that that affects recompilation.

Changing the type is an interesting one. If you change the backing class from
one type to another, but you were using an interface to access it, then the
Java compiler doesn't need to recompile that code -- the compiled
InvokeInterface bytecode for that method invocation doesn't change. However --
I feel like I may have read this somewhere -- there are optimizations which
might cause it to replace InvokeInterface with static invocations when it can
determine at compile time what class is used. If that's the case, then it
would have to recompile the client class too.

~~~
eru
I was thinking about changing the type of a parameter and return type in
parallel --- where the application only treats them as block boxes and just
gives and takes those objects, but never looks at them.

------
wglb
Good article about large builds. I know of at least one outfit with a large
legacy C++ code spend the effort to move to C#, in no small part due to
unmanageable build times. This would be under the "constant reduce" leg of
this journey, but with C# you cut your file count in half all other things
being equal.

A minor nit: _most other builds should be O(WhateverChanged)_ \-- consider a
C++ header file used by 10 c++ files. Those 10 would need to be recompiled.
The agony there is that you wish for some less-than-full-file dependency
analysis so that only the files that use that constant you just changed need
to recompile.

On the whole a good start to those massive builds you have.

~~~
angstrom
Also worth point out is the ability to modify the MSBuild file to use parallel
build paths - assuming the dependency tree is fairly flat.

~~~
tpz
Which, in case you didn't know, the MSFT build stack was able to do with C++
projects long before it could with C# projects. I wonder if wglb's referenced
company would have liked to know that. ;)

~~~
angstrom
True, last I checked MSBuild support for VC++ projects was lacking. I think
their solution was to run the build through the visual studio command
interface...hopefully it's gotten better.

------
kevingadd
Here's a mirror (don't know how well it will hold up):
<http://hildr.luminance.org/ibb/>

~~~
mainland
Apparently, neither does his hosting provider...

------
nradov
The idea of optimizing build performance for huge monolithic chunks of code
seems somewhat misguided. Isn't it better to break your product up into a set
of reasonably sized libraries that can be built separately? That way when you
change something in one module you can just rebuild that one library, and then
perhaps relink it with the others. That general approach can usually deliver
fast development cycles with any language or build tool.

------
Groxx
I like the idea of in-memory change detection... there's a definite use there.
I don't agree with the idea of storing all files in memory (especially given
their complaint that Git is slow on 20GB of files...), though I admit grepping
that is fast.

But what if it were simplified to _just_ a live ls-diff based on when a
command was last run / a timestamp? If Git / Make would hook into something
like that it'd be really useful.

~~~
TimothyFitz
ibb, at it's core, is just an in memory tree of the filesystem metadata.
Search is implemented as a plugin that stores the file contents in memory.

And ibb is proof of concept, showing that O(1) is possible, desirable and
useful. Having said that, I've been using it daily, because we have > 200MB of
PHP/JS/CSS/HTML at IMVU.

~~~
Groxx
Gotcha, thanks. That known, I quite like the idea. Any way a makefile could be
twonked to use it? (haven't looked at all, don't really know where to start)

------
chipsy
I agree with the general thrust of the article in that iteration time is one
of the most important factors of any kind of creative work, programming
included. The longer you go without feedback or corrections, the less
confident you can be that you're on the "right track."

That said, build systems and VCS aren't the only ways to optimize the process.
For example, a system using scripts processed at runtime lets you reduce the
amount of compiled code and iterate quickly on the one section by pressing a
"reload script" button, or, even more conveniently, monitoring the source file
and auto-reloading when changes are saved. If you're daring enough you could
probably even do this with machine-compiled languages by abusing dynamic
linking, though the ease of doing that would ultimately depend on the runtime
linker capabilities.

Doing this kind of data-driven thing will cost runtime performance since less
can be compiled and optimized statically, but if you treat it as the
"scaffolding" work that it is and also include ways to reclaim some of the
static factors for release builds, you can get a better overall result than
you would if you were just suffering through long builds.

------
samlittlewood
Whilst I hated clearcase as a version control system (and I mean 'fire and
stakes' hate, not just 'crossO the road') - it did have one good trick up its
sleeve:

Since the dynamic views into the source were implemented as a custom file
system, they could track all the file access that went into building each
object. They supplied a custom version of make that pulled that info. out to
construct complete dependency descriptions. If you had the toolchain in
clearcase as well, that would include all the compiler bins, libs and
includes.

Once that info was cached, and as they had code at filesystem level, it could
easily be invalidated as files were touched.

Another product of this mechanism was that if you tried to build something
that someone else had already (all the same inputs), it would just grab it
over the network. In practice this meant that the nightly build would cache
the bulk of the object files.

------
ajross
FWIW: I'm not sure I buy that 21 minute Mozilla build, unless he was working
on a cold disk, or unless the build is doing a bunch of stuff with broken
dependencies. If the build was just done, the file metadata is in the page
cache, and you can read a _staggering_ number of dents in 21 minutes.

~~~
chadaustin
I've got a slow Pentium 4. I assure you, it's a real timing... A no-op build
on my Core 2 Duo laptop is ~7 minutes.

Either way is ridiculous.

~~~
ajross
7 minutes following a successful build? That just can't be due to
stat/getdents() calls, it can't. Check the log, no doubt there's a broken
dependency in there causing something to happen (e.g. an unconditional copy
done, and things that depend on the copied file). Again, you should be able to
get hundreds of millions (!) of files stat'd in those seven minutes.

There may well be something wrong with the mozilla build, but I swear, it's
not the metadata reads from make.

------
sorbits
Disable make’s implicit rules with -r or explicitly in the Makefile:

    
    
        .SUFFIXES:
        %:: %,v
        %:: RCS/%
        %:: RCS/%,v
        %:: s.%
        %:: SCCS/s.%
        %.c: %.w %.ch
    

This gave me a significant speedup for no-op builds (0.7s → 0.1s, ~500 goals
in the Makefile).

Of course you can only do this if you do not rely on these implicit rules.

------
quellhorst
My version control and build systems scale better than your blog.

~~~
raganwald
Really? If you mean 'your version control and build systems can handle more
load than his blog,' perhaps you are correct. But the expression 'scale'
refers to the rate at which increases in load places increased demand on
resources. Unless you have a very specialized infrastructure for VCS and
builds, it probably doesn't scale at a lower rate than his blog infrastructure
scales.

~~~
quellhorst
Do you know what his blog infrastructure is? Over 3 hours later its still not
working. Git is decentralized and can and does scale. Builds can be spread
across multiple machines on EC2.

~~~
chadaustin
From 2003 until a few months ago, I hosted my website off of a Pentium 2 in my
apartment. Behind slow DSL, no less.

Once my blog posts started reaching hacker news I thought "Oh, I'll just move
site out of the apartment and into the cloud!" and bought an account at
prgmr.com (which I highly recommend, by the way).

However, I serve WordPress via Apache on a 256 MB VM, which clearly thrashes
under load.

Tonight I will purchase an upgrade to 512 MB of RAM and play with nginx.

I'm sorry for the inconvenience.

p.s. I do have WP-Supercache enabled and a PHP bytecode cache. I _could_ just
host at wordpress.com, but I might as well learn nginx while I'm at it...

~~~
stevenp
Such a good problem to have, Chad. ;)

