Hacker News new | past | comments | ask | show | jobs | submit login
Your Version Control and Build Systems Don't Scale (chadaustin.me)
94 points by chadaustin on March 4, 2010 | hide | past | favorite | 53 comments

A friend of mine expressed the problem described here in a more formal manner in this paper: http://gittup.org/tup/build_system_rules_and_algorithms.pdf

He ended up creating his own build system, "tup" < http://gittup.org/tup/ >, based off of it. It also has the property desired in this article that, "No-op builds should be O(1) and instantaneous, and most other builds should be O(WhateverChanged)".

Thanks for the link! I'll take a look at that and change my strategy accordingly.

The graph in page 25 is thought-provoking (tup-vs-make on a change-one-of-N files test, N = 1 to 100k).

Interesting quote: When a single C file is changed in the 100,000 file case, make can take over half an hour to figure out which file needs to be compiled and linked.

tup looks very cool, but I don't see how to fetch the source.

this seems to work: git clone git://gittup.org/tup.git

I strongly disagree with the assertion that C compilation is I/O bound. If you have optimizations turned off, this is probably true, but if you are using a decent optimizing compiler, this is not the case.

We ran some tests where I work on a pretty good sized code-base and found that we were CPU bound, even on an 8-core system.

[edit] There is an exception, if you are using networked file systems of some type (especially clearcase dynamic views) then you are almost certainly I/O bound.

That was kind of tangential point to the article, and it was merely inspiration for the name, but I'll ask Dusty for the citation.

This was years ago, so who knows.

[Update: he says he thinks it was on the D mailing list years ago, and was probably related to the fact that every source file in a nontrivial C program includes dozens or hundreds of headers, especially with a naive compiler.]

[Update 2: TinyCC http://bellard.org/tcc/ compiles at 30 MB/s. It's not hard to imagine a dual-core CPU on a conventional drive failing to feed the compiler enough source to be CPU-bound.]

Have you tested the difference between cold and warm FS caches on a project with a shitload of files? There's a reason why the core Linux devs love Intel SSDs!

The reason that parallel make helps so much (especially with 2x more jobs than processors) is that you aren't idling while blocked for I/O, and are piling up the I/O queue higher so the scheduler can de-randomize some I/O (elevator algorithm FTW).

I wish I could upvote this more. There is no way compilation is I/O bound.

Even over the network this is true. My old lab had homedirs all on NFS (gigabit I think, but that's not too important) and all our builds were CPU-bound, even without raising the -j argument from cores+1. The same lab (before I was there) even wrote a paper on why the "compile" test is a horrible one for measuring filesystem performance; it just doesn't do enough I/O.

I'd be interested to see what happens if you were to compile something large written in assembly (or close to it), but I'd put money that just the sequential optimizer would be slower than I/O.

EDIT: found the paper and an article about it http://www.linux-mag.com/cache/7464/1.html http://www.fsl.cs.sunysb.edu/project-fsbench.html

If C compilation were I/O bound then ccache would not speed things up. It does. I've used it on my NFS mounted filesystems and it speeds things up there too.

There is clearly a point where if your filesystem is slow enough that the process does become I/O bound. I have no experience with ClearCase but from what I've heard it's molasses slow and might be to that point.

Years ago I implemented a network file cache using a file-modification-count returned in every open handle. At that point it became an order of magnitude faster to build over the network than local, because the local file system had no such cache. Anyway this race between speed, space and network performance has been around a while.

This is the one thing I miss from working with Java in Eclipse. There, you're always in a built state so every time you save the file you're working on, you instantly get all the compiler errors if any. There's also a large amount of errors the editor can detect without invoking the compiler, and you get constant feedback about those as well.

My current main environment is Visual Studio for C#, and there you get a lot of errors detected while editing, but not all of them, so you have to continually press the rebuild button, and the time of that rebuild just keeps growing...

Then again, I recently bought a SSD, that helps a bit with build times. :-)

If only there were some sort of tool you could use which would automatically detect which dependencies have changed and compile only those files...

I swear, I read "press the rebuild button" and wanted to cry.

Edit: the sarcasm-free version of the above amounts to this: there was a time when the idea of "writing software to help you write software" was a standard notion, something that everyone did. It seems like the modern world is training a generation of programmers who don't understand this, and who see the "end result" software as the only software worth writing. The idea of tool creation and integration is alien to them. That's what the IDE buttons are for.

I meant the build button, of course you don't have to do a full rebuild every time, that would be insane. Did you honestly think Visual Studio or the cs compiler is that retarded?

However, it doesn't change the fact that you still have to perform it. If you're changing a file that has a lot of dependencies, all of those have to be re-built, unlike if you're sitting in Eclipse coding Java, where you don't.

The only reason incremental compilation is possible in Java is that the language completely lacks abstractions like macros and overloading which can globally affect the meaning of existing code. I already spend more time reading and writing code than recompiling, so making recompiling faster by making the code more verbose, repetitive, and error-prone is not a tradeoff I'm happy about.

I think this is wrong. Incremental rebuilds are supported just fine by languages other than Java. cabal does incremental rebuilds of Haskell (and Template Haskell) projects just fine.

If you define every function in terms of a macro, and change that macro, though, of course you have to rebuild the whole project.

I don't see how macros should make everything depend on everything else. When you want to use a macro you (should) still need indicate somehow where it comes from.

They make compilation of the macro's clients depend on the implementation of the macro, and therefore those clients must be recompiled when the macro changes. In Java, there's no expectation that changing the implementation of a method in one java file would cause another java file to be recompiled.

But if you add an parameter to a function then don't all the files that have calls to that function have to be recompiled (so that you can see the error)?

I don't know Java so I'm assuming it's similar to other compilers. Please correct me if I'm wrong.

Or how about changing types? Javas weak type system should make it necessary to re-compile (and even change the sources).

If you change a parameter, then you are changing the interface, and it makes sense that that affects recompilation.

Changing the type is an interesting one. If you change the backing class from one type to another, but you were using an interface to access it, then the Java compiler doesn't need to recompile that code -- the compiled InvokeInterface bytecode for that method invocation doesn't change. However -- I feel like I may have read this somewhere -- there are optimizations which might cause it to replace InvokeInterface with static invocations when it can determine at compile time what class is used. If that's the case, then it would have to recompile the client class too.

I was thinking about changing the type of a parameter and return type in parallel --- where the application only treats them as block boxes and just gives and takes those objects, but never looks at them.

About Eclipse: in most cases Eclipse never "invokes" a Java compiler--the compiler is built into the IDE! That's right, those .class files are getting built by the JDT Core plug-ins that come with Eclipse [1]. javac is never even used.

Of course, you could configure Eclipse to run ant or make or something else to compile your code, but the editor is still driving the integrated incremental compiler to mark up all your syntax errors.

[1] http://www.eclipse.org/jdt/core/index.php

This is somewhat misleading.

JDT is a compiler in the same sense of the term as javac is a compiler.

It's just a different compiler, with some different features.

I think you're just misparsing the parent post re: in most cases Eclipse never "invokes" a Java compiler. The meaning seems to be that a Java compiler is never invoked (because it's built into the IDE), not that a Java compiler is never used.

it's not hard to invoke whatever compiler you want. when i write python in eclipse using pydev, i get build errors when i save (though i run the programs from the command line for various reasons).

Good article about large builds. I know of at least one outfit with a large legacy C++ code spend the effort to move to C#, in no small part due to unmanageable build times. This would be under the "constant reduce" leg of this journey, but with C# you cut your file count in half all other things being equal.

A minor nit: most other builds should be O(WhateverChanged) -- consider a C++ header file used by 10 c++ files. Those 10 would need to be recompiled. The agony there is that you wish for some less-than-full-file dependency analysis so that only the files that use that constant you just changed need to recompile.

On the whole a good start to those massive builds you have.

Also worth point out is the ability to modify the MSBuild file to use parallel build paths - assuming the dependency tree is fairly flat.

I am sure that they are well aware of this. The issue is that the total compute time to compile their build is significant, and significantly (several multiples of 2) reduced by going to c#. Compile time increases development costs, execute time increases hardware costs.

Which, in case you didn't know, the MSFT build stack was able to do with C++ projects long before it could with C# projects. I wonder if wglb's referenced company would have liked to know that. ;)

True, last I checked MSBuild support for VC++ projects was lacking. I think their solution was to run the build through the visual studio command interface...hopefully it's gotten better.

Here's a mirror (don't know how well it will hold up): http://hildr.luminance.org/ibb/

Apparently, neither does his hosting provider...

The idea of optimizing build performance for huge monolithic chunks of code seems somewhat misguided. Isn't it better to break your product up into a set of reasonably sized libraries that can be built separately? That way when you change something in one module you can just rebuild that one library, and then perhaps relink it with the others. That general approach can usually deliver fast development cycles with any language or build tool.

I like the idea of in-memory change detection... there's a definite use there. I don't agree with the idea of storing all files in memory (especially given their complaint that Git is slow on 20GB of files...), though I admit grepping that is fast.

But what if it were simplified to just a live ls-diff based on when a command was last run / a timestamp? If Git / Make would hook into something like that it'd be really useful.

ibb, at it's core, is just an in memory tree of the filesystem metadata. Search is implemented as a plugin that stores the file contents in memory.

And ibb is proof of concept, showing that O(1) is possible, desirable and useful. Having said that, I've been using it daily, because we have > 200MB of PHP/JS/CSS/HTML at IMVU.

Gotcha, thanks. That known, I quite like the idea. Any way a makefile could be twonked to use it? (haven't looked at all, don't really know where to start)

I believe it only stores files in RAM for search, not builds.

I agree with the general thrust of the article in that iteration time is one of the most important factors of any kind of creative work, programming included. The longer you go without feedback or corrections, the less confident you can be that you're on the "right track."

That said, build systems and VCS aren't the only ways to optimize the process. For example, a system using scripts processed at runtime lets you reduce the amount of compiled code and iterate quickly on the one section by pressing a "reload script" button, or, even more conveniently, monitoring the source file and auto-reloading when changes are saved. If you're daring enough you could probably even do this with machine-compiled languages by abusing dynamic linking, though the ease of doing that would ultimately depend on the runtime linker capabilities.

Doing this kind of data-driven thing will cost runtime performance since less can be compiled and optimized statically, but if you treat it as the "scaffolding" work that it is and also include ways to reclaim some of the static factors for release builds, you can get a better overall result than you would if you were just suffering through long builds.

Whilst I hated clearcase as a version control system (and I mean 'fire and stakes' hate, not just 'crossO the road') - it did have one good trick up its sleeve:

Since the dynamic views into the source were implemented as a custom file system, they could track all the file access that went into building each object. They supplied a custom version of make that pulled that info. out to construct complete dependency descriptions. If you had the toolchain in clearcase as well, that would include all the compiler bins, libs and includes.

Once that info was cached, and as they had code at filesystem level, it could easily be invalidated as files were touched.

Another product of this mechanism was that if you tried to build something that someone else had already (all the same inputs), it would just grab it over the network. In practice this meant that the nightly build would cache the bulk of the object files.

FWIW: I'm not sure I buy that 21 minute Mozilla build, unless he was working on a cold disk, or unless the build is doing a bunch of stuff with broken dependencies. If the build was just done, the file metadata is in the page cache, and you can read a staggering number of dents in 21 minutes.

I've got a slow Pentium 4. I assure you, it's a real timing... A no-op build on my Core 2 Duo laptop is ~7 minutes.

Either way is ridiculous.

7 minutes following a successful build? That just can't be due to stat/getdents() calls, it can't. Check the log, no doubt there's a broken dependency in there causing something to happen (e.g. an unconditional copy done, and things that depend on the copied file). Again, you should be able to get hundreds of millions (!) of files stat'd in those seven minutes.

There may well be something wrong with the mozilla build, but I swear, it's not the metadata reads from make.

Is a cold disk unrealistic? I can't imagine the entire source tree for mozilla staying in cache - that's a lot of stat() entries.

regardless, even if it's not 21 minutes, my experience building mozilla with make was definitely slow.

A typical desktop is 4GB, stuff stays in cache for weeks on my laptop. Yes, I'd say it's unrealistic. Especially since the use case here is a "minimally changed build" for a developer. You build it once, then change something, then build again. You don't reboot in the middle.

Yeah, I'm not sure I buy it either. A full Mozilla build takes roughly 24 minutes on my Windows notebook (it's a Core i7, but Windows has slower process creation than Linux).

Disable make’s implicit rules with -r or explicitly in the Makefile:

    %:: %,v
    %:: RCS/%
    %:: RCS/%,v
    %:: s.%
    %:: SCCS/s.%
    %.c: %.w %.ch
This gave me a significant speedup for no-op builds (0.7s → 0.1s, ~500 goals in the Makefile).

Of course you can only do this if you do not rely on these implicit rules.

My version control and build systems scale better than your blog.

Really? If you mean 'your version control and build systems can handle more load than his blog,' perhaps you are correct. But the expression 'scale' refers to the rate at which increases in load places increased demand on resources. Unless you have a very specialized infrastructure for VCS and builds, it probably doesn't scale at a lower rate than his blog infrastructure scales.

Do you know what his blog infrastructure is? Over 3 hours later its still not working. Git is decentralized and can and does scale. Builds can be spread across multiple machines on EC2.

From 2003 until a few months ago, I hosted my website off of a Pentium 2 in my apartment. Behind slow DSL, no less.

Once my blog posts started reaching hacker news I thought "Oh, I'll just move site out of the apartment and into the cloud!" and bought an account at prgmr.com (which I highly recommend, by the way).

However, I serve WordPress via Apache on a 256 MB VM, which clearly thrashes under load.

Tonight I will purchase an upgrade to 512 MB of RAM and play with nginx.

I'm sorry for the inconvenience.

p.s. I do have WP-Supercache enabled and a PHP bytecode cache. I _could_ just host at wordpress.com, but I might as well learn nginx while I'm at it...

Such a good problem to have, Chad. ;)

nginx will be a huge improvement in terms of resource usage. :-)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact