How refactoring a macro reduced my Rust project build time from 4 hours to 40sec

garganzol · on Jan 20, 2020

This is clearly an O(n^2) situation somewhere in LLVM. Usually it is pretty easy to fix by adding a cache or two; or by changing a linear search to a hash map.

If OP could produce an LLVM reproduction sample then this would be a good help for LLVM team.

adrianN · on Jan 20, 2020

Does anyone know what LLVM does that breaks for long functions? Is the optimization pass doing something quadratic in the number of basic blocks in the function?

garganzol · on Jan 20, 2020

This is an O(n^2) in processing of instructions, blocks or anything related. Optimizations are specifically prone to this.

cespare · on Jan 20, 2020

I hope this is considered a compiler bug.

lilyball · on Jan 20, 2020

Compiling a single function containing over 100,000 lines is a rather pathological case. And while you could argue it's a compiler bug, it's honestly not one I'd expect anyone to be spending any time on.

olliej · on Jan 20, 2020

I think is a case of "low priority" rather than not a bug - should compiling a 100k line function take an hour? no. Should that be higher priority than making said function run 2x faster? maybe not. (hypothetical of course, a 2x win in optimized code would imply some unexpectedly bad codegen in optimizing modes, or new tech)

Arnavion · on Jan 20, 2020

Indeed, you can hit large-function-causes-unexpected-behavior situations with even smaller functions sometimes - https://github.com/rust-lang/rust/issues/60753

fluffything · on Jan 20, 2020

Almost any "large anything" hits pathologic behavior in a compiler.

For example, try to compile a Rust `match int { 0 => 0, ..., MAX => MAX }` statement with one arm per `int` value: you'll notice that for `u8` this kind of works fast: 1 - 2 seconds on my machine, but that means I'd expect compilation to take ~4-8 minutes for a `u16` with `u16::MAX` match arms, yet it takes > 30 minutes [0].

So the complexity of compiling Rust match statements is not linear in the number of match arms. And this is with a single language feature in isolation.

Now write a 100k LOC function using many different language features, each of which have worst case complexities larger than O(N), with different constant factors.

[0]: it doesn't really matter that much to me why this is the case. E.g. the compiler might be building extra data-structures that fill in the RAM, and then my system starts swapping, and it crawls. Point being, the problem is too large for my system, but one is able to hit this problem on any system with a sufficiently large input.

bsder · on Jan 20, 2020

It is a bug worth fixing if that thing is going O(n^2) for some reason.

If, however, the system is just O(n)*k where k is lines of code, there probably isn't much you can do.

Others · on Jan 20, 2020

Isn’t O(n) here relative to lines of code? So O(n) * k is just O(n^2)?

bsder · on Jan 20, 2020

Not necessarily.

For example, you could have a register coloring algorithm that runs in O(n^2) time that has no relation to number of lines of code. It is not unusual to have some optimization pass that has quadratic worst-case behavior but much better average behavior that makes it worth it 99% of the time but 1% of the time goes pathological.

rat9988 · on Jan 20, 2020

May I ask what the n refers to then?

bsder · on Jan 20, 2020

Um, anything internal to the compiler that it is allocating.

It could be number of registers spilled (graph coloring algorithms are NP-complete but have useful heuristic solutions ... most of the time), number of macros allocated, loop iterations unrolled, etc.

If any of those things has a quadratic (or worse(!)) behavior, they may make compile times blow up if you hit a pathological boundary condition.

And, sometimes I might even want that quadratic behavior optimization. If I'm trying to squeeze every single byte out of a program because I am on a memory constrained system, I'm probably pretty happy to let the compiler churn for a couple of hours to crush my program into memory if it means I don't have to start rewriting code that is nominally correct and risk introducing bugs.

C1sc0cat · on Jan 20, 2020

A single function is >100k damm! I though a 1300-1440 line Fortran 77 subroutine was over the top back in the day

jsjohnst · on Jan 20, 2020

I’m not sure if this is still the case, but Akamai previously had over 18,000 lines in the main C function in the heavily modified version of Squid they used for their CDN.

_y4o5 · on Jan 20, 2020

Back in the days there on Nokia S80 phone there was the infamous Phone class that had about 30kloc in the constructor ;-)

lopsidedBrain · on Jan 20, 2020

GCC does this too fwiw. Whether or not it is considered a bug, it doesn't have one particular root cause. Several common optimization passes assume small function body. Since functions are typically small, many optimizations use this as a window for heavyweight passes.

In other words, intraprocedural optimisation has always been more aggressive than interprocedural optimisation. If you try inlining every function ever into one giant function, you'll be forced to scale back optimization levels.

viraptor · on Jan 20, 2020

There seems to be space for middle ground available then...

   if IR_count > ...
     warn "Can't run optimiser ... pass on {func} (too large)"
   else
     run_pass

lopsidedBrain · on Jan 20, 2020

I can't speak for the toolchain maintainers. But with appropriate flags to force or skip those passes, yes I can see something like this being merged in.

MBlume · on Jan 20, 2020

This is generally good practice. Put runtime logic in functions, then write macros with just enough syntax sugar to bridge the gap to those functions.

lopsidedBrain · on Jan 20, 2020

TL;DR: Don't autogenerate a large monolithic function. Generate many small functions instead.

Thorrez · on Jan 20, 2020

It sounds like this is an LLVM problem. So could this same thing happen in C?

flohofwoe · on Jan 20, 2020

I think I'm seeing this in C code too with clang. I'm having fairly big code-generated switch-case statements in my CPU emulators with hundreds to thousands of case-blocks (Z80: [1], 6502: [2]), and those take up to 7 seconds to compile with -O3, which is really unusually long for a single C compilation unit of that size (compared to the same number of lines across many smaller functions). If the problem has exponential behaviour it could easily blow up to minutes or hours for larger functions.

[1] https://github.com/floooh/chips/blob/d0f690f2598f967f75cb360...

[2] https://github.com/floooh/chips/blob/d0f690f2598f967f75cb360...