lingolango's comments

lingolango · 2025-09-11T13:02:53 1757595773

>since the bottleneck in large codebases is not a compute, e.g. headers preprocessing, but it's a memory bandwidth.

SSD bandwidth: 4-10GB/s RAM bandwidth: 5-10x that, say 40GB/s.

If compute was not a bottleneck, the entire linux kernel should compile in less than 1 second.

jayd16 · 2025-09-11T18:34:22 1757615662

This is making the assumption that source is read once and that there is no intermediate data to write and read. Unless the working set fits in cache, you'll have I/O and can be I/O bound.

menaerus · 2025-09-11T13:22:22 1757596942

On 40-core or 64-core machine there's more compute than you will ever need for a compilation process. Compilation is a heavy I/O workload not a heavy compute workload, in most cases, where it actually matters.

lingolango · 2025-09-11T13:36:43 1757597803

Linux is ~1.5GB of source text and the output is typically a binary less than 100MB. That should take a few hundred milliseconds to read in from an SSD or be basically instant from RAM cache, and then a few hundred ms to write out the binary.

So why does it take minutes to compile?

Compilation is entirely compute bound, the inputs and outputs are minuscule data sizes, in the order of megabytes for typical projects - maybe gigabytes for multi million line projects, but that is still only a second or two from an SSD.

bluGill · 2025-09-11T13:49:53 1757598593

I don't build linux from source, but in my tests with large machines (and my C++ work project with more than 10 million lines of code) somewhere between 40 and 50 cores compile speed starts decreasing as you add more cores. When I moved my source files to a ramdisk the speed got even worse so I know disk IO isn't the issue (there was a lot of RAM on this machine so I don't expect to run low on RAM even with that many cores in use). I don't know how to find the truth, but all signs point to memory bandwidth being the issue.

Of course the above is specific to the machines I did my testing on. A different machine may have other differences from my setup. Still my experience matches the claim: at 40 cores memory bandwidth is the bottleneck not CPU speed.

Most people don't have 40+ core machines to play with, and so will not see those results. The machines I tested on cost > $10,000 so most would argue that is not affordable.

menaerus · 2025-09-11T13:56:13 1757598973

One of the biggest reasons why people see so much compilation improvement speed on Apple M chips - massive bandwidth improvement in contrast to other machines, even some older servers. 100G/s single core main memory. It starts to drop, e.g. it doesn't scale linearly, when you add more and more cores to the workload, due to L3 contention I'd say, but it goes up to 200G/s IIRC.

Someone · 2025-09-11T13:51:32 1757598692

> So why does it take minutes to compile?

I’m not claiming anything about it being I/O or compute bound, but you are missing some sources of I/O:

- the compiler reads many source files (e.g. headers) multiple times

- the compiler writes and then reads lots of intermediate data

- the OS may have to swap out memory

Also, there may be resource contention that makes the system do neither I/O nor compute for part of the build.

lingolango · 2025-09-11T14:35:38 1757601338

Tried building sqlite amalgamation just now.

Input: single .c file 8.5MB.

Output: 1.8MB object file.

Debug build took 1.5s.

Release build (O2) took about 6s.

That is about 3 orders of magntiude slower than what this machine is capable of in terms of IO from disk.

sgerenser · 2025-09-11T18:26:13 1757615173

The fact that something doesn’t scale past X cores doesn’t mean that it is I/O bound! For most C++ toolchains, any given translation unit can only be compiled on a single core. So if you have a big project, but there’s a few files that alone take 1+ minute to compile, the entire compilation can’t possibly take any less than 1 minute even if you had infinite cores. That’s not even getting into linking, which is also usually at least partially if not totally a serial process. See also https://en.m.wikipedia.org/wiki/Amdahl%27s_law

menaerus · 2025-09-11T13:49:44 1757598584

Output as a result is 100mb. Process of compilation accumulates magnitudes more data. Evidence is the constant memory pressure you have in 32G or 64G or even 128G systems. Now given that the process of compilation on even such high end systems take non trivial amount of time, tens of minutes, what do you think how much data bounces from and in memory? It accumulates to a lot more than what you suggest.

anarazel · 2025-09-11T15:00:56 1757602856

This is just wildly wrong.

On an older 2 socket workstation, with relatively poor memory bandwidth, I ran a linux kernel compile.

  perf stat --topdown --td-level 2

indicates that memory bandwidth is not a bottleneck. Fetch latency, branch mispredicts and the frontend are.

I also analyzed the memory bandwidth using

  perf stat --per-socket  -M memory_bandwidth_read,memory_bandwidth_write -a -r0 sleep 1

and it never gets anywhere close to the memory bandwidth the system can trivially utilize (it barely reaches the bandwidth a single core can utilize).

iostat indicates there are pretty much no reads/writes happening on the relevant disks.

Every core is 100% busy.

menaerus · 2025-09-11T15:18:27 1757603907

It is not wildly wrong, be more respectful please since I am speaking from my own experience. Nowhere in my comment have I used Linux kernel as an example. It's not a great example neither since it's mostly trivial to compile in comparison to the projects I had experience with.

Core can be 100% busy but as I see you're a database kernel developer you must surely know that this can be an artifact of a stall in a memory backend of the CPU. I rest my case.

anarazel · 2025-09-11T15:31:20 1757604680

> Nowhere in my comment have I used Linux kernel as an example. It's not a great example neither since it's mostly trivial to compile in comparison to the projects I had experience with.

It's true across a wide range of projects. I build a lot of stuff from source and I routinely look at performance counters and other similar metrics to see what the bottlenecks are (I'm almost clinically impatient).

Building e.g. LLVM, a project with much longer per-translation unit build times, shows that memory bandwidth is even less of a bottleneck. Whereas fetch latency increased as a bottleneck.

> Core can be 100% busy but as I see you're a database kernel developer you must surely know that this can be an artifact of a stall in a memory backend of the CPU. I rest my case.

Hence my reference to doing a topdown analysis with perf. That provides you with a high-level analysis of what the actual bottlenecks are.

Typical compiler work (with typical compiler design) has lots of random memory accesses. Due to access latencies being what they are, that prevents you from actually doing enough memory accesses to reach a particularly high memory bandwidth.

bluGill · 2025-09-11T15:14:53 1757603693

How many cores on that workstation? The claim is you need 40 cores to observe that - very few people have access to such a thing - they exist, but they are expensive.

anarazel · 2025-09-11T15:19:02 1757603942

That workstation has 2x10 cores / 20 threads. I also executed the test on a newer workstation with 2x24 cores with similar results, but I thought the older workstation is more interesting, as the older workstation has a much worse memory bandwidth.

Sorry, but compilation is simply not memory bandwidth bound. There are significant memory latency effects, but bandwidth != latency.

menaerus · 2025-09-11T15:43:24 1757605404

I doubt you can saturate the bandwidth with dual-socket configuration with each having 10 cores. Perhaps if you have very recent cores, which I believe you don't, but Intel design hasn't been that good. What you're also measuring in your experiment, and needs to be taken into account, is the latency across the NUMA nodes which is ridiculously high, 1.5x to 2x to the local node, amounting to usually ~130ns. Because of this, in NUMA configurations, you usually need more (Intel) cores to saturate the bw. I know because I have one sitting at my desk. Memory bandwidth saturation usually begins at ~20 cores with the Intel design that is roughly ~5 year old. I might be off with that number but it's roughly something like that. Other cores if you have them burning the cycles are just sitting there and waiting in the line for the bus to become free.

bluGill · 2025-09-11T19:23:18 1757618598

At 48 cores you are right about at the point where memory bandwidth becomes the limit. I suspect you are over the line, but by so little it is impossible to measure with all the ther noise. Get a larger machine and report back.

anarazel · 2025-09-11T20:19:13 1757621953

On the 48 core system, building linux peaks at about 48GB/s; LLVM peaks at something like 25GB/s.

The system has well over 450GB/s of memory bandwidth.

menaerus · 2025-09-12T10:12:37 1757671957

> On the 48 core system, building linux peaks at about 48GB/s; LLVM peaks at something like 25GB/s

LLVM peak is suspiciously low since building LLVM is heavier than the kernel? Anyway, on my machine, which is dual-socket 2x22-core skylake-x, for pure release build without debug symbols (less memory pressure), I get ~60GB/s.

   # python do_pair_combined.py out_clang_release
   Peak combined memory bandwidth found in block #180:
   S0_write: 8046.8 MB/s
   S0_read: 23098.2 MB/s
   S1_write: 7611.3 MB/s
   S1_read: 21231.3 MB/s
   Total: 59987.6 MB/s

For release build with debug symbols, which is much heavier, and what I normally use during the development, so my experience is probably more biased towards that workload, is >50% larger - ~98GB/s.

  $ python do_pair_combined.py out_clang_relwithdeb
  Peak combined memory bandwidth found in block #601:
  S0_write: 11648.5 MB/s
  S0_read: 17347.9 MB/s
  S1_write: 31686.2 MB/s
  S1_read: 37532.7 MB/s
  Total: 98215.3 MB/s

I repeated the experiment with linux kernel, and I get almost the same figure as you do - ~48GB/s.

  $ python do_pair_combined.py out_kernel 
  Peak combined memory bandwidth found in block #329:
  S0_write: 8963.9 MB/s
  S0_read: 16584.1 MB/s
  S1_write: 7863.4 MB/s
  S1_read: 14371.0 MB/s
  Total: 47782.399999999994 MB/s

Now this was peak accumulated but I was also interested in what is the single highest read/write bw measured. For LLVM/clang release with debug symbols this is what I get ~32GB/s for write bw and ~52GB/s for read bw.

  $ python do_single.py out_clang_relwithdeb
    Peak memory_bandwidth_write: 31686.2 MB/s
    Peak memory_bandwidth_read: 52038.0 MB/s

This is btw very close to what my socket can handle, store bandwidth is ~40GB/s, load bandwidth is ~80GB/s, and combined load-store bandwidth is 65G/s.

So, I think it is not unreasonable to say that there are compiler workloads that can be limited by the memory bandwidth. I for sure worked with heavier codebases even than LLVM, and even though I did not do the measurements back then, the gut feeling I was having is that the bw is consumed. Some translation units would literally stay for few minutes "compiling" but no progress would have been made.

I agree that random access memory patterns and the latency those patterns incur are also a cost that need to be added to this cost function.

My initial comment on this topic was - I don't really believe that the bottleneck in compilation for larger codebases, of course not on _any_ given machine, is on the compute side, and therefore I don't see how modules are going to fix any of this.

gpderetta · 2025-09-12T11:59:34 1757678374

> This is just wildly wrong.

Indeed! Compilation is notorious for being a classing pointer chasing load that is hard to brute force and a good way to benchmark overall single-thread core performance. It is more likely to be memory latency bound than memory bandwidth bound.