I see a lot of misconceptions about using ML for compilers. You don't ask the model what instructions to emit. Instead, you prepare a set of passes which are guaranteed to preserve correctness (we already have hundreds of them). Then you ask the model - what passes should I apply and in what order.
Writing code to unroll a loop is trivial. The limitations of compilers are that almost all currently existing languages are too low level for optimization. ML has the potential to extract back this lost information. Basically the opposite of lowering an IR.
ML for phase ordering is just one problem that ML could solve within compilers.
Heuristic replacement (like loop unrolling) is another big one. For the specific case of loop unrolling, I would think lower level elements like how much iCache pressure the unrolling creates/whether or not the loop could fit in the DSB buffer would matter more.
For your point about existing IRs being too low-level, there has been a large push to try and work on that. MLIR has been used pretty extensively for that problem in ML applications, and languages like Rust have multiple higher level IRs. There's also a preliminary implementation of a Clang-IR for C/C++, and there's even be some work on higher level representations within LLVM-IR itself.
> prepare a set of passes which are guaranteed to preserve correctness (we already have hundreds of them)
Hundreds of passes, sure. Guaranteed that they preserve correctness is a bit more dubious, that's pretty hard to establish for most transforms. Passes that make no assumptions about prior passes are tricky too since compilers tend to work in terms of a lowering pipeline.
If the compiler has N correct passes that can be combined in arbitrary order without compromising compiler termination, exponentially increasing code size or generally making the output much worse, then you've already built a really good compiler. The subtask of then shuffling the order of passes to see if you missed anything is trivial, using machine learning to control your sort & test loop doesn't seem very compelling here.
My hunch is that the low hanging fruit in compiler dev using LLM is driving a fuzz tester with one. Other things seem worthwhile but difficult.
> The subtask of then shuffling the order of passes to see if you missed anything is trivial, using machine learning to control your sort & test loop doesn't seem very compelling here.
This is called the "phase ordering problem", and it's neither trivial nor solved.
> You don't ask the model what instructions to emit.
You could still do that, you'd just also need to ask the model for a proof. (But I guess that's much harder than heuristically picking which passes to apply.)
This is the right way to interact with LLMs in general. Ask for what you want, but independently verify the result. Don't be like that lawyer "...but I asked ChatGPT if it was telling the truth, and it said yes!"
What you describe is useful, but it's not really what my comment described.
My comment was describing the wacky idea of asking the model to come up with a formal, machine-checkable mathematical proof of correctness, too. That's hard in general.
The idea in the article of just letting the model pick between different, already proven-correct, optimization passes is much saner most of the time.
One of the biggest things that seems to be holding back ML in compilers right now is dataset size. This model was only trained on a gigabyte of source code, 30+% of that synthetic. Even on much simpler models, there have been massive performance gains by just throwing more data at them. Some experimentation with the original MLGO inlining model on a much bigger data corpus doubled the code-size wins. LLMs have also been shown to perform better they more data they are fed [1].
It's possible, nay, mandatory to constrain the outputs of the model at each step of generation in order to guarantee that a given structure or grammar is adhered to. If you can fine-tune the model with these constraints in place you can offload a lot of the effort that the LLM otherwise has to perform in comprehending correctness so it has more capacity for generating good content. To be sure, quality and quantity of data are important, but it's all too easy to introduce subtle bugs that take years to tease out if you don't adhere to the right constraints.
Most of the work in this space is not focused on neural compilation (having a ML model perform the transformation/entire compilation), but on replacing heuristics or phase ordering, where the issue of correctness falls back onto the compiler. For pretty much exactly the reasons you mentioned, neural compilation isn't really tractable.
This specific paper focuses on phase ordering, which should guarantee correctness, assuming the underlying transformations are correct. They do train the model to perform compilation, but as an auxiliary task.
- 3.0% improvement in reducing instruction counts over the compiler
- generating compilable code 91% of the time
- perfectly emulating the output of the compiler 70% of the time.
I read through the paper to see what perfectly emulating the output means. In this case, I think it's that it's also possible to get the same code out of the compiler using a different pass order. I was hoping for still passes original test suite or similar.
The authors are aware that the generated code has different semantics to the input in some cases but don't seem to consider that particularly important. The section is "Evaluation of Generated Code".
So - using machine learning, it is possible to delete a small percentage of the instructions emitted by a compiler while breaking the semantics. A tool which miscompiles programs while making them slightly smaller doesn't seem to be progress in compiler optimization.
Does anyone see some value add here that I'm missing?
Hey, author here. We use machine learning to generate a list of _optimization passes_ for the compiler to run. These optimization passes give us a 3.0% improvement over the default (-Oz), and are what generates correct code. We don't do anything to the code that breaks semantics (assuming no bugs in the compiler passes; some nuance is needed here ;) ).
We also train the model to generate what it thinks the optimized code will look like. We find that this helps the model choose better pass lists, but obviously the code cannot be trusted and semantics are not guaranteed. It only compiles in 91% of cases. "Perfectly emulating the output of the compiler" means the model spat out code that is character-for-character identical to what the compiler generates with the given pass list (even choosing the same variable names etc). IMO this is no mean feat, but is still a long way to go before using LLMs for codegen. We provide a bunch of examples in the paper of things that LLMs can and cannot do.
Hey, I just read through this paper. The phase ordering issues are currently based on heuristics. I noticed that you're only using the instruction count of LLVM as the measurement. However, this metric might not accurately reflect the program's actual performance and code size. LLVM has instructions like GEP that can translate into several lines of assembly code. Additionally, I suggest trying to run some large benchmarks like SPEC to demonstrate the performance benefits of using the LLM.
Hey, yes that's right, and good callout on GEP instructions. We admit that instruction count is a bit handwavy, but we use it as a starting point as that's what the prior works we compare against optimize for. We'll be looking at true binary size next, and code runtime after.
It seems like a poor fit to me precisely because correctness is boolean, difficult to measure and getting it wrong is very bad. I do think there's a place for AI here but it's probably not LLMs in their current form.
They are not using LLM to directly produce the result code, but as tool that lists which optimisations should be done and in which order, which is fairly complex problem to solve. But if optimisation passes are implemented correctly (which is anyway required for a functioning optimising compiler), it cannot produce incorrect code, maybe only suboptimal compared to default heuristics used.
If there's a list of known optimizations that preserve correctness then it becomes an optimization problem based on output length (as a proxy for cycle count). So is the idea that an LLM is more efficient than a search or direct optimization?
There tend to be a lot of heuristics involved when deciding which optimizations to apply and in which order, so there's plenty of room to apply some machine learning.
Whether LLMs are the right approach is a separate question.
In SQL optimization, the problem is a bit trickier (IMO) because compilation is in the query path. One successful approach I know of is Bao: https://arxiv.org/abs/2004.03814
This is scooching into AlphaGo (or whatever the generic implementation is called) territory: mixing predictive AI with traditional optimization algorithms.
For function inlining specifically, I'm not sure LLMs are necessarily the right choice. The original MLGO paper [1] demonstrated a big code-size improvement with a ML model for making inlining decisions (7-20% code size wins), but they used tens of engineered features. Maybe a LLM could squeeze some additional size wins out, but maybe not [2].
Additionally, there are other factors to consider when productionizing these systems. Compile time is important (which LLMs will almost certainly explode), and anyone concerned about code size will probably be doing (Thin)LTO which would require feeding a lot more context into a LLM making inlining decisions.
I'm probably being very naive with this but, could mechanistic interpretability play a role here? Specifically, to experimentally short-list the LLM's most effective optimizations, then try to peek inside the LLM and perhaps fish out some novel optimization algorithm that could be efficiently implemented without the LLM?
I did a bit of work on this last summer on (much) smaller models [1] and it was briefly discussed towards the end of last year's MLGO panel [2]. For heuristic replacements specifically, you might be able to glean some things (or just use interpretable models like decision trees), but something like a neural network works fundamentally differently than the existing heuristics, so you probably wouldn't see most of the performance gains. For just tuning heuristics, the usual practice is to make most of the parameters configurable and then use something like bayesian optimization to try and find an optimal set, and this is sometimes done as a baseline in pieces of ML-in-compiler research.
Quantification can be done by measuring in at least two dimensions: (1) the size of the synthesised code, and (2) how precisely the generated code matches the input (which means roughly: on what fraction of input do the two programs give different output). We have set up a challenge that seeks to entice the community to look into this problem domain more. And we've simplified the assumptions, so as to make it more tractable:
How well does (2) really measure accuracy? It seems like a single output that doesn't match the input code could indicate a fundamental floor in the optimized code, so it's essentially 100% wrong even though it gets the correct answer almost all the time.
Good luck on the challenge though, this seems like an interesting and valuable area of research.
Of course (2) is not a perfect measure of accuracy, since it does not quantify how far wrong an output is, e.g. if 111111111111 is the correct output, then both
111111111110 and 829382934783 count as equally faulty. The main advantage of (2) is that it is natural, easy to understand, and easy to measure and compare. We have to start somewhere. I imagine that, in the future, it can be refined (e.g. taking the Hamming distance between desire and actual output). I expect that more refined quantification emerges in response to the community better understanding exactly what is hard in the synthesis of programs.
Feel free to submit something! A simple submission is probably just a few lines of code.
The approach seems to focus on selecting optimizations to apply for LLVM (e.g., imagime: should this be inlined as an example), but the worst case is that the code is poorly optimized, you can't select optimizations to apply and get a wrong result.
I agree that what you say is true of compilation as a whole, but that doesn't seem to be the focus here (rather, it's used as a sort of crutch to help the LLM learn)
The trick is finding a way to ensure that LLM produces something which is always correct. Like in this case, the LLM only changes compiler optimizations, not the assembly itself, so no matter what it outputs the code is correct, it just may be larger. Other possibilities: an LLM which applies semantics-preserving program transformations, or an LLM combined with a proof assistant to verify the output (more generally and for any domain, an LLM as an NP oracle).
But I agree, as of now I haven't seen good uses where LLMs produce reliable output. Not only do you need that guarantee that whatever output always generates a correct program, you need something where an LLM is considerably better than a simple or random algorithm, and you need a lot of training data (severely restricting how creative you can be with the output).
> understanding.
We evaluate on a large suite of test programs. Our approach achieves a 3.0% improvement in reducing instruction counts over the compiler,
3% code size reduction is really good. The challenge will be having codegen like this that someone is willing to support. And for that they'd want to be able to reason about why the compiler made this decision or that one. IIUC that's an outstanding problem for AI in general.
Also bugs from this approach are going to be funny - program compiled with compiler version X will work as expected and same program compiled with version X+1 will start crashing because AI under some circumstances decided that dereference of a specific pointer was unnecessary, so it won't drop it into the assembly.
Good luck finding such a bug, because you will be looking on correct code, but computer will be executing invalid output.
The focus of this work is finding the optimal ordering of optimization passes to perform, not doing neural compilation. This guarantees correct code, assuming the underlying transformation passes are correct.
Most work in ML for compilers focuses on replacing heuristics and phase ordering precisely because they don't impact correctness. There is some work being done on neural compilation [1], but I'm not sure that's going to be a viable approach anytime soon.
Damn. These things imitate humans too well. Guess we’ll need a giant test suite to feel stable. Sounds like work. Let’s train a GAN for that and put our feet up. Turtles all the way down, my dudes.
Hey Jon, we don't change semantics. We just choose optimization passes to run to match a particular input code. Agreed that code size wins could regress performance. We don't measure that yet, but will be looking into it next. There are some applications where codesize above-all-else is what to optimize for.
Chatgpt4 can do source to source optimization which is pretty cool. I got it to beat gcc at -03 on simple small toy problems. and it can do similar things with python also. But it threw its hands up when I gave it a longer piece of code to optimize.
It does make sense that ChatGPT can beat gcc (or any compiler really); the compiler is forced to optimise code so that it still appears to work as if it's the source you wrote. Most importantly, if there are observable side-effects, compilers must preserve them, even though you know you don't actually care about them. An AI is not bound to preserve those, and can semantically change what you wrote, so long as the one semantic you wanted preserved is still there.
It seems like compilers should have an interactive mode where it suggests code that is very similar but not technically perfectly the same that is more efficient and the can accept or decline the alternate version.
A backend compiler is probably not in a good place to do that. Frontends could, however, and there are lots of tools already to make those suggestions for many languages.
clippy in rust, for example, will tell you about removing needless allocations, etc, which a backend compiler might have a harder time doing.
At my company we have some "loadbearing" code that was written by a mad scientist that no longer works at our company. We have a total ban on AI for source code analysis (so chatgpt, copilot etc are all banned). I've really wanted to throw some of the grosser parts of that codebase into gpt-4 just to see if it could bring some small amount of sanity.
It's probably both that is leaking your source out and the risk of it being a copyright violation if it spits out some GPL source verbatim and you check it in.
Does that ever actually happen? I've only heard of it happening to people who forced the AI's hand by including the comments for said code in the prompt.
There's a setting "Suggestions matching public code" that can be set to "Block" that I think is sufficient for most people's usage. But many companies don't want to involve themselves in any sort of liability.
While many software companies get away with an “it’ll be fine” attitude, that is not sufficient for all companies and industries. Sometimes things have to be provably correct.
I work in a research facility, so the biggest fear is that our super top secret elite info will leak. The reality is it'd be used to refactor a lot of terrible code.
#include <stdio.h>
int main(void) {
int i;
int a[2000] ;
for(i = 0; i < 2000; i++){
a[i] = i ;
}
printf("%d\n", a[5]);
}
This program initializes an array `a` of size 2000, populates it with integers from 0 to 1999, and then prints the value at index 5, which is 5.
Just as with the Python program, we can optimize this program significantly. Given that we're only interested in the sixth element (index 5) of the array, we don't need to construct and populate the entire array.
Here's an optimized version of the code:
#include <stdio.h>
int main(void) {
printf("%d\n", 5);
}
Code indentation is 2 spaces on here, you have to do it to all the lines you want formatted that way. That's why the first one got a bunch of lines combined, it treated them as one paragraph instead of code.
But when I tried to get it to remove the dependency on studio.h, it said that’s too hard. Might have gone further with a more reasonable request I guess. Also not necessary as gcc does that anyway.
puts isn't applicable here because you actually need to format the %d, but any compiler worth using (i.e. not MSVC: https://gcc.godbolt.org/z/zx74vY1za) will optimize a printf of a constant string ending in a newline into a puts.
While this isn't optimization, is there any LLM systems that create "provably correct" transformations yet?
I have used ChatGPT on some fairly awful code, asking it to add comments, rename variables and functions, etc. I find the outputs useful, but a couple of times it's broken the code. I imagine for many (not all) languages, you could ask the LLM to produce suggested changes, then use a code-rewriting tool to apply them in a way you could be 100% sure they weren't going to change the behaviour of the progrma.
Not strictly what you're looking for, but in Lean (functional language/theorem prover), there's some interesting work being done. Using the tool actually shows which suggestions will compile, which certifies correctness to some degree. https://leandojo.org/
There's a half day tutorial at the LLVM Developers Meeting on this, ML-Guided Compiler Optimization in LLVM. However, the authors of this paper aren't giving that tutorial.
I don’t see why would the output language’s complexity matter - that’s clearly not the hard part. You need plenty of “thinking” to do for outputting sensible assembly, let alone whole programs.
With that said, it is not doing neural compilation as others have mentioned, it’s only about ordering/enabling different phases of the compiler based on ML, over the current, simpler heuristics.
Exactly, since its not doing whole program synthesis im thinking it could be done with fewer parameters. However program synthesis is part of the loss function.
Program synthesis is part of the loss function, which is what makes it a auxiliary learning task.
We haven’t experimented with model size yet, we just used the same configuration as the smallest Code Llama. We did play with dataset size and found thah performance tracks the usual scaling laws. Details in the paper
I see a lot of interest in this field which is amazing! I have been working on this problem for a while and have made a lot of progress! If you are interested in building a general purpose code performance optimizer, I am hiring founding engineers. We are still in stealth so can't say too much but we already have great results on customers are already funded by some of the best investors of silicon valley. If you might be interested, pls reach out at misra.saurabh1@gmail.com
I was thinking about writing my bachelor's thesis about LLMs and compilers.
My idea is to maybe attempt to write a C (to assembler)-Compiler and see how well it does.
Attempting to convert assembler to C using an LLM would be interesting too, but the results would probably be poor since there is just so much information that gets lost when compiling, so the C code would be pretty statistical. I guess I could improve it by somehow adding "C code is or is not compilable because of line n" and whether the result is correct to the lost function.
Do LLMs need to see some information just once to answer about it? My understanding was that they need to see many examples of the same thing to be able to answer it correctly (generate text around it correctly).
Shouldn't this mean that it is safe to ask GPTs about something proprietary if it was just once because rare examples should just disappear in weights of everything else. And this also means that even GPT4 won't be able to answer any queries about obscure or rare knowledge it has seen in its training dataset.
What is interesting to me about this approach is that they're fine tuning an existing model. If they could train a model using input/output pairs from scratch and entirely on the optimization task, I could see a smaller LLM performing drastically better than the compiled output - according to whatever the loss function is implemented to optimize for.
I was hoping to see a more ambitious language model based project.
Rather than just picking compiler arguments, an LLM could parse, and then transform AST into its more optimized form in a way that does not rely on a very formal way it is treated by compilers of today, apply guesstimate-based branch reordering and inlining heuristics not dissimilar to how a programmer would do so manually.
Once done, a compiler could perform final AST validation and either route it back to LLM to fix it or auto-fix most common cases.
What's the benefit of having an LLM do those things in a way that guesstimates? There are big wins to be had in code-size and some wins to be had in performance related to inlining [1][2], but I think the implementation in the references directly tied into the compiler's inlining heuristic is a much better way to do that as it guarantees correctness. In addition, there's a reason that compilers basically ignore the `inline` keyword these days.
For branch reordering, techniques like BOLT [5] are pretty effectively able to reorder code layout at the binary level for big performance gains by using profile information. ML models can sometimes synthesize that information [3], but if I recall correctly, the performance of those models wasn't as good.
Neural compilation (like what you're describing) has been tried with LLMs [4], but has a lot of correctness problems currently, and I don't think it's going to be feasible anytime soon to do reinforcement learning for performance/code-size improvements.
> Our approach achieves a 3.0% improvement in reducing instruction counts over the compiler, outperforming two state-of-the-art baselines that require thousands of compilations.
If that's their target, then what's the point? LLVM's optimizations are not done to minimize instructions but to maximize performance. On modern processors, these can be very different things.
You're right that a decrease in code size doesn't mean a performance increase (and oftentimes they can be inversely correlated like in inlining).
But LLVM targets both depending upon what optimization pipeline you select. (-Oz/-Os are targeting minimum code size, -O1,-O2,-O3 are optimization focussed).
Code size reduction is critical in some use cases like embedded environments and mobile apps and it is a significant area of research.
Hey, we’re targeting code size in this work, not runtime performance. You would use an option like -O3 to optimize for runtime and -Oz to optimize for code size. The pass pipelines are different for both
I wonder if you could get a correct compiler (or 100% emulation in their terms) by allowing it to choose optimization pass order rather than doing more arbitrary things
I'm not sure a fully correct production optimizing compiler is that feasible. LLVM gets multiple miscompilation reports per week (from what I've haphazardly seen observing the issue tracker).
Theoretically changing the order of the passes in the optimization pipeline shouldn't cause any correctness issues, but the fact is that the ordering in the default compilation pipelines is the one that is most tested, so there will probably be bugs exposed when fuzzing the pass ordering.
Next step is to add verification for optimized code from the LLM with an SMT solver (like Z3) to remove "hallucinations". If the input and output code can be verified to be equivalent then this would be a great addition to an optimization pipeline. Once that's done the same can be applied to intermediate representations of GPU kernels in a recursive loop of AI optimizing AI code for faster execution times.
There's already tooling available for using SMT to validate LLVM-IR transformations [1]. It's designed for zero false positives however, so some things might slip through the cracks.
Additionally, this work focuses on phase ordering, which produces correct code regardless of what the LLM puts out, assuming there aren't any bugs in the passes being used (which could crop up as random orderings aren't as well tested as the standard orderings in the commonly used pipelines).
“generating compilable code 91% of the time” which means that almost ten percent of the time, it doesn’t generate compilable code. Given the fact that ChatGPT has gotten worse at math over time,¹ I’m wondering if this too will get worse.
⸻
1. Although I also find myself thinking about my kids as they were developing language where they initially correctly conjugated some common irregular verbs, then they started conjugating them as if they were regular and then finally returned to correctly conjugating them, which might be what’s happening with ChatGPT and math.
This is very interesting work, but it's not really a LLM. It doesn't have language abilities. They should have called it a seq2seq model, but I think that term is not in vogue these days :)
We use the same architecture as other LLMs, but we include no natural language in our pretraining. We figured a single-domain training corpus would make evaluation easier. We’ll be looking at layering this on top of something like Code Llama next
Writing code to unroll a loop is trivial. The limitations of compilers are that almost all currently existing languages are too low level for optimization. ML has the potential to extract back this lost information. Basically the opposite of lowering an IR.