Hacker News new | past | comments | ask | show | jobs | submit login
Meta LLM Compiler: neural optimizer and disassembler (twitter.com/aiatmeta)
248 points by foobazgt 9 months ago | hide | past | favorite | 100 comments



Huh. This is a very... "interesting" application for an LLM. I'm not the brightest crayon in the box, but if anyone else would like to follow along with my non-expert opinion as I read through the paper, here's my take on it.

It's pretty important for compilers / decompilers to be reliable and accurate -- compilers behaving in a deterministic and predictable way is an important fundamental of pipelines.

LLMs are inherently unpredictable, and so using an LLM for compilation / decompilation -- even an LLM that has 99.99% accuracy -- feels a bit odd to include as a piece in my build pipeline.

That said, let's look at the paper and see what they did.

They essentially started with CodeLlama, and then went further to train the model on three tasks -- one primary, and two downstream.

The first task is compilation: given input code and a set of compiler flags, can we predict the output assembly? Given the inability to verify correctness without using a traditional compiler, this feels like it's of limited use on its own. However, training a model on this as a primary task enables a couple of downstream tasks. Namely:

The second task (and first downstream task) is compiler flag prediction / optimization to predict / optimize for smaller assembly sizes. It's a bit disappointing that they only seem to be able to optimize for assembly size (and not execution speed), but it's not without its uses. Because the output of this task (compiler flags) are then passed to a deterministic function (a traditional compiler), then the instability of the LLM is mitigated.

The third task (second downstream task) is decompilation. This is not the first time that LLMs have been trained to do better decompilation -- however, because of the pretraining that they did on the primary task, they feel that this provides some advantages over previous approaches. Sadly, they only compare LLM Compiler to Code Llama and GPT-4 Turbo, and not against any other LLMs fine-tuned for the decompilation task, so it's difficult to see in context how much better their approach is.

Regarding the verifiability of the disassembly approach, the authors note that there are issues regarding correctness. So the authors employ round-tripping -- recompiling the decompiled code (using the same compiler flags) to verify correctness / exact-match. This still puts accuracy in the 45% or so (if I understand their output numbers), so it's not entirely trustworthy yet, but it might be able to still be useful (especially if used alongside a traditional decompiler, and this model's outputs only used when they are verifiably correct).

Overall I'm happy to see this model be released as it seems like an interesting use-case. I may need to read more, but at first blush I'm not immediately excited by the possibilities that this unlocks. Most of all, I would like to see it explored if these methods could be extended to optimize for performance -- not just size of assembly.


>compilers behaving in a deterministic and predictable way is an important fundamental of pipelines. LLMs are inherently unpredictable, and so using an LLM for compilation / decompilation -- even an LLM that has 99.99% accuracy

You're confusing different concepts here. An llm is technically not unpredictable by itself (at least the ones we are talking about here, there are different problems with beasts like GPT4 [1]). The "randomness" of llms you are probably experiencing stems from the autoregressive completion, which samples from probabilities for a temperature T>0 (which is very common because it makes sense in chat applications). But there is nothing that prevents you from simply choosing greedy sampling, which would make your output 100% deterministic and reproducible. That is particularly useful for disassembling/decompiling and has the chance to vastly improve over existing tools, because it is common knowledge that they are often not the sharpest tools and humans are much better at piecing together working code.

The other question here is accuracy for compiling. For that it is important whether the llm can follow a specification correctly. Because once you write unspecified behaviour, your code is fair game for other compilers as well. So the real question is how well does it follow the spec how good is it at dealing with situations where normal compilers will flounder.

[1] https://152334h.github.io/blog/non-determinism-in-gpt-4/


Determinism for any given input isn't an interesting metric though. Inputs are always different, or else you could just replace it with a lookup function. What's important is reliability of the output given a distribution of inputs, and that's where LLMs are unreliable. Temperature sampling can be a technique to improve reliability particularly when things get into repetitve loops - though usually it's to increase creativity.


> What's important is reliability of the output given a distribution of inputs, and that's where LLMs are unreliable.

Nailed it! Thank you for stating this better than I did. Maybe I should have used the word "trustworthy" instead of "predictable".

If I'm using an LLM as a calculator to solve math problems for my users, I may be able to say with certainty that my model is 99% accurate, but I'm not able to know ahead of time which questions it's going to miss, and which ones it's going to get correct. It's a near-infinite input space, so it's difficult to prove correctness via induction. The attention mechanism is (thus far) fairly inscrutable, and thus proving the correctness via deduction is (currently) not possible either.

It would feel weird to use an LLM for a calculator, just as it feels weird to use an LLM as a compiler. I want something that uses an algorithm that is provably and predictably correct -- not one that is "almost always" correct.

That unpredictability of accuracy isn't connected to the temperature used when sampling, so much as the inscrutable nature of the attention mechanism when verifying the trustworthiness over the range of all possible inputs.


> The "randomness" of llms you are probably experiencing stems from the autoregressive completion, which samples from probabilities for a temperature T>0 (which is very common because it makes sense in chat applications).

Even that “random” sampling is deterministic, in that if you use the same PRNG algorithm with the same random seed, then (all else being equal) you should get the same results every time.

To get genuine nondeterminism, you need an external source of randomness, such as thermal noise, keystroke timing, etc. (Even whether that is really non-deterministic depends on a whole lot of contested issues in philosophy and physics, but at least we can say it is non-deterministic for all practical purposes.)


This was a big unlock for me -- I recall antirez saying the same thing in a comment replying to me when I asked a similar question about potential LLM features in Redis [1].

[1] https://news.ycombinator.com/item?id=39617370


It is normally not a necessary feature of a compiler to be determistic. A compiler should be correct against a specification. If the specification allows indeterminism a compiler should be able to exploit them. I remember the story of the sather-k compiler that did things differently based on the phase of the moon.


It's technically correct that a language specification is rarely precise enough to require compiler output to be deterministic.

But it's pragmatically true that engineers will want to murder you if your compiler is non-deterministic. All sorts of build systems, benchmark harnesses, supply chain validation tools, and other bits of surrounding ecosystem will shit the bed if the compiler doesn't produce bitwise identical output on the same input and compiler flags.


Can vouch for this having fixed non-determinism bugs in a compiler. Nobody is happy if your builds aren't reproducible. You'll also suffer crazy performance problems as everything downstream rebuilds randomly and all your build caches randomly miss.


NixOS with its nixpkgs [0] and cache [1] would also not work if compilers weren't reproducible. Though they won't use something like PGO or some specific optimization flags as these would very likely lead to unreproducible builds. For example most distros ship a PGO optimized build of Python while nixos does not.

[0] https://github.com/nixos/nixpkgs

[1] https://cache.nixos.org/


PGO can be used in such situations, but the profile needs to be checked in. Same code + same profile -> same binary (assuming the compiler is deterministic, which is tested quite extensively).

There are several big projects that use PGO (like Chrome), and you can get a deterministic build at whatever revision using PGO as the profiles are checked in to the repository.


It’s called autofdo although I’ve struggled to get it working well in Rust.


It's not called AutoFDO. AutoFDO refers to a specific sampling-based profile technique out of Google (https://dl.acm.org/doi/abs/10.1145/2854038.2854044). Sometimes people will refer to that as PGO though (with PGO and FDO being somewhat synonymous, but PGO seeming to be the preferred term in the open source LLVM world). Chrome specifically uses instrumented PGO which is very much not AutoFDO.

PGO works just fine in Rust and has support built into the compiler (https://doc.rust-lang.org/rustc/profile-guided-optimization....).


I wasn’t trying to conflate the two. PGO traditionally meant a trace build but as a term it’s pretty generic, at least to me to the general concept of “you have profile information that replaces generically tuned heuristics that the compiler uses). AutoFDO I’d classify as an extension to that concept to a more general PGO technique; kind of ThinLTO vs LTO. Specifically, it generates the “same” information to supplant compiler heuristics, but is more flexible in that the sample can be fed back into “arbitrary” versions of the code using normal sampling techniques instead of an instrumented trace. The reason sampling is better is that it more easily fits into capturing data from production which is much harder to accomplish for the tracing variant (due to perf overheads). Additionally, because it works across versions the amortized compile cost drops from 2x to 1x because you only need to reseed your profile data periodically.

I was under the impression they had switched to AutoFDO across the board but maybe that’s just for their cloud stuff and Chrome continues to run a representative workload since that path is more mature. I would guess that if it’s not being used already, they’re exploring how to make Chrome run AutoFDO for the same reason everyone started using ThinLTO - it brought most of the advantages while fixing the disadvantages that hampered adoption.

And yes, while PGO is available natively, AutoFDO isn’t quite as smooth.


I'm not sure where you're getting your information from.

Chrome (and many other performance-critical workloads) is using instrumented PGO because it gives better performance gains, not because it's a more mature path. AutoFDO is only used in situations where collecting data with an instrumented build is difficult.


Last I looked AutoFDO builds were similar in performance to PGO as ThinLTO vs LTO is. I’d say that collecting data with an instrumented Chrome build is extremely difficult - you’re relying on your synthetic benchmark environment which is very very different from the real world (eg extensions aren’t installed, the patterns of websites being browsed is not realistic, etc). There’s also a 2x compile cost because you have to build Chrome twice in the exact same way + you have to run a synthetic benchmark on each build to generate the trace.

I’m just using an educated guess to say that at some point in the future Chrome will switch to AutoFDO, potentially using traces harvested from end user computers (potentially just from their employees even to avoid privacy complaints).


You can make the synthetic benchmarks relatively accurate, it just takes effort. The compile-time hit and additional effort is often worth it for the extra couple percent for important applications.

Performance is also pretty different on the scales that performance engineers are interested in for these sorts of production codes, but without the build system scalability problems that LTO has. The original AutoFDO paper shows an improvement of 10.5%->12.5% going from AutoFDO to instrumented PGO. That is pretty big. It's probably even bigger with newer instrumentation based techniques like CSPGO.

They also mention the exact reasons that AutoFDO will not perform as well, with issues in debug info and losing profile accuracy due to sampling inaccuracy.

I couldn't find any numbers for Chrome, but I am reasonably certain that they have tried both and continue to use instrumented PGO for the extra couple percent. There are other pieces of the Chrome ecosystem (specifically the ChromeOS kernel) that are already optimized using sampling-based profiling. It's been a while since I last talked to the Chromium toolchain people about this though. I also remember hearing them benchmark FEPGO vs IRPGO at some point and concluding that IRPGO was better.


Yeah, and nixpkgs also, last time I checked, does patch GCC/ clang to ensure determinism. Many compilers and toolchain by default want to, e.g., embed build information that may leak from the build env in a non-deterministic/ non-reprodicible manner.


Yup. Even so much as inserting the build timestamp into the generated executable (which is strangely common) causes havoc with build caching.


NVCC CUDA builds were nondeterministic last time I checked, it made certain things (trying to get very clever with generating patches) difficult. This was also hampered by certain libraries (maybe GTSAM?) wanting to write __DATE__ somewhere in the build output, creating endlessly changing builds.


In parallel computing you run into nondeterminism pretty quickly anyways - especially with CUDA because of undetermined execution order and floating point accuracy.


Yes, at runtime. Compiling CUDA doesn’t require a GPU, though, and doesn’t really use “parallel computing”. I think CUDA via clang gets this right and will produce the same build every time - it was purely an NVCC issue.


I’m amused by the possibility of a compiler having a flag to set a random seed. (with a fixed default, of course).

If you hit a compiler bug, you could try a different seed to see what happens.

Or how about a code formatter with a random seed?

Tool developers could run unit tests with a different seed until they find a bug - or hide the problem by finding a lucky seed for which you have no provable bugs :)

Edit:

Or how about this: we write a compiler as a nondeterministic algorithm where every output is correct, but they are optimized differently depending on an input vector of choices. Then use machine learning techniques to find the picks that produce the best output.


Plus, these models are entirely black boxes. Even given weights, we don't know how to look at them and meaningfully tell what's happening - and not only that, but training these models is likely not cheap at all.

Stable output is how we can verify that attacks like the one described in Reflections on Trusting Trust[0] don't happen.

[0] https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_Ref...


> But it's pragmatically true that engineers will want to murder you if your compiler is non-deterministic. All sorts of build systems, benchmark harnesses, supply chain validation tools, and other bits of surrounding ecosystem will shit the bed if the compiler doesn't produce bitwise identical output on the same input and compiler flags.

I think that’s rather true nowadays, but hasn’t always been thus. Back in the 20th century, non-deterministic compiler output was very common - even if only due to the common practice of embedding the compilation timestamp in the resulting executable - and very few ever cared. Whereas nowadays, there is a much bigger culture of hermetic/reproducible build processes, in which stuff like embedding compilation timestamps in the executable or object files is viewed as an antipattern.


Just fix the random seed :)


LLMs can be deterministic if you set the random seed and pin it to a certain version of the weights.

My bigger concern would be bugs in the machine code would be very, very difficult to track down.


> It is normally not a necessary feature of a compiler to be determistic. A compiler should be correct against a specification.

That sounds like a nightmare. Optimizing code to play nice with black-box heuristic compilers like V8's TurboFan is, already in fact, a continual maintenance nightmare.

If you don't care about performance, non-deterministic compilation is probably "good enough." See TurboFan.


LLMs are deterministic. We inject randomness after the fact, just because we don't like our text being deterministic. Turn temperature to 0 and you're good.


But temperature 0 LLM's don't exhibit the emergent phenomena we like, even in apparently non-creative tasks. The randomness is, in some sense, a cheap proxy for an infeasible search over all completion sequences, much like simulated annealing with zero temperature is a search for a local optimum but adding randomness makes it explore globally and find more interesting possibilities.


Sure but you could add pseudo random noise instead and get the same behavior while retaining determinism.


Temperature is at ~1.2 in this thread, here's some 0.0:

- Yes, temperature 0.0 is less creative.

- Injecting pseudo-random noise to get deterministic creative outputs is "not even wrong", in the Wolfgang Pauli sense. It's fixing something that isn't broken, with something that can't fix it, that if it could, would be replicating the original behavior - more simply, it's proposing non-deterministic determinism.

- Temperature 0.0, in practice, is an LLM. There aren't emergent phenomena, in the sense "emergent phenomena" is used with LLMs, missing. Many, many, many, applications use this.

- In simplistic scenarios, on very small models, 0.0 could get stuck literally repeating the same token.

- There's a whole other layer of ex. repeat penalties/frequency penalties and such that are used during inference to limit this. Only OpenAI and llama.cpp expose repeat/frequency.

- Temperature 0.0 is still non-deterministic on ex. OpenAI, though substantially the same, and even the same most of the time. It's hard to notice differences. (Reproducible builds require extra engineering effort, the same way ensuring temperature = 0.0 is truly deterministic requires engineering effort.)

- Pedantically, only temperature 0.0 at the same seed (initial state) is deterministic.


Even then, though, the output could change drastically based on a single change to the input (such as a comment).

That's not something you want in a compiler.


It is very important for a compiler to be deterministic. Otherwise you can't validate the integrity of binaries! We already have issues with reproducibility without adding this shit in the mix.


Reproducible builds are an edge case, that require determistic compilation for sure. But profile based optimisation or linker address randomisation are sometimes also useful. While rule out one thing for the other. Normally you can easily turn on and off optimisation depending on your need. Just do -O0 if you want determinism. But normally you should not rely on it (also at execution time)


Sure, performance is more interesting, but it's significantly harder.

With code size, you just need to run the code through the compiler and you have a deterministic measurement for evaluation.

Performance has no such metric. Benchmarks are expensive and noisy. Cost models seem like a promising direction, but they aren't really there yet.


LLMs are probably great at this. You can break down the code into functions or basic blocks. You can use the LLM to decompile them and then test if the decompilation results match the code when executed. You'll probably get this right after a few tries. Then you can train your model with the succesful decompilation results so your model will get better.


Thank you for the summary. My memory of SOTA on disassembly about a year ago was sub—30% accuracy, so this is a significant step forward.

I do think the idea of a 90%+-ish forward and backward assembler LLM is pretty intriguing. There’s bound to be a lot of uses for it; especially if you’re of the mind that to get there it would have to have learned a lot about computers in the foundation model training phase.

Like, you’d definitely want to have those weights somehow baked into a typical coding assistant LLM, and of course you’d be able to automate round one of a lot of historical archiving projects that would like to get compilable modern code but only have a binary, you’d be able to turn some PDP-1 code into something that would compile on a modern machine, … you’d probably be able to leverage it into building chip simulations / code easily, it would be really useful for writing Verilog, (maybe), anyway, the use cases seem pretty broad to me.


Maybe they are thinking about embedding a program generator and execution environment into their LLM inferencing loop in a tighter way. The model invents a program that guides the output in a specific/algorithmic way, tailored to the prompt.


some comments like this make me waant to subscribe to you for all your future comments. thanks for doing the hard work of summarizing and taking the bold step of sharing your thoughts in public. i wish more HNers were like you.


You're very kind, thank you. :)



I continue to be fascinated about what the next qualitative iteration of models will be, marrying the language processing and broad knowledge of LLMs with an ability to reason rigorously.

If I understand correctly, this work (or the most obvious productionized version of it) is similar to the work Deep Mind released a while back: the LLM is essentially used for “intuition”—-to pick the approach—-and then you hand off to something mechanical/rigorous.

I think we’re going to see a huge growth in that type of system. I still think it’s kind of weird and cool that our meat brains with spreading activation can (with some amount of effort/concentration) switch over into math mode and manipulate symbols and inferences rigorously.


How do they verify the output preserves semantics of the input?


For the disassembler we round trip. x86 ->(via model) IR ->(via clang) x86. If they are identical then the IR is correct. Could be correct even if not identical, but then you need to check.

For the auto-tuning, we suggest the best passes to use in LLVM. We take some effort to weed out bad passes, but LLVM has bugs. This is in common with any auto-tuner.

We train it to emulate the compiler. The compiler does that better already. We do it because it helps the LLM understand the compiler better and it auto-tunes better as a result.

We hope people will use this model to fine-tune for other heuristics. E.g. an inliner which accepts the IR or the caller and callee to decide profitability. We think things like that will be vastly cheaper for people if they can start from LLM Compiler. Training LLMs from scratch is expensive :-)

IMO, right now, AI should be used to decide profitability not correctness.


Have you guys applied this work internally to optimize Meta's codebase?


> Could be correct even if not identical

Can you be 100% sure that the model-gen-IR is correct if output x86 is identical against the input?


Some previous work in the space is at https://github.com/albertan017/LLM4Decompile


As usual, Twitter is impressed by this, but I'm very skeptical, the chance of it breaking your program is pretty high. The thing that makes optimizations so hard to make is that they have to match the behavior without optimizations (unless you have UBs), which is something that LLMs probably will struggle with since they can't exactly understand the code and execution tree.


Hey! The idea isn't to replace the compiler with an LLM, the tech is not there yet. Where we see value is in using these models to guide an existing compiler. E.g. orchestrating optimization passes. That way the LLM won't break your code, nor will the compiler (to the extent that your compiler is free from bugs, which can tricky to detect - cf Sec 3.1 of our paper).


I've done some similar LLM compiler work, obviously not on Meta's scale, teaching an LLM to do optimization by feeding an encoder/decoder pairs of -O0 and -O3 code and even on my small scale I managed to get the LLM to spit out the correct optimization every once and a while.

I think there's a lot of value in LLM compilers to specifically be used for superoptimization where you can generate many possible optimizations, verify the correctness, and pick the most optimal one. I'm excited to see where y'all go with this.


Thank you for freeing me from one of my to-do projects. I wanted to do a similar autoencoder with optimisations. Did you write about it anywhere? I'd love to read the details.


No writeup, but the code is here:

https://github.com/SuperOptimizer/supercompiler

There's code there to generate unoptimized / optimized pairs via C generators like yarpgen and csmith, then compile, train, inference, and disassemble the results


Yes! An AI building a compiler by learning from a super-optimiser is something I have wanted to do for a while now :-)


then maybe dont name it "LLM Compiler", just "Compiler Guidance with LLMs" or "LLM-aided Compiler optimization" or something - will get much more to the point without overpromising


Yeah, the name was misleading. I thought it was going to be source to object translation maybe with techniques like how they translate foreign languages.


> The idea isn't to replace the compiler with an LLM, the tech is not there yet

What do you mean the tech isn't there yet, why would it ever even go into that direction? I mean we do those kinds of things for shits and giggles but for any practical use? I mean come on. From fast and reliable to glacial and not even working a quarter of the time.

I guess maybe if all compiler designers die in a freak accident and there's literally nobody to replace them, then we'll have to resort to that after the existing versions break.


As this LLM operates on LLVM intermediate representation language, the result can be fed into https://alive2.llvm.org/ce/ and formally verified. For those who don't know what to print there: here is an example of C++ spaceship operator: https://alive2.llvm.org/ce/z/YJPr84 (try to replace -1 with -2 there to break). This is kind of a Swiss knife for LLVM developers, they often start optimizations with this tool.

What they missed is to mention verification (they probably don't know about alive2) and comparison with other compilers. It is very likely that LLM Compiler "learned" from GCC and with huge computational effort simply generates what GCC can do out of the box.


I'm reasonably certain the authors are aware of alive2.

The problem with using alive2 to verify LLM based compilation is that alive2 isn't really designed for that. It's an amazing tool for catching correctness issues in LLVM, but it's expensive to run and will time out reasonably often, especially on cases involving floating point. It's explicitly designed to minimize the rate of false-positive correctness issues to serve the primary purpose of alerting compiler developers to correctness issues that need to be fixed.


Yep, we tried it :-) These were exactly the problems we had with it.


> C++ spaceship operator

> (A <=> B) < 0 is true if A < B

> (A <=> B) > 0 is true if A > B

> (A <=> B) == 0 is true if A and B are equal/equivalent.

TIL of the spaceship operator. Was this added as an april fools?


This is one of the oldest computer operators in the game: the arithmetic IF statement from FORTRAN.

It's useful for stable-sorting collections with a single test. Also, overloading <=> for a type, gives all comparison operators "for free": ==, !=, <, <=, >=, >


It also has no good definition for its semantics when presented with a NaN.



How would that apply to Fortran’s arithmetic IF statement? It goes to one label for a negative value, or to a second label for a zero, or to a third label for positive. A NaN is in none of these categories.


I mean maybe I'm missing something but it seems like it behaves exactly the same way as subtraction? At least for integers it's definitely the same, for floats I imagine it might handle equals better?


C++ has operator overloading, so you can define the spaceship for any class, and get every comparison operator from the fallback definitions, which use `<=>` in some obvious ways.


The three-way comparison operator just needs to return a ternary, and many comparisons boil down to integer subtraction. strcmp is also defined this way.

In C++20 the compiler will automatically use the spaceship operator to implement other comparisons if it is available, so it's a significant convenience.


I'm not sure it's likely that the LLM here learned from gcc. The size optimization work here is focused on learning phase orderings for LLVM passes/the LLVM pipeline, which wouldn't be at all applicable to gcc.

Additionally, they train approximately half on assembly and half on LLVM-IR. They don't talk much about how they generate the dataset other than that they generated it from the CodeLlama dataset, but I would guess they compile as much code as they can into LLVM-IR and then just lower that into assembly, leaving gcc out of the loop completely for the vast majority of the compiler specific training.


Yep! No GCC on this one. And yep, that's not far off how the pretraining data was gathered - but with random optimisations to give it a bit of variety.


Do you have more information on how the dataset was constructed?

It seems like somehow build systems were invoked given the different targets present in the final version?

Was it mostly C/C++ (if so, how did you resolve missing includes/build flags), or something else?


We plan to have a peer reviewed version of the paper where we will probably have more details on that. Otherwise we can't give anymore details than in the paper or post, etc. without going through legal which takes ages. Science is getting harder to do :-(


If I understand correctly, the AI is only choosing the optimization passes and their relative order. Each individual optimization step would still be designed and verified manually, and maybe even proven to be correct mathematically.


Right, it's only solving phase ordering.

In practice though, correctness even over ordering of hand-written passes is difficult. Within the paper they describe a methodology to evaluate phase orderings against a small test set as a smoke test for correctness (PassListEval) and observe that ~10% of the phase orderings result in assertion failures/compiler crashes/correctness issues.

You will end up with a lot more correctness issues adjusting phase orderings like this than you would using one of the more battle-tested default optimization pipelines.

Correctness in a production compiler is a pretty hard problem.


There are two models.

- foundation model is pretrained on asm and ir. Then it is trained to emulate the compiler (ir + passes -> ir or asm)

- ftd model is fine tuned for solving phase ordering and disassembling

FTD is there to demo capabilities. We hope people will fine tune for other optimisations. It will be much, much cheaper than starting from scratch.

Yep, correctness in compilers is a pain. Auto-tuning is a very easy way to break a compiler.


Let's be real, at least 40% of those comments are bots


People simply have no idea what they're talking about. It's just jumping on to the latest hype train. My first impression here was per the name that it was actually some sort of compiler in it of itself--ie programming language in and pure machine code or some other IR out. It's got bits and pieces of that here and there but that's not what it really is at all. It's more of a predictive engine for an optimizer and not a very generalized one for that.

What would be more interesting is training a large model on pure (code, assembly) pairs like a normal translation task. Presumably a very generalized model would be good at even doing the inverse: given some assembly, write code that will produce the given assembly. Unlike human language there is a finite set of possible correct answers here and you have the convenience of being able to generate synthetic data for cheap. I think optimizations would arise as a natural side effect this way: if there's multiple trees of possible generations (like choosing between logits in an LLM) you could try different branches to see what's smaller in terms of byte code or faster in terms of execution.


It can emulate the compiler (IR + passes -> IR or ASM).

> What would be more interesting is training a large model on pure (code, assembly) pairs like a normal translation task.

It is that.

> Presumably a very generalized model would be good at even doing the inverse: given some assembly, write code that will produce the given assembly.

Is has been trained to disassemble. It is much, much better than other models at that.


> Presumably a very generalized model would be good at even doing the inverse: given some assembly, write code that will produce the given assembly.

ChatGPT does this, unreliably.


AFAIK this is a heuristic, not a category. The underlying grammar would be preserved.

Personally I thought we were way too close to perfect to make meaningful progress on compilation, but that’s probably just naïveté


I would not say we are anywhere close to perfect in compilation.

Even just looking at inlining for size, there are multiple recent studies showing ~10+% improvement (https://dl.acm.org/doi/abs/10.1145/3503222.3507744, https://arxiv.org/abs/2101.04808).

There is a massive amount of headroom, and even tiny bits still matter as ~0.5% gains on code size, or especially performance, can be huge.


This feels like going insane honestly. It's like reading that people are super excited about using bouncing castles to mix concrete.


Unlike many other AI-themed papers at Meta this one omits any mention of the model output getting used at Instagram, Facebook or Meta. Research is great! But doesn't seem all that actionable today.


This would be difficult to deploy as-is in production.

There are correctness issues mentioned in the paper regarding adjusting phase orderings away from the well-trodden O0/O1/O2/O3/Os/Oz path. Their methodology works for a research project quite well, but I personally wouldn't trust it in production. While some obvious issues can be caught by a small test suite and unit tests, there are others that won't be, and that's really risky in production scenarios.

There are also some practical software engineering things like deployment in the compiler. There is actually tooling in upstream LLVM to do this (https://www.youtube.com/watch?v=mQu1CLZ3uWs), but running models on a GPU would be difficult and I would expect CPU inference to massively blow up compile times.


Wouldn't "Compiler LLM" be a more accurate name than "LLM Compiler"?


Never let a computer scientist name anything :-)


I am curious about CUDA assembly, does it work on CUDA -> ptx level? or ptx -> sass? I have done some work on SASS optimization and it would be a lot easier if LLM could be applied at SASS level


Reading the title, I thought this was a tool for optimizing and disassembling LLMs, not an LLM designed to optimize and disassemble. Seeing it's just a model is a little disappointing in comparison.


my knowledge of compilers don't extend beyond a 101 course done ages ago, but i wonder how the researchers enriched the dataset for improving these features.

did they just happen to find a way to format the heuristics of major compilers in half-code, half-language mix? confusingly enough, another use case where a (potential) tool that let us veer into the solution with some work is being replaced by an llm.


I don’t understand the purpose of this. Feels like a task for function calling and sending it to an actual compiler.

Is there an obvious use case I’m missing?


This is not a product, it's a research project.

They don't expect you to use this.

Applications might require further research. And the main takeaway might be not "here's a tool to generate code", but "LLMs are able to understand binary code, and thus we can train them to do ...".


GPT 6 can write software directly (as assembly) instead of writing c first.

Lots of training data for binary, and it can train itself by seeing if the program does what it expects it to do.


Is this GPT-6 in the room with us now?


Pretty sure I remember trading 300 creds for a Meta Technologies Neural Optimizer and Disassembler in one of the early Deus Ex games.


I love this company. Advancing ai and keeping the rest of us in the loop.


I hate the company (Facebook), but I still think them having been publicly releasing a bunch of the research they've been doing (and models they've been making) has been a net good for almost everybody, at least in terms of exploring the field of LLMs.


My love for Meta is strictly confined to FAIR and the PyTorch team. The rest of the company is basically cancer.


Is this a bot comment?


It is so funny that meta has to post it on X.





Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: