Very different technologies. The PyPy JIT is a tracing JIT where hot paths of execution are identified, then that specific path is compiled and optimized. Same as LuaJIT.
The CPython JIT is a newer and less invasive technique called copy-and-patch[1]. It's a lot less powerful, but a lot easier to plug into an existing language implementation: known sequences of python bytecode are mapped to templates of machine code
I'm a bit surprised that copy-and-patch is "new". I remember writing a JIT framework that did something similar back in the aughts and I got the idea from reading docs that were already old then.
I understand people at IBM were doing it industrially for Java bytecode in the late 90s under the name quasi-static compilation, and the DyC/Tempo folks were doing similar things over in C land under different names. There were some minor differences due to the technology of the day, but it was broadly similar. For example, explicitly building to IR was uncommon outside Java land and register scheduling didn't have a lot of choices to make. The Java stuff even allowed the template specializations to exist on other computers and be dynamically loaded and validated over the network, for thin-client reasons. Very cool for the early 2000s.
Template JITs aren't new. Copy and patch is a specific scheme for automatically creating a template JIT by using relocations in order to generate templates from normal C++ code. That wikipedia page is just very bad.
Compile enough traces with the accompanying optimizations and you have solid training data for an LLM that can suggest optimizations based on JIT traces.
All the minimized inefficiencies that are found by my script are already optimizations. They just happen to be rather specific patterns, so they need to be suitably generalized. There's another Regehr etal paper about how to do that automatically:
https://dl.acm.org/doi/10.1145/3649837
(llm's aren't involved, it's all based on z3)
I don't plan on implementing something like this for now, I'd rather take the inefficiencies and manually extract optimizations out of them and implement them in PyPy's jit.
If you already know what the optimizations are, you don't need AI to optimize them. Just write optimizations in the compiler. That's literally one of the biggest jobs of the compiler already.
I meant how do you make sure the optimization suggested by the AI is actually valid. If you're using AI to modify bytecode for faster execution then you have to make sure the optimized and unoptimized code are semantically equivalent. Neural networks can't do logic so how would you know the suggestions were not bogus?
You've asked the right question, and for those that think validation is as simpLe as "run it and see if it gets the right result", good start but instruction ordering can be critical around multi thread aware data structures. Taking a fence out, or an atomic operation might give a big performance gain. Trouble is the structure may now go wrong 1% of the time.
JIT optimizers operate at runtime, there are no test suites to verify before/after. It's happening live as the code is running so if you use AI then you won't know if the optimization is actually valid or not. This is why the article is using Z3 instead of neural networks. Z3 can validate semantic equivalence, neural networks can't.
Yes, but this Z3 analysis is not done at runtime. It's done offline, based on JIT traces. A neural network could, in principal, suggest optimizations in the same way, which an expert would then review for possible inclusion into the Pypy JIT.
You'd still have to write a proof for verifying semantic equivalence before implementing the optimization so I don't see what the neural network gains you here unless it is actually supplying the proof of correctness along with the optimization.
The idea is that the LLM would provide "intuition" to guide the optimizer to find better optimizations, but a formal proof would be necessary to ensure that those optimizations are actually valid.
Pypy doesn't do this in general. The same Z3 model that is used to find these missing optimizations is also used to verify some integer optimizations.
But the point is that as long as optimization rules are hand-written, a human has thought about them and convinced themselves (maybe incorrectly) that the rules are correct. If a machine generates them without a human in the loop, some other sort of correctness argument is needed. Hence the reasonable suggestion that they should be formally verified.
PyPy has formally verified the integer abstract domain using Z3, a quite important part of our jit optimizer (will write about that in the coming weeks).
We also run a fuzzer regularly to find optimization bugs, using Z3 as a correctness check:
The peephole optimizations aren't themselves formally verified completely yet. We've verified the very simplest rules, and some of the newer complicated ones, but not systematically all of them. I plan to work on fully and automatically verifying all integer optimizations in the next year or so. But we'll see, I'll need to find students and/or money.
Ah, yes, I meant that the LLM could output suggestions, which a human would then think about and convince themselves, and only then, implement in Pypy.
Presumably the LLM would generate a lot of proposed rules for humans to wade through. Reviewing lots of proposed rewrites while catching all possible errors would be tedious and error-prone. We have computers to take care of this kind of work.
Perhaps not, but they’re based on heuristics and checks that are known, checked and understood by humans, and aren’t prone to hallucination like LLM’s are. An LLM suggests something that looks plausible, but there’s no guarantee that it’s suggestions actually work as intended, hence the need for a proof.
I added a few somewhat similar optimization to Racket. The problem are the corner cases.
For example (fixnums are small integer), is it valid to replace
(if (fixnum? x)
(fixnum? (abs x))
true)
with just the constant
true
?
Try runing a few tests, common unit test and even random test. Did you spot the corner case?
It fails only when x is the most negative fixnum, that is also a very rare case in a real program. (IIRC, the random test suit try to use more of this kind of problematic values.)
Generate the z3 too - as the need is to verify, not test. It can be a direct translation. For all inputs, is the optimization output equivalent. (Bootstrapping a compiler prototype via LLMs is nice though.)
One place LLMs get fun here is where the direct translation to z3 times out, such as bigger or more complicated programs, and so the LLM can provide intuition for pushing the solver ahead.
Sure, for booleans you can just test all combinations of input arguments. In some cases you can do the same for all possible 32 bit float or int values that you have as input. But for 64 bit integers (let alone several of them) that's not feasible.
As long as we can agree that we are testing the application logic and not the compiler or hardware, then if (a > 4) {...} else {...} can be tested with just 3, 4, 5 no need to test -430 or 5036.
Known as boundary value testing, you partition all input into equivalence classes, then make sure your tests contain a sample from each class.
Making sure the test contains a sample from each class is the hard part. For example in your `if` example above it may happen that the code computing `a` is such that `a >= 5` is impossible, so that equivalence class is never going to happen. As such you can't have a test for it, and instead you'll have to prove that it can never happen, but this reduces to the halting problem and is not computable.
And even ignoring that problem, there may be an infinite amount of equivalence classes when you introduce loops/recursion, as the loops can run a different amount of times and thus lead to different executions.
Even just considering `if` statements, the amount of equivalence classes can be exponential in the amount of `if` (for example consider a series of `if` where each check a different bit of the input; ultimately you'll need any combination of bits to check every combination of `if`, and the number is 2^number of ifs).
Even most algorithms would allow too many inputs. Even a simple algorithm computing the addition between two 64 bit numbers allow 2^128 possible input combinations, which would take billions of years to exhaustively check in the best case.
I ask because z3 has been used for type inference (Typpete) and for solving equations written in Python.