Compilers for free with weval

JonChesterfield · 2024-05-19T16:50:31

I remember being really excited about the derive-a-compiler-from-a-partial-evaluator premise when I first ran across it. You genuinely do get a compiler from an interpreter plus a partial evaluator. That trick does work.

The gotcha I didn't spot at the time is that a compiler is (I think, could be argued either way) easier to write than a partial evaluator. The "get it for free bit" only really applies if someone else gave you the partial evaluator and if you somehow don't need to spend ages debugging why that partial evaluator isn't behaving like you hoped.

Now that I'm somewhat more beaten down by toolchain dev, I'm starting to think compilers are easier to write than interpreters. It's definitely not a clear win in favour of the interpreter. If you compile to x64, or to C, or to javascript or whatever, you now have a thing you can debug with whatever tools are native to that target. If you run in an interpreter, you get to debug that interpreter running the program, with whatever somewhat ad hoc debugging tools you put into the interpreter yourself.

Getting useful semantic error message out of an interpreter at partial evaluation time (aka the "compile time" of the aggregate tool) is probably solvable but not likely to work out of the box. Partial eval isn't really a phase separation friendly thing.

cfallin · 2024-05-19T17:16:33

This is indeed a good point and something I want to write about when I eventually do a blog post on weval.

A few counterpoints that I'd offer (and what led me to still take this approach):

- If the target has sub-par debugging infrastructure, it can be easier to debug an interpreter (which is portable) then apply the semantics-preserving PE. In particular when targeting Wasm outside the browser, there is... not really a good debug experience, anywhere, for that. It was way easier to get an interpreter right by developing on native with gdb/rr/whatever, and then separately ensure weval preserves semantics (which I tested with lockstep differential execution).

- Maintenance: if one is going to have an interpreter and a compiler anyway (and one often wants this or needs this e.g. to handle eval()), easier for them both to come from the same source.

- Amortized cost: in the Wasm world we want AOT compilers for many languages eventually; there are interpreter ports with no Wasm backends; developing weval was a one-time cost and we can eventually apply it multiple times.

- If the semantics of the existing interpreter are quite nontrivial, that can push the balance the other way. I designed weval as part of my work on SpiderMonkey; extremely nontrivial interpreter with all sorts of edge cases; replicating that in a from-scratch compiler seemed a far harder path. (It's since been done by someone else and you can find the "wasm32 codegen" patchset in Bugzilla but there are other phasing issues with it from our use-case's PoV; it's not true AOT, it requires codegen at runtime.)

I don't think the tradeoff is always clear and if one is building a language from scratch, and targeting a simple ISA, by all means write a direct compiler! But other interesting use-cases do exist.

JonChesterfield · 2024-05-19T17:39:26

The divergent semantics risk between the interpreter and the compiler is a really big deal. It's genuinely difficult to get a language implementation to behave exactly as specified, even when the spec is do the same as some other implementation. Treating "compiled code" as specialising the interpreter with respect to the program is a great solution to that, since the bugs in the optimiser/partial-evaluator (they're kind of the same thing) are unlikely to be of the same class as bugs from independent implementations.

Wasm is a really solid target for heroic compiler optimisations. It's relatively precisely specified, user facing semantic diagnostics are in some language front end out of sight, aliasing is limited and syscalls are finite with known semantics. Pretty much because it was designed by compiler people. You've picked a good target for this technique.

achierius · 2024-05-20T16:26:10

One problem with Wasm is that in practice there's not as much optimization work to do as you might expect, as the higher-level compiler which produced it already di a lot of the work. Of course you still need to lower the stack machine + locals to actual spills/fills/etc., but still a chunk of the work is already done for you.

titzer · 2024-05-19T18:44:24

> The gotcha I didn't spot at the time is that a compiler is (I think, could be argued either way) easier to write than a partial evaluator.

I agree that this is sometimes true, but there are several, IMHO bigger, issues:

1. The partial evaluation of the interpreter might degenerate back to getting a copy of the interpreter loop (or multiple!) and not achieve any speedup, just memory overhead

2. The partial evaluator might need additional annotations on what is and is not mutable; it might need a lot of help in order to constant-fold away a key piece of data. Tuning that and getting wrong and result in #1.

3. Partially evaluating the interpreter is a pretty slow compiler. You need to do the second Futamura projection (apply the partial evaluator to an interpreter loop without the user code) in order to get a fast compiler. That means the partial evaluator needs to be in the same IR as the partial evaluator's input language.

That said, I've chatted a bit with Chris offline about this, and I think this work is really cool and promising. I might tinker with this a little in Wizard too.

JonChesterfield · 2024-05-19T21:32:51

That is better expressed than my rambling! The futamura projection is satisfied by passing the "program" directly to the interpreter without optimising either, which is sometimes useful as a packaging exercise and does look somewhat like a non-optimising compiler.

If you want this trickery to make a useful compiler, the partial evaluator picks up global value numbering, dead code elimination, loop transforms - the effective partial evaluator is very much the optimisation passes of a compiler. It can still be a win if someone else wrote said partial evaluator.

Point 2 is the usual "my compiler is not sufficiently smart, I must annotate the program" problem, with the slight twist that you're annotating the interpreter in the hope that it does useful things with the end program. Interacts with hoping someone else built the dependency well.

And yeah, generated compilers in this fashion have a reputation for being slow, and for not being great optimising compilers, where self application might dig oneself out of the hole. Very like adding optimisations to your compiler to make your compiler faster when building your compiler.

All in all the compiler for free tagline (not specifically this post, it's written on a lot of futamura references) feels a bit like an inside joke. It reminds me of the sad discovery that a metacircular interpreter can't interpret itself after all since what you've written is a heavily obfuscated infinite loop.

cfallin · 2024-05-19T15:33:42

weval author here (thanks Max for the blog post!). Also AMA!

The talk about weval that Max mentions was at NEU and also CMU; the latter was recorded and is here: https://vimeo.com/940568191

I also plan to write up a blog post on weval in depth, plus its application to SpiderMonkey-on-Wasm, in a few months; it's pretty exciting though, currently getting 4-5x speedups on some benchmarks on a decidedly nontrivial interpreter!

sitkack · 2024-05-20T19:11:41

This is amazing! I really do look forward to that writeup.

Would investigating the mapping from interpreters down the assembly that Wasmtime generates be something worth looking into? I got distracted into the vimeo presentation, the weval blog post covers this? How high can the interpreter/compiler tower go?

Is Weval looking to become the PyPy of Wasm?

rst · 2024-05-19T14:42:41

The most elaborate deployed version of this idea is probably PyPy, an alternative Python implementation that works by giving a JIT-compiling partial evaluator a hint-annotated Python interpreter as the program, and the Python code to be executed as its input. Slightly abstruse overview here: https://doc.pypy.org/en/latest/architecture.html

chubot · 2024-05-19T16:24:39

PyPy used partial evaluation at one point, but now it's "meta-tracing":

Why did we Abandon Partial Evaluation? - https://www.pypy.org/posts/2018/09/the-first-15-years-of-pyp...

082349872349872 · 2024-05-19T17:36:33

Taking meta-tracing to the next level would reconstruct not just hot loops but even the original CFG, as speculated in: https://news.ycombinator.com/item?id=38778372

Anyone know of non-fictional work in this area?

alserio · 2024-05-19T15:22:22

Futamura projections should still be the basis for GraalVM

evacchi · 2024-05-19T18:28:02

they are for Truffle-based language implementations https://chrisseaton.com/rubytruffle/pldi17-truffle/pldi17-tr...

tekknolagi · 2024-05-19T14:58:30

I think PyPy moved away from partial evaluation to trace optimization, fwiw

teo_zero · 2024-05-19T17:58:11

If the wasi+weval version is better than the compiled one, doesn't it mean that the compiler made a bad job at optimizing out the constant parts?

To be fair, in order for weval to do its magic, the code has to be changed and annotated. I'm wondering if a similar set of annotations could convince clang to generate a binary that runs in less than 40 ms.

foota · 2024-05-19T22:32:42

Clang doesn't know about the static code, so it can't do the optimizations this does.

You could probably do the same thing with some hacky C++ code gen and templates, but I'm not sure clang et al would be able to optimize them.

tekknolagi · 2024-05-19T22:50:22

In the limit, that's called Copy and Patch!

moonchild · 2024-05-19T23:04:33

Isn't copy and patch just jitter: new fangled kids' version?

tekknolagi · 2024-05-20T03:30:18

As far as I can tell from the jitter docs, no; the big thing about copy and patch is that you can piggy back on the existing C compiler and whatever backends it already supports.

moonchild · 2024-05-20T04:17:45

> piggy back on the existing C compiler and whatever backends it already supports

that is the whole point of jitter

tekknolagi · 2024-05-20T04:44:28

Does jitter compile C at run-time or just stitch it together? (EDIT: looks like stitch at run-time)

The other thing C&P does is allow for a library of stencils and stitch them together intelligently depending on where arguments are coming from, where the return should go, where the flow of control should go next, etc

EDIT: So I guess they look fairly similar, huh

moonchild · 2024-05-20T05:32:14

yup, jitter does that too; e.g. https://ageinghacker.net/talks/jitter-slides--saiu--bts-2022... page 23 and 35

fwiw i thought this was a stupid, hacky, pointless approach when it was called jitter and i still think it's a stupid, hacky, pointless approach when it's called copy-and-patch ;)

but still like to see credit where it's due

tekknolagi · 2024-05-20T12:44:17

I wonder if they're truly unaware of one another and developed in parallel or not. Hmmm

posnet · 2024-05-19T14:26:53

I will never not be amazed by Futamura Projections.

metadat · 2024-05-19T16:11:21

They are pretty remarkable.

https://en.wikipedia.org/wiki/Partial_evaluation#Futamura_pr...

tekknolagi · 2024-05-19T14:23:16

Author here! Feel free to ask questions or leave comments or whatever.

burakemir · 2024-05-19T15:22:54

Thanks for the write-up.

Neil Jones' (free) book about partial evaluation deserves a mention here, since it explains how/why it works in some detail.

Query languages are a very cool application of this stuff. I was planning to use wasm for Mangle implementation (a language extending datalog, with its implementation a library for deductive database), following Tiark Rompf and Nada Amin's "a SQL to C compiler in 500 lines of code"... but of course I only have the interpreter for now and started a Rust rewrite instead.

quag · 2024-05-19T17:14:11

The free book on partial evaluations: http://www.itu.dk/people/sestoft/pebook/

sestep · 2024-05-19T20:35:27

Is the book enlightening beyond what one can learn from sources like this post and the aforementioned paper by Rompf and Amin? I'm trying to decide whether I should spend the time to dig deeper into it.

azakai · 2024-05-19T19:58:55

I understand that by making the bytecode constant a lot of optimization become possible. But I can't see how you can unroll the interpreter loop past unknown inputs to the program?

That is, if the program is a pure function like power(x, y) (from the top of the article) then I understand if the bytecode is constant and the inputs x, y are constant then you can unroll completely. In fact, in that case, you can just do a full evaluation of the entire program to a constant. But, if x and y are unknown at compile time then you can do very little, I think?

I guess the example at the top of the article is saying that if you know y but not x then partial evaluation can still help a lot. But in a real-world program the first unknown input that causes a significant branch in execution will block progress, won't it? (edit: in particular unrolling has to stop, unless you consider multiple paths)

tekknolagi · 2024-05-19T20:26:06

Weval has intrinsics that unroll the loop and specialize on the PC, merging paths with the same PC. So if you pick a good specialization context like PC, the unrolling stops eventually.

n_plus_1_acc · 2024-05-19T22:56:48

For a given function like `power`, you can unroll the interpreter loop and eliminate branches on the instruction, so the code has less branching overall.

azakai · 2024-05-19T23:04:02

But can you only do that when you know what the inputs are? That is, if x and y are determined only at runtime for power(x, y) then I don't see what can be optimized. But that seems like the common case to me. Likely I'm missing something and maybe not asking the question clearly, sorry.

cfallin · 2024-05-20T03:26:49

> That is, if x and y are determined only at runtime for power(x, y) then I don't see what can be optimized.

Yes, the example in Max's post is specifically assuming one wants to generate a specialized version of `power` where `y` is fixed.

To take it back to weval: we can know what the bytecode input to the interpreter is; we provide an intrinsic (part of the "wevaling" request) to indicate that some function argument is a pointer to memory with constant, guaranteed-not-to-change content. That, together with context specialization on PC (another intrinsic), allows us to unroll the interpreter loop and branch-fold it so we get the equivalent of a template method compiler that reconstitutes the CFG embedded in the bytecode.

azakai · 2024-05-20T03:56:57

Thanks, I think I see now. So `y` is the bytecode, in the analogy. Makes sense.

(For me at least a concrete example would have helped, something like showing the specialized output of running on the bytecode for `power` with that interpreter. But maybe that would be too verbose...)

abeppu · 2024-05-19T18:43:03

This is cool, but the post swaps in a tiny interpreter bc it says that CPython and SpiderMonkey are both huge. It then goes on to talk about modifying the interpreter to make some specific info visible to weval, and I can imagine that those changes would be considerable for CPython or SpiderMonkey. But if we ignore those specialization sections, and just consider the material up through the "Enter weval" section ... could weval have handled CPython or SpiderMonkey (or some other real interpreter that people use regularly)? Or does the partial evaluation have some scaling behavior that explodes or bogs down with large or complex interpreters? I.e. was the tiny interpreter needed just to make the following sections compact, or does even the basic use of weval require a small input?

tekknolagi · 2024-05-19T19:54:24

You should wait for Chris's upcoming post :) the minimal changes are the same as the ones I have for my tiny interpreter. There are some more advanced changes you can make for more performance

cfallin · 2024-05-20T03:28:38

Indeed; the tl;dr is "yes it works on SpiderMonkey, and was designed to do so". Blog post in a few months, I promise!

syrusakbary · 2024-05-20T11:59:01

This is really awesome. Mad props for Chris for working on weval, and to Max for the great overview with the blogpost.

I found the video in Vimeo quite useful to understand how it works: https://vimeo.com/940568191

aatd86 · 2024-05-19T23:10:50

I'm not sure I understand. This would be useful if, for instance, a runtime (compiled as a wasm function) accepted code as input to be run.

But not if the code was compiled to wasm alongside the runtime?

Asking because I have compiled the go compiler and the standard library package as wasm. Would weval help here? Is that what wasmbindgen does?