JAM (Joe’s Abstract Machine)
In 1989 JAM (Joe’s Abstract Machine) was first implemented. Mike Williams wrote the runtime system in C, Joe Armstrong wrote the compiler, and Robert Virding wrote the libraries.
Now I'm sad :(
found that link here:
It's not working for me right now, but you may have better luck.
Every other language that I’ve studied seems to just cope with the same problems in slightly different ways.
No kitchen sink language here: play by their rules or choose a different environment.
This is the first time I've seen AsmJIT compared to HiPE. Interesting.
There's no goal of heroic optimization; the optimization goal is really just to remove the overhead of interpretation.
This removes the need to apply the JIT only to some code, because it's fairly simple, it's fast enough to apply when the code is loaded, and so all code is JITed to native as it's loaded. Or you're on an unsupported platform and all code is interpretted.
Because it's all or nothing, testing the OTP release should uncover any JIT bugs; you won't have the hard to track bugs sometimes seen in other systems where a function's correctness depends on whether or not it was JITed and that depends on runtime state. That won't mean no JIT bugs, of course, but they should be easier to track down.
Also in the case of Erlang the language model will reward these first steps even more than most subsequent optimizations because much Erlang code in general don't have much tight loops of the kind that are prominent in benchmarks were a good register allocator would provide huge wins.
More on the 90/10 rule we need to remember that these expensive JIT's are very complicated with optimization levels and interpreters combined with tons of GC options whereas here they explicitly just dumped JIT-interpreter cross-calling to simplify the design as well as a more straightforward internal memory model with less complicated edge cases.
What it's not doing, but is commonly done in other JITs, is any sort of runtime profiling and chosing of which modules to transforming; all modules are transformed when loaded.
As described in the article, previous JIT attempts with BEAM did have that functionality, and they did meet the project goals; profiling cost too much, the compilation step was too expensive, and mixing modes between interpreted modules and native modules added too much complexity.
I haven't looked at the code behind this, but from articles, I haven't seen anything that would preclude running the bytecode to native code transformation ahead of time (or caching the just-in-time transformation for future use), but it's not part of the implementation as of now.
This sounds pretty similar, except there’s no fast load file.
A great deal of research has been published in the last 25 years, and some of it invalidates earlier wisdom due to changes in processor design. Just following this trail and applying the 80/20 rule could get a lot done for a little effort. And a simple JIT has half a prayer of being correct.
Andy has been working at Igalia with the JS engines of chrome and Firefox, which makes me believe it might not be easily reached by mere mortals, but looking at the source it is quite easy to follow, even though I would not trust myself to make any significant changes.
If you're aiming to compete with HotSpot and .Net then you'll need to invest millions, but not all JITs are this ambitious. GNU Lightning is another example of a JIT with few people behind it.
Works, is stable and faster than an interpreter? Sure, that is achievable for a motivated and skilled developer working on their own. At least for a reasonably simple language, maybe not for a beast like C++.
Competitive with one of those bigcorp-funded ones? Nope.
Why's that horrifying? Isn't inlining a super-basic and well-understood optimisation that any serious compiler would be doing?
> BEAM/C generated a single C function for each Erlang module. Local calls within the module were made by explicitly pushing the return address to the Erlang stack followed by a goto to the label of the called function. (Strictly speaking, the calling function stores the return address to BEAM register and the called function pushes that register to the stack.)
> Calls to other modules were done similarly by using the GCC extension that makes it possible to take the address of a label and later jumping to it. Thus an external call was made by pushing the return address to the stack followed by a goto to the address of a label in another C function.
> Isn’t that undefined behavior?
> Yes, it is undefined behavior even in GCC.
From the C standard:
> The identifier in a goto statement shall name a label located somewhere in the enclosing function.
`Shall` is a term of art here:
> If a "shall" or "shall not" requirement that appears outside of a constraint is violated, the behavior is undefined.
(Sections 22.214.171.124.1 and 4.2, respectively: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf)
> it is undefined behavior even in GCC. It happened to work with GCC on Sparc, but not on GCC for X86.
The BEAM is currently a bytecode interpreter. The article describes it as using threaded code, which if you look at the C source is true in a sense on platforms where the C compiler provides the GNU-style extension of being able to take the address of labels for dynamic `goto` (https://stenmans.org/happi_blog/?p=194 agrees with my skim of `beam_emu.c`)—though if that's the only use as it seems, I might find that an edge case of the term, since I associate “threaded code” with the style used in Forth where “user-level” subroutines can be pointed to directly. If computed `goto` isn't available, then it uses a conventional switch/case loop, repeatedly dispatching the result of fetching the next bytecode instruction pointer value.
Though that brings up a question... I wonder if you could write a qemu backend to the BEAM, and start working towards dynamic translation that way?
I also don't know of any other HLL JITs that use TCG, which seems like weak evidence that it wouldn't be a win.
And, would implementing the Beam bytecode using Graal polyglot layer be a good idea then? Allowing it to leverage JVM JIT ?
Erlang (pre-JIT) has a compilation step. The compiler reads the source code and generates bytecode files (.beam). The runtime interprets this bytecode; it doesn't care about the original source.
Erlang (with JIT) has the same compilation process, but at runtime, it also converts the bytecode to machine code in memory. Then it executes that code, which removes the interpretation overhead.
Given that Python uses a GIL (global interpreter lock), and Erlang typically has highly parallel workloads, my guess as an outsider would be that it's different.
Re: graal, there is erjang, but nobody uses it.
One reason is that on platforms that support computed goto, when a BEAM file is loaded, the list of bytecodes for each function is translated into a list of machine code addresses that implement each bytecode (a.k.a. threaded code). Bytecode dispatch then skips one layer of indirection vs. CPython. CPython looks up each bytecode's machine code address in an array every time it dispatches a bytecode. (Or, if the platform doesn't support computed goto, CPython uses a regular switch statement, which hopefully gets compiled to a jump table.)
Threaded code is a pretty standard implementation strategy for some languages (notably Forth), and my understanding it was even a pretty common compiler implementation strategy in the 1970s/ early 1980s. It's a pretty easy optimization, particularly for a stack-based VM or a VM with a small number of registers.
The main downside is memory usage, as Python bytecode can be memory-mapped directly from disk and therefore shared across processes and discarded by the OS under memory pressure rather than being written out to swap. There's obviously a bit of startup overhead at class load time.
Though, I'm a bit surprised that most stack-based interpreters don't initially load classes with functions implemented as a dead-simple threaded implementation consisting of a pointer to the start of a regular bytecode interpreter and a pointer into a memory-mapped buffer of the on-disk bytecode. Based on performance counters, they could JIT the hotspots into regular threaded code.
Look up which object the method belongs to, look up the namespace, check if there's an implementation of __getattr__ or equivalent, maybe call that, get the result and then actually call the function. More complicated cases get worse.
Erlang/Elixir are much simpler and get their flexibility from functional programming. There's much less indirection so less work the interpreter needs to do.
My interpretation is that the end goal seems similar, but this project starts with the Rust language instead.