The performance work in Erlang at the moment is really exciting. It's easy to ge...

tiffanyh · on April 26, 2022

I don't want this to come across as snarky or unappreciative but note that JIT work on Erlang has been under active develop since at least 2014 (8 years) [0] if not earlier.

The current JIT provides about 25% improvement while other languages like PHP have sen a near 200% improvement during that same timeframe.

I also realize that Erlang is extremely difficult to increase performance of due to the very nature of it vs other languages.

[0] https://llvm.org/devmtg/2014-04/PDFs/Talks/drejhammar.pdf

derefr · on April 27, 2022

> but note that JIT work on Erlang has been under active develop since at least 2014

Kind of, but that work (BEAM JIT, HiPE, etc.) was done by third-party academic researchers and then contributed to Erlang, where it languished, because none of the core developers were academic researchers with the knowledge to improve or even really change what had been contributed. HiPE was disabled in Erlang 22 simply because some new instructions were added and nobody knew how to extend HiPE with support for them. You might call this previous work "false starts" at optimization.

The work going on since Erlang 24, however, is being done in-tree by the core maintainers themselves. It will stick around, and be gradually improved upon.

Alongside this work, comes a number of recent features in the last few Erlang releases, all with the specific aim of supporting performance-oriented code — e.g. atomic counters, persistent_term, ETS high-read-concurrency improvements, etc.

Basically, the Erlang devs have seemingly shifted from their previous "enterprise embedded use-case enablement" posture, more toward a "high-performance OLTP use-case enablement" posture.

tiffanyh · on April 27, 2022

> Kind of, but that work (BEAM JIT, HiPE, etc.) was done by third-party academic researchers and then contributed to Erlang

Is that correct? Lukas Larsson is a core developer and he's been working on JIT since 2012. (While yes, I agree academics have also worked on it - the core team has as well for equally amount of time)

https://www.erlang-factory.com/conference/SFBay2012/speakers...

derefr · on April 27, 2022

My understanding is that the integration work for these huge academic patchsets, took up the lion's share of any time the maintainers booked as "working on JIT." If not, these patchsets at least got in the way of useful progress in JIT work, redirecting "work on JIT" into themselves; where progress gradually slowed due to the "big ball of mud" that each of these patchsets were; and then all such progress was lost when the patchsets were given up on altogether.

jlouis · on April 26, 2022

PHP is the language which would implement a loop by pushing a byte offset to a stack and seek in the file at the end of the loop. Massive improvement if you had a disk cache.

It's all relative to where you come from. The byte code interpreter of BEAM is very close to a poor man's JIT. It has some op-fusion, and uses threaded code. So gaining efficiency will require lot of work.

dgb23 · on April 26, 2022

Also the languages are almost extreme opposites in terms of how they execute as they make entirely different tradeoffs.

nickjj · on April 26, 2022

Is the BEAM's byte code interpreter comparable to something like opcache with PHP?

[0]: https://www.php.net/manual/en/intro.opcache.php

jlouis · on April 27, 2022

BEAM is precompiled bytecode. The loader will peephole optimize the byte code replacing some instruction sequences with optimized variants. Bytecode resides in memory already, and there is no disk seeking.

ReactiveJelly · on April 27, 2022

> PHP is the language which would implement a loop by pushing a byte offset to a stack and seek in the file at the end of the loop.

I didn't quite understand this. Is it some famous PHP implementation thing?

jlouis · on April 27, 2022

It's from one of the first implementations of the language. If you want to move fast, and you have no ambition about execution speed, this is a solution to the problem of implementing loops.

PHP worked by parsing code directly from the file for every executed statement. So you ran the full lex->parse->interpret path for each statement. A loop need to jump back, which can be tracked by the file offset position, so you can redo the lex->parse->interpret step for the loops statements again.

Modern PHP will not do this, because it's highly inefficient. I'm not even sure it survived into the 2000's.

jjtheblunt · on April 27, 2022

> PHP have <seen> a near 200% improvement

perhaps because PHP's implementation was grotesquely wasteful?

kaba0 · on April 26, 2022

Adding to that, wouldn’t adding Erlang’s unique features to the JVM be easier than the other way around? Especially with GraalVM’s unique approach that manages to elevate Ruby’s performance far higher than any other runtime?

derefr · on April 27, 2022

Erlang's semantics are deeply intertwined with the unique things the Erlang runtime does.

For example, the Erlang abstract machine does straight-line non-preemptable atomic execution within bytecode basic-blocks, with reduction-checking for yield exactly/only at stack-frame manipulation points (i.e. call/ret/tail-call.)

Those points are guaranteed to occur after O(1) reductions, because of an ISA design that contains no unbounded local looping primitives — i.e. no way to encode relative jumps with negative offsets. (Note that this design requirement — and not any functional-programming ideal — is why Erlang uses tail-calls for looping. It has to; there's no other way to do loops given the ISA constraints!)

This atomicity of bytecode basic-blocks is what guarantees that actors can be hard-killed without corrupting the abstract-machine scheduler they run on (they die at their next yield-point, with the scheduler in a coherent state). It's a fundamental difference between Erlang scheduling and JVM scheduling.

The JVM doesn't have this atomicity, and so you can't hard-kill a Java thread without corrupting the JVM. Instead, you can only softly "interrupt" threads — sending them a special "please die" signal they have to explicitly check for. This means that JVM languages can't support anything like Erlang's process links — i.e. JVM concurrency frameworks can't propagate failure downwards through a supervision hierarchy in a way that actually releases resources from long-running CPU-bound sub-tasks. This in turn means you can't reliably bound resource usage under high-concurrency scenarios; which means that, essentially, all the things that people get excited about adding to Java with Akka, Loom, etc. don't actually do much to help the use-cases they attempt to address.

This last is personal experience, by the way. My company develops backend server software in both Erlang (Elixir) and Java. We actually tried Loom as a way of fixing some of the robustness-under-concurrency problems with the JVM; but the problems are much more fundamental than just adding features like virtual threads can resolve.

kaba0 · on April 27, 2022

I didn’t meant it as “only” a library, but as a fork. I still don’t necessarily see why it would be an insurmountable problem to fork the OpenJDK project and add atomic execution to it (do note that I’m sure it has the necessary mechanism for that — JIT deoptimizations pretty much require dropping the previously calculated results, so with some change it could be doable).

It wouldn’t be a soft fork and would still require plenty of resources no doubt, but I still feel like going this direction will better utilize the insane dev-hours that went into the OpenJDK’s stellar performance, than trying to fix the performance issues of Erlang. But do take everything I said with a huge grain of salt, as I am neither an OpenJDK contributor and as you can see don’t know much about the Erlang ecosystem. And thank you for the very informative look behind the scenes.

di4na · on April 27, 2022

Yeah no. The problem here is that a huge part of said jit performance on jvm depends on the assumption that you can elide information that the schedulers and tracing would need.

Reverting that would need to revert nearly all advanced optimisations plus to totally change the memory model and the GC. At this point, you have already lost all benefits.

dszoboszlay · on April 27, 2022

This is not exactly how the Beam VM works.

> an ISA design that contains no unbounded local looping primitives

Actually, most of the Beam instructions are non-O(1). For example since integers are unbounded in Erlang even simple arithmetic (+, -, etc.) may turn out to be a non-constant time operation (even though most of the time your integers will fit into a single machine word, so it'd rarely be a problem). But there's also a built-in for appending lists (++), which obviously contains an unbounded loop in it.

The solution is that these instructions are written so that they do work on chunks of data (e.g. ++ may process 1000 elements of a list in a chunk) after which they increment the reduction counter and possibly schedule out the process.

> there's no other way to do loops given the ISA constraints

The Beam ISA allows looping. You can even write a hand-crafted loop that will never increment the reduction counter and thus will deadlock a scheduler. But the Erlang compiler will never generate such a loop for you.

On the ISA level there are only labels where you can jump to. You can jump forwards and backwards. So you could implement a language that offers loops and still compiles to Beam. However, since jumps don't increment the reduction counter, you would either risk your loops breaking the fair scheduling of processes, or you would have to ensure that the loop body contains an operation that increments the reduction counter and allows the scheduler to suspend the process.

> This atomicity of bytecode basic-blocks is what guarantees that actors can be hard-killed without corrupting the abstract-machine scheduler they run on

Well, it is of course important that you don't interrupt the scheduler at an arbitrary point, midway executing an opcode. But there are no atomically executed bytecode blocks. Actors are free to kill not because they would run their code in uninterrupted atomic blocks, but because they don't share state (their heap) with each other. So if you have an actor that holds e.g. a binary tree, and it is half way into inserting a value into the binary tree when you kill it, it may leave the binary tree in an inconsistent state, but that doesn't matter, because no one else have access to this data structure: it lives on this process' own heap.

When processes use shared resources (such as ETS tables, files or a gen_server process) and they are killed, they may very well leave that shared resource in an inconsistent state, just not on the VM layer, but on the application logic layer. So the file will still be usable as a file, but it may contain corrupted data for example.

> The JVM doesn't have this atomicity, and so you can't hard-kill a Java thread without corrupting the JVM. Instead, you can only softly "interrupt" threads.

If you would port Erlang to the JVM, that would be the least of your problems. The compiler could just insert code to check for these signals every now and then. I go further: if you'd run Erlang code (and only Erlang code) on the JVM, it wouldn't even matter that you don't have separate heaps. Every process would only use a separate part of the shared heap, so they couldn't tip on each other's toe. The GC could take care of the rest as usual.

I think there are two real issues with porting to the JVM:

* Mapping an Erlang process to an OS thread would only work up to some reasonably low number of Erlang processes. After that you'd have to switch to a green thread model with schedulers, which is a lot of work to implement. * The Beam put a lot of effort into making the VM scale well to a lot of schedulers. Things like how to implement a message box where 100+ schedulers can concurrently push messages to. You'd probably have to implement similar optimisations for the data structures you'd use for message boxes, ETS tables etc. on the JVM too.

kaba0 · on April 27, 2022

As discussed under my first comment by others, the JVM will soon get Loom, which might solve the first issue you mention. It will effectively give one the option to run a thread on a virtual one, jumping to another one at any blocking operation.

derefr · on April 27, 2022

> Actually, most of the Beam instructions are non-O(1).

I said O(1) in BEAM reductions per basic-block, not O(1) in underlying CPU instructions. This is why reductions, rather than pure "instructions executed", are counted: it allows each op (or BIF/NIF call) to account for how expensive executing it was.

> The solution is that these instructions are written so that they do work on chunks of data (e.g. ++ may process 1000 elements of a list in a chunk) after which they increment the reduction counter and possibly schedule out the process.

I was eliding reference to these for simplicity. The precise semantics are that "simple ops" are required to be O(1) reduction-bounded; while BIFs/NIFs (incl. things like `erlang:++/2`) aren't, but then must necessarily be implemented with their own internal yield points; and the instructions which invoke them will also potentially yield before/after the invocation. Essentially, within the Erlang abstract-machine model, BIF/NIF invocations act as optimized remote function calls that might have associated "bytecode intrinsics" for invoking them, rather than as regular ISA instructions per se.

> The Beam ISA allows looping.

Yes, but BEAM programs that use such code shouldn't be considered valid.

The BEAM loader doesn't do load-time checks like that, but that's because the BEAM loader is on the inside of an assumed trust zone created by the Erlang compiler. (I.e. BEAM bytecode is implicitly assumed by the loader to be pre-validated at compile time. This is the reason every attempt at untrusted mobile code execution in Erlang has failed—to use Raymond Chen's phrasing, load-time is already "on the other side of the airtight hatchway." If you wanted to allow people to execute untrusted code, you'd need to move the hatchway!)

Tangent: this is an annoying aspect of calling the Erlang emulator the "Erlang Abstract Machine" — it's not. An abstract machine is a model of runtime semantics, formed by a compiler/interpreter, VM, runtime libraries, and even standard library, all working together to run code under that model.

(Compare and contrast: the C abstract machine. It is a model of runtime semantics that exists as only 1. compile-time enforcement by C compilers, and 2. libc. It has no VM component at all.)

This part might be "just my opinion, man" but: given that BEAM was designed purely for the execution of Erlang; and given that BEAM is written to assume that you used an Erlang compiler to compile the bytecode it's running (thus the trust-zone); then any feature of BEAM bytecode that goes unused by Erlang codegen, should be considered undefined behavior for the purposes of the Erlang abstract machine. Whether the BEAM VM allows the bytecode or not, the Erlang abstract machine doesn't.

In other words, yes, you can create back-references in a BEAM bytecode file. You can also load and call into a NIF that doesn't bother to do reduction accounting. In both cases, you're breaking the runtime semantics of the Erlang abstract machine by doing so, and thereby discarding the properties (e.g. soft-realtime max-bounded-latency scheduling) that you get when you stay within those runtime semantics.

(And I would argue that, if we did move the trust zone to allow for untrusted mobile code execution, such that we were doing static analysis at load time, then the bytecode loader would almost certainly toss out programs that contain back-references. They're semantically invalid for the abstract-machine model it's trying to enact.)

> Actors are free to kill not because they would run their code in uninterrupted atomic blocks, but because they don't share state (their heap) with each other.

Untrue. Many things in ERTS manipulate global emulator (or more precariously, per-scheduler) state in careful ways: fused port packet generation + enqueue done inside the calling process; ETS updates for tables with write concurrency enabled; module-unload-time constant propagation; etc.

You're even free to manipulate arbitrary shared state yourself, inside a NIF! It's not breaking Erlang abstract-machine semantics as long as that state 1. isn't ERTS state, and 2. the results aren't visible inside the abstract-machine model. Thus NIF memory handles being visible in Erlang as zero-width reference binaries — that's only necessary because NIFs are assumed to be manipulating shared mutable buffers, and so Erlang actually being able to see into those buffers would cause undefined behavior!)

BEAM can't assume that any process isn't currently executing inside a NIF that's doing precarious must-be-atomic things to shared out-of-ERTS resources. (I realize that this wasn't a part of the initial design of Erlang — NIFs didn't always exist — but it wasn't in conflict with the design, either, and after much iteration, is now fundamental to it.)

But this manipulation of global state doesn't break the runtime guarantees of Erlang's abstract machine model, so long as these operations are never pre-empted. And so the BEAM doesn't.

I also didn't mention the other, maybe more interesting things that this constraint gets you: hot-code upgrade, process hibernation, and dynamic tracing. To work, these three features all require that a process's current heap state have a clean bijection to a continuation (i.e. a remote function call MFA tuple.) This is only true at yield points; between these, the heap state's meaning is undefined to the Erlang abstract machine, and has meaning only to BEAM itself. It's only the guarantee of O(1)-in-reductions distance between yield points — and never having to yield between these points — that makes all these features practical.

("Erlang with pre-emptible actors" would basically have to use OS threads for each actor, because anything it did instead would be just as heavy in terms of context-switching costs. No green-threading architecture allows the schedulers to pre-empt the green threads they're running, for exactly this reason.)

> After that you'd have to switch to a green thread model with schedulers, which is a lot of work to implement.

My whole point is: how do you cleanly de-schedule arbitrary JVM bytecode that's doing something compute-bounded without explicit yield points? You can't, without replacing both the ISA and the compiler with ones that enforce Erlang-abstract-machine-alike semantics, as described above. And any attempt to do that would mean that this hypothetical forked JVM would now be unable to load original-JVM bytecode — and that you'd have to code in a version of Java that only supports tail calls — which makes it useless as a JVM. It'd just be an Erlang emulator.

toast0 · on April 26, 2022

Erjang exists, but it seems pretty dormant. JVM and BEAM have pretty different philosophies, so while you could probably add messaging and distribution to JVM, I don't think it would be the same. I think it would be hard to get one thread-like thing per connection and millions of connections per OS process to work on the JVM, unless you can get OS threads to scale that high (which seems unlikely?).

dgb23 · on April 26, 2022

There is Project Loom[0] which seems to promise some of what you describe.

[0] https://wiki.openjdk.java.net/display/loom/Main

toast0 · on April 27, 2022

Yeah, that might do it, although from a brief skim, I suspect it's likely to end up with function color issues making it hard to really embrace virtual threads. Interesting to read about, thanks.

kimi · on April 27, 2022

No - they are doing the opposite. A "lite" thread will be a thread, period. It has a different implementation, but it behaves the same way. You ask for one or the other, but they basically work the same way from the PoV of the programmer.

MrBuddyCasino · on April 27, 2022

Project Loom is the only approach that truly promises to solve function colouring, user code just uses the thread API, while the underlying implementation can be swapped out transparently. I'm not aware of any other ones that do this (not, not even Zig).

the_only_law · on April 26, 2022

> but it seems pretty dormant.

It’s been getting an RC for the newest version like monthly?

toast0 · on April 27, 2022

Last commit is 2016? https://github.com/trifork/erjang/

asabil · on April 26, 2022

You could, and in fact there has been Erlang implementations on top of the JVM[1][2], however you will never be able to get the same runtime characteristics of Erlang running on BEAM due to the fundamental differences in memory management: BEAM is first and foremost designed for reliability, predictability and soft real-time behaviour.

[1] https://github.com/trifork/erjang/ [2] https://github.com/jerlang/jerlang

h0l0cube · on April 27, 2022

That's correct. Java's threading and memory model is very different. You could emulate that model on top of the JVM, but that layer of abstraction would be an unacceptable performance penalty.