> Bytecode/opcodes are translated into more efficient "operations" during a compilation pass, generating pages of meta-machine code
WASM compiled to a novel bytecode format aimed at efficient interpretation.
> Commonly occurring sequences of operations can can also be optimized into a "fused" operation.
Peephole optimizations producing fused opcodes, makes sense.
> In M3/Wasm, the stack machine model is translated into a more direct and efficient "register file" approach
WASM translated to register-based bytecode. That's awesome!
> Since operations all have a standardized signature and arguments are tail-call passed through to the next, the M3 "virtual" machine registers end up mapping directly to real CPU registers.
This is some black magic, if it works!
Regardless, kudos to the authors and nice to see a fast wasm interpreter done well.
Translating the stack machine into registers was always a core part of the model but it's interesting to me that even interpreters are doing it. The necessity of doing coloring to assign registers efficiently is kind of unfortunate, I feel like the WASM compiler would have been the right place to do this offline.
Register based VMs like Lua don't do this. The register allocation is incredibly simple https://github.com/LuaJIT/LuaJIT/blob/v2.1/src/lj_parse.c#L3...
You don't have to do register allocation with coloring; it's just that most implementations do.
If the hardware executing this code is "stack-based" (or, does not offer enough general purpose registers to accomodate the funtion call) - this will need to be converted back to a stack-based function call (either at runtime, or beforehand). Wouldn't this intermediate WASM-to-register-based-bytecode translation be redundant then?
 - https://en.wikipedia.org/wiki/X86_calling_conventions
I would guess negligible.
2. Some platforms prohibit creating new executable pages, which prevents JITing.
3. Memory savings!
Nowadays CPU's often have a bunch of these things anyways and you'll hear that all CPU's sort of resemble SoCs. They also tend to have auxillary, lower-power processors that manage power and other things for the main processor.
3. Java Extension Module
The AVR32 architecture can optionally support execution of Java bytecodes by including a JavaExtension Module (JEM).
This support is included with minimal hardware overhead.
Low-horsepower platforms are probably the best place to give it a go, as they may struggle to run a respectable JIT compiler, but as you say, Jazelle didn't catch on.
The HotSpot JIT is an impressive feat of compiler engineering. The advanced optimisations it performs cannot practically be performed in hardware.
Modern CPUs are capable of, for instance, loop-detection, but they haven't put optimising compilers out of a job, and never will.
Jazelle would add no value on a modern ARM server, for instance. You'd get far better performance out of HotSpot, or some other modern JVM.
Tbh, I couldn't get the eureka moment though.
Might try to read in the AM ;)
You can see an example of this particular implementation style (where each operation is a tail call to a C function, passing the registers as arguments) at the second link above, under "continuation-passing style".
One of the big advantages of a threaded interpreter is relatively good branch prediction. A simple switch-based dispatch loop has a single indirect jump at its core, which is almost entirely unpredictable -- whereas threaded dispatch puts a copy of that indirect jump at the end of each opcode's implementation, giving the branch predictor way more data to work with. Effectively, you're letting it use the current opcode to help predict the next opcode!
59.5x faster than node.js at what? Executing WebAssembly?
It would be interesting to see how this is designed for security in mind.
Struggling to see it.
1. Control flow is always checked. You can't jump to an arbitrary address, you jump to index N in a control flow table.
2. Calls out of the sandbox are also table based.
3. Indexed accesses are bounds checked. On 64 bit platforms, this is achieved by demoting the wasm to 32 bit and using big guard pages. On 32 bit platforms, it's explicit compares.
The result is something which may become internally inconsistent (can Heartbleed) but cannot perform arbitrary accesses to host memory.
JVM and WASM both have statically-verifiable control flow. No wild jumps, no executable stacks, etc. Phew.
Arrays and pointer arithmetic are a big difference. WASM has a big linear memory block, and instructions may access arbitrary locations within it - the runtime performs bounds checking only at the edges. So your `sprintf` can still overflow the buffer, and smash the stack's data, but can't affect the host, or the control flow.
JVM goes further: it prohibits pointer arithmetic and pushes array accesses down into the instruction stream. To access a JVM array, you must provide the array reference itself, and the runtime will perform bounds checking using the length.
The JVM approach gives you better runtime safety - no Heartbleed! The WASM approach is lower-level and is more easily adapted to existing system languages: C++, Rust, other LLVMites.
And while WASM trumps the security trumpet, without actually supporting proper bounds checking, the CLR will taint C++ pointer arithmetic as unsafe, thus making the whole module unsafe.
So I as consumer can decide if I am willing to trust an unsafe module or not.
Which is something that WASM isn't being honest about, corruption of internal data structures is allowed.
If I can control what goes into memory just by calling module public functions with the right data set and access patterns, CFI won't help a thing.
Suddenly the authorization module that would authenticate me as regular user, might give me another set of capabilities and off to the races.
/clr:pure uses unsafe and supports those cases though.
And yeah, WebAssembly only doing bounds checking within a single memory block and not actually offering true bounds checking is a big downgrade, and a pretty much unjustified one (+ it's rare among JITted languages...).
Consider: should the VM somehow try to analyze the running code and determine when it's about to commit a logical error and return "forbidden" data? What specification states which data is forbidden?
Consider: even the most high-level VM can execute a program with logical errors if it is Turing-complete. A Java or Erlang program running on a bug-free, fully-compliant implementation of the respective VM can still get (e.g.) an authorization condition wrong, or incorrectly implement a security protocol, or return data from the wrong (in-bounds, but wrong!) index in an array.
Secondly, it could have been designed with bounds checking for any kind of data access.
The CLR is honest about it, when C++/CLI uses C style low level tricks, the Assembly is considered tainted and requires explicit allowance of unsafe assemblies execution.
An idea that goes back to Burroughs B5000, where tainted binaries (code using unsafe code blocks) requires admin permission for execution.
Needs to be memory safe otherwise a wasm program can execute arbitrary code, access memory that it should not, etc.
> Because operations end with a call to the next function, the C compiler will tail-call optimize most operations.
It appears that this relies on tail-call optimization to avoid overflowing the stack. Unfortunately this means you probably can't run it in debug mode.