Hacker News new | past | comments | ask | show | jobs | submit login
Wasm3 – A high performance WebAssembly interpreter in C (github.com)
186 points by sound_and_form 5 days ago | hide | past | web | favorite | 76 comments





This is pretty exciting if real:

> Bytecode/opcodes are translated into more efficient "operations" during a compilation pass, generating pages of meta-machine code

WASM compiled to a novel bytecode format aimed at efficient interpretation.

> Commonly occurring sequences of operations can can also be optimized into a "fused" operation.

Peephole optimizations producing fused opcodes, makes sense.

> In M3/Wasm, the stack machine model is translated into a more direct and efficient "register file" approach

WASM translated to register-based bytecode. That's awesome!

> Since operations all have a standardized signature and arguments are tail-call passed through to the next, the M3 "virtual" machine registers end up mapping directly to real CPU registers.

This is some black magic, if it works!


Sure this is neat stuff, but I don't think any of it is novel. Bochs is a good source for some bytecode vm performance wizardry [1], even if the bytecode in question is the x86 ISA.

Regardless, kudos to the authors and nice to see a fast wasm interpreter done well.

1: http://www.emulators.com/docs/nx25_nostradamus.htm


Yeah, threaded interpretation is nothing new. Notably, Forth compilers have often used it; the iSH x86 emulator (https://github.com/tbodt/ish) is a more recent example of this technique.

It's not only the "threaded code" approach, that makes Wasm3 so fast. In fact, Intel's WAMR also utilizes this method, yet is 30 times slower..

IR getting converted into an interpreter-oriented bytecode is pretty common. Mono does it for its interpreter and IIRC, Spidermonkey has historically done that as well. I'm not sure if V8 has ever interpreted from an IR (maybe now?) but you could view their original 'baseline JIT' model as converting into an interpreter-focused IR, where the IR just happened to be extremely unoptimized x86 assembly.

Translating the stack machine into registers was always a core part of the model but it's interesting to me that even interpreters are doing it. The necessity of doing coloring to assign registers efficiently is kind of unfortunate, I feel like the WASM compiler would have been the right place to do this offline.


> The necessity of doing coloring to assign registers efficiently is kind of unfortunate

Register based VMs like Lua don't do this. The register allocation is incredibly simple https://github.com/LuaJIT/LuaJIT/blob/v2.1/src/lj_parse.c#L3...


But that's an allocator to virtual registers that don't try to correspond to (a valid number of) physical CPU registers. Sure it's easy to allocate to a large number of registers. It's harder to do it to a small number, like the project discussed here seems to claim to do.

I could be completely wrong but it looks like what this does is more like stack-caching. One operand, if it's the result of the last instruction, should already be in "virtual r0" which is already in a register. For the other operand of the opcode, it's an infinite set of "registers" at an offset from the stack pointer: https://github.com/wasm3/wasm3/tree/b1462d450ca367e39e4b2eb4...

You're right. This doesn't use a register machine in the usual sense of the word. It's kind of the software analog to accumulator machines.

> The necessity of doing coloring to assign registers efficiently is kind of unfortunate

You don't have to do register allocation with coloring; it's just that most implementations do.


> "WASM translated to register-based bytecode. That's awesome!"

If the hardware executing this code is "stack-based" (or, does not offer enough general purpose registers to accomodate the funtion call) - this will need to be converted back to a stack-based function call (either at runtime, or beforehand). Wouldn't this intermediate WASM-to-register-based-bytecode translation be redundant then?


You seem to be asking something like "if we always hit the algorithm's slow path, isn't the algorithm slow?". The answer is "yes, but we will (hopefully) almost never hit the slow path". On x64-64 you will typically be able to pass 6 integer/pointer values and 8 floating-point values in registers. That should be enough for most function calls in real-world code.

I don't know of any current physical stack machine CPUs.

Stacks are used extensively across the x86 family [0]

[0] - https://en.wikipedia.org/wiki/X86_calling_conventions


"Has a stack" isn't the same as "Has a stack based ISA".

To expand: a Forth machine or similar would be a stack based ISA. Using a stack is a different matters; Pretty much every ISA uses stacks.

No, stack machine CPUs are pretty obscure things, especially today. See the link below for some examples. All or virtually all modern commercial CPUs are register based.

https://en.wikipedia.org/wiki/Stack_machine#Commercial_stack...


What proportion of machines running web assembly are stack-based, would you say?

I would guess negligible.


Why is an interpreter desirable when JIT compilers create significantly faster code? Is this primarily about embedded use?

1. Some platforms do not have wasm JITs available, because nobody has written one.

2. Some platforms prohibit creating new executable pages, which prevents JITing.

3. Memory savings!


Plus some platforms allow all of the above, but need something that can run while the JIT warms up.

Any JIT needs an interpreter to run code sections that are not compiled (yet), and large sections (I don't have any statistics, but I would guess a very sizable majority) of the code is 'cold' and will never get compiled at all. Compiling bytecode to machine code is a relatively expensive operation in itself so it only makes sense to do it for 'hot' sections (loops, functions that get called many times, etc).

There JIT implementations without interpreter phase, .NET being one of them.

It appears that Mono has added the interpreter back: https://www.mono-project.com/news/2017/11/13/mono-interprete.... I think the CLR has one too, but I haven't looked into it much: https://github.com/dotnet/runtime/blob/master/src/coreclr/sr...

From Microsoft side, regular CLR never had one, only the .NET MF variant.

Shows what I know; I just searched for "interpreter" in the repository. I'm curious how C# runs on iOS, though.

AOT compiled to native code.

Besides that it can help with startup time. If the script that needs to be executed is a one-off thing the removal of compilation time might be save more time than a JIT could save during execution.

From what I know, its because JIT uses a lot more memory than an interpreter. Also in LUAJIT it uses very little memory and thats why people have their minds blown all the time.

iOS prohibits JIT for example.

More specifically, apps submitted to the App Store may not utilize JIT.

Is there any specific reason for this?

Executing instructions dynamically from memory (JIT does this) exposes potential security issues. A good example is the famous Specter vulnerability.

Specter itself is not JIT-specific but it is known to very hard to reproduce that the only current viable target environment is JIT. So JavaScript was affected by Specter.


Allegedly security. Apple only trusts Apple.

Right. Wasm3 targets iOS as well

On the embedded side, I'd rather see a hardware interpreter.

I'm hardware ignorant... would that be a system on a chip?

No, a "system on a chip" means the chip includes things that are not typically part of the CPU but were always parts of computer systems: busses, I/O, sometimes memory, auxillary HW functions like audio, image/video processing or codecs.

Nowadays CPU's often have a bunch of these things anyways and you'll hear that all CPU's sort of resemble SoCs. They also tend to have auxillary, lower-power processors that manage power and other things for the main processor.


I suspect they're just saying they'd rather see a CPU implementing the wasm isa, in a facetious way.

AVR32 with JEM does this to some extent for Java.

http://ww1.microchip.com/downloads/en/devicedoc/doc32000.pdf

  3.   Java Extension Module
  The AVR32 architecture can optionally support execution of Java bytecodes by including a JavaExtension Module (JEM).
  This support is included with minimal hardware overhead.
EDIT: Not by implementing the ISA verbatim, but still neat though

ARM tried to do this with Jazelle, but the effort floundered: https://en.wikipedia.org/wiki/Jazelle

It doesn't make much sense to build hardware to run an intermediate language. Java bytecode isn't optimised; almost all optimisation is meant to happen in the JIT. If you build a processor that runs bytecode directly, when does the optimisation occur? Presumably never.

Low-horsepower platforms are probably the best place to give it a go, as they may struggle to run a respectable JIT compiler, but as you say, Jazelle didn't catch on.


The hardware would perform the optimisation using typical means. Instruction lookahead can be used for caching the stack slots to be used in registers, for instance.

> The hardware would perform the optimisation using typical means.

The HotSpot JIT is an impressive feat of compiler engineering. The advanced optimisations it performs cannot practically be performed in hardware.

Modern CPUs are capable of, for instance, loop-detection, but they haven't put optimising compilers out of a job, and never will.

Jazelle would add no value on a modern ARM server, for instance. You'd get far better performance out of HotSpot, or some other modern JVM.


The neater article seems to be about M3 interpreter https://github.com/soundandform/m3#m3-massey-meta-machine

Tbh, I couldn't get the eureka moment though. Might try to read in the AM ;)


Yeah, this is a good way to design a fast interpreter! It's traditionally called a "threaded interpreter", or (somewhat confusingly) "threaded code":

https://en.wikipedia.org/wiki/Threaded_code

http://www.complang.tuwien.ac.at/forth/threaded-code.html

You can see an example of this particular implementation style (where each operation is a tail call to a C function, passing the registers as arguments) at the second link above, under "continuation-passing style".

One of the big advantages of a threaded interpreter is relatively good branch prediction. A simple switch-based dispatch loop has a single indirect jump at its core, which is almost entirely unpredictable -- whereas threaded dispatch puts a copy of that indirect jump at the end of each opcode's implementation, giving the branch predictor way more data to work with. Effectively, you're letting it use the current opcode to help predict the next opcode!


Yeah, but... It's not only the "threaded code" approach, that makes Wasm3 so fast. In fact, Intel's WAMR also utilizes this method, yet is 30 times slower..

Thanks for the links! I've long searched google trying to find a similar "tail-cail" interpreter. No wonder I couldn't hit anything -- it was so poorly named! :)

> Node v13.0.1 (interpreter) 28 59.5x

https://github.com/wasm3/wasm3/blob/master/test/benchmark/co...

59.5x faster than node.js at what? Executing WebAssembly?


V8 has a built-in (pure, no JIT etc.) interpreter of WASM. Which is quite slow according to this test.

The motivation for WebAssembly rather than plain ARM or x86 assembly = portability, security.

It would be interesting to see how this is designed for security in mind.


Pardon my wasm illiteracy here, what exactly makes it more secure?

Struggling to see it.


WASM has a sandboxing model. The idea is:

1. Control flow is always checked. You can't jump to an arbitrary address, you jump to index N in a control flow table.

2. Calls out of the sandbox are also table based.

3. Indexed accesses are bounds checked. On 64 bit platforms, this is achieved by demoting the wasm to 32 bit and using big guard pages. On 32 bit platforms, it's explicit compares.

The result is something which may become internally inconsistent (can Heartbleed) but cannot perform arbitrary accesses to host memory.


How is that different from CLR or JVM?

I'll speak to JVM; I'm less familiar with CLR but I believe it's the same.

JVM and WASM both have statically-verifiable control flow. No wild jumps, no executable stacks, etc. Phew.

Arrays and pointer arithmetic are a big difference. WASM has a big linear memory block, and instructions may access arbitrary locations within it - the runtime performs bounds checking only at the edges. So your `sprintf` can still overflow the buffer, and smash the stack's data, but can't affect the host, or the control flow.

JVM goes further: it prohibits pointer arithmetic and pushes array accesses down into the instruction stream. To access a JVM array, you must provide the array reference itself, and the runtime will perform bounds checking using the length.

The JVM approach gives you better runtime safety - no Heartbleed! The WASM approach is lower-level and is more easily adapted to existing system languages: C++, Rust, other LLVMites.


CLR was designed to support C++ since day one.

And while WASM trumps the security trumpet, without actually supporting proper bounds checking, the CLR will taint C++ pointer arithmetic as unsafe, thus making the whole module unsafe.

So I as consumer can decide if I am willing to trust an unsafe module or not.


The CLR doesn’t guarantee control flow integrity (modulo type confusion) or any form of isolation when linear memory accesses are used. So here WASM offers another option in the middle between trust and don’t trust: “trust unsafe module only to not compromise its own functionality; no attacking the rest of the process or kernel” (modulo runtime bugs).

Well for that execution of unsafe Assemblies was already enabled anyway, so there isn't much that the verifier can do.

Which is something that WASM isn't being honest about, corruption of internal data structures is allowed.

If I can control what goes into memory just by calling module public functions with the right data set and access patterns, CFI won't help a thing.

Suddenly the authorization module that would authenticate me as regular user, might give me another set of capabilities and off to the races.


Note for C++ on the CLR that you can use /clr:safe as an MSVC compilation argument. This errors out when trying to access to random pointers at compile time.

/clr:pure uses unsafe and supports those cases though.

And yeah, WebAssembly only doing bounds checking within a single memory block and not actually offering true bounds checking is a big downgrade, and a pretty much unjustified one (+ it's rare among JITted languages...).


If you care about "true" bounds checking, just compile to Wasm from a safe source language. Besides, Wasm does support multiple memory blocks so a potentially-unsafe module need not "taint" anything else.

Security is as strong as the weakest link.

Memory blocks have a page granularity, which makes them useless for pervasive checking. (+ you can do the same thing to a reasonable extent with C even, with guard pages.)

Internally inconsistency is already good enough for possible security exploits, by making the module reveal stuff that would otherwise be forbidden.

That's not the goal of a VM/runtime, though: it only ensures that the hosted code cannot escape the sandbox, and provides a correct implementation of the semantics of each VM opcode. The hosted code's computation implemented with those opcodes is a black box to the VM.

Consider: should the VM somehow try to analyze the running code and determine when it's about to commit a logical error and return "forbidden" data? What specification states which data is forbidden?

Consider: even the most high-level VM can execute a program with logical errors if it is Turing-complete. A Java or Erlang program running on a bug-free, fully-compliant implementation of the respective VM can still get (e.g.) an authorization condition wrong, or incorrectly implement a security protocol, or return data from the wrong (in-bounds, but wrong!) index in an array.


Indeed, what WASM advocates cannot do is advocate safety, that it is better than any other bytecode format since 1958, when it isn't so.

Secondly, it could have been designed with bounds checking for any kind of data access.

The CLR is honest about it, when C++/CLI uses C style low level tricks, the Assembly is considered tainted and requires explicit allowance of unsafe assemblies execution.

An idea that goes back to Burroughs B5000, where tainted binaries (code using unsafe code blocks) requires admin permission for execution.


Unlike in x86 or ARM assembly, you can't overwrite the return address in WebAssembly.

Sandboxing is trivial, because you can cover all paths to the outside world.

I edited my earlier comment to remove the sandboxing part thinking this may be embedded specific, so not sure you saw that, so ask again how is this special? would you rather inspect potentially malicious wasm or js?

All web APIs are designed to be secure in the context of the web.

Needs to be memory safe otherwise a wasm program can execute arbitrary code, access memory that it should not, etc.


This is designed for microcontrollers. If you're running untrusted code on microcontrollers, you've got bigger problems.

I wouldn't go as far as to say it's designed for microcontrollers; as others have mentioned, there's a number of potential applications for this in other contexts as well.

These are impressive performance numbers.

> Because operations end with a call to the next function, the C compiler will tail-call optimize most operations.

It appears that this relies on tail-call optimization to avoid overflowing the stack. Unfortunately this means you probably can't run it in debug mode.


It's not that bad even in debug mode (or without TCO). Just not optimal. Also, there is a way to rework this part, so it does not rely on compiler TCO.

If the jump to the next opcode is a tail call, wouldn't an arbitrarily long sequence of instructions take arbitrarily much stack space?

Impressive list of constrained targets for embedded. The AtMega1284 microcontroller for example has only 16 KB of RAM. Which is a lot for an 8-bit micro, but pretty standard for a modern application processors.

Yup. TinyBLE is nRF51 SoC with 16Kb SRAM as well.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: