Hacker News new | past | comments | ask | show | jobs | submit login
A bare metal physical implementation of WASM. That's right, a WebAssembly CPU (2018) (github.com/lastmjs)
125 points by Kinrany 49 days ago | hide | past | favorite | 61 comments

Uh, so the code is all typescript, JS and HTML.

Not sure how you'll get "bare metal" performance with those languages


    - "I'm going to make a repo because we I have a neat idea"   
    - .... 3 hours later ...   
    - "I'm not sure this is even possible"
And then it died.

Maybe this is what ex-CEO Trevor Milton (of Nikola fame) was talking about when he boasted they had developed HTML5 supercomputers?

Their todo list has implementing it in hardware descriptor languages. It's not checked as done yet.

One day I will make the HN front page with a repo containing a todolist with one unchecked item. "TODO: Write a todo list cool enough to make the frontpage of HN."

But then you will have achieved the TODO, ruining the joke.

That is the joke...

The joke was that a TODO list with an unchecked item is on the front page of HN. In the grandparent's scenario, the item is checked.

Well, and the last commit is Dec 14, 2018.

> Roadmap > Understand the WebAssembly ISA

Don't forget redux.

Sun at some point built a CPU that could run java bytecode "natively". That sounds nicer than it actually is when you realize that most practical optimizations happen at run time in Java.

WASM is very similar in that it typically gets executed by a jit and that compilers don't try to optimize too much ahead of time. WASM is optimized for loading quickly in a JIT. It also shares the issue that it needs to run on a very wide variety of hardware architectures. So it is not optimized for any particular one.

There was a short time window in browsers where WASM was actually AOT compiled, instead of the tiered compilation model that seems to be common now. I actually preferred the AOT compilation because it meant that there was no JS-like warmup phase until the code reaches full speed. This was a problem for large WASM blobs though because compilation took too long, and instead of "encouraging" people to write small WASM blobs, the browser makers "caved in" and moved towards a more JIT-like model (first pass is "compile fast but run slowly", and then compile hot functions with an optimizing compiler again).

With application frameworks like Blazor, Uno, Flutter, Composer, Unity taking the place of Flash, Java Applets and Silverlight, trying to force people to write smaller WASM modules was doomed to fail anyway.

WASM isn't specifically meant to be JIT and there is no compelling reason to. One prime use is ahead of time native compilation from easy to translate intermediate representation.

WASM specifically intended to run in just about any CPU. Which makes JIT compilation a particularly valid strategy to run it. Hardware accelerating the unoptimized WASM bytecode would have a lot of limitations since compiler does not know what hardware it is going to be running on. So, it can't do any hardware specific optimizations that a normal compiler would do.

A single compiler pass that takes the WASM and optimises it for your machine is still AOT compilation.

What makes the JIT a JIT is not it's ability to execute bytecode optimised for the platform, but to perform optimisations for your current input problem with its memory and execution characteristics. E.g. stuf like polymorphic inline caching or dynamic decompilation.

Wasm is pretty well AOT compilable, and I don't see a reason why a WASM CPU shouldn't run a hybrid aproach of a hardware AOT step, with WASM kinda being it's microcode.

> with WASM kinda being it's microcode

Not to be too pedantic, just because you make a good point that's worth clarifying: I think you meant WASM being the processor's *ISA*, which is translated to microcode instructions that are more optimized for actual execution. Exactly like an Intel CPU isn't "really" executing x86 instructions anymore.

You're correct, my brain skipped a couple steps when writing that sentence ^^', thanks for the clarification!

JIT compilers do a bit less optimization than rumored (of course web browsers and HotSpot do some of the heaviest.) The results have to be worth the analysis time, which they're not, because CPUs are already extremely fast and it's memory that's slow, and JIT isn't a technique that can optimize memory layout.

> The results have to be worth the analysis time

On modern systems it may be possible for a JIT compiler to run in the background on cores not being used by the application code.

> JIT isn't a technique that can optimize memory layout

As I understand it, compiler optimizations for memory layouts aren't very well researched in general. Why would the JIT model be a hindrance?

> On modern systems it may be possible for a JIT compiler to run in the background on cores not being used by the application code.

Only as long as the app itself is on-core. It's not worth running background superoptimization pretty much ever, particularly on a battery-powered device, but also in general because optimizations aren't real unless they're reliable. I've had Lisp programmers brag to me about how some implementation can spend 30 minutes optimizing, but in that case you can't tell what's going to happen to your program…

> As I understand it, compiler optimizations for memory layouts aren't very well researched in general. Why would the JIT model be a hindrance?

It's not a help either. Well, function specialization might be helped in some cases.

It doesn't need to be like microcode, it suffices to be like portable excitable format.

This was pretty common in big iron, and survives to this day on IBM and Unisys mainframes.

Also some Java vendors take this approach by AOT .class files into native code for embedded deployment, as an option.

If anyone's interested, this was the issue of Byte magazine that wrote about Sun's Java CPUs:


I actually owned that copy, but sadly and stupidly gave it away with the rest of my Byte collection that dated from 1983 until Byte's demise.

I have a mostly complete set of BYTEs from the late 70's into the 90's.

Thank you for reminding me why I haul them around every time I move. I would love to donate them but can't think of anywhere worthy.

There were also ARM variants that could run a bunch of JVM bytecode natively - https://en.wikipedia.org/wiki/Jazelle

It was only really competitive against interpreters. It was aimed at a gate count niche that couldn't expect a JIT, for instance feature phones or JavaCards.

+1 on the features phones - I worked on quite a few of them that had Jazelle acceleration for J2ME.

Solid 60fps on those 206MHz CPUs and the bottleneck was actually the display pipeline!

Thanks, that's what I was referring to.

The Optimizer could still output bytecode, which sounds theoretically nice for homomorphy, turtles all the way down (I am not a logician, mind)

A stack-based WASM CPU for stack-based WASM bytecode will be slower than a traditional register based CPU for reasons that have been known at least since the 1970s-1980s, when stack-based CPUs for Smalltalk, Lisp, and C were tried.

This didn't stop Sun from trying again in the 1990s to make stack based CPUs for the JVM (picoJava, UltraJava, ...).

TL;DR: A WASM CPU will need a complicated stack cache to map stack positions to registers. This will involve more transistors, power, and latency than just letting compiler writers use the registers directly.

One of the earliest important papers is from 1982: "Register Allocation for Free: The C Machine Stack Cache."[1]

Lispers sometimes lament that current processors are 'C machines ill-suited to Lisp like the Symbolics Lisp processors of yore', but in reality, C is very much a stack language, too. A C compiler is soooooo much easier to write if you don't have to worry about register allocation, a hard problem.

I like that WASM is stack-based because it makes compilers for WASM much easier to write. But it makes the WASM JIT compiler much more complicated to write. But it's better that top talent at Google, Mozilla, etc. does the WASM JIT compiler so that the rest of us can focus on solving other problems.

The Lisp Machines of yore had instructions for tagged arithmetic, which can speed up, say, adding two dynamically typed variables. No need for a compiler to infer and enforce the datatypes of simple variables in advance, the processor checks the datatypes while it is executing the code and signals a type error if you try to add a float to a character. But modern JIT strategies, which can infer the datatypes of loop variables and emit native instructions lightning fast, might ultimately be better.

[1] https://www.eecg.utoronto.ca/~jzhu/csc467/readings/ra-for-fr...

I mean, you don't even have to go as far as register allocation to see the problems that make it all a big dead end. WebAssembly requires e.g. the call opcode parameter to actually be an index into the global function table, causing an indirection, which is about a trillion times worse than current CPUs which can just directly write the new PC to the register file at once. The spec is littered with stuff like this, for good reason.

Considering WebAssembly is designed to be relatively easy to compile ahead of time, there's no reason to make the hardware worse when you could just ship "firmware" that compiles to an underlying ISA transparently, and reuse decades of existing knowledge. This would make the system behave more like the AS/400 and IBM iSeries, which abstracted away the underlying microarchitecture through its firmware.

And, if you want to literally run the bytecode, there’s that little issue of nested blocks and block start/end tags…a proper WASM CPU would have to perform parsing in order to handle any form of control flow. (Alternatively, you could load not-quite-WASM that’s been gently preprocessed to use normal jumps, but at that point why not just go all the way and generate native code for a real CPU?)

Technically there are a few extra tricks you could pull off in microcode if you natively support the instruction set, that even a sufficiently smart JIT compiler cannot. It would be similar to what ARM tried with Jazelle way back when, with some very modest performance gain at best.

Not worth the time.

C is not a stack machine at all because (1) it has randomly accessed local variables with (2) the programmer expectation that these be mapped to registers as well as possible. C even used to have a register keyword; yet it never had any stack manipulation primitives (push, pop, ...).

From the programmer's perspective C is not a stack based language the way that, say, Forth is, because (as you note) there are no explicit push and pop instructions. But from an implementor's perspective, evaluating C function calls and arithmetic expressions is very stack oriented, which is why in the cited paper some Bell Labs researchers in the early 1980s were trying to build stack machine CPUs to execute C code. Given your Lisp experience, just think about how you'd translate various C expressions to Lisp ones, and how Lisp expressions map to stack operations, and you'll see C's stack based nature come out. (I.e. the abstract syntax trees (ASTs) a C compiler builds can naturally be represented as Lisp-like expressions.)


  y = a*x + foo(b);

  (setq y (+ (* a x) (foo b)))
Example stack instructions for the above (many variations possible):

  PUSH a
  PUSH x
  PUSH b
  PUSH foo
  PUSH y
  SETQ  ; or MOV or whatever your arch wants to call it
Dennis Ritchie added the register storage qualifier keyword to primeval C (and also auto, inherited from B) to make the earliest C compilers easier to write, because bug-free register allocation is a very hard problem and Ritchie's earliest PDP had only a few kilobytes of core to hold both the C compiler's code and the chunk of program text being currently translated, so a complicated register allocator was out of the question.

C is just implicit about how it pushes each function-local variable onto the stack. No amount of compiler optimization can hide it completely -- recurse too much or use up all the space, and you'll get a stack overflow. And just below C, all mainstream ABIs are designed around a stack too (with a designated stack pointer register, etc).

C has activation frames allocated in a LIFO discipline for supporting recursive functions, implemented using a stack; but I think that is not what we mean by "stack language".

What you are showing is not a purebred stack machine; it's a machine with stack-based temporary calculations that rely on randomly accessed operands.

The first C that I used was like this. It was actually a tiny, toy subset of C powering a programming game called C Robots by Tim Poindexter. I think that's how I first heard about Yacc, too.

Anyway, if you want to defend the hypothesis that C is a stack language under the hood from an implementor's POV, it would help to show how expressions involving function calls and variables translate to something resembling canonical, point-free Forth.

The local variables must turn into anonymous stack locations accessed implicitly. If a variable is used as an operand, and is still live (has a next use), you must DUP it to keep a copy in the stack.

Lisp isn't a stack machine either. Lisp expressions can map to stack operations, and easily so if we can have random access to off-stack operands, but do not have to. My own TXR Lisp uses a register-based VM.

  1> (disassemble (compile-toplevel '(set y (+ (* a x) (foo b)))))
  ** expr-1:1: warning: unbound variable y
  ** expr-1:1: warning: unbound variable a
  ** expr-1:1: warning: unbound variable x
  ** expr-1:1: warning: unbound variable b
  ** expr-1:1: warning: unbound function foo
    0: y
    1: sys:b+
    2: sys:b*
    3: a
    4: x
    5: foo
    6: b
    0: 90040003 getlx t4 3
    1: 90050004 getlx t5 4
    2: 20020003 gcall t3 2 t4 t5
    3: 00040002
    4: 00000005
    5: 90050006 getlx t5 6
    6: 20010004 gcall t4 5 t5
    7: 00050005
    8: 20020002 gcall t2 1 t3 t4
    9: 00030001
   10: 00000004
   11: 94020000 setlx t2 0
   12: 10000002 end t2
  instruction count:
  #<sys:vm-desc: 8c0dc50>
The registers are allocated on the native stack, in a frame whose size is determined at compile time and sized to fit using alloca() at run-time.

Local variables (not seen here) turn into v registers, treated uniformly with t registers.

We don't see t1 and t0 above because t0 is a read-only register that holds nil, and t1 is the assembler temporary.

With locals:

  4> (disassemble (compile-toplevel '(let ((x 1) y (a 4) (b 3)) (set y (+ (* a x) (foo b))))))
  ** expr-4:1: warning: unbound function foo
    0: 1
    1: 4
    2: 3
    0: sys:b+
    1: sys:b*
    2: foo
    0: 20020004 gcall t4 1 d1 d0
    1: 04010001
    2: 00000400
    3: 20010005 gcall t5 2 d2
    4: 04020002
    5: 20020009 gcall t9 0 t4 t5
    6: 00040000
    7: 00000005
    8: 10000009 end t9
  instruction count:
  #<sys:vm-desc: 8c72ce0>
I don't have constant folding through variables working. So gcall t4 1 d1 d0 is still doing the (* a x) multiplication, though directly on the d1 d0 registers that hold these values. Ideally, this

The original unoptimized code is generated like this:

  5> (let ((*opt-level* 0)) (disassemble (compile-toplevel '(let ((x 1) y (a 4) (b 3)) (set y (+ (* a x) (foo b)))))))
  ** expr-6:1: warning: unbound function foo
    0: 1
    1: 4
    2: 3
    0: +
    1: *
    2: foo
    0: 04020004 frame 2 4
    1: 2C800400 movsr v00000 d0
    2: 2C820401 movsr v00002 d1
    3: 2C830402 movsr v00003 d2
    4: 20020004 gcall t4 1 v00002 v00000
    5: 08020001
    6: 00000800
    7: 20010005 gcall t5 2 v00003
    8: 08030002
    9: 20020801 gcall v00001 0 t4 t5
   10: 00040000
   11: 00000005
   12: 2C020801 movsr t2 v00001
   13: 10000002 end t2
   14: 10000002 end t2
  instruction count:
  #<sys:vm-desc: 8923d90>

The optimization eliminated the variable frame allocation "frame 2 3" and its matching "end t2", and mapped all the v registers belonging to that frame to new t registers. It propagated the d0, d1 and d2 values, eliminating some of those registers. v00001 became t9, and the "movsr t2 t9" was eliminated, which involved changing the subsequent "end t2" to "end t9".

Another thing we see is the reduction of the + and * functions to internal binary-only variants sys:b+ and sys:b, but that's not done as a pass over the code; the initial instruction selection does that, sensitive to opt-level*.

I'm not sure how easy this kind of work is on stack machines.

> What you are showing is not a purebred stack machine

I see where you are coming from now, and yes, you are correct these stack CPUs are not pure. The whole point though of trying to do a stack machine CPU (however impure) is to simplify register allocation, which is why the 1982 C Machine Stack Cache paper I cited has a title beginning with Register Allocation for Free.... Whether writing a compiler or writing assembler by hand, at some point you hit the problem that it's not easy to fit all your variables into registers r0 through r15 (or whatever) and you must then spill them onto the stack. You evidently have first-hand experience with the challenges of keeping performance critical variables in registers. So the essence of these stack processors, the whole point of trying to build them, is to just let you spill everything on the stack and let the CPU worry about how to map stack locations to fast registers.

Excerpt from the C Machine Stack Cache paper[1]:

The goal of the Stack Cache is to keep the top elements of the stack in high speed registers. The problem we solve is how to perform the allocation of these registers without placing the burden on the compiler, and at the same time retaining the efficiency of register accesses. Control of the hardware, the instruction set, and a disciplined use of the aforementioned calling sequence allows a memory-to-memory style architecture (i.e. registerless to the compiler) to perform an automatic binding of memory addresses to machine registers.

[1] https://www.eecg.utoronto.ca/~jzhu/csc467/readings/ra-for-fr...

His point is not about what C abstract machine is, rather that there existed C implementations that used a stack machine as target.

You can start with "A Book on C" for such implementation description,


Ah yes, I remember the embedded SoCs that started appearing in our coursework and labs that ran Java native. It just never seemed to be worth all the trouble to me. There are whole ecosystems that have to be supplanted, and for what? To help 'get into' a new domain?

> Lispers sometimes lament that current processors are 'C machines ill-suited to Lisp like the Symbolics Lisp processors of yore'

Interestingly, SPARC was designed to run C code well. The register window idea allows cheaper function calls than other architectures where you need to push state to a stack prior to the jump.

SPARC is also one of the very first C Machines, to the sense that SPARC ADI is the first widely deployed CPU architecture to protect against traditional C memory corruption bugs with help of tagged memory.

I think it’s a real shame SPARC is fading fast thanks to Oracle. The acquisition of Sun by them was a major tragedy of our industry.

SPARC ADI was designed by Oracle.

I liked Sun, but lets not worship them more than they deserve.

Just like had it not been for Oracle, anyone doing Java would be porting Java 6 code to whatever would be the hot replacements.

And yes they care more about Oracle Linux than Solaris, just like all remaining UNIX vendors have switched to GNU/Linux as cost reduction on development costs.

I give you Sun’s management pre-acquisition was utterly stupid, but their hardware design teams, which continued under Oracle until being disbanded in 2017 or 2018, was nothing short of stellar.

Is the compiler in a better position to allocate registers than the CPU? Considering it's a hard problem, but really only needs to be solved once, I can see how pushing that from the CPU to the compiler would help.

A register based ISA implicitly conveys dependencies between instructions.

This allows the CPU to infer which instructions can be executed in parallel (superscalar execution).

Extracting instructions level parallelism (ILP) from a stack oriented is harder, but if a compiler can do it, technically a CPU could do the same. The question is: what would be the advantage?

ILP extraction can be done statically. Doing it on a CPU at runtime would cost time and power. OTOH, stack based instruction sets tend to be more compact, so there is less pressure to the memory hierarchy to pull in the code, leaving more bandwidth for operands, and thus reducing stalls.

CPU design is a tradeoff in a tradeoff in a tradeoff.

Very modestly. It's better to provide simple 4 bit "hotness" probability similar to what CPUs do on their own in branch prediction and cache allocation.

Considering all the "blobby" multicore and SMT architecture, you could argue CPU has better information about registers. Just keep register format even and you'd be fine. Not too much to allocate when you have a lot of them and fast.

The general hotness would especially help with reducing execution costs in JIT compilation, where new code is generated and CPU does not have accurate prediction data.

It's better to think of the compiler allocating register names rather than registers. CPUs usually have far more physical registers which they allocate the ISA registers to, with some weird constraints, like some Intel CPUs have a penalty if you try to read a register that hasn't been written in a few hundred cycles.

But that problem is simpler than allocating a stack to physical registers.

A webassembly CPU in a FPGA could be quite a thing. Not sure how fast, but perhaps the processor could be implemented as an accelerator of opcodes generated by having WASM transformed into something at a lower level. Direct execution might be too slow. Still, speed isn’t everything and first generation doesn't need to be for prime-time.

Bigger question is: is this useful? Does it have an actual application that can drive it forwards? A solid yes to that question shoves all the others to the side.

Useful might at first just mean "hey its going to run my blinky lights using WASM instead of arm/avr/etc!" If so, then you already have a winner as soon as it hits the FPGA bitstream flash. From there you can add support for all those pmods on your dev board and off you go.

Pretty neat idea though it appears to have stagnated a couple years back.

I was reminded of TornadoVM, a product for doing SIMD high-speed arithmetic processing on FPGAs by converting Java, Python, etc. to OpenCL. https://www.infoq.com/articles/tornadovm-java-gpu-fpga/

The transcompiler is forever.

This was my first thought when I saw the headline. What I once thought was satire was, in fact, prophecy.

We are lucky nuke didn't happen in 2020, but COVID. And there are people fleeing San Francisco for different reason.

This is “inspired in part” by the same (excellent and a little scary) talk. Which makes you think, when would something like this have appeared if the talk had never happened?

Asm.js was already in the wild and referenced in the talk, so while it is humorous, I think the path was already set.

Some other bytecode among the dozens in use since the early 1960's would have been chosen instead.

Missing 2018 in the title.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact