Not sure how you'll get "bare metal" performance with those languages
- "I'm going to make a repo because we I have a neat idea"
- .... 3 hours later ...
- "I'm not sure this is even possible"
WASM is very similar in that it typically gets executed by a jit and that compilers don't try to optimize too much ahead of time. WASM is optimized for loading quickly in a JIT. It also shares the issue that it needs to run on a very wide variety of hardware architectures. So it is not optimized for any particular one.
What makes the JIT a JIT is not it's ability to execute bytecode optimised for the platform, but to perform optimisations for your current input problem with its memory and execution characteristics. E.g. stuf like polymorphic inline caching or dynamic decompilation.
Wasm is pretty well AOT compilable, and I don't see a reason why a WASM CPU shouldn't run a hybrid aproach of a hardware AOT step, with WASM kinda being it's microcode.
Not to be too pedantic, just because you make a good point that's worth clarifying: I think you meant WASM being the processor's *ISA*, which is translated to microcode instructions that are more optimized for actual execution. Exactly like an Intel CPU isn't "really" executing x86 instructions anymore.
On modern systems it may be possible for a JIT compiler to run in the background on cores not being used by the application code.
> JIT isn't a technique that can optimize memory layout
As I understand it, compiler optimizations for memory layouts aren't very well researched in general. Why would the JIT model be a hindrance?
Only as long as the app itself is on-core. It's not worth running background superoptimization pretty much ever, particularly on a battery-powered device, but also in general because optimizations aren't real unless they're reliable. I've had Lisp programmers brag to me about how some implementation can spend 30 minutes optimizing, but in that case you can't tell what's going to happen to your program…
> As I understand it, compiler optimizations for memory layouts aren't very well researched in general. Why would the JIT model be a hindrance?
It's not a help either. Well, function specialization might be helped in some cases.
This was pretty common in big iron, and survives to this day on IBM and Unisys mainframes.
Also some Java vendors take this approach by AOT .class files into native code for embedded deployment, as an option.
I actually owned that copy, but sadly and stupidly gave it away with the rest of my Byte collection that dated from 1983 until Byte's demise.
Thank you for reminding me why I haul them around every time I move. I would love to donate them but can't think of anywhere worthy.
Solid 60fps on those 206MHz CPUs and the bottleneck was actually the display pipeline!
This didn't stop Sun from trying again in the 1990s to make stack based CPUs for the JVM (picoJava, UltraJava, ...).
TL;DR: A WASM CPU will need a complicated stack cache to map stack positions to registers. This will involve more transistors, power, and latency than just letting compiler writers use the registers directly.
One of the earliest important papers is from 1982: "Register Allocation for Free: The C Machine Stack Cache."
Lispers sometimes lament that current processors are 'C machines ill-suited to Lisp like the Symbolics Lisp processors of yore', but in reality, C is very much a stack language, too. A C compiler is soooooo much easier to write if you don't have to worry about register allocation, a hard problem.
I like that WASM is stack-based because it makes compilers for WASM much easier to write. But it makes the WASM JIT compiler much more complicated to write. But it's better that top talent at Google, Mozilla, etc. does the WASM JIT compiler so that the rest of us can focus on solving other problems.
The Lisp Machines of yore had instructions for tagged arithmetic, which can speed up, say, adding two dynamically typed variables. No need for a compiler to infer and enforce the datatypes of simple variables in advance, the processor checks the datatypes while it is executing the code and signals a type error if you try to add a float to a character. But modern JIT strategies, which can infer the datatypes of loop variables and emit native instructions lightning fast, might ultimately be better.
Considering WebAssembly is designed to be relatively easy to compile ahead of time, there's no reason to make the hardware worse when you could just ship "firmware" that compiles to an underlying ISA transparently, and reuse decades of existing knowledge. This would make the system behave more like the AS/400 and IBM iSeries, which abstracted away the underlying microarchitecture through its firmware.
Not worth the time.
y = a*x + foo(b);
(setq y (+ (* a x) (foo b)))
SETQ ; or MOV or whatever your arch wants to call it
The first C that I used was like this. It was actually a tiny, toy subset of C powering a programming game called C Robots by Tim Poindexter. I think that's how I first heard about Yacc, too.
Anyway, if you want to defend the hypothesis that C is a stack language under the hood from an implementor's POV, it would help to show how expressions involving function calls and variables translate to something resembling canonical, point-free Forth.
The local variables must turn into anonymous stack locations accessed implicitly. If a variable is used as an operand, and is still live (has a next use), you must DUP it to keep a copy in the stack.
Lisp isn't a stack machine either. Lisp expressions can map to stack operations, and easily so if we can have random access to off-stack operands, but do not have to. My own TXR Lisp uses a register-based VM.
1> (disassemble (compile-toplevel '(set y (+ (* a x) (foo b)))))
** expr-1:1: warning: unbound variable y
** expr-1:1: warning: unbound variable a
** expr-1:1: warning: unbound variable x
** expr-1:1: warning: unbound variable b
** expr-1:1: warning: unbound function foo
0: 90040003 getlx t4 3
1: 90050004 getlx t5 4
2: 20020003 gcall t3 2 t4 t5
5: 90050006 getlx t5 6
6: 20010004 gcall t4 5 t5
8: 20020002 gcall t2 1 t3 t4
11: 94020000 setlx t2 0
12: 10000002 end t2
Local variables (not seen here) turn into v registers, treated uniformly with t registers.
We don't see t1 and t0 above because t0 is a read-only register that holds nil, and t1 is the assembler temporary.
4> (disassemble (compile-toplevel '(let ((x 1) y (a 4) (b 3)) (set y (+ (* a x) (foo b))))))
** expr-4:1: warning: unbound function foo
0: 20020004 gcall t4 1 d1 d0
3: 20010005 gcall t5 2 d2
5: 20020009 gcall t9 0 t4 t5
8: 10000009 end t9
The original unoptimized code is generated like this:
5> (let ((*opt-level* 0)) (disassemble (compile-toplevel '(let ((x 1) y (a 4) (b 3)) (set y (+ (* a x) (foo b)))))))
** expr-6:1: warning: unbound function foo
0: 04020004 frame 2 4
1: 2C800400 movsr v00000 d0
2: 2C820401 movsr v00002 d1
3: 2C830402 movsr v00003 d2
4: 20020004 gcall t4 1 v00002 v00000
7: 20010005 gcall t5 2 v00003
9: 20020801 gcall v00001 0 t4 t5
12: 2C020801 movsr t2 v00001
13: 10000002 end t2
14: 10000002 end t2
Another thing we see is the reduction of the + and * functions to internal binary-only variants sys:b+ and sys:b, but that's not done as a pass over the code; the initial instruction selection does that, sensitive to opt-level*.
I'm not sure how easy this kind of work is on stack machines.
I see where you are coming from now, and yes, you are correct these stack CPUs are not pure. The whole point though of trying to do a stack machine CPU (however impure) is to simplify register allocation, which is why the 1982 C Machine Stack Cache paper I cited has a title beginning with Register Allocation for Free.... Whether writing a compiler or writing assembler by hand, at some point you hit the problem that it's not easy to fit all your variables into registers r0 through r15 (or whatever) and you must then spill them onto the stack. You evidently have first-hand experience with the challenges of keeping performance critical variables in registers. So the essence of these stack processors, the whole point of trying to build them, is to just let you spill everything on the stack and let the CPU worry about how to map stack locations to fast registers.
Excerpt from the C Machine Stack Cache paper:
The goal of the Stack Cache is to keep the top elements of the stack in high speed registers. The problem we solve is how to perform the allocation of these registers without placing the burden on the compiler, and at the same time retaining the efficiency of register accesses. Control of the hardware, the instruction set, and a disciplined use of the aforementioned calling sequence allows a memory-to-memory style architecture (i.e. registerless to the compiler) to perform an automatic binding of memory addresses to machine registers.
You can start with "A Book on C" for such implementation description,
Interestingly, SPARC was designed to run C code well. The register window idea allows cheaper function calls than other architectures where you need to push state to a stack prior to the jump.
I liked Sun, but lets not worship them more than they deserve.
Just like had it not been for Oracle, anyone doing Java would be porting Java 6 code to whatever would be the hot replacements.
And yes they care more about Oracle Linux than Solaris, just like all remaining UNIX vendors have switched to GNU/Linux as cost reduction on development costs.
This allows the CPU to infer which instructions can be executed in parallel (superscalar execution).
Extracting instructions level parallelism (ILP) from a stack oriented is harder, but if a compiler can do it, technically a CPU could do the same. The question is: what would be the advantage?
ILP extraction can be done statically. Doing it on a CPU at runtime would cost time and power. OTOH, stack based instruction sets tend to be more compact, so there is less pressure to the memory hierarchy to pull in the code, leaving more bandwidth for operands, and thus reducing stalls.
CPU design is a tradeoff in a tradeoff in a tradeoff.
Considering all the "blobby" multicore and SMT architecture, you could argue CPU has better information about registers. Just keep register format even and you'd be fine. Not too much to allocate when you have a lot of them and fast.
The general hotness would especially help with reducing execution costs in JIT compilation, where new code is generated and CPU does not have accurate prediction data.
But that problem is simpler than allocating a stack to physical registers.
Bigger question is: is this useful? Does it have an actual application that can drive it forwards? A solid yes to that question shoves all the others to the side.
Useful might at first just mean "hey its going to run my blinky lights using WASM instead of arm/avr/etc!" If so, then you already have a winner as soon as it hits the FPGA bitstream flash. From there you can add support for all those pmods on your dev board and off you go.
The transcompiler is forever.