I agree that UM, Chifir, and uxn are Art, and that wasm is at least potentially ...

sitkack · on July 25, 2022

I think SIMD was a distraction to our conversation, most code doesn't use it and in the future the length agnostic, flexible vectors; https://github.com/WebAssembly/flexible-vectors/blob/master/... are a better solution. They are a lot like RVV; https://github.com/riscv/riscv-v-spec, research around vector processing is why RISC-V exists in the first place!

I was trying to find the smallest Rust Wasm interpreters I could find, I should have read the source first, I only really use wasmtime, but this one looks very interesting, zero deps, zero unsafe.

16.5kloc of Rust https://github.com/rhysd/wain

The most complete wasm env for small devices is wasm3

20kloc of C https://github.com/wasm3/wasm3

I get what you are saying as to be so small that there isn't a place of bugs to hide.

> “There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.” CAR Hoare

Even a 100 line program can't be guaranteed to be free of bugs. These programs need embedded tests to ensure that the layer below them is functioning as intended. They cannot and should not run open loop. Speaking of 300+ reimplementations, I am sure that RISC-V has already exceeded that. The smallest readable implementation is like 200 lines of code; https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/...

I don't think Wasm suffers from the base extension issue you bring up. It will get larger, but 1.0 has the right algebraic properties to be useful forever. Wasm does require an environment, for archival purposes that environment should be written in Wasm, with api for instantiating more envs passed into the first env. There are two solutions to the Wasm generating and calling Wasm problem. First would be a trampoline, where one returns Wasm from the first Wasm program which is then re-instantiated by the outer env. The other would be to pass in the api to create new Wasm envs over existing memory buffers.

See, https://copy.sh/v86/

MS-DOS, NES or C64 are useful for archival purposes because they are dead, frozen in time along with a large corpus of software. But there is a ton of complexity in implementing those systems with enough fidelity to run software.

Lua, Typed Assembly; https://en.wikipedia.org/wiki/Typed_assembly_language and Sector Lisp; https://github.com/jart/sectorlisp seem to have the right minimalism and compactness for archival purposes. Maybe it is sectorlisp+rv32+wasm.

If there are directions you would like Wasm to go, I really recommend attending the Wasm CG meetings.

https://github.com/WebAssembly/meetings

When it comes to an archival system, I'd like it to be able to run anything from an era, not just specially crafted binaries. I think Wasm meets that goal.

https://gist.github.com/dabeaz/7d8838b54dba5006c58a40fc28da9...

kragen · on July 31, 2022

> When it comes to an archival system, I'd like it to be able to run anything from an era, not just specially crafted binaries. I think Wasm meets that goal.

Wasm can only run specially crafted binaries; it can't run arbitrary i386 or ARM7 code, only Wasm code.

I think the way to run anything from an era is with the following four-layer cake:

0. Whatever hardware and operating system the programmer-archaeologist happens to be running 1000 years from now.

1. An implementation of the forever-platform running on that hardware and OS. This needs to be simple enough for the programmer-archaeologist to write and debug without a running implementation to compare to, even if their hardware is base-10 or balanced ternary or has 21-bit word-addressed memory or whatever, but efficient enough to support real applications.

2. An implementation of a popular platform from the era you want to preserve, running on the forever-platform, such as wasm, MS-DOS, RV64+UEFI, or amd64+BIOS. (This doesn't have to be a platform that was widely used, just a platform to which the software you want to preserve can be compiled.) Probably you want to implement this as a binary-to-binary compiler targeting forever-platform code, not an interpreter, so that the slowdown introduced by this layer is less than an order of magnitude.

3. The application software you want to preserve, the "anything from an era".

Wasm is not even close to being a candidate for the forever-platform, but it might be a reasonable choice for (the CPU part of) the layer on top of it, because there are compilers that can compile a large body of C and C++ to it; and because it's a lot easier to compile from, and compile efficient code from, than botched monstrosities like amd64 and ARM7. (Did you realize that in ARM7 with Thumb an addition instruction can change the instruction set the processor is implementing? Because the low bit of the PC tells you whether it's in Thumb mode, and you can specify the PC as a destination register.)

In that context the fact that Wasm keeps changing isn't a big problem. You can compile some code with Wasm today and bundle it into a cart with an implementation of today's Wasm spec, and 8 or 16 years from now when you compile different code to a different version of Wasm, you bundle that code into a cart with an updated Wasm implementation.

kragen · on Aug 1, 2022

You say, "I get what you are saying as to be so small that there isn't a place of bugs to hide." But that's not really what I was saying.

The issue of perfection, bug-freeness, is an interesting one. You could have a specification that was bug-free but implementations that were buggy, or you could have a specification that was itself buggy. Most specification errors won't give you an unimplementable specification; they'll give you a specification that specifies behavior you didn't want.

In the forever-platform context, most such specification errors are unimportant. It doesn't matter if your bitwise operation is accidentally NOR instead of NAND, or if you accidentally specified division to round toward -∞ instead of toward 0, or whatever. As long as your implementation of the forever-platform follows the spec, and you test your forever-platform programs on the implementation and fix them when they break, they'll continue working on future implementations of the spec.

So, with respect to specification complexity, the reason the specification needs to be very short is not so that the specification has no bugs; it's so that the specification has no ambiguities that result in observably different behavior among implementations. (The spec also needs to specify something Turing-complete and reasonably efficient, but those are less difficult.)

A buggy implementation you test on would be a much bigger problem, because it could result in purportedly working "carts" (or "roms" or "programs") that in fact only operate as intended on the buggy implementation. Then programmer-archaeologists would be forced to guess what the behavior of the actual implementation was. In many cases a metacircular interpreter will not help with this: if reading outside the bounds of memory produces 0 or -1, or if division by 0 produces 0 or ∞, or if arithmetic overflow saturates or wraps mod 2³², the metacircular interpreter is likely to inherit this behavior from its host implementation.

Buggy implementations might be detectable by formal methods or by asking for alternative implementations, perhaps in a contest format as in the UM case.

— ⁂ —

Wasm3 looks very appealing in a lot of ways. 64 KiB (?) for code and 10 KiB for RAM is small enough to run on a lot of small platforms! I didn't know Wasm could scale down that small. However, most of things compiled for Wasm will need a lot more than that.

I think getting something like LuaJIT or HotSpot, or even GCC's trampolines for nested functions, working on top of the trampoline approach you suggest for Wasm, is probably technically possible but impractically slow on existing implementations.

kragen · on July 25, 2022

There's a lot to think about here, but right now I just want to respond to one point, about vectors and SIMD.

Performance limits what you can program on a platform, and there's a pretty wide range of interactive software where the performance bottleneck is generating pixels to put on the screen, just because there are so goddamned many of the fuckers. And fortunately that is commonly vectorizable; vector operations implemented by special CPU instructions have been a key enabling hack since BitBlt on Smalltalk in the 01970s. (In fact, the very first machine that ran a GUI had SIMD instructions in hardware, the TX-2, completed in 01958, but it didn't have pixels.)

This is even more important now than before, even without bringing GPUs into the picture. With optimized single-threaded C on an 8-core system, you're getting at best 12.5% of the machine's performance. If SIMD gives you an 8-fold speedup on slinging pixels around, which is common, you're getting at best 1.6% of theoretical performance with single-threaded SIMD-less code. With a naive, CPython-style bytecode interpreter, you have another factor of 40 or so slowdown, so you're getting 0.04% of the machine's performance: 11.3 Moore's-law performance doublings, so on today's machines you're back to 01999, age of Winamp, Quake III, ICQ, RealAudio, and Baldur's Gate. On the 02009 machine I'm typing this on you're back to 01987.

Now, people wrote a lot of interesting programs in 01987. Fractint, Hypercard, Quattro Pro, X11R3, a lot of the more elaborate Infocom games. And modern SSDs and humongous RAM mean you can do a lot of things now you couldn't do then, even if your CPU is clunking along with 01987 performance. And even a basic JIT compiler can probably get you from 0.04% up to 0.5% of the machine's performance, maybe 4% if it's multithreaded, though that makes determinism enormously more difficult.

But in theory Numpy-style vector operations can take advantage of both multiple cores and SIMD, usually giving you about 20% of the machine's performance even in a naive, CPython-style bytecode interpreter, at least on the kinds of operations where it's applicable, and without sacrificing determinism. Possibly with the kind of pipelining and tiling optimizations Numpy doesn't do you can do even better than that.

Of course, you can also design a virtual machine architecture in such a way as to permit efficient compilation, and not have to fall back on turbocharged vector operations. A lot of Wasm is designed with this objective in mind; for example, there is no dynamic typing to require runtime type checks, local variables can't alias linear memory, control flow is structured to help out the compiler, and so on, so that your interpretation/emulation slowdown is almost nil, reduced to the occasional bounds check that couldn't be hoisted out of a loop. And you can provide access to SIMD instructions, which Wasm now does, and multithreading, which it doesn't.

But all of this can't be done on a simple VM architecture. And for a given VM architecture I think an interpreter is always simpler than a compiler.

One of my stretch goals for the forever-platform is for it to be independently and compatibly reimplementable from the spec, that is, without access to a working implementation to test against, ideally within a few hours (as with Nguyen and Kay's "fun afternoon hack" desideratum for Chifir). There are indeed 300+ RISC-V implementations, but they aren't independent in the way the implementations of the UM were; the authors had not only a body of RISC-V software to work from but also other implementations of RISC-V to test the software on. But for Rosetta-Stone-level archival, we want the programmer-archaeologist to be able to get their implementation working even though they don't have an existing implementation to test against.

UM is a demonstration that this goal is achievable, if you're willing to accept that ceiling of 0.04% of native performance for the straightforward interpreter you'd write as a fun afternoon hack.

The hope with vector operations is that they give you an extra three orders of magnitude in inner-loop performance, without stretching the implementation task beyond what's achievable as a "fun afternoon hack", and without introducing so much ambiguity and so many tricky corner cases in the spec that in practice a given "ROM" will not run on an independent reimplementation. I don't know if I'll find a way to do it or not. But that's the objective.

And that's why I've been wrestling with SIMD, and don't think it's just a distraction.

So, thank you very much for the links to the relevant efforts! Oh dear, though, the V1.0 version of RVV is 111 pages, longer than the entire RV32/RV64 unprivileged spec.

sitkack · on July 25, 2022

The Rosseta-Stone spec needs to have an executable test, otherwise how do you know it works? Even the software you want to get running is a sort of a test, you just need test oracles.

> And that's why I've been wrestling with SIMD, and don't think it's just a distraction.

I think in this instance, Wasm 128b SIMD is a distraction. I agree with everything you have said, and your goals. Wasm 128b SIMD was a stop-gap and distracts from what ultimately will be a much better solution (vector processing). It is the lowest common denominator (128b width registers) impl that they could come up with to satisfy necessary immediate perf needs. It does get good speedups on code right now, but it more than doubles the number of Wasm opcodes. It isn't tenable to have the same 128 -> 256 -> 512 path that CPUs have taken. Implementation complexity and having to recode software everytime the SIMD registers change shape is lame. CPU vendors are totally ok having a needless upgrade train. The future is the past with vector processing, like RVV and flexible vectors. For archival software, I'd argue that SIMD doesn't matter. I don't think it is fare to compare RVV against the base ISA spec, the base ISA spec is trying to be the simplest possible thing, it is just a bootloader for RVV and custom instructions.

Being able to extract parallelism is way more important than constant factors, even for pathologically slow systems like Python. Thread counts for single socket systems are already at 256. One could probably implement Quake 3 in PyPy, maybe even CPython now and hit 60 FPS. CPython could definitely run Quake 1.

Wasm does support threads now. https://web.dev/webassembly-threads/

Wasm SIMD didn't even need to exist, the AST could have been encoded in a call graph that could have executed a fallback implementation, or been converted to local SIMD. It could have all just been function calls, like a binary intrinsic. It didn't really need an instruction set extension. You could encode a vector program, or something shader like in a binary wasm call graph.

I love that you use Moore Units in your comparisons, rarely do folks do this. When folks argue over 4% difference for some aspect of safety, they are talking about weeks of relative performance on a Moore Scale. But no amount of concrete performance gains will allow them to have safety or correctness. They are content to live in a ghetto all driving fast cars.

sitkack · on July 26, 2022

Interesting paper on a formally generated Wasm interpreter.

https://github.com/WasmCert/WasmCert-Isabelle

Mechanising and Verifying the WebAssembly Specification https://www.cl.cam.ac.uk/~caw77/papers/mechanising-and-verif...

kragen · on July 26, 2022

You can't have an executable test if your objective is to get from not having an executable platform to having an executable platform. The best you can do is a set of (cart, expected output) pairs; the CBV UM published three carts — http://www.boundvariable.org/sandmark.umz, http://www.boundvariable.org/codex.umz, and http://www.boundvariable.org/um.um — and the expected output for one of them, http://www.boundvariable.org/sandmark-output.txt. They also had an online test oracle, in the form of their scoring server, but that didn't become relevant until after you already had the UM running well enough to run the Codex, or most of it.

However, the third cart was a metacircular interpreter, so in some sense it potentially represents a kind of reference implementation — just, one you couldn't use until your UM was running well enough to run it.

This may not be the perfect strategy but it worked well enough for 365 teams to submit a proof witness for at least one of the challenges in the Codex. And, importantly, it isn't very dependent on the hardware you're implementing the machine on, and in particular it doesn't require you to be able to run an external test suite.

— ⁂ —

You may be right about 128-bit Wasm SIMD.

When you say, "thread counts for single socket systems are already at 256", do you mean you need 256 threads to hit 100% utilization of the AVX units? Or are you talking about a chip with 64 or 128 cores with 2-way or 4-way "hyperthreading"? On https://www.tomshardware.com/reviews/cpu-hierarchy,4312.html... the highest number I see is 64 cores with 128 threads on the AMD Threadripper 3990X. Are you talking about the Tera MTA or Tilera or Altra or Parallela or something?

I'm not sure if mere vector operations scripted by a single control thread (SIMD, but in a broader sense than SIMD intrinsics, including things like RVV) can approach the level of parallelism needed.

Suppose you have 64 cores with 256-bit AVX2, like the 2.9GHz Threadripper 3990X, which Donald Kinghorn benchmarked at 1.57 teraflops on HPLinpack, and (optimistically) you're doing double-precision floating-point. In theory each core can do up to 4 flops per cycle (right?) and thus 740 gigaflops; perhaps he's counting FMA as two flops instead of one, because on the Cray-1 it would have been two flops. Without any parallelism, the max you get is 2.9 Gflops (or 5.8 if you count FMA as two), and if you have an additional 40× slowdown from naïve bytecode interpretation like CPython you get 73 megaflops, 0.01% or 0.005% of theoretical. This is maybe comparable to a Pentium MMX, Pentium Pro, or Pentium II; according to Greer and Henry's SC97 "Micro-Ops to TeraFLOPS" technical paper, matrix blocking boosted ASCI Option Red's Linpack scores from about 10 megaflops per 300 MHz Pentium Pro to about 160 megaflops per.

(This is perfectly adequate for Quake III, but maybe not at 60 fps.)

For our naïve interpreter to hit the 1.6% of maximum performance, 11.6 gigaflops, that you could get from single-threaded AVX assembly on that chip, you need an average vector length of 158 (11.6 ÷ .073). For it to approach 100% of maximum performance we need an average vector length of 10100 (740 ÷ .073). I am not convinced that you can reliably get such large vector lengths in most pixel pushing.

Even with a naïve JIT compiler with, say, 4 clock cycles per bytecode instruction, you need an average vector length of 1010 (≈1024) to get close to maximum performance, or 256 to be only two Moore generations behind (≈02018, or maybe ≈02016 given that the 3990X came out in 02020).

And the situation is 8× worse if you're alpha-compositing pixels or something; an AVX256 register holds 32 8-bit color components rather than just 4.

I think that usually you will need both SIMD and more flexible forms of parallelism to extract that much parallelism.

— ⁂ —

Why does this matter for reproducible software? Because if the forever-platform imposes a 10100× or a 1024× slowdown then we'll only use it when we really have to, and most of our software will fail to be reproducible, and we'll be back to directories full of screenshots of things we can no longer do.

— ⁂ —

Yes, 4% difference is three weeks of Moore's Law. But perhaps Moore's Law has already ended?

To, "no amount of concrete performance gains will allow them to have safety or correctness," I would add, "or durability". Modern software is not just a ghetto but a shantytown slum built from cardboard: it collapses after a few good rains.

sitkack · on July 28, 2022

You need a high level relational projection system that can operate on multiple dimensions simultaneously, shaders or SQL are ideal for taking advantage of parallelism.

Esp if you can operate symbolically, think triple nested loops where the inner loop is some expensive projection, if it is over a regular space, and can be decomposed as a tree. I think this language could be represented as a Wasm DAG of functions, map/pmap/apply/tree_call/reduce

Having a spec and a body of software to run but no test oracles for how that software runs feels contrived. If you have megabytes or gigabytes of code but not even screenshots of how it should it should look when it runs isn't the same as having a book in a dead language and not knowing what it should sound like.

> However, the third cart was a metacircular interpreter, so in some sense it potentially represents a kind of reference implementation — just, one you couldn't use until your UM was running well enough to run it.

This is a great example of a test oracle. Fixed-point correctness.

kragen · on Aug 9, 2022

Hmm, I understand a "test oracle" to be a thing where, if you submit any arbitrary input to it, it will tell you what the correct output should be. The metacircular interpreter isn't such a thing: you can submit an input to it, and if its output is different from the output of the implementation it's running on, you know that implementation has a bug.

Say my implementation is P and the metacircular interpreter (presumed correct) is M. If P[M[X]] outputs "foo", while P[X] outputs "bar", we know there is a bug in I. This is similar to a test oracle in some ways: given a test oracle O, if O[X] outputs "foo" and P[X] outputs "bar", we know I has a bug. The difference is that in the metacircular case, the correct output might actually be "foo".

Worse, if P[M[Y]] outputs "baz" and P[Y] also outputs "baz", it does not therefore follow that "baz" is the correct output, as it would if O[Y] outputs "baz". It might just be that P has a bug that flows through to the metacircular interpreter. Maybe it rounds division the wrong way, or incorrectly considers negative zero to be unequal to positive zero.

I agree that screenshots would be very useful, but a static repository of screenshots is also not a test oracle, for a slightly different reason: you cannot submit a novel input to it and find out what the correct screenshot would be.

I agree that a DAG of functions is a useful representation that preserves significant degrees of parallelism that conventional machine-code representations expunge. However, Wasm also expunges them.

The idea of using a relational rather than functional or imperative computing paradigm for archival is interesting.

We should probably discuss this in a more amenable format.