Hacker News new | past | comments | ask | show | jobs | submit login
Writing a register based VM in less than 125 lines of C code (andreinc.net)
291 points by nomemory on Dec 8, 2021 | hide | past | favorite | 41 comments



Our machine has W=UINT16_MAX words, of N=16 bits each

Did the seemingly-off-by-one-error catch anyone else? A 16-bit integer has 65536 values, so you'll be going one past the end of the virtual RAM if you attempt to access address FFFFh.

With this simple trick, we can avoid writing a switch statement with 16 (+1) cases

Personally, I prefer having a switch since it's more obvious which opcode goes where. Probably makes the code a little shorter too (and is that another off-by-one, this time in the other direction?)

Also, I understand the need to keep a "learning architecture" simple, but without byte-wide memory access, you'll quickly realise how annoying it is to work with actual bytes. For example, any string handling, unless you decide to use UTF-16 instead. I think a sweet spot is 16-bit address and 8-bit data, with register pairs for 16-bit operations, much like the popular Z80 and the 8080/8085 that preceded it.

The DEC Alpha is a real architecture that made a similar mistake and later had to be extended to fix it: https://en.wikipedia.org/wiki/DEC_Alpha#Byte-Word_Extensions...


A language lawyer quiz: if the definition is changed to

    uint16_t mem[UINT16_MAX + 1] = {0};
is it guaranteed to work on platforms where unsigned int is 16-bit wide? What about platforms where void* is 16-bit wide (hint: &mem[65536] must be a valid pointer value that must compare unequal from &mem[0])?


> &mem[65536] must be a valid pointer value

explain


C99 draft, 6.5.6.8:

    Moreover, if the expression P points to the last element of an array object, 
    the expression (P)+1 points one past the last element of the array object,
    and if the expression Q points one past the last element of an array object,
    the expression (Q)-1 points to the last element of the array object. If both
    the pointer operand and the result point to elements of the same array object,
    or one past the last element of the array object, the evaluation shall not
    produce an overflow; otherwise, the behavior is undefined. If the result points
    one past the last element of the array object, it shall not be used as the
    operand of a unary * operator that is evaluated.


> Personally, I prefer having a switch since it's more obvious which opcode goes where.

In practice, having a switch is not worse or can even be better for performance because any non-naive optimising compiler will emit a nicely ordered indexed jump table, whereas following function pointers involves more indirection through the addresses of the functions, and has the additional overhead of passing args and preparing/restoring the stack.


I've thought of that, and to be honest I think it's worth mentioning in the article. But having set this arbitrary limit to 125 lines, before even starting to sketch things down, made me use the function tables. As a plus, article is intended for beginners, and it would be cool if they learn about function pointers from a practical example.

But from a performance perspective, you are right!


Further to this: this kind of thing is something the Forth community takes seriously. They call these threaded code techniques. [0] (The Gforth interpreter, for instance, gives you the choice of various different techniques.)

Anton Ertl of Gforth fame published a microbenchmark performance comparison of toy interpreters, written in C, run on various CPUs. [1] As you say, the approach using function pointers (call threading) scores poorly.

Interestingly, direct threading and indirect threading are extremely close, with the winner seeming to depend on the specific CPU. [1] Branch-prediction differences seem to be the main reason. There was a 2001 paper on this. [2]

[0] https://www.complang.tuwien.ac.at/forth/threaded-code.html

[1] https://www.complang.tuwien.ac.at/forth/threading/

[2] https://github.com/ForthPapersMirror/Papers/blob/master/Conv...


There's also "computed gotos". It's not the same as function pointers, though.

https://eli.thegreenplace.net/2012/07/12/computed-goto-for-e... - article about them being used in CPython



The real problem solved by BWX wasn't shifting and masking necessary in handling 8-bit values, as that was handled pretty transparently by compilers, it was how complicated it made dealing with hardware I/O which often expected 8bit or 16bit atomic writes - whereas Alpha could not do shorter i/o than halfword, i.e. 32 bits (All I/O on Alpha was MMIO).

This is visible in form of big split in Alpha hw support (drivers etc) on VMS and OSF/1 between "non-BWX" and "BWX" hardware.


> without byte-wide memory access, you'll quickly realise how annoying it is to work with actual bytes. For example, any string handling, unless you decide to use UTF-16 instead. I think a sweet spot is 16-bit address and 8-bit data

Even if you don't want to call it UTF-16, you can just ignore the upper byte of your 16-bit cells and pretend they only store bytes when that's more convenient for you: for practical purposes you won't end up with any less memory available than if your cells were only 8 bits wide, but you have the option of using more when you need it.


Really cool!

I've never understood why modern software projects use names like `mw` or `mr`? What's wrong with `memory_read` / `memory_write`? Or `opcode_execute_fn opcode_executers[NOPS]` is magnitudes clearer than `op_ex_f op_ex[NOPS]`, and the only advantage of the latter is less time to write.

You will spend at least twice as much time reading your code as writing it, just take the extra time. Nobody needs another `vsnprintf_s`.

ETA: I have to repeat this is an insanely cool project idea, and very good execution


On one hand, the abbreviations are extra mental work. On the other hand, when you start composing fairly complicated chains together, all those extra letters are also extra mental work. Neither is inherently right, but I do prefer something like the former here (just making up something emulatorish):

    mw(cpu.r[0], mr(mr(cpu.r[1])));

    memory_write(processor.register[0], memory_read(memory_read(processor.register[1])));


This was exactly my reason to shorten the name, even if most of the feedback I've received is related to the "weird" naming conventions.

Initially mw was mem_write and mr was mem_read, and reg was registers, etc. But because I wanted to keep the functions as short as possible (to put them on the single line without making them totally unreadable), I've comeup with those names, which are confusing at first.


I think this boils down entirely to how comfortable the reader is with the concepts involved. If you have to stop and think for a sec about what "memory_write" actually does, the spelled-out words are a useful prompt and the time it takes to read them is not a bottleneck. But if you're used to thinking at level of abstraction, used to tossing the concepts around in your head like mathematical symbols - then the shorter version is preferable and the longer one feels boggy and pedantic. This is why regex is a thing and why people skilled in APL rave about it so much, and why nobody ever bothers to symlink rm, cp, and mv to remove, copy, and move respectively.


When you're optimizing for a metric like lines of code written, then abbreviations make sense. If you're optimizing for readability you could do something like (I'm not a C programmer so IDK if this is syntactically correct C):

` memory_write(

  processor.register[0],

  memory_read(

    memory_read(processor.register[1])

  )
) `

To me, multiline statements are much more readable than single line nested statements regardless of abbreviations used.


I do this sometimes when the line gets too long due to long names. I do not like it.


just making up something emulatorish

Double-indirect through memory? I think that's something only VAX has.


Since this code is dealing mostly with instruction mnemonics (br, jsr, lea), it makes sense for the rest of the code to comply with this style. The code has an inherently repetitive structure (implementing various opcodes) and therefore shorter global names contribute for its compactness. Mentioning "register_file" 30 times rather than "reg" wouldn't make it clearer when it is obvious that the short name serves the same purpose.


Same reason math uses letters rather than words.

It's easier to pattern match strings that are short.

Having one `mr`/`mw` is confusing, but taking up whole lines with long strings can obscure the logic of the code.


Alternate perspective: After you've briefly immersed yourself in context, you no longer need the longer names, the shorter ones make sense. In fact, it would be fatiguing the have to constantly read the longer names after you're familiar with them.


I think people are too used to huge projects. this one is 125 lines of C, in that space, you can express yourself shortly.


This reminds me of the Gigatron, a minimal retro TTL-computer, which came as a kit. See: https://gigatron.io/ The C program for the emulator can be found at https://github.com/kervinck/gigatron-rom/blob/master/Docs/gt... which has 126 lines.


Skimmed over this... nice write up so far! Notes:

* Would be nice to have a TOC for such a long article.

* I never implemented a register based VM but I think I remember one of the trickiest things is to allocate the registers when compiling from a higher level language [1], which I don't think is mentioned in the article? I think the article doesn't concern with this since the sample program just uses the bytecode directly.

* Random: been thinking of a DSL that would be able to express a VM logic without having to deal with all the micro details of C... Just found a few academic papers on a quick search investigating the idea, fwiw. [2] [3]

1: https://en.wikipedia.org/wiki/Register_allocation

2: https://link.springer.com/chapter/10.1007/978-3-540-25935-0_...

3: https://dl.acm.org/doi/10.1002/spe.434


Register allocation is not that hard, especially if you don't do it via graph coloring and have more than 2 general-purpose registers available. Build SSA form, remove critical edges, spill temporaries until the number of simultaneously live temporaries is strictly less than the number of available registers, assign registers in a straightforward manner, move out of SSA (that's where that one unused register may come in handy). See [0] and its bibliography for more details.

[0] https://hal-lara.archives-ouvertes.fr/hal-02102286/file/RR20...


Thanks for the feedback, I appreciate it.

You are right about the TOC, my other long articles have them, but for this one I simply forgot to generate one. I will generate it tomorrow.

There's no compiler, and yes you are right managing a limited set of registers is hard when compiling from higher level languages. But there are a few compilers for LC3, so it's not impossible to write them even for such a limiting number of registers. But this is not my area of expertise, so someone more knowledgeable can comment on that.


If 8 registers is "limiting", I don't know what you'd call something like a PIC which has a single accumulator... and yet, C compilers exist for those too. On the bright side, it really reduces the number of choices for register allocation.


I call it painful.


In my undergrad, we used an updated version of LC-3 (called LC-4, reasonably enough) for our intro architecture course. LC-4 came with a number of "quality of life" improvements. The simplest ones were just extra arithmetic and logical instructions; if I recall correctly, LC-3 only implemented implements an ADD instruction (as well as a variant for adding an immediate value, which meant doing multiplication required manually writing a loop. Additionally, in both ISAs, branching is done using a specialized "NZP" module, where sign result of the most recent operation (negative, zero, or positive) done by the ALU would be saved, and then branching would be done based on which conditions were expected (e.g. "branch if the last result was positive" or "branch if the last result was negative or zero" (which incidentally resulted in an instruction code identical to jumping but with one less bit available to specify the destination, as well as making the selection of the NOP binary representation easy). However, LC-4 added provided direct comparison instructions that wrote directly to the NZP module without having to write to a register, which helped avoid having to write to memory as much when doing an algorithm that requires more than just a few variables.


The article is in early draft, so there might be some errors and/or bugs.


I haven't read the whole article yet, I just want to report what seems to be a small error:

> The code is written in C11, and it will probably compile on most operating systems. The repo can be found here, and the exact source code is vm.c:

The link on "here" is broken, it's https://github.com/,nomemory/lc3-vm (with a comma before your github username) instead of (I think) https://github.com/nomemory/lc3-vm.


Thanks for spotting this. I will fix it as soon as I am back to my personal computer.


Ah the LC-3! I wrote an assembler for it a little while ago: https://github.com/pepaslabs/lc3as.py

I’ll definitely be digging into this article!


Fascinating topic. I always wondered how can someone make the jump from understanding a VM like OP's post to something like vmprotect[0] which is orders of magnitude more complex and is designed to be as hard to reverse as possible.

[0]: https://back.engineering/17/05/2021/


This looks like a fun thing to mess with and lose sleep while I play with! Thank you!


Next step is to compile brainfuck to this.


It is a virtual machine in the sense of the JVM (rather than say a VM image running on KVM.)

I'd call it an interpreter/emulator for a simple instruction set and hardware interface.


Nice. Note that I think there is a typo in the embedded link in "The repo can be found here", there is an extra comma "," in the URL.


This is really nice. Great job. It would be interesting to see how something like concurrency could be added.


I really enjoyed reading this! Thanks for taking the time to write it up!


so this is what people mean by a blog! very nice




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: