Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: A small virtual machine written in C (github.com/tekknolagi)
212 points by tekknolagi on Aug 4, 2014 | hide | past | favorite | 91 comments



Seems like [1] could benefit from use of the X macro [2], should make adding new instructions much easier and you can avoid the hassle of having to keep two separate tables in sync. There's probably quite a few places where the code could be made clearer by using it. Also in the implementation of your functions, there's a hell of a lot of repetition in all the binary operations, another macro which you pass in the operation and the function name would make life earier:

    #define binop(NAME, OP) definstr (NAME) { \
        long long b, a; \
        if (carp_stack_pop(&m->stack, &b) == -1)\
            carp_vm_err(m, CARP_STACK_EMPTY);\
        if (carp_stack_pop(&m->stack, &a) == -1)\
            carp_vm_err(m, CARP_STACK_EMPTY);\
        carp_stack_push(&m->stack, a OP b);}
    
    binop(ADD,+)
    binop(MUL,*)
    ...
The repetition of

    if (carp_stack_pop(&m->stack, &a) == -1)
        carp_vm_err(m, CARP_STACK_EMPTY);
seems like a good place to just use a function to encapsulate all the error checking and handling.

[1] https://github.com/tekknolagi/carp/blob/master/src/carp_inst...

[2] http://www.embedded.com/design/programming-languages-and-too... and the following parts


This is a technique I first came across in the code for lcc, as described in David R. Hanson's book "A Retargetable C Compiler: Design and Implementation".

https://sites.google.com/site/lccretargetablecompiler/

The code in that book feels pretty dated in a lot of places, but this was a neat trick. Every piece of information related to a token is defined in one place:

https://github.com/drh/lcc/blob/master/src/token.h

Then here's how it's used:

https://github.com/drh/lcc/blob/master/src/expr.c#L8

https://github.com/drh/lcc/blob/master/src/output.c#L107


Definitely interesting — I shall take a look! Feel free to make a pull request if you're game.


If I had time and a job where I got to code I would, but it'll be a few days before I have time.


Quick question. Is that code you've given ok to use in the project? I just realized that I added and pushed without asking.

GPL?


The binop stuff you can certainly use freely, the X macro stuff is quite well known so I guess that it's also fine to use. I certainly won't be claiming any copyright on the code I shared =)

My personal preference for licensing is one of the BSD or MIT licenses, I'd rather see things of mine used to makie the world a better place than force people to share what they've done with it.


Thanks :) Having some weird pasting issues with a macro for POP expansion. http://i.imgur.com/4HiltgI.png

Edit: Just kidding. That was dumb.


Btw, I remember there was a law stating that the code less than 10(?) lines is not copyright-able, no?


Wrong -- Google lost the first of the Android/Java battles to Oracle for 9 lines of "copy/pasted" range check code.[1][2][3]

[1] https://news.ycombinator.com/item?id=3951575

[2] https://news.ycombinator.com/item?id=3951480

[3] http://www.theverge.com/android/2012/5/9/3010404/judge-denie...


My first reaction was to say "it should be based on the number of CPU/GPU/*PU instructions". But then, there are things like this: http://www.norvig.com/spell-correct.html. I wonder how much actual new-ness you can pack into, say, 100 instructions on a modern CPU.


That would be a a silly rule, for better or worse 10 lines of Haskell might perform the computation of 500 line of Java (even if they took you the same amount of time to write.. heh).


Well, maybe they made this rule in days when COBOL was in hype? ;)


Or, say, Befunge...


Ah, I feel you — no problem! I do really love the binop macro.


The implementation is really clean! The mixing of general-purpose registers, special-purpose registers and the stack makes the instruction set a bit weird, though.

For example, instead of using registers, why not have OR pick values off the stack like ADD, or vice versa? Why use EAX for the conditional jump instructions when you could look at the top of the stack? Why have REM when you can just MOV from ERX?


You know, I was thinking about that this morning actually. All great suggestions! If you could make an issue, that would be fantastic.


I just changed OR/XOR/AND to be stack based, and good idea for the jump instructions.


The code looks really clean to me.

If you feel the urge to convert the VM to a JIT, rather than calling function pointers for each instruction, you might find this blog post useful: http://blog.reverberate.org/2012/12/hello-jit-world-joy-of-s...


Thanks! The goal was readability and ease of use, so that's probably why it's slow :P

Not sure I understand. What would it do differently?

Edit: Hang on, so I should generate C or asm? or something?


I think that the method you use is "Token Threaded Code" [1]. It's one good way of implementing a VM.

I only added the blog link in case you were interested in extending the VM to use Just In Time compiling. Using a JIT would mean generating assembly.

I like your VM - there is no need to change it. Thanks for sharing.

[1] http://realityforge.org/code/virtual-machines/2011/05/19/int...


Thanks for the info! :)


You'd essentially be generating machine code on the fly and letting the cpu execute it, instead of interpreting anything yourself.


Interesting. Is that significantly harder?


It does make it more complex since you now have another layer to deal with. I wouldn't roll by own JIT compiler though, otherwise it would definitely be a crap ton harder. Using something like libjit or llvm's JIT compiler are two good options.

It's nice having a normal, interpreted VM then building a JIT compiler later on for extra speed. A recent Rust project called `js.rs`[0] started out with a simple interpreter and swapped it for a JIT compiler using libjit.

The biggest difference being you now have to interface with the actual CPU and such.

[0] https://github.com/TomBebbington/js.rs


if you're already writing in C, no. Just write some functions to convert those instructions into asm, and write an assembler (or use one of the many out there) to output them to bytes. Then use mprotect(), tadah!


Interesting.


I have been working on this project on and off for about 6 months. I would love feedback on the code, design - whatever!


Very concise implementation. I keep thinking there's missing files or at least I'm missing something in my understanding. :)

I wrote a 0-operand VM in C a few years ago that used a lot of the same concepts but considerably more code, like 10x at least. I will learn a lot from this.


Unfortunately, with the ever-increasing processor-memory gap (the growing discrepancy between processor and memory speeds, also called Memory Wall, if I'm not mistaken), 0-operand, or stack VM's are seeing ever-increasing performance penalties.


What kinds of files look missing? I'm curious.

What is a 0 operand VM? I'd love to see it!


Just for anyone looking for more information: http://en.wikipedia.org/wiki/Instruction_set#Number_of_opera...


Awesome!


Using only PUSH and POP I think. i.e a stack machine.

also, OP, your code looks great.


Yeah, a stack machine. https://github.com/dsturnbull/stack_cpu

What I mean by missing files is that it's so clean that at first glance it didn't look like the whole deal :)


Oh well thanks — that's quite a compliment.

This is pretty cool!


Oh interesting. That sounds... complicated.

Thanks! It's my first big C project that isn't hideous :D


Is there a reason for the combination of register-based (logical ops) and stack-based (arithmetic ops) operations?

Otherwise fun project.


Probably just my sloppiness. Things like OR were implemented while looking at other projects like tinyvm, so it's not perfectly consistent.


Nice. Any particular tutorial/article/book you're following?


Perhaps, if you're interested in learning to write well-written C code, you might be interested in the book "C Programming: A Modern Approach"[0] by King. Coding style is an important topic in his book.

[0]: http://www.amazon.com/Programming-Modern-Approach-2nd-Editio...


Thank you, seems very interesting. Usually, I only write pure C code when I really have to. Otherwise, I see no reason why someone should restrict "shimself" from using C++.


Nope :) Lots of small web resources.


As inspiration, I would suggest to look at the virtual machine (byte code interpreter) of the Lua language. You can find several papers describing the design at http://www.lua.org/docs.html. The code base is also very small and clean.


Awesome!


Can you give some directions for how to run it?

  $ git clone https://github.com/tekknolagi/carp
  $ make
  $ ./carp
  $ ./carp --help  # after peeking in carp.c
  help msg
There's a tantalizing target called 'tests' in the Makefile, but it doesn't work.

There's some examples, but no makefile to build them. How do you run them?

I see you used to have some .carp files but you just deleted them: https://github.com/tekknolagi/carp/commit/fa16eeb443

I tried restoring one, but it doesn't work:

  $ git checkout fa16eeb443~1 examples/reg.carp
  $ ./carp -f examples/reg.carp
  Unknown label <add>
So at this point I'm ready to give up..


Hi there! Sorry for this. I actually have a help message but for some reason I forgot to commit it. It's coming soon!

You run the examples like so: carp -f examples/carp/call.carp

You can compile C files that use the Carp API, but I have not written any documentation yet.


Oh that's weird. the carp folder from examples disappeared. Gimme a sec.


A year (or two) ago, I wrote a simple virtual machine [1] of my own. Fun time. Interesting exercise. Wish I had courage to post it here and get feedback from the HN. Anyways, yours is much cleaner and concise code. Kudos.

[1] https://github.com/swapi/z9


Definitely post it! I've written muuuuch worse and it's still public :D

Check out http://github.com/tekknolagi/gechoshudder


Posted my reply on wrong comment (Got confused between akkartik and tekknolagi) :). Anyways, thanks for sharing. Are you planning to add memory management opcodes (Allocate memory, free memory)? I see DBS, DBG instruction, but it is like key-value store (since it uses carp_ht internally).


Umm perhaps. Could you give a use case? Not sure how/why.


Alright, try now. Also — thanks for the pull request!


You're welcome. I pulled the new commits, but I'm still having trouble running the examples:

  $ ./carp -f examples/carp/call.carp 
  Unknown label <add>
(I am very interested in building easy-to-understand codebases: http://akkartik.name/post/wart)


That's strange: http://i.imgur.com/TbF7Izc.png

Try rebuilding carp? o.O

That was some very nice documentation!


Nope, rebuilding didn't help.. never mind, I'll look into it later tonight.

Edit: I just noticed you changed the target (why?) and are also immediately deleting carp.out. So I was indeed using the stale version in spite of using 'make clean' :)

Anyways, so I did this, but still no luck.

  $ git diff
  diff --git a/Makefile b/Makefile
  index f921d96..7c55b4e 100644
  --- a/Makefile
  +++ b/Makefile
  @@ -10,7 +10,6 @@ all:
	  $(CC) $(CFLAGS) $(SRCS)
	  ar cr libcarp.a $(OBJS)
	  $(CC) src/carp.c libcarp.a -o $(PROG)
  -       make clean_objs
   
   #.PHONY: tests
   
  $ make
  gcc  -c -std=c99 -Wall -Werror -Wno-unused-variable -Wno-format-security -O3 -static   src/carp_instructions.c src/carp_lexer.c src/carp_machine.c src/carp_tokenizer.c src/lib/carp_stack.c src/lib/carp_ht.c
  ar cr libcarp.a *.o
  gcc  src/carp.c libcarp.a -o carp.out
  $ ./carp.out -f reg.carp 
  Unknown label <add>


Yeah I realized that I had changed PROG to carp.out but not changed clean_objs. Already pushed a fix!


So strange.


In the process of debugging it I noticed that you're compiling but not linking in c99 mode. When I fix that the problem seems to go away:

  $ git diff
  diff --git a/Makefile b/Makefile
  index 1263dd8..6592c59 100644
  --- a/Makefile
  +++ b/Makefile
  @@ -1,16 +1,16 @@
   CC = gcc #/usr/local/bin/gcc-4.2
   PREFIX = /usr/local
   NDEBUG ?= 
  -CFLAGS = -c -std=c99 -Wall -Werror -Wno-unused-variable -Wno-format-security -O3 -static $(NDEBUG) 
  +CFLAGS = -std=c99 -Wall -Werror -Wno-unused-variable -Wno-format-security -O3 -static $(NDEBUG) 
   SRCS = src/carp_instructions.c src/carp_lexer.c src/carp_machine.c src/carp_tokenizer.c src/lib/carp_stack.c src/lib/carp_ht.c
   #$(wildcard src/*.c src/lib/*.c)
   OBJS = *.o
   PROG = carp.out
   
   all:
  -       $(CC) $(CFLAGS) $(SRCS)
  +       $(CC) -c $(CFLAGS) $(SRCS)
	  ar cr libcarp.a $(OBJS)
  -       $(CC) src/carp.c libcarp.a -o $(PROG)
  +       $(CC) $(CFLAGS) src/carp.c libcarp.a -o $(PROG)
   
   #.PHONY: tests
Still, I'm concerned that you might be triggering some sort of undefined behavior, which means it might work for you but not on a slightly different machine or compiler version. I tried printing out token lexemes in carp_run_program, right after the call to tokenize(), and some of the tokens printed binary garbage, suggesting that they might be missing a terminating null, or worse. Does this look right?

  $ git diff
  ...
  diff --git a/src/carp.c b/src/carp.c
  index 9268d98..19e16d9 100644
  --- a/src/carp.c
  +++ b/src/carp.c
  @@ -88,6 +88,8 @@ void carp_print_conditions () {
   
   void carp_run_program (char *fn) {
     carp_tok *tokens = carp_lex_tokenize(fn);
  +  for (carp_tok* tt = tokens; tt; tt=tt->next)
  +    printf("%s %d\n", tt->lexeme, tt->type);
     if (tokens == NULL) {
       fprintf(stderr, "Something went wrong with tokenization.\n");
       exit(1);

  $ make && ./carp.out -f ./examples/carp/call.carp
  gcc  -c -std=c99 -Wall -Werror -Wno-unused-variable -Wno-format-security -O3 -static   src/carp_instructions.c src/carp_lexer.c src/carp_machine.c src/carp_tokenizer.c src/lib/carp_stack.c src/lib/carp_ht.c
  ar cr libcarp.a *.o
  gcc  -std=c99 -Wall -Werror -Wno-unused-variable -Wno-format-security -O3 -static   src/carp.c libcarp.a -o carp.out
  add 3
  gload 6
  -5l 1
  gload 6
  -4l 1
  add 6
  ret 6
  main 3
  push 6
  7�l 1
  push 6
  9�l 1
  call 6
  add 4
  2 1
  ptop 6
  halt 6
  0 1
  16
The trouble with C/C++ is that it's so easy to end up with undefined behavior :/


Actually I can't link with the same CFLAGS; I get an error about missing lcrt0.o or something of the sort.

Also, yeah your code looks fine; I added it in and it worked fine. That's unfortunate. I'll debug soon.


What machine/os are you on?

(Perhaps we should take this offline. My email is in my profile.)


Sent an email :)


>The goal is to try and build a small (and decently reliable) VM from the ground up, learning more and more C as I go.

The author explicitly mentions, he is in the process of learning C while he's building something using it.

This is interesting because the folks at [osdev.org](http://wiki.osdev.org) keep stressing that you must be a god-level expert in C, before you even think of getting into systems programming.


I think the guys at osdev mean you need to be good at C before doing systems programming at any form of a professional level. If you don't intend for your code to actually do anything (I mean this is just a side project or something, not for somebody to use in production) than you can really do whatever you want.


Definitely a side project.


I don't know about other people, but there's no way I'd be able to become an expert in a language without building things with it.


This is neat. For others interested in this sort of thing, https://challenge.synacor.com/ specifies a virtual machine, and comes with a binary that performs a self-test -- it is perhaps a neat way to get started.


Cool!


Very cool!

Georgia Tech offers a class where you code a VM for the LC3b instruction set (http://users.ece.gatech.edu/~moin/s13a/hw.html).

It was a great learning experience, and pretty fun too.


Ah, that's neat! I am not headed off to Georgia Tech though :)


Can anyone link me to some literature on VMs? Sure, I can analyse the code, but from a design perspective, I would love to learn more about how VMs work, where they can be used, etc etc.


So I started off following this (http://en.wikibooks.org/wiki/Creating_a_Virtual_Machine/Regi...) which was interesting but clearly never super influential in my design (except in earlier stages, but that died).

Then I read this: http://stackoverflow.com/questions/2034422/tutorial-resource...

and this: http://courses.cms.caltech.edu/cs11/material/c/mike/lab8/lab...

and took a look at the MSP430's instruction set.

As far as where they can be used... anywhere? I don't know the answer to your question - sorry :)


"and took a look at the MSP430's instruction set."

I wonder in a general sense if there are many VMs that explicitly copy good older CPUs. I've always thought the 6809 would make an awesome VM. After all, it was pretty awesome to code on in the real world. Or a VM based on the classic 1802, 6502, or Z80 with some minor mods.

If you really want to warp peoples minds give them a VM based on IBM HAL assembler, basically turn Hercules and VM into a hypervisor rather than an application.

A PDP-11 inspired/based virtual machine. Hmm.


CISC would be better than RISC because one should keep in mind that for many operation most of the execution time is VM overhead.

But if you want performance you'd rather tailor your instruction set to the language it will execute (assuming you're not making a general purpose VM). CPU makers like Intel do actually keep a close look on what programs are doing in order to evolve their instruction set.


Thanks for all those links!



Out of curiosity - How are you testing your VM? meaning, how do you know it actually works correctly? I didn't see any tests in the github repository


I am not testing — and I know this is horrifying. I need to write some tests. D:


I am now starting to write some tests with libtap. If you are game, come help out! :D


good stuff.

the power of function pointers for stuff like this is quite enormous. making a vm like this is great practice for a real compiler too... :)

i'm curious though, why it isn't a direct copy of x86? other than the obvious 'thats really complicated' it would save a lot of wheel reinventing...


I started off following MSP430 but just kind of wanted to make my own thing.


"NOP (): Does nothing. Seriously." - I love this :)


Me too :D


Good job man, code looks tight, wish I had some more time to jump on this. Keep it up.

EDIT: Btw, do you happen to have a book or some study material on this subject ? Anything you can recommend ?


Whenever you get a chance just send a stray pull request :)

I don't - I wasn't really following anything unified. I just pulled together a bunch of crappy web resources :)


[deleted]


Pardon?


I mean it's a big story, if you can do part of job of VMWare


This is a totally different type of VM than VMWare makes.

This is what's called a process virtual machine, and they make application virtual machines (I think).

Basically, this acts as a small CPU-type-thing and interprets instructions. Theirs emulate whole operating systems.


Sound more like container


It's similar to the Java Virtual Machine (JVM) or the .NET CLR VM if you're familiar with either of those.


Mmm nope. Nothing about kernels in here.


Think VM as in JVM.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: