Hacker News new | past | comments | ask | show | jobs | submit login
How to write a toy JVM (zserge.com)
335 points by oftenwrong on June 2, 2020 | hide | past | favorite | 86 comments



On the topic of toy JVMs, my Metascala project is a JVM implemented in ~4000 lines of Scala that is complete enough to interpret itself!

- https://github.com/lihaoyi/Metascala

Interpreting code in Metascala is about 100x slower than just running it, and interpreting code in Metascala interpreted by Metascala is about 10,000x slower than just running it. Not going to win any performance benchmarks, but it's a cool demonstration of how a JVM works.

All the runtime data structures, memory allocation and garbage collection, method dispatch logic, stack trace management, exceptions, inheritance, object layouts, etc. are all implemented in a relatively small amount of relatively simple code.

For example, here is the implementation of the heap, which allocates the VM's objects inside a big byte array and has a simple copying semispace garbage collector to clean them up:

- https://github.com/lihaoyi/Metascala/blob/master/src/main/sc...


And here's a talk by Ben Evans about implementing a JVM in Rust.

https://www.youtube.com/watch?v=7ECbwgkHdAE


This is awesome! I have been working on a interpreter for a toy language and my next step would be study the JVM. I'll definitely check your project out


As always, I'm impressed with the amount of work you do. Have you ever talked about how you manage to be so productive?


I like the commented out print statements all over the place. It make the code seem alive somehow.


Most of the performance overhead is probably the print messages as those are likely sync.


> and interpreting code in Metascala interpreted by Metascala

It's JVMs all the way down


Implementing a VM is a unique experience. You take some bytecode that's initially a meaningless blob, sketch out an execution environment, and start implementing opcodes, one by one, and the blob actually starts doing real things.

Back at uni, I've done this with the ZMachine [1], in (non-idiomatic, newbie) Haskell [2], using Zork 1 as the blob. Sixteen years after, I remember the elation when my interpreter first printed out the familiar message:

You are standing in an open field west of a white house, with a boarded front door.

[1]: https://en.wikipedia.org/wiki/Z-machine

[2]: https://github.com/nathell/haze


It truly is unique. I wrote a chip8[1] interpreter shortly after finishing my first year of university and was still a very novice programmer.

My implementation was poor even by novice standards, i knew there was a lot of spaghetti but I didn't mind. Implementing each opcode was like beating a level in a video game - and it worked, not perfectly and with some unsolved bugs. Tetris worked flawlessly though.

[1]https://en.wikipedia.org/wiki/Chip-8


I did this with the Z80 machine code variant that runs in the GameBoy. I defined my goal as, run the boot rom until it jumps to the game. The boot rom makes sound and moves a logo on the screen. Not the first person to do this and not the last. But interesting projects you can do after you get one going is, study on why Android doesn't run JVM bytecode, or come up with a better byte code for Java/Scala.


The opening line confuses me... The JVM is one of the fastest, well established, well documented, widely deployed platform in the world. Hundreds of languages run on it, quite quickly.


Opening:

"Whether we like it or not, but Java is one of the most widely used programming languages. However, since most of the applications in Java are either too boring or too complex - not every Java developer has enough curiosity to look under the hood and see how JVM works."

I don't think the author contradicted what you just said. I guess you may be confused by the "whether we like it or not" part? I feel like the author is commenting on Java the language (vs the JVM).


I don't think English is his first language. Go easy.


Tone is hard to communicate over the internet. I wasn't attacking him, apologies, my comment was purely critiquing the statement itself.


I bought the JVM Specification book some years ago. It was fun holiday reading (seriously) seeing how they bytecode was put together, how try-catch blocks really work etc. It's quite readable as general interest, if you're into that kind of thing.

I don't think it's ever been much actual use to me in programming, but was nice-to-have background knowledge.


You mean like this book? https://docs.oracle.com/javase/specs/jvms/se8/jvms8.pdf

I'm very tempted to pick a hard copy up!


I would advise the up to date version though.

https://docs.oracle.com/javase/specs/jvms/se14/html/


Can't seem to find a dead-tree version of this one :-(


Yeah, I'm quite interested in having a dead tree version to read. I guess I could send the PDF off to a book maker - not entirely sure of the legalities though...


I see, yes, that might be an issue.


Yes, you can get it in dead-tree.


I fell down the rabbit hole trying to answer "what is the smallest JVM that could be implemented".

I've come to TinyVM [1] made for some Lego system which uses about 10kB of RAM. I was wondering about porting it to other micro-controllers...

[1] http://tinyvm.sourceforge.net/


So why CAFEBABE ? it is a random hex value looking like words or intentional to say that ?

edit, found it : https://dzone.com/articles/the-magic-word-in-java-cafebabe


I never thought of CAFE as a hex word. I always knew about DEADBEEF and so on.

Now I feel that there should be a website that lists all kinds of fun hex words.

Like CAFEFACE



We don't really need a website to do it:

grep -i '^[0-9a-f]*$' /usr/share/dict/words


I'd say you might want to grep for [0134567] instead (being charitable) and maybe use sed y command to find the l33t equivalences

0 - o

1 - I

3 - E (redundant)

4 - A (redundant)

5 - S

6 - G

7 - T


Apple uses a bunch of hexspeak for error codes in iOS, including 8BADFOOD ("ate bad food"), C00010FF ("cool off" - related to thermal events), and DEAD10CC ("deadlocc" - deadlock).

There's also DEFEC8ED ("defecated"), which was used by OpenSolaris for core dumps.


IIRC Facebook uses IPv6 addresses that have face:b00c in them


CAFED00D was used as the magic number for the pack200 compressed class file format.


To anyone writing an article like this: Please mention what language you are using before your first code listing, not four paragraphs after.


I didn't feel I needed to know Go to read the example code.

(Plus, didn't "if err != nil" give it away?)


Fun project (probably already been done). A "in which languages is this valid syntax" machine. On its own `if err != nil` could parse as Rust and Python.


For me it was when I saw the strange structure definition and did a double take.


This was a great read! Do you plan to continue to develop this? I would love to see more posts if so. I actually think this would be a great book as well. Cheers.


Yes, uncover the magic, good job!


Outside of the specs for the JVM itself, has anybody studied from alternate resources to learn move about the JVM? Like any specific videos or illustrative resources? I love learning about the interiors of languages, but sometimes you need a second person to explain things or some visuals to really catch it.


This is not a comprehensive resource, but it provides 25 bite-sized essays about different aspects of the JVM:

https://shipilev.net/jvm/anatomy-quarks/


Sure plenty of stuff.

Occasionally there are such talks at Java ONE (now Code ONE), Voxxed, NDC.

JVM Languages Summit also has such talks.

http://openjdk.java.net/projects/mlvm/jvmlangsummit/

Talks from Cliff Click or Gil Tene.

Then you can have a look at implementations like JikesRVM (one of the first ones implemented in Java), OpenJ9 (open source variant of IBM's J9).

https://www.jikesrvm.org/

https://www.eclipse.org/openj9/

There is plenty of other stuff, but maybe this allows you to get going.


Pretty much anything Aleksey Shipilev has ever done, tbh. Here's an example - https://shipilev.net/blog/2014/jmm-pragmatics/


From your link to Shiplev’s talk comes this definition of ‘nasal demons’:

http://www.catb.org/jargon/html/N/nasal-demons.html

In there is a link to the thread that originated the term from 1992:

http://groups.google.com/groups?hl=en&selm=10195%40ksr.com

If you scroll to the bottom you’ll find possibly the most lost soul on the Internet reviving the most dead thread ever.


While not directly a resource, I found writing agents for the JVM (java.lang.instrument agents and JVMTI agents) to be quite enlightening and rewarding. Depending on what you want to do you'll have to deal with bytecode transformation/instrumentation, JVM events, JNI and many other things. For example, have you ever thought about how a java debugger (JPDA, JDI, JDWP) works or a tool such as OverOps (Takipi)?


Include some benchmarks? :) I've always wondered how production/robust JVM implementations make themselves "faster" after they warm up.


> I've always wondered how production/robust JVM implementations make themselves "faster" after they warm up.

They compile the bytecode just-in-time to native machine code, using many of the same techniques a conventional native code compiler.


Maybe my knowledge isn't up to date, but I always understood (since Hotspot anyway) that not all bytecode is necessarily JIT'd to native code.

From the Hotspot Wikipedia page:

"Both VMs compile only often-run methods, using a configurable invocation-count threshold to decide which methods to compile."[1]

Also see very old discussions at StackOverflow[2][3]. Then of course there are compilers (eg. gcj) which compile to native up-front.

[1] https://en.wikipedia.org/wiki/HotSpot#Features

[2] https://stackoverflow.com/questions/7100365/why-doesnt-javas...

[3] https://stackoverflow.com/questions/16568253/difference-betw...


Only hot code paths get compiled. That's after 10000 executions on "server" JVM and 1000 executions on a "client" JVM by default.

This is by design, and if you need everything compiled right away you can set the compilation threshold to 1.

I don't see any value in compiling parts of code that only gets executed during bootstrap.


Nitpick: the HotSpot 'client' JVM's default JIT threshold is 1500, not 1000.

https://www.oracle.com/java/technologies/javase/vmoptions-js...


You're right, thank you for correcting me. For more popular archs (x86, arm, aarch64) it's 1500. It's only 1000 by default for ppc, s390 and sparc.

https://chriswhocodes.com/hotspot_options_jdk14.html


So we're both right! Didn't know it varied between architectures.


> I don't see any value in compiling parts of code that only gets executed during bootstrap.

Not disagreeing with you there, since stopping to compile code/optimize at runtime contributes to sluggish interactive performance.


Compilation is performed concurrently to the application in specialised compilation threads. Only OSR (on-stack-replacement) requires stopping execution.


> Not disagreeing with you there, since stopping to compile code/optimize at runtime contributes to sluggish interactive performance.

No JVM I am aware of stops to compile - the compiler runs on a background thread while the application continues to run as normal.


> I always understood (since Hotspot anyway) that not all bytecode is necessarily JIT'd to native code

Who are you disagreeing with? I didn’t say that’s the only way they execute bytecode, did I?

The question was how it’s executed ‘after they warm up’.

> Then of course there are compilers (eg. gcj) which compile to native up-front.

But I was replying to someone who asked specifically about how ‘production JVMs’ do it, not how discontinued compilers do it.


> Who are you disagreeing with? I didn’t say that’s the only way they execute bytecode, did I?

Nope, my bad. Thanks for clarifying.


> what’s missing The other two hundred instructions, the runtime, OOP type system, and a few other things.

Also bytecode verifier


> There are 11 groups of instructions [missing] and most of them are trivial:

> * Conversions (int to short, int to float, …).

float to string should be the most trivial of all


I always liked writing VMs

But we have to agree that JVM feels like steam engine running in the age of electric motors. With virtualisation available cheaply in every level (hardware, arch, OS and docker) virtualisation at runtime feels like overhead.

JVM was originally created for purpose of 'write once, run anywhere', which I think can be addressed in alternative ways, look at golang


"Steam Engine" isn't what I'd call the JVM; it's still a remarkable piece of software driving hundreds of thousands business apps with a strong focus on long-term maintenance, excellent mindshare, and really very good performance for what it does (though in somewhat of a stasis with Java9+ deprecation, and Spring-centric development). However, I get what you mean: the JVM was originally intended as a portable runtime for set-top boxes at a time when there were many ISAs (MIPS, PPC, etc.) around, and not just x86 and ARM like today. I believe Java is still a mandatory part of BlueRay disc players. It was also at one time a candidate for running in the browser (also reflected in the Java/JavaScript naming). Incidentally, Java was rejected in the browser by browser vendors, and JavaScript became "more like Java" instead, and has followed a similar path from being a browser language to being used also on the server-side (something that Netscape started already in 1996 or so).

Edit: also I want to mention that Java was IMO the single one tech that saved the scene from being an Microsoft-only world, and also significantly paved the way for today's Linux dominance on the server-side; Java was picked by many devs because it helped to keep open the door for migrating to Solaris or Linux in an increasingly MS-dominated landscape in mid 90's


I agree all that about jvm. It had considerable impact on industry and it is nice piece of software.

But you missed my point entirely. I did not call JVM a steam engine in derogatory way. On the contrary, steam engine is awesome piece of technology which had tremendous impact on industry.

But JVM has issues which recent runtimes have learnt and solved in different ways.


> But JVM has issues which recent runtimes have learnt and solved in different ways.

Such as?


There's not much difference between Go and the JVM, except one is compiled just in time and the other ahead of time. There's also been a few ahead of time compilers for Java over the years, with substrateVM being the latest one for example.

From https://golang.org/doc/faq they say:

> Go does have an extensive library, called the runtime, that is part of every Go program. The runtime library implements garbage collection, concurrency, stack management, and other critical features of the Go language

I think people can argue all they want about semantics, what is a virtual machine? There are not "proper definition" for this. Ultimately, if you abstract away the details of the running environment, to me, you've created a virtual machine.

The Go FAQ continues by insinuating it isn't a virtual machine because it doesn't do just in time compilation and only ahead of time. I think that's just word play.

From https://en.m.wikipedia.org/wiki/Virtual_machine says:

> A process VM, sometimes called an application virtual machine, or Managed Runtime Environment (MRE), runs as a normal application inside a host OS and supports a single process. It is created when that process is started and destroyed when it exits. Its purpose is to provide a platform-independent programming environment that abstracts away details of the underlying hardware or operating system and allows a program to execute in the same way on any platform.

Personally, I think "Managed Runtime Environment" is a better term, and Go would definitely fall into that term.

So I recognize the difference are real. Just in time Vs ahead of time. But it isn't docker, or any other layer of virtualization which magically allow Go to run over many machines. Its because it has an extensive runtime that abstracts away their details. And this runtime has to be bundled in every compiled application. Maybe we should start talking about virtual runtimes?


> Ultimately, if you abstract away the details of the running environment, to me, you've created a virtual machine.

So then are operating systems also virtual machines? Is the C standard library a virtual machine? This seems like a pointless definition to me.

> Maybe we should start talking about virtual runtimes?

Isn't that just called a runtime? Java is "virtual" because the instructions in the Java class files don't run directly on the CPU, unlike Go where the instructions in the binary do run directly on the CPU.


If you want to understand the differences and similarities between Go and Java in any meaningful way, you've got to discuss something more than the vague idea of a virtual machine. That's all I'm really saying.

Here it seems specifically like OP wanted to say that they find ahead of time statically linked with cross compilation to be a more convenient model of code compilation and distribution.

With that in mind, we're now better equipped to compare and contrast, and discus the pros/cons.

When we restricted ourselves to VM vs non-VM, I didn't feel like we were in this meaningful zone.

And from that angle for example, we can see that Java offers ahead of time compilation, which can be statically linked as well if one wants too. One example is GraalVM native image. There were also some options before GraalVM that I think were all commercial. That said, I've yet to see a cross-compilation offering.

The JIT approach has benefits as well, the user doesn't need to select the appropriate package for their OS. The runtime distribution is shared between all apps that uses it. It's easier to offer debugging/profiling/measuring capabilities of the running application. The user can tweak certain parameters of the runtime or even choose an alternative runtime implementation. If runtime has any known security vulnerability, the user can upgrade it to a new secure version, without waiting for a new update of the app. Better peek performance of hot code paths can in theory be achieved. Just to name a few.

Another difference is Go has no intermediate representation. While Rust, C# and Java for example do. This intermediate representation is quite useful in allowing multiple languages to reuse the Java bytecode runtime. That's not as simple in Go, since Go is a harder target for compilation.

Etc.


> Is the C standard library a virtual machine?

Kind of, the ISO C is defined in terms of the C abstract machine.

> The semantic descriptions in this International Standard define a parameterized nondeterministic abstract machine. This International Standard places no requirement on the structure of conforming implementations. In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine as explained below.

https://www.cl.cam.ac.uk/teaching/1415/CandC++/lecture10.pdf

https://phoenix.labs.vu.nl/sysprog/cabs.pdf

And not all C implementations AOT compile to native code.


You've forgotten about different processor architectures? Even today there's Intel and ARM, but at the start there was also Sparc which Sun wanted people to use. So, just like WebAssembly, java bytecode operates as a platform-independent instruction set. (ARM provides processors with some ability to execute bytecode directly - Jazelle - but it seems to be obsolete?)

Virtualisation is also not readily available from within userland. We're not at "each process runs in its own VM" yet. Perhaps we'd want "each browser tab runs in its own VM", which is kind of what Javascript aims to achieve.

Microsoft have also gone in this direction; they don't call the CLR a "VM" but it performs many of the same functions in order to run CIL/MSIL bytecode.


Then your feelings are wrong. No runtime environment provides anything near the JVM's combination of performance, productivity and observability. It's steam engines vs. an electric motor, alright.

Yes, there is a cost in footprint and warmup, but footprint is very often (though not always) the right cost. Of all software resources -- development, maintenance, memory, processing and bandwidth -- it's the cheapest (well, second to storage). By comparison Go is less RAM hungry but it's sluggish and opaque.


I guess you work on JVM.

But the opinion that Go is sluggish and opaque is purely subjective. For anything but long running server apps, Java's memory consumption makes it feel sluggish, especially on desktop.

Also, when are the value types coming?


> But the opinion that Go is sluggish and opaque is purely subjective

It is not. It's significantly slower than Java and has maybe 5% of its observability tools. Go is technologically more than a decade behind Java in compilation, GC, and observability. I'm not saying it's not good enough for certain things -- sometimes primitive and simple gets the job done, and you might not need a Ferrari to go to the grocery store -- but let's not turn an acceptable compromise into a win.

> Java's memory consumption makes it feel sluggish

Why? If you don't mind it performing as poorly as Go, you can reduce the heap size and pick a low-latency collector like ZGC, but really Java says, give me the cheapest resource, RAM, and in return you'll get something much better.

> especially on desktop

Java desktop applications consume less RAM than Electron ones, and IntelliJ is more snappy than Atom (and recently, unfortunately, vscode, too), although there are perfectly valid reasons to prefer Electron.

> Also, when are the value types coming?

I don't know. Not my area.


Clojure famously takes exactly the opposite tack: https://clojure.org/about/rationale#_languages_and_platforms


That didn't aged well...


Why? I really enjoy using Clojure, and the community is vibrant.


Was referring to the "VMs are the future" part.


I don’t see how Go presents anything new in the portability front.


Like Java, I can write a Go program, compile it on my computer for 3 different OSes, ship the binaries, and I can be reasonably sure that it works on other OSes.

All of this has been possible for decades (eg cross compile C++), but Go is the first mainstream compiled-to-native language that I know that makes it easy. Thanks to this, I can run many utilities written in Go on my Windows box, written mostly by people who likely never even tested in on Windows. That's pretty amazing to me.

People just drop a Windows binary on GitHub and think "I'll get the PR if stuff doesn't work". They'd never do that if it cost them effort to cross build a Windows binary. People didn't do that before Go, and many modern OSS command line utilities / server apps only worked on unixes.


The JVM's "write once, run everywhere" doesn't just apply to any OS that happens to have a JVM available... it means any architecture. You don't have to compile one version of your program for MIPS, ARM, x86, etc. The same bytecode will run on any JVM without being touched... something C/C++ (which were popular when Java was being created) couldn't do.

> and many modern OSS command line utilities / server apps only worked on unixes

This, I think, likely has more to do with being POSIX compatible and/or oriented for headless server usage where a shell is "home" for many 'nix admins.


Sorry I think it does mean any OS that has a JVM available. It can't run on ANY architecture without a JVM. Maybe I miss understood what you mean?


On some kind of deployments, the JVM is the OS.


Really? I don't understand? What do you mean?


This is a little bit off. The only architecture supported by Java is the JVM. It is a fictional CPU architecture with predefined characteristics, such as a memory model. The implementation of a JVM is a mapping from a real CPU to the J VM's requirements. So no, JVM doesn't mean 'any architecture', it means 'JVM'. It only applies to OSes where a JVM is available.


That's a bit of a nitpick.

The developer no longer cares what architecture the program will run on, it just works.

You compile to bytecode, and stop caring. The JVM becomes your only target architecture, regardless of what actual architecture the system has.

That's unlike C or pretty much any (all?) other compiled languages around in the early 90's when Java was still Oak and Gosling was just getting started.


> Like Java, I can write a Go program, compile it on my computer for 3 different OSes, ship the binaries, and I can be reasonably sure that it works on other OSes.

With Java, you compile one “binary” and it works not just on three OSes, but all of them that have a JVM available. The beauty of it is that it’s as portable as an interpreted language, no cross-compilation needed at all.


Go is nice, but when you have to deal with some of the real life architectures like as400 or AIX, it silently retreats to the bushes.


Heard about cross compilation while learning Go. Go surely made it mainstream and easy. Maybe thats the reason so many CLI's being build in Go. Before that python was mostly used for CLI but windows does not have python installed by default.


I would like to see how that Windows binary compiled on Linux does WMI.


Sure, and it be confused if your favorite Go iptables frontend would work on Windows.

But there's plenty use cases for which the go standard library offers a sufficiently good abstraction that cross compiling just works. The same holds for JVM apps for ages (and for nodejs etc etc) but not eg for C(++).

As a windows dev I very much noticed an increase in good CLI tools that just work on my box since Go got to the scene. I think that's cool.


> With virtualisation available cheaply in every level (hardware, arch, OS and docker) virtualisation at runtime feels like overhead.

These seem to be solving fundamentally different problems. Where do you see the overlap that might be able to be moved out of the jvm?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: