Hacker News new | comments | ask | show | jobs | submit login
Learn how to write an emulator (emulator101.com)
487 points by dreampeppers99 6 months ago | hide | past | web | favorite | 60 comments

The Space Invaders (Midway 8080) platform described here is really easy to emulate; it's just a 1-bit frame buffer hooked up to a CPU. Subtracting the CPU emulator, my implementation on http://8bitworkshop.com/ takes less than 200 lines of code. One of the easiest with which to get started.

That's a real gem!

I also started out on the Apple ][, writing my own versions of the games I saw in the arcades (donkey kong, pacman, joust, etc).

My kids are old enough to be curious about how games are actually built, and we've had fun building an adventure game in basic, and a version of asteroids in C. We had planned to build something this summer, so this book of yours is an ideal next step which we can work through. I think they'll really get a kick out of seeing how it all goes together. (okay, okay, I admit it. Nostalgia hit me hard and I just want an excuse to fire up my old Apple ][ again.) Regardless, I just hit "order" on Amazon.

Hope you like the books! I'll have some basic Apple ][ emulation available on the site soon.

Similar with a system I put together for emulating CP/M recently. Take away the Z80 emulator and you really only need to emulate a few bits of hardware for handling console I/O, selecting disk drive, track and sector and then reading or writing the sector. At that point you can run simple CP/M commands, assemblers, C compilers, BASIC interpreters etc.

Add VT100/ANSI terminal emulation and you can run Wordstar, Rogue, dBase 2 etc.

> Subtracting the CPU emulator

If you're subtracting the CPU emulator what is the point of doing any of this?

I think the parent's point was meant to illustrate that there is very little boilerplate necessary beyond the CPU emulation (which is, arguably, the fun bit).

On more modern systems, the CPU is only a tiny bit of the equation. You might have to emulate the graphics pipeline, the audio processor, the input mechanism, etc... This makes the process of writing an emulator more tenuous, as those are not nearly as straightforward to implement, and might be badly documented.

If you look at the really modern platforms (like the Switch or 3DS), CPU emulation is barely 20% of the work.

You've got it ... even the Atari 2600 has a graphics pipeline almost as complex as the CPU. The Space Invaders/Midway 8080 platform is rudimentary by comparison, maybe owing to its pedigree as the first arcade game platform (Midway's "Gun Fight" in 1975 has virtually identical hardware)

Wow, that is really neat!

A long time ago, I wrote a 68000 emulator in C to replace MAME's x86 assembler core. After months and months of poring through my old Amiga programming books and the Motorola manual, I still remember clearly that moment where I fired up the emulator, and actually SAW the title screen of Rastan Saga come up after the game booted from my core. Truly a magical experience...

After that, I doubled down and wrote a g65816 emulator for it (that's the CPU the SNES and Apple IIgs used). Emulator writing is almost a zen-like experience.

If you happen to remember, which of those Amiga books was the most useful to you?

At the moment (well, for years really) I have been in a 6502 binge, but want to get into m68k programming as well (and the Amiga ecosystem specifically).

>> Want to really learn how a CPU works? Writing an emulator is the best way to learn about it.

The best way to learn how a CPU works would be to write a CPU! I.e., write HDL code for it and test in a hardware simulator.

I have designed a miniature application-specific processor, a compiler* and IDE for it, a disassembler, and an emulator too.

* To be more precise, I hacked a compiler for it using C# .net compiler.

Indeed. Software emulation is fun and all but real hardware works differently.

Yeah, for sure, especially for more modern CPU features like caches and OOO that are very difficult to emulate in software.

But you can't share hardware designs (e.g., FPGA bitstreams) like you can software binaries or software source code. A hardware design will require a specific FPGA, board, and probably even a particular version of the synthesis tools if you want to build from source.

You can still share the source code. Just requires some assembly then.

I think that would be way too difficult than writing an emulator.

You'll be surprised how easy it is to create a simple processor once you understand basic concepts. The one I had designed had about 25 application-specific instructions and had about 2K lines of HDL code for full implementation.

The hardware is the intent of the design, the emulator is the expression of the actual behavior. eg a game that inadvertently relies on a cache miss.

Do you have any recommendations for tools for this?

I had developed using professional IC design tools (specifically Modelsim and Xilinx) for the processor. However, free tools are available like http://iverilog.icarus.com/

The actual processor was implemented on a Xilinx FPGA, and it was about 2K lines of HDL code, carrying about 25 application-specific instructions which I had designed also.

For the compiler, disassembler, IDE, emulator, I created in C# and .NET.

While I did not use these, there are also ASIPs: https://en.m.wikipedia.org/wiki/Application-specific_instruc...

Not tools per se, but check out the Bitwise education project, where Per Vognsen is coding an entire hardware and software stack from scratch, live on stream. https://github.com/pervognsen/bitwise

Implementing an emulator is similar to writing an interpreter and imo more rewarding in certain cases.

Start with CHIP-8, it's the gateway drug to others.

I want to second this. I did CHIP-8 two weeks ago and it’s good for getting your feet wet.

I’m gonna finish it up with super chip-48, and then move to Gameboy. The progression feels natural and it’s fun and rewarding.

How do I get started? What resources do you suggest?

When I did my CHIP-8 [1] I collected a lot of background info and links, which you can find in the "History" and "Sources" topics of the readme.md there.

[1] https://github.com/tallforasmurf/CHIP8IDE

I want to learn how to write an emulator that is not an interpreter and still keep it timing accurate (even if not completely cycle accurate). I haven't dug, but some of the ones for modern machines seem to compile instead of interpret yet I don't know how they handle timing.

IIRC Dolphin as a JIT. In fact I think it has separate JITs for the CPU and GPU.

I think rpcs3 does too. What I need to understand (and probably only takes a small bit of digging on my part) is how they keep from having, say, my extremely high powered computer go way faster than what was expected in the original. Surely there's a scheduler involved, but does it install OS-like hooks? Does it sleep on the threads? Meh, I should just go read.

Actually emulating the CPU, even with a JIT, isn't as much of a performance drain as you'd think. The performance elephant is synchronization. If the CPU and the GPU aren't running at exactly the same speed as they were on original hardware, then timing is different from what a game might expect, and that can introduce untold bugs.

Real games on the console didn't have to worry about timing variances, so the emulator can't have them, but that means it needs to do a whole ton of extra work to make sure the various hardware bits are all running more or less in sync with each other.

Right, it's that sync work that I need to research. If I'm compiling some func down to, say, LLVM IR, how do I interleave the synchronization points without too much penalty? Again, surely just a matter of research, just seems like magic from the outside

Haven't written a JIT/dynamic recompiler myself, but here's how I've thought of it:

When two devices are communicating over a port, only operations writing to the port matter. If something is spinning waiting on a signal (assuming the partner isn't affected by this read repeatedly) you can simulate that by sleeping until the partner writes to the port (or another internal event occurs), updating any internal counters with how many cycles would have burned.

Where it gets more difficult is interrupts, which in the above paradigm could come at any time. If they're timers you could know before going into a "basic block" (the JITted chunk, probably larger than a classic basic block) if it would hit during the block, then just single step interpret out the rest of the cycles. If it's irregular... you might either checkpoint regularly and revert to a previous checkpoint if an interrupt comes in during a block, or batch up changes and only commit them if we hit the end of the block with no interrupt.

One big problem with this idea is that RAM is a device, probably not connected to only a single core CPU! You can get around this a little by modelling RAM over time as separate ranges, first assume that writes are only going to be visible to the device that wrote it, then if another reads/writes it, split the region, treat the ranges as separate devices (and treat this event as an interrupt to the first device).

Man now I want to write this!

The problem arises when needing to know how frequently to checkpoint, i.e. the size of said blocks. If I am running the emulator on a machine 200x faster than the emulatee, the check should be every instruction because we have the time and executing too many instructions between syncs will be very noticeable. But if I am running the emulator on a machine 1.05x faster than the emulatee, we don't have a lot of time to work with, so you'll guess at the block size between checks? Get it wrong and you've done too many ops or too few. Either way, you definitely don't have time to sleep or use small block sizes. I'm sure I'm over thinking it.

I guess I'm mostly thinking of using threads?

On the plus side, the synchronization is explicit on newer consoles because they couldn't rely on constant timing. Caches and multiple bus masters meant that they couldn't cycle count like the used to be able to.

Modern games (say, since the playstation 2 generation) tend to rely on clocks to compute animations. This is most obvious today when consoles often get several different clockspeeds across their release lifetime--my current xbox one is faster than my original xbox one by a tangible amount. It's also necessary when different parts of a 3d scene will render at different rates--you don't want to be able to slow the animation down by looking at something complex.

If you're recompiling, you just add code to keep track of how many guest instructions you've retired. It's similar to the kind of profiling code that multi stage JIT would use. You can also just interrupt a thread after some time has passed, but that tends to fall apart in my experience.

What is the difference, in this context, between timing-accurate and cycle-accurate?

Timing accurate: this instruction takes four cycles, so do the work, and increase master clock by four.

Cycle accurate: this instruction takes four cycles, so this push the address to the multiplexed A/D bus. Next cycle read the data. etc.

So more fast forwarding in timing vs cycle accurate. Typically timing accurate doesn't implment things like wait states injected by other peripherals accessing the main bus either.

For me it's the difference between something taking the right amount of time to the viewer vs ensuring all instructions take the same amount of cycles in the emulator as the emulatee (or at least relative to each other that is).

This is pretty much about writing a cpu emulator. A system emulator is much more complex even as far back as a gameboy or Sega genesis. I think there is still great value in writing your own cpu emulator as described in the article.

I used this website for information for both a Chip8 emulator and later an NES emulator (which uses a 6502). The PPU for the NES was definitely more of a struggle but Emulator101 is still a great resource even if you want to go on for emulating other chips / peripherals :)

If you're interested in processor emulation, here's a really cool one done in HDL for the pdp 11/70: https://wfjm.github.io/home/w11/

If you like writing small emulators, try the Synacor challenge and/or ICFP 2006. Extremely challenging puzzles, and both start with writing a simple emulator.

Does this address optimizations, as seen in LLVM and JIT compilers?

As far as I can tell it does not feature a recompiler, only an interpreter. Emulator recompilers are very tricky to get right. Since you're working directly with machine code instead of nice bytecode you generally need a bunch of heuristics to avoid optimizing something that shouldn't be optimized (register access being an obvious candidate). Furthermore machine code (especially for older consoles) tend to use self-modification liberally so you need to figure out the right heuristic to invalidate your recompiled cache when the machine code is modified.

In general for these very old systems (anything 8bit these days I'd say, and even 16bits if you're only targeting the desktop) a recompiler is probably more trouble than it's worth.

> In general for these very old systems (anything 8bit these days I'd say, and even 16bits if you're only targeting the desktop) a recompiler is probably more trouble than it's worth.

I guess that's also because modern CPUs can easily handle very old systems, which were relatively slow, and hence software written for it had to be efficient.

However, if emulating e.g. ARM on i86 (or vice versa), I guess that recompiling optimizations may be necessary to keep things efficient. Of course emulating JITted code can be a pain because of self-modifying code. Wondering: does VirtualBox use such approach?

>I guess that's also because modern CPUs can easily handle very old systems, which were relatively slow, and hence software written for it had to be efficient.

It's that but it's actually worse than that: power aside, one hertz for one hertz, modern systems are generally much more recompiler-friendly than a GameBoy or Atari. The reason is that modern software is written using high level abstractions, you have well defined firmware interfaces, self-modifying code has become the exception instead of the rule etc... That lets you make simplifying assumptions that can speed things up greatly. For instance "I don't have to worry about self modifying code unless I encounter a call to the firmware's flush_cache method". That's better than "I need to be careful with every write to memory because it might potentially write to code instead of data or registers".

Back in the days it wasn't rare to find timing loops that required cycle-accurate emulation, for instance because the drawing code expected that the delay between the interruption until a certain instruction was executed was precisely n cycles. I doubt many people do that on a PS4, actually I would be surprised if you could make something like that work between all the various hardware revisions and firmware updates.

So older hardware generally requires the emulator to be a lot more strict which often means that you end up with a comparatively slow recompiler. Given the complexity of a recompiler compared to a simple interpreter it's simply not worth it anymore for these old systems.

> However, if emulating e.g. ARM on i86 (or vice versa), I guess that recompiling optimizations may be necessary to keep things efficient.

Depends on the ARM in question and how dependent on cycle accuracy the rest of the system is. Anything more than ~100MHz you pretty much need to JIT, but a surprising number of systems don't reach that. So like a DS is generally interpreted, but a 3DS is generally JITed.

> Wondering: does VirtualBox use such approach?

VirtualBox uses a combo approach depending on the host.

32 bit host uses a very simple JIT that runs ring 0 and runs all guest code in ring 3. The JIT hides itself using the segment registers. It's mainly to emulate the few instructions that don't fulfill Popek & Goldberg on x86 (ie. instructions that act different in ring 0 vs ring 3, but don't trap). It'll also play with the page tables to know when self modification occurs.

64 bit host is way easier. Pretty much just relying on hardware virtualization, and interpreting a single instruction here and there when you get a trap out of hardware virtualization into the hypervisor.

I wanted to learn SNES assembly once, but I found the actual limitations of the SNES frustrating, do I made my own 24-bit virtual machine and basic assembly language. I was getting confused with how to implement interrupts so I took a look at an open source gameboy emulator....turns out without really knowing what I was doing I had written an emulator.

That was really awesome to work on though. I probably learned more working on that than anything else. It was the first time I really felt like I understood entirely how a computer actually works.

too bad it's in objective-c :/

It does go into Objective-C later, but it starts with C. You can see that if you go to Disassembler Pt 1.

That’s the best part!!! Love me some Objective-C


   /* Example 2: turn on one LED on the control panel */    
   char * LED_pointer = (char *) 0x2089;    
   char led = *LED_pointer;    
   led = led | 0x40; //set LED controlled by bit 6    
   *LED_pointer = led;

Yes really. It's example code written for the sole purpose of being read by people for educational purposes. As long as you understand the basics of pointers, it's really easy to read. Why are you mad at it?

What's wrong with that? If you're writing code for pedagogical purposes, it doesn't necessarily need to be as concise as possible.

I know some people who object to one-line read-modify-write of hardware registers, the idea being that glancing at the code the non-atomicity of the operation might be lost. I also generally don't use raw volatile pointers directly when dealing with hardware registers, too easy to forget a "volatile" somewhere and have the code behave erratically. I always use tiny read32/write32 wrappers instead.

Absolutely. Modern compilers are trivially able to inline such operations, so there's no reason not to make your intentions and separation of operations crystal clear. After all, you're going to read the code a lot more often than you'll write it.

Yes and read/write barriers are also sometimes necessary, to prevent re-ordering by the compiler and processor.

You have to be careful though in some cases. For instance, if you read two MMIO registers in a one liner, there's no way to guarantee the ordering between them.

Yes, wrappers are the way to go. e.g. Linux kernel

In fact, in my firmware writing style, I tend to separate MMIO accesses from mutation to make it clear when those bus accesses occur.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact