IN/MSX: Running 4 Copies of an Operating System at Once (2008)

cpr · on May 2, 2014

Yep, sounds familiar.

At Imagen (a Stanford TeX project spin-off started by Knuth's sidekick Luis Trabb Pardo, building the first typesetting-capable commercial laser printers using, at first, wet-process Canon imaging engines (LBP-10)) in the early 80's, we used the same Sun board (Andy Bechtolsheim, the designer, was a consultant for us while he started up Sun).

I wrote our own "real-time OS" on the bare 68K Sun hardware (first time I'd ever written a full (if simple) OS from scratch), and remember fairly vividly the hard-knocks learning experience about race conditions just like the one he describes here. Running for hours or days without error and then crashing randomly--nightmare time.

Luckily, we also had an ace hardware guy, Kok Chen, from the Stanford SETI project, and he and I and the logic analyzer would run test setup and lie in wait for the condition to show up, then look back in time at all the (Multi)bus transactions to see what actually happened. (Kok later moved to Apple and became a distinguished engineer, one of very few folks who could work on whatever they wanted.)

jeffbarr · on May 2, 2014

I had a lot of fun doing this project as a young developer who had no idea what could and could not be done with access to a raw machine.

Zenst · on May 2, 2014

Indeed as we get older we add more realistic limits into our lifes, having already defined our boundary markers. Also got more layers between you and the metal. But many have thinks they did in there youths that today would be things they would avoid. That said we often chucked in the level of commitment work wise that you would only see today in a founder of a company. That was the norm back then and today, it is the exception.

Morgawr · on May 2, 2014

This reminds me of a (much much simpler) assignment for my Operating System class during my Bachelor. We had to develop our own microkernel with multitasking on a MIPS virtual machine. Setting up the scheduler was easy but handling interrupts and message passing in a way that each task would not incur into some context switching while a message was not yet fully acknowledged/replied was hell.

I was 19 back then and I didn't know much (I learned a lot during that assignment) about OS design and concurrency, I remember spending a few nights just wondering why my code wasn't working at times. I mean, after all it's just two instructions one next to the other, right? It's too unlikely that a context switch were to happen right between these two lines of code, right? How wrong I was.

Awesome read, by the way.

userbinator · on May 2, 2014

For a second I thought this would be about running 4 OSs on the http://en.wikipedia.org/wiki/MSX . Those interrupt-related race conditions can definitely be really subtle - reminds me that one of the earliest 8088 had an errata where an interrupt could occur during a stack switch, corrupting memory in a similar way (see http://www.malinov.com/Home/sergeys-projects/sergey-s-xt/his... ).

chillingeffect · on May 2, 2014

"When you are young and naive, anything seems possible with technology."

Oh boy, I hope to use this quote at every possible opportunity from now on.

jeffbarr · on May 2, 2014

Enjoy!

mwcampbell · on May 2, 2014

So if it was possible to implement a hypervisor, why wasn't it possible, or feasible, to run Unix and ROS on the same machine?

jeffbarr · on May 2, 2014

Well, for one thing, I didn't have access to the Unix source code. The CEO handed me a drive and said "Here's Unix."

Also, ROS worked within a flat address space and had no awareness of the memory management features of the SUN board. Unix, on the other hand, took full advantage of the MMU and I had no way to control what it did.

vidarh · on May 3, 2014

I'm curious about the hardware. The 68000 is impossible to fully virtualize directly - e.g. if you tried to write to memory the current process shouldn't be able to access, there'd be no way to signal it without triggering an exception/trap that didn't leave you with enough information to restart or continue the instruction. I think the 68010 fixed this.

I also seem to remember that there were 68000 boards that "solved" this by actually including two 68000's on the board and running them in lock-step 2 cycles apart, or something, and then triggering a trap on both of them if the leading one triggered the MMU - that way the trailing CPU would be halted at a point where sufficient state was available to reset state on both of them..

Do you remember if you had to deal with anything like this? Or did it have simple enough requirements to get away without it (or actually run a 68010)?

I love the 68k family - after my C64, I got into Amiga's, and I'm still disappointed Motorola didn't manage to keep up. It was so much cleaner to work with than the x86...

jeffbarr · on May 3, 2014

There was no virtual memory and hence no page faults and no need to recover from them. Each OS ran in a fixed portion of the available RAM (I want to say 512K per OS).

The 68K ISA was a joy to program. With some prior experience on the 6502, I was productive on the 68K within days.

emeraldd · on May 2, 2014

I would imagine this was less what we would think of as a hypervisor now (i.e. something that exposes a complete machine to the guest) and more of something of a partitioning/container system that imposed significant constraints on the guests.

jeffbarr · on May 2, 2014

This is correct! I really had no model to go on, only my naive belief that I could make it work if I only wrote enough code.