
Here’s my favorite operating systems war story, what’s yours? - luu
http://blog.valerieaurora.org/2013/12/17/heres-my-favorite-operating-systems-war-story-whats-yours/
======
wglb
The year is 1969. The machine is a SDS Sigma 5, soon to be renamed XDS Sigma 5
after Xerox bought SDS from Max Palevsky.

The Sigma 5 had been used in aerospace data collection as it had the
capability for data collection and the optional fixed-head disk, ideal for
real-time operation.

We were building a system to take analog data from 12-lead electrocardiograms
transmitted in three-channel audio FM over telephone lines, sampling the three
channels of data from up to 20 phone lines at 500 samples per second, and
queueing the data to disk for later collation and feeding to the analysis
program as well as writing to tape.

It turns out that the operating system called RBM (Real-time Batch Monitor)
would mask out interrupts during certain key events. Since the 500 samples per
second was driven by a hardware timer interrupt, we needed that to not be
masked out. So with every release of the OS, I had the job of locating all the
places that the interrupt masking took place in the OS and changing the
instruction so that it wouldn't mask the timer interrupt. This required a
careful audit of the OS's use of the timer interrupt to be sure that we
weren't exposing an inadvertent race condition. We were worried about skew
leading to an appearance of noise on the digitized signal.

So I had the task of changing the card deck and recompiling the kernel.

All our interrupt driven work was done in assembly language. Probably would
have used C, but it hadn't been invented yet when we started. But coroutines
in that interrupt-rich environment were a damn sight easier in assembler than
wrangling with threads in a higher-level language.

Much fun, but then I got interested in compilers.

~~~
Stratoscope
You worked on a Sigma 5 in 1969? Where were you? I was in Phoenix, at my first
programming job. Summer of '69, Transdata in Phoenix hired me as the night
operator. The best part was that they didn't offer their timesharing service
at night, so basically I had the computer to myself.

We didn't have anything like electrocardiograms, but the neat thing was to see
what kind of program you could write on a single card. Did you have any of the
bird chirp cards that would toggle the front panel speaker and put out a whole
flurry of bird sounds?

My favorite (or at least most useful) one-card program was when I found the
print card: Put this card first in the card reader ahead of a stack of other
cards, and it would print out the whole deck!

Only problem was that the print card was single buffered. It would read a
card, then print it. Then read the next card, then print it. And on and on,
with the card reader waiting while you print, and the printer waiting while
you read the next card.

So I figured out a way to double-buffer the print routine: It read the next
line while printing the one before, so it was much faster than before.

And it still fit on a single card!

80 columns should be enough for anyone.

~~~
wglb
Ha! Double buffering. I knew it well.

The company was Telemed and it was located near Chicago. In fact, for a while
it was in an office complex a few hundred yards directly east of runway 27 R
at O'hare. Often fully-laden Europe-bound 747 would take of, aiming directly
overhead. We would be silently chanting "up! up!" as it felt like they were
pretty close.

I don't think we did the single card trick. I do remember writing a few
utilities like an editor so we could store the programs on disk instead of
feeding them in each time. And a crude document processor for internal
documentation. And we had a pretty serious engineering effort saving the real-
time data to tape. There was always the concern about being sure that we had
saved the data by the time the system signaled the hospital technician that it
was done so they could unhook the patient.

The extracurricular activity I spent the most time on was porting the XPL
compiler from one done at University of Washington to run on our particular
configuration. Involved converting from a 7-track tape to a 9-track tape,
regenerating the operating system to reduce the start address of user
programs, and a few other hoops. I had read _A Compiler Generator_ and my
career was off on another track.

And we did the evening thing as well. We had two systems for redundancy, but
for a while production required both machines. One would do data collection,
and the other would run the diagnostic program, which spit out paper tape that
was carried over to the teletypes. Since traffic was light at night, we would
come in at some ungodly hour for the better part of a year doing the work to
combine these systems. If a call came in, we had about an hour to get off the
system so they could bring up the diagnostic system to do the analysis and
generate the paper tape.

I think I remember the cards--they were binary and half full of holes, as
opposed to the others (EBCDIC, i believe).

------
bananas
Z80 embedded system debug session. Two flipping days working out why an NMI
wasn't being captured by the kernel interrupt handler in a communications
controller I was working on. The NMI handler religiously stopped working after
ten minutes. Turned out some chump had bent the CPU _NMI pin in the socket of
the device at the manufacturer and it wasn't contacting the socket reliably so
when it warmed up it started floating. However, you couldn't see it with
visual inspection. I assumed it was my fault and spent hours with the
assembler and Z80 and vendor docs trying to work out why it wasn't working.
Got miffed, plugged in a logic analyser which caused it to work perfectly.

Eventually I assumed the CPU was duff, gave a finger to the rules which
involved not changing the hardware, yanked it and found the inverted pin. Grr!

Bear in mind this was MILSPEC and had gone through QC, soak and thermal
testing.

------
jacquesm
While writing a QnX clone I used DJGPP to bootstrap my fledgling new OS. DJGPP
is a 32 bit 'extender', it allows you to write 32 bit code and run it from the
16 bit dos environment of the time. When the basics where in place and the OS
was self hosting I made a mistake somewhere and managed to mess up the system
to the point where it would no longer self host.

Having to go back to DJGPP, extracting the files from the (now unmountable)
filesystem with the latest working version and then getting back to being self
hosting was the stuff of nightmares and I considered giving up several times.

I never realized that the holy grail of self-hosting is also a trap-door until
it had swung shut behind me!

In the end it all worked out and I got it back, but from then on I made sure
to have at least two 'known good' kernels waiting and I checked the toolchain
a lot more carefully to avoid bugs introduced by broken compilers.

Lessons learned the hard way for sure.

------
nappy-doo
I wrote an operating system for a number of different DSPs in 2001. The OS was
C++, with different pieces written in assembly as needed. The design of the OS
required that we take the lowest priority interrupt to handle rescheduling as
we returned from interrupts, and made sure the correct task was running, etc.
At the end of that lowest priority interrupt, we wouldn't RTI if a high
priority task was switching in, we'd just drop back in to user space, and
handle the rest of the RTI functionality.

One particular part had a fixed size hardware stack. When an interrupt
occurred, it'd push stack/frame and a single register (ALU results), and begin
executing the ISR. What we were seeing was occasionally we'd blow the hardware
stack. Now, I've given enough data here to debug it, clearly we weren't
resetting the hardware stack in the low priority interrupt, but at the time we
were STUMPED.

I spent a full day working and thinking about the problem, and that night, I
dreamt what the answer was, came in, wrote the three lines of assembly, and it
all worked perfectly. It was the first (and unfortunately not the last) time I
debugged software while sleeping.

------
ibisum
It's 1987. I'd just spent a few months building a small single-floppy bootable
OS for the IBM PC. The purpose of this project was to display a small training
demonstration for security-related personnel in a protected environment. There
were to be no ways to interrupt/interfere with the system running the training
program, and it absolutely could not have been done in DOS or CP/M - had to be
its own standalone system, 100%. It had absolutely to be something that 'could
not be copied in a normal computer', where normal was: any of the DOS-booting
machines out there in the final location.

So we built a boot-loader, a small kernel, packed the training-material
resources into a tiny VM, wrote a VM to process the bytecode and run the
training app, and delivered two bootable - albeit 'uncopyable' \- floppies
with the app - one for the demo, and one for the final installation. The app
worked great, but building the floppy required a fair bit of magic hand-
waving, back in those days. I retired after giving the delivery-person their
two, very valuable floppies, with only thoughts on my mind that perhaps one
day I should automate all that sector-placing hand-waving magic ..

So, I get a call from the remote location at 4am in the morning, saying that
the demo floppy had been placed on top of some magnetic thing accidentally,
the person had been fired, and how do we make another copy of the floppy for
the install at 7am?

Well, indeed. "We'll have to do a sector-copy. Do any of your DOS machines
have debug.com installed, by any chance .." A 7-line assembly routine to do
block copies, 15 minutes of waiting-on-hold listening to remote floppy copy
noises later, and we had our copy. The scant few hours I had in between
dreaming of the routine, and then actually having to explain to a non-
technical person over the phone how to assemble it into a working program in a
way that won't destroy the only working copy .. well, lets just say I learned
a lot of things in that project that I'm _still_ trying to un-learn. ;)

------
unwiredben
Reminds me of a story from back in the mid-90s at my first job. I was working
in Motorola's paging products group and helping bring up a ARM7-based
microcontroller that we were testing for two-way pagers. I was in the ASIC
group that had put together the chip and had written a simple test monitor
shell that ran over serial and let us test out the registers.

I was on location at an engineering office in Florida where we were adding
features to the monitor to test out new systems. At one point, the builds
stopped working -- they never seemed to come out of reset successfully.

After work with the logic analyzer to try to watch the memory bus to see what
instructions were running in memory, I finally hit on the answer. The linker
was placing the init code at the end of our binary, and the latest code we
added had pushed the init routines past the first 64K of program code.
However, on this microcontroller, some of the address lines were multiplexed
with general purpose IO lines, so out of reset, trying to get to higher
addresses just wouldn't work; you'd be loading low-memory code instead.

A quick rework of the linker command line to reorder code sections and some
modifications to the init code to flip those lines to address mode, and
everything started working again.

------
mzs
There was a board that over the course of two weeks would get progressively
and perceptibly slower. Looking into it the heap was full of tiny allocations
that nobody knew where they were coming from. What had happened is that
someone had compile with the C++ compiler instead of the C compiler the file
that had an interrupt handler. This was running in kernel context and the
older compiler decided that there had been no data-structure allocated to keep
tabs on exceptions for that thread, so it it malloced that little bitsy chunk
in the from the function prolog - every time the interrupt occurred, about
15Hz in practice. Eventually it was taking tasks a longer and longer time to
find free blocks in the fragmented heap. The quick solution was to add nothrow
to the file ;)

------
ArkyBeagle
It's not an O/S story but...

A product ( with a massive, linear power supply ) has powerfail detection -
when voltage dropped below <x>, ... powerfail. This triggered a digital
counter and an R/C network as an intentional race condition to trigger a line
going to a PIO on the processor. Which ever finished first, won. The software
would then shut things down in an orderly fashion. It would wait in that state
until reboot.

When it was tested, it was tested with an A/C relay driven by the parallel
port on a PeeCee. Gen a random number, wait that many milliseconds, turn the
relay off ( or on ).

But field service said the thing would latch in powerfail. I pointed to the
automatic (RNG driven) test of powerfail ( run every software release ) and
the lead field service guy says "but that relay only switches on zero
crossings of the A/C line." Sure enough... we thumbed power strips for two
days and got it to latch....

The fix? Add a software counter to the powerfail state. After <n> cycles, it
pulled the reset line ( which was in I/O space; NEC V25) itself.

------
biot
While not a debugging war story, here is one of my favorite stories about a
different sort of war between two OSes:

[http://www.eros-os.org/project/novelty.html#persistence](http://www.eros-
os.org/project/novelty.html#persistence)

------
coldcode
Working at Apple in 1996 I went to a meeting where all departments send people
to talk with the Copeland (Apple's attempt at a modern OS that still included
the old OS as a first class API) leadership. The quicktime team got into an
argument with the leaders when they couldn't promise that the OS could
actually give quicktime a predictable time slice, thus making video almost
impossible. I think at that moment everyone realized that Copeland was a
disaster (though most already assumed it would be). I left Apple soon
thereafter and in the next year (1) CTO Ellen Hancock killed Copeland (2)
Apple bought NeXT (3) Steve returned.

------
tzs
I've had a few interesting operating system development experiences. Warning:
rambling alert!

Circa 1984: I was working at Callan Data Systems, a small 68k workstation
maker in the greater Los Angeles region (just outside Thousand Oaks, for those
familiar with the region).

We had been using Unix ports from UniSoft, but for the new 68010 and 68020
based systems we were developing we were doing our own port from the AT&T
sources. I don't recall if our base was System III or if it was System V
Release 1 (I'm sure it was earlier than SVR2, for reasons that will become
apparent).

This version of Unix did not support demand paged virtual memory. It was a
classic swapping system. I was rewriting the process and memory subsystems to
add demand paged virtual memory (that's why I know we were starting from
something earlier than SVR2, because SVR2 was when AT&T added demand paged
virtual memory support).

It was running quite well, except I had this one annoying bug where
occasionally when a signal was delivered to a process that had a signal
handler installed for that process, the process would get some kind of error,
like an illegal instruction trap. For instance, hitting control-C in the shell
might hit the bug. There was no sign of memory corruption, and no illegal
instructions where it would claim it had been executing.

I spend some long, late evenings with the in-circuit emulator and the logic
analyzer, trying to figure out what the hell was going on. Eventually, I was
able to determine that it only happened if the signal was delivered while the
system was trying to return from handling a page fault for the process that
was receiving the signal.

On the original 68000, virtual memory was not supported. When a bus error was
generated by an invalid memory access, the exception stack frame that
contained information about the error did not contain enough information to
restart or resume the failed instruction. You had no choice really except to
kill the process. Hence, almost all 68000 systems were pure swapping systems
[1]. (It was possible to do on-demand stack space allocation even on the
68000, through a bit of a kludge [2]).

The 68010 added support for virtual memory. The way it did this was to make
bus error push a special extended exception stack frame. This extended frame
contained internal processor state information. When you did the return from
exception, the processor recognized that the exception had the extended frame,
and restored that state information. (This is called instruction continuation,
because the processor continues after the interrupt by resuming processing in
the middle of the interrupted instruction. The other major approach, which is
what I believe most Intel processors use, is called instruction restart. With
that approach, the processor does not save any special internal state
information. If it was 90% of the way through a complicated instruction when
the fault occurred, it will redo the entire instruction when resuming.
Instruction continuation raises some annoying problems on multiprocessor
systems [3]).

The way signals are delivered is that whenever the kernel is about to return
to user mode, it does a check to see if the user process has any signals
queued. If it does, and they are trappable and the process has a signal
handler, the kernel fiddles with the stack, making it so that (1) when it does
the return to user mode it will return to the start of the signal handler
instead of to where it would have otherwise returned, and (2) there is a made-
up stack frame on the stack after that so that when the signal handler
returns, it returns to the right place in the program.

This was fine if the kernel was entered via a mechanism that generated a
normal interrupt stack frame, such as a system call or a timer interrupt or a
device interrupt. When the kernel was entered via a bus error due to a page
fault, then the stack frame was that special extended frame, with all the
internal processor state in it. When we fiddled with that to make it return to
the signal handler, the result was the processor tried to resume the first
instruction of the signal handler, but the internal state was for a different
interrupted instruction, and if these did not match bad things happened.

The fix? I put a check in the signal delivery code to check for the extended
frame. If one was present, I turned on the debug flag and returned from the
page fault without trying to deliver the signal. The instruction that incurred
the page fault would then resume and complete, the processor would see that
the debug flag was on, and would generate a debug interrupt. That gave control
back to the kernel, where I could then turn the debug flag off, and during the
return from the debug interrupt do the stack manipulation to deliver the
signal.

(continued in reply)

~~~
tzs
Circa 1986: I was working at Interactive Systems Corporation in Santa Monica.
ISC had a contract from AT&T to produce the official port of System V Release
3 to Intel's new processor, the 80386. Unfortunately, the contract also called
for porting it to the 80286, and that was what I was working on the kernel,
with one other programmer. That was a kludge. We got it working, but there was
a strange scheduling bug. If you loaded it down with around 10 processes, each
using a lot of memory, so the system had to make heavy use of virtual memory,
what you'd see is that 1 process would get about 90% of the available
processing time, 1 process would get essentially no processing time, and the
remaining 8 would split the remaining 10% of the processing time pretty much
equally. It would stay this way for a few hours, and then it would thrash
especially hard for a short while, and go back to the previous pattern, except
the processes had shuffled, so it was a different process getting the 90% and
a different getting screwed with 0%, and the remaining 10% equally shared
among the remaining processes. So, in a sense, the scheduler was actually
quite fair--if you watched it for a week, every process ended up with about
the same processor time.

We just could not figure out why the heck it was doing this. We never did
solve this. AT&T came to their senses and realized no one wanted a 286 port of
SVR3 and dropped that part of the project, and I got moved to the 386 port,
where I added an interactive debugger to the kernel, and hacked up the device
driver subsystem to allow dynamic loading of drivers at runtime instead of
requiring them to be linked in at kernel boot time. (The kernel had grown to
big for the real mode boot code, and no one wanted to deal with writing a new
boot loader! Eventually, someone bit the bullet and wrote a new, protected
mode, boot loader and so we didn't need my cool dynamic device loading
system).

Another part of the project with AT&T was providing a way for 386 Unix to run
binaries from 286 Unix (probably System III, but I don't recall for sure).
Carl Hensler, the senior Unix guru at ISC, and I did that project. (Carl,
after ISC was sold to Kodak and then to Sun, ended up at Sun where he became a
Distinguished Engineer on Solaris. He now spends much of his time helping his
mechanic maintain his Piper Comanche, which he flies around to visit craft
breweries). The 286 used a segmented memory model. So did the 386, but since
segments could be 4 GB, 386 processes only used one 3 segments (one code, one
data, and one stack) which all actually pointed to the same 4 GB space.
Fortunately, the segment numbers used for those 3 segments did not overlap the
segments used in the 286 Unix process model, so we did not have to do any
gross memory hacks to deal with 286 memory layout on the 386. We were able to
do most of the 286 support via a user mode program, called 286emul. We
modified the kernel to recognize attempts to exec a 286 binary, and to change
that to an exec of 286emul, adding the path to the 286 program to the
arguments. 286emul would then allocate memory (ordinary user process memory)
and load the 286 process into it. We added a system call to the kernel that
allowed a user mode process to ask the kernel to map segments for it. 286emul
used that to set up the segments appropriately.

Another lucky break was that 286 Unix and 386 Unix used a different system
call mechanism. The 286emul process was able to trap system call attempts from
the 286 code and handle them itself.

Later, AT&T and Microsoft made some kind of deal, and as part of that they
wanted something like 286emul, but for Xenix binaries instead of Unix
binaries, and ISC got a contract to do that work. This was done by me and
Darryl Richman. It was mostly similar to 286emul, as far as dealing with the
kernel. Xenix was farther from 386 Unix than 286 Unix was, so we had quite a
bit more work in the 286 Xenix emulator process to deal with system calls, but
there was nothing too bad.

There was one crisis during development. Microsoft said that there was an
issue that needed to be decided and that it could not be handled by email or
by a conference call. We had to have a face to face meeting, and we had to
send the whole team. So, Darryl and I had to fly to Redmond, which was
annoying because I do not fly. I believe everyone is allowed, and should have,
one stubbornly irrational fear, and I picked flying on aircraft that I am not
piloting.

So we get to Microsoft, have a nice lunch, and then we gather with a bunch of
Microsoft people to resolve the issue. The issue turned out to be dealing with
a difference in signal handling between Xenix and Unix. To make this work, the
kernel would have to know that a signal was for a Xenix process and treat it
slightly different. So...we needed some way for a process to tell the kernel
"use Xenix signal handling for me". Microsoft wanted to know if we wanted this
to be done as a new flag on an existing ioctl, or if we wanted to add a new
"set signal mode" system call. We told them a flag was fine, and they said
that was it, and we could go. WTF...this could not have been done by email or
over the phone?

But wait...it gets even more annoying. After we got back, and finished the
project, Microsoft was very happy with it. They praised it, and sent Darryl
copies of all Microsoft's consumer software products as thanks for a job well
done. They sent me nothing.

On the 286emul project, Carl was the lead engineer, and the most experienced
Unix guy in the company. If AT&T had decided to give out presents for 286emul,
I would have fully understood if they gave them only to Carl. On the Xenix
emulator, on the other hand, neither Darryl nor myself was lead engineer, and
we had about the same overall experience level (I was the more experienced
kernel guy, whereas he was a compiler guru, and I had been on the 286emul
project that served as the starting point for the Xenix emulator).

All I can come up with for this apparent snub is that in 1982, when I was a
senior at Caltech, Microsoft came to recruit on campus. I wasn't very
enthusiastic at the interview (I had already decided I did not want to move
from Southern California at that time), and I got some kind of brainteaser
question they asked wrong (and when they tried to tell me I was wrong, I
disagreed). I don't remember the problem for sure, but I think it might have
been the Monty Hall problem. Maybe they recognized me at the face to face
meeting as the idiot who couldn't solve their brainteaser in 1982, and so
assumed Darryl had done all the work.

Three years later, Microsoft recruited Darryl away from ISC, so evidently they
really liked him. (As with Carl, you cannot tell the Darryl story without beer
playing a role. After Microsoft, Darryl ran his microbrewery for a while, and
wrote a book on beer [4]. I don't know why, but a lot of my old friends from
school, and my old coworkers from my prior jobs, brew beer as either a hobby
or as a side business, or are seriously into drinking craft beers. I do not
drink beer at all, so it seems kind of odd that I apparently tend to befriend
people with unusual propensities toward beer).

[1] I've heard of one hack that supposedly was actually used by a couple of
vendors to do demand paged virtual memory on 68000. They put two 68000s in
their system. They were running in parallel, running the same code and
processing the same data, except one was running one instruction behind. If
the first got a bus error on an attempted memory access, the second was
halted, and the bus error handler on the first could examine the second to
figure out the necessary state information to restart the failed instruction
(after dealing with the page fault). This is one hell of a kludge. (Some
versions of the tale say that after the first fixed the page fault, the
processors swapped roles. The one that had been behind resumed as the lead
processor, and the one that had been in the lead became the follower. I'm not
really much of a hardware guy, but I think the first approach, where one
processor is always the lead and the other is always the follower, would be
easier).

[2] There was not enough information on the 68000 to figure out how to restart
after a bus error in the general case, but you could in special cases.
Compilers would insert a special "stack probe" at the start of functions. This
would attempt to access a location on the stack deep enough to span all the
space the function needed for local variables, struct returns, and function
calls. The kernel knew about these stack probes, and so when it saw a bus
error for an address in the stack segment but below the current stack, it
could look around the return address to see if there was a stack probe
instruction, and it could figure out a safe place to resume after expanding
the stack.

[3] The extended exception frame contains internal processor state
information. Different steppings of the same processor model might have
different internal state information. After you deal with a page fault for a
process, you'll have to resume that process on a processor that has a
compatible extended exception frame.

[4] [http://www.amazon.com/Bock-Classic-Style-Darryl-
Richman/dp/0...](http://www.amazon.com/Bock-Classic-Style-Darryl-
Richman/dp/093738139X/ref=cm_lmf_tit_9)

~~~
wglb
Two awesome stories.

We dealt with the 286 when I was at Mark Williams. They had done the compiler
that Intel was using and reselling at the time. When the 286 came out, they
were concerned about performance, what with the goofy segment registers and
all the various memory models (compact, small, medium, large). The wanted us
to guarantee that the performance of the compiled code would be equal to or
better than the 8086. Naturally we resisted.

So you must know Ron Lachman.

Oh, also at Mark Williams, the year before I started with them they
demonstrated Coherent (v7 unix-alike) on a vanilla IBM PC without any memory
protection hardware. Later also done on the Atari ST.

------
Danieru
I have a more recent story but I hope its still fun. A months ago I wrote a
rootkit for our super-special viruses class.

The real point of the assignment was to write a self-propagating virus. I had
teamed up with a friend who happens to be a fantastic programmer. He promised
to cover the virus portion which freed me to go for the bonus marks.

This professor does bonus marks with the democratic method. At the beginning
of the classes he announces the criteria then at some point the class votes.
In our case the goal was "the most annoying virus".

As it happens my rootkit won us the bonus marks by a healthy margin. Something
I wasn't prepared for since the class learned a quick way to disable the
rootkit. What they would do is delete the kernel module loader before by
deleting the kernel module loader before running the virus. When I heard they
would this I was disheartened. Here was a method I had not thought of and was
sure to make my work worthless.

As it happens they did this because the rootkit was evil. More evil than I
intended.

In fact in thoeyr the rootkit was benign. The rootkit hooked the open system
call and counted opened files from the /bin, /usr/bin, and /sbin folders. A
bloom filter hidden in the kernel's task struct's dtrace fields prevented
double counting the same file.

Then another hook, write, performed the attack. When a task reached the "too
many files opened" threshold the rootkit caused any writes to return
instantly. The goal was to identify anti-viruses programs when they were
disinfecting files and sabotage their disinfectant.

In theory this was being nice to the other students since only the strongest
students attempted disinfecting files. In practice it was much worse.

I did not realize it at the time but I bet you can guess what happened. For
the web developers: stdout, the way to give output back to console, is really
a file and output is sent with a write system call. A virus scanner would find
and report viruses then hit the threshold and get silenced. Students would
find the /bin viruses just fine but soon notice the viruses in /usr/bin were
getting missed!

Of course they searched their code for an explanation how that folder was
special and found none. Their programs just stopped working for no logical
reason. Pure evil. This I think is how we won the bonus.

As an extra feature the rootkit would kernel panic your computer if you dared
to unload it. It did this by re-hooking the system call table in the rootkit's
unload handler. Once unloaded any write or open system call would segfault the
kernel and the game was up. Nothing like a real rootkit's anti-anti-virus
arsenal but it worked.

I did write up a post about the rootkit and the course if anyone wants more
details: [http://danieru.com/2013/12/28/cpsc-527-or-how-i-learned-
to-s...](http://danieru.com/2013/12/28/cpsc-527-or-how-i-learned-to-stop-
worrying-and-write-a-virus/)

------
owyn
My favorite stories are here -> [http://multicians.org/multics-
stories.html](http://multicians.org/multics-stories.html)

I used to work with this guy ->
[http://www.multicians.org/thvv/](http://www.multicians.org/thvv/)

He was really fantastic to work with, a great project manager, great stories,
and a good example of how to stay relevant in the tech industry for 40 years.

------
robot
For those who want to work on similar problems of interrupt handler debugging
and scheduler design, we have an opening at NVIDIA:
[https://news.ycombinator.com/item?id=7687317](https://news.ycombinator.com/item?id=7687317)

------
cheerio
This is an awesome blog post! Could the fact that the bug was stochastic have
to do something with multi-threading? Also, how do you use BIOS to zero out
the section of memory?

~~~
azernik
No, given the early stage of boot at which it crashed, no threading was
happening. The randomness was because the initial zeroing out of the kernel's
global and static variables might or might not happen, as a result of a
physically random process (electrical discharge), instead of being ensured by
software.

Most bootloaders (well, a BIOS usually refers to one step before the
bootloader, but still) have a pretty primitive command shell, through which
you issue the commands telling it how to load the initial kernel (e.g. from
storage, or over the network). My guess would be she had to add a line to the
boot script that zeroed out the relevant RAM; that, or rewrite the bootloader
and add a loop in machine code to zero out the memory.

~~~
vaurora
There was already code written to zero out the BSS shared across all the
bootloaders for PowerPC, the call to it had just gotten lost when our
enthusiastic fellow kernel dev rewrote bootloaders for platforms they couldn't
test. I assume I just added the call to the existing code back in.

------
wainstead
Tom Servo and Crow T. Robot argue Mac vs. PC:

[https://www.youtube.com/watch?v=ixQE496Pcn8](https://www.youtube.com/watch?v=ixQE496Pcn8)

------
CJefferson
ibisum: If you read this, all your posts are auto-marked [dead].

