Unfortunately, this (really nice) PDF isn't complete.
For me an Operating System means processes in userland, sadly it comes short of that.
I'm now pushing my own x86 32bit OS in Rust using [0][1]:
There is definitely a shortage of information on this stuff (for example writing a USB driver is really hard and I couldn't find easily available tutorials - you have to resort to raw *HCI and USB specs to get things done).
It is for the author's definition of "complete", which he gives explicitly early on in the paper.
> For me an Operating System means processes in userland
As the reference to the author's late professor in the introduction to the paper will show you, it hasn't always been so. Once upon a time, an "operating system" just meant "the code you have to write before you can do anything at all with this computer". That code could be anything from a special-purpose system for one particular task, to something like MS-DOS, which could run other programs but had no notion of "processes in userland" (because there wasn't any "userland", every byte of code ran on the bare metal), to the kind of multi-tasking system we now generally think of when we think "operating system", like Linux or Windows or Mac OS X.
> For me an Operating System means processes in userland, sadly it comes short of that.
I'm no OS expert, besides using multiple of them daily, but isn't there different categories of Operating Systems? The ones you say are "A Real OS" are specifically multitasking OSes, while there also exists a section of OSes that only runs single tasks. Aren't those also real Operating Systems?
I would argue a single program that runs on bare metal to performs a specific purpose is not an "operating system". An "operating system" is a program that runs on bare metal (operates the system) and runs other programs. Those other programs are "userspace" regardless of the security or IPC model involved in communicating with the actual operating system.
Userspace typically makes people assume a security model, Ring 0/3, etc.
I think of an Operating System as something that provides a Hardware Abstraction Layer (HAL) and system calls to take advantage of that hardware. Drawing on the screen without knowing the details of the video card, file io without caring about IDE/SCSI, etc.
If it runs other programs but those programs still need to take direct control of the hardware then the OS is just a glorified boot loader.
Pick R83 is an (8086 real mode) operating system for the IBM PC. It runs a byte code virtual machine, and has a compiler for a dialect of BASIC which generates that byte code as output. Most programs were written in BASIC, although the system also had a byte code assembler and some system-level software was written in that instead. Its filesystem is rather primitive in some ways (it only has single-level directories, a directory cannot contain subdirectories), but at the same time supports key-indexed files, which was essential for the business database applications it was used to run. There was never (as far as I know) a C compiler, although in principle someone could have compiled C to the same bytecode that BASIC was compiled to. It is multi-tasking and multi-user, albeit its security is rather primitive by today's standards (as one might expect for a real mode 8086 operating system). Does that count as "running other programs"? If it does, just make your "single program" be an interpreter (Tcl, Lua, whatever you wish) and now you have a "real operating system".
(The byte code for R83's VM, called Pick assembly, came from Pick's earlier non-PC hardware systems, in which it was actually the hardware machine language, implemented by microcode; I believe the parts of the system which ran on top of the VM were a pretty direct port from those earlier platforms. R83 was succeeded by Advanced Pick, which supported both running as a standalone OS – AP/Native – or as a process on top of a host OS – AP/Unix and AP/DOS. AP/DOS was later succeeded by AP/NT for Windows NT, and Advanced Pick eventually evolved into D3, which is still around – now sold by Rocket Software. D3's virtual machine environment, VME, is a direct descendant of the same byte code VM, although it now supports much deeper integration with the AIX/Linux/Windows host platform than the original hosted AP versions did. Early MUMPS systems were similar in being hosted directly on the hardware, but Pick as an operating system arguably lasted a decade or more longer than MUMPS as an operating system did.)
Sure, there is a gradient of complexity here, first you can just do your stuff in kernel in a single thread, then you'll probably want to load programs and execute them in userspace [0], then you'll want multitasking and task scheduling, memory isolation and process syncronization etc. You can call any of those steps an OS of course, its just I personally find earlier steps not that interesting.
Another one great resource (that actually inspired me to tinker with all this awesome stuff) is this book [1]
Slightly off-topic: Anyone knows when the new edition of [1] is supposed to come out? It's been so long and the articles are so well written, I'm eagerly waiting for their UEFI rewrite to be released...
From the PDF:
"It is true: in our quest to make full use of the CPU, we must abandon all of those helpful routines provided by BIOS. As we will see when we look in more detail at the 32-bit protected mode switch-over, BIOS routines, having been coded to work only in 16-bit real mode, are no longer valid in 32-bit protected mode; indeed, attempting to use them would likely crash the machine.
"So what this means is that a 32-bit operating system must provide its own drivers for all hardware of the machine (e.g. the keybaord, screen, disk drives, mouse, etc). Actually, it is possible for a 32-bit protected mode operating system to switch temporarily back into 16-bit mode whereupon it may utilise BIOS, but this teachnique can be more trouble than it is worth, especially in terms of performance."
--
In the toy 32-bit OS I am currently writing, having easy disk access and text output was more important to me than performance, so I decided to implement this technique to access the disk and screen via the BIOS instead of writing an ATA disk driver.
Although I could not find any minimal yet complete working examples of dropping to 16 bit and later resuming 32 bit mode, I was able to piece it together and write assembly functions called enter_16bit_real and resume_32bit_mode. See https://github.com/ironmeld/builder-hex0/blob/main/builder-h.... Those routines are working well but beware the project is a volatile work in progress and is coded in hex for bootstrapping reasons.
Windows 3.1, 95, 98, Me all did exactly the thing described as "more trouble than it is worth"; they are 32-bit protected mode operating systems[0] that makes frequent context switches to 16-bit mode and utilize BIOS functions.
[0] Yes, I know, Windows 3.1 applications were almost always 16-bit unless you installed Win32S. It's somewhat complicated.
I guess in their case they had so much already written code to work with BIOS calls it was worth adding the shim layer. But if you were to start tomorrow with no legacy code then it would be more trouble than it's worth.
If you are running your OS on a VM hypervisor (e.g.; Qemu, kvmtool), then you can just use virtio for disk and network I/O. No boring and complex device specification (e.g.; ATA) to implement.
Yeah, I'm avoiding modern interfaces due to my peculiar bootstrapping requirements (e.g. must run on old HW; no SW dependencies). virtio is probably a better choice for anyone trying to build something practical. On the other hand, it looks like using virtio would need a lot more code than accessing the BIOS and I couldn't easily find examples in assembly.
Page 37 says that pipelining is the reason we need to jmp after setting 32bit mode in CR0, but that's not true. The execution mode is part of the CS descriptor, and setting CR0 merely means that "on next segment (CS:EIP) jump" we should treat the segment part not as a 16bit segment, but as a segment descriptor selector.
Otherwise, dude, that would be a race against the pipelining with undefined behaviour.
So no, pipelining is an implementation detail you STILL don't have to care about.
Like, do you know the instruction pipeline length on a 386SX? No, because you don't have to. There's no rush to jump. (though what else are you going to do?)
Actually, you could continue executing 16bit code, loading only DS/ES with 32bit selectors. Why, I don't know, but you could.
> a near jump[…] may not be sufficient to flush the pipeline
Nope, that's not it. You are literally jumping into a 32bit address space by loading the descriptor. The descriptor is not loaded until you load a new selector into CS by using a far jump.
Mind you, this is from memory when I made my OS in high school. Back when there was a lot of moving floppies back and forth to iterate. :-/
This probably no longer applies in any way to modern processors, but the 286 and 386 did have a "prefetch queue" (containing opcode bytes fetched from CS:eIP), as well as a "decoded instruction queue".
Up to 3 instructions could be pre-decoded and stored in that queue, while the execution unit was still busy with some other instruction (keep in mind that back in the day it was multiple clock cycles per instruction, not multiple instructions per cycle!)
Each decoded instruction included a microcode entry point address, depending on the opcode and the state of CR0 at the time of decoding. On the 486 this was further optimized to actually store the first micro-op in that queue.
If you set the "Protect Enable" bit in CR0, it does not affect instructions that have already been through the decode stage, so they would not run with protected mode semantics. A far jump would actually not work in that case, because it would load the CS base with (selector << 4) instead of going through the descriptor table. So you should first do a near jump to flush the pipeline, and only then jump to the protected mode CS.
In practice, you could get away with not doing this step, depending on the exact timing of the code used to switch to protected mode. A lot of instructions are short enough that the processor never has enough time to pre-decode anything following.
The Intel manuals did (and still do?) document you have to do this near jump first, though not the exact details of why.
But the near and far jump (if you use both) still have VERY different purposes. And the CPU is not really in 32bit mode until all the segment descriptors have been loaded manually.
Here we see the near instruction-flushing jump, then a bunch of code (that is still in 16bit mode!), and then the long jump that switches the instruction set to 32bit.
The comments in that last link explains it well, too.
The point is that the CPU might still be in real mode at the point in time when the far jump is decoded, and thus execute it with real mode semantics. Coding a seemingly useless near jump was the way to prevent that.
Whoever wrote the modern Linux code doesn't seem to understand any of this. There is even a jump before the switch to protected mode, with the comment "Short jump to serialize on 386/486". Cargo-cult programming at its finest :)
As I said before, it may have always worked in practice without the jump, and modern Linux doesn't even support the 386 anymore.
Actually, with all the crazy out-of-order speculative execution going on, one could expect newer CPUs to have similar requirements, but I guess they also had to get better at providing the illusion that such internal details don't matter. Making LMSW / MOV CR0 a serializing instruction seems to be an easy way to do it.
> the CPU might still be in real mode at the point in time when the far jump is decoded,
It definitely is in real mode when the far jump is decoded. Or is that not the right technical definition of "real mode"?
Do we leave real mode as soon as CR0 bit 1 is set, even though nothing has changed until the code puts a new value into a segment register? I've always thought of it as when CS becomes 32bit, meaning the long jump.
Anyway, that's just terminology.
> "Short jump to serialize on 386/486". Cargo-cult programming at its finest :)
>Do we leave real mode as soon as CR0 bit 1 is set
Yes, that's what the bit is defined as doing. Protected mode was introduced on the 16-bit 80286, and 16-bit code segment descriptors are still supported to this day.
Even in real mode, the segment registers have hidden fields containing base, limit and access rights. The difference is that when they get loaded in real mode, the base will be set to the segment shifted left by 4 bits, with the limit¹ and access rights² generally left unchanged. What the PE bit actually affects is how segment load instructions operate, how interrupts/exceptions are handled, and a few other details.
Most instructions run exactly the same microcode in either mode: any memory access will form the address and do protection checks based on whatever is currently in those hidden fields. But a segment load (or far jump) decoded while PE=0 will execute different microcode than one with PE=1.
>That's weird. I tracked down the commit
Shuffling the code around may have fixed some alignment bug? The jump could likely be replaced by two NOPs, in any case the comment is completely wrong.
2.6.22 seems to be the last version using LMSW followed by a near jump, and presumably worked on that CPU (at least there is a comment mentioning bugfixes for Elan), so it isn't likely to be the cause of the problem.
¹ the limit on power-on/reset is 64K, but it is possible to change it to 4G, allowing access to all memory ("unreal mode")
² CS will always be made a writable data segment, something not possible to set up in protected mode without the use of LOADALL
edit:
>Interesting. The new Intel docs explicitly say to do the far jump "Immediately" after setting the PE bit in CR0.
This PDF brings back memories. I went down this path about a little over a year ago—it’s great fun. I didn’t end up finishing anything significant but I’d highly recommend everyone give writing a kernel or an OS a shot sometime. You will learn a whole lot.
If you’re interested, the OSDev Wiki [0] is a great place to start, as are Bran’s Kernel Development tutorials, which I can’t seem to find on mobile right now.
Nowadays, I would recommend not using BIOS booting. Modern machines with UEFI firmware only emulate BIOS anyway, for backwards compatibility with old bootloaders.
UEFI gives you a much nicer initial environment — it's already in 64-bit long mode, for example, unlike BIOS which drops you into 8-bit real mode, with a maximum of one sector in which to perform all the gymnastics you need to do to load the rest of your program and switch to _at least_ 32-bit protected mode (notice how this PDF doesn't even go to 64-bit long mode).
UEFI also gives you better, easier to use boot facilities than BIOS: you can access a memory map, read from the boot file system, switch video modes, and even access network drivers. And you can access all this from long mode, unlike BIOS which only works in real mode. Then once you've booted you can take over the system by exiting the boot services, and have full control over the system just like you want for making an operating system.
If anyone is interested, I have a couple of implementations of booting under UEFI and getting a bunch of info about the system (don't expect a functioning system, they just boot and dump some info):
You sound very knowledgable about UEFI - it is intimidating since the spec is so big. It also...concerns me that so much code is running prior to my own, even if it is for a good purpose. I get it that CPU mode selection and boot device intitialization isn't something most programmers want to deal with, but doesn't putting it in UEFI firmware put it out of reach of most programmers? Should we not aim to put as little as possible into firmware?
First off I don't recommend implementing the specification from scratch. It is big, and there are implementations already in various languages: for C you can use GNU-UEFI, for example, Rust has uefi-rs, and Zig has support in its standard library. These take care of the fiddly details of interacting with the UEFI firmware's services.
I don't think the role of firmware requires it to be minimal. It's to provide a consistent boot environment on different hardware, and to provide useful services to the program. Even BIOS-based machines have significant amounts of code that run before they jump to the boot sector, so while UEFI is definitely more sophisticated I don't really see this as a drawback to it.
I suppose it depends what you're going for, but in general I don't think being able to fully support some very old PCI devices should be the determining factor.
Tangentially, does anyone remember “Developing Your Own 32-bit Operating System” (1995) by Richard Burgess ? I was absolutely fascinated by this back in the day. Misplaced some of my computer books (Design Patterns, this, a few others) on my travels so don't have it to hand any more. Was gob-smacked that it took what seemed like mere nanoseconds to boot on bare x86 hardware.
I have this from a decade ago when I was trying to build my own OS. I think Tannenbaums book has probably stood the test of time more but it was a great resource regardless.
The boot sector is only 512 bytes, but if you're using Asm that's actually more than enough for some nontrivial functionality. Here's a few of many examples of "what fits in a boot sector" that has appeared on HN:
I tried writing a simple OS decades ago as an EXE you'd run from MSDOS. It would rewrite all the ISR vectors and take over all the memory including the space that was previously used by MSDOS. I was told that because I relied on DOS, whatever I wrote was not an operating system. I politely disagreed.
LOADLIN could load linux from DOS. And being a .EXE it relied on DOS to bootstrap.
So yes, if you didn't use the DOS code (I mean… you overwrote it), then DOS is not there. And if DOS is not the OS, that just leaves your code to be the OS. :-)
See also both Windows 3.1 and the “DOS extenders” used to build applications that were launched from DOS but ran in protected mode with additional facilities.
I notice this pdf and most of the links in this discussion are for x86 architectures, and deal a lot with BIOS, getting out of 'real' mode, etc.
Are there any such resources for ARM (preferably ARM64) that would cover similar "Get a basic OS up and running" topics that you could follow on a device or emulator?
Find another course about building arm arch kernel writing: https://www.udemy.com/course/raspberry-pi-write-your-own-ope.... I did not take this course, but I took another course on architecture x86 from the same authors and it was pretty good. I have already written a comment in this thread about the course I took. Look for that my comment. It has the disadvantage that it is paid. Sorry for adv again.
Recent x64 AMD and Intel PC's use UEFI secure boot and require signed boot images, so this simple approach won't boot out-of-the box anymore I suspect. Same for Mac M1. So some update would be useful, without requiring to disable UEFI secure boot.
I’ve never come across an amd64 system where secure boot couldn’t be disabled in the BIOS (as you’d do when installing Linux) - do you have any examples of this?
Of course you can switch secure boot off. I am just interested in a solution that boots with security on: just out of curiosity, this is about learning after all.
You can't sign your custom kernel, only Windows can be signed and signing has to be disabled for Linux and other OS.
UEFI offers some basic IO functionality, it's like BIOS in that regard. You don't design your OS around it, you use it to have easy graphics, networking etc during boot time, then you switch to your own drivers and don't touch UEFI again.
I really enjoyed doing the OS course at UoB. Awesome to see UoB paper trending on Hacker News. This was also one the easiest to follow “tutorials” on making a OS from scratch I’ve followed.
Sortix is a small self-hosting operating-system aiming to be a clean and modern POSIX implementation.
It is a hobbyist operating system written from scratch with its own base system, including kernel and standard library, as well as ports of third party software. It has a straightforward installer and can be developed under itself. Releases come with the source code in /src, ready for tinkering
This looks like a nice guide. I had a couple of questions.
When reading the "entering protected mode" section I started to question my own understanding of some the details. My first question is:
When a CPU starts up in real mode it's running in a kind of privileged state since there are none of the 4 protection rings that protected mode give us. We then build a GDT and switch the CPU to protected mode. In practical terms where does this code live GDT building and long mode jump code live? Is it in GRUB or whichever stage 2 bootloader is in use?
My second question was about the "chicken and egg" situation about how the kernel code inherits that privileged address space defined marked with ring 0 bits in the GDT. Is magic that once the GDT has been defined, when we subsequently decompresses the kernel into those memory addresses the kernel memory will have ring 0 bits in its descriptors?
This is how I've understood it but I've often found the texts about this a little murky and this part still feels a bit like magic to me I guess.
For the first question, it can be done in various ways. It can be done by the firmware or the bootloader for you, or you can do it by yourself once the platform jumps to your kernel code upon machine startup.
For the second question, you just build your GDT entries with the highest privelege, then you switch to protected mode.
I understand about that you build your GDT entries and mark them all with privilege bit set to 0. However at that point your kernel is still sitting as a compressed image on disk. This is the disconnect I seem to have. We have an in memory data structure that's the GDT and we have kernel image that still on disk. How are those GDT entries allocated to the kernel?
As you observed before, real mode is also privileged. Both before and immediately after enabling protected mode, the code segment you are running from is already "ring 0" (even if it wasn't loaded from a GDT descriptor), and so it is allowed to transfer control to the "proper" kernel code segment.
Thank you. For clarification is that same code segment from real mode sort of automatically "remapped" then to a descriptor in the GDT after CPU transitions into protected mode? Is that correct mental model?
The OS (or loader) is responsible for setting up descriptors and managing memory. In 64-bit and most 32-bit kernels, segmentation is basically not used anymore, and the descriptors are just flat address spaces (all start at 0, no limit) with different privilege levels.
Example of 32 bit flat GDT:
0000 (null descriptor; reserved by hardware and can't be used)
0008 kernel code (ring 0)
0010 kernel data (ring 0)
0018 user code (ring 3)
0020 user data (ring 3)
0028 TSS (hardware defined structure)
Isolation between processes would then be done through the page mapping mechanism. This is kind of hard to wrap your head around at first, but makes it much easier to allocate memory. One big problem with using segments instead of fixed-size pages is that when you need to free them in some random order, memory becomes fragmented.
But you can also have segments in addition to paging, which is helpful for transitioning from/to 16-bit mode.
Sorry "remapping" was a poor choice of words on my part.
What I meant was the code segment register gets reloaded to some offset in that GDT that has those protection bits set to 0. According to your flat address space map above that would be at offsets 0008 and 0010. That makes a lot of sense, thanks.
I know there must be, but are there examples of people using the linux kernel as a base but having a radically different userland from traditional linuxes? I feel like all the Gnome and KDE Distros feel really samey whether they are arch or debian based (this isn't necessarily a bad thing!).
This looks cool for x86. Somewhat along these lines, is there a piece of software that lets you emulate chips, connections and custom built systems?
I’ve been trying to play with building an OS for riscv (and / or maybe pic32) using qemu, but I’d like to simulate my own display, keyboards and other components.
Basically a digitised bread board where you can send compiled code to emulated chips and run.
At the core of almost every 8-bit computer game there is a cooperative multitasking kernel, without memory protection (because, of course, there is none), running multiple little taska - updating the screen, reading inputs, moving characters, checking if someone exploded, clicking the speaker to make some sound, etc.
For me an Operating System means processes in userland, sadly it comes short of that.
I'm now pushing my own x86 32bit OS in Rust using [0][1]:
There is definitely a shortage of information on this stuff (for example writing a USB driver is really hard and I couldn't find easily available tutorials - you have to resort to raw *HCI and USB specs to get things done).
[0] https://wiki.osdev.org/Expanded_Main_Page
[1] https://os.phil-opp.com/