Performance was the big problem. (At one point, a disk read was CPU-bound.)
1. The Mach Interface Generator (mig) generated code with wildly different performance, with no obvious relationship to the mig spec.
2. Context switching is always the big topic, but it's a red herring. Our problem was primarily data transfer; copy-on-write took so long to set up that it was frequently cheaper to just do the copy.
3. Making a syscall to get the current PID is a stupid idea.
Right. QNX just copied for interprocess communication. Mach did lots of messing with the MMU to create temporary shared pages. That seemed to be a lose.
* What you want for interprocess communication in a microkernel is something that works like a function call - send data, wait for reply. If you have to build that out of unidirectional messages, it means more trips through the CPU dispatcher, and probably going to the end of the line waiting for a turn at the CPU. Call-like approach: Process A calls process B, control transfers immediately from A to B, B returns to A, control transfers immediately to process A. No need to look for the next task to run; who runs next is a no-brainer. Pipe-like approach: A sends to B, both are active for a moment, A soon blocks on a lock, B wakes up and does its thing, B sends to A, both are active for a moment, B soon blocks on a lock. A and B fight for the CPU.
The "going to the end of the line" effect is that when other tasks want the CPU, the pipe-like approach means other tasks get a chance to run during the handoff. The effect is that message passing performance falls off a cliff when you're CPU-bound. QNX used a call like approach (MsgSend/MsgReceive), while Mach used something more like pipe I/O.
* Mach started from the BSD code base. Trying to build a microkernel by hacking on a macrokernel didn't end well. Microkernels are all about getting the key primitives at the bottom working very fast and reliably. There was an eventual rewrite, but I gather that BSD code remained.
There's an alternative - unbuffered pipes. This is like a Go channel of length 0 - all writes block until the read empties the channel. This is better from a CPU dispatching perspective, in that a write to a channel implies an immediate transfer of control to the receiver. Of course, Go is doing this in one address space, not across a protection boundary.
The QNX approach worked well for service-type requests. You could set up a service usable by multiple processes. Each request contained the info needed to send the reply back to the caller. The service didn't have to open a pipe to the requestor. This even understood process priority, so high priority requests were serviced first, an essential feature in a hard real time system.
Read the second sentence; process A does the write+select on AB and [BA,...], yields to B, is not ready to run; B generates a reply and writes it to BA, which has a read (technically select) already pending.
Why is the Mac so much better under low memory conditions than Linux? Is it the kernel, and if so, is there an inherent trade-off between low memory performance and other kinds of performance?
If there is a trade-off, would reworking the Linux kernel to function better under low memory conditions also create a way forward for a non-dbus, non-systemd, yet modern Linux?
Kind of off-topic, but, every once in a while, I think of Workplace OS and how it was this mystical product that was going to be The Future (especially to us OS/2 nerds!) and how it nearly never even gets mentioned anymore these days. I'd love to read the reminisces of you or your team members.
Wikipedia says OS/2 for PPC shipped with "IBM Microkernel 1.0" (based on Mach), with plans for a "IBM Microkernel 2.0" which never shipped? Was 2.0 planned as a major change from 1.0, or just an incremental evolution?
I assume DOS and Windows compatibility would come from the features in OS/2. My primary task was to write and run benchmarks comparing Workplace OS personalities to native code. Primarily OS/2 and AIX, which were pretty far along, but later Windows NT. Classic MacOS and OS/400 were mentioned, but I never saw any work on them. Taligent was dead as a door nail by that time (I did ride that project into the ground), by which I mean it had been converted to a C++ utility library.
We were just benchmarking and telling the devs in Boca Raton, "Non't do that. No, don't do that. Here's how you do it. Stop it." I don't know about the roadmap beyond the initial personalities.
Was OS/2 for PPC (are you sure it shipped?) well known for being hideously slow?
Well, apparently it did: http://www.os2museum.com/wp/os2-history/os2-warp-powerpc-edi...
Some "abandonware" website is even offering it for download: https://winworldpc.com/product/os-2-3x/30-powerpc-edition
> well known for being hideously slow?
I don't think it was "well-known" for anything :) But, not having used it myself, the blog post I cite above says the performance was "surprisingly good" and that "all things considered, responsiveness quite good for a 100MHz CPU"
IBM had already discontinued the hardware, so it probably only existed so IBM could say IBM is always true to its word or whatever.
That sounds impressive. Could you say more about how/why?
2. Someone used the wrong magic keywords in the mig spec, causing poor message send-receive structure and code.
3. The same someone, IIRC, set up the copy on write memory management for each page, rather than one big buffer. The latter would still be slower than just copying, but geeze.
4. There was something wrong with the driver at the time; I never heard what.
5. In fairness, there was some overhead from our monitoring.
Something completely unrelated to Mach, what do you think of IBM then, IBM now, and IBM's future.
After spending some time with the OS/2 Workplace Shell usability group, I spent time with Taligent and then the Workplace OS. I'm having "DASD" flashbacks typing this. Then I left for grad school and UT Austin for a good ten years, going back to IBM to work on something called xCP and digital media encryption. That was mostly after IBM became a pure consulting company, and was a cluster fuck, too. (The entire Pervasive Computing division went down with me. My boss ended up in the Lotus division.)
Don't work for IBM. Don't buy IBM. After I get done here, I'm washing my cell phone out with Listerine.
This is still true, but to a lesser extent, on MacOSX (or at least was, a decade ago). I was writing HPC drivers for a cluster interconnect. So performance was critical. We had been using the BSD ioctl system to communicate with our drivers because we used ioctls in all our other drivers (Linux, Solaris, FreeBSD, Windows). I did some microbenchmarks and noticed that it was far slower than FreeBSD or Linux ioctls & complained. Apple suggested that I re-write the app/driver communication using IOKit, which is Mach based. The result was something that was twice as slow.
Trivial mach IPC round-trip (same process) seems to be about ~8.5µs in my test, so around 4µs per call. I'm sure if you transferred port rights, memory, etc the cost would go up.
edit: I would advise anyone to run your own tests. Anecdotes aren't data and some things like syscalls are not nearly as expensive on modern hardware as they once were.
Sometimes things can get ridiculously faster than you remember. For example the original iPhone took nearly 200ns to do objc_msgSend() . In 2016 modern hardware did the same thing in 2.6ns . That's two orders of magnitude improvement. So saying "message passing is slow" is not a correct statement... it's almost as cheap as a C++ virtual method call.
Point of reference: I work in HFT, and it's possible to process a market data update from an exchange, reprice a complex financial instrument, decide whether to place an order, and send that order back towards the exchange all in under 5us. Another point of reference: raising one float to the power of another (a^b) could be done over a 100,000 times in the same time it takes for Mach to do a single syscall.
The rest of your post seems OK, but this claim is almost certainly off. 2000 float^float operations takes a full second? Even on a 486DX, no way.
For presumably optimization reasons arbitrary pointers (rather than checked messages) are passed around different parts of the kernel. And that exposes some quite bad security issues from time to time.
AFAIK, they still don't support 15 year old technologies like MSI-X that permit efficient multi-queue network drivers.
Have you ever built a project by hand on MacOS and then on Linux (or FreeBSD)? Have you noticed how absurdly, painfully slow it is running autoconf on MacOSX? That's because MacOS system calls are horrifically slow compared to Linux / BSD.
Apple provides posix_spawn which is much much faster. Running /usr/bin/false 1000 times in the fish shell is nearly twice as fast (2.25s to 1.25s) when using posix_spawn instead of fork, on the Mac.
Autotools doesn't use posix_spawn because there's no benefit on Linux, but it is what Apple's frameworks use internally for process launching.
Wait, autotools don't make syscalls at all, do they? I thought it's just shell and make.
Is the problem that /bin/sh and /usr/bin/make use fork and it would help to have a Bourne shell and GNU-compatible make that used posix_spawn?
We moved almost all development machines to Linux after someone demonstrated how fast Linux built some C project compared to osx and windows.
To be fair this may be many things and not just mach syscalls.
It's fascinating how some things get their names :)
There's some delay between when a new macOS version comes out and when sources get published, but it's great to see how they use the Mach kernel in practice.
One can have said codebase effectively outsource functionality to function calls on closed source libraries.
Google does something similar with android except in that case its moreso coupling functionality to their services:
Up until a couple years ago, they'd strip out ARM specific things from the releases macOS XNU code. Then they started leaving that stuff in!
Firmware cores and AirPods run Apple RTKit.
It was done by Apple and OSF, and for a while was the only way to run Linux on NuBus Macs.
As announced at WWDC, it will be a progressive transition, for every new driver model being supported as user space driver, the related kernel space APIs will be automatically deprecated and removed in the following OS release the year after.
Also many BSD syscalls have been deprecated along the years, including POSIX features like networking stack, now replaced by Objective-C APIs.
I would advise spending some time reading "Mac OS X Internals: A Systems Approach" and "Mac OS X and iOS Internals" books, to learn that having everything on kernel space alone, doesn't rewrite the original microkernel code into a huge monolith.
Apple showed their long term roadmap at WWDC how they plan to purge all kernel drivers.
Graphics - I don't see much changing, but to be fair it's pretty nice already. The part in kernel space for the most part just controls the GPU's MMU, the meat of the driver runs in user space for speed reasons as shared libraries in the processes that are making GPU calls. It's sort of exokernel like if you squint hard enough.
Filesystems - will probably be hybrid. I don't see APFS leaving kernel space, or anything your root partition would be on, but NFS, exFat, NTFS? Yeah.
Networking is part of the first wave by the way.
Networking kernel drivers are now deprecated as you can easily read about here.
I appreciate there is a lot of grey area with microkernels, and a lot of hybrid designs these days, since as was the case with Mach/XNU/Windows NT, "pure" microkernel designs have often shown less than optimal performance due to additional context switching.
Even current macOS variants do plenty of message passing and sandboxing.
From the iOS games and real time audio apps, it appears fast enough to me.
Mach and Hurd on the other hand store the messages in the kernel, priorities them and handles all mailbox messages, without ever losing one.
This proved to be the wrong approach.
Lack of hardware acceleration, and the general clunkiness of X windows make Linux crappy in this regard for example. I _still_ get screen tearing in Ubuntu 18.04 if I don't run Wayland, which I don't because it breaks some apps I use.
A few benchmark L4Linux, where L4 is not used as a microkernel but a hypervisor which consequently proves nothing about the micokernel paradigm.
The top half was Dhrystones (CPU integer performance), the bottom was various syscalls.
The benchmark is its use as radio OS on many handsets.
This usage goes against many of the security claims of microkernels. In particular. If Linux is big and insecure running it in a VM doesn't really improves its security. It just make sure it doesn't infect the rest of the system. For this type of benchmark the "rest of the system" does not exist.
(I assume you meant the baseband processor in iPhone and Android. I don't think either use a clean microkernel and there isn't really a performance comparison available)
For me the benchmark that counts is "does it deliver in production", winning ms in laboratory micro-benchmarks is kind of useless.
Specially when so many are willing to waste those ms running Electron apps on userspace.
All the ones corresponding to the former IO Kit do require C++, all the remaining categories are going to be supported from Swift as well.
This is planned to take place across several releases, at the end of which no kernel drivers will be any longer allowed.
probably you mean 3-rd party drivers? or even apple internally?
 i couldn’t tell that from the slides mentioned below, but maybe i missed something
> All the ones corresponding to the former IO Kit do require C++
yea, which leads me to believe if apple was to rewrite the kernel, they probably would go with c++ ... or maybe it’s just that swift doesn’t have its embedded chops up to snuff yet...
The long term roadmap is as follows:
1 - surface kernel APIs for a specific driver model as userspace API
2 - deprecate for the respective OS release the kernel entry points related to the newly surfaced driver model
3 - remove the kernel api on the following OS release
4 - rinse and repeat untill there aren't any kernel driver APIs left
The ones being released with Catalina are just the first wave.
i’ll watch the video then, very interesting