I worked at IBM Austin on the performance team when the Mach-based IBM Microkernel/Workplace OS was ongoing. (Ask me anything! :-))
Performance was the big problem. (At one point, a disk read was CPU-bound.)
1. The Mach Interface Generator (mig) generated code with wildly different performance, with no obvious relationship to the mig spec.
2. Context switching is always the big topic, but it's a red herring. Our problem was primarily data transfer; copy-on-write took so long to set up that it was frequently cheaper to just do the copy.
3. Making a syscall to get the current PID is a stupid idea.
copy-on-write took so long to set up that it was frequently cheaper to just do the copy.
Right. QNX just copied for interprocess communication. Mach did lots of messing with the MMU to create temporary shared pages. That seemed to be a lose.
* What you want for interprocess communication in a microkernel is something that works like a function call - send data, wait for reply. If you have to build that out of unidirectional messages, it means more trips through the CPU dispatcher, and probably going to the end of the line waiting for a turn at the CPU. Call-like approach: Process A calls process B, control transfers immediately from A to B, B returns to A, control transfers immediately to process A. No need to look for the next task to run; who runs next is a no-brainer. Pipe-like approach: A sends to B, both are active for a moment, A soon blocks on a lock, B wakes up and does its thing, B sends to A, both are active for a moment, B soon blocks on a lock. A and B fight for the CPU.
The "going to the end of the line" effect is that when other tasks want the CPU, the pipe-like approach means other tasks get a chance to run during the handoff. The effect is that message passing performance falls off a cliff when you're CPU-bound. QNX used a call like approach (MsgSend/MsgReceive), while Mach used something more like pipe I/O.
* Mach started from the BSD code base. Trying to build a microkernel by hacking on a macrokernel didn't end well. Microkernels are all about getting the key primitives at the bottom working very fast and reliably. There was an eventual rewrite, but I gather that BSD code remained.
Technically there's no reason why write can't context switch directly to a process that's blocked on read (or, less commonly, vice versa). That only solves half the problem though; if you want pure pipe-like IO, you'd also need a single system call that combines write with select/poll/equivalent so the caller can immediately become blocked waiting for the return message.
That runs into a problem on a quick reply. Process A does the write on pipe AB, then gives up the CPU to B. A is still in ready to run state. B quickly generates a reply and writes it to pipe BA. But there's no read pending on pipe BA yet. So both processes are now in ready to run state, contending for the CPU, along with anything else that needed it.
There's an alternative - unbuffered pipes. This is like a Go channel of length 0 - all writes block until the read empties the channel. This is better from a CPU dispatching perspective, in that a write to a channel implies an immediate transfer of control to the receiver. Of course, Go is doing this in one address space, not across a protection boundary.
The QNX approach worked well for service-type requests. You could set up a service usable by multiple processes. Each request contained the info needed to send the reply back to the caller. The service didn't have to open a pipe to the requestor. This even understood process priority, so high priority requests were serviced first, an essential feature in a hard real time system.
Read the second sentence; process A does the write+select on AB and [BA,...], yields to B, is not ready to run; B generates a reply and writes it to BA, which has a read (technically select) already pending.
That's the advantage of a combined read and write. But if you're going to have that, it's more convenient to explicitly package it as a request/reply. Less trouble with things like having two messages in the pipe and such.
You might be waiting (via select) for multiple replies; you don't know whether B will (say) grab something out of disk cache and chuck it back to you, or if it'll fire off a seek command to the spinning rust and take a nap just as your network round-trip completes. Having two (or more) messages in (separate) pipes is the point of using select.
Is there some relationship between Mach's poor performance and the ability of macOS to remain functional in low-memory situations? Linux has been called out recently for becoming unusable when free memory is low. Using both, I can say that the Mac is hands-down a better OS for desktop use where having the system not freeze is way more important than a 70% slowdown in file transfer rate.
Why is the Mac so much better under low memory conditions than Linux? Is it the kernel, and if so, is there an inherent trade-off between low memory performance and other kinds of performance?
If there is a trade-off, would reworking the Linux kernel to function better under low memory conditions also create a way forward for a non-dbus, non-systemd, yet modern Linux?
I seem to recall that back when I used Windows back in XP era when memory was limited, the page file would sometimes grow to a few hundred MB without GUI responsiveness suffering too much. Does Windows or Mac protect certain processes from being swapped?
I worked at IBM Austin on the performance team when the Mach-based IBM Microkernel/Workplace OS was ongoing. (Ask me anything! :-))
Kind of off-topic, but, every once in a while, I think of Workplace OS and how it was this mystical product that was going to be The Future (especially to us OS/2 nerds!) and how it nearly never even gets mentioned anymore these days. I'd love to read the reminisces of you or your team members.
Questions: The OS/2 personality was the only one that shipped, right? (Was its DOS and Windows compatibility provided by separate personalities, or provided by the OS/2 personality?) What other personalities were being developed, and how far along were they in development before being cancelled? I know IBM executives talked about AIX, OS/400, classic MacOS and Taligent personalities, but did any development work happen on those or were they just vaporware?
Wikipedia says OS/2 for PPC shipped with "IBM Microkernel 1.0" (based on Mach), with plans for a "IBM Microkernel 2.0" which never shipped? Was 2.0 planned as a major change from 1.0, or just an incremental evolution?
The OS/2 personality shipped? (For once in my career, I didn't ride the dead horse into the ground.) When I left, the performance group I was with was breaking up, and the Workplace OS was very far from a releasable state.
I assume DOS and Windows compatibility would come from the features in OS/2. My primary task was to write and run benchmarks comparing Workplace OS personalities to native code. Primarily OS/2 and AIX, which were pretty far along, but later Windows NT. Classic MacOS and OS/400 were mentioned, but I never saw any work on them. Taligent was dead as a door nail by that time (I did ride that project into the ground), by which I mean it had been converted to a C++ utility library.
We were just benchmarking and telling the devs in Boca Raton, "Non't do that. No, don't do that. Here's how you do it. Stop it." I don't know about the roadmap beyond the initial personalities.
Was OS/2 for PPC (are you sure it shipped?) well known for being hideously slow?
I don't think it was "well-known" for anything :) But, not having used it myself, the blog post I cite above says the performance was "surprisingly good" and that "all things considered, responsiveness quite good for a 100MHz CPU"
"Shipped" might be debatable. IIRC, the OS/2-PPC release was more like a preview or private beta. Or at least that's how the OS/2 and PPC fanboys spun it by saying "just wait for the real retail version!" (which was never to come)
IBM had already discontinued the hardware, so it probably only existed so IBM could say IBM is always true to its word or whatever.
1. Keep in mind that the programmers had swallowed gallons of the multi-server microkernel koolaid, so the flow was user code -> mk -> file server -> mk -> device driver and back.
2. Someone used the wrong magic keywords in the mig spec, causing poor message send-receive structure and code.
3. The same someone, IIRC, set up the copy on write memory management for each page, rather than one big buffer. The latter would still be slower than just copying, but geeze.
4. There was something wrong with the driver at the time; I never heard what.
5. In fairness, there was some overhead from our monitoring.
IBM then was essentially "the loudest person gets to be in charge." Larry Loucks was very loud.
After spending some time with the OS/2 Workplace Shell usability group, I spent time with Taligent and then the Workplace OS. I'm having "DASD" flashbacks typing this. Then I left for grad school and UT Austin for a good ten years, going back to IBM to work on something called xCP and digital media encryption. That was mostly after IBM became a pure consulting company, and was a cluster fuck, too. (The entire Pervasive Computing division went down with me. My boss ended up in the Lotus division.)
Don't work for IBM. Don't buy IBM. After I get done here, I'm washing my cell phone out with Listerine.
Given a syscall that does nothing, a full round-trip under BSD would require about 40μs, whereas on a user-space Mach system it would take just under 500μs
This is still true, but to a lesser extent, on MacOSX (or at least was, a decade ago). I was writing HPC drivers for a cluster interconnect. So performance was critical. We had been using the BSD ioctl system to communicate with our drivers because we used ioctls in all our other drivers (Linux, Solaris, FreeBSD, Windows). I did some microbenchmarks and noticed that it was far slower than FreeBSD or Linux ioctls & complained. Apple suggested that I re-write the app/driver communication using IOKit, which is Mach based. The result was something that was twice as slow.
Trivial mach IPC round-trip (same process) seems to be about ~8.5µs in my test, so around 4µs per call. I'm sure if you transferred port rights, memory, etc the cost would go up.
edit: I would advise anyone to run your own tests. Anecdotes aren't data and some things like syscalls are not nearly as expensive on modern hardware as they once were.
Sometimes things can get ridiculously faster than you remember. For example the original iPhone took nearly 200ns to do objc_msgSend() [1]. In 2016 modern hardware did the same thing in 2.6ns [2]. That's two orders of magnitude improvement. So saying "message passing is slow" is not a correct statement... it's almost as cheap as a C++ virtual method call.
>Given a syscall that does nothing, a full round-trip under BSD would require about 40μs, whereas on a user-space Mach system it would take just under 500μs
Point of reference: I work in HFT, and it's possible to process a market data update from an exchange, reprice a complex financial instrument, decide whether to place an order, and send that order back towards the exchange all in under 5us. Another point of reference: raising one float to the power of another (a^b) could be done over a 100,000 times in the same time it takes for Mach to do a single syscall.
That 500 microsecond number is from a 1993 paper on 1991 hardware (a 50 MHz 486DX-50). Your HFT code wouldn't load on it, and in 500usec it would be lucky to do a single float^float power operation.
Ok I mixed up the DX and SX; at least the DX has the fp hardware. Still, many of the x87 instructions you'd want to use to implement pow() take hundreds of cycles on a 486, so I'd bet it would take tens of usec to do each pow().
I really don't mean this as a knock on the work you do.. but what is the actual complexity of the operations you're describing? From what little I understand of it, all HFT operations are designed to be as simple as possible. Like.. I've read that the decision to the order engine could be as simple as a comparison between two numbers (some kind of arbitrage model), and the price adjustment could be just a simple addition/alpha adjustment. Basically, the offline/non-realtime stuff in HFT is way more complicated than the live/real-time stuff.
The way XNU converged from its microkernel nature to a more monolithic one has led to all kinds of funkiness. You still have more syscall overhead (and VM but message passing isn't probably the culprit idk) but can't really trust Mach to separate kernel's internal systems like one would in a typical microkernel.
For presumably optimization reasons arbitrary pointers (rather than checked messages) are passed around different parts of the kernel. And that exposes some quite bad security issues from time to time.
I doubt that they care. In the period where I worked on Mac drivers and kept close track of the Darwin sources (roughly 2003->2013) I saw very little in the way of performance improvements.
AFAIK, they still don't support 15 year old technologies like MSI-X that permit efficient multi-queue network drivers.
Have you ever built a project by hand on MacOS and then on Linux (or FreeBSD)? Have you noticed how absurdly, painfully slow it is running autoconf on MacOSX? That's because MacOS system calls are horrifically slow compared to Linux / BSD.
It's `fork` in particular which is much slower. It has to fiddle with Mach port rights, doesn't do overcommit, etc. And autotools is very fork heavy.
Apple provides posix_spawn which is much much faster. Running /usr/bin/false 1000 times in the fish shell is nearly twice as fast (2.25s to 1.25s) when using posix_spawn instead of fork, on the Mac.
Autotools doesn't use posix_spawn because there's no benefit on Linux, but it is what Apple's frameworks use internally for process launching.
Small number of minutes vs less than a minute. The last time I built something by hand in both places, it seemed to take 2x to 3x as long to run all the autoconf checks on MacOS. That's basically a fork / exec / open /close sort of benchmark.
Spent a while thinking why I don’t see this much. I work mostly with higher level languages, so the beefiest things I run are probably webpack and the typescript language server. I wonder if maybe the problem is fork or some derivative effect. That, or I just don’t have any syscall heavy loads in my life, full stop.
It doesn't need autoconf to see this. I have seen it in a project with a simple make -j, compiling with clang in both places. Builds finish much faster on Linux.
To be fair this may be many things and not just mach syscalls.
> Mach's name Mach evolved in a euphemization spiral: While the developers, once during the naming phase, had to bike to lunch through rainy Pittsburgh's mud puddles, Tevanian joked the word muck could serve as a backronym for their Multi-User [or Multiprocessor Universal] Communication Kernel. Italian CMU engineer Dario Giuse later asked project leader Rick Rashid about the project's current title and received "MUCK" as the answer, though not spelled out but just pronounced as IPA: [mʌk] which he, according to the Italian alphabet, wrote as Mach. Rashid liked Giuse's spelling "Mach" so much that it prevailed.
It's fascinating how some things get their names :)
There's some delay between when a new macOS version comes out and when sources get published, but it's great to see how they use the Mach kernel in practice.
Apple does not need to do that because of the BSD license, but I've done that with AGPL3 software before. It was the suggested solution by the original developer of the project, whom we hired
I would be surprised if that was the case. Low-power micro-controllers, such as ARM's Cortex-M series usually do not have support for virtual memory, which is required for running XNU.
Actually stumbled across this the other day- had no idea the Mach kernel was being used on Apple hardware before the Intel transition. MachTen was a paid-for Mach kernel based around BSD4.4 https://en.wikipedia.org/wiki/MachTen
It will become one again with the new user space drivers being introduced in Catalina.
As announced at WWDC, it will be a progressive transition, for every new driver model being supported as user space driver, the related kernel space APIs will be automatically deprecated and removed in the following OS release the year after.
That still leaves lots of functionality in the kernel, most notably the BSD code (which implements Unix syscalls). So even with the new userspace driver support, most people would not consider xnu a microkernel.
It will be very hard to be a monolithic kernel without any sort of kernel space drivers.
Also many BSD syscalls have been deprecated along the years, including POSIX features like networking stack, now replaced by Objective-C APIs.
I would advise spending some time reading "Mac OS X Internals: A Systems Approach" and "Mac OS X and iOS Internals" books, to learn that having everything on kernel space alone, doesn't rewrite the original microkernel code into a huge monolith.
Including file systems, graphics and networking? That sounds like it would require a redesign of Mach, similar to what the L4 family had to do, or what Google is doing with Fuchsia.
Networking - as in from the packet queues down, yes. I could see the IP stack making its way into its own process (or more likely they'll try and see that the perf is terrible). Of they do that, they'll probably push SSL into the same stack.
Graphics - I don't see much changing, but to be fair it's pretty nice already. The part in kernel space for the most part just controls the GPU's MMU, the meat of the driver runs in user space for speed reasons as shared libraries in the processes that are making GPU calls. It's sort of exokernel like if you squint hard enough.
Filesystems - will probably be hybrid. I don't see APFS leaving kernel space, or anything your root partition would be on, but NFS, exFat, NTFS? Yeah.
Are all drivers going to be userspace in Catalina, or just the user-installed kernel extensions? I just assumed Apple's OS drivers would still be in kernelspace.
Maybe I'm wrong, but I always thought that the "classic" description of a microkernel was one that implemented not only device drivers, but filesystem drivers and possibly other key components in userspace (memory manager?) as well.
At least that is how I remember MINIX 1.0 as described in the Tanenbaum book.
I appreciate there is a lot of grey area with microkernels, and a lot of hybrid designs these days, since as was the case with Mach/XNU/Windows NT, "pure" microkernel designs have often shown less than optimal performance due to additional context switching.
But they are fundamentally different. They have no mailboxes in the server, they just pass the messages through to the right client after some capabilities check.
Mach and Hurd on the other hand store the messages in the kernel, priorities them and handles all mailbox messages, without ever losing one.
This proved to be the wrong approach.
I think that just goes to show that when it comes to consumer media, there's probably more important layers in the stack that should be the focus of optimizations.
Lack of hardware acceleration, and the general clunkiness of X windows make Linux crappy in this regard for example. I _still_ get screen tearing in Ubuntu 18.04 if I don't run Wayland, which I don't because it breaks some apps I use.
For consumer media it's even more important to drop late or lost messages. Just like with its transport protocol RTP or UDP. A microkernel is much, much simplier than those protocols.
I've forgotten it's title, but there was a performance comparison between Linux and MkLinux (with the OSF/Mach microkernel) in the one and only FSF free software conference. It had my favorite chart: a barchart showing the relative performance. The top half was 1:1; the bottom half was 1:2 or so.
The top half was Dhrystones (CPU integer performance), the bottom was various syscalls.
This usage goes against many of the security claims of microkernels. In particular. If Linux is big and insecure running it in a VM doesn't really improves its security. It just make sure it doesn't infect the rest of the system. For this type of benchmark the "rest of the system" does not exist.
(I assume you meant the baseband processor in iPhone and Android. I don't think either use a clean microkernel and there isn't really a performance comparison available)
Google is developing Fuchsia as a possible replacement for Android's Linux kernel. Are there any reports or rumors about Apple developing a next-generation kernel (perhaps with a safe language like Swift and optimized for mobile devices) to eventually replace Mach/XNU and its historical baggage?
probably you mean 3-rd party drivers? or even apple internally?
[edit] i couldn’t tell that from the slides mentioned below, but maybe i missed something
> All the ones corresponding to the former IO Kit do require C++
yea, which leads me to believe if apple was to rewrite the kernel, they probably would go with c++ ... or maybe it’s just that swift doesn’t have its embedded chops up to snuff yet...
Performance was the big problem. (At one point, a disk read was CPU-bound.)
1. The Mach Interface Generator (mig) generated code with wildly different performance, with no obvious relationship to the mig spec.
2. Context switching is always the big topic, but it's a red herring. Our problem was primarily data transfer; copy-on-write took so long to set up that it was frequently cheaper to just do the copy.
3. Making a syscall to get the current PID is a stupid idea.