Hacker News new | past | comments | ask | show | jobs | submit login
Mach kernel (wikipedia.org)
139 points by thatguyagain on Sept 5, 2019 | hide | past | favorite | 95 comments

I worked at IBM Austin on the performance team when the Mach-based IBM Microkernel/Workplace OS was ongoing. (Ask me anything! :-))

Performance was the big problem. (At one point, a disk read was CPU-bound.)

1. The Mach Interface Generator (mig) generated code with wildly different performance, with no obvious relationship to the mig spec.

2. Context switching is always the big topic, but it's a red herring. Our problem was primarily data transfer; copy-on-write took so long to set up that it was frequently cheaper to just do the copy.

3. Making a syscall to get the current PID is a stupid idea.

copy-on-write took so long to set up that it was frequently cheaper to just do the copy.

Right. QNX just copied for interprocess communication. Mach did lots of messing with the MMU to create temporary shared pages. That seemed to be a lose.

* What you want for interprocess communication in a microkernel is something that works like a function call - send data, wait for reply. If you have to build that out of unidirectional messages, it means more trips through the CPU dispatcher, and probably going to the end of the line waiting for a turn at the CPU. Call-like approach: Process A calls process B, control transfers immediately from A to B, B returns to A, control transfers immediately to process A. No need to look for the next task to run; who runs next is a no-brainer. Pipe-like approach: A sends to B, both are active for a moment, A soon blocks on a lock, B wakes up and does its thing, B sends to A, both are active for a moment, B soon blocks on a lock. A and B fight for the CPU.

The "going to the end of the line" effect is that when other tasks want the CPU, the pipe-like approach means other tasks get a chance to run during the handoff. The effect is that message passing performance falls off a cliff when you're CPU-bound. QNX used a call like approach (MsgSend/MsgReceive), while Mach used something more like pipe I/O.

* Mach started from the BSD code base. Trying to build a microkernel by hacking on a macrokernel didn't end well. Microkernels are all about getting the key primitives at the bottom working very fast and reliably. There was an eventual rewrite, but I gather that BSD code remained.

Technically there's no reason why write can't context switch directly to a process that's blocked on read (or, less commonly, vice versa). That only solves half the problem though; if you want pure pipe-like IO, you'd also need a single system call that combines write with select/poll/equivalent so the caller can immediately become blocked waiting for the return message.

That runs into a problem on a quick reply. Process A does the write on pipe AB, then gives up the CPU to B. A is still in ready to run state. B quickly generates a reply and writes it to pipe BA. But there's no read pending on pipe BA yet. So both processes are now in ready to run state, contending for the CPU, along with anything else that needed it.

There's an alternative - unbuffered pipes. This is like a Go channel of length 0 - all writes block until the read empties the channel. This is better from a CPU dispatching perspective, in that a write to a channel implies an immediate transfer of control to the receiver. Of course, Go is doing this in one address space, not across a protection boundary.

The QNX approach worked well for service-type requests. You could set up a service usable by multiple processes. Each request contained the info needed to send the reply back to the caller. The service didn't have to open a pipe to the requestor. This even understood process priority, so high priority requests were serviced first, an essential feature in a hard real time system.

> That runs into a problem on a quick reply.

Read the second sentence; process A does the write+select on AB and [BA,...], yields to B, is not ready to run; B generates a reply and writes it to BA, which has a read (technically select) already pending.

That's the advantage of a combined read and write. But if you're going to have that, it's more convenient to explicitly package it as a request/reply. Less trouble with things like having two messages in the pipe and such.

You might be waiting (via select) for multiple replies; you don't know whether B will (say) grab something out of disk cache and chuck it back to you, or if it'll fire off a seek command to the spinning rust and take a nap just as your network round-trip completes. Having two (or more) messages in (separate) pipes is the point of using select.

Is there some relationship between Mach's poor performance and the ability of macOS to remain functional in low-memory situations? Linux has been called out recently for becoming unusable when free memory is low. Using both, I can say that the Mac is hands-down a better OS for desktop use where having the system not freeze is way more important than a 70% slowdown in file transfer rate.

Why is the Mac so much better under low memory conditions than Linux? Is it the kernel, and if so, is there an inherent trade-off between low memory performance and other kinds of performance?

If there is a trade-off, would reworking the Linux kernel to function better under low memory conditions also create a way forward for a non-dbus, non-systemd, yet modern Linux?

I seem to recall that back when I used Windows back in XP era when memory was limited, the page file would sometimes grow to a few hundred MB without GUI responsiveness suffering too much. Does Windows or Mac protect certain processes from being swapped?

I worked at IBM Austin on the performance team when the Mach-based IBM Microkernel/Workplace OS was ongoing. (Ask me anything! :-))

Kind of off-topic, but, every once in a while, I think of Workplace OS and how it was this mystical product that was going to be The Future (especially to us OS/2 nerds!) and how it nearly never even gets mentioned anymore these days. I'd love to read the reminisces of you or your team members.

Questions: The OS/2 personality was the only one that shipped, right? (Was its DOS and Windows compatibility provided by separate personalities, or provided by the OS/2 personality?) What other personalities were being developed, and how far along were they in development before being cancelled? I know IBM executives talked about AIX, OS/400, classic MacOS and Taligent personalities, but did any development work happen on those or were they just vaporware?

Wikipedia says OS/2 for PPC shipped with "IBM Microkernel 1.0" (based on Mach), with plans for a "IBM Microkernel 2.0" which never shipped? Was 2.0 planned as a major change from 1.0, or just an incremental evolution?

The OS/2 personality shipped? (For once in my career, I didn't ride the dead horse into the ground.) When I left, the performance group I was with was breaking up, and the Workplace OS was very far from a releasable state.

I assume DOS and Windows compatibility would come from the features in OS/2. My primary task was to write and run benchmarks comparing Workplace OS personalities to native code. Primarily OS/2 and AIX, which were pretty far along, but later Windows NT. Classic MacOS and OS/400 were mentioned, but I never saw any work on them. Taligent was dead as a door nail by that time (I did ride that project into the ground), by which I mean it had been converted to a C++ utility library.

We were just benchmarking and telling the devs in Boca Raton, "Non't do that. No, don't do that. Here's how you do it. Stop it." I don't know about the roadmap beyond the initial personalities.

Was OS/2 for PPC (are you sure it shipped?) well known for being hideously slow?

> The OS/2 personality shipped?

Well, apparently it did: http://www.os2museum.com/wp/os2-history/os2-warp-powerpc-edi...

Some "abandonware" website is even offering it for download: https://winworldpc.com/product/os-2-3x/30-powerpc-edition

> well known for being hideously slow?

I don't think it was "well-known" for anything :) But, not having used it myself, the blog post I cite above says the performance was "surprisingly good" and that "all things considered, responsiveness quite good for a 100MHz CPU"

"Shipped" might be debatable. IIRC, the OS/2-PPC release was more like a preview or private beta. Or at least that's how the OS/2 and PPC fanboys spun it by saying "just wait for the real retail version!" (which was never to come)

IBM had already discontinued the hardware, so it probably only existed so IBM could say IBM is always true to its word or whatever.

> At one point, a disk read was CPU-bound.

That sounds impressive. Could you say more about how/why?

1. Keep in mind that the programmers had swallowed gallons of the multi-server microkernel koolaid, so the flow was user code -> mk -> file server -> mk -> device driver and back.

2. Someone used the wrong magic keywords in the mig spec, causing poor message send-receive structure and code.

3. The same someone, IIRC, set up the copy on write memory management for each page, rather than one big buffer. The latter would still be slower than just copying, but geeze.

4. There was something wrong with the driver at the time; I never heard what.

5. In fairness, there was some overhead from our monitoring.

>I worked at IBM

Something completely unrelated to Mach, what do you think of IBM then, IBM now, and IBM's future.

IBM then was essentially "the loudest person gets to be in charge." Larry Loucks was very loud.

After spending some time with the OS/2 Workplace Shell usability group, I spent time with Taligent and then the Workplace OS. I'm having "DASD" flashbacks typing this. Then I left for grad school and UT Austin for a good ten years, going back to IBM to work on something called xCP and digital media encryption. That was mostly after IBM became a pure consulting company, and was a cluster fuck, too. (The entire Pervasive Computing division went down with me. My boss ended up in the Lotus division.)

Don't work for IBM. Don't buy IBM. After I get done here, I'm washing my cell phone out with Listerine.

What do you think about the hypothetical that Ginni Rometty gets replaced by Jim Whitehurst?

I heard DEC/Compaq ended up basically rewriting the whole thing in OSF/Digital Unix/Tru64. Anyone with stories on that kernel?

Isn't that the "OSF MkLinux" thing?

Given a syscall that does nothing, a full round-trip under BSD would require about 40μs, whereas on a user-space Mach system it would take just under 500μs

This is still true, but to a lesser extent, on MacOSX (or at least was, a decade ago). I was writing HPC drivers for a cluster interconnect. So performance was critical. We had been using the BSD ioctl system to communicate with our drivers because we used ioctls in all our other drivers (Linux, Solaris, FreeBSD, Windows). I did some microbenchmarks and noticed that it was far slower than FreeBSD or Linux ioctls & complained. Apple suggested that I re-write the app/driver communication using IOKit, which is Mach based. The result was something that was twice as slow.

Have you run a test recently?

Trivial mach IPC round-trip (same process) seems to be about ~8.5µs in my test, so around 4µs per call. I'm sure if you transferred port rights, memory, etc the cost would go up.

edit: I would advise anyone to run your own tests. Anecdotes aren't data and some things like syscalls are not nearly as expensive on modern hardware as they once were.

Sometimes things can get ridiculously faster than you remember. For example the original iPhone took nearly 200ns to do objc_msgSend() [1]. In 2016 modern hardware did the same thing in 2.6ns [2]. That's two orders of magnitude improvement. So saying "message passing is slow" is not a correct statement... it's almost as cheap as a C++ virtual method call.

[1] https://www.mikeash.com/pyblog/friday-qa-2016-04-15-performa... [2] https://www.mikeash.com/pyblog/friday-qa-2016-04-15-performa...

What is a BSD syscall on the same hardware?

>Given a syscall that does nothing, a full round-trip under BSD would require about 40μs, whereas on a user-space Mach system it would take just under 500μs

Point of reference: I work in HFT, and it's possible to process a market data update from an exchange, reprice a complex financial instrument, decide whether to place an order, and send that order back towards the exchange all in under 5us. Another point of reference: raising one float to the power of another (a^b) could be done over a 100,000 times in the same time it takes for Mach to do a single syscall.

That 500 microsecond number is from a 1993 paper on 1991 hardware (a 50 MHz 486DX-50). Your HFT code wouldn't load on it, and in 500usec it would be lucky to do a single float^float power operation.

> in 500usec it would be lucky to do a single float^float power operation.

The rest of your post seems OK, but this claim is almost certainly off. 2000 float^float operations takes a full second? Even on a 486DX, no way.

Ok I mixed up the DX and SX; at least the DX has the fp hardware. Still, many of the x87 instructions you'd want to use to implement pow() take hundreds of cycles on a 486, so I'd bet it would take tens of usec to do each pow().

I really don't mean this as a knock on the work you do.. but what is the actual complexity of the operations you're describing? From what little I understand of it, all HFT operations are designed to be as simple as possible. Like.. I've read that the decision to the order engine could be as simple as a comparison between two numbers (some kind of arbitrage model), and the price adjustment could be just a simple addition/alpha adjustment. Basically, the offline/non-realtime stuff in HFT is way more complicated than the live/real-time stuff.

Still more complex than a no-op syscall, especially including message encoding/decoding.

The way XNU converged from its microkernel nature to a more monolithic one has led to all kinds of funkiness. You still have more syscall overhead (and VM but message passing isn't probably the culprit idk) but can't really trust Mach to separate kernel's internal systems like one would in a typical microkernel.

For presumably optimization reasons arbitrary pointers (rather than checked messages) are passed around different parts of the kernel. And that exposes some quite bad security issues from time to time.

So the obvious question is how it has changed over the years. What has apple done to pull latency down to a comparable time?

I doubt that they care. In the period where I worked on Mac drivers and kept close track of the Darwin sources (roughly 2003->2013) I saw very little in the way of performance improvements.

AFAIK, they still don't support 15 year old technologies like MSI-X that permit efficient multi-queue network drivers.

Have you ever built a project by hand on MacOS and then on Linux (or FreeBSD)? Have you noticed how absurdly, painfully slow it is running autoconf on MacOSX? That's because MacOS system calls are horrifically slow compared to Linux / BSD.

It's `fork` in particular which is much slower. It has to fiddle with Mach port rights, doesn't do overcommit, etc. And autotools is very fork heavy.

Apple provides posix_spawn which is much much faster. Running /usr/bin/false 1000 times in the fish shell is nearly twice as fast (2.25s to 1.25s) when using posix_spawn instead of fork, on the Mac.

Autotools doesn't use posix_spawn because there's no benefit on Linux, but it is what Apple's frameworks use internally for process launching.

> Autotools doesn't use posix_spawn because there's no benefit on Linux

Wait, autotools don't make syscalls at all, do they? I thought it's just shell and make.

Is the problem that /bin/sh and /usr/bin/make use fork and it would help to have a Bourne shell and GNU-compatible make that used posix_spawn?

What does “horrifically slow” mean compared to a monolithic syscall? I haven’t run autoconf on linux in years.

Small number of minutes vs less than a minute. The last time I built something by hand in both places, it seemed to take 2x to 3x as long to run all the autoconf checks on MacOS. That's basically a fork / exec / open /close sort of benchmark.

(and general filesystem performance)

You should try it yourself.

We moved almost all development machines to Linux after someone demonstrated how fast Linux built some C project compared to osx and windows.

Spent a while thinking why I don’t see this much. I work mostly with higher level languages, so the beefiest things I run are probably webpack and the typescript language server. I wonder if maybe the problem is fork or some derivative effect. That, or I just don’t have any syscall heavy loads in my life, full stop.

It doesn't need autoconf to see this. I have seen it in a project with a simple make -j, compiling with clang in both places. Builds finish much faster on Linux.

To be fair this may be many things and not just mach syscalls.

> Mach's name Mach evolved in a euphemization spiral: While the developers, once during the naming phase, had to bike to lunch through rainy Pittsburgh's mud puddles, Tevanian joked the word muck could serve as a backronym for their Multi-User [or Multiprocessor Universal] Communication Kernel. Italian CMU engineer Dario Giuse later asked project leader Rick Rashid about the project's current title and received "MUCK" as the answer, though not spelled out but just pronounced as IPA: [mʌk] which he, according to the Italian alphabet, wrote as Mach. Rashid liked Giuse's spelling "Mach" so much that it prevailed.

It's fascinating how some things get their names :)

I always thought it was named after Mach number because of its speed. Being named after muck couldn't be farther from that.

Speed was not one of Mach's attributes.

I had supposed it was named after Ernst Mach, after whom the Mach number is named.

I’m Italian and I’m not sure I’d transliterate that way, but...

Depends who is speaking; I've heard the vowel sound in "muck" spoken as any of ʌ, ə, or u by native English speakers from different regions.

Apple even open sources their XNU kernel: https://opensource.apple.com/

There's some delay between when a new macOS version comes out and when sources get published, but it's great to see how they use the Mach kernel in practice.

I dont recall if it was Apple that did this, but theres trick one can pull with an "open source"codebase thats part of a proprietary product.

One can have said codebase effectively outsource functionality to function calls on closed source libraries.

Google does something similar with android except in that case its moreso coupling functionality to their services:


Apple does not need to do that because of the BSD license, but I've done that with AGPL3 software before. It was the suggested solution by the original developer of the project, whom we hired

It's also mirrored on Github! [1]

[1]: http://github.com/apple/darwin-xnu

Why do they update the macOS XNU Kernel and not iOS?

They seem really big on keeping iOS things secret-ish, but the XNU kernel for both iOS and macOS seem to be built from the same code base.

Up until a couple years ago, they'd strip out ARM specific things from the releases macOS XNU code. Then they started leaving that stuff in!

What about their embedded OS? Like one they're running on T2 chip or inside Airpods? Is it still based on XNU kernel?

T2 is a variant of A10 and runs XNU for the main CPU.

Firmware cores and AirPods run Apple RTKit.

They use L4 in the Secure Enclave. Perhaps they do the same for other embedded devices

I would be surprised if that was the case. Low-power micro-controllers, such as ARM's Cortex-M series usually do not have support for virtual memory, which is required for running XNU.

IIRC they share the exact same kernel, so they don't need to.

They’re not quite the same, which is why the lack of iOS sources was a minor annoyance for a while.

Oral History of Avie Tevanian starting with the Mach segment: https://youtu.be/vwCdKU9uYnE?t=3995.

Actually stumbled across this the other day- had no idea the Mach kernel was being used on Apple hardware before the Intel transition. MachTen was a paid-for Mach kernel based around BSD4.4 https://en.wikipedia.org/wiki/MachTen

Technically they were anyway, since XNU/OSX were on PowerPC ;P

MkLinux ran Linux as the sole server process on Mach on Macs. It was noticeably slower than the direct PowerPC port.

It was done by Apple and OSF, and for a while was the only way to run Linux on NuBus Macs.

After being integrated into XNU through the OSFMK kernel, it's not microkernel anymore.

It will become one again with the new user space drivers being introduced in Catalina.

As announced at WWDC, it will be a progressive transition, for every new driver model being supported as user space driver, the related kernel space APIs will be automatically deprecated and removed in the following OS release the year after.

That still leaves lots of functionality in the kernel, most notably the BSD code (which implements Unix syscalls). So even with the new userspace driver support, most people would not consider xnu a microkernel.

It will be very hard to be a monolithic kernel without any sort of kernel space drivers.

Also many BSD syscalls have been deprecated along the years, including POSIX features like networking stack, now replaced by Objective-C APIs.

I would advise spending some time reading "Mac OS X Internals: A Systems Approach" and "Mac OS X and iOS Internals" books, to learn that having everything on kernel space alone, doesn't rewrite the original microkernel code into a huge monolith.

Userspace drivers do not make a microkernel.

They surely do when there isn't anything else available.

Apple showed their long term roadmap at WWDC how they plan to purge all kernel drivers.

Including file systems, graphics and networking? That sounds like it would require a redesign of Mach, similar to what the L4 family had to do, or what Google is doing with Fuchsia.

Networking - as in from the packet queues down, yes. I could see the IP stack making its way into its own process (or more likely they'll try and see that the perf is terrible). Of they do that, they'll probably push SSL into the same stack.

Graphics - I don't see much changing, but to be fair it's pretty nice already. The part in kernel space for the most part just controls the GPU's MMU, the meat of the driver runs in user space for speed reasons as shared libraries in the processes that are making GPU calls. It's sort of exokernel like if you squint hard enough.

Filesystems - will probably be hybrid. I don't see APFS leaving kernel space, or anything your root partition would be on, but NFS, exFat, NTFS? Yeah.

That is what they mentioned at WWDC yes.

Networking is part of the first wave by the way.

Networking kernel drivers are now deprecated as you can easily read about here.


Are all drivers going to be userspace in Catalina, or just the user-installed kernel extensions? I just assumed Apple's OS drivers would still be in kernelspace.

Maybe I'm wrong, but I always thought that the "classic" description of a microkernel was one that implemented not only device drivers, but filesystem drivers and possibly other key components in userspace (memory manager?) as well. At least that is how I remember MINIX 1.0 as described in the Tanenbaum book.

I appreciate there is a lot of grey area with microkernels, and a lot of hybrid designs these days, since as was the case with Mach/XNU/Windows NT, "pure" microkernel designs have often shown less than optimal performance due to additional context switching.

Catalina is only the start of a long term roadmap, so no not all drivers are userspace on Catalina, just the first set of them.

Apple stated that would apply to everyone, as you can check from WWDC videos.

Have they solved the performance issue in any way?

QNX, L4 and Minix 3 have solved it ages ago.

Even current macOS variants do plenty of message passing and sandboxing.

From the iOS games and real time audio apps, it appears fast enough to me.

But they are fundamentally different. They have no mailboxes in the server, they just pass the messages through to the right client after some capabilities check.

Mach and Hurd on the other hand store the messages in the kernel, priorities them and handles all mailbox messages, without ever losing one. This proved to be the wrong approach.

Seems to work alright for consumer media in real time, something that monolithic Linux has real issues with.

I think that just goes to show that when it comes to consumer media, there's probably more important layers in the stack that should be the focus of optimizations.

Lack of hardware acceleration, and the general clunkiness of X windows make Linux crappy in this regard for example. I _still_ get screen tearing in Ubuntu 18.04 if I don't run Wayland, which I don't because it breaks some apps I use.

For consumer media it's even more important to drop late or lost messages. Just like with its transport protocol RTP or UDP. A microkernel is much, much simplier than those protocols.

There are no proper L4 performance comparisons.

A few benchmark L4Linux, where L4 is not used as a microkernel but a hypervisor which consequently proves nothing about the micokernel paradigm.

I've forgotten it's title, but there was a performance comparison between Linux and MkLinux (with the OSF/Mach microkernel) in the one and only FSF free software conference. It had my favorite chart: a barchart showing the relative performance. The top half was 1:1; the bottom half was 1:2 or so.

The top half was Dhrystones (CPU integer performance), the bottom was various syscalls.

An hypervisor is a kind of microkernel.

The benchmark is its use as radio OS on many handsets.

Yes, but...

This usage goes against many of the security claims of microkernels. In particular. If Linux is big and insecure running it in a VM doesn't really improves its security. It just make sure it doesn't infect the rest of the system. For this type of benchmark the "rest of the system" does not exist.

(I assume you meant the baseband processor in iPhone and Android. I don't think either use a clean microkernel and there isn't really a performance comparison available)

I prefer to see hypervisors used alongside Unikernels in what concerns security.

For me the benchmark that counts is "does it deliver in production", winning ms in laboratory micro-benchmarks is kind of useless.

Specially when so many are willing to waste those ms running Electron apps on userspace.

Google is developing Fuchsia as a possible replacement for Android's Linux kernel. Are there any reports or rumors about Apple developing a next-generation kernel (perhaps with a safe language like Swift and optimized for mobile devices) to eventually replace Mach/XNU and its historical baggage?

They are moving all drivers to user space.

All the ones corresponding to the former IO Kit do require C++, all the remaining categories are going to be supported from Swift as well.

This is planned to take place across several releases, at the end of which no kernel drivers will be any longer allowed.

> They are moving all drivers to user space.

probably you mean 3-rd party drivers? or even apple internally?

[edit] i couldn’t tell that from the slides mentioned below, but maybe i missed something

> All the ones corresponding to the former IO Kit do require C++

yea, which leads me to believe if apple was to rewrite the kernel, they probably would go with c++ ... or maybe it’s just that swift doesn’t have its embedded chops up to snuff yet...

relevant slides:


Watch the presentation, as it contains more information as the slides.

The long term roadmap is as follows:

1 - surface kernel APIs for a specific driver model as userspace API

2 - deprecate for the respective OS release the kernel entry points related to the newly surfaced driver model

3 - remove the kernel api on the following OS release

4 - rinse and repeat untill there aren't any kernel driver APIs left

The ones being released with Catalina are just the first wave.


i’ll watch the video then, very interesting

Here's the video of that "System Extensions and DriverKit" session from WWDC 2019:


thanks, watching it now ^^

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact