Hacker News new | comments | show | ask | jobs | submit login
Mkfile(8) is severely syscall limited on OS X (metaobject.com)
102 points by mpweiher 45 days ago | hide | past | web | 88 comments | favorite

" I did not check on other operating systems, but my guess is that the results would be similar."

Actually, no. Irrespective of mkfile being mostly MacOSX specific, most sycalls on MacOSX are just plain slow as compared to Linux (or FreeBSD). I think this is partially an artifact of performance not being a major metric for most system calls on MacOSX. The systems calls where performance is a critical metric (like timekeeping) use a kernel / user shared page interface to avoid the syscall entirely in the vast majority of the cases.

I recall doing benchmarking roughly 10 years ago to find the best interface to communicate with a mostly in an OS-bypass HPC driver. MacOSX ioctls were some multipler more expensive than Linux, but the native IOKit Mach IPC was even more expensive than that. Sigh. There was a similar story for sockets in non-OS bypass mode, where simply writing on a socket was far more expensive than Linux.

Somebody needs to resurrect lmbench & do a comparison of the various x86 kernels available these days. Maybe that would shame Apple into focusing on performance.


Using mkfile compiled for Linux on a tmpfs gets me 1550 MiB/s.

I'm pretty sure that even at their slowest, syscalls aren't going to match IO overhead. It's not like the mkfile is CPU bound.

The point of the article is that I was also "pretty sure" about exactly that, and advances in hardware performance meant I was wrong...

2GB/s is pretty damn fast, 0.5 nanoseconds per byte.

That's not what the article showed. It showed that synchronous IO latency was the limiting factor. By arranging your IOs into little 512-byte chunks, you don't make it terribly easy for the filesystem to efficiently use the underlying device --not surprising, given that it doesn't even want to work in 512 byte sectors, and operates more efficiently with lots of a parallel io operations.

OS X does write-coalescing.

You can't write coalesce IO operations that haven't been called yet.


Bloated syscall times wind up being a death by a thousand cuts when your workload is syscall heavy.

One simple, common, example is plain old autotools configure scripts. I have no love for autotools generated configure scripts (and in fact hate them with a passion), but I've long observed they crawl on MacOSX and go much faster on Linux. They do lots of syscalls (tons of fork/exec/open/close/stat).

Reason for downvoting?

Because I speak in a condescending tone that irritates people. I'm quite used to it. I fully admit it's rare that I can present my point of view in a way that is a) concise rather than a rant, and b) doesn't use phrasing that is abrasive. I should have been able to summarize my point in two sentences without negative coloring, but instead it took me three verbose paragraphs with assaulting language. I'm trying to work on it, but old habits die hard. shrug

(Also sshhhhh, it is anti-HN to discuss downvotes)

I was the same. Probably still am sometimes. The most helpful reminder to myself was that this isn't a shame or punishment thing. It's simply that most people aren't receptive when spoken to that way. So instead of feeling like I was being shamed into being polite, I just reminded myself that it's my loss if I can't communicate my thoughts, no matter what the transport mechanism looks like.

Language is such a horribly inefficient and inconsistent medium of communication..

It's unbelievably flexible though.

Ambiguous, I would say.

Actually it is neither forbidden nor uncommon to see a comment by the downvoter explaining their reasoning for the downvote. But complaining about being downvoted, or inviting discussion about it, is frowned upon.

I just wondered seeing as it had been flagged dead within an hour I believe.

I believe however that this is one of the fundamental flaws of Reddit-style websites. On an imageboard you are obliged to make your disagreement in words, constructive or otherwise. Here, you can passive aggressively click the downvote button and leave the post in limbo.

Well, sometimes it is better to just click to downvote, because the only other possibility that makes sense is to leave a message saying that the guy is a scumbag shithead. In such (common) cases, the simple downvote is better for the health of the thread and of the protagonists.

(I am not at all talking about the present case, let it be clear :-) )

And that's why sane people will tell you to not use something like st_blksize as the IO size, because that's far to small. Many tools use 32 KiB, but that can be limiting even on Linux and needlessly uses more CPU (especially if FUSE is involved -- reads are not coalesced at the VFS layer! Read and write merging is done at the block device layer by the IO scheduler there!). Something like 512k-1M is a sane default these days.

Maybe i'm misunderstanding you, but.. Before every one a "echo 3 > /proc/sys/vm/drop_caches" is executed. Processor is in "performance" mode.

dd if=./big_file.file of=/dev/null bs=64K

535020540 bytes (535 MB, 510 MiB) copied, 8.70128 s, 61.5 MB/s

dd if=./big_file.file of=/dev/null bs=4K

535020540 bytes (535 MB, 510 MiB) copied, 8.36283 s, 64.0 MB/s

dd if=./big_file.file of=/dev/null bs=1K

535020540 bytes (535 MB, 510 MiB) copied, 8.82508 s, 60.6 MB/s

Only thing gone up is the cpu usage (same is negligible at ~60MB/s).

dd if=/dev/zero of=/dev/null bs=1K count=1024K

1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.06113 s, 1.0 GB/s

dd if=/dev/zero of=/dev/null bs=4K count=256K

1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.343595 s, 3.1 GB/s

dd if=/dev/zero of=/dev/null bs=32K count=32K

1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.160191 s, 6.7 GB/s

At just 60 MB/s it doesn't matter much (even 512 byte blocks would just be a little more than 100k syscalls / second). At a couple hundred MB/s it begins to matter, and with faster NVMe SSDs it's a legit issue.

This becomes a bigger issue if it's not a kernel-based, local FS, but something that does more work, I mentioned FUSE above. Since no read merging happens, small reads like 4k or 32k start to hurt earlier, especially if the FS is written in an interpreter. Some of these also mishandled small reads, incurring an additional penalty (I know because I fixed that in an implementation).

No point in speculating. I showed you how. Go test it.

Even the fastest SSD is not even near the speed of /dev/zero. Easy to extrapolate any speed from those two points (minus the filesystem overhead (the thing we are talking about is the block layer)).

I'd be interested in a test of a FUSE based fs like ntfs. So if you have one, just run some tests. (don't forget to flush the cache)

Don't forget "oflag=direct"

Uhm, does dd contain all the logic to work with that or is that literally just O_DIRECT |ed into the flags?

The whole thing is kind of odd. Normally I'd want to leave filling the file to filesystem...

The GNU coreutils default to 128KiB buffers as per the test script at


Oh wow, I had no idea that write buffers needed to be on the order of a megabyte on modern systems.

Neither did I! And in fact it took almost 2 days to question/overcome my assumptions.

"It's not what you don't know that kills you. It is what you know for sure that ain't true" -- Mark Twain

Interesting... You never dd data to a USB3 disk or SSD? The default buffer size for dd on OS X is also 512 bytes. I've been using `bs=1m` on OS X for ages.

Why are you using dd for USB disks?

Writing bootable installation images?

You should use cp, that's just better for this task.

"dd is not a disk writing tool" http://www.vidarholen.net/contents/blog/?p=479

You could use cp.

cp is designed to copy data across file systems.

dd is designed to copy raw block data across devices.

Sure, you can abuse both for either purpose, but seems rather misguided to call it out explicitly. What did you intend to gain by doing it?

You don't know what cp is for. cp is exactly for copying data from a file to another file -- and under Unix, devices are represented as files. Using cp for writing disk ISOs to an USB drive is proper use, not abuse.

dd is for transform-and-copy, so you're abusing dd when writing disk ISOs. The fact that most of the Internet is wrong doesn't make you any less wrong.

cp is designed to copy data from one file to another. Block devices are represented as files, and cp won't be speed limited by block size.

dd is also designed to copy data from one file to another, in blocks, with optional conversion of the data (eg, from EBCDIC to ASCII), and with optional offsets/limits to copy regions of a file.

dd is not designed to copy raw block data, it is merely a different file copy tool.


It's just OS X being slow, Linux gets close to peak with a 16 kB buffer: http://blog.tdg5.com/tuning-dd-block-size/

Reading from /dev/zero and writing out to RAM with a more optimal block size of 256K yields a throughput of 1,536 MB/s

Actually, in this article, Linux writing to RAM is slower than MacOS writing to SSD. Hard to conclude that "it's just OS X being slow". I suspect the hardware used is wildly different from the one in the original article.

Maybe I am reading it wrong, but I am not getting the 16KB optimal buffer size from that article. The summary at the end is as follows:

'In the above example it can be seen that an input block size of 128K is optimal for my particular setup.'

However, the SSD seems to be significantly slower, so the relative overhead of the syscall is going to be respectively less:

'Reading from /dev/zero and writing out to a SSD with a more optimal block size of 256K yields a throughput of 280 MB/s.'

I was getting 250MB/s with the 512 byte buffer size, although the test in the article is doing two syscalls (one read(), one write()) whereas mkfile is just doing one.

At 256KB buffer size I was already getting around 1.8GB/second writing to SSD.

Or is it another case of Linux taking shortcuts and sacrificing safety for speed?

Just saw the performance of this "consumer" device and remembered this thread: http://www.anandtech.com/show/11104/the-patriot-hellfire-m2-... . This SSD can write over a GB/s and read nearly 3 GB/s. We've come a long way since 32 KiB buffers were a sane choice.

Does anyone know offhand the architecture of the Mach/BSD hybrid kernel in this case? It sounds suspiciously like a problem with the IBM Microkernel back in about 1992, and with the OSF's mkLinux, which is apparently a predecessor of Mac OS.

Specifically, do the syscalls writing the file go through multiple protection domains?


"the OSF's mkLinux, which is apparently a predecessor of Mac OS."

Apple worked on MkLinux, but it isn't technically a predecessor to Mac OS X. The two do not share a single line of code; if they did, Apple would have to license Mac OS under the GPL.

XNU, the Mac OS kernel (https://en.wikipedia.org/wiki/XNU) isn't a real microkernel; the functionality of a microkernel is there, but quite a bit of code was added that, in a true microkernel, would live in userspace.

This still begs the question I've had since day one when learning about buffers: what the heck is the recommended buffer size. I've seen a lot of old code that use extremely small buffer size as well as some recent code that use extremely high buffer sizes (up to 10MB).

What is a good sweet spot that runs well on older hardware (less then 10 years old) and new hardware? And should network buffers be bigger or smaller then disk buffers?

Network buffers are a completely different animal, also harder to optimize...

Disk IO buffers, it's easy nowadays, just use something like a MB, which is just fine for almost any application, and doesn't stack up to much memory use (unless you're writing many files concurrently, which can bring it's own problems as well)

That actually does sound about right, but do you know how to derive that? I'm curious how to figure it out.

Judging by the post here, 512kb buffers seems like a good bet?

For a an SSD with 2GB/s throughput and the syscall overhead of OS X: yes. Go higher if you want some headroom for the next generation of faster SSDs.

Just do a test with dd or some small piece of code. Why rely on heresay.

So is this any different on Linux? Why is OSX in the title?

Are syscalls more expensive on OSX?

It's interesting that if you google "how do i create a large file on osx" you get guidance for this command[1].

But yes, syscalls are more expensive on OSX.

[1]: http://stackoverflow.com/questions/26796729/quickly-create-a...

>"But yes, syscalls are more expensive on OSX."

Can you elaborate on this?

The XNU implements system calls in the same way a BSD system does. If you are talking about the Mach aspect of OS X, the BSD part can call down to the the Mach directly without using Mach messages. XNU is not a traditional microkernel even though Mach is in there.

I don't think anyone seriously expects OSX to be as performant as Linux, since it doesn't feel faster, and a lot of the problems are obviously[1] the results of the microkernel side. That being said, a little googling shows a direct head-to-head[2] comparison.

[1]: http://sekhon.berkeley.edu/macosx/

[2]: http://www.academia.edu/2685902/An_Analysis_of_Mac_OS_X_Leop...

I have spent a great deal of time studying the code path taken on system calls in xnu/osfmk[3] versus Linux[4] in building my Linux emulator for OSX[5].

[3]: https://opensource.apple.com/source/xnu/xnu-1504.3.12/osfmk/...

[4]: https://github.com/torvalds/linux/blob/master/arch/x86/entry...

[5]: https://github.com/geocar/ml/

I think many OSX supporters do the platform a disservice by defending it without spending any time studying it or the competition. OSX has some serious performance-based weaknesses, that are perhaps a real worthwhile trade for a lot of things "just working" -- especially when compared to Windows or Linux -- but are still a weakness.

I certainly wouldn't classify myself as an "OS X" supporter by any means. I don't think I've ever actually defended it :)

I was more responding to the blanket statement that system calls are more expensive on OS X.

You first citation is ancient by the way, thats 11 years ago now.

You second link states in the conclusion section:

"First, Mac OSX uses a hybrid monolithic and microkernel architecture inwhich system calls must be wrapped into an RPC messageto the Mach microkernel. "

But in XNU/Darwin there are three different mechanism for systems calls - traditional BSD style traps, Mach traps, and Mach RPC.

I didn't really understand how the other 3 links related or I guess specifically what I was looking for in those.

> So is this any different on Linux? Why is OSX in the title?

mkfile is not available on Linux. The equivalent utility is xfs_mkfile or fallocate.

No, mkfile creates a file with written extents (no sparse or uninitialized areas), while fallocate will (with ext4/xfs/...) only map some extents which are marked uninitialized. So while mkfile 10G would have to write ~10 GB of data, fallocate 10G is nearly instanteneous, because it most likely just allocated a single extent.

xfs_mkfile could do what mkfile does, the description isn't conclusive enough.

You can use good old `dd` for this:

    dd if=/dev/zero of=test bs=1k seek=2m count=1
Or python:

    python -c 'f=open("test", "w"); f.seek(2e9); f.write("\x00")'
Both these calls create a ~2GB file with holes in it. They return almost instantly on OSX.

EDIT: oh wait, this isn't what you want. you want a file _without_ holes, my mistake.

Yeah without hole would be

    dd if=/dev/zero of=test bs=1k count=2m

OSX doesn't support sparse files...

It does, just not at the HFS+ layer. Sparse bundles are the functional equivalent at the Vitual file system layer and are used by Time Machine and are useable by developers or via the CLI. Sparse images are entire virtual volumes and are used for FileVault.


To clarify: FileVault on 10.6 and earlier; current FileVault is FDE.

>just not at the HFS+ layer

macOS supports sparse bundle with HFS+

As I was saying, that's implemented above HFS+

I guess it is the same. Try doing a dd if=/dev/sdX of=/dev/null or dd if=/dev/zero of=/dev/sdX (be warned, the latter wipes the disk!).

Then, experiment with various buffer sizes (bs=1k, 10k, 100k, 1M) - I personally use 1M with dd.

Be warned on OS X you have to use /dev/rdiskX instead of /dev/diskX, as the latter is a buffered version that usually is slower.

You know, you can dd to a file to avoid destroying a drive. And you can provide a `count` to write fixed-size output files, dd will write bs√ócount bytes (technically it writes ibs√ócount, and bs sets both ibs and obs).


    dd if=/dev/zero of=foo bs=1k count=1m
would write 1GB in 1k blocks, and

    dd if=/dev/zero of=foo bs=1m count=1k
would write 1GB in 1m blocks.

> You know, you can dd to a file to avoid destroying a drive.

Then you're at the whim of more or less obscure caching layers in the Linux kernel. At least there's a filesystem cache plus something in the block-device layer that reads ahead in order to speed up fread calls with low buffer sizes.

Also, due to filesystem fragmentation the file will be distributed across locations on the disk - that doesn't matter much for SSDs, but on "spinning rust" the head seeks distort the performance. A dd on a raw disk, however, will not cause any seeks except those for background processes' file operations.

You can add `direct` to dd if you're concerned about the caches. And if you want to avoid seeks, just use `fallocate`. It'll do the Right Thing(tm) on CoW and non-CoW filesystems - either just creating a hole or reserving extent space, without having to write all the blocks in either case.

I think the main difference is that `mkfile' doesn't exist on Linux? As far as I can gather, it's a Sun OS utility, that BSD has and therefore OS X.

I've read that some Linux distributions have `mkfile', but it's just a script wrapper around `dd'.

Good point, the reason is simple: I didn't check on Linux (and have updated the post to reflect this). I am guessing it would be similar, but don't have any data.

It is.

While HN is discussing XNU sycalls, anyone know why the most fundamental of syscalls, execve, according to NeXT/Apple's post-UNIX wisdom needs to have an extra "char *apple[]" argument vector?

Not to imply that it is "hidden" but I am curious if any HN users know about this and understand its purpose?

When in doubt, consult the source: https://github.com/opensource-apple/xnu/blob/53c5e2e62fc4182...

The upshot is that it looks like it constructs some non-env variable environment data on the program's stack after the posix environment.

The absolute path of the executable, followed by preemption free zone addresses, entropy, a configuration setting for malloc allocation strategy, and the address of the main thread's stack afaict.

Some years ago I picked up the habit from a predecessor of testing such things with dd instead, that way you can experiment with the effect of different block sizes, so like -

dd of=/dev/zero of=/ddtest.out myfile bs=64k count=65536

The graph could use an X axis label. I'm assuming it's "Buffer size in kB". It would also be nice to include the datapoint you started with (250MB/s from a 512byte buffer).

That's actually written below the graph, rather than as a graph label

> X-Axis is buffer size in KB. The original 512 byte size isn't on there because it would be 0.5KB or the entire axis would need to be bytes, which would also be awkward at the larger sizes. Also note that the X-Axis is logarithmic.

Sorry, I should have mentioned that I added the comment after the parent's very valid point.

I don't understand that reasoning, 0.5 on a logarithmic base 2 scale would be the next step below 1 on the x axis (0.5, 1, 2, 4, 8), and the y axis already goes to zero anyway.

Furthermore, the X and Y axis should really have been the other way around.

Though I thought it was fairly obvious from the text this is still a good point. Added as comment underneath.

Well, there's this option (which looks like it would be the way to go for its purpose):

-n Create an empty filename. The size is noted, but disk blocks aren't allocated until data is written to them

Also the description on the man file says: mkfile creates one or more files that are suitable for use as NFS-mounted swap areas. The sticky bit is set, and the file is padded with zeroes by default

That's missing the point. The author wanted to test the write speed and found that mkfile was not a good tool to test with.

Is this something that would be affected by file system defaults? I see it writes in 512 byte chunks by default, when MacOS moves to APFS, is this something that might be altered as a result and thus get a 'free' speedup?

Interesting post, but this xkcd is really relevant here: https://xkcd.com/833/

It's a command, not a syscall!

Yes but mkfile uses syscalls to create and write to the file. With a smaller buffer more write syscalls are required to populate the file.

The article doesn't make such a claim as far as I can see.

You're right, I misread. Sorry for the noise.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact