
Mkfile(8) is severely syscall limited on OS X - mpweiher
http://blog.metaobject.com/2017/02/mkfile8-is-severely-syscall-limited-on.html
======
drewg123
" I did not check on other operating systems, but my guess is that the results
would be similar."

Actually, no. Irrespective of mkfile being mostly MacOSX specific, most
sycalls on MacOSX are just plain slow as compared to Linux (or FreeBSD). I
think this is partially an artifact of performance not being a major metric
for most system calls on MacOSX. The systems calls where performance is a
critical metric (like timekeeping) use a kernel / user shared page interface
to avoid the syscall entirely in the vast majority of the cases.

I recall doing benchmarking roughly 10 years ago to find the best interface to
communicate with a mostly in an OS-bypass HPC driver. MacOSX ioctls were some
multipler more expensive than Linux, but the native IOKit Mach IPC was even
more expensive than that. Sigh. There was a similar story for sockets in non-
OS bypass mode, where simply writing on a socket was far more expensive than
Linux.

Somebody needs to resurrect lmbench & do a comparison of the various x86
kernels available these days. Maybe that would shame Apple into focusing on
performance.

Drew

~~~
cbsmith
I'm pretty sure that even at their slowest, syscalls aren't going to match IO
overhead. It's not like the mkfile is CPU bound.

~~~
mpweiher
The point of the article is that I was also "pretty sure" about exactly that,
and advances in hardware performance meant I was wrong...

2GB/s is pretty damn fast, 0.5 nanoseconds per byte.

~~~
cbsmith
That's not what the article showed. It showed that synchronous IO latency was
the limiting factor. By arranging your IOs into little 512-byte chunks, you
don't make it terribly easy for the filesystem to efficiently use the
underlying device --not surprising, given that it doesn't even want to work in
512 byte sectors, and operates more efficiently with lots of a parallel io
operations.

~~~
mpweiher
OS X does write-coalescing.

~~~
cbsmith
You can't write coalesce IO operations that haven't been called yet.

------
dom0
And that's why sane people will tell you to not use something like st_blksize
as the IO size, because that's far to small. Many tools use 32 KiB, but that
can be limiting even on Linux and needlessly uses more CPU (especially if FUSE
is involved -- reads are not coalesced at the VFS layer! Read and write
merging is done at the block device layer by the IO scheduler there!).
Something like 512k-1M is a sane default these days.

~~~
gens
Maybe i'm misunderstanding you, but.. Before every one a "echo 3 >
/proc/sys/vm/drop_caches" is executed. Processor is in "performance" mode.

dd if=./big_file.file of=/dev/null bs=64K

535020540 bytes (535 MB, 510 MiB) copied, 8.70128 s, 61.5 MB/s

dd if=./big_file.file of=/dev/null bs=4K

535020540 bytes (535 MB, 510 MiB) copied, 8.36283 s, 64.0 MB/s

dd if=./big_file.file of=/dev/null bs=1K

535020540 bytes (535 MB, 510 MiB) copied, 8.82508 s, 60.6 MB/s

Only thing gone up is the cpu usage (same is negligible at ~60MB/s).

dd if=/dev/zero of=/dev/null bs=1K count=1024K

1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.06113 s, 1.0 GB/s

dd if=/dev/zero of=/dev/null bs=4K count=256K

1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.343595 s, 3.1 GB/s

dd if=/dev/zero of=/dev/null bs=32K count=32K

1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.160191 s, 6.7 GB/s

~~~
dom0
At just 60 MB/s it doesn't matter much (even 512 byte blocks would just be a
little more than 100k syscalls / second). At a couple hundred MB/s it begins
to matter, and with faster NVMe SSDs it's a legit issue.

This becomes a bigger issue if it's not a kernel-based, local FS, but
something that does more work, I mentioned FUSE above. Since no read merging
happens, small reads like 4k or 32k start to hurt earlier, especially if the
FS is written in an interpreter. Some of these also mishandled small reads,
incurring an additional penalty (I know because I fixed that in an
implementation).

~~~
gens
No point in speculating. I showed you how. Go test it.

Even the fastest SSD is not even near the speed of /dev/zero. Easy to
extrapolate any speed from those two points (minus the filesystem overhead
(the thing we are talking about is the block layer)).

I'd be interested in a test of a FUSE based fs like ntfs. So if you have one,
just run some tests. (don't forget to flush the cache)

------
pixelbeat__
The GNU coreutils default to 128KiB buffers as per the test script at

[https://github.com/coreutils/coreutils/blob/master/src/ioblk...](https://github.com/coreutils/coreutils/blob/master/src/ioblksize.h#L23)

------
im3w1l
Oh wow, I had no idea that write buffers needed to be on the order of a
megabyte on modern systems.

~~~
fulafel
It's just OS X being slow, Linux gets close to peak with a 16 kB buffer:
[http://blog.tdg5.com/tuning-dd-block-size/](http://blog.tdg5.com/tuning-dd-
block-size/)

~~~
alain94040
_Reading from /dev/zero and writing out to RAM with a more optimal block size
of 256K yields a throughput of 1,536 MB/s_

Actually, in this article, Linux writing to RAM is _slower_ than MacOS writing
to SSD. Hard to conclude that "it's just OS X being slow". I suspect the
hardware used is wildly different from the one in the original article.

------
mcguire
Does anyone know offhand the architecture of the Mach/BSD hybrid kernel in
this case? It sounds suspiciously like a problem with the IBM Microkernel back
in about 1992, and with the OSF's mkLinux, which is apparently a predecessor
of Mac OS.

Specifically, do the syscalls writing the file go through multiple protection
domains?

[https://maniagnosis.crsr.net/2011/07/this-is-comment-i-
made-...](https://maniagnosis.crsr.net/2011/07/this-is-comment-i-made-on-
gluster-blog.html)

~~~
Someone
_" the OSF's mkLinux, which is apparently a predecessor of Mac OS."_

Apple worked on MkLinux, but it isn't technically a predecessor to Mac OS X.
The two do not share a single line of code; if they did, Apple would have to
license Mac OS under the GPL.

XNU, the Mac OS kernel
([https://en.wikipedia.org/wiki/XNU](https://en.wikipedia.org/wiki/XNU)) isn't
a real microkernel; the functionality of a microkernel is there, but quite a
bit of code was added that, in a true microkernel, would live in userspace.

------
Matt3o12_
This still begs the question I've had since day one when learning about
buffers: what the heck is the recommended buffer size. I've seen a lot of old
code that use extremely small buffer size as well as some recent code that use
extremely high buffer sizes (up to 10MB).

What is a good sweet spot that runs well on older hardware (less then 10 years
old) and new hardware? And should network buffers be bigger or smaller then
disk buffers?

~~~
throwawayish
Network buffers are a completely different animal, also harder to optimize...

Disk IO buffers, it's easy nowadays, just use something like a MB, which is
just fine for almost any application, and doesn't stack up to much memory use
(unless you're writing many files concurrently, which can bring it's own
problems as well)

~~~
yjftsjthsd-h
That actually does sound about right, but do you know how to derive that? I'm
curious how to figure it out.

------
amelius
So is this any different on Linux? Why is OSX in the title?

Are syscalls more expensive on OSX?

~~~
geocar
It's interesting that if you google "how do i create a large file on osx" you
get guidance for this command[1].

But yes, syscalls are more expensive on OSX.

[1]: [http://stackoverflow.com/questions/26796729/quickly-
create-a...](http://stackoverflow.com/questions/26796729/quickly-create-a-
large-file-on-a-mac-os-x-system)

~~~
bogomipz
>"But yes, syscalls are more expensive on OSX."

Can you elaborate on this?

The XNU implements system calls in the same way a BSD system does. If you are
talking about the Mach aspect of OS X, the BSD part can call down to the the
Mach directly without using Mach messages. XNU is not a traditional
microkernel even though Mach is in there.

~~~
geocar
I don't think anyone seriously expects OSX to be as performant as Linux, since
it doesn't _feel_ faster, and a lot of the problems are obviously[1] the
results of the microkernel side. That being said, a little googling shows a
direct head-to-head[2] comparison.

[1]: [http://sekhon.berkeley.edu/macosx/](http://sekhon.berkeley.edu/macosx/)

[2]:
[http://www.academia.edu/2685902/An_Analysis_of_Mac_OS_X_Leop...](http://www.academia.edu/2685902/An_Analysis_of_Mac_OS_X_Leopard)

I have spent a great deal of time studying the code path taken on system calls
in xnu/osfmk[3] versus Linux[4] in building my Linux emulator for OSX[5].

[3]:
[https://opensource.apple.com/source/xnu/xnu-1504.3.12/osfmk/...](https://opensource.apple.com/source/xnu/xnu-1504.3.12/osfmk/x86_64/idt64.s.auto.html)

[4]:
[https://github.com/torvalds/linux/blob/master/arch/x86/entry...](https://github.com/torvalds/linux/blob/master/arch/x86/entry/entry_64.S)

[5]: [https://github.com/geocar/ml/](https://github.com/geocar/ml/)

I think many OSX supporters do the platform a disservice by defending it
without spending any time studying it or the competition. OSX has some serious
performance-based weaknesses, that are perhaps a real worthwhile trade for a
lot of things "just working" \-- especially when compared to Windows or Linux
-- but are still a weakness.

~~~
bogomipz
I certainly wouldn't classify myself as an "OS X" supporter by any means. I
don't think I've ever actually defended it :)

I was more responding to the blanket statement that system calls are more
expensive on OS X.

You first citation is ancient by the way, thats 11 years ago now.

You second link states in the conclusion section:

"First, Mac OSX uses a hybrid monolithic and microkernel architecture inwhich
system calls must be wrapped into an RPC messageto the Mach microkernel. "

But in XNU/Darwin there are three different mechanism for systems calls -
traditional BSD style traps, Mach traps, and Mach RPC.

I didn't really understand how the other 3 links related or I guess
specifically what I was looking for in those.

------
gwu78
While HN is discussing XNU sycalls, anyone know why the most fundamental of
syscalls, execve, according to NeXT/Apple's post-UNIX wisdom needs to have an
extra "char *apple[]" argument vector?

Not to imply that it is "hidden" but I am curious if any HN users know about
this and understand its purpose?

~~~
msbarnett
When in doubt, consult the source: [https://github.com/opensource-
apple/xnu/blob/53c5e2e62fc4182...](https://github.com/opensource-
apple/xnu/blob/53c5e2e62fc4182595609153d4b99648da577c39/bsd/kern/kern_exec.c#L4131)

The upshot is that it looks like it constructs some non-env variable
environment data on the program's stack after the posix environment.

The absolute path of the executable, followed by preemption free zone
addresses, entropy, a configuration setting for malloc allocation strategy,
and the address of the main thread's stack afaict.

------
rollthehard6
Some years ago I picked up the habit from a predecessor of testing such things
with dd instead, that way you can experiment with the effect of different
block sizes, so like -

dd of=/dev/zero of=/ddtest.out myfile bs=64k count=65536

------
pedrocr
The graph could use an X axis label. I'm assuming it's "Buffer size in kB". It
would also be nice to include the datapoint you started with (250MB/s from a
512byte buffer).

~~~
masklinn
That's actually written below the graph, rather than as a graph label

> X-Axis is buffer size in KB. The original 512 byte size isn't on there
> because it would be 0.5KB or the entire axis would need to be bytes, which
> would also be awkward at the larger sizes. Also note that the X-Axis is
> logarithmic.

~~~
mpweiher
Sorry, I should have mentioned that I added the comment after the parent's
very valid point.

------
raverbashing
Well, there's this option (which looks like it would be the way to go for its
purpose):

-n Create an empty filename. The size is noted, but disk blocks aren't allocated until data is written to them

Also the description on the man file says: mkfile creates one or more files
that are suitable for use as NFS-mounted swap areas. The sticky bit is set,
and the file is padded with zeroes by default

~~~
grigjd3
That's missing the point. The author wanted to test the write speed and found
that mkfile was not a good tool to test with.

------
snorrah
Is this something that would be affected by file system defaults? I see it
writes in 512 byte chunks by default, when MacOS moves to APFS, is this
something that might be altered as a result and thus get a 'free' speedup?

------
JelteF
Interesting post, but this xkcd is really relevant here:
[https://xkcd.com/833/](https://xkcd.com/833/)

------
mkj
It's a command, not a syscall!

~~~
Sharlin
The article doesn't make such a claim as far as I can see.

~~~
mkj
You're right, I misread. Sorry for the noise.

