" I did not check on other operating systems, but my guess is that the results would be similar."
Actually, no. Irrespective of mkfile being mostly MacOSX specific, most sycalls on MacOSX are just plain slow as compared to Linux (or FreeBSD). I think this is partially an artifact of performance not being a major metric for most system calls on MacOSX. The systems calls where performance is a critical metric (like timekeeping) use a kernel / user shared page interface to avoid the syscall entirely in the vast majority of the cases.
I recall doing benchmarking roughly 10 years ago to find the best interface to communicate with a mostly in an OS-bypass HPC driver. MacOSX ioctls were some multipler more expensive than Linux, but the native IOKit Mach IPC was even more expensive than that. Sigh. There was a similar story for sockets in non-OS bypass mode, where simply writing on a socket was far more expensive than Linux.
Somebody needs to resurrect lmbench & do a comparison of the various x86 kernels available these days. Maybe that would shame Apple into focusing on performance.
That's not what the article showed. It showed that synchronous IO latency was the limiting factor. By arranging your IOs into little 512-byte chunks, you don't make it terribly easy for the filesystem to efficiently use the underlying device --not surprising, given that it doesn't even want to work in 512 byte sectors, and operates more efficiently with lots of a parallel io operations.
Bloated syscall times wind up being a death by a thousand cuts when your workload is syscall heavy.
One simple, common, example is plain old autotools configure scripts. I have no love for autotools generated configure scripts (and in fact hate them with a passion), but I've long observed they crawl on MacOSX and go much faster on Linux. They do lots of syscalls (tons of fork/exec/open/close/stat).
Because I speak in a condescending tone that irritates people. I'm quite used to it. I fully admit it's rare that I can present my point of view in a way that is a) concise rather than a rant, and b) doesn't use phrasing that is abrasive. I should have been able to summarize my point in two sentences without negative coloring, but instead it took me three verbose paragraphs with assaulting language. I'm trying to work on it, but old habits die hard. shrug
(Also sshhhhh, it is anti-HN to discuss downvotes)
I was the same. Probably still am sometimes. The most helpful reminder to myself was that this isn't a shame or punishment thing. It's simply that most people aren't receptive when spoken to that way. So instead of feeling like I was being shamed into being polite, I just reminded myself that it's my loss if I can't communicate my thoughts, no matter what the transport mechanism looks like.
Actually it is neither forbidden nor uncommon to see a comment by the downvoter explaining their reasoning for the downvote. But complaining about being downvoted, or inviting discussion about it, is frowned upon.
I just wondered seeing as it had been flagged dead within an hour I believe.
I believe however that this is one of the fundamental flaws of Reddit-style websites. On an imageboard you are obliged to make your disagreement in words, constructive or otherwise. Here, you can passive aggressively click the downvote button and leave the post in limbo.
Well, sometimes it is better to just click to downvote, because the only other possibility that makes sense is to leave a message saying that the guy is a scumbag shithead. In such (common) cases, the simple downvote is better for the health of the thread and of the protagonists.
(I am not at all talking about the present case, let it be clear :-) )
And that's why sane people will tell you to not use something like st_blksize as the IO size, because that's far to small. Many tools use 32 KiB, but that can be limiting even on Linux and needlessly uses more CPU (especially if FUSE is involved -- reads are not coalesced at the VFS layer! Read and write merging is done at the block device layer by the IO scheduler there!). Something like 512k-1M is a sane default these days.
At just 60 MB/s it doesn't matter much (even 512 byte blocks would just be a little more than 100k syscalls / second). At a couple hundred MB/s it begins to matter, and with faster NVMe SSDs it's a legit issue.
This becomes a bigger issue if it's not a kernel-based, local FS, but something that does more work, I mentioned FUSE above. Since no read merging happens, small reads like 4k or 32k start to hurt earlier, especially if the FS is written in an interpreter. Some of these also mishandled small reads, incurring an additional penalty (I know because I fixed that in an implementation).
No point in speculating. I showed you how. Go test it.
Even the fastest SSD is not even near the speed of /dev/zero. Easy to extrapolate any speed from those two points (minus the filesystem overhead (the thing we are talking about is the block layer)).
I'd be interested in a test of a FUSE based fs like ntfs. So if you have one, just run some tests. (don't forget to flush the cache)
Interesting... You never dd data to a USB3 disk or SSD? The default buffer size for dd on OS X is also 512 bytes. I've been using `bs=1m` on OS X for ages.
You don't know what cp is for. cp is exactly for copying data from a file to another file -- and under Unix, devices are represented as files. Using cp for writing disk ISOs to an USB drive is proper use, not abuse.
dd is for transform-and-copy, so you're abusing dd when writing disk ISOs. The fact that most of the Internet is wrong doesn't make you any less wrong.
cp is designed to copy data from one file to another. Block devices are represented as files, and cp won't be speed limited by block size.
dd is also designed to copy data from one file to another, in blocks, with optional conversion of the data (eg, from EBCDIC to ASCII), and with optional offsets/limits to copy regions of a file.
dd is not designed to copy raw block data, it is merely a different file copy tool.
Reading from /dev/zero and writing out to RAM with a more optimal block size of 256K yields a throughput of 1,536 MB/s
Actually, in this article, Linux writing to RAM is slower than MacOS writing to SSD. Hard to conclude that "it's just OS X being slow". I suspect the hardware used is wildly different from the one in the original article.
Maybe I am reading it wrong, but I am not getting the 16KB optimal buffer size from that article. The summary at the end is as follows:
'In the above example it can be seen that an input block size of 128K is optimal for my particular setup.'
However, the SSD seems to be significantly slower, so the relative overhead of the syscall is going to be respectively less:
'Reading from /dev/zero and writing out to a SSD with a more optimal block size of 256K yields a throughput of 280 MB/s.'
I was getting 250MB/s with the 512 byte buffer size, although the test in the article is doing two syscalls (one read(), one write()) whereas mkfile is just doing one.
At 256KB buffer size I was already getting around 1.8GB/second writing to SSD.
Just saw the performance of this "consumer" device and remembered this thread: http://www.anandtech.com/show/11104/the-patriot-hellfire-m2-... . This SSD can write over a GB/s and read nearly 3 GB/s. We've come a long way since 32 KiB buffers were a sane choice.
Does anyone know offhand the architecture of the Mach/BSD hybrid kernel in this case? It sounds suspiciously like a problem with the IBM Microkernel back in about 1992, and with the OSF's mkLinux, which is apparently a predecessor of Mac OS.
Specifically, do the syscalls writing the file go through multiple protection domains?
"the OSF's mkLinux, which is apparently a predecessor of Mac OS."
Apple worked on MkLinux, but it isn't technically a predecessor to Mac OS X. The two do not share a single line of code; if they did, Apple would have to license Mac OS under the GPL.
XNU, the Mac OS kernel (https://en.wikipedia.org/wiki/XNU) isn't a real microkernel; the functionality of a microkernel is there, but quite a bit of code was added that, in a true microkernel, would live in userspace.
This still begs the question I've had since day one when learning about buffers: what the heck is the recommended buffer size. I've seen a lot of old code that use extremely small buffer size as well as some recent code that use extremely high buffer sizes (up to 10MB).
What is a good sweet spot that runs well on older hardware (less then 10 years old) and new hardware? And should network buffers be bigger or smaller then disk buffers?
Network buffers are a completely different animal, also harder to optimize...
Disk IO buffers, it's easy nowadays, just use something like a MB, which is just fine for almost any application, and doesn't stack up to much memory use (unless you're writing many files concurrently, which can bring it's own problems as well)
The XNU implements system calls in the same way a BSD system does. If you are talking about the Mach aspect of OS X, the BSD part can call down to the the Mach directly without using Mach messages. XNU is not a traditional microkernel even though Mach is in there.
I don't think anyone seriously expects OSX to be as performant as Linux, since it doesn't feel faster, and a lot of the problems are obviously[1] the results of the microkernel side. That being said, a little googling shows a direct head-to-head[2] comparison.
I have spent a great deal of time studying the code path taken on system calls in xnu/osfmk[3] versus Linux[4] in building my Linux emulator for OSX[5].
I think many OSX supporters do the platform a disservice by defending it without spending any time studying it or the competition. OSX has some serious performance-based weaknesses, that are perhaps a real worthwhile trade for a lot of things "just working" -- especially when compared to Windows or Linux -- but are still a weakness.
I certainly wouldn't classify myself as an "OS X" supporter by any means. I don't think I've ever actually defended it :)
I was more responding to the blanket statement that system calls are more expensive on OS X.
You first citation is ancient by the way, thats 11 years ago now.
You second link states in the conclusion section:
"First, Mac OSX uses a hybrid monolithic and microkernel architecture inwhich system calls must be wrapped into an RPC messageto the Mach microkernel. "
But in XNU/Darwin there are three different mechanism for systems calls - traditional BSD style traps, Mach traps, and Mach RPC.
I didn't really understand how the other 3 links related or I guess specifically what I was looking for in those.
No, mkfile creates a file with written extents (no sparse or uninitialized areas), while fallocate will (with ext4/xfs/...) only map some extents which are marked uninitialized. So while mkfile 10G would have to write ~10 GB of data, fallocate 10G is nearly instanteneous, because it most likely just allocated a single extent.
xfs_mkfile could do what mkfile does, the description isn't conclusive enough.
It does, just not at the HFS+ layer. Sparse bundles are the functional equivalent at the Vitual file system layer and are used by Time Machine and are useable by developers or via the CLI. Sparse images are entire virtual volumes and are used for FileVault.
You know, you can dd to a file to avoid destroying a drive. And you can provide a `count` to write fixed-size output files, dd will write bs×count bytes (technically it writes ibs×count, and bs sets both ibs and obs).
> You know, you can dd to a file to avoid destroying a drive.
Then you're at the whim of more or less obscure caching layers in the Linux kernel. At least there's a filesystem cache plus something in the block-device layer that reads ahead in order to speed up fread calls with low buffer sizes.
Also, due to filesystem fragmentation the file will be distributed across locations on the disk - that doesn't matter much for SSDs, but on "spinning rust" the head seeks distort the performance. A dd on a raw disk, however, will not cause any seeks except those for background processes' file operations.
You can add `direct` to dd if you're concerned about the caches. And if you want to avoid seeks, just use `fallocate`. It'll do the Right Thing(tm) on CoW and non-CoW filesystems - either just creating a hole or reserving extent space, without having to write all the blocks in either case.
Good point, the reason is simple: I didn't check on Linux (and have updated the post to reflect this). I am guessing it would be similar, but don't have any data.
While HN is discussing XNU sycalls, anyone know why the most fundamental of syscalls, execve, according to NeXT/Apple's post-UNIX wisdom needs to have an extra "char *apple[]" argument vector?
Not to imply that it is "hidden" but I am curious if any HN users know about this and understand its purpose?
The upshot is that it looks like it constructs some non-env variable environment data on the program's stack after the posix environment.
The absolute path of the executable, followed by preemption free zone addresses, entropy, a configuration setting for malloc allocation strategy, and the address of the main thread's stack afaict.
Some years ago I picked up the habit from a predecessor of testing such things with dd instead, that way you can experiment with the effect of different block sizes, so like -
The graph could use an X axis label. I'm assuming it's "Buffer size in kB". It would also be nice to include the datapoint you started with (250MB/s from a 512byte buffer).
That's actually written below the graph, rather than as a graph label
> X-Axis is buffer size in KB. The original 512 byte size isn't on there because it would be 0.5KB or the entire axis would need to be bytes, which would also be awkward at the larger sizes. Also note that the X-Axis is logarithmic.
I don't understand that reasoning, 0.5 on a logarithmic base 2 scale would be the next step below 1 on the x axis (0.5, 1, 2, 4, 8), and the y axis already goes to zero anyway.
Well, there's this option (which looks like it would be the way to go for its purpose):
-n Create an empty filename. The size is noted, but disk blocks aren't allocated until data is written to them
Also the description on the man file says: mkfile creates one or more files that are suitable for use as NFS-mounted swap areas. The sticky bit is set, and the file is padded with zeroes by default
Is this something that would be affected by file system defaults? I see it writes in 512 byte chunks by default, when MacOS moves to APFS, is this something that might be altered as a result and thus get a 'free' speedup?
Actually, no. Irrespective of mkfile being mostly MacOSX specific, most sycalls on MacOSX are just plain slow as compared to Linux (or FreeBSD). I think this is partially an artifact of performance not being a major metric for most system calls on MacOSX. The systems calls where performance is a critical metric (like timekeeping) use a kernel / user shared page interface to avoid the syscall entirely in the vast majority of the cases.
I recall doing benchmarking roughly 10 years ago to find the best interface to communicate with a mostly in an OS-bypass HPC driver. MacOSX ioctls were some multipler more expensive than Linux, but the native IOKit Mach IPC was even more expensive than that. Sigh. There was a similar story for sockets in non-OS bypass mode, where simply writing on a socket was far more expensive than Linux.
Somebody needs to resurrect lmbench & do a comparison of the various x86 kernels available these days. Maybe that would shame Apple into focusing on performance.
Drew