
Benchmarking OS primitives - mortenlarsen
http://www.bitsnbites.eu/benchmarking-os-primitives/
======
meuk
I can't help but start a rant about Windows upon reading this.

One of my major complaints with Windows is that things just 'feel slow'. I
have to wait _very_ often. Opening an FTP location? Wait for 5 seconds, (and
it also opens in a new window, leaving the old window open in an unusable
state - very confusing). Starting a GUI? Wait for 5 seconds.

My laptop and raspberry pi at home both work _a lot_ smoother than the hi-end
(it's a brand new Dell XPS machine with 8 GB RAM - which I consider hi-end)
laptop I have at work.

I still find it hard to comprehend that people are buying _ridiculously
overpowered_ Windows computers for tasks like browsing and document editing.
Developers are at fault too - if it runs smoothly on your $1000+ machine with
32GB RAM, that does _not_ mean that the average user will be able to even use
it. Everyone and their mother is jumping at the sustainability hype, but at
the same time developers assume that everyone buys a new computer and phone
every other year, _for the same tasks we 've been doing for decades_. Once you
realize this it's hard to use a Windows system and not cringe at the mess of
laggy/unresponsive GUI's.

~~~
justin66
> Starting a GUI? Wait for 5 seconds.

You're holding it wrong.

> My laptop and raspberry pi at home both work a lot smoother than the hi-end
> (it's a brand new Dell XPS machine with 8 GB RAM - which I consider hi-end)
> laptop I have at work.

If your RPi is faster at comparable tasks than that Windows PC, your Windows
PC has some extremely serious setup problems.

(if your employer uses anything like the commercial security software mine
does, that's one potential problem)

~~~
meuk
I'd argue that it's more of a culture problem. I regularly have multiple
instances of Visual Studio and Visual Studio Code open (together with many
windows explorer, notepad, notepad++ and whatever). Visual Studio in itself is
just slow, and often crashes, even if I just have one instance open.

On Linux, I have several terminals open, which is just so much more
lightweight.

Of course, from a performance point-of-view, these are not comparable, but
that is exactly my point. On windows, everything has a GUI, and everything
seems to assume a much more hi-end machine.

P.S. I admit that it was a bit misleading to post this rant under this
article, since the article measures raw OS performance, while my point is that
lower performance is usually sufficient too, if you use tools that have a
single purpose instead of trying to be an OS in itself.

~~~
justin66
Some of it's culture and some is certainly bloat. VS has slowed down a lot in
the last eight years, and VS Code was born slow.

But the performance between the RPi and the Windows machine shouldn't be
comparable at all, and if the RPi is coming out ahead for anything more
trivial than opening a terminal window, I'd look at the setup of the Windows
machine. There is plenty that can get effed up there.

------
drewg123
Ever notice how much slower building software using configure is on MacOS than
Linux? The results here point out why: Fork + exec is ~10x slower on MacOS.

However, this isn't exactly new information. The general slowness of the OSX
kernel has been known for years, via other benchmarks like lmbench. Its one of
the reasons they were the first to implement a vdso-like interface for things
like gettimeofday().

------
self_awareness
Not very objective, since the operating systems were in unknown state, i.e.
there was a third party antivirus installed on one Windows machine. In such
conditions this benchmark doesn't provide any meaningful information.

~~~
vardump
I find the article very informative from point of view of developing software
packages installable by the end user.

Benchmarking typical environments vs artificially lean is much more helpful in
practice.

------
dzdt
Frome experience git is slow on windows when dealing with 10's of thousands of
files even with a SSD drive. This is due to the filesystem (NTFS) being rather
slow, especially for various stat operations. (If you watch in windows
taskmanager this shows up in the "other i/o" column -- not reads or writes.

~~~
arghwhat
I am not sure if it is correct to attribute it to NTFS, rather than the
Windows VFS layer. IIRC, NTFS is a reasonably sane filesystem.

~~~
blattimwind
I don't know why, but NTFS metadata performance just was never very good. I
always assumed that was due to the more complex/capable data model compared to
what Linux usually uses, but I'm not so sure about that any more given the
complexity and yet still better performance of ZFS (btrfs remains too much of
a mixed bag to mention here).

However, poor or bad applications also play a role here. For some reason
explorer.exe requires 1-2 orders of magnitude more I/O time than dir.exe, and
somehow the local search is slower than a human binary search. However they
managed to do that bad of a job remains a mystery.

~~~
paulmd
Yeah, poor search performance is definitely an application problem here.
Lately I've taken to using Cygwin find instead and it runs just as fast as
you'd expect. eg

find /cygdrive/c/ -type f -iname '*somefile.tar'

------
zokier
The memory allocation test seems bit out of place, considering that the
allocator is provided by libc and not the OS. Testing something like
mmap/virtualalloc might have made more sense

~~~
gpderetta
You are not wrong, but a) at least on unix the libc is certainly considered
part of the OS and b) malloc has to get the memory from the OS eventually, via
sbrk or mmap.

~~~
bjg
I'm not sure I agree with (a), any app that is focused on perf is likely going
to use jemalloc or any of the other high performance allocators (tcmalloc),
and completely bypass the libc implementation...

------
stephencanon
Lots of confounds due to non-uniform hardware, etc, but more importantly these
are very artificial micro-benchmarks; systems are (ideally) tuned for
performance on the sorts of loads that they will actually be running under,
not artificial tests like "create 65k files of 32B each".

In artificial tests like these, you frequently get the best performance by
flushing data out as fast as possible, while in most "real-world" scenarios
you have some temporal locality that makes keeping data around a win.
Optimizing for these sorts of benchmarks can actually harm performance.

Still, fun.

------
bla2
"Launching Programs" should use posix_spawn at least on macOS, it's a distinct
syscall there and faster than fork + exec.

------
CJefferson
Microsoft really need to fix windows defender and search indexing. I have
myself benchmarked horrible pains similar to this, 7-10x slowdowns doing
things like copying many files around. It can make the Linux Subsystem almost
unusuable.

------
justin66
Benchmarking things like file creation without noting the filesystem or how it
was mounted is... interesting.

~~~
cozzyd
Yeah also SELinux / AppArmor state would be interesting to know as well.

------
c12
Anyone else see the create file test results and think of a node_modules joke?

------
mastax
Ah yes, Windows Defender. I always forget about it until it makes some trivial
operation take 5x too long. Make sure to add your compilers and build tools to
the exclusions list.

~~~
mbitsnbites
That's the easy case. In a typical enterprise environment a software developer
may be faced with several roundtrips to the centralized/outsourced it support
to get their build folders white listed in the company approved (and forcibly
installed) AV software to make a simple CMake run take less than 10 minutes
(something that takes about 5 seconds on a stock Linux machine).

------
swebs
If you haven't already, I highly recommend reading "I Contribute to the
Windows Kernel. We Are Slower Than Other Operating Systems. Here Is Why."

[http://blog.zorinaq.com/i-contribute-to-the-windows-
kernel-w...](http://blog.zorinaq.com/i-contribute-to-the-windows-kernel-we-
are-slower-than-other-oper/)

~~~
justin66
It's awful that something as misleading as that anonymous, superficial rant is
held up as important. The only lesson there is pretty meta: you actually _can_
issue a retraction for a rant, but nobody will care.

------
snvzz
Not surprised. Note the Linux results, while looking good compared to what's
even worse, are still terrible.

Missing are the results for the BSDs. I'm particularly interested on Dragonfly
BSD. Maybe I'll try them myself when 5.2 is out, which will be soon.

------
quickben
Awesome work.

Otherwise, the filesystem bench is pointless without:

\- SSDs type. Tlc? SLC? Are they same, different?

\- Linux filesystem type and fstab flags.

~~~
mbitsnbites
SSD types: for the machines that are identical, they are the same

Linux filesystem: stock ext4

------
blitmap
Would the Windows-equivalent of fork() be CreateProcess() ?

~~~
pjmlp
No, because the semantics aren't the same.

CreateProcess() is like posix_spawn(), or if you prefer fork()/exec().

Windows is a thread based OS, not process based, hence why the focus on thread
performance, not on process creation.

~~~
zokier
> Windows is a thread based OS, not process based, hence why the focus on
> thread performance, not on process creation.

Which, somewhat ironically, leads NT to have worse numbers in the create
thread test than linux in the create process one (25.6us vs 18us).

The redeeming factor of NT is their async IO model which afaik is the best
among mainstream OS.

~~~
udp
IOCP is very complicated to code against, though. kqueue can do nearly all the
same things and is both much cleaner and more portable.

~~~
trentnelson
It's a very different paradigm to wrap your head around, but once you grok the
NT kernel's approach to I/O (packet based IRPs, inherently asynchronous,
thread-agnostic), and thread scheduling, I/O completion ports are very
powerful constructs.

The key difference is that I/O completion ports can be used to achieve
asynchronous I/O on any underlying object, e.g. files and sockets, _and_ they
have this nifty built-in concept of concurrency, such that the kernel can
ensure there is always one running thread per CPU core (which is optimal from
a scheduling perspective).

You can't use file descriptors with epoll/kqueue, and you certainly can't say
"ensure every core only has one active thread running".

"The key to understanding what makes asynchronous I/O in Windows special
is...": [https://speakerdeck.com/trent/pyparallel-how-we-removed-
the-...](https://speakerdeck.com/trent/pyparallel-how-we-removed-the-gil-and-
exploited-all-cores?slide=54)

"Thread-agnostic I/O with IOCP": [https://speakerdeck.com/trent/pyparallel-
how-we-removed-the-...](https://speakerdeck.com/trent/pyparallel-how-we-
removed-the-gil-and-exploited-all-cores?slide=62)

