
Windows vs. Linux Scaling Performance 16 to 128 Threads with Threadripper 3990X - jjuhl
https://www.phoronix.com/scan.php?page=article&item=3990x-windows-linux&num=1
======
gmueckl
There is no mention in the article whether the software suite was vetted for
support of more than 64 threads on Win32. The API has a peculiar weakness that
limits thread scheduling to a single processor group by default and a group
can have no more than 64 hardware threads. To get above this limit, the
application must explicitly adjust the processor affinity of its threads to
include the additional hardware threads. MS was not in a hurry to adjust the
C++ STL and their OpenMP runtime after the basic processor group API appeared
in Vista. I am not sure if they managed to do it by now. Some of the benchmark
results look to me like the missing scaling from 64 to 128 hardware threads on
Windows might be caused by this.

~~~
monocasa
It's not just the API, it's the scheduler in NT itself that won't move threads
from one process group of up to 64 hardware threads (on a 64bit system) to
another and so it has to be manually managed by the application if you want to
scale out farther than that on NT.

Given that it's a fundamental limitation of the NT scheduler (not present in
Linux), it seems like it'd be on the table for "yeah, windows makes this way
harder, and a lot of applications won't scale the same way on windows as their
Linux versions will", rather than "oh, that just doesn't count because they
aren't using it right".

EDIT: As an aside, this kind of thing is exactly why Linux doesn't provide
binary compatibility on the driver level. It's easy to paint yourself into a
corner by making decisions that were perfectly sane 20 years ago. Now NT has
fundamental limitations, hitting even harder in kernel space where nearly
every driver out there has some macros compiled in that touched these
structures. It's bad enough at the syscall layer, but it's even worse when you
can't change things with code that's directly modifying internal structures.

This is exactly why Linux won't provide a driver ABI, and why it's a good
thing.

~~~
zamadatix
I don't think it's a fundamental binary API limitation type thing as the issue
does not exist in Windows Enterprise or Server. This was covered the last time
this was posted [https://www.anandtech.com/show/15483/amd-
threadripper-3990x-...](https://www.anandtech.com/show/15483/amd-
threadripper-3990x-review/3)

~~~
monocasa
> the issue does not exist in Windows Enterprise or Server.

It does. There's maximum of 64 threads in a processor group. This is because
the affinity mask in tons of internal data structures inside NT are pointer
width (so 64 bits on a 64bit platform).

Server and Enterprise's scheduler adjustments are just around making better
decisions balancing which processor group a new process is assigned to at
creation time.

You can read more about processor groups and the manual work by user space
needed to manage them on all flavors of windows that support them here:
[https://docs.microsoft.com/en-
us/windows/win32/procthread/pr...](https://docs.microsoft.com/en-
us/windows/win32/procthread/processor-groups)

~~~
zamadatix
So it is, thanks for the link. Does anyone know how to access this[1] page
that is referenced towards the end of that, it comes up as "Access Denied"
with no hint as to what access is needed but it's referenced all over the
place in these documentation pages.

[1]
[https://www.microsoft.com/whdc/system/Sysinternals/MoreThan6...](https://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx)

~~~
monocasa
It's available on archive.org

------
MrBuddyCasino
This reveals one weakness of the Windows development model: if something isn’t
a feature that is driven with a PM behind it, it won’t happen. On the other
hand, if some obscure internal thing isn’t optimal yet, you can bet some
obsessed hacker is going to tackle it one day. How many schedulers has Linux
had already?

~~~
derision
I don't necessarily consider that a weakness

~~~
MrBuddyCasino
Ok, its a trade-off. The year of the Linux desktop is surely coming soon. I
must say I did not expect the scaling performance difference to be so large
though.

------
t0mas88
From my experience for a long time the Windows NT (from Win2K onwards) kernel
and scheduler were actually better than Linux in several ways. That always
amazed me, because Linux was a better server OS in many other ways.

Now at 64 cores and above it is clear that the Linux developers have spent a
lot of time making the Linux kernel better. May have something to do with the
fact that a big proportion of servers in production with many CPUs/cores are
Linux servers so they started investing in this quite early?

~~~
twoodfin
It wouldn’t surprise me if Microsoft spends more time and effort on tuning and
providing OS services for the particular commercial applications that are
typically used on high core-count Windows server boxes, several of the most
prominent of which are also Microsoft products.

I suspect SQL Server has no trouble scaling to 64 cores and beyond.

~~~
Guest42
That is true. I have seen some terrible queries written on enterprise hardware
that returned at the drop of a hat without being cached ahead of time.

------
arminiusreturns
Honestly a little surprised it's as close as it is. I have consistently hated
having to deploy anything that requires lots of cores on a windows machine.

I have been keeping on eye on DragonflyBSD for years now, it does some very
interesting things, so this:

> Coming up next I will be looking at the FreeBSD / DragonFlyBSD performance
> on the Threadripper 3990X

has me excited.

------
adossi
I'd like to see comparisons of compilation time. I wish there was a standard
for benchmarking CPUs by compilation time. I know quite often a compilation of
the Firefox source code is used, as well as the Linux kernel, I just wish it
was more prevalent in these reviews.

------
newnewpdro
The linux kernel has been run on "big iron" for a long time now, it would be
surprising if it weren't better prepared for scaling to 128+ cores.

linux/Documentation/vm/numa.rst states it was started in 1999, was windows
going anywhere near NUMA architectures back then?

~~~
the8472
Since windows was mostly running on x86 and the memory controllers were in the
northbridge back then even multi-socket systems wouldn't have been affected by
NUMA. Moving them on-die only happened later.

~~~
thedance
There were NUMA x86 rigs long before the memory controller moved to the CPU.
IBM xSeries and serverworks chipsets from around 2000 had NUMA topologies.

------
thedance
These are all embarrassingly parallel multiplication workloads. Would be nice
for a change if anyone would run something like MySQL or a gRPC server or
something like that, you know one where it actually makes a difference how
threads get scheduled when they go to sleep and wake up and when packets
arrive and so forth.

------
lostmsu
With no clear explanation of wildly varying results between different
benchmarks, I wonder if the the analysis is flawed.

Were those programs built with the same toolchain? Could it be, that some
library the lagging ones use is causing the problem?

------
lisk1
looking at the results make me wonder if MS is keeping separate branches of
win 10 internally or some CPU hogging services are disabled on Win 10
enterprise version.

~~~
wmf
Windows 10 Pro crippled the scheduler. Windows 10 Enterprise uses the same
uncrippled scheduler as Windows Server. "CPU-hogging services" don't consume
32 full cores.

~~~
lisk1
but this proofs my theory that MS is keeping internally different repositories
for win 10 also we know that some tracking services are disabled for Win 10
enterprise which leads to logical conclusion that tracking services could
potentially limit OS I/O ops.

~~~
my123
Lol no, it's just licensing policies and nothing more.

By the way, going Enterprise -> Pro or Pro -> Enterprise doesn't need a
reboot.

~~~
NullPrefix
Hard to believe that when you saw reboot prompts after plugging usb drives

------
streetcat1
So, the get max pref from the Windows kernel, the software should use the
completion port API, and not regular threads/locks.

[https://docs.microsoft.com/en-us/windows/win32/fileio/i-o-
co...](https://docs.microsoft.com/en-us/windows/win32/fileio/i-o-completion-
ports)

However, any software that does that will likely NOT be cross platform.

In addition, if you want to benchmark the kernel, you should run against ram
disk and not SSD.

~~~
dragontamer
In the general case, to maximize performance on any platform requires you to
use platform-specific code.

There are some decent "cross platform" platforms, such as Java or C#, which
have a better degree of performance compatibility. But if you're working at
the system level (aka: PThreads / epoll with Linux, or Windows Threads /
Critical Sections / Completion Ports), you need to use the OS-specific code to
truly reach best performance.

Java, especially with high-performance JVMs like Azul, can be surprisingly
efficient. But to achieve the best performance on the Java Azul Zing runtime,
means to use Java Azul-specific libraries! Once again, tying yourself down to
a platform.

As it turns out, performance is the hardest thing to port. You can somewhat
easily port functionality to any system and kludge things together (with
effort, your C# code can port over to .Net Mono and run on Linux). But to
actually get performance guarantees with primitives is almost always platform
specific testing.

Case in point: you may make certain assumptions about the Linux scheduler,
only for the Linux scheduler to change from O(n) to O(1) to Completely Fair,
and today the System Admin can change scheduler details to better tune the
needs of your application. These things have an effect on performance that
make it difficult to port between systems... or even between the SAME system
running slightly different configurations (ex: misconfigure Huge Pages on one
box)

~~~
streetcat1
Right. So in this case, what does the article compare?

~~~
dragontamer
Oh yeah, I'm agreeing with you for sure.

I don't know what to say about the article, aside from Windows vs Linux
comparisons almost always have a degree of inaccuracy. In this case, Phoronix
are clearly Linux experts and I fully trust their Linux data.

Its very difficult to find someone who knows how to optimally compare Windows
vs Linux, because most people only really learn one platform. I've taken it
upon myself to become a "jack" in both platforms (having neither the expertise
of a Windows expert, nor a Linux expert), so I'm better positioned than most
to see and understand cross-platform issues.

But very few people bother to learn both systems. (And frankly, most people
don't have to learn the other system, so why bother learning? You really can
make a solid career on one OS without ever thinking about the other one...)

------
pstrateman
They're using `Clear Linux 32280` which is a distro produced by Intel.

Presumably built using the Intel compiler which specifically penalizes using
AMD CPUs.

Would explain the advantage at low core counts that windows has.

~~~
arianvanp
Interestingly the opposite is true. AMD performs surprisingly well on Intel
Clear Linux
[https://www.forbes.com/sites/jasonevangelho/2020/02/12/surpr...](https://www.forbes.com/sites/jasonevangelho/2020/02/12/surprise-
amd-recommends-intels-clear-linux-for-best-ryzen-
threadripper-3990x-performance/)

~~~
yxhuvud
To the point where the benchmarks in the post get a bit misleading, as clear
Linux will outperform Ubuntu (or Fedora or whatever a user is more likely to
install) by a quote big margin.

