Well Mark's post was interesting, but this is also old news now, is it not? Microsoft has had two OS updates since then and hot fixes and service packs.
The biggest difference I see is that the Linux Kernel can be tweaked to specific performance characteristics at build time, whereas Windows can only be altered by runtime configuration. depending on what you might want to change, you may not have that capability.
Going back to the article about network latency while playing media, the breakdown seems to show that this was a bug, and bugs happen. To Linux's credit, those bugs are probably patched more quickly and even if it hasn't been promoted to release, if someone has made a patch, you can incorporate your own fix. If not, the landscape doesn't look any different for Linux vs. Windows in that regard.
I have never seen a description or mention of a bugfix for this Vista problem. The root of the issue is that DPCs are very CPU-intensive due to the inherent design of the Windows network stack and drivers. A bug like this just does not "simply" happen as you make it seem. The root cause is lack of innovation in the kernel. The bloat does not get fixed. The software architecture they used to handle 10/100 Mbps of network traffic does not scale to 1000 Mbps.
While I don't know about that Vista problem, it's not true that Microsoft doesn't innovate in the kernel. Windows 7 introduced the "MinWin" kernel refactoring which significantly reduced the OS footprint, see the 2009 talk & slides by Mark Russinovich:
Windows 7 and 8 both have lower system requirements than Vista while offering more features. That fact was widely advertised and acknowledged. Sure, not everything was improved, but it's not true that MS never fixes things for better performance. They simply have different business priorities, such as supporting tablets in Windows 8 rather than supporting 1000 Mbps.
The OP didn't say there's no innovation at all in Windows. He just said it's slower than in the Linux kernel, and as a consequence atm Windows lags behind Linux in performance-critical usage scenarios.
Can you please elaborate on this inherent design flaw? My understanding of NTOS DPCs is that they are quite similar to Linux tasklets/bottom-half interrupt handlers.
A simple problem is that DPCs are not scheduled in any way (there's some very primitive queueing and delivery algorithms), or more importantly, are not scheduled in any way that correlates with the thread scheduler's view. So between four cores, if Network DPC/ISRs are abusing Core 1, but the thread scheduler sees an idle-ish low priority thread using up Core 0, and Core 1, 2, 3 are all Idle (because it doesn't understand DPCs/ISRs), it'll pick Core 1 for this thread (just because round-robin). I'm omitting NUMA/SMT and assuming all of these are 4 regular cores on the same socket.
One could argue a better decision would've been to pick Core 2 and/or 3, but there's nothing to guide the scheduler to make this decision.
But it's not that "DPCs" are a design flaw. It's the way that Windows drivers have been encouraged by Microsoft to be written. You'll see most Windows drivers have an ISR and DPC. If you look at IOKit (Mac's driver framework), almost all drivers run at the equivalent of Passive Level (IRQL 0) outside of the ISR/DPC path -- the OS makes sure of that.
Because Windows driver devs are encouraged to write ISR/DPC code, and because this code runs at high IRQL, it means that bugs and inefficiencies in drivers show a much larger performance degradation. And when you buy a shit 0.01$ OEM NIC, and you have to do MAC filtering, L2 layering, checksum validation and packet reconstruction in your DPC, plus there's no MSI and/or interrupt coalescing, you're kind of f*cked as a Windows driver.
Well you can manually request your DPC to target a particular core before you insert it in the queue. So yeah, while its possible to avoid the situation, it requires the driver writer to be aware of this problem.
Also w.r.t to your other point, Threaded DPCs do exist (since vista) which run at PASSIVE_LEVEL.
Targeting the DPC provides absolutely no help -- how will you as a driver know what core to target to? It's typical design-by-MSDN: provide a feature that looks useful and people will quote, but actually provides no benefits. Drivers targeting their own DPCs are actually one of the leading causes of horrible DPC perf.
As for Threaded DPCs, not only does nobody use them in real life, but they MAY run at PASSIVE. The system still reserves the right to run them at DPC level.
Really the only way out of the mess is Passive Level Interrupts in Win8... Which likely nobody outside of ARM ecosystem partners will use.
Well I assume (depending on where the bottleneck is) spreading execution across cores will reduce DPC latency. Or maybe they could use UMDF.
Though, as a user of badly written drivers, I'm totally fucked. Its too bad the OS design does not allow for the user control any aspect of this (well apart from MaximumDpcQueueDepth).
Spreading DPCs across cores will lead to two possibilities:
- Typical driver dev: Knows nothing about DPC Importance Levels, and sticks with medium (default): IPIs are not sent to idle cores, so device experiences huge latencies as the DPC targeted to core 7 never, ever, gets delivered.
- Typical driver dev 2: Hears about this problem and learns that High/MediumHigh Importance DPCs cause an IPI to be delivered even to idle cores: wakes up every core in your system round-robin as part of his attempt to spread/reduce latencies, killing your battery life and causing IPI pollution.
Now I hear you saying: "But Alex, why not always target the DPC only to non-idle cores?". Yeah, if only the scheduler have you that kind of information in any sort of reliable way.
Really this is clearly the job of the OS. As it stands now, targeting DPCs on your own is a "fcked if you do, fcked if you don't" proposition.
You do get a few more variables you can play with as a user, but changing them will usually lead to worst problems than it would solve. Many drivers take dependencies on the default settings :/
Okay, but I'm not suggesting that spreading DPCs over multiple cores is the only solution. It is a solution in some cases. Originally I was merely responding to your point about not being able to schedule DPCs on other cores. You were speaking more generally from the OS schedulers POV, but I took it more literally.
Honestly.. I've spent countless hours hunting down bad drivers to fix audio stutter and other crap on my gaming PC. I've finally got DPC latency under 10microSec and I'm not touching a thing :)
Yes, my point was that the OS isn't currently doing it on its own, putting the onus on the driver developers -- which have limited information available to them and are pretty much unable to make the right choice -- unless we're talking about highly customized embedded-type machines, or unless the user is willing to heavily participate in the process.
It was just a simple example of something that could change for the better in terms of performance, but that probably won't because it's a significant amount of code churn with little $$$ benefit.
I was honestly surprised that a core-balancing algorithm was added in Windows 7, but that was done at the last second (RC2) and by a very senior (while young) dev that had a lot of balls and respect. Sadly he was thanked by being promoted into management, as is usual with Microsoft.