This is one of the reasons I prefer to use Linux over systems such as Windows or OSX. I love the public discussion over the direction of technical details; the to-and-fro on the benefits of esoteric technologies.
There is not much discussion about Windows internals, not only because they are not shared, but also because quite frankly the Windows kernel evolves slower than the Linux kernel in terms of new algorithms implemented. For example it is almost certain that Microsoft never tested I/O schedulers, process schedulers, filesystem optimizations, TCP/IP stack tweaks for wireless networks, etc, as much as the Linux community did. One can tell just by seeing the sheer amount of intense competition and interest amongst Linux kernel developers to research all these areas.
Note: my post may sound I am freely bashing Windows, but I am not. This is the cold hard truth. Countless of multi-platform developers will attest to this, me included. I can't even remember the number of times I have written a multi-platform program in C or Java that always runs slower on Windows than on Linux, across dozens of different versions of Windows and Linux. The last time I troubleshooted a Windows performance issue, I found out it was the MFT of an NTFS filesystem was being fragmented; this to say I am generally regarded as the one guy in the company who can troubleshoot any issue, yet I acknowledge I can almost never get Windows to perform as good as, or better than Linux, when there is a performance discrepancy in the first place.
There isn't a single comment in that whole thread about how outrageously bad EC2 performance is. Meanwhile I'd bet that most HN startups run on EC2, heroku or other virtualized cloud platforms. And how many are using dog slow interpreted languages like python or ruby? It looks to me like people around here are quite willing to take very large performance hits for the sake of convenience.
I find Windows to be a small performance hit for the sake of convenience.
It is a great OS for which there exist a large corpus of high quality developer tools. In fact, there are many domains such as game programming for which there is no other platform that comes even close.
You can buy computers with Linux pre-installed nowadays too.
I used to use Windows exclusively until about 4 years ago. Up to the time i switched, I occasionally was testing a few Linux distros, and repeatedly came to the conclusion that drivers were still an issue, and that Linux wasn't ready for the average user's desktop.
Not so anymore, since about 2009. From that time on, the only thing without a Linux driver I encountered (there are many more, just not that widespread that it should matter) was an HP printer. Which is why I stopped using Windows altogether.
My experience since then? Windows is a royal PITA to use and maintain. Linux, with KDE as the desktop manager, isn't just faster, it's way friendlier for users. One example, which to me is huge: on Windows, you need to run the updater of every f..ing software provider from which you have purchased an app. On Linux, there's just one updater for everything. another one: on Windows, even with all the discount programs for students and others, you have to spend thousands of dollars before you get equivalents of all apps installed that you get for free when you select developer workstation for a Linux installer.
Agreed, games are the only thing that Linux doesn't yet cover as well as Windows - both development and play. However, wanna bet that Linux will tip the balance in this area too, in at most five years?
I have the same experience in an established tech company in the bay area.
Exactly the same has happened - lots of new hires. Bad management. Really silly review process. Features are valued over fixing things.
There's no mentorship process for said new hires. This has obvious flow-on effects.
The old timers don't get promoted into management but end up fixing more and more bugs (because they're the ones that know stuff well enough to fix said bugs.) They get frustrated and leave, or they just give up and take a pay check.
The management values "time to deliver" over any kind of balance with "fix existing issues", "make things better", "fix long standing issues", "do code/architecture review to try and improve things over the long run."
They're going to get spanked if anyone shows up in their marketspace. Oh wait, but they're transitioning to an IP licensing company instead of "make things people buy." So even if they end up delivering crap, their revenue comes from licensing patents and IP rather than making things. Oops.
Thank god there's a startup mentality in the bay area.
Sorry, I have never noticed Windows being slower than Linux. And usually when people complain, they usually have an incomplete understanding of what they're doing (nothing personal!).
The links that you posted in support of your claim are irrelevant IMO.
Compiling has nothing whatsoever to do with windows internals. You're comparing Visual Studio a full fledged IDE with dozens of extra threads running source indexing, source code completion/help indexing and dozens of other things that gcc does not do. To make a fair comparison you will have to just compare cl.exe with gcc with a bunch of makefiles (yes you can have makefiles on windows too).
Then your "real concrete and technical" example is actually a bug in windows vista which was fixed around 6 years ago.
And your claim about MFT fragmentation kind of sounds bizzare to be honest. Since Vista the OS internally runs a scheduled task to run a partial-defrag that takes care of it. I'm not sure what went wrong in your case.
I'm not saying you imagined the slowness, I believe you experienced what you said. So lets test your theory. Since you can write code runs slower only on Windows - give us some C code that runs horribly on Windows.
Well Mark's post was interesting, but this is also old news now, is it not? Microsoft has had two OS updates since then and hot fixes and service packs.
The biggest difference I see is that the Linux Kernel can be tweaked to specific performance characteristics at build time, whereas Windows can only be altered by runtime configuration. depending on what you might want to change, you may not have that capability.
Going back to the article about network latency while playing media, the breakdown seems to show that this was a bug, and bugs happen. To Linux's credit, those bugs are probably patched more quickly and even if it hasn't been promoted to release, if someone has made a patch, you can incorporate your own fix. If not, the landscape doesn't look any different for Linux vs. Windows in that regard.
I have never seen a description or mention of a bugfix for this Vista problem. The root of the issue is that DPCs are very CPU-intensive due to the inherent design of the Windows network stack and drivers. A bug like this just does not "simply" happen as you make it seem. The root cause is lack of innovation in the kernel. The bloat does not get fixed. The software architecture they used to handle 10/100 Mbps of network traffic does not scale to 1000 Mbps.
While I don't know about that Vista problem, it's not true that Microsoft doesn't innovate in the kernel. Windows 7 introduced the "MinWin" kernel refactoring which significantly reduced the OS footprint, see the 2009 talk & slides by Mark Russinovich:
Windows 7 and 8 both have lower system requirements than Vista while offering more features. That fact was widely advertised and acknowledged. Sure, not everything was improved, but it's not true that MS never fixes things for better performance. They simply have different business priorities, such as supporting tablets in Windows 8 rather than supporting 1000 Mbps.
The OP didn't say there's no innovation at all in Windows. He just said it's slower than in the Linux kernel, and as a consequence atm Windows lags behind Linux in performance-critical usage scenarios.
Can you please elaborate on this inherent design flaw? My understanding of NTOS DPCs is that they are quite similar to Linux tasklets/bottom-half interrupt handlers.
A simple problem is that DPCs are not scheduled in any way (there's some very primitive queueing and delivery algorithms), or more importantly, are not scheduled in any way that correlates with the thread scheduler's view. So between four cores, if Network DPC/ISRs are abusing Core 1, but the thread scheduler sees an idle-ish low priority thread using up Core 0, and Core 1, 2, 3 are all Idle (because it doesn't understand DPCs/ISRs), it'll pick Core 1 for this thread (just because round-robin). I'm omitting NUMA/SMT and assuming all of these are 4 regular cores on the same socket.
One could argue a better decision would've been to pick Core 2 and/or 3, but there's nothing to guide the scheduler to make this decision.
But it's not that "DPCs" are a design flaw. It's the way that Windows drivers have been encouraged by Microsoft to be written. You'll see most Windows drivers have an ISR and DPC. If you look at IOKit (Mac's driver framework), almost all drivers run at the equivalent of Passive Level (IRQL 0) outside of the ISR/DPC path -- the OS makes sure of that.
Because Windows driver devs are encouraged to write ISR/DPC code, and because this code runs at high IRQL, it means that bugs and inefficiencies in drivers show a much larger performance degradation. And when you buy a shit 0.01$ OEM NIC, and you have to do MAC filtering, L2 layering, checksum validation and packet reconstruction in your DPC, plus there's no MSI and/or interrupt coalescing, you're kind of f*cked as a Windows driver.
Well you can manually request your DPC to target a particular core before you insert it in the queue. So yeah, while its possible to avoid the situation, it requires the driver writer to be aware of this problem.
Also w.r.t to your other point, Threaded DPCs do exist (since vista) which run at PASSIVE_LEVEL.
Targeting the DPC provides absolutely no help -- how will you as a driver know what core to target to? It's typical design-by-MSDN: provide a feature that looks useful and people will quote, but actually provides no benefits. Drivers targeting their own DPCs are actually one of the leading causes of horrible DPC perf.
As for Threaded DPCs, not only does nobody use them in real life, but they MAY run at PASSIVE. The system still reserves the right to run them at DPC level.
Really the only way out of the mess is Passive Level Interrupts in Win8... Which likely nobody outside of ARM ecosystem partners will use.
Well I assume (depending on where the bottleneck is) spreading execution across cores will reduce DPC latency. Or maybe they could use UMDF.
Though, as a user of badly written drivers, I'm totally fucked. Its too bad the OS design does not allow for the user control any aspect of this (well apart from MaximumDpcQueueDepth).
Spreading DPCs across cores will lead to two possibilities:
- Typical driver dev: Knows nothing about DPC Importance Levels, and sticks with medium (default): IPIs are not sent to idle cores, so device experiences huge latencies as the DPC targeted to core 7 never, ever, gets delivered.
- Typical driver dev 2: Hears about this problem and learns that High/MediumHigh Importance DPCs cause an IPI to be delivered even to idle cores: wakes up every core in your system round-robin as part of his attempt to spread/reduce latencies, killing your battery life and causing IPI pollution.
Now I hear you saying: "But Alex, why not always target the DPC only to non-idle cores?". Yeah, if only the scheduler have you that kind of information in any sort of reliable way.
Really this is clearly the job of the OS. As it stands now, targeting DPCs on your own is a "fcked if you do, fcked if you don't" proposition.
You do get a few more variables you can play with as a user, but changing them will usually lead to worst problems than it would solve. Many drivers take dependencies on the default settings :/
Okay, but I'm not suggesting that spreading DPCs over multiple cores is the only solution. It is a solution in some cases. Originally I was merely responding to your point about not being able to schedule DPCs on other cores. You were speaking more generally from the OS schedulers POV, but I took it more literally.
Honestly.. I've spent countless hours hunting down bad drivers to fix audio stutter and other crap on my gaming PC. I've finally got DPC latency under 10microSec and I'm not touching a thing :)
Yes, my point was that the OS isn't currently doing it on its own, putting the onus on the driver developers -- which have limited information available to them and are pretty much unable to make the right choice -- unless we're talking about highly customized embedded-type machines, or unless the user is willing to heavily participate in the process.
It was just a simple example of something that could change for the better in terms of performance, but that probably won't because it's a significant amount of code churn with little $$$ benefit.
I was honestly surprised that a core-balancing algorithm was added in Windows 7, but that was done at the last second (RC2) and by a very senior (while young) dev that had a lot of balls and respect. Sadly he was thanked by being promoted into management, as is usual with Microsoft.
One wonders if you can go to a full 'subscribe' model which a bunch of pre-emptive OSes use, which says "I need to run in 'x' nS" or "I need to run when interrupt 'y' fires." and then they don't get any cycles at all until they need them, and at that time they are arbitrated.
The challenge of course is priority inversion, which is to say that low priority thread A gets started on an otherwise idle processor and then higher priority thread B wants to run, except there isn't a 'tick' where the processor periodically checks to see that the highest priority thread is running. Now your low priority thread is running in preference to the high priority thread.
You can finesse some of that by interrupting on thread state change (which has its own issues since sometimes threads have to run to know they want to change their state) but you're still stuck somewhere between ticks when threads are sleeping and full tickless. Not surprisingly its kind of like building asynchronous hardware in hardware description languages.
Most context switches are actually triggered by a thread blocking. In a completely tickless OS you'd also set a one-shot preemption timer whenever a thread starts running (this used to be expensive, but Intel recently added the TSC deadline timer).
Doesn't it already use a preemption timer? If not, how does the kernel regain control once it schedules a piece of non-kernel code aside from that code blocking?
I haven't kept up with Linux specifically. In a tick-based OS, every tick can trigger the scheduler and the scheduler can decide to preempt. In a tickless OS you need a hardware preemption timer (otherwise the scheduler may never be invoked).
This is the key sticky bit. With hardware support for pre-emption it can work. I saw a very clever hack on a CPU for doing T1 processing (the telecoms protocol) which basically had four tasks that 'could' run and the engineer used the hardware data watch breakpoint (up to four addresses) which would interrupt to a vector when data changed at that address. This combined with the serdes interrupt for line data basically allowed everyone to 'sleep' most of the time.
Clever system but very very very difficult to maintain.
Surely the ordinary regular tick would use exactly the same hardware to get the regular tick interrupt as the tickless one would use to get an irregular preemption tick?
(I suppose I sort of don't get why this is a big deal - theoretically, I mean. Of course changing a system as widely-used as Linux is probably a challenge itself.)
I think the issue is that you don't necessarily know when you need to preempt before hand. For example, if you are a normal computer idling, in theory you do not need to do anything until something happens. However, you need to constantly check to see if anything happened, and the process of checking is doing something. With hardware support, you can skip checking when nothing happened.
You don't know when to preempt - but that's what the timer is for. It would be something of an emergency measure, because most of the time, most threads go to sleep for some other reason than running out of time. (Of course, if a thread regularly exceeds its time slice, the computer won't go into an idle state - but that's the right thing for it to do, because it isn't idle.)
As for checking for something happening, do you need to poll for this? I don't think it's necessary in the general case. Perhaps there is some broken hardware that doesn't support interrupts, but I have a hard time believing that's very common. Aside from that, nothing ever happens except by request, and (if structured right - not that I'm suggesting this is automatically easy) polling is unnecessary.
This allows for better power management since the hardware has a better understanding the state of the system. When a system is idle now the hardware is also idle and can put itself to sleep.
I suspect you can infer a subscription when a thread goes to sleep. It's probably waiting on some kind of synchronization primitive (and I guess I/O would count), or has explicitly slept for a period.
You still need a timer, I think, but it doesn't have to be a fixed frequency. It gets set to the time slice period when each new thread is scheduled, and if it fires before the thread has given up its time slice then a new thread is selected. (Ideally this behaviour should be relatively rare.)
I doubt it. Being the most popular (and open source) kernel means that anyone looking to experiment or improve in this area will almost definitively go to Linux. Combine that with the fact that Linux has major corporate support, and many of those sponsors have an interest in the kernel being good, it is not surprising that Linux is doing well in this area.
However, this also means that the kernel is (in some ways) optimized for more server/super computer type stuff which prefers throughput over latency, at the expense of desktop/smartphone use, which prefers low latency over throughput.
Of course, in many cases there is an option between approaches (either at compile or run time). In the case of the scheduler however, they decided not to support multiple implementations in the mainline. This leaves us with the highly complicated implementation suitable for high end computer, and not the simpler Brain Fuck Scheduler (BFS) that shows minor improvements in desktop use.
Of course, using BFS is only a patch away, and several desktop distributions do use it.
I wrote an emulator for a Prime minicomputer (1975-1992 era). The Prime had built-in instructions for process coordination via semaphores. So when a process issued a WAIT instruction to wait on an event (semaphore), the microcode scheduler looked through the ready list to find the next eligible process to run. At the end of the ready list is the "backstop" process, the lowest priority process; it is always ready to run. The backstop process would look through other scheduling queues (just a semaphore with a list of processes waiting on it) for processes that had exhausted their timeslice, give them a new timeslice, and execute the NFYE instruction to put the process on the ready list. The NFYE instruction would actually switch to the new process, because that process had a higher priority than the backstop process.
There was a periodic timer process on the Prime, that ran at the highest priority. It was used for things like waking up driver processes periodically, in case an IO device was hung and didn't interrupt like it should have, and for waking up processes that were sleeping until a certain time passed (sleep call).
I think it was advanced for its time. If you want to read more about it, here's a link to the Prime System Architecture Guide:
And! Yes! Finally. Our 'isolated' cores are finally ours. Bare metal. No more jitter. Thank you for everyone who put that together. Your efforts are really appreciated!
The advantages of full tickless elude me somewhat. Single percent performance gain does not seem worth the effort, so what else? For real-time operation tickless may make sense, but on the other hand idk how much ticking interferes currently RT processes on linux.
For mobile devices and laptops, the ability to turn off ticks on certain cores translates to huge power savings. I almost doubled my battery life on a quad core laptop when I switched from a full-tick to a dynamic-tick kernel.
It's interesting, when you think about it from that perspective.
On my system, maybe that 1% improvement doesn't mean very much.
But when you add up all the systems in the world that are running Linux, and think about how much electricity is used by them or how many personal experiences they are mediating, it really adds up into something worthwhile.
The curious question is: at what point does it become not worthwhile? 1% is maybe worthwhile. But .01%? .5%?
There is always someone who will want to do it if only to show they can. You only need to care about "worthwhile" if you're balancing it against other concerns.
Obviously people are generally going to be motivated to smash the larger ones first. But Linux doesn't run like a centralized project where developers are directed on what to prioritize.
If you've got 100 CPU cores at or near full utilization, you've just saved yourself a CPU core. 100 machines at full utilization, it's a full machine.
The more cores/machines you have, the more this savings means. The threshold "is it worth it?" percentage depends on how many machines saved is worth an engineer's time to do the optimization.
Presumably users will gladly accept any positive % as long as it works, so the question comes down to what are the motivations of those actually implementing (or green-lighting) the commits.
I would say that so long as there are issues that have been identified to negatively affect performance, whichever issue has the biggest performance impact should always be considered worthwhile.
Precisely. The latency of some of the more aggressive power states exceeds the resolution of most OS tick timers, so they are often aborted for no good reason. So, as I'm typing this reply, in-between keystrokes, voltage to the CPU can be killed (not just clock-gated).
And it gets even better when you can postpone several low-priority interrupts for a while until a high-priority interrupt comes in, and then you wake up the CPU and handle them all at once before putting the CPU back to sleep.
Hmm. Completely brown-out the CPU in between keypresses? I always heard cold starts are very very slow (compared to regular operation). While I guess I don't have a measure for how slow, it seems unlikely that if a user is typing at 360CPM (60WPM * ~7), that you will be able to completely shut off the CPU much.
I'm not sure about the CPU numbers either (though my understanding is that modern CPUs are super-quick to power down and back up), but I think 360 CPM is way too high: few people type at 60 WPM, most words aren't 7 letters long, and nearly no one is typing 100% of the time they're at their computer.
Oh, I agree there is ample opportunity for shutting down the CPU. I probably spend more time just looking at the screen reading, than any other activity.
I know the Linux's default scheduler ticks are, depending on the platform, ~10 ms or less. How long are the latencies of the slower power state changes?
Being tickless when idle provides significant power savings. Being tickless when active doesn't. The benefit here is to avoid trapping into the kernel an additional HZ times a second when you have a single CPU bound task bound to a CPU that nothing else wants to run on.
Posting a subscriber link to HN is a bit abusive, no? Quoting:
The "subscriber link" mechanism allows an LWN.net subscriber to generate a
special URL for a subscription-only article. That URL can then be given to
others, who will be able to access the article regardless of whether they
are subscribed. This feature is made available as a service to LWN
subscribers, and in the hope that they will use it to spread the word about
their favorite LWN articles.
If this feature is abused, it will hurt LWN's subscription revenues and
defeat the whole point. Subscriber links may go away if that comes about.
> This feature is made available [...] in the hope that they will use it to spread the word about their favorite LWN articles
Emphasis mine. Can you make a case for it being "abusive" in the context of the full text? I take it to mean generating an excessive number of links, not posting one link to a wide audience.
I took it as sending the occasional article to people you know. Not as a way to poke a hole through the paywall and send (via HN) it to thousands and thousands of people.
FWIW I (as the editor of LWN and the author of the article) do not mind the posting of this link. It has brought in 16,000 people (at last count), many of whom are probably unfamiliar with LWN. Some subscriptions have been sold in the process.
Certainly I don't want large amounts of our content to be distributed this way, but an occasional posting that puts an LWN article at #1 on HN is going to do us far more good than harm.
I've been addicted to the kernel page since I first found out about it 10+ years ago. When they put up the pay wall it looks than a month before I cave and subscribed so I wouldn't have to wait.
I read the kernel page and front page religiously, and read bits of the others.
I can't imagine how many things I've learned following the kernel's development. But the mailing lists are HUGE and thus hard to follow. LWN makes keeping up possible.
As the submitter, this is very good to hear. This is also the first link from LWN that I've submitted to HN in a few years. Thanks for running an awesome site.
You will always have more than one process running, if only for the kernel threads handling filesystems, networking etc., so you can only get that if you have at the very least two CPUs. And even then you usually have more processes than CPUs and the ‘application’ split up in various processes that in turn require scheduling by the kernel.
Yes, but specialized server installations where multi-core CPUs have one process sitting on one core are still "real" loads. They're just specialized and not the common case.
Well, if your RedHat and are paying someone to work on this feature in Linux, then yes. [1] Irrespective of my disclaimer, many companies are investing many smart people in making Linux work well.
Anyway, I've heard rumors that Windows 8 has this, but I can't find a good source on that (admittedly, I'm spoiled with GNU/Linux where it does not take much work to find this stuff).
[1] I just pieced this together, so take it with a grain of salt. Following the e-mail at [2], I guessed that Frederic Weisbecker was running this project. Googling him brought me to his github profile [3] which indicates he is with RedHat. Anyone who actually knows what is going on please confirm or correct this.
You sure about that? I read that section and it seems to be about idle duration whereas this is about turning off the timer interrupt on non-idle CPUs.
The whole openness of the culture.