Hacker News new | past | comments | ask | show | jobs | submit login
No nuances, just buggy code (was: related to Spinlock implementation) (realworldtech.com)
540 points by shaklee3 on Jan 5, 2020 | hide | past | favorite | 274 comments



It's important to read Malte's response to Linus:

https://www.realworldtech.com/forum/?threadid=189711&curpost...

Followed by Linus response to Malte:

https://www.realworldtech.com/forum/?threadid=189711&curpost...

The long and the short of it:

- It's nearly impossible to make meaningful benchmarks for this because reality will almost always be different and more complicated

- Calling sched_yield() is almost always a mistake

- Any scheduling optimization is probably a waste of time because either your running env will change, or the kernel will change, or someone else's optimizations will clash with yours.

- Even with decades of research into locking, everyone (including the experts) continually gets it wrong.

- If you're not running a specialized embedded setup where you control all of the running processes and other such variables, just use sleep locks and call it a day.


Basically Linus is right.

One caveat is that calling sched_yield is a fantastic throughput optimization for the spinning portion of an adaptive mutex. If you have parallel code that sporadically contends for short critical sections then my data has always said this:

- something like a futex lock is optimal

- you want barging, not fairness

- spinning the CPU is a t questionable idea even if you do it for a short time. The pause instruction does something like jack and shit. Busy waiting of any kind seems great for microbenchmarks sometimes but it’s pretty bad for anything complex.

- spinning 40 times while calling sched_yield prior to futex_wait (or whatever) appears optimal on many schedulers and benchmarks. It’s hard to tell why. Best I can tell it’s because in the large parallel apps I’ve worked on and benchmarked, there are often threads that are ready to do some useful work right now, so yielding from a thread that can’t get a lock to one of those is a system throughput optimization. If there are no such threads then it’s a bit like spinning the CPU.

The 40 yields thing has a funny story for me. I learned of it from a comment in the JikesRVM code that was more than a decade old claiming it was optimal on their custom scheduler on a 12 core POWER box running AIX. The benchmark cited was something I’d never heard of. When I rewrote how JikesRVM did thread scheduling, I was sure that this would need tuning. Nope. It’s optimal on Linux and Darwin on x86 on basically any Java benchmark that has lock contention. I later implemented the same heuristic in WebKit’s locks. It’s still optimal.

To be sure, there is a broad plateau of goodness around 40. Like, 50 is about as good. But if you go order of magnitude in either direction, performance tanks. I’m superstitious so I’ve stayed with exactly 40 whenever I have to write this code.

I wrote a lot more about this stuff once: https://webkit.org/blog/6161/locking-in-webkit/


Interesting. A Win32 CriticalSection is also a two step primitive that spins for a short amount of time and then yielda to the system scheduler.

It would be interesting to see a distribution of wait time per lock for different locks. If such a two step approach performs well, a hypothesis might be that there are three kinds of locks:

- mostly uncontended: lock succeeds without spinning

- held for exceptionally short amounts of time: lock usually gets released again before waiting thread on different CPU stops spinning. Wait time is lower than the time it takes to yield to the scheduler and/or reschedule.

- held for a really long time: wait time is much longer than scheduling another thread and re-scheduling the initial one.

And what you are saying seems to indicate that optimizing for the first and second one is worth it because the overhead is negligible for the third type, that is, there are few locks that are held for an amount of time so that spinning adds noticable overhead.


Do you have a reference for this? AFAIK the win32 critical section (like any modern mutex implementation) first uses atomic instructions to check if anyone is already in the critical section so it's really fast if no one is, and otherwise falls back to the OS synchronization objects.


The information is right there in the documentation:

https://docs.microsoft.com/en-us/windows/win32/sync/critical...

The paragraph about InitializeCriticalSectionAndSpinCount and SetCriticalSectionSpinCount describe the behavior. IIRC, the default spin count used to be pretty high (1000 loops or so). Not sure if that was changed.


Interesting. There's also that anecdote about the heap manager using a spin count of 4000. I wasn't aware this happened by default (I didn't see any mention of this in InitializeCriticalSection). I guess it's all down to the probability of contention vs. the amount of time the mutex is held.


Your description and their description are not mutually exclusive.


Spinning vs. a single check...


I don’t have as much experience with the Win32 scheduler. But I remember getting conflicting data about the profitability of yielding. In particular, lots of threads yielding on Win32 can cost you a lot of overall system perf, if I remember right. Not so on Linux or Darwin, on the same workloads.


I don't have too much experience with the Windows scheduler, either. I've written most of my multithreaded code exclusively for Linux.


All of these little details vary dramatically depending on the exact CPU and workload. I've developed a wide variety of scheduling strategies and have used neural networks to predict when a given strategy will be better. Scheduling is giant non deterministic mess with no ideal answers.


Ok but my data disagrees with you. Specifically: when apps get complex enough, the differences between CPUs and schedulers wash out in the chaos.

I’ve tested this over the course of a decade on multiple Linuxes, multiple Darwins, multiple x86 flavors (Intel, amd, and lots of core counts and topologies), POWER, and various arms, and on many large benchmarks in two very different languages (Java and C/C++). In Java I tested in in two very different VMs (JikesRVM and FijiVM). I think the key is that a typical benchmark for me is >million lines of code with very heterogenous and chaotic locking behavior stemming from the fact that there are hundreds (at least) of different hot critical sections of varying lengths and subtle relationships between them. So you get a law of large numbers or maybe wisdom of the masses kind of “averaging” of differences between CPUs and schedulers.

I’d love to see some contradictory data on similarly big stuff. But if you’re just saying that some benchmark that had a very homogenous lock behavior (like ~one hot critical section in the code that always runs for a predictable amount of time and never blocks on weird OS stuff) experiences wild differences between CPUs and schedulers then sure. But that just means there are no ideal answers for that scenario, not that there aren’t ideal answers for anyone.


Linus is backed up by Josh Simmons on Twitter. He does some perf tracing and also tweaks Malte's "utterly terrible benchmark" to be "about twice as fast".

https://twitter.com/dotstdy/status/1213578864842162176


The most important point is:

- you cannot implement a functioning spin lock in user space (for all practical intents and purposes and if you don't resort to extreme measures and talking modern general purpose operating systems)


You can, but it requires special architecture for the application. The cpuset can be used to disable scheduler use of the core and then you can pin application threads to cores, single thread per core.

Only then it makes sense tu use spin locks.


Reserving CPUs for your application definitely qualifies as "extreme measures".

In general you should expect userspace code to be scheduled out at any time and use locking primitives that the kernel knows about.


This is still a disaster with hyperthreading where you can have two threads contending for a lock on the same physical core.

Even if your thread won't be scheduled off the core I don't know if you might still have irqs stealing time during the critical section.


Can't you isolate cores from irqs with irq affinities? I'd be more worried by SMM stealing CPU time (ugh... talk about sad stories of 'real-time' Linux...).


Depends on what you mean by "functioning". You can make one that functions in the sense that you can lock between concurrent threads. However, it's difficult to do right and even more difficult to do in a way that is performant (and the interface to the kernel is so ill defined for this, whatever efficient solution you come up with will likely prove to be far less efficient on other systems with just subtle alterations in the underlying kernel implementation.


I like the end:

> Lots of users, all of which are basically buggy-by-definition, and all you can do is a bad half-arsed job of trying to make it "kind of work".

> Reality is messy.

> Linus


They're talking past each other. Malte is really saying that yield() does what he wants on Windows and BSD, and Linus is giving a bunch of excuses for why Linux doesn't work that way.


I’m pretty sure that the Linux behavior makes sched_yield more suitable to the 40-yields optimization I use for locks.

I don’t buy claims about lock behavior based on microbenchmarks. Microbenchmarks have wildly different behavior from one another so it stands to reason that they have wildly different behavior from real code.

Linus’s excuse is that if you use a real lock then shit is good. That’s a pretty good excuse IMO.


Yeah Linus is correct here, he's just not directly addressing why it works on some other schedulers.


One reason they might act that way: They have nowhere near the amount of NUMA or cache locality optimization work the Linux kernel possesses. Thinking of the 2nd-generation Threadripper issues in Windows in particular.


That's quite likely. I think that the sched_yield behavior is throughput-optimized, which is consistent with what you get when an OS is server-optimized like Linux.


Agreed.


There's no excuses, the behavior of the scheduler is documented and straightforward. Up thread a twitter user [0] noted that you can somewhat replicate the behavior of windows by replacing sched_yield() with sleep(0). The original benchmark skips this detail and just assumes the C++ wrapper will exhibit the same behavior on all OSes with all schedulers, which is clearly wrong.

[0]: https://nitter.net/dotstdy/status/1213582199615901696#m


The point is Windows and BSD isn't doing "what he wants," because the benchmark in question isn't producing any meaningful data for two reasons. First, it's not even measuring what Malte said he was measuring. Second, the benchmark assumes a single threaded HW model, and optimizing for it would very likely slow down performance for modern real-world workloads.


Ignore the benchmark for a moment; the original issue is that a real program (the game Rage 2) uses spin locks, and experiences really terrible hangs on Linux and not on the other kernels. What he wanted was for that to not happen. The upshot is that there really is no way to get spinlocks to not hang on Linux. Now, there's a very good argument that spinlocks are just dumb, and therefore that real game is broken.


EXACTLY! I wish someone would just explain why the behavior is different. Does linux only sched_yield on the same NUMA node ? (Linus' reply seemed to imply that). Windows will not pin threads unless you set processor affinity, AFAIK, so maybe that's the difference.

There are runnable threads that are not being run. It would be nice to know why.

Tell a person their code is wrong and you may have saved them from one mistake. Tell them why their code is wrong and you will save them from a dozen.


> Does linux only sched_yield on the same NUMA node ? (Linus' reply seemed to imply that).

Perhaps Linus' concern is that sched_yield doesn't guarantee any particular behavior, it just happens to be implemented that way now.


> Does linux only sched_yield on the same NUMA node ?

> There are runnable threads that are not being run. It would be nice to know why.

More extreme. It only applies to the same core, more or less. The original benchmarks are on a one-node CPU, after all.


sched_yield does seem like it should have a pretty obvious meaning: do whatever you would do if the preemption timer went off right now. Whether it's a good idea or not, I don't see why it's considered "undefined".


I haven't read Linus' discussions, but often what happens when you take that approach is:

1. Scheduler looks through all the processes on the system, decides A is the one that should run right now

2. A decides it doesn't want to run, calls yield.

3. Scheduler looks again, thinks A is still the right one to run, and runs it again.

That is, in most cases, if the scheduler thought there was a better process to run, it would already have been running.


Please read Linus' discussion.

He states (paraphrasing) - it's undefined because there's no real way (unless in a realtime scheduler) to find what to yield to without instigating complexity.

The search itself is the quest that the kernel would rather not do at all. Step 3 is undefined and thus sched_yield itself is undefined.


GGP had a simple solution: "yield is the same as a preemption tick". I was trying to briefly explain what usually happens when you define things that way. I'm not advocating one thing or another.


That is somewhat surprising to me. My idea of the canonical use case for yield is interprocess communication: process A sent a message to process B, and it wants to yield the remainder of its scheduling quantum to process B because process A has nothing left to do. Although maybe this use case is satisfied by something other than sched_yield.


Yes, something else. Linus is pointing out that sched_yield doesn’t allow specifying where to yield control to, so it is a bad way to transfer control to a particular task.


Why would it yield to process B and not one of the hundreds of other processes running?


In the case of the benchmark, at any point in time there might be 2 background processes that aren't asleep.

So let's look over our cores:

6 of them are churning through processes that do nothing but call sched_yield. Nothing has been waiting to run for more than 3 microseconds.

2 of them are doing real work, and the benchmark threads that happen to be on the same core have been waiting to run for most of a millisecond.

If you're stuck, then B is one of the processes that has been waiting.

So the obvious change would be to move threads around in sched_yield. Then B will get scheduled within microseconds.


How is it possible that there are only 2 background processes? Any normal Linux system has hundreds maybe even thousands of processes running.


Two processes that aren't asleep.

What's your CPU use right now? I bet it's less than 200%. Or will be if you exclude your web browser.

The typical background process is active far less than one percent of the time.


That doesn't matter that still doesn't guaranty yielding the thread will allow another specific process to run. Especially not if that process is soft locked to a different core.


> That doesn't matter that still doesn't guaranty yielding the thread will allow another specific process to run.

Right, but you would expect the "switch to another thread" function to eventually pick the one that's sitting there with a growing and growing time since it last got a chance to run.

It would be easy to guarantee that if any thread has been waiting for at least x amount of time, then any context switch without long-waiting local threads will pick that thread. (If it's on the same NUMA node, at least.) But in the actual implementation that time can build up to a huge amount with no effect, even as thousands of yield calls go by.

> Especially not if that process is soft locked to a different core.

Yeah, that's the problem, it's getting locked to a core even though other cores are desperate for threads to run.


The best way to do that is with a blocking write. You're sending the messages using a mechanism such as a pipe, right?

Pipes have a small buffer. Once it's full, the caller will block, and the scheduler will know precisely what to schedule to unblock it.


"there's no real way (unless in a realtime scheduler) to find what to yield to without instigating complexity"

What about just temporarily increasing the process' nice value?

That should be simple to do, and the scheduler already knows how to yield to processes with lower nice values.


Well, the scheduler has the same problem when the preemption timer goes off, doesn't it?


What's the postcondition of "whatever you do when the preemption timer goes off"? What's the observable side effect for the program?


Well, it's a hint (but then so is `nice`), so it doesn't have any strict postcondition. The meaning of the hint would be that the process won't make good use of CPU cycles at present, and it would be preferable to yield to any other processes now, rather than being preempted later when it are busy.


I think I would state it like this: Yielding is communication with the OS. And communication with the OS is observable.


Why is calling sched_yield() a mistake? What's the correct way to yield CPU without calling sleep?


If a thread can do no more for the moment, then either it is finished, or it is waiting for something to happen. In the latter case, the best thing is to be woken on that event, and the next best is to poll for it. Sleep is only the best solution when that event is literally the passage of time.


you may want to read Linus's answer about it https://www.realworldtech.com/forum/?threadid=189711&curpost...


You really should just read the linked article, but the tl;dr is that sched_yield() does not convey intent (except in narrow realtime scenarios that you should not be using in the first place).

Tell the kernel why you want to yield CPU. Work with the kernel, not against it. Do you want to yield because you genuinely want to wait until some future point in time? Use sleep. Is it because you're waiting for some I/O? Use poll or its variants. Is it because you're waiting for a signal from another thread? Use mutexes, condition variables, etc. that call futex under the hood.


Futex wait, or any other blocking syscall.


sched_yield means “yield to any thread if it’s easy to do so or do nothing”. That meaning fits the requirements of some algorithms, particularly if you call it a finite number of times before doing something more drastic like a blocking syscall (like futex wait, lock acquisition, whatever) that will block until the condition you’re waiting for is fulfilled.

Think of it as a speculative optimization on top of busy waiting, not as a way to control the scheduler.


It's explained in the linked post.

Why don't you want to call sleep?


Beacuse it would act as as an even worse sched yield.


I guess I can't really see the use case here. If you want to sleep, it's usually because you're waiting for something: file descriptor, semaphore/mutex, or a certain amount of time to pass. If none of these conditions apply, can't you just continue to run and let the scheduler do its work? (maybe with an appropriate nice setting if you are doing "less important" work).

The scheduler itself uses fixed time interrupts for its basic functions, so "sleep(fixed_time_interval)" is what you will get in the end anyway on a preemptive system.


Calling sleep(1) instead of thread yield is faster - see the twitter thread for a proof.


Still worse: it's a confusing code, and it's also confusing that it's faster.


Not in every cases, but both are certainly suboptimal compared to a diretected wake up.


Ok, so how you would write this (at least the first section, explaining the error) constructively could be much shorter.

"The author of that post has a common misconception, which is that spin locks are a fast drop in replacement for uncontended mutexes. They aren't. You can only use them well when you know you won't be descheduled, like we do in the kernel. In a generic userspace program, you don't have that knowledge. The possibility of your time slice ending skews all the time measurements, for example."

Boom. Roughly the same message. Most people will get the same amount of edification from this as Linus' original letter. You could go in to more detail if you want but it's not really necessary. Smart people will figure it out if you just point out the general shape of the error. No need for calling code "garbage."


I personally really appreciated the additional details in Linus's version.

When I read your version, I learn that spin locks are not a drop-in replacement for uncontested mutexes. That's great, except that I don't really know for sure what either of those terms mean, and also I didn't gain any knowledge that doesn't relate to this specific situation. It also wasn't super convincing -- I could just as easily see someone post a reply saying "well actually spin locks ARE a drop-in something something mutex" and I'd have no way to tell who was right.

Linus, on the other hand, told me about the mechanism by which the problem happens. I learned about how the scheduler sees spin-locks' loops, and that the main advantage of "real" mutexes is that the scheduler knows who's waiting on what. I also wouldn't necessarily have realized the "take clock measurement, then get descheduled before you release the lock" issue if he hadn't called it out specifically.

Sure, I guess "garbage" comes off a little mean, but "the author has a common misconception" almost sounds more passive-aggressive to me.


The first section of Linus's answer reads really rough to me and almost made me stop reading, imagining that it was going to be yet an other entry in the "Linus bullies some poor coder into submission by insulting them until they just stop responding" which is a shame because there's a lot of interesting content after that. I guess I invite the people who had the same knee-jerk reaction I had to give the post a 2nd chance and power through Linus' unnecessarily harsh tone to get to the good stuff.

>Sure, I guess "garbage" comes off a little mean, but "the author has a common misconception" almost sounds more passive-aggressive to me.

The latter might be slightly passive-aggressive (although it doesn't have to be), the former is simply aggressive-aggressive.


As long as Linus is blaming the work done and not the people behind it, that's fine. Something truly aggressive would be "This benchmark is garbage and the people responsible for writing it are monkeys on crack". He did such things in the past, but in the 4 emails I got to find on this specific benchmark, none of them critic people writing code.

I think Linus got upset that a badly designed benchmark got popular and therefor is spreading misconceptions. His strongly worded answer is getting popular too and as a result everybody learns and he has stopped an urban legend from spreading. A success on every points.


>As long as Linus is blaming the work done and not the people behind it, that's fine.

I think your comment is just wrong and generally pure garbage. I can't believe some poor server will have to host this nonsense for years to come. It's the most inane thing I've read all year.

I'm just talking about your comment, not you though, so clearly no hard feelings here and it's definitely the right way to frame my argument to have a nice, peaceful and constructive technical discussion.


No, reductio ad absurdum is not an argument. What you wrote is an exercise in swearing. What Linus wrote is getting his point across.

It is certainly possible to describe a code with bad words without hurting the feelings of person who made it, and I think, in order for it to work everyone has to understand that everybody writes “garbage” or “crap” code sometimes, so when it happens, it’s not a big deal; it shouldn’t hurt your professional dignity.

On my job, in team of 4, we do it from time to time and it’s just much mentally easier, streamlined - and yes, constructive to conduct code reviews this way. I understand this may not work for other, bigger teams, but it’s certainly a valid way.


There were ways for him to communicate those same ideas less negatively.

And it was more than just the "garbage" line. It was also things like "If I haven't convinced you of that by now, I don't know what I can say.", and "See how simplistic - and self-centered - your expectation is?".

Writing this way doesn't "streamline" anything. I interpreted it as a negative interaction, and clearly others have as well including the original author (since he said as much in his response).


> Writing this way doesn't "streamline" anything.

Exactly. Blameless and non-violent language can streamline communication. It does this by removing any trigger that would make your communication partner feel defensive. Needlessly emotional and charged language is sure to distract from the subject at hand. When you use calm language, there is only one thing to focus on: the content. When you call someone's code garbage, there are now two things that that person could focus on -- the content and the tone.

To me, it is completely reasonable that the original author might have made a mistake like this. After all, when he shared his findings it seemed like he had found something interesting to me! Many people on this very site commented positively on his efforts. There is nothing "garbage" about the code that he wrote, it just doesn't quite do exactly what he thought it did.


> To me, it is completely reasonable that the original author might have made a mistake like this. After all, when he shared his findings it seemed like he had found something interesting to me! Many people on this very site commented positively on his efforts. There is nothing "garbage" about the code that he wrote, it just doesn't quite do exactly what he thought it did.

It was blogged about, so carries some authority on the topic, and was convincing to readers, but the code did nothing close -- not just "not quite exactly" -- to what it was portrayed to be doing in the benchmarks or article. It is garbage; useless; should be binned. I tried to come up with better adjectives -- "dangerous", "useless", etc -- but they came up short, in subtle ways. Who cares; Linus explains what he means better than the one, concise word.


Speech is not violence.

It always contains some information payload. If you felt triggered by it and failed to parse the content then maybe you're overly emotional.

For some people who aren't it's going to be useful assessment of how bad this code was.

I see this ongoing trend (Twitter) that by limiting the ways in which people can express themselves, you can in theory broaden the recipient circle (to include people who otherwise would be offended and wouldn't read).

In practice you not only limit the ways in which information can be distributed but often also barter away some people who weren't offended and would get useful information this way.

Looking at speech this way is also a bad idea because today's acceptable is tomorrow's unacceptable word and the only way to allow all information is to allow all speech.


All I'm saying is that there is a spectrum of ways to communicate about things like this, from "fuck you you piece of shit how dare you put out this garbage code" to "hey, bud, I see what you're trying to do. But there's a mistake here, which I can explain." Linus chose something closer to the left than the right. You can get the same technical ideas across either way, but as you move toward the latter mode of communication you will decrease the number of recipients who have trouble receiving your message because of its negative emotional content.

"maybe you're overly emotional." Feel free to label me however you please. It does not change the fact that if your goal is to communicate ideas efficiently, my way will help you avoid alienating some folks, and has no practical downside.


There's a big difference between ugly ad hominem comments and saying that your code is bs.

> and has no practical downside That's the place when we differ in opinion. It has one direct downside of preventing some people from expressing their opinions and conveying the point. And another in the long run – setting up a precedents of what can and cannot be said.

> Feel free to label me however you please It was general note not something directed at you.


In my group, we call it "shoddy" code. A bit less brash than "garbage" I think.


If you had followed with multiple paragraphs describing in detail as to why the comment is pure garbage, why it has to be called out, offering context, background information and possible alternatives - then you'd have a valid analogue and a basis for debate. Just sayin'.


Believe it or not, it's one way or another to formulate a critic, and I'm fine with it. However your strongly worded answer lacks arguments, so color me unconvinced.


Except your attempt is not exactly analogus to the situation discussed.


Critique doesn’t have to be peaceful to be constructive


> As long as Linus is blaming the work done and not the people behind it, that's fine.

Certainly, ad hominem attacks are worse than making substantive attacks, but is that really the standard we should judge it by? Instead of a sprawling thread of visceral, wouldn't it be better if he just explained the reasons why the approach is bad in a professional manner?

Writing like this just seems sloppy and non-professional. It's rambling and unclear. It could have been far more effective if written point-by-point with far less emotion.


Calling out non-professionalism and bashing a stranger's communication style are, perhaps ironically, nice examples of non-professional behavior.

I would answer your rhetorical questions with an emphatic "no." People's emotions and mental state are important. It's also important that we communicate, understand, and think about people's interior worlds.

Question for you: What else is Linus communicating with his writing style? Does he show that he cares? About what? What can we tell about his values? Are these values we want in a leader?

Sometimes it's good to take off the prescriptivist hat and try on Honest Curiosity.


If what you're measuring are random delays due to default scheduling and flawed code implementation, calling it "garbage" should give the author and others pause to think. In some cases, one need to establish a boundary, so as to not waste time and efforts into an endless discussion on flawed premises.

Explaining why and educating about the concepts, is the better approach, and was there as well. Short and concise would be best, while limiting the negative to exactly what need be pointed out: A spade is a spade after all.

If it stops people from developing "their own userspace spinlocks", it may even be kind of necessary to be quite direct.


> Calling out non-professionalism and bashing a stranger's communication style are, perhaps ironically, nice examples of non-professional behavior.

I don't see how this is the case. One can critique a style of communication professionally just like any endeavor. I'm not "bashing" Linus, or even suggesting that he should be less honest. Just that there are far more effective ways to make the same point.


> yet an other entry in the "Linus bullies some poor coder into submission by insulting them until they just stop responding"

Has that ever happened? My impression was that he only ever takes a rough tone with high profile coders and is nice and polite to "poor coders". Also, he never insults them personally, but says that their code/argument is bad and that they are smart enough to know that already. Hence, they should stop making excuses and fix it already.


I still think that this tone is counterproductive, but that might be personal preference. I hate working in an environment where I know I can end up on the receiving end of some public humiliation if I voice a dissenting opinion or make a mistake. That's how you end up with people censuring themselves and not pointing out when somebody is actually making a mistake because they don't want to suffer the abuse.

This is especially inexcusable coming from Linus because it's not like he needs to resort to these tactics to be heard or make an impression, he's one of the most respected software devs in existence (and for good reasons). You could remove all the abuse from his posts and you'd lose exactly nothing IMO.

But this discussion is hackneyed, I know many people love Linus' style and think it's perfectly acceptable and even a good thing, I'm just definitely not one of them.


Sigh

It’s sad that there are people who think calling garbage code garbage is inappropriate. What goes in their mind? Don’t they smell or call a smell a smell?


On at least one memorable occasion [0], he expressed surprise that the developer(s) responsible for some piece of code had survived to adulthood, considering they "were likely too stupid to find a tit to suck on" as a baby. I'm not super aware of the context here, but at first glance it doesn't appear to be a high-profile coder with lots of kernel experience that he's talking about.

[0] https://lkml.org/lkml/2012/7/6/495


I disagree with this sentiment. Linus' answer not only describes the immediate bug, but also gives enough detail and context that other, similar potential bugs related to the scheduler may be avoided after having read the entire post. Also I don't think Linus is being egregiously impolite in his post; he isn't calling the author of the post garbage. Calling the code garbage is not a personal affront, but an objective statement to dissuade others from being misled.

If this was a truly idiotic / trivial mistake I doubt Linus would put much thought or time into such a long post. I also think that if you are going to write a blog post, you are implicitly supposing a place of authority and therefore opening yourself to criticisms. In other words, writing a blog post that is incorrect (in this case, very incorrect) is actively harmful as it misleads others, and it is important to call out wrong posts when necessary. I've posted dumb advice with a know-it-all voice before too (see [1] for an embarrassing HN comment) and I am not at all angry/annoyed that I was similarly called out.

[1] https://news.ycombinator.com/item?id=20017832 (edit: wrong link)


> No need for calling code "garbage."

In my opinion calling the code garbage is much nicer than "The author of that post has a common misconception", which sounds like a passive-aggressive way to call someone dumb. Everyone writes bad/'garbage' code once in awhile.


>"The author of that post has a common misconception", which sounds like a passive-aggressive way to call someone dumb

Not at all. It's a polite way to say they're ignorant of a particular thing. It would be rather insecure of them to read into it and think they're being called dumb across the board.


When you say "the code is bad/wrong/garbage", that's a statement about the code. It gives the person who wrote it an out: they can say "Oh, right, I totally knew that, just slipped my mind for a second", or "This is why I shouldn't write blog posts about mutexes at 3 AM" or something. I've certainly written garbage code at times, and you probably have too -- it doesn't mean you're dumb or you don't know things.

When you say "the person who wrote this code has a common misconception", that's a statement about the person. You've jumped to a conclusion about the reason the person wrote the bad code. You're telling the person to accept that they don't know as much about mutexes as you, that they're ignorant, that they're a worse programmer (albeit in a fairly narrow category).

This is why I think GP's wording comes off worse even though it's superficially more polite.


I get where you're going with that, but I would much rather someone tell me I hold a common misconception instead of saying my code is garbage. As much as the ideal is to remember that you are not your work, I still feel that my work reflects on me and my intelligence.

Giving someone an out is a great thing to do, especially for people for whom saving face is important. But I don't think "this code is garbage" does that. Something like, "that code is measuring timing incorrectly because X", conveys the same information without being inflammatory. "Garbage" is a value judgment, while just pointing out that it's doing something incorrectly is merely stating a fact.


I'd appreciate if "this code is garbage" was at the end, so it would be (and feel more like) a conclusion.

Saying that stuff at the beginning only clutters the reasoning, as one should not "just believe" that that code was "indeed garbage", just because he said so.


That's a fair point, and moving it to the end would likely make the whole thing read better overall, but really... it's just not necessary to include it at all. It doesn't add anything to the discussion or critique, and can only serve to detract, so why bother?


On the other hand, I wouldn't get offended if someone said a piece of my code indicated I was ignorant of something. If I was ignorant, I'd fess up to it no problem. Or if I had reasons for intentionally doing it a different way, I'd share and let someone school me if they had a point on why I was wrong. Ego sensitivity only causes problems. It never solves them.

Additionally, _sometimes_ offering someone an out to save their ego actually does them a disservice.


You all are involved in like a 100 comment thread about his tone rather than the content, so that should maybe suggest that he didn’t get his point across well.


This is all in response to a blog post titled ”Measuring mutexes, spinlocks and how bad the Linux scheduler really is”. Based on the content and tone of that post, I think Linus’s tone is fair.


I'm favor of Linus version. They are not the same message to me. He clearly explains what really would happen behind the scene.


He probably just didn't take very kindly to the Linux scheduler being called "bad" in the blog post, when tons of people have spent an immense amount of time on it, and there are in fact good reasons for it behaving the way it does. If I were in the same seat, I'd be pretty stingy with the politeness as well.


> The possibility of your time slice ending skews all the time measurements, for example."

If you were forced to summarize the entire first third of the email into ten words, that might be valid. As-is, it's way too short to be a proper explanation. It's more of a reminder note. If someone doesn't know exactly what you mean at a single glance, it's useless. Linus has an actual explanation in his email.


It's important to realize that you are not your code. Linus' reply comes off as passionate but he clearly cares enough to write such a detailed response.


At least it's not "total fucking braindamaged garbage." Progress?


Your version lacks depth and I appreciate Linus's response.

Imo, he can call something garbage if the mentioned garbage makes false and bold claims about his and contributors decades of work.


Yes, Linus can still be abrasive, but he's gotten a LOT better than he was before his road to Damascus moment.


I immediately thought of his lkml email where he reflected on this kind of thing: https://lore.kernel.org/lkml/CA+55aFy+Hv9O5citAawS+mVZO+ywCK...


In CS "garbage" has a bit different sounding, like in "garbage collector". Nonetheless, original poster started it, writing an article "Why Linux scheduler is bad". Next time he'll see it coming.

By the way, Linus' answer is much more informative than yours. I got a lot of useful insights from it and had fun just reading his lively language.


> In CS "garbage" has a bit different sounding, like in "garbage collector".

It's a bit disingenuous to suggest that Torvalds meant it in that sense, though. He very obviously meant it in the sense of a piece of trash that should be thrown into a trash bin.


Or a piece of code that doesn't do anything useful or meaningful and may be skipped by compiler as well.


Compiler warning: Sourcecode classified as "garbage". Replaced with NOP for user protection.


Garbage code sounds like a fine new class of undefined behavior to add to the next C standard. You just need to teach ubsan to understand it.


No, he very clearly meant the definition I suggested.

(Also I've never heard of dead code referred to as "garbage code".)


>Boom. Roughly the same message. Most people will get the same amount of edification from this as Linus' original letter. You could go in to more detail if you want but it's not really necessary. Smart people will figure it out if you just point out the general shape of the error. No need for calling code "garbage."

People should be shamed for being ignorant. It stops ignorance from spreading.


I guess we should shame you for being ignorant about human beings and the value of being kind. There are plenty of ways to point out ignorance without being unkind (and the grandfather comment is a nice attempt).


You just did, and got to pretend you didn't.


Ignorance is the lack of knowledge, it doesn't "spread" like a disease. Misconceptions can, but without the original author's mistakes we wouldn't have this rebuttal by a very knowledgeable actor, so overall this is a net positive including for you and me.


misconception is a form of ignorance.


And so is not reading the second half of comments.


Your "correction" literally explained the comment it was correcting as being true.

reacting in a haughty manner doesn't stop you from being wrong, it just makes you a dick.


This all sounds very good, and I grant that Linus knows what he's talking about, but didn't this whole thing start with a bunch of game developers that noticed massive stalls when porting games to Google Stadia? Stalls that, presumably, didn't happen on other platforms (Windows and PlayStation, using a variant of FreeBSD).

If these kinds of spinlocks are so fundamentally misguided, why do they only cause problems on Linux?


Game developers running their software that requires a system tuned for interactive latencies get burned by systems optimized for throughput, proceed to blame the system.

I’m more for polite passive aggressive suggestion of incompetence instead of in your face blaming, otherwise the response was quite on point.

On the plus side the original author now has plenty to read about and us here get to learn interesting things by proxy.


AKA the Linux scheduler isn't intended for desktop environments and applications.

Certainly that same observation drove the development of BFS / MuQSS, which unsurprisingly behaves better on his artificial benchmark than the mainline scheduler.


As shown in the original benchmark, using proper mutexes is near optimal and there is little need for kernel tuning (although still possible).


Linux can be tuned for low latency, it's a kernel setting.


Either that's not sufficient or the Google Stadia people didn't even bother tuning their systems for gaming, which I doubt.


Maybe they did some tuning but missed this thing.


Moral of the story... don’t ship games on Linux?


No, the moral of the story is don't use spinlocks in userspace and use established locking libraries that suite your needs. You can't expect any operating system to optimize for poorly written applications when you have thousands of them screwing up in all different ways.


The irony is that because of Android, another platform Google owns besides Stadia, Linux is already the world’s biggest gaming platform.


iOS is the biggest gaming platform, on $$$ per user, including some exclusive hits, with several frameworks and tooling tailored to writing games.

Android is such a pain to develop games for, that Google has started a Android Games initiative only in 2019, to appease the growing complaints from game studios.

An NDK that feels like a 20% project, many game related APIs (e.g. image codecs, networking, files, permissions) are Java only and require JNI even if implemented in C++, frame rate setup, build infrastructure regarding NDK libraries, C only API for what are actually C++ classes forcing wrapping code for RAII and such, Vulkan feels like a code dump that everyone has to manually integrate into the project vs Metal tooling, graphics debugging and real time audio are outsourced to a GitHub projects not directly supported on the official Android site, network debugging is only supported for Java layer,....

Finally, the only Linux that Android has is the kernel, which Google could replace tomorrow and user space wouldn't even notice.

Linux fans keep failing to realise how little Linux Android actually exposes.


> iOS is the biggest gaming platform, on $$$ per user, including some exclusive hits, with several frameworks and tooling tailored to writing games.

Other than "Apple Revenue", in no other world is this a metric of "biggest" in the context of even the most "squint hard enough" contexts of this thread.


Then by that measure, the Web is the biggest gaming platform, where the Linux kernel doesn't have any role at all regarding Web browsers.


Are you sure? I think people play way more often on android than they do on the browser.


No, and actually Windows, Mac and consoles still rule, with mobile being a mere 38%.

https://www.wepc.com/news/video-game-statistics/


So iOS is the biggest market, while android apps have more play time (also are messier, more of a pain..)?

Plausibly more time is spent gaming in browsers, or not. My guess would be android apps, possibly browser games. AFAICT the link only covers payed games, mostly counting by revenue, but I didn't look hard.


Worldwide 2019:

PC 25%

Consoles 26%

Phones 34%

Tablets 11%

Browser 4%

Since Android has 76% mobile os market I'd say that Android is the main gaming platform of the world. And the web is least used for gaming.


I wonder what magic turns 34% into 76%.

This even before starting the whole discussion that market share doesn't transform into gaming market share, nor in most profitable market share, or even native vs web views vs mobile browser usage.


.76*(.34+.11) = .342

That's greater than .26 (consoles), .25 (pcs).


> the only Linux that Android has is the kernel

The only Linux in any Linux-based operating system is the kernel.


Except that the little fact that Android is not Android/Linux, rather just Android, in what concerns userspace.

Any kernel will do the job required by userspace, that is running ART, ISO C, ISO C++ and the set of Android native APIs.


No, not "except that". Linux is just a kernel.

Yes, Google could replace Linux but they would also have to port drivers for every device Android runs on to that kernel.


> Yes, Google could replace Linux but they would also have to port drivers for every device Android runs on to that kernel.

Or just virtualize the whole kernel and Android runtime via a new kernel-based hypervisor for compatibility/migration purposes and switch everyone to Fuchsia/Zircon.

Google doesn't care about writing the drivers for other/older phones for manufacturers as they already have a hardware division to do it themselves for their phones, laptops and tablets. If Android was to be replaced by something else (Fuchsia), The vendors will be writing the drivers instead if they want to support Fuchsia.

Also, they don't need to "port drivers for every device Android runs on" since they would just start with their own devices first and the other manufacturers will eventually write drivers for their devices to support Fuchsia.

The problem is that they won't be required to open them up unlike Linux. But I guess the benefit to Google is that they have full control rather than waiting for one person to approve their changes.


Except that from a locking latency perspective, virtualization isn't a great choice. As Linus explains, guest operating systems runs on virtualized CPUs which can be de-scheduled while holding a spin lock. This is very much like what happened with the original benchmark in Malte's blog post, except that the application and operating system are now replaced with guest operating systems and hypervisors, respectively. AFAIK, the Linux kernel makes heavy use of spinlocks, making this a problem for people who care a lot about scheduling performance. This isn't to mean that people should avoid virtualization, though. There are likely to be many other bottlenecks to consider before most people have to consider these kind of stuff.


> Or just virtualize the whole kernel and Android runtime via a new kernel-based hypervisor

Nokia gave that a go, back in ~2009 if my memory serves me correctly. They actually had L4 (with certain security extensions) running on some of their prototype ARM boards. Linux, as an OS kernel, was then running on top of that.

It didn't get much traction, and I heard very little of it from 2010 onwards.


I suspect if they could "just" (!!) virtualise, they would have already. Modern phones are amazing but they have limits.

Google undoubtedly has the singular oompf to launch a brand new OS, I just haven't seen a reason yet. The few things they say they want to improve upon aren't limitations in Android (or Linux). Third party vendors and their chip makers are going to —perhaps rightly— view this as an Emperors New Clothes upgrade and resist. They're going to have to make a business case somewhere along the line.

Let's wait for a production release before we start the Android/Linux death knell.


But they already are with Fuchsia.


Like Fuchsia.


I assume you mean Zircon —formerly Magenta— Fuchsia's kernel.


Android is Android/Linux, just not GNU/Linux.

And sure, Google can change the kernel, but that would be a lot of work. The same could be said of many Linux distros. For example Debian GNU/Hurd is an experimental version of Debian swapping Linux for Hurd.


We're measuring by revenue by user?

It's got to be something more expensive than that. Hololens starts at $3500, so it's probably a good candidate for instance.

Which kinda gets the point across that revenue per user matters very little.

If we're looking ourselves to software revenue and ignoring device cost, look towards vr headsets. Practically every user is buying expensive games.

If you only want to include things with built in processors as platforms, look at Oculus quest in particular.


I believe by total revenue iOS is bigger — not per user (although that’s implied), but I could be wrong.


> Android is such a pain to develop games for, that Google has started a Android Games initiative only in 2019, to appease the growing complaints from game studios.

This is actually irrelevant, Android could be the worst environment to develop however since it has the biggest user base it still has people wanting to develop to it.

> Finally, the only Linux that Android has is the kernel, which Google could replace tomorrow and user space wouldn't even notice.

This is not really true, specially considering for games. For one, since most games use Android NDK, you can't simple swap the kernel and everything will magically work. You will at least need a recompilation, and this new kernel API must be bug to bug compatible with Linux or things will strangely broke.

For another, there is things like the discussion of spinlocks/mutex/schedulers like this one of this topic that, as shown, can make a big difference. It could be better in a kernel optimized for latency, or it could be much worse, making games that today are playable in low end devices a complete shitty show.


There is no Kernel API for userspace as official supported NDK API. If you call any Linux API you are the one to blame if anything changes, even across devices.

No it hasn't the biggest gamming userbase, as proven on a market analysis report I just posted.


> There is no Kernel API for userspace as official supported NDK API. If you call any Linux API you are the one to blame if anything changes, even across devices.

Yeah, I never said that you should directly access the kernel API using the NDK (however you can actually). I said that just the fact that the NDK is compiled to native code means that the code depends on the Linux kernel, and changing the kernel would mean, at minimum, a recompilation. Android NDK only guarantees an API, not a ABI, and those two things are very different.

So no, it is not like changing the kernel would have zero impact in Android game market. It would actually have a huge impact, since almost every game would stop working until being recompiled (games that does not use the NDK would probably be fine, however few games are not native). And of course, for backwards compatibility purposes you would need to have multiple binaries (however this is already true nowadays since Android support multiple architectures outside ARM).

> No it hasn't the biggest gamming userbase, as proven on a market analysis report I just posted.

I never said it was the biggest gamming userbase, I said it was the biggest userbase. Even if Android doesn't have the biggest gamming userbase, just for the fact that that you have a big userbase as big as Android have means people will develop games for you.


Google Play already does binary translations from ARM into x86.

Also the whole point of ISO C, ISO C++, OpenGL, Vulkan, OpenSL is to be OS independent.

Game studios are not a charity, they develop games for profitable markets, not for large userbases that don't pay for games.


> Google Play already does binary translations from ARM into x86.

It is not libhoudini [1]? AFAIK, libhoudini is not something that Google Play embeds on binary for distribution, it is a library that ships on Android x86 devices (and another AFAIK, it is proprietary from Intel so they only ships on Intel devices).

So while Google could have a similar approach if they want to ship another kernel for Android, the work is not as trivial as you seem to think.

> Also the whole point of ISO C, ISO C++, OpenGL, Vulkan, OpenSL is to be OS independent.

Yeah, it doesn't mean you don't need to recompile your software. You can't take an game written for Windows in OpenGL and run in Linux without recompile. This is my point.

> Game studios are not a charity, they develop games for profitable markets, not for large userbases that don't pay for games.

Well, Android seems to be pretty profitable? It is not the most profitable platform sure, however it sure makes enough money because almost everyone ends up creating games for both iOS and Android.

You seem to imply that making games to Android is not profitable, however it sure seems to be. Or Epic Games, EA, ActionVision, and many other big names in game studio are a charity.

[1]: https://commonsware.com/blog/2013/11/21/libhoudini-what-it-m...


Having to recompile doesn't make the applications kernel dependent, now to consider porting such applications from macOS to Linux to Windows, kernel dependent?!?

And the point wasn't that it isn't profitable as such, rather it isn't the money maker that some here are making waves for, rather a wannabe versus the revenue made across desktop, consoles and iOS.

Hence why Google has introduced the Android Game SDK last December (which is quite basic as of now) and Google IO 2019 was the first time Android ever had a games track, versus WWDC iOS/macOS that have had them since ever.


> Having to recompile doesn't make the applications kernel dependent, now to consider porting such applications from macOS to Linux to Windows, kernel dependent?!?

So you concur with me that if Google changed the Android kernel it would be a big change? Just imagine, you would have to recompile all applications using NDK to work with this new Android. It is not something trivial, sure they can make it work, however it would probably need quite some time with Google making obligatory to having developers to ship both Linux and new Kernel binaries.

And even them, you can't make sure that everything will work. A small change in behavior of a specific syscall may break a random application somewhere that depends on it, because even if this syscall is not directly used, Android NDK uses this syscall to implements its API. Now multiply this for the number of NDK applications that Android have and you will have a huge amount of breakage.

So yeah, Google can make it work, however it is not as trivial as you think.

> And the point wasn't that it isn't profitable as such, rather it isn't the money maker that some here are making waves for, rather a wannabe versus the revenue made across desktop, consoles and iOS.

Sure, we can concur at this point. However the market for Android gaming is still more expressive than you seem to think (not wannabe by any means, but lower than iOS sure).


> So you concur with me that if Google changed the Android kernel it would be a big change?

No I don't, anyone writing portable applications already does compilations all the time.

It is no different than the requirements to provide newer builds updated to specific Android target versions, or being 64 bit only that Google keeps forcing devs to move forward.

It is Google and OEM's job to ensure that NDK APIs work across whatever kernels are available.

Any application that depends on syscalls isn't portable across Android devices anyway. Google doesn't support syscalls as part of the NDK stable APIs contract, and some OEMs are well known to change their behavior.


> It is no different than the requirements to provide newer builds updated to specific Android target versions, or being 64 bit only that Google keeps forcing devs to move forward.

That is basically saying that this is a big work for everyone involved (developers, Google, OEMs). This is exactly my point.

> Any application that depends on syscalls isn't portable across Android devices anyway.

The application does not necessary call a syscall directly, the problem is that the NDK calls a syscall as an implementation detail (for example, for lock primitives eventually the NDK must call the Linux lock primitives too, there is no way out).

So if Google needs to change Android kernel, they need to implement the same quirks of the current syscalls, or the slightly change in behavior will break random applications.

My point was never calling syscall directly, however for sure some applications call syscall directly too.


A market analysis measures the market, i.e. $$$, not the user base. Which part of the report is on the userbase?


I posted the link, everyone can read it.


How much Linux is android though? Does Android feedback into Linux? This argument about Linux being the most popular consumer OS or Gaming platform thanks to Android sounds like Syria taking credit for the iPhone because of Steve Jobs being a Syrian by bloodline.

"You know, although we are consumed by sectarian fights and iron-fisted leaders that will crush anyone having opinions, It is important to note that worlds one of the worlds most successful contemporary technology companies is actually Syrian".


From a game's perspective, Android is very much like Linux, actually.


Don't use userspace spinlocks on Linux.


Ironically things went upside down with linux scheduling over the last 15 years.

All major OSes had PhD CS people doing their schedulers "efficient" at around 2000, and be able to sustain 98%+ MP loads. Linux was an odd case of a very naive, and "inefficient" scheduler that left CPU 30-20% idle unless you really knew how to hack around that. Linux however came with near no stalls, and latency spikes.

Things have changed dramatically now. Linux got attention of the entire industry on it, and years of holy wars about scheduling.

You can now hit 95%+ under multithreaded load easily, but even on the desktop you can now see how a block in IO hog app can visibly freeze your system for few seconds.

Now, the trend has reversed again, and now everybody is praising conservative scheduling that doesn't give out too much CPU :-/


> but even on the desktop you can now see how a block in IO hog app can visibly freeze your system for few seconds.

Or worse. I ran into an issue when copying an image from an NVME drive to a USB drive using `cat` (e.g. `cat somefile.img >/dev/sdX`) where mouse performance (also a USB device) became so sluggish that the desktop was unusable until the copy was complete.

This wasn't the only instance where background I/O brought my desktop to it's knees. In another instance it was copying a large file across the network (Gigabit Ethernet) to a file server. The network was saturated and the desktop was unusable. I think that the remote system was the destination but my recollection could be in error. The network interface is a PCI device and there was no USB device involved in this transfer.

This is on Debian Stable and using the Gnome desktop. The non-standard portion of the system is that it runs off a ZFS filesystem. I don;t know if that matters.

I'd like to know where to look and how to diagnose this issue if/when it crops up again. Interrupt storm? USB driver issue? Some other issue resulting in some resource starvation? (In neither case did overall CPU load appear high.)


I'd guess the apps were waiting for I/O, not CPU.

Don't know the reason, but lots of unwritten I/O on a (slow) drive can block processes waiting for I/O on an unrelated drive.

By default, dirty buffers are a predetermined portion of RAM, so lots of RAM and slow I/O can make for long pauses. I usually tweak sysctls on desktops to limit dirty buffers. E.g.

  vm.dirty_background_bytes = 67108864
  vm.dirty_bytes = 134217728


You will not get enough CPU time given to a task if it does, say, 10% IO, and 90% CPU if there is an IO hog running in parallel. You either IOnice everything to oblivion, or bare with this.

IO blocks are almost always CPU blocks too on anything that does even minimal IO. And the more simultaneous IO, the more CPU overhead there is.

Try running compiling something big with -J 32 or more.


When system is IO busy, processes block (more) waiting for IO completion. IO blocked processes wouldn't know what to do with a CPU, so they won't have it assigned. Are you saying something different?

Decreasing dirty memory decreases flush times, decreasing process write/flush times, decreasing process/UI hangs/pauses. Presumably it also lowers throughput in some cases, but it's better than having UI hangs.


Is that a CPU scheduler issue though?

Linux traditionally has a problem of overall system architecture in that userspace isn't set up to provide policy information such as "this thread is doing UI" vs. "this is a background task" by default.

However, if I help the scheduler along by nice(1)-ing my big compute tasks, then the desktop remains perfectly smooth even while the CPU is 100% busy.

It'd be much better to have this experience by default, but unfortunately it requires architectural improvements throughout the entire software stack. Trying to band-aid it in the kernel scheduler may allow you to heuristically improve some workloads, but it's ultimately a fool's errand.

Where the desktop does tend to fall over and become totally unusable is when the system starts to swap...


There is pretty much nothing you can do on a Linux system that doesn't make a swaptest (showing e.g. red and blue screens on even and odd frames) glitch.


I remember when the 'rotating cube' 3D desktop was released for linux (compiz?), and the animations were instant and flawless, which seemed absolutely _impossible_ on Windows at the time.


It still does seem impossible. If i (on accident) hit super+tab on my work computer i get to wait 1-2 seconds while the workspace overview stabilizes before i can get back to work.


This could be due to the focus of the kernel developers involved. At some point in the 2000s, companies such as IBM invested heavily in Linux, working on making the kernel scale well on 'big iron' machines with tens of cores and NUMA-style architecture. As a result, the focus shifted away from 'normal' desktop boxes.

But was this a good or bad thing? Even desktop machines can now have 16+ cores and NUMA. So, perhaps that focus has paid off for more users?


The claims from the OP:

> So this all started like this: I overheard somebody at work complaining about mysterious stalls while porting Rage 2 to Stadia. The only thing those mysterious stalls had in common was that they were all using spinlocks.

If the OP's benchmark is invalid and if the root cause is just "overheard" by somebody without sharing detailed analysis, we don't even know if the problem is really about spinlocks.

We are still unsure if it comes from Linux kernel or just Stadia platform-specific. Do we have similar issues with Android?


Fair enough, I suppose the original post was quite light on technical details.


> If these kinds of spinlocks are so fundamentally misguided, why do they only cause problems on Linux?

It is possible that, as explained by Linus, the Windows scheduler just doesn't ootimize for NUMA as aggressively as linux (which attempts to keep threads on their own cores as much as possible).

It is also as likely that, because on Windows games have been an important load for a while, the Windows scheduler likely bends over backward to support broken code either by penalising other use cases or by attempting to recognise this sort of spinlock behaviour and special casing it.


Keeping threads on their own single cores goes way beyond optimizing for NUMA. Especially because the CPU being tested on only has a single NUMA node.

Even being maximally NUMA-friendly, if calling thread_yield a hundred times would eventually pull in a blocked thread from a neighboring core then the benchmark would see much smaller delays.


Perhaps Linux should do more 'special-casing' of broken/older apps?

In the email thread, Linus points out that GNU make has a bug in its concurrency code, which was only hit when newer kernels added some optimizations. As a result, they had to walk back on these improvements.

Now, I'm sure that the bug has been fixed in newer versions of make, but 'make' is a program that many people are unlikely to keep up-to-date (most users of it will be only using functionality that has been in make for decades and wouldn't ever consider needing to have a brand new version of it)

It would be a shame to hold back kernel optimizations if they trigger bugs in older user-space code, so in these kind of cases it would be great if Linux had some of the 'backwards-compatibility cruft' that Windows is famous for, allowing for new optimizations while detecting and working around known problems in common older code.


Sure! Remember Linus mantra of "never break userspace". But the code in point is not a legacy binary, it is a new program getting ported to linux.

Still it might make sense to change linux scheduler behaviour to improve performance of unmodified Windows games under Wine if it turns out that using yield (or the Windows equivalent Sleep(0)) is common.

And in fairness to the original post, the author had a latency issue, they invertigated it and correctly identified the problem with spinlock and yield and found out that using a proper mutex is the best solution. The only issue is that they thought that the scheduler was the root cause of the bad behaviour, not the yield itself.


> So the code in question is pure garbage. You can't do spinlocks like that. Or rather, you very much can do them like that, and when you do that you are measuring random latencies and getting nonsensical values, because what you are measuring is "I have a lot of busywork, where all the processes are CPU-bound, and I'm measuring random points of how long the scheduler kept the process in place".

> And then you write a blog-post blamings others, not understanding that it's your incorrect code that is garbage, and is giving random garbage values.

> And then you test different schedulers, and you get different random values that you think are interesting, because you think they show something cool about the schedulers.

Tldr: They got probably lucky on Windows. Conceivably similar code might have worked on Linux and then had issues on Windows.


...and that would be extremely likely to happen if the Linux version was being written first and only then ported to Windows.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: