Hacker News new | past | comments | ask | show | jobs | submit login
No nuances, just buggy code (was: related to Spinlock implementation) (realworldtech.com)
540 points by shaklee3 on Jan 5, 2020 | hide | past | favorite | 274 comments

It's important to read Malte's response to Linus:


Followed by Linus response to Malte:


The long and the short of it:

- It's nearly impossible to make meaningful benchmarks for this because reality will almost always be different and more complicated

- Calling sched_yield() is almost always a mistake

- Any scheduling optimization is probably a waste of time because either your running env will change, or the kernel will change, or someone else's optimizations will clash with yours.

- Even with decades of research into locking, everyone (including the experts) continually gets it wrong.

- If you're not running a specialized embedded setup where you control all of the running processes and other such variables, just use sleep locks and call it a day.

Basically Linus is right.

One caveat is that calling sched_yield is a fantastic throughput optimization for the spinning portion of an adaptive mutex. If you have parallel code that sporadically contends for short critical sections then my data has always said this:

- something like a futex lock is optimal

- you want barging, not fairness

- spinning the CPU is a t questionable idea even if you do it for a short time. The pause instruction does something like jack and shit. Busy waiting of any kind seems great for microbenchmarks sometimes but it’s pretty bad for anything complex.

- spinning 40 times while calling sched_yield prior to futex_wait (or whatever) appears optimal on many schedulers and benchmarks. It’s hard to tell why. Best I can tell it’s because in the large parallel apps I’ve worked on and benchmarked, there are often threads that are ready to do some useful work right now, so yielding from a thread that can’t get a lock to one of those is a system throughput optimization. If there are no such threads then it’s a bit like spinning the CPU.

The 40 yields thing has a funny story for me. I learned of it from a comment in the JikesRVM code that was more than a decade old claiming it was optimal on their custom scheduler on a 12 core POWER box running AIX. The benchmark cited was something I’d never heard of. When I rewrote how JikesRVM did thread scheduling, I was sure that this would need tuning. Nope. It’s optimal on Linux and Darwin on x86 on basically any Java benchmark that has lock contention. I later implemented the same heuristic in WebKit’s locks. It’s still optimal.

To be sure, there is a broad plateau of goodness around 40. Like, 50 is about as good. But if you go order of magnitude in either direction, performance tanks. I’m superstitious so I’ve stayed with exactly 40 whenever I have to write this code.

I wrote a lot more about this stuff once: https://webkit.org/blog/6161/locking-in-webkit/

Interesting. A Win32 CriticalSection is also a two step primitive that spins for a short amount of time and then yielda to the system scheduler.

It would be interesting to see a distribution of wait time per lock for different locks. If such a two step approach performs well, a hypothesis might be that there are three kinds of locks:

- mostly uncontended: lock succeeds without spinning

- held for exceptionally short amounts of time: lock usually gets released again before waiting thread on different CPU stops spinning. Wait time is lower than the time it takes to yield to the scheduler and/or reschedule.

- held for a really long time: wait time is much longer than scheduling another thread and re-scheduling the initial one.

And what you are saying seems to indicate that optimizing for the first and second one is worth it because the overhead is negligible for the third type, that is, there are few locks that are held for an amount of time so that spinning adds noticable overhead.

Do you have a reference for this? AFAIK the win32 critical section (like any modern mutex implementation) first uses atomic instructions to check if anyone is already in the critical section so it's really fast if no one is, and otherwise falls back to the OS synchronization objects.

The information is right there in the documentation:


The paragraph about InitializeCriticalSectionAndSpinCount and SetCriticalSectionSpinCount describe the behavior. IIRC, the default spin count used to be pretty high (1000 loops or so). Not sure if that was changed.

Interesting. There's also that anecdote about the heap manager using a spin count of 4000. I wasn't aware this happened by default (I didn't see any mention of this in InitializeCriticalSection). I guess it's all down to the probability of contention vs. the amount of time the mutex is held.

Your description and their description are not mutually exclusive.

Spinning vs. a single check...

I don’t have as much experience with the Win32 scheduler. But I remember getting conflicting data about the profitability of yielding. In particular, lots of threads yielding on Win32 can cost you a lot of overall system perf, if I remember right. Not so on Linux or Darwin, on the same workloads.

I don't have too much experience with the Windows scheduler, either. I've written most of my multithreaded code exclusively for Linux.

All of these little details vary dramatically depending on the exact CPU and workload. I've developed a wide variety of scheduling strategies and have used neural networks to predict when a given strategy will be better. Scheduling is giant non deterministic mess with no ideal answers.

Ok but my data disagrees with you. Specifically: when apps get complex enough, the differences between CPUs and schedulers wash out in the chaos.

I’ve tested this over the course of a decade on multiple Linuxes, multiple Darwins, multiple x86 flavors (Intel, amd, and lots of core counts and topologies), POWER, and various arms, and on many large benchmarks in two very different languages (Java and C/C++). In Java I tested in in two very different VMs (JikesRVM and FijiVM). I think the key is that a typical benchmark for me is >million lines of code with very heterogenous and chaotic locking behavior stemming from the fact that there are hundreds (at least) of different hot critical sections of varying lengths and subtle relationships between them. So you get a law of large numbers or maybe wisdom of the masses kind of “averaging” of differences between CPUs and schedulers.

I’d love to see some contradictory data on similarly big stuff. But if you’re just saying that some benchmark that had a very homogenous lock behavior (like ~one hot critical section in the code that always runs for a predictable amount of time and never blocks on weird OS stuff) experiences wild differences between CPUs and schedulers then sure. But that just means there are no ideal answers for that scenario, not that there aren’t ideal answers for anyone.

Linus is backed up by Josh Simmons on Twitter. He does some perf tracing and also tweaks Malte's "utterly terrible benchmark" to be "about twice as fast".


The most important point is:

- you cannot implement a functioning spin lock in user space (for all practical intents and purposes and if you don't resort to extreme measures and talking modern general purpose operating systems)

You can, but it requires special architecture for the application. The cpuset can be used to disable scheduler use of the core and then you can pin application threads to cores, single thread per core.

Only then it makes sense tu use spin locks.

Reserving CPUs for your application definitely qualifies as "extreme measures".

In general you should expect userspace code to be scheduled out at any time and use locking primitives that the kernel knows about.

This is still a disaster with hyperthreading where you can have two threads contending for a lock on the same physical core.

Even if your thread won't be scheduled off the core I don't know if you might still have irqs stealing time during the critical section.

Can't you isolate cores from irqs with irq affinities? I'd be more worried by SMM stealing CPU time (ugh... talk about sad stories of 'real-time' Linux...).

Depends on what you mean by "functioning". You can make one that functions in the sense that you can lock between concurrent threads. However, it's difficult to do right and even more difficult to do in a way that is performant (and the interface to the kernel is so ill defined for this, whatever efficient solution you come up with will likely prove to be far less efficient on other systems with just subtle alterations in the underlying kernel implementation.

I like the end:

> Lots of users, all of which are basically buggy-by-definition, and all you can do is a bad half-arsed job of trying to make it "kind of work".

> Reality is messy.

> Linus

They're talking past each other. Malte is really saying that yield() does what he wants on Windows and BSD, and Linus is giving a bunch of excuses for why Linux doesn't work that way.

I’m pretty sure that the Linux behavior makes sched_yield more suitable to the 40-yields optimization I use for locks.

I don’t buy claims about lock behavior based on microbenchmarks. Microbenchmarks have wildly different behavior from one another so it stands to reason that they have wildly different behavior from real code.

Linus’s excuse is that if you use a real lock then shit is good. That’s a pretty good excuse IMO.

Yeah Linus is correct here, he's just not directly addressing why it works on some other schedulers.

One reason they might act that way: They have nowhere near the amount of NUMA or cache locality optimization work the Linux kernel possesses. Thinking of the 2nd-generation Threadripper issues in Windows in particular.

That's quite likely. I think that the sched_yield behavior is throughput-optimized, which is consistent with what you get when an OS is server-optimized like Linux.


There's no excuses, the behavior of the scheduler is documented and straightforward. Up thread a twitter user [0] noted that you can somewhat replicate the behavior of windows by replacing sched_yield() with sleep(0). The original benchmark skips this detail and just assumes the C++ wrapper will exhibit the same behavior on all OSes with all schedulers, which is clearly wrong.

[0]: https://nitter.net/dotstdy/status/1213582199615901696#m

The point is Windows and BSD isn't doing "what he wants," because the benchmark in question isn't producing any meaningful data for two reasons. First, it's not even measuring what Malte said he was measuring. Second, the benchmark assumes a single threaded HW model, and optimizing for it would very likely slow down performance for modern real-world workloads.

Ignore the benchmark for a moment; the original issue is that a real program (the game Rage 2) uses spin locks, and experiences really terrible hangs on Linux and not on the other kernels. What he wanted was for that to not happen. The upshot is that there really is no way to get spinlocks to not hang on Linux. Now, there's a very good argument that spinlocks are just dumb, and therefore that real game is broken.

EXACTLY! I wish someone would just explain why the behavior is different. Does linux only sched_yield on the same NUMA node ? (Linus' reply seemed to imply that). Windows will not pin threads unless you set processor affinity, AFAIK, so maybe that's the difference.

There are runnable threads that are not being run. It would be nice to know why.

Tell a person their code is wrong and you may have saved them from one mistake. Tell them why their code is wrong and you will save them from a dozen.

> Does linux only sched_yield on the same NUMA node ? (Linus' reply seemed to imply that).

Perhaps Linus' concern is that sched_yield doesn't guarantee any particular behavior, it just happens to be implemented that way now.

> Does linux only sched_yield on the same NUMA node ?

> There are runnable threads that are not being run. It would be nice to know why.

More extreme. It only applies to the same core, more or less. The original benchmarks are on a one-node CPU, after all.

sched_yield does seem like it should have a pretty obvious meaning: do whatever you would do if the preemption timer went off right now. Whether it's a good idea or not, I don't see why it's considered "undefined".

I haven't read Linus' discussions, but often what happens when you take that approach is:

1. Scheduler looks through all the processes on the system, decides A is the one that should run right now

2. A decides it doesn't want to run, calls yield.

3. Scheduler looks again, thinks A is still the right one to run, and runs it again.

That is, in most cases, if the scheduler thought there was a better process to run, it would already have been running.

Please read Linus' discussion.

He states (paraphrasing) - it's undefined because there's no real way (unless in a realtime scheduler) to find what to yield to without instigating complexity.

The search itself is the quest that the kernel would rather not do at all. Step 3 is undefined and thus sched_yield itself is undefined.

GGP had a simple solution: "yield is the same as a preemption tick". I was trying to briefly explain what usually happens when you define things that way. I'm not advocating one thing or another.

That is somewhat surprising to me. My idea of the canonical use case for yield is interprocess communication: process A sent a message to process B, and it wants to yield the remainder of its scheduling quantum to process B because process A has nothing left to do. Although maybe this use case is satisfied by something other than sched_yield.

Yes, something else. Linus is pointing out that sched_yield doesn’t allow specifying where to yield control to, so it is a bad way to transfer control to a particular task.

Why would it yield to process B and not one of the hundreds of other processes running?

In the case of the benchmark, at any point in time there might be 2 background processes that aren't asleep.

So let's look over our cores:

6 of them are churning through processes that do nothing but call sched_yield. Nothing has been waiting to run for more than 3 microseconds.

2 of them are doing real work, and the benchmark threads that happen to be on the same core have been waiting to run for most of a millisecond.

If you're stuck, then B is one of the processes that has been waiting.

So the obvious change would be to move threads around in sched_yield. Then B will get scheduled within microseconds.

How is it possible that there are only 2 background processes? Any normal Linux system has hundreds maybe even thousands of processes running.

Two processes that aren't asleep.

What's your CPU use right now? I bet it's less than 200%. Or will be if you exclude your web browser.

The typical background process is active far less than one percent of the time.

That doesn't matter that still doesn't guaranty yielding the thread will allow another specific process to run. Especially not if that process is soft locked to a different core.

> That doesn't matter that still doesn't guaranty yielding the thread will allow another specific process to run.

Right, but you would expect the "switch to another thread" function to eventually pick the one that's sitting there with a growing and growing time since it last got a chance to run.

It would be easy to guarantee that if any thread has been waiting for at least x amount of time, then any context switch without long-waiting local threads will pick that thread. (If it's on the same NUMA node, at least.) But in the actual implementation that time can build up to a huge amount with no effect, even as thousands of yield calls go by.

> Especially not if that process is soft locked to a different core.

Yeah, that's the problem, it's getting locked to a core even though other cores are desperate for threads to run.

The best way to do that is with a blocking write. You're sending the messages using a mechanism such as a pipe, right?

Pipes have a small buffer. Once it's full, the caller will block, and the scheduler will know precisely what to schedule to unblock it.

"there's no real way (unless in a realtime scheduler) to find what to yield to without instigating complexity"

What about just temporarily increasing the process' nice value?

That should be simple to do, and the scheduler already knows how to yield to processes with lower nice values.

Well, the scheduler has the same problem when the preemption timer goes off, doesn't it?

What's the postcondition of "whatever you do when the preemption timer goes off"? What's the observable side effect for the program?

Well, it's a hint (but then so is `nice`), so it doesn't have any strict postcondition. The meaning of the hint would be that the process won't make good use of CPU cycles at present, and it would be preferable to yield to any other processes now, rather than being preempted later when it are busy.

I think I would state it like this: Yielding is communication with the OS. And communication with the OS is observable.

Why is calling sched_yield() a mistake? What's the correct way to yield CPU without calling sleep?

If a thread can do no more for the moment, then either it is finished, or it is waiting for something to happen. In the latter case, the best thing is to be woken on that event, and the next best is to poll for it. Sleep is only the best solution when that event is literally the passage of time.

you may want to read Linus's answer about it https://www.realworldtech.com/forum/?threadid=189711&curpost...

You really should just read the linked article, but the tl;dr is that sched_yield() does not convey intent (except in narrow realtime scenarios that you should not be using in the first place).

Tell the kernel why you want to yield CPU. Work with the kernel, not against it. Do you want to yield because you genuinely want to wait until some future point in time? Use sleep. Is it because you're waiting for some I/O? Use poll or its variants. Is it because you're waiting for a signal from another thread? Use mutexes, condition variables, etc. that call futex under the hood.

Futex wait, or any other blocking syscall.

sched_yield means “yield to any thread if it’s easy to do so or do nothing”. That meaning fits the requirements of some algorithms, particularly if you call it a finite number of times before doing something more drastic like a blocking syscall (like futex wait, lock acquisition, whatever) that will block until the condition you’re waiting for is fulfilled.

Think of it as a speculative optimization on top of busy waiting, not as a way to control the scheduler.

It's explained in the linked post.

Why don't you want to call sleep?

Beacuse it would act as as an even worse sched yield.

I guess I can't really see the use case here. If you want to sleep, it's usually because you're waiting for something: file descriptor, semaphore/mutex, or a certain amount of time to pass. If none of these conditions apply, can't you just continue to run and let the scheduler do its work? (maybe with an appropriate nice setting if you are doing "less important" work).

The scheduler itself uses fixed time interrupts for its basic functions, so "sleep(fixed_time_interval)" is what you will get in the end anyway on a preemptive system.

Calling sleep(1) instead of thread yield is faster - see the twitter thread for a proof.

Still worse: it's a confusing code, and it's also confusing that it's faster.

Not in every cases, but both are certainly suboptimal compared to a diretected wake up.

Ok, so how you would write this (at least the first section, explaining the error) constructively could be much shorter.

"The author of that post has a common misconception, which is that spin locks are a fast drop in replacement for uncontended mutexes. They aren't. You can only use them well when you know you won't be descheduled, like we do in the kernel. In a generic userspace program, you don't have that knowledge. The possibility of your time slice ending skews all the time measurements, for example."

Boom. Roughly the same message. Most people will get the same amount of edification from this as Linus' original letter. You could go in to more detail if you want but it's not really necessary. Smart people will figure it out if you just point out the general shape of the error. No need for calling code "garbage."

I personally really appreciated the additional details in Linus's version.

When I read your version, I learn that spin locks are not a drop-in replacement for uncontested mutexes. That's great, except that I don't really know for sure what either of those terms mean, and also I didn't gain any knowledge that doesn't relate to this specific situation. It also wasn't super convincing -- I could just as easily see someone post a reply saying "well actually spin locks ARE a drop-in something something mutex" and I'd have no way to tell who was right.

Linus, on the other hand, told me about the mechanism by which the problem happens. I learned about how the scheduler sees spin-locks' loops, and that the main advantage of "real" mutexes is that the scheduler knows who's waiting on what. I also wouldn't necessarily have realized the "take clock measurement, then get descheduled before you release the lock" issue if he hadn't called it out specifically.

Sure, I guess "garbage" comes off a little mean, but "the author has a common misconception" almost sounds more passive-aggressive to me.

The first section of Linus's answer reads really rough to me and almost made me stop reading, imagining that it was going to be yet an other entry in the "Linus bullies some poor coder into submission by insulting them until they just stop responding" which is a shame because there's a lot of interesting content after that. I guess I invite the people who had the same knee-jerk reaction I had to give the post a 2nd chance and power through Linus' unnecessarily harsh tone to get to the good stuff.

>Sure, I guess "garbage" comes off a little mean, but "the author has a common misconception" almost sounds more passive-aggressive to me.

The latter might be slightly passive-aggressive (although it doesn't have to be), the former is simply aggressive-aggressive.

As long as Linus is blaming the work done and not the people behind it, that's fine. Something truly aggressive would be "This benchmark is garbage and the people responsible for writing it are monkeys on crack". He did such things in the past, but in the 4 emails I got to find on this specific benchmark, none of them critic people writing code.

I think Linus got upset that a badly designed benchmark got popular and therefor is spreading misconceptions. His strongly worded answer is getting popular too and as a result everybody learns and he has stopped an urban legend from spreading. A success on every points.

>As long as Linus is blaming the work done and not the people behind it, that's fine.

I think your comment is just wrong and generally pure garbage. I can't believe some poor server will have to host this nonsense for years to come. It's the most inane thing I've read all year.

I'm just talking about your comment, not you though, so clearly no hard feelings here and it's definitely the right way to frame my argument to have a nice, peaceful and constructive technical discussion.

No, reductio ad absurdum is not an argument. What you wrote is an exercise in swearing. What Linus wrote is getting his point across.

It is certainly possible to describe a code with bad words without hurting the feelings of person who made it, and I think, in order for it to work everyone has to understand that everybody writes “garbage” or “crap” code sometimes, so when it happens, it’s not a big deal; it shouldn’t hurt your professional dignity.

On my job, in team of 4, we do it from time to time and it’s just much mentally easier, streamlined - and yes, constructive to conduct code reviews this way. I understand this may not work for other, bigger teams, but it’s certainly a valid way.

There were ways for him to communicate those same ideas less negatively.

And it was more than just the "garbage" line. It was also things like "If I haven't convinced you of that by now, I don't know what I can say.", and "See how simplistic - and self-centered - your expectation is?".

Writing this way doesn't "streamline" anything. I interpreted it as a negative interaction, and clearly others have as well including the original author (since he said as much in his response).

> Writing this way doesn't "streamline" anything.

Exactly. Blameless and non-violent language can streamline communication. It does this by removing any trigger that would make your communication partner feel defensive. Needlessly emotional and charged language is sure to distract from the subject at hand. When you use calm language, there is only one thing to focus on: the content. When you call someone's code garbage, there are now two things that that person could focus on -- the content and the tone.

To me, it is completely reasonable that the original author might have made a mistake like this. After all, when he shared his findings it seemed like he had found something interesting to me! Many people on this very site commented positively on his efforts. There is nothing "garbage" about the code that he wrote, it just doesn't quite do exactly what he thought it did.

> To me, it is completely reasonable that the original author might have made a mistake like this. After all, when he shared his findings it seemed like he had found something interesting to me! Many people on this very site commented positively on his efforts. There is nothing "garbage" about the code that he wrote, it just doesn't quite do exactly what he thought it did.

It was blogged about, so carries some authority on the topic, and was convincing to readers, but the code did nothing close -- not just "not quite exactly" -- to what it was portrayed to be doing in the benchmarks or article. It is garbage; useless; should be binned. I tried to come up with better adjectives -- "dangerous", "useless", etc -- but they came up short, in subtle ways. Who cares; Linus explains what he means better than the one, concise word.

Speech is not violence.

It always contains some information payload. If you felt triggered by it and failed to parse the content then maybe you're overly emotional.

For some people who aren't it's going to be useful assessment of how bad this code was.

I see this ongoing trend (Twitter) that by limiting the ways in which people can express themselves, you can in theory broaden the recipient circle (to include people who otherwise would be offended and wouldn't read).

In practice you not only limit the ways in which information can be distributed but often also barter away some people who weren't offended and would get useful information this way.

Looking at speech this way is also a bad idea because today's acceptable is tomorrow's unacceptable word and the only way to allow all information is to allow all speech.

All I'm saying is that there is a spectrum of ways to communicate about things like this, from "fuck you you piece of shit how dare you put out this garbage code" to "hey, bud, I see what you're trying to do. But there's a mistake here, which I can explain." Linus chose something closer to the left than the right. You can get the same technical ideas across either way, but as you move toward the latter mode of communication you will decrease the number of recipients who have trouble receiving your message because of its negative emotional content.

"maybe you're overly emotional." Feel free to label me however you please. It does not change the fact that if your goal is to communicate ideas efficiently, my way will help you avoid alienating some folks, and has no practical downside.

There's a big difference between ugly ad hominem comments and saying that your code is bs.

> and has no practical downside That's the place when we differ in opinion. It has one direct downside of preventing some people from expressing their opinions and conveying the point. And another in the long run – setting up a precedents of what can and cannot be said.

> Feel free to label me however you please It was general note not something directed at you.

In my group, we call it "shoddy" code. A bit less brash than "garbage" I think.

If you had followed with multiple paragraphs describing in detail as to why the comment is pure garbage, why it has to be called out, offering context, background information and possible alternatives - then you'd have a valid analogue and a basis for debate. Just sayin'.

Believe it or not, it's one way or another to formulate a critic, and I'm fine with it. However your strongly worded answer lacks arguments, so color me unconvinced.

Except your attempt is not exactly analogus to the situation discussed.

Critique doesn’t have to be peaceful to be constructive

> As long as Linus is blaming the work done and not the people behind it, that's fine.

Certainly, ad hominem attacks are worse than making substantive attacks, but is that really the standard we should judge it by? Instead of a sprawling thread of visceral, wouldn't it be better if he just explained the reasons why the approach is bad in a professional manner?

Writing like this just seems sloppy and non-professional. It's rambling and unclear. It could have been far more effective if written point-by-point with far less emotion.

Calling out non-professionalism and bashing a stranger's communication style are, perhaps ironically, nice examples of non-professional behavior.

I would answer your rhetorical questions with an emphatic "no." People's emotions and mental state are important. It's also important that we communicate, understand, and think about people's interior worlds.

Question for you: What else is Linus communicating with his writing style? Does he show that he cares? About what? What can we tell about his values? Are these values we want in a leader?

Sometimes it's good to take off the prescriptivist hat and try on Honest Curiosity.

If what you're measuring are random delays due to default scheduling and flawed code implementation, calling it "garbage" should give the author and others pause to think. In some cases, one need to establish a boundary, so as to not waste time and efforts into an endless discussion on flawed premises.

Explaining why and educating about the concepts, is the better approach, and was there as well. Short and concise would be best, while limiting the negative to exactly what need be pointed out: A spade is a spade after all.

If it stops people from developing "their own userspace spinlocks", it may even be kind of necessary to be quite direct.

> Calling out non-professionalism and bashing a stranger's communication style are, perhaps ironically, nice examples of non-professional behavior.

I don't see how this is the case. One can critique a style of communication professionally just like any endeavor. I'm not "bashing" Linus, or even suggesting that he should be less honest. Just that there are far more effective ways to make the same point.

> yet an other entry in the "Linus bullies some poor coder into submission by insulting them until they just stop responding"

Has that ever happened? My impression was that he only ever takes a rough tone with high profile coders and is nice and polite to "poor coders". Also, he never insults them personally, but says that their code/argument is bad and that they are smart enough to know that already. Hence, they should stop making excuses and fix it already.

I still think that this tone is counterproductive, but that might be personal preference. I hate working in an environment where I know I can end up on the receiving end of some public humiliation if I voice a dissenting opinion or make a mistake. That's how you end up with people censuring themselves and not pointing out when somebody is actually making a mistake because they don't want to suffer the abuse.

This is especially inexcusable coming from Linus because it's not like he needs to resort to these tactics to be heard or make an impression, he's one of the most respected software devs in existence (and for good reasons). You could remove all the abuse from his posts and you'd lose exactly nothing IMO.

But this discussion is hackneyed, I know many people love Linus' style and think it's perfectly acceptable and even a good thing, I'm just definitely not one of them.


It’s sad that there are people who think calling garbage code garbage is inappropriate. What goes in their mind? Don’t they smell or call a smell a smell?

On at least one memorable occasion [0], he expressed surprise that the developer(s) responsible for some piece of code had survived to adulthood, considering they "were likely too stupid to find a tit to suck on" as a baby. I'm not super aware of the context here, but at first glance it doesn't appear to be a high-profile coder with lots of kernel experience that he's talking about.

[0] https://lkml.org/lkml/2012/7/6/495

I disagree with this sentiment. Linus' answer not only describes the immediate bug, but also gives enough detail and context that other, similar potential bugs related to the scheduler may be avoided after having read the entire post. Also I don't think Linus is being egregiously impolite in his post; he isn't calling the author of the post garbage. Calling the code garbage is not a personal affront, but an objective statement to dissuade others from being misled.

If this was a truly idiotic / trivial mistake I doubt Linus would put much thought or time into such a long post. I also think that if you are going to write a blog post, you are implicitly supposing a place of authority and therefore opening yourself to criticisms. In other words, writing a blog post that is incorrect (in this case, very incorrect) is actively harmful as it misleads others, and it is important to call out wrong posts when necessary. I've posted dumb advice with a know-it-all voice before too (see [1] for an embarrassing HN comment) and I am not at all angry/annoyed that I was similarly called out.

[1] https://news.ycombinator.com/item?id=20017832 (edit: wrong link)

> No need for calling code "garbage."

In my opinion calling the code garbage is much nicer than "The author of that post has a common misconception", which sounds like a passive-aggressive way to call someone dumb. Everyone writes bad/'garbage' code once in awhile.

>"The author of that post has a common misconception", which sounds like a passive-aggressive way to call someone dumb

Not at all. It's a polite way to say they're ignorant of a particular thing. It would be rather insecure of them to read into it and think they're being called dumb across the board.

When you say "the code is bad/wrong/garbage", that's a statement about the code. It gives the person who wrote it an out: they can say "Oh, right, I totally knew that, just slipped my mind for a second", or "This is why I shouldn't write blog posts about mutexes at 3 AM" or something. I've certainly written garbage code at times, and you probably have too -- it doesn't mean you're dumb or you don't know things.

When you say "the person who wrote this code has a common misconception", that's a statement about the person. You've jumped to a conclusion about the reason the person wrote the bad code. You're telling the person to accept that they don't know as much about mutexes as you, that they're ignorant, that they're a worse programmer (albeit in a fairly narrow category).

This is why I think GP's wording comes off worse even though it's superficially more polite.

I get where you're going with that, but I would much rather someone tell me I hold a common misconception instead of saying my code is garbage. As much as the ideal is to remember that you are not your work, I still feel that my work reflects on me and my intelligence.

Giving someone an out is a great thing to do, especially for people for whom saving face is important. But I don't think "this code is garbage" does that. Something like, "that code is measuring timing incorrectly because X", conveys the same information without being inflammatory. "Garbage" is a value judgment, while just pointing out that it's doing something incorrectly is merely stating a fact.

I'd appreciate if "this code is garbage" was at the end, so it would be (and feel more like) a conclusion.

Saying that stuff at the beginning only clutters the reasoning, as one should not "just believe" that that code was "indeed garbage", just because he said so.

That's a fair point, and moving it to the end would likely make the whole thing read better overall, but really... it's just not necessary to include it at all. It doesn't add anything to the discussion or critique, and can only serve to detract, so why bother?

On the other hand, I wouldn't get offended if someone said a piece of my code indicated I was ignorant of something. If I was ignorant, I'd fess up to it no problem. Or if I had reasons for intentionally doing it a different way, I'd share and let someone school me if they had a point on why I was wrong. Ego sensitivity only causes problems. It never solves them.

Additionally, _sometimes_ offering someone an out to save their ego actually does them a disservice.

You all are involved in like a 100 comment thread about his tone rather than the content, so that should maybe suggest that he didn’t get his point across well.

This is all in response to a blog post titled ”Measuring mutexes, spinlocks and how bad the Linux scheduler really is”. Based on the content and tone of that post, I think Linus’s tone is fair.

I'm favor of Linus version. They are not the same message to me. He clearly explains what really would happen behind the scene.

He probably just didn't take very kindly to the Linux scheduler being called "bad" in the blog post, when tons of people have spent an immense amount of time on it, and there are in fact good reasons for it behaving the way it does. If I were in the same seat, I'd be pretty stingy with the politeness as well.

> The possibility of your time slice ending skews all the time measurements, for example."

If you were forced to summarize the entire first third of the email into ten words, that might be valid. As-is, it's way too short to be a proper explanation. It's more of a reminder note. If someone doesn't know exactly what you mean at a single glance, it's useless. Linus has an actual explanation in his email.

It's important to realize that you are not your code. Linus' reply comes off as passionate but he clearly cares enough to write such a detailed response.

At least it's not "total fucking braindamaged garbage." Progress?

Your version lacks depth and I appreciate Linus's response.

Imo, he can call something garbage if the mentioned garbage makes false and bold claims about his and contributors decades of work.

Yes, Linus can still be abrasive, but he's gotten a LOT better than he was before his road to Damascus moment.

I immediately thought of his lkml email where he reflected on this kind of thing: https://lore.kernel.org/lkml/CA+55aFy+Hv9O5citAawS+mVZO+ywCK...

In CS "garbage" has a bit different sounding, like in "garbage collector". Nonetheless, original poster started it, writing an article "Why Linux scheduler is bad". Next time he'll see it coming.

By the way, Linus' answer is much more informative than yours. I got a lot of useful insights from it and had fun just reading his lively language.

> In CS "garbage" has a bit different sounding, like in "garbage collector".

It's a bit disingenuous to suggest that Torvalds meant it in that sense, though. He very obviously meant it in the sense of a piece of trash that should be thrown into a trash bin.

Or a piece of code that doesn't do anything useful or meaningful and may be skipped by compiler as well.

Compiler warning: Sourcecode classified as "garbage". Replaced with NOP for user protection.

Garbage code sounds like a fine new class of undefined behavior to add to the next C standard. You just need to teach ubsan to understand it.

No, he very clearly meant the definition I suggested.

(Also I've never heard of dead code referred to as "garbage code".)

>Boom. Roughly the same message. Most people will get the same amount of edification from this as Linus' original letter. You could go in to more detail if you want but it's not really necessary. Smart people will figure it out if you just point out the general shape of the error. No need for calling code "garbage."

People should be shamed for being ignorant. It stops ignorance from spreading.

I guess we should shame you for being ignorant about human beings and the value of being kind. There are plenty of ways to point out ignorance without being unkind (and the grandfather comment is a nice attempt).

You just did, and got to pretend you didn't.

Ignorance is the lack of knowledge, it doesn't "spread" like a disease. Misconceptions can, but without the original author's mistakes we wouldn't have this rebuttal by a very knowledgeable actor, so overall this is a net positive including for you and me.

misconception is a form of ignorance.

And so is not reading the second half of comments.

Your "correction" literally explained the comment it was correcting as being true.

reacting in a haughty manner doesn't stop you from being wrong, it just makes you a dick.

This all sounds very good, and I grant that Linus knows what he's talking about, but didn't this whole thing start with a bunch of game developers that noticed massive stalls when porting games to Google Stadia? Stalls that, presumably, didn't happen on other platforms (Windows and PlayStation, using a variant of FreeBSD).

If these kinds of spinlocks are so fundamentally misguided, why do they only cause problems on Linux?

Game developers running their software that requires a system tuned for interactive latencies get burned by systems optimized for throughput, proceed to blame the system.

I’m more for polite passive aggressive suggestion of incompetence instead of in your face blaming, otherwise the response was quite on point.

On the plus side the original author now has plenty to read about and us here get to learn interesting things by proxy.

AKA the Linux scheduler isn't intended for desktop environments and applications.

Certainly that same observation drove the development of BFS / MuQSS, which unsurprisingly behaves better on his artificial benchmark than the mainline scheduler.

As shown in the original benchmark, using proper mutexes is near optimal and there is little need for kernel tuning (although still possible).

Linux can be tuned for low latency, it's a kernel setting.

Either that's not sufficient or the Google Stadia people didn't even bother tuning their systems for gaming, which I doubt.

Maybe they did some tuning but missed this thing.

Moral of the story... don’t ship games on Linux?

No, the moral of the story is don't use spinlocks in userspace and use established locking libraries that suite your needs. You can't expect any operating system to optimize for poorly written applications when you have thousands of them screwing up in all different ways.

The irony is that because of Android, another platform Google owns besides Stadia, Linux is already the world’s biggest gaming platform.

iOS is the biggest gaming platform, on $$$ per user, including some exclusive hits, with several frameworks and tooling tailored to writing games.

Android is such a pain to develop games for, that Google has started a Android Games initiative only in 2019, to appease the growing complaints from game studios.

An NDK that feels like a 20% project, many game related APIs (e.g. image codecs, networking, files, permissions) are Java only and require JNI even if implemented in C++, frame rate setup, build infrastructure regarding NDK libraries, C only API for what are actually C++ classes forcing wrapping code for RAII and such, Vulkan feels like a code dump that everyone has to manually integrate into the project vs Metal tooling, graphics debugging and real time audio are outsourced to a GitHub projects not directly supported on the official Android site, network debugging is only supported for Java layer,....

Finally, the only Linux that Android has is the kernel, which Google could replace tomorrow and user space wouldn't even notice.

Linux fans keep failing to realise how little Linux Android actually exposes.

> iOS is the biggest gaming platform, on $$$ per user, including some exclusive hits, with several frameworks and tooling tailored to writing games.

Other than "Apple Revenue", in no other world is this a metric of "biggest" in the context of even the most "squint hard enough" contexts of this thread.

Then by that measure, the Web is the biggest gaming platform, where the Linux kernel doesn't have any role at all regarding Web browsers.

Are you sure? I think people play way more often on android than they do on the browser.

No, and actually Windows, Mac and consoles still rule, with mobile being a mere 38%.


So iOS is the biggest market, while android apps have more play time (also are messier, more of a pain..)?

Plausibly more time is spent gaming in browsers, or not. My guess would be android apps, possibly browser games. AFAICT the link only covers payed games, mostly counting by revenue, but I didn't look hard.

Worldwide 2019:

PC 25%

Consoles 26%

Phones 34%

Tablets 11%

Browser 4%

Since Android has 76% mobile os market I'd say that Android is the main gaming platform of the world. And the web is least used for gaming.

I wonder what magic turns 34% into 76%.

This even before starting the whole discussion that market share doesn't transform into gaming market share, nor in most profitable market share, or even native vs web views vs mobile browser usage.

.76*(.34+.11) = .342

That's greater than .26 (consoles), .25 (pcs).

> the only Linux that Android has is the kernel

The only Linux in any Linux-based operating system is the kernel.

Except that the little fact that Android is not Android/Linux, rather just Android, in what concerns userspace.

Any kernel will do the job required by userspace, that is running ART, ISO C, ISO C++ and the set of Android native APIs.

No, not "except that". Linux is just a kernel.

Yes, Google could replace Linux but they would also have to port drivers for every device Android runs on to that kernel.

> Yes, Google could replace Linux but they would also have to port drivers for every device Android runs on to that kernel.

Or just virtualize the whole kernel and Android runtime via a new kernel-based hypervisor for compatibility/migration purposes and switch everyone to Fuchsia/Zircon.

Google doesn't care about writing the drivers for other/older phones for manufacturers as they already have a hardware division to do it themselves for their phones, laptops and tablets. If Android was to be replaced by something else (Fuchsia), The vendors will be writing the drivers instead if they want to support Fuchsia.

Also, they don't need to "port drivers for every device Android runs on" since they would just start with their own devices first and the other manufacturers will eventually write drivers for their devices to support Fuchsia.

The problem is that they won't be required to open them up unlike Linux. But I guess the benefit to Google is that they have full control rather than waiting for one person to approve their changes.

Except that from a locking latency perspective, virtualization isn't a great choice. As Linus explains, guest operating systems runs on virtualized CPUs which can be de-scheduled while holding a spin lock. This is very much like what happened with the original benchmark in Malte's blog post, except that the application and operating system are now replaced with guest operating systems and hypervisors, respectively. AFAIK, the Linux kernel makes heavy use of spinlocks, making this a problem for people who care a lot about scheduling performance. This isn't to mean that people should avoid virtualization, though. There are likely to be many other bottlenecks to consider before most people have to consider these kind of stuff.

> Or just virtualize the whole kernel and Android runtime via a new kernel-based hypervisor

Nokia gave that a go, back in ~2009 if my memory serves me correctly. They actually had L4 (with certain security extensions) running on some of their prototype ARM boards. Linux, as an OS kernel, was then running on top of that.

It didn't get much traction, and I heard very little of it from 2010 onwards.

I suspect if they could "just" (!!) virtualise, they would have already. Modern phones are amazing but they have limits.

Google undoubtedly has the singular oompf to launch a brand new OS, I just haven't seen a reason yet. The few things they say they want to improve upon aren't limitations in Android (or Linux). Third party vendors and their chip makers are going to —perhaps rightly— view this as an Emperors New Clothes upgrade and resist. They're going to have to make a business case somewhere along the line.

Let's wait for a production release before we start the Android/Linux death knell.

But they already are with Fuchsia.

Like Fuchsia.

I assume you mean Zircon —formerly Magenta— Fuchsia's kernel.

Android is Android/Linux, just not GNU/Linux.

And sure, Google can change the kernel, but that would be a lot of work. The same could be said of many Linux distros. For example Debian GNU/Hurd is an experimental version of Debian swapping Linux for Hurd.

We're measuring by revenue by user?

It's got to be something more expensive than that. Hololens starts at $3500, so it's probably a good candidate for instance.

Which kinda gets the point across that revenue per user matters very little.

If we're looking ourselves to software revenue and ignoring device cost, look towards vr headsets. Practically every user is buying expensive games.

If you only want to include things with built in processors as platforms, look at Oculus quest in particular.

I believe by total revenue iOS is bigger — not per user (although that’s implied), but I could be wrong.


Rather someone that actually understands how meaningless is Android making use of the Linux kernel, beyond saving money paying for licenses.

Paying to license what? I'm not aware of any other kernel suitable for mobile phone use you could buy with a license (i.e. iOS doesn't count as there's no way you're getting Apple to sell you even a license)

Windows and plenty of embedded ones to chose from.

After all, it was Series 30, Bada, Symbian, Blackberry, Psion and Windows CE that started mobile phones.

> Android is such a pain to develop games for, that Google has started a Android Games initiative only in 2019, to appease the growing complaints from game studios.

This is actually irrelevant, Android could be the worst environment to develop however since it has the biggest user base it still has people wanting to develop to it.

> Finally, the only Linux that Android has is the kernel, which Google could replace tomorrow and user space wouldn't even notice.

This is not really true, specially considering for games. For one, since most games use Android NDK, you can't simple swap the kernel and everything will magically work. You will at least need a recompilation, and this new kernel API must be bug to bug compatible with Linux or things will strangely broke.

For another, there is things like the discussion of spinlocks/mutex/schedulers like this one of this topic that, as shown, can make a big difference. It could be better in a kernel optimized for latency, or it could be much worse, making games that today are playable in low end devices a complete shitty show.

There is no Kernel API for userspace as official supported NDK API. If you call any Linux API you are the one to blame if anything changes, even across devices.

No it hasn't the biggest gamming userbase, as proven on a market analysis report I just posted.

> There is no Kernel API for userspace as official supported NDK API. If you call any Linux API you are the one to blame if anything changes, even across devices.

Yeah, I never said that you should directly access the kernel API using the NDK (however you can actually). I said that just the fact that the NDK is compiled to native code means that the code depends on the Linux kernel, and changing the kernel would mean, at minimum, a recompilation. Android NDK only guarantees an API, not a ABI, and those two things are very different.

So no, it is not like changing the kernel would have zero impact in Android game market. It would actually have a huge impact, since almost every game would stop working until being recompiled (games that does not use the NDK would probably be fine, however few games are not native). And of course, for backwards compatibility purposes you would need to have multiple binaries (however this is already true nowadays since Android support multiple architectures outside ARM).

> No it hasn't the biggest gamming userbase, as proven on a market analysis report I just posted.

I never said it was the biggest gamming userbase, I said it was the biggest userbase. Even if Android doesn't have the biggest gamming userbase, just for the fact that that you have a big userbase as big as Android have means people will develop games for you.

Google Play already does binary translations from ARM into x86.

Also the whole point of ISO C, ISO C++, OpenGL, Vulkan, OpenSL is to be OS independent.

Game studios are not a charity, they develop games for profitable markets, not for large userbases that don't pay for games.

> Google Play already does binary translations from ARM into x86.

It is not libhoudini [1]? AFAIK, libhoudini is not something that Google Play embeds on binary for distribution, it is a library that ships on Android x86 devices (and another AFAIK, it is proprietary from Intel so they only ships on Intel devices).

So while Google could have a similar approach if they want to ship another kernel for Android, the work is not as trivial as you seem to think.

> Also the whole point of ISO C, ISO C++, OpenGL, Vulkan, OpenSL is to be OS independent.

Yeah, it doesn't mean you don't need to recompile your software. You can't take an game written for Windows in OpenGL and run in Linux without recompile. This is my point.

> Game studios are not a charity, they develop games for profitable markets, not for large userbases that don't pay for games.

Well, Android seems to be pretty profitable? It is not the most profitable platform sure, however it sure makes enough money because almost everyone ends up creating games for both iOS and Android.

You seem to imply that making games to Android is not profitable, however it sure seems to be. Or Epic Games, EA, ActionVision, and many other big names in game studio are a charity.

[1]: https://commonsware.com/blog/2013/11/21/libhoudini-what-it-m...

Having to recompile doesn't make the applications kernel dependent, now to consider porting such applications from macOS to Linux to Windows, kernel dependent?!?

And the point wasn't that it isn't profitable as such, rather it isn't the money maker that some here are making waves for, rather a wannabe versus the revenue made across desktop, consoles and iOS.

Hence why Google has introduced the Android Game SDK last December (which is quite basic as of now) and Google IO 2019 was the first time Android ever had a games track, versus WWDC iOS/macOS that have had them since ever.

> Having to recompile doesn't make the applications kernel dependent, now to consider porting such applications from macOS to Linux to Windows, kernel dependent?!?

So you concur with me that if Google changed the Android kernel it would be a big change? Just imagine, you would have to recompile all applications using NDK to work with this new Android. It is not something trivial, sure they can make it work, however it would probably need quite some time with Google making obligatory to having developers to ship both Linux and new Kernel binaries.

And even them, you can't make sure that everything will work. A small change in behavior of a specific syscall may break a random application somewhere that depends on it, because even if this syscall is not directly used, Android NDK uses this syscall to implements its API. Now multiply this for the number of NDK applications that Android have and you will have a huge amount of breakage.

So yeah, Google can make it work, however it is not as trivial as you think.

> And the point wasn't that it isn't profitable as such, rather it isn't the money maker that some here are making waves for, rather a wannabe versus the revenue made across desktop, consoles and iOS.

Sure, we can concur at this point. However the market for Android gaming is still more expressive than you seem to think (not wannabe by any means, but lower than iOS sure).

> So you concur with me that if Google changed the Android kernel it would be a big change?

No I don't, anyone writing portable applications already does compilations all the time.

It is no different than the requirements to provide newer builds updated to specific Android target versions, or being 64 bit only that Google keeps forcing devs to move forward.

It is Google and OEM's job to ensure that NDK APIs work across whatever kernels are available.

Any application that depends on syscalls isn't portable across Android devices anyway. Google doesn't support syscalls as part of the NDK stable APIs contract, and some OEMs are well known to change their behavior.

> It is no different than the requirements to provide newer builds updated to specific Android target versions, or being 64 bit only that Google keeps forcing devs to move forward.

That is basically saying that this is a big work for everyone involved (developers, Google, OEMs). This is exactly my point.

> Any application that depends on syscalls isn't portable across Android devices anyway.

The application does not necessary call a syscall directly, the problem is that the NDK calls a syscall as an implementation detail (for example, for lock primitives eventually the NDK must call the Linux lock primitives too, there is no way out).

So if Google needs to change Android kernel, they need to implement the same quirks of the current syscalls, or the slightly change in behavior will break random applications.

My point was never calling syscall directly, however for sure some applications call syscall directly too.

A market analysis measures the market, i.e. $$$, not the user base. Which part of the report is on the userbase?

I posted the link, everyone can read it.

How much Linux is android though? Does Android feedback into Linux? This argument about Linux being the most popular consumer OS or Gaming platform thanks to Android sounds like Syria taking credit for the iPhone because of Steve Jobs being a Syrian by bloodline.

"You know, although we are consumed by sectarian fights and iron-fisted leaders that will crush anyone having opinions, It is important to note that worlds one of the worlds most successful contemporary technology companies is actually Syrian".

From a game's perspective, Android is very much like Linux, actually.

Don't use userspace spinlocks on Linux.

Ironically things went upside down with linux scheduling over the last 15 years.

All major OSes had PhD CS people doing their schedulers "efficient" at around 2000, and be able to sustain 98%+ MP loads. Linux was an odd case of a very naive, and "inefficient" scheduler that left CPU 30-20% idle unless you really knew how to hack around that. Linux however came with near no stalls, and latency spikes.

Things have changed dramatically now. Linux got attention of the entire industry on it, and years of holy wars about scheduling.

You can now hit 95%+ under multithreaded load easily, but even on the desktop you can now see how a block in IO hog app can visibly freeze your system for few seconds.

Now, the trend has reversed again, and now everybody is praising conservative scheduling that doesn't give out too much CPU :-/

> but even on the desktop you can now see how a block in IO hog app can visibly freeze your system for few seconds.

Or worse. I ran into an issue when copying an image from an NVME drive to a USB drive using `cat` (e.g. `cat somefile.img >/dev/sdX`) where mouse performance (also a USB device) became so sluggish that the desktop was unusable until the copy was complete.

This wasn't the only instance where background I/O brought my desktop to it's knees. In another instance it was copying a large file across the network (Gigabit Ethernet) to a file server. The network was saturated and the desktop was unusable. I think that the remote system was the destination but my recollection could be in error. The network interface is a PCI device and there was no USB device involved in this transfer.

This is on Debian Stable and using the Gnome desktop. The non-standard portion of the system is that it runs off a ZFS filesystem. I don;t know if that matters.

I'd like to know where to look and how to diagnose this issue if/when it crops up again. Interrupt storm? USB driver issue? Some other issue resulting in some resource starvation? (In neither case did overall CPU load appear high.)

I'd guess the apps were waiting for I/O, not CPU.

Don't know the reason, but lots of unwritten I/O on a (slow) drive can block processes waiting for I/O on an unrelated drive.

By default, dirty buffers are a predetermined portion of RAM, so lots of RAM and slow I/O can make for long pauses. I usually tweak sysctls on desktops to limit dirty buffers. E.g.

  vm.dirty_background_bytes = 67108864
  vm.dirty_bytes = 134217728

You will not get enough CPU time given to a task if it does, say, 10% IO, and 90% CPU if there is an IO hog running in parallel. You either IOnice everything to oblivion, or bare with this.

IO blocks are almost always CPU blocks too on anything that does even minimal IO. And the more simultaneous IO, the more CPU overhead there is.

Try running compiling something big with -J 32 or more.

When system is IO busy, processes block (more) waiting for IO completion. IO blocked processes wouldn't know what to do with a CPU, so they won't have it assigned. Are you saying something different?

Decreasing dirty memory decreases flush times, decreasing process write/flush times, decreasing process/UI hangs/pauses. Presumably it also lowers throughput in some cases, but it's better than having UI hangs.

Is that a CPU scheduler issue though?

Linux traditionally has a problem of overall system architecture in that userspace isn't set up to provide policy information such as "this thread is doing UI" vs. "this is a background task" by default.

However, if I help the scheduler along by nice(1)-ing my big compute tasks, then the desktop remains perfectly smooth even while the CPU is 100% busy.

It'd be much better to have this experience by default, but unfortunately it requires architectural improvements throughout the entire software stack. Trying to band-aid it in the kernel scheduler may allow you to heuristically improve some workloads, but it's ultimately a fool's errand.

Where the desktop does tend to fall over and become totally unusable is when the system starts to swap...

There is pretty much nothing you can do on a Linux system that doesn't make a swaptest (showing e.g. red and blue screens on even and odd frames) glitch.

I remember when the 'rotating cube' 3D desktop was released for linux (compiz?), and the animations were instant and flawless, which seemed absolutely _impossible_ on Windows at the time.

It still does seem impossible. If i (on accident) hit super+tab on my work computer i get to wait 1-2 seconds while the workspace overview stabilizes before i can get back to work.

This could be due to the focus of the kernel developers involved. At some point in the 2000s, companies such as IBM invested heavily in Linux, working on making the kernel scale well on 'big iron' machines with tens of cores and NUMA-style architecture. As a result, the focus shifted away from 'normal' desktop boxes.

But was this a good or bad thing? Even desktop machines can now have 16+ cores and NUMA. So, perhaps that focus has paid off for more users?

The claims from the OP:

> So this all started like this: I overheard somebody at work complaining about mysterious stalls while porting Rage 2 to Stadia. The only thing those mysterious stalls had in common was that they were all using spinlocks.

If the OP's benchmark is invalid and if the root cause is just "overheard" by somebody without sharing detailed analysis, we don't even know if the problem is really about spinlocks.

We are still unsure if it comes from Linux kernel or just Stadia platform-specific. Do we have similar issues with Android?

Fair enough, I suppose the original post was quite light on technical details.

> If these kinds of spinlocks are so fundamentally misguided, why do they only cause problems on Linux?

It is possible that, as explained by Linus, the Windows scheduler just doesn't ootimize for NUMA as aggressively as linux (which attempts to keep threads on their own cores as much as possible).

It is also as likely that, because on Windows games have been an important load for a while, the Windows scheduler likely bends over backward to support broken code either by penalising other use cases or by attempting to recognise this sort of spinlock behaviour and special casing it.

Keeping threads on their own single cores goes way beyond optimizing for NUMA. Especially because the CPU being tested on only has a single NUMA node.

Even being maximally NUMA-friendly, if calling thread_yield a hundred times would eventually pull in a blocked thread from a neighboring core then the benchmark would see much smaller delays.

Perhaps Linux should do more 'special-casing' of broken/older apps?

In the email thread, Linus points out that GNU make has a bug in its concurrency code, which was only hit when newer kernels added some optimizations. As a result, they had to walk back on these improvements.

Now, I'm sure that the bug has been fixed in newer versions of make, but 'make' is a program that many people are unlikely to keep up-to-date (most users of it will be only using functionality that has been in make for decades and wouldn't ever consider needing to have a brand new version of it)

It would be a shame to hold back kernel optimizations if they trigger bugs in older user-space code, so in these kind of cases it would be great if Linux had some of the 'backwards-compatibility cruft' that Windows is famous for, allowing for new optimizations while detecting and working around known problems in common older code.

Sure! Remember Linus mantra of "never break userspace". But the code in point is not a legacy binary, it is a new program getting ported to linux.

Still it might make sense to change linux scheduler behaviour to improve performance of unmodified Windows games under Wine if it turns out that using yield (or the Windows equivalent Sleep(0)) is common.

And in fairness to the original post, the author had a latency issue, they invertigated it and correctly identified the problem with spinlock and yield and found out that using a proper mutex is the best solution. The only issue is that they thought that the scheduler was the root cause of the bad behaviour, not the yield itself.

> So the code in question is pure garbage. You can't do spinlocks like that. Or rather, you very much can do them like that, and when you do that you are measuring random latencies and getting nonsensical values, because what you are measuring is "I have a lot of busywork, where all the processes are CPU-bound, and I'm measuring random points of how long the scheduler kept the process in place".

> And then you write a blog-post blamings others, not understanding that it's your incorrect code that is garbage, and is giving random garbage values.

> And then you test different schedulers, and you get different random values that you think are interesting, because you think they show something cool about the schedulers.

Tldr: They got probably lucky on Windows. Conceivably similar code might have worked on Linux and then had issues on Windows.

...and that would be extremely likely to happen if the Linux version was being written first and only then ported to Windows.

Reinterpreted in my own words:

Usercode spinlocking is not legible to the kernel. Therefore anytime you attempt it you are in the situation of fighting the scheduler, and given that even the kernel devs are constantly challenged by the problem of making appropriate scheduling tradeoffs, you will lose.

Damn, Linus really knows his stuff, getting to the heart of the matter right the way - measuring user mode locking time is futile, as the OS scheduler can preemptively put the locking thread to sleep for arbitrary time. The OP ended up measuring how quickly the scheduler woke up the locking thread, thus measuring the wrong thing.

You think the OP doesn't know this? Common. Linus was making straw men arguments as he seems to commonly do.

The measurements still show Windows trouncing Linux in reponsiveness

Yes, because the benchmark was using spinlocks incorrectly.

There are many places where the Linux scheduler is more responsive, even in gaming, exactly because of what makes spinlocks inefficient. The Ryzen scheduler debacle comes to mind.

That first post is what Linus is responding to.

The second post was really educational for me.

I am under the impression that this long, vocal post, consisting of many assertions and no measurements, would have been judged less favorably if it were not from Linus.

I used to agree with many of Linus rants, but not this time though.

The gist of the reply, as I understand it, is that the benchmark is measuring random spikes while the spinning thread is scheduled away. But the benchmark clearly stated that they are using one thread per core and that the host was otherwise idle.

It could be that, still, some random task pops up once in a while, replacing the very thread that has the lock. But then, rather than calling the whole benchmark garbage I think it'd be more fair to just ask for it being rerun with, say, two free cores. Or use auditd to detect and invalidate those runs.

Linus also calling everything random sounds annoyingly slopy: varrying the scheduler or the OS led to significantly different measurements. Maybe the explanation for those differences are wrong but the measurements are certainly not random.

The final conclusion after throwing the blog author into the dumpster (which I’d be more polite about, but that’s Linus for you) is that unless you spent your whole career researching and implementing locks, you should use something that was created by somebody who did. Which is some damn good advice if you ask me (disclaimer: I didn’t research or implement any production-grade locking primitives.)

I don't know why you felt necessary to rephrase the final conclusion (I've read it - notice, by the way, that it was also the conclusion of the original benchmark) nor to introduce Torvalds' style (I mentioned I'm familiar with it).

Anyway, I do not think that a forum that calls itself "hacker news" should adhere to that dogma that any topic claimed by an authority to be complex should never be re-implemented on first principles. And I doubt the young Linus would have paid much attention to that kind of advice when he started developing a new operating system.

Personally, the conclusion I take from that thread is simpler:

On a situation when N threads are runnable on N cores with N-1 calling sched_yield(), one must not expect the Nth thread never to be scheduled out for a whole slice on Linux (for some or all of the reasons listed by Linus, none of which seems to address that exact case unfortunately).

Why would you expect otherwise from sched_yield?

man sched_yield(2) says: == sched_yield() is intended for use with real-time scheduling policies (i.e., SCHED_FIFO or SCHED_RR). Use of sched_yield() with nondetermin‐ istic scheduling policies such as SCHED_OTHER is unspecified and very likely means your application design is broken. ==

The problem with the article is that the author blamed Linux without knowing the worst case behavior of spinlocks.

When you are working on your own locking algorithm you should be fully aware of all the potential problems that a certain approach implies. It's not the 70s anymore back when multicore processors and the need for complicated locking algorithms were a rarity.

> I've read it - notice, by the way, that it was also the conclusion of the original benchmark

Unclear. The article's conclusions about spinlocks are completely unfounded by the article's code and benchmarks. You could say the article luckily got it right. That's Linus's ultimate contention; the code is garbage and leads to an undefined conclusion. That what's undefined is the right conclusion is lucky, today, and could be garbage tomorrow.

Measurement of what? You don't have to measure logic. Of course you can be unscheduled for any reason in userspace, of course you can be rescheduled with completely arbitrary latency esp. compared to targeted wake up, and of course polling for lock availability in this context make you risk poor results. If under other oses this is vaguely better for this synthetic meaningless "benchmark" by miracle, I guess they have other issues (and yes, actually they have)

A simple reasoning on why spinlock is slower than mutex in a heavily contested situation.

A thread (S1) attempting to acquire a spinlock will busy looping to hog all of its time slice, preventing other threads (S2) from utilizing the CPU to free the lock.

    S1: [**busy spinning**][..waiting........][**lock & run*****]
    S2: [..waiting........][**run & unlock***][..waiting........]
A thread (M1) attempting to acquire a mutex will sleep to give up its time slice, letting other threads (M2) to utilize the CPU and thus have a chance to free the lock. The given up time slice can be very substantial, leading to much better performance.

    M1: [sleep][..waiting........][**lock & run*****]
    M2: [.....][**run & unlock***][..waiting........]

If one really really wants to use spinlock, the way to use it correctly is to boost the priority of the thread holding the lock to make sure it got priority to run over all else, so that it can finish its work quickly and release the lock.

    S1: lower priority  [..waiting........][**lock & run*****] higher priority
    S2: higher priority [**run & unlock***] lower priority [..waiting........]
I assume the benchmark and the user-mode spinlock doesn't do that, and thus has the measured problem. OS/kernel would have a better idea to raise and lower priority with spinlock than user-mode code.

> If one really really wants to use spinlock

You will also have to make sure that your application threads are pinned to specific cores and that the number of threads is smaller than the number of cores because you want to keep one thread free to run non critical code.

Spinlocks are like full plate armor. Very effective but incredibly cumbersome to use.

Just searching the web to see if anyone has published a list of Linus Torvalds' best technical posts -- those where he shares his extensive knowledge of system programming, CPU architecture, etc. With all the years he has been posting on newsgroups, there should be a lot more "gold" out there.

Unfortunately, it seems that most people who write about Linus on the web are more interested in his worst posts -- where he has slung insults and profanity with no valuable technical insights.

Thanks! That's a great link for more than just Torvalds.

> Pretty much every time we picked an unfair - but fast - locking model in the kernel, we ended up regretting it eventually, and had to add fairness.

This is an interesting quote from Linus. I always thought that fairness was for the desktop and Linux had latency issues sometimes because it historically put the "big-iron" first. Now this quote makes it seem that fairness is a good thing for the server as well and the preceding paragraphs hint that supposed latency issues are just a natural consequence of the trade off between throughput and latency.

I highly recommend reading the whole thread, especially all Torvald's responses.

Maybe I'm not adding much as contribution here; but I am of the point of view that Torvald's responses are always worth reading through.

It's kind of disturbing to me that so many people in the comments are defending Linus's tone here. He writes like petulant child. Its wildly unprofessional, and only makes it harder to follow what follow what he's trying to say. I'm really glad I'm not actively involved in any community he's a part of.

What exactly you dislike about tone of that post? There only few harsh points there:

- He called garbage code garbage.

- He pointed out that most of people don't understand locking.

- He again pointed out that almost no one including him is qualified enough to write own locking code.

There is absolutely no personal attacks or whatever. It's just someone make a big fuss about how bad kernel is without real reasons and Linus called bullshit.

There's always a surge in people who defend and even mimic Linus's tone whenever a post like this gets popular. People love being that smart guy who gets to feels entitled to act obnoxiously because they're right - at least about one small thing. Reminds me of _that_ group in High School.

It sounds like there was an expectation that Linux should support something that it doesn't, and whether it _should_ support it is up for debate. Totally agree about Linus's involvement in any communities.

I assume part of it is that he feels personally attacked when something about Linux is challenged.

> It sounds like there was an expectation that Linux should support something that it doesn't, and whether it _should_ support it is up for debate.

You really didn't paid attention to what being discussed. Someone started discussion on how bad Linux kernel is based on incorrect data generated by badly written code.

The common refrain I hear is that "we're engineers. We don't have time for people's 'feelings' and 'emotions'". Ironically, the controversial things that Linus says are exactly those phrases that inject unnecessary emotions into a technical discussion. Take away all the emotionally charged words, replace them with purely factual descriptions or observations, and you would have something that better resembles the principles that Linus's defenders claim to be endorsing.

Ironically, if a manager or CEO or VC or politician used similar language, they would be attacked by the same people for being abusive and elitist. But when an alpha-programmer does it, it is somehow ok.

I've learnt so much from reading both the original blog post, and also the resulting conversation with Linus. The technical work Linus has done and is doing, is fantastic and to be commended. I just wish he followed through on his earlier desire to change his communication style.

Although what I'm about to say doesn't justify his behaviour, but have you ever considered the following:

a) He's aware of his tone, but probably bald tired of reading stuff that can be (and in this case is) perceived as an attack (the blog post's title comes to mind) on his work and a bad benchmark/study/conclusion is used to justify it? (Now imagine enduring this throughout decades... Yeah, one has to have very thick skin to be in such a position--have you ever tried to be in his shoes for a moment?)

b) He has called himself names as well (In the past I've seriously been tempted to compile a list of such self-deprecation behaviour but it's ultimately pointless); and

c) We don't know what's going on in his life or maybe it's just how he is or maybe it's a cultural thing.

My point is: sure Linus can do better on how to react to criticism, but inexperienced people ought to refrain themselves from blaming other people's work--especially from vastly experienced ones.

I do think you have a point here, especially in the last sentence. The original blog post also could have been more constructive in how it worded its criticism. As usual, this could be accomplished by being more precise. "Spin locks experience high latency under the Linux scheduler" would be a fine title for a blog post. Whether it would have received any upvotes is another question though. Maybe we, the voters, in our crow-like attraction to shiny titles, are as much to blame as anyone for the poor communication here.

summary: Don't roll your own (spin) locking unless you really want to go down the rabbit hole. Linus thinks locking is a PITA and there be dragons. You have been warned.

Synchronization primitives are another thing along with dates and names and postcodes and unicode text editing: "Stuff you should avoid writing yourself if at all possible because they are incredibly fiddly to get absolutely right."

They all share that quality of "its straightforward, anyone can just look up the details and make it"... a month later you're still chasing some detail as some bureaucrat has added some new wrinkle. Now a rule change elsewhere has happened... it never ends. Plus each locality has its own history and the details all make sense for them but not necessarily for you and your codebase.

Simple to implement but the complexity is getting it right for every case. None of it is easy.

You're also much, much better off figuring out how to structure your many threads so they interact as little as possible than trying to make them interact quickly, performance-wise.

Add to that lock-free concurrent data structures.

Also add crypto algorithms to that. I know of too many people that thought, a SHA-512 is fine for passwords. /facepalm

I don't understand this complaint. Cryptographic hashing gets obsoleted all the time. No matter which algorithm you choose you're just buying time until a better algorithm is needed. One day we will facepalm because people didn't use quantum resistant algorithms.

There's a major difference thinking it's okay to store all of the user's passwords on your site with a SHA-512 of the password vs using bcrypt/scrypt/pbkdf2 etc with a high work factor. One is just ignorant and the other is following modern practices and continually updating the work factor to address advances in computing power.

When was bcrypt or pbkdf2 made obsolete as a algorithm?

A straight hashing algorithm isn't what you want for storing passwords. At a minimum, you want it salted. It's also advisable to include an iteration count so that (a) brute-forcing is harder, and (b) you can tweak just one parameter in the future to deal with hardware speed increases. At this point, you've just reinvented a KDF and should use PBKDF2, scrypt, or whatever the KDF du jour is.

TL;DR storing passwords with SHA512 is bad, storing them with PBKDF2-SHA512 is ok and likely overkill.

Like all the mainstream Linux distributions out there?

But seriously, IIRC they do 5000 rounds by default, but I've been wondering for a few years how is it that supporting nothing better than SHA512 for the whole /etc/shadow -based system is still ok? How come there's still no support for scrypt or bcrypt?

Because it actually hardly matters at all?

It's salted enough that it's not amenable to a time-space optimisation and it's pessimised enough to make trivial attacks impractical so bad guys who have the database will resort to brute force on individual passwords. At which point either you have a good password and they can't guess it or you don't.

Fancier schemes are about the margin, can we make it too expensive to brute force say 8 alphanumerics? But you don't need to live in the margins and you shouldn't.

_spin_ locking in particular

They mention all locking

> But don't write your own. Find somebody else that wrote one, and spent the decades actually tuning it and making it work.

> Because you should never ever think that you're clever enough to write your own locking routines.. Because the likelihood is that you aren't (and by that "you" I very much include myself - we've tweaked all the in-kernel locking over decades, and gone through the simple test-and-set to ticket locks to cacheline-efficient queuing locks, and even people who know what they are doing tend to get it wrong several times).

> There's a reason why you can find decades of academic papers on locking. Really. It's hard.

Can anybody explain the advantages or otherwise of Apple’s libdispatch for locking? I understand there’s Linux support now also. Is this something you would use outside of Apple ecosystem?


Quoting from the link:

> A dispatch semaphore is an efficient implementation of a traditional counting semaphore. Dispatch semaphores call down to the kernel only when the calling thread needs to be blocked. If the calling semaphore does not need to block, no kernel call is made.

This sounds like the same optimization that futex(2) makes for pthread mutex on Linux. So while a naive semaphore might enter the kernel 100% of the time, this one does so only when there is actual waiting to do.

It would seem to me that sem_wait() does similar on glibc so Linux probably performs similarly with the standard POSIX semaphore:

> https://github.com/bminor/glibc/blob/master/nptl/sem_wait.c

mirrored from here:

> https://sourceware.org/git/gitweb.cgi?p=glibc.git;a=blob;f=n...

Don't use dispatch semaphores where mutexes (or dispatch queues) would suffice. Apple platforms have a QOS mechanism that donates priority from higher-QOS threads to lower ones when they're waiting on locks that the higher-QOS thread owns, but with semaphores the system doesn't know what thread "owns" the semaphore and so it can't donate priority. Mutexes (and dispatch queues), including os_unfair_lock, will donate priority accordingly.

Which is to say, dispatch semaphores shouldn't be used if you're initializing them with a value of 1, you should only use them if you have an actual pool of resources you're protecting.

Semaphores are for signaling (sames a condition variables, events) while mutexes are for mutual exclusion. Technically you can also use semaphores for mutual exclusion (a mutex can be thought as a binary semaphore) but you really shouldn't.

Right, but libdispatch doesn't have a mutex. It has semaphores and queues. So if you're trying to use libdispatch and you don't want the closure-based aspect of queues, you might be tempted to use a semaphore instead. Don't do that, use os_unfair_lock or pthread_mutex (or a higher-level construct like NSLock) instead.

I was hazy on remembering details for my comment above, but I now realize fully that I do have experience with semaphores on Darwin. It went like this:

First I tried POSIX semaphores. They seemed poorly done on Darwin. [Maybe they needed sem_open() rather than sem_init()? But unnamed ones are way more useful!]

Then I tried Mach semaphores.

Then I saw a recommendation for libdispatch semaphores to avoid too many syscalls.

Given that Linux does the syscall-avoiding trick by default for standard-conforming semaphores, it seems to me like on Darwin you can't go too far wrong always using libdispatch. But I guess you miss that priority inversion workaround you were talking about.

On Apple platforms the simple set of guidelines would look something like:

* Can you use dispatch_queue? Great, use it.

* If not, can you use pthread_mutex? Note that this can be configured to be unfair if desired. If so, use it.

* If not (e.g. you need actual semaphore behavior), use dispatch_semaphore.

os_unfair_lock is more of an optimization when speed is a must. It's a replacement for the old OSSpinLock, but AFAICT it basically acts like a normal mutex under contention, so I'm not actually sure why pthread_mutex would be preferred over it, except the documentation says to prefer pthread or libdispatch as they're "higher level" primitives.

Also of interest is dispatch_group_t, which is basically an inverted semaphore (it counts active tasks, and lets you wait until all tasks have completed). Under the hood it's actually a semaphore that's initialized with LONG_MAX as the value and some custom logic (including atomics) around waiting.

Yeah, obviously if I needed a mutex I would use a mutex, I don't have that particular confusion and would expect pthread mutex to be well worn on the platform. (And I noticed Darwin has some helpful pthread_mutex_attr values.) I am talking only about the case where you really want a semaphore.

It's a different programming paradigm. Libdispatch is really meant to be a form of coroutine-oriented asynchronous serialization. It can be used for synchronous locking but it's not particularly good at it.

I remember someone at google proposed new syscall to voluntarily give up time slice from one process to another which they intended to use for co-routines in c/c++ (which they used internationally and boy was it awesome) but it was not accepted. Sure would be nice for this case as well.

I know it's a typo, but still - I'm adding "schenario" to my list of good words to use when describing screwy code.

Isn't it a joke contraction about schedule + scenario?

You’re right, that makes it even better.

Some people might call his rhetoric abusive. I think it’s well deserved though love.

Where is the link to the origin of this discussion ? From the realworldtech.com looks this is the first post https://www.realworldtech.com/forum/?threadid=189711&curpost... . But that post is already an reply to something else's post.

Is this your interpretation or you know it for sure. Sorry for being skeptical. I am having hard time believing Linus Torvalds would have an account on realworldtech.com. I have never heard of this site before. The only place i have seen Linus reply is in the kernel mailing lists. Also is this email id torvalds.delete@this.linux-foundation.org ? How do we even know the reply is from linus.

Sidenote : just had a random thought reading linus talking about not fighting the scheduler:

are there common theoretical problems between a garbage collector / memory allocator (aka : automatic handling of memory resource) and a scheduler (aka : automatic handling of cpu resource) ?

Could there be some kind of super abstraction solving common issues to both problem in a more mathematical approach ?

What I'm reading from here:

> Or at least be aware of it, and have some virtualization-aware paravirtualized spinlock so that you can tell the hypervisor that "hey, don't do that to me right now, I'm in a critical region".

is that your kernel could be hypervisor aware?

Could you compile a kernel that flags itself as being in a critical region all the time? Would that give you a whole CPU indefinitely?

> is that your kernel could be hypervisor aware?

I mean, paravirtualization is precisely when a kernel communicates with the hypervisor. Dunno however how scheduling factors into it—I was under the impression that VMs are treated much like userland programs. I.e. would similarly be scheduled off sooner or later.

Maybe Linus should update the docs: https://www.kernel.org/doc/htmldocs/kernel-locking/locks.htm... I think it would be useful to add a warning there.

So, ignoring whether the code is good or not, what's the correct way to do this benchmark? Is there any way measure the timers atomically, or to prevent your code from getting preempted in a tiny section, or have the kernel notify you if it gets preempted? Maybe something can be hacked in with transactional memory infrastructure?

Realtime scheduling. It only makes sense for a tiny process/thread that is carefully written to avoid the bad effects of no preemption.

This benchmark runs 32 software threads on a machine with 8 CPU threads. Getting preempted a lot is a major part of what's being measured. But samples that got preempted at exactly the wrong spot need to be filtered out, which is tricky.

I'm sure there's probably some succinct way to do so using perf_event in C but off the top of my head instead of just looking at the wall clock in isolation you could also compare it to the task clock to see if there was any pause of execution. Just make sure that you're actually including the time spent in the kernel during whatever syscalls you're trying to include but IIRC that's just a flag that you can set when opening the event.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact