Followed by Linus response to Malte:
The long and the short of it:
- It's nearly impossible to make meaningful benchmarks for this because reality will almost always be different and more complicated
- Calling sched_yield() is almost always a mistake
- Any scheduling optimization is probably a waste of time because either your running env will change, or the kernel will change, or someone else's optimizations will clash with yours.
- Even with decades of research into locking, everyone (including the experts) continually gets it wrong.
- If you're not running a specialized embedded setup where you control all of the running processes and other such variables, just use sleep locks and call it a day.
One caveat is that calling sched_yield is a fantastic throughput optimization for the spinning portion of an adaptive mutex. If you have parallel code that sporadically contends for short critical sections then my data has always said this:
- something like a futex lock is optimal
- you want barging, not fairness
- spinning the CPU is a t
questionable idea even if you do it for a short time. The pause instruction does something like jack and shit. Busy waiting of any kind seems great for microbenchmarks sometimes but it’s pretty bad for anything complex.
- spinning 40 times while calling sched_yield prior to futex_wait (or whatever) appears optimal on many schedulers and benchmarks. It’s hard to tell why. Best I can tell it’s because in the large parallel apps I’ve worked on and benchmarked, there are often threads that are ready to do some useful work right now, so yielding from a thread that can’t get a lock to one of those is a system throughput optimization. If there are no such threads then it’s a bit like spinning the CPU.
The 40 yields thing has a funny story for me. I learned of it from a comment in the JikesRVM code that was more than a decade old claiming it was optimal on their custom scheduler on a 12 core POWER box running AIX. The benchmark cited was something I’d never heard of. When I rewrote how JikesRVM did thread scheduling, I was sure that this would need tuning. Nope. It’s optimal on Linux and Darwin on x86 on basically any Java benchmark that has lock contention. I later implemented the same heuristic in WebKit’s locks. It’s still optimal.
To be sure, there is a broad plateau of goodness around 40. Like, 50 is about as good. But if you go order of magnitude in either direction, performance tanks. I’m superstitious so I’ve stayed with exactly 40 whenever I have to write this code.
I wrote a lot more about this stuff once: https://webkit.org/blog/6161/locking-in-webkit/
It would be interesting to see a distribution of wait time per lock for different locks. If such a two step approach performs well, a hypothesis might be that there are three kinds of locks:
- mostly uncontended: lock succeeds without spinning
- held for exceptionally short amounts of time: lock usually gets released again before waiting thread on different CPU stops spinning. Wait time is lower than the time it takes to yield to the scheduler and/or reschedule.
- held for a really long time: wait time is much longer than scheduling another thread and re-scheduling the initial one.
And what you are saying seems to indicate that optimizing for the first and second one is worth it because the overhead is negligible for the third type, that is, there are few locks that are held for an amount of time so that spinning adds noticable overhead.
The paragraph about InitializeCriticalSectionAndSpinCount and SetCriticalSectionSpinCount describe the behavior. IIRC, the default spin count used to be pretty high (1000 loops or so). Not sure if that was changed.
I’ve tested this over the course of a decade on multiple Linuxes, multiple Darwins, multiple x86 flavors (Intel, amd, and lots of core counts and topologies), POWER, and various arms, and on many large benchmarks in two very different languages (Java and C/C++). In Java I tested in in two very different VMs (JikesRVM and FijiVM). I think the key is that a typical benchmark for me is >million lines of code with very heterogenous and chaotic locking behavior stemming from
the fact that there are hundreds (at least) of different hot critical sections of varying lengths and subtle relationships between them. So you get a law of large numbers or maybe wisdom of the masses kind of “averaging” of differences between CPUs and schedulers.
I’d love to see some contradictory data on similarly big stuff. But if you’re just saying that some benchmark that had a very homogenous lock behavior (like ~one hot critical section in the code that always runs for a predictable amount of time and never blocks on weird OS stuff) experiences wild differences between CPUs and schedulers then sure. But that just means there are no ideal answers for that scenario, not that there aren’t ideal answers for anyone.
- you cannot implement a functioning spin lock in user space
(for all practical intents and purposes and if you don't resort to extreme measures and talking modern general purpose operating systems)
Only then it makes sense tu use spin locks.
In general you should expect userspace code to be scheduled out at any time and use locking primitives that the kernel knows about.
Even if your thread won't be scheduled off the core I don't know if you might still have irqs stealing time during the critical section.
> Lots of users, all of which are basically buggy-by-definition, and all you can do is a bad half-arsed job of trying to make it "kind of work".
> Reality is messy.
I don’t buy claims about lock behavior based on microbenchmarks. Microbenchmarks have wildly different behavior from one another so it stands to reason that they have wildly different behavior from real code.
Linus’s excuse is that if you use a real lock then shit is good. That’s a pretty good excuse IMO.
There are runnable threads that are not being run. It would be nice to know why.
Tell a person their code is wrong and you may have saved them from one mistake. Tell them why their code is wrong and you will save them from a dozen.
Perhaps Linus' concern is that sched_yield doesn't guarantee any particular behavior, it just happens to be implemented that way now.
> There are runnable threads that are not being run. It would be nice to know why.
More extreme. It only applies to the same core, more or less. The original benchmarks are on a one-node CPU, after all.
1. Scheduler looks through all the processes on the system, decides A is the one that should run right now
2. A decides it doesn't want to run, calls yield.
3. Scheduler looks again, thinks A is still the right one to run, and runs it again.
That is, in most cases, if the scheduler thought there was a better process to run, it would already have been running.
He states (paraphrasing) - it's undefined because there's no real way (unless in a realtime scheduler) to find what to yield to without instigating complexity.
The search itself is the quest that the kernel would rather not do at all. Step 3 is undefined and thus sched_yield itself is undefined.
So let's look over our cores:
6 of them are churning through processes that do nothing but call sched_yield. Nothing has been waiting to run for more than 3 microseconds.
2 of them are doing real work, and the benchmark threads that happen to be on the same core have been waiting to run for most of a millisecond.
If you're stuck, then B is one of the processes that has been waiting.
So the obvious change would be to move threads around in sched_yield. Then B will get scheduled within microseconds.
What's your CPU use right now? I bet it's less than 200%. Or will be if you exclude your web browser.
The typical background process is active far less than one percent of the time.
Right, but you would expect the "switch to another thread" function to eventually pick the one that's sitting there with a growing and growing time since it last got a chance to run.
It would be easy to guarantee that if any thread has been waiting for at least x amount of time, then any context switch without long-waiting local threads will pick that thread. (If it's on the same NUMA node, at least.) But in the actual implementation that time can build up to a huge amount with no effect, even as thousands of yield calls go by.
> Especially not if that process is soft locked to a different core.
Yeah, that's the problem, it's getting locked to a core even though other cores are desperate for threads to run.
Pipes have a small buffer. Once it's full, the caller will block, and the scheduler will know precisely what to schedule to unblock it.
What about just temporarily increasing the process' nice value?
That should be simple to do, and the scheduler already knows how to yield to processes with lower nice values.
Tell the kernel why you want to yield CPU. Work with the kernel, not against it. Do you want to yield because you genuinely want to wait until some future point in time? Use sleep. Is it because you're waiting for some I/O? Use poll or its variants. Is it because you're waiting for a signal from another thread? Use mutexes, condition variables, etc. that call futex under the hood.
Think of it as a speculative optimization on top of busy waiting, not as a way to control the scheduler.
Why don't you want to call sleep?
The scheduler itself uses fixed time interrupts for its basic functions, so "sleep(fixed_time_interval)" is what you will get in the end anyway on a preemptive system.
"The author of that post has a common misconception, which is that spin locks are a fast drop in replacement for uncontended mutexes. They aren't. You can only use them well when you know you won't be descheduled, like we do in the kernel. In a generic userspace program, you don't have that knowledge. The possibility of your time slice ending skews all the time measurements, for example."
Boom. Roughly the same message. Most people will get the same amount of edification from this as Linus' original letter. You could go in to more detail if you want but it's not really necessary. Smart people will figure it out if you just point out the general shape of the error. No need for calling code "garbage."
When I read your version, I learn that spin locks are not a drop-in replacement for uncontested mutexes. That's great, except that I don't really know for sure what either of those terms mean, and also I didn't gain any knowledge that doesn't relate to this specific situation. It also wasn't super convincing -- I could just as easily see someone post a reply saying "well actually spin locks ARE a drop-in something something mutex" and I'd have no way to tell who was right.
Linus, on the other hand, told me about the mechanism by which the problem happens. I learned about how the scheduler sees spin-locks' loops, and that the main advantage of "real" mutexes is that the scheduler knows who's waiting on what. I also wouldn't necessarily have realized the "take clock measurement, then get descheduled before you release the lock" issue if he hadn't called it out specifically.
Sure, I guess "garbage" comes off a little mean, but "the author has a common misconception" almost sounds more passive-aggressive to me.
>Sure, I guess "garbage" comes off a little mean, but "the author has a common misconception" almost sounds more passive-aggressive to me.
The latter might be slightly passive-aggressive (although it doesn't have to be), the former is simply aggressive-aggressive.
I think Linus got upset that a badly designed benchmark got popular and therefor is spreading misconceptions. His strongly worded answer is getting popular too and as a result everybody learns and he has stopped an urban legend from spreading. A success on every points.
I think your comment is just wrong and generally pure garbage. I can't believe some poor server will have to host this nonsense for years to come. It's the most inane thing I've read all year.
I'm just talking about your comment, not you though, so clearly no hard feelings here and it's definitely the right way to frame my argument to have a nice, peaceful and constructive technical discussion.
It is certainly possible to describe a code with bad words without hurting the feelings of person who made it, and I think, in order for it to work everyone has to understand that everybody writes “garbage” or “crap” code sometimes, so when it happens, it’s not a big deal; it shouldn’t hurt your professional dignity.
On my job, in team of 4, we do it from time to time and it’s just much mentally easier, streamlined - and yes, constructive to conduct code reviews this way. I understand this may not work for other, bigger teams, but it’s certainly a valid way.
And it was more than just the "garbage" line. It was also things like "If I haven't convinced you of that by now, I don't know what I can say.", and "See how simplistic - and self-centered - your expectation is?".
Writing this way doesn't "streamline" anything. I interpreted it as a negative interaction, and clearly others have as well including the original author (since he said as much in his response).
Exactly. Blameless and non-violent language can streamline communication. It does this by removing any trigger that would make your communication partner feel defensive. Needlessly emotional and charged language is sure to distract from the subject at hand. When you use calm language, there is only one thing to focus on: the content. When you call someone's code garbage, there are now two things that that person could focus on -- the content and the tone.
To me, it is completely reasonable that the original author might have made a mistake like this. After all, when he shared his findings it seemed like he had found something interesting to me! Many people on this very site commented positively on his efforts. There is nothing "garbage" about the code that he wrote, it just doesn't quite do exactly what he thought it did.
It was blogged about, so carries some authority on the topic, and was convincing to readers, but the code did nothing close -- not just "not quite exactly" -- to what it was portrayed to be doing in the benchmarks or article. It is garbage; useless; should be binned. I tried to come up with better adjectives -- "dangerous", "useless", etc -- but they came up short, in subtle ways. Who cares; Linus explains what he means better than the one, concise word.
It always contains some information payload. If you felt triggered by it and failed to parse the content then maybe you're overly emotional.
For some people who aren't it's going to be useful assessment of how bad this code was.
I see this ongoing trend (Twitter) that by limiting the ways in which people can express themselves, you can in theory broaden the recipient circle (to include people who otherwise would be offended and wouldn't read).
In practice you not only limit the ways in which information can be distributed but often also barter away some people who weren't offended and would get useful information this way.
Looking at speech this way is also a bad idea because today's acceptable is tomorrow's unacceptable word and the only way to allow all information is to allow all speech.
"maybe you're overly emotional." Feel free to label me however you please. It does not change the fact that if your goal is to communicate ideas efficiently, my way will help you avoid alienating some folks, and has no practical downside.
> and has no practical downside
That's the place when we differ in opinion. It has one direct downside of preventing some people from expressing their opinions and conveying the point. And another in the long run – setting up a precedents of what can and cannot be said.
> Feel free to label me however you please
It was general note not something directed at you.
Certainly, ad hominem attacks are worse than making substantive attacks, but is that really the standard we should judge it by? Instead of a sprawling thread of visceral, wouldn't it be better if he just explained the reasons why the approach is bad in a professional manner?
Writing like this just seems sloppy and non-professional. It's rambling and unclear. It could have been far more effective if written point-by-point with far less emotion.
I would answer your rhetorical questions with an emphatic "no." People's emotions and mental state are important. It's also important that we communicate, understand, and think about people's interior worlds.
Question for you: What else is Linus communicating with his writing style? Does he show that he cares? About what? What can we tell about his values? Are these values we want in a leader?
Sometimes it's good to take off the prescriptivist hat and try on Honest Curiosity.
Explaining why and educating about the concepts, is the better approach, and was there as well. Short and concise would be best, while limiting the negative to exactly what need be pointed out: A spade is a spade after all.
If it stops people from developing "their own userspace spinlocks", it may even be kind of necessary to be quite direct.
I don't see how this is the case. One can critique a style of communication professionally just like any endeavor. I'm not "bashing" Linus, or even suggesting that he should be less honest. Just that there are far more effective ways to make the same point.
Has that ever happened? My impression was that he only ever takes a rough tone with high profile coders and is nice and polite to "poor coders". Also, he never insults them personally, but says that their code/argument is bad and that they are smart enough to know that already. Hence, they should stop making excuses and fix it already.
This is especially inexcusable coming from Linus because it's not like he needs to resort to these tactics to be heard or make an impression, he's one of the most respected software devs in existence (and for good reasons). You could remove all the abuse from his posts and you'd lose exactly nothing IMO.
But this discussion is hackneyed, I know many people love Linus' style and think it's perfectly acceptable and even a good thing, I'm just definitely not one of them.
It’s sad that there are people who think calling garbage code garbage is inappropriate. What goes in their mind? Don’t they smell or call a smell a smell?
If this was a truly idiotic / trivial mistake I doubt Linus would put much thought or time into such a long post. I also think that if you are going to write a blog post, you are implicitly supposing a place of authority and therefore opening yourself to criticisms. In other words, writing a blog post that is incorrect (in this case, very incorrect) is actively harmful as it misleads others, and it is important to call out wrong posts when necessary. I've posted dumb advice with a know-it-all voice before too (see  for an embarrassing HN comment) and I am not at all angry/annoyed that I was similarly called out.
 https://news.ycombinator.com/item?id=20017832 (edit: wrong link)
In my opinion calling the code garbage is much nicer than "The author of that post has a common misconception", which sounds like a passive-aggressive way to call someone dumb. Everyone writes bad/'garbage' code once in awhile.
Not at all. It's a polite way to say they're ignorant of a particular thing. It would be rather insecure of them to read into it and think they're being called dumb across the board.
When you say "the person who wrote this code has a common misconception", that's a statement about the person. You've jumped to a conclusion about the reason the person wrote the bad code. You're telling the person to accept that they don't know as much about mutexes as you, that they're ignorant, that they're a worse programmer (albeit in a fairly narrow category).
This is why I think GP's wording comes off worse even though it's superficially more polite.
Giving someone an out is a great thing to do, especially for people for whom saving face is important. But I don't think "this code is garbage" does that. Something like, "that code is measuring timing incorrectly because X", conveys the same information without being inflammatory. "Garbage" is a value judgment, while just pointing out that it's doing something incorrectly is merely stating a fact.
Saying that stuff at the beginning only clutters the reasoning, as one should not "just believe" that that code was "indeed garbage", just because he said so.
Additionally, _sometimes_ offering someone an out to save their ego actually does them a disservice.
If you were forced to summarize the entire first third of the email into ten words, that might be valid. As-is, it's way too short to be a proper explanation. It's more of a reminder note. If someone doesn't know exactly what you mean at a single glance, it's useless. Linus has an actual explanation in his email.
Imo, he can call something garbage if the mentioned garbage makes false and bold claims about his and contributors decades of work.
By the way, Linus' answer is much more informative than yours. I got a lot of useful insights from it and had fun just reading his lively language.
It's a bit disingenuous to suggest that Torvalds meant it in that sense, though. He very obviously meant it in the sense of a piece of trash that should be thrown into a trash bin.
(Also I've never heard of dead code referred to as "garbage code".)
People should be shamed for being ignorant. It stops ignorance from spreading.
reacting in a haughty manner doesn't stop you from being wrong, it just makes you a dick.
If these kinds of spinlocks are so fundamentally misguided, why do they only cause problems on Linux?
I’m more for polite passive aggressive suggestion of incompetence instead of in your face blaming, otherwise the response was quite on point.
On the plus side the original author now has plenty to read about and us here get to learn interesting things by proxy.
Certainly that same observation drove the development of BFS / MuQSS, which unsurprisingly behaves better on his artificial benchmark than the mainline scheduler.
Android is such a pain to develop games for, that Google has started a Android Games initiative only in 2019, to appease the growing complaints from game studios.
An NDK that feels like a 20% project, many game related APIs (e.g. image codecs, networking, files, permissions) are Java only and require JNI even if implemented in C++, frame rate setup, build infrastructure regarding NDK libraries, C only API for what are actually C++ classes forcing wrapping code for RAII and such, Vulkan feels like a code dump that everyone has to manually integrate into the project vs Metal tooling, graphics debugging and real time audio are outsourced to a GitHub projects not directly supported on the official Android site, network debugging is only supported for Java layer,....
Finally, the only Linux that Android has is the kernel, which Google could replace tomorrow and user space wouldn't even notice.
Linux fans keep failing to realise how little Linux Android actually exposes.
Other than "Apple Revenue", in no other world is this a metric of "biggest" in the context of even the most "squint hard enough" contexts of this thread.
Plausibly more time is spent gaming in browsers, or not. My guess would be android apps, possibly browser games. AFAICT the link only covers payed games, mostly counting by revenue, but I didn't look hard.
Since Android has 76% mobile os market I'd say that Android is the main gaming platform of the world. And the web is least used for gaming.
This even before starting the whole discussion that market share doesn't transform into gaming market share, nor in most profitable market share, or even native vs web views vs mobile browser usage.
That's greater than .26 (consoles), .25 (pcs).
The only Linux in any Linux-based operating system is the kernel.
Any kernel will do the job required by userspace, that is running ART, ISO C, ISO C++ and the set of Android native APIs.
Yes, Google could replace Linux but they would also have to port drivers for every device Android runs on to that kernel.
Or just virtualize the whole kernel and Android runtime via a new kernel-based hypervisor for compatibility/migration purposes and switch everyone to Fuchsia/Zircon.
Google doesn't care about writing the drivers for other/older phones for manufacturers as they already have a hardware division to do it themselves for their phones, laptops and tablets. If Android was to be replaced by something else (Fuchsia), The vendors will be writing the drivers instead if they want to support Fuchsia.
Also, they don't need to "port drivers for every device Android runs on" since they would just start with their own devices first and the other manufacturers will eventually write drivers for their devices to support Fuchsia.
The problem is that they won't be required to open them up unlike Linux. But I guess the benefit to Google is that they have full control rather than waiting for one person to approve their changes.
Nokia gave that a go, back in ~2009 if my memory serves me correctly. They actually had L4 (with certain security extensions) running on some of their prototype ARM boards. Linux, as an OS kernel, was then running on top of that.
It didn't get much traction, and I heard very little of it from 2010 onwards.
Google undoubtedly has the singular oompf to launch a brand new OS, I just haven't seen a reason yet. The few things they say they want to improve upon aren't limitations in Android (or Linux). Third party vendors and their chip makers are going to —perhaps rightly— view this as an Emperors New Clothes upgrade and resist. They're going to have to make a business case somewhere along the line.
Let's wait for a production release before we start the Android/Linux death knell.
And sure, Google can change the kernel, but that would be a lot of work. The same could be said of many Linux distros. For example Debian GNU/Hurd is an experimental version of Debian swapping Linux for Hurd.
It's got to be something more expensive than that. Hololens starts at $3500, so it's probably a good candidate for instance.
Which kinda gets the point across that revenue per user matters very little.
If we're looking ourselves to software revenue and ignoring device cost, look towards vr headsets. Practically every user is buying expensive games.
If you only want to include things with built in processors as platforms, look at Oculus quest in particular.
After all, it was Series 30, Bada, Symbian, Blackberry, Psion and Windows CE that started mobile phones.
This is actually irrelevant, Android could be the worst environment to develop however since it has the biggest user base it still has people wanting to develop to it.
> Finally, the only Linux that Android has is the kernel, which Google could replace tomorrow and user space wouldn't even notice.
This is not really true, specially considering for games. For one, since most games use Android NDK, you can't simple swap the kernel and everything will magically work. You will at least need a recompilation, and this new kernel API must be bug to bug compatible with Linux or things will strangely broke.
For another, there is things like the discussion of spinlocks/mutex/schedulers like this one of this topic that, as shown, can make a big difference. It could be better in a kernel optimized for latency, or it could be much worse, making games that today are playable in low end devices a complete shitty show.
No it hasn't the biggest gamming userbase, as proven on a market analysis report I just posted.
Yeah, I never said that you should directly access the kernel API using the NDK (however you can actually). I said that just the fact that the NDK is compiled to native code means that the code depends on the Linux kernel, and changing the kernel would mean, at minimum, a recompilation. Android NDK only guarantees an API, not a ABI, and those two things are very different.
So no, it is not like changing the kernel would have zero impact in Android game market. It would actually have a huge impact, since almost every game would stop working until being recompiled (games that does not use the NDK would probably be fine, however few games are not native). And of course, for backwards compatibility purposes you would need to have multiple binaries (however this is already true nowadays since Android support multiple architectures outside ARM).
> No it hasn't the biggest gamming userbase, as proven on a market analysis report I just posted.
I never said it was the biggest gamming userbase, I said it was the biggest userbase. Even if Android doesn't have the biggest gamming userbase, just for the fact that that you have a big userbase as big as Android have means people will develop games for you.
Also the whole point of ISO C, ISO C++, OpenGL, Vulkan, OpenSL is to be OS independent.
Game studios are not a charity, they develop games for profitable markets, not for large userbases that don't pay for games.
It is not libhoudini ? AFAIK, libhoudini is not something that Google Play embeds on binary for distribution, it is a library that ships on Android x86 devices (and another AFAIK, it is proprietary from Intel so they only ships on Intel devices).
So while Google could have a similar approach if they want to ship another kernel for Android, the work is not as trivial as you seem to think.
> Also the whole point of ISO C, ISO C++, OpenGL, Vulkan, OpenSL is to be OS independent.
Yeah, it doesn't mean you don't need to recompile your software. You can't take an game written for Windows in OpenGL and run in Linux without recompile. This is my point.
> Game studios are not a charity, they develop games for profitable markets, not for large userbases that don't pay for games.
Well, Android seems to be pretty profitable? It is not the most profitable platform sure, however it sure makes enough money because almost everyone ends up creating games for both iOS and Android.
You seem to imply that making games to Android is not profitable, however it sure seems to be. Or Epic Games, EA, ActionVision, and many other big names in game studio are a charity.
And the point wasn't that it isn't profitable as such, rather it isn't the money maker that some here are making waves for, rather a wannabe versus the revenue made across desktop, consoles and iOS.
Hence why Google has introduced the Android Game SDK last December (which is quite basic as of now) and Google IO 2019 was the first time Android ever had a games track, versus WWDC iOS/macOS that have had them since ever.
So you concur with me that if Google changed the Android kernel it would be a big change? Just imagine, you would have to recompile all applications using NDK to work with this new Android. It is not something trivial, sure they can make it work, however it would probably need quite some time with Google making obligatory to having developers to ship both Linux and new Kernel binaries.
And even them, you can't make sure that everything will work. A small change in behavior of a specific syscall may break a random application somewhere that depends on it, because even if this syscall is not directly used, Android NDK uses this syscall to implements its API. Now multiply this for the number of NDK applications that Android have and you will have a huge amount of breakage.
So yeah, Google can make it work, however it is not as trivial as you think.
> And the point wasn't that it isn't profitable as such, rather it isn't the money maker that some here are making waves for, rather a wannabe versus the revenue made across desktop, consoles and iOS.
Sure, we can concur at this point. However the market for Android gaming is still more expressive than you seem to think (not wannabe by any means, but lower than iOS sure).
No I don't, anyone writing portable applications already does compilations all the time.
It is no different than the requirements to provide newer builds updated to specific Android target versions, or being 64 bit only that Google keeps forcing devs to move forward.
It is Google and OEM's job to ensure that NDK APIs work across whatever kernels are available.
Any application that depends on syscalls isn't portable across Android devices anyway. Google doesn't support syscalls as part of the NDK stable APIs contract, and some OEMs are well known to change their behavior.
That is basically saying that this is a big work for everyone involved (developers, Google, OEMs). This is exactly my point.
> Any application that depends on syscalls isn't portable across Android devices anyway.
The application does not necessary call a syscall directly, the problem is that the NDK calls a syscall as an implementation detail (for example, for lock primitives eventually the NDK must call the Linux lock primitives too, there is no way out).
So if Google needs to change Android kernel, they need to implement the same quirks of the current syscalls, or the slightly change in behavior will break random applications.
My point was never calling syscall directly, however for sure some applications call syscall directly too.
"You know, although we are consumed by sectarian fights and iron-fisted leaders that will crush anyone having opinions, It is important to note that worlds one of the worlds most successful contemporary technology companies is actually Syrian".
All major OSes had PhD CS people doing their schedulers "efficient" at around 2000, and be able to sustain 98%+ MP loads. Linux was an odd case of a very naive, and "inefficient" scheduler that left CPU 30-20% idle unless you really knew how to hack around that. Linux however came with near no stalls, and latency spikes.
Things have changed dramatically now. Linux got attention of the entire industry on it, and years of holy wars about scheduling.
You can now hit 95%+ under multithreaded load easily, but even on the desktop you can now see how a block in IO hog app can visibly freeze your system for few seconds.
Now, the trend has reversed again, and now everybody is praising conservative scheduling that doesn't give out too much CPU :-/
Or worse. I ran into an issue when copying an image from an NVME drive to a USB drive using `cat` (e.g. `cat somefile.img >/dev/sdX`) where mouse performance (also a USB device) became so sluggish that the desktop was unusable until the copy was complete.
This wasn't the only instance where background I/O brought my desktop to it's knees. In another instance it was copying a large file across the network (Gigabit Ethernet) to a file server. The network was saturated and the desktop was unusable. I think that the remote system was the destination but my recollection could be in error. The network interface is a PCI device and there was no USB device involved in this transfer.
This is on Debian Stable and using the Gnome desktop. The non-standard portion of the system is that it runs off a ZFS filesystem. I don;t know if that matters.
I'd like to know where to look and how to diagnose this issue if/when it crops up again. Interrupt storm? USB driver issue? Some other issue resulting in some resource starvation? (In neither case did overall CPU load appear high.)
Don't know the reason, but lots of unwritten I/O on a (slow) drive can block processes waiting for I/O on an unrelated drive.
By default, dirty buffers are a predetermined portion of RAM, so lots of RAM and slow I/O can make for long pauses. I usually tweak sysctls on desktops to limit dirty buffers. E.g.
vm.dirty_background_bytes = 67108864
vm.dirty_bytes = 134217728
IO blocks are almost always CPU blocks too on anything that does even minimal IO. And the more simultaneous IO, the more CPU overhead there is.
Try running compiling something big with -J 32 or more.
Decreasing dirty memory decreases flush times, decreasing process write/flush times, decreasing process/UI hangs/pauses. Presumably it also lowers throughput in some cases, but it's better than having UI hangs.
Linux traditionally has a problem of overall system architecture in that userspace isn't set up to provide policy information such as "this thread is doing UI" vs. "this is a background task" by default.
However, if I help the scheduler along by nice(1)-ing my big compute tasks, then the desktop remains perfectly smooth even while the CPU is 100% busy.
It'd be much better to have this experience by default, but unfortunately it requires architectural improvements throughout the entire software stack. Trying to band-aid it in the kernel scheduler may allow you to heuristically improve some workloads, but it's ultimately a fool's errand.
Where the desktop does tend to fall over and become totally unusable is when the system starts to swap...
But was this a good or bad thing? Even desktop machines can now have 16+ cores and NUMA. So, perhaps that focus has paid off for more users?
> So this all started like this: I overheard somebody at work complaining about mysterious stalls while porting Rage 2 to Stadia. The only thing those mysterious stalls had in common was that they were all using spinlocks.
If the OP's benchmark is invalid and if the root cause is just "overheard" by somebody without sharing detailed analysis, we don't even know if the problem is really about spinlocks.
We are still unsure if it comes from Linux kernel or just Stadia platform-specific. Do we have similar issues with Android?
It is possible that, as explained by Linus, the Windows scheduler just doesn't ootimize for NUMA as aggressively as linux (which attempts to keep threads on their own cores as much as possible).
It is also as likely that, because on Windows games have been an important load for a while, the Windows scheduler likely bends over backward to support broken code either by penalising other use cases or by attempting to recognise this sort of spinlock behaviour and special casing it.
Even being maximally NUMA-friendly, if calling thread_yield a hundred times would eventually pull in a blocked thread from a neighboring core then the benchmark would see much smaller delays.
In the email thread, Linus points out that GNU make has a bug in its concurrency code, which was only hit when newer kernels added some optimizations. As a result, they had to walk back on these improvements.
Now, I'm sure that the bug has been fixed in newer versions of make, but 'make' is a program that many people are unlikely to keep up-to-date (most users of it will be only using functionality that has been in make for decades and wouldn't ever consider needing to have a brand new version of it)
It would be a shame to hold back kernel optimizations if they trigger bugs in older user-space code, so in these kind of cases it would be great if Linux had some of the 'backwards-compatibility cruft' that Windows is famous for, allowing for new optimizations while detecting and working around known problems in common older code.
Still it might make sense to change linux scheduler behaviour to improve performance of unmodified Windows games under Wine if it turns out that using yield (or the Windows equivalent Sleep(0)) is common.
And in fairness to the original post, the author had a latency issue, they invertigated it and correctly identified the problem with spinlock and yield and found out that using a proper mutex is the best solution. The only issue is that they thought that the scheduler was the root cause of the bad behaviour, not the yield itself.
> And then you write a blog-post blamings others, not understanding that it's your incorrect code that is garbage, and is giving random garbage values.
> And then you test different schedulers, and you get different random values that you think are interesting, because you think they show something cool about the schedulers.
Tldr: They got probably lucky on Windows. Conceivably similar code might have worked on Linux and then had issues on Windows.
Usercode spinlocking is not legible to the kernel. Therefore anytime you attempt it you are in the situation of fighting the scheduler, and given that even the kernel devs are constantly challenged by the problem of making appropriate scheduling tradeoffs, you will lose.
The measurements still show Windows trouncing Linux in reponsiveness
There are many places where the Linux scheduler is more responsive, even in gaming, exactly because of what makes spinlocks inefficient. The Ryzen scheduler debacle comes to mind.
The second post was really educational for me.
I used to agree with many of Linus rants, but not this time though.
The gist of the reply, as I understand it, is that the benchmark is measuring random spikes while the spinning thread is scheduled away. But the benchmark clearly stated that they are using one thread per core and that the host was otherwise idle.
It could be that, still, some random task pops up once in a while, replacing the very thread that has the lock. But then, rather than calling the whole benchmark garbage I think it'd be more fair to just ask for it being rerun with, say, two free cores. Or use auditd to detect and invalidate those runs.
Linus also calling everything random sounds annoyingly slopy: varrying the scheduler or the OS led to significantly different measurements. Maybe the explanation for those differences are wrong but the measurements are certainly not random.
Anyway, I do not think that a forum that calls itself "hacker news" should adhere to that dogma that any topic claimed by an authority to be complex should never be re-implemented on first principles. And I doubt the young Linus would have paid much attention to that kind of advice when he started developing a new operating system.
Personally, the conclusion I take from that thread is simpler:
On a situation when N threads are runnable on N cores with N-1 calling sched_yield(), one must not expect the Nth thread never to be scheduled out for a whole slice on Linux (for some or all of the reasons listed by Linus, none of which seems to address that exact case unfortunately).
man sched_yield(2) says:
sched_yield() is intended for use with real-time scheduling policies
(i.e., SCHED_FIFO or SCHED_RR). Use of sched_yield() with nondetermin‐
istic scheduling policies such as SCHED_OTHER is unspecified and very
likely means your application design is broken.
When you are working on your own locking algorithm you should be fully aware of all the potential problems that a certain approach implies. It's not the 70s anymore back when multicore processors and the need for complicated locking algorithms were a rarity.
Unclear. The article's conclusions about spinlocks are completely unfounded by the article's code and benchmarks. You could say the article luckily got it right. That's Linus's ultimate contention; the code is garbage and leads to an undefined conclusion. That what's undefined is the right conclusion is lucky, today, and could be garbage tomorrow.
A thread (S1) attempting to acquire a spinlock will busy looping to hog all of its time slice, preventing other threads (S2) from utilizing the CPU to free the lock.
S1: [**busy spinning**][..waiting........][**lock & run*****]
S2: [..waiting........][**run & unlock***][..waiting........]
M1: [sleep][..waiting........][**lock & run*****]
M2: [.....][**run & unlock***][..waiting........]
S1: lower priority [..waiting........][**lock & run*****] higher priority
S2: higher priority [**run & unlock***] lower priority [..waiting........]
You will also have to make sure that your application threads are pinned to specific cores and that the number of threads is smaller than the number of cores because you want to keep one thread free to run non critical code.
Spinlocks are like full plate armor. Very effective but incredibly cumbersome to use.
Unfortunately, it seems that most people who write about Linus on the web are more interested in his worst posts -- where he has slung insults and profanity with no valuable technical insights.
This is an interesting quote from Linus. I always thought that fairness was for the desktop and Linux had latency issues sometimes because it historically put the "big-iron" first.
Now this quote makes it seem that fairness is a good thing for the server as well and the preceding paragraphs hint that supposed latency issues are just a natural consequence of the trade off between throughput and latency.
- He called garbage code garbage.
- He pointed out that most of people don't understand locking.
- He again pointed out that almost no one including him is qualified enough to write own locking code.
There is absolutely no personal attacks or whatever. It's just someone make a big fuss about how bad kernel is without real reasons and Linus called bullshit.
It sounds like there was an expectation that Linux should support something that it doesn't, and whether it _should_ support it is up for debate. Totally agree about Linus's involvement in any communities.
I assume part of it is that he feels personally attacked when something about Linux is challenged.
You really didn't paid attention to what being discussed. Someone started discussion on how bad Linux kernel is based on incorrect data generated by badly written code.
Ironically, if a manager or CEO or VC or politician used similar language, they would be attacked by the same people for being abusive and elitist. But when an alpha-programmer does it, it is somehow ok.
I've learnt so much from reading both the original blog post, and also the resulting conversation with Linus. The technical work Linus has done and is doing, is fantastic and to be commended. I just wish he followed through on his earlier desire to change his communication style.
a) He's aware of his tone, but probably bald tired of reading stuff that can be (and in this case is) perceived as an attack (the blog post's title comes to mind) on his work and a bad benchmark/study/conclusion is used to justify it? (Now imagine enduring this throughout decades... Yeah, one has to have very thick skin to be in such a position--have you ever tried to be in his shoes for a moment?)
b) He has called himself names as well (In the past I've seriously been tempted to compile a list of such self-deprecation behaviour but it's ultimately pointless); and
c) We don't know what's going on in his life or maybe it's just how he is or maybe it's a cultural thing.
My point is: sure Linus can do better on how to react to criticism, but inexperienced people ought to refrain themselves from blaming other people's work--especially from vastly experienced ones.
Simple to implement but the complexity is getting it right for every case. None of it is easy.
When was bcrypt or pbkdf2 made obsolete as a algorithm?
TL;DR storing passwords with SHA512 is bad, storing them with PBKDF2-SHA512 is ok and likely overkill.
But seriously, IIRC they do 5000 rounds by default, but I've been wondering for a few years how is it that supporting nothing better than SHA512 for the whole /etc/shadow -based system is still ok? How come there's still no support for scrypt or bcrypt?
It's salted enough that it's not amenable to a time-space optimisation and it's pessimised enough to make trivial attacks impractical so bad guys who have the database will resort to brute force on individual passwords. At which point either you have a good password and they can't guess it or you don't.
Fancier schemes are about the margin, can we make it too expensive to brute force say 8 alphanumerics? But you don't need to live in the margins and you shouldn't.
> But don't write your own. Find somebody else that wrote one, and spent the decades actually tuning it and making it work.
> Because you should never ever think that you're clever enough to write your own locking routines.. Because the likelihood is that you aren't (and by that "you" I very much include myself - we've tweaked all the in-kernel locking over decades, and gone through the simple test-and-set to ticket locks to cacheline-efficient queuing locks, and even people who know what they are doing tend to get it wrong several times).
> There's a reason why you can find decades of academic papers on locking. Really. It's hard.
> A dispatch semaphore is an efficient implementation of a traditional counting semaphore. Dispatch semaphores call down to the kernel only when the calling thread needs to be blocked. If the calling semaphore does not need to block, no kernel call is made.
This sounds like the same optimization that futex(2) makes for pthread mutex on Linux. So while a naive semaphore might enter the kernel 100% of the time, this one does so only when there is actual waiting to do.
It would seem to me that sem_wait() does similar on glibc so Linux probably performs similarly with the standard POSIX semaphore:
mirrored from here:
Which is to say, dispatch semaphores shouldn't be used if you're initializing them with a value of 1, you should only use them if you have an actual pool of resources you're protecting.
First I tried POSIX semaphores. They seemed poorly done on Darwin. [Maybe they needed sem_open() rather than sem_init()? But unnamed ones are way more useful!]
Then I tried Mach semaphores.
Then I saw a recommendation for libdispatch semaphores to avoid too many syscalls.
Given that Linux does the syscall-avoiding trick by default for standard-conforming semaphores, it seems to me like on Darwin you can't go too far wrong always using libdispatch. But I guess you miss that priority inversion workaround you were talking about.
* Can you use dispatch_queue? Great, use it.
* If not, can you use pthread_mutex? Note that this can be configured to be unfair if desired. If so, use it.
* If not (e.g. you need actual semaphore behavior), use dispatch_semaphore.
os_unfair_lock is more of an optimization when speed is a must. It's a replacement for the old OSSpinLock, but AFAICT it basically acts like a normal mutex under contention, so I'm not actually sure why pthread_mutex would be preferred over it, except the documentation says to prefer pthread or libdispatch as they're "higher level" primitives.
Also of interest is dispatch_group_t, which is basically an inverted semaphore (it counts active tasks, and lets you wait until all tasks have completed). Under the hood it's actually a semaphore that's initialized with LONG_MAX as the value and some custom logic (including atomics) around waiting.
are there common theoretical problems between a garbage collector / memory allocator (aka : automatic handling of memory resource) and a scheduler (aka : automatic handling of cpu resource) ?
Could there be some kind of super abstraction solving common issues to both problem in a more mathematical approach ?
> Or at least be aware of it, and have some virtualization-aware paravirtualized spinlock so that you can tell the hypervisor that "hey, don't do that to me right now, I'm in a critical region".
is that your kernel could be hypervisor aware?
Could you compile a kernel that flags itself as being in a critical region all the time? Would that give you a whole CPU indefinitely?
I mean, paravirtualization is precisely when a kernel communicates with the hypervisor. Dunno however how scheduling factors into it—I was under the impression that VMs are treated much like userland programs. I.e. would similarly be scheduled off sooner or later.