I’m pretty sure that the Linux behavior makes sched_yield more suitable to the 40-yields optimization I use for locks.
I don’t buy claims about lock behavior based on microbenchmarks. Microbenchmarks have wildly different behavior from one another so it stands to reason that they have wildly different behavior from real code.
Linus’s excuse is that if you use a real lock then shit is good. That’s a pretty good excuse IMO.
One reason they might act that way: They have nowhere near the amount of NUMA or cache locality optimization work the Linux kernel possesses. Thinking of the 2nd-generation Threadripper issues in Windows in particular.
That's quite likely. I think that the sched_yield behavior is throughput-optimized, which is consistent with what you get when an OS is server-optimized like Linux.
I don’t buy claims about lock behavior based on microbenchmarks. Microbenchmarks have wildly different behavior from one another so it stands to reason that they have wildly different behavior from real code.
Linus’s excuse is that if you use a real lock then shit is good. That’s a pretty good excuse IMO.