Now, consider that there's work going on to enable Linux to be compiled with profile guided optimization with clang[0], the DAMON patchset that enables proactive reclamation with "32% memory saving with only 1.91% runtime overhead"[1] and performance improvements achievable with futex2 system call[2].
Linux future seems bright with regard to performance.
Linux needs a standard set of benchmarks that are vaguely representative of things users use Linux for. Phone, laptop and a few server use cases.
It then needs someone to do a big parameter tuning to select optimal settings.
Too many decent algorithms don't make it into the kernel because there are too many tunables, and the ones that do typically arent well tuned for anyone's use case.
Even big projects like Ubuntu typically don't change many tunables in the kernel.
> Linux needs a standard set of benchmarks that are vaguely representative of things users use Linux for.
Have you seen phoronix framework + benchmarks? They use common tools, but did some good work on making the tests repeatable and accessible to anyone else.
Think about it this way, we're playing Guess Who, and you haven't asked about gender yet. You ask "Do they have dark hair?". It's obviously a question about a singular person, and that's the correct way to structure that question in English.
Certainly. But I referred to the presented guidelines, collectively - not to "singular they". I wrote and intended that _those linked guidelines_ are "not universal".
Singular they in English usage dates back 600-700 years. People can disagree with those guidelines all they want - they will have to deal with the fact that people choose to use singular they and they can get worked up over it or get used to it.
It is clearly valid. I personally don't care if others use "he" as a generic, but seeing at some people don't feel included when it is used, I increasingly use "they". Not least when I don't know someones preference, because I consider the most inclusive option to be the most polite.
You objectively are, and in the choice of who to care about that means I have absolutely no interest in catering to people upset that others are included.
There is an objective difference between generic he and generic they: In many contexts you can not tell whether the "he" was intended to be generic or not.
I did not write anything about "singular they". I wrote against the legitimacy of those linked guidelines, which accidentally include "singular they".
The poster I replied to stated (I remember) that some use of language is made legitimate by some group laying out some guidelines. That some pretty random group («not philologists») lays out guidelines is maybe an "affiliation pass" for their group, but not universally valid.
You need grounds, good grounds. Check those guidelines...
Some people: "let's use, of already-established pronouns, the one that doesn't make any assumptions about people"
You: "no, fuck you, other people's interpretation of what I say is strictly their problem, I'm gonna stick with 'he' no matter what other people feel or request, because I can"
The only ideology I can see here is the absolutist sort of "nobody can even suggest to me that I change a thing" or "I can't possibly cause offense if I don't mean to, so I don't need to think about word usage."
I don’t understand how it doesn’t make assumptions about people. In the case of a generic antecedent, sure that’s fine and well established historically. But in this case there’s a known antecedent, Michael Larabel.
I don’t know what Michael’s preferred pronouns are, but isn’t the original poster in this chain assuming it’s they/them and aren’t they more likely, statistically speaking, to be he/him?
Why does it somehow not count as misgendering when you they/them someone that prefers he/him or she/her?
> Why does it somehow not count as misgendering when you they/them someone that prefers he/him or she/her?
Because one appears to make an assumption about gender, whether or not you intended it to be taken as a generic, while the other objectively is generic.
As far as options go, they/them minimises assumptions. Yes, some people might still take offence, but given that most of the people who take offence at that takes offence because they oppose inclusiveness of others, I'm perfectly fine with not respecting their choice in the matter.
I go by experience. There may be exceptions, but I've yet to see one, so I don't particularly care if there's a large pool of exceptions outside the horizon of my personal experience.
If you are an exception and have a good reason for taking offence that doesn't involve excluding others, do tell.
“You” was exclusively plural (singular equivalent was “thou”), but its meaning shifted to be either singular or plural a few hundred years ago. So “are” was indeed a plural-only conjugation until that shift happened.
Now “you” is shifting again: in formal English it is still singular or plural, but in colloquial American English — at least where I’m from — it can only be singular. The plural form is “you guys” or “y’all”, depending on dialect. (The “guys” in this pronoun shouldn’t be confused with the noun “guys”, which is usually only used for boys and men; “you guys” is used to refer to groups of any gender).
Exactly. There's loads of examples of such shifts in other languages, for example in Portuguese the formal you is conjugated in grammatical third person despite being semantically second person, arising from people being formally addressed in the third person ("Does the Right Honourable gentleman agree that...").
Same is true in Spanish, where "usted" ultimately derives from "vuestra merced".
For a completely different example, "on" in formal French means something like "one" in English: a pronoun for a general, unspecified person; as such, it takes singular conjugations. However, in colloquial speech, it means "we" and has almost completely replaced "nous" as a subject pronoun.
Thus: "il est" (he is); "on est" (we are, colloquial language); "nous sommes" (we are, standard language)
As I read it, the point was not about the wrong pronoun but about the fact that the suite is developed by one person and not a large team and that the burden of benchmarking Linux performance should maybe not rest on one person alone.
If by "old school" you mean "everywhere outside of Silicon Valley", then maybe. Coming from a pretty conservative society, I read these sorts of discussions with mild amusement. I may have formed a mistaken impression of the US society, but it seems to me that it tends to bounce from one extreme of political or social views to another.
One of the things that struck me recently is that working out what a computer is actually spending it's time doing is now almost a Master's thesis to do properly.
I've found it the hard way when I was trying to profile my own code.
If you're using a few optimized libraries and designed your code for high-speed parallel execution, the compiled thing becomes almost impossible to trace in a practical manner.
Used perf record for getting CPU efficiency numbers, and it worked very well, however to be able to reliably trace the code, I had to compile it in debug mode with no optimizations, and run with a single thread. That combination increased the execution time considerably.
Some of the Matrix and solver code coming from Eigen is heavily optimized for SIMD operations and using it with -O0 is just painful.
Moreover, I had to verify its memory sanity and used Valgrind for that. Using a full size problem meant it had to churn for days to finish execution.
Having deadlines doesn't always help in this stuff.
I think perf and valgrind is a fantastic combination. While valgrind has a big overhead, it can make a lot of things visible without instrumenting the code.
I also think Perf is phenomenal for its scope. It's showing performance metrics of the code without any processor dependency and instrumentation.
On the positive side of things, the notion that performance is multivarious and means different things to different contexts has really sunk in with the general tech demographic and we don't see much of the "one big number that sums it all" type of stuff. Think the GHz race or FPS measuring contests.
Both with the GHz race and FPS measuring contests, we've gotten to the point where there's substantially diminishing returns - GHz has stopped growing in the same way, and reaching the limits of human perception with FPS is generally readily achievable, and so it loses interest as a contest.
I suspect your broader point stands - certainly "performance is multivarious and means different things in different contexts" has sunk in for me. But I suspect the actual difficulty of reasoning about performance deters most people from doing it beyond big O notation. The fact that I'm aware that performance is complicated doesn't drive me to understand the complexity, I mostly just say "good enough" and move on.
Right, problem solved. Now we just need to identify the few representative use cases. Is it people using a phone, as, er, phone with actual voice communication? or video conferencing? Or those who play games casually or not-quite-so-casually? Or video streaming? Or just for texting and would rather prioritize security and battery runtime?
And that's just the phone. I could come up with many more questions regarding desktop and server usage as I'm more familiar with those.
I'm afraid, there is no one optimal set of configuration options. Not even two or three.
Tuning is highly overrated in software performance, eg compiler optimizations don't help that much and searching their params helps even less. I think it comes from people wishing they could change their software without actually needing to learn how to change it.
It is overrated when you are tuning the wrong thing, but if it is the bottleneck in a significant process for you use case then it is often very significant especially at scale. For someone running a compute cluster for their own needs or as a cloud service a couple of percent improvement like this represents a massive gain in throughput or a saving in the need for extra kit (or power costs - there may even be a small environmental benefit).
Of course on the scale of you or I, such optimisations may be little more than a curiosity most of the time. A process that normally takes 100 minutes (a video transcode, perhaps) now taking 95 followed by the machine being idle while it waits for us to look back and see the result, is not really benefiting from the benefit.
> without actually needing to learn how to change it
This is certainly a problem sometimes, but not an argument for tuning bring overrated in general. It usually comes down to tuning the wrong thing, like someone playing with mysql engine & kernel IO parameters to eek out a fraction of a of % bonus when they could improve index structures, or fix queries with no sargable predicates, or both in unison, to get benefits measured in orders of magnitude.
It can be overrated in the case of average gamers tweaking their hardware to get a few extra FPS on top of the many tens they already get, but again this not being worth the time for some, even many, doesn't mean it can make a huge difference to a pro-gamer or someone using old kit where "a few extra" is a relatively large gain.
That's quite a bad take. You can bank on PGO+LTO of a C++ server being good for 20% throughput easily, and not compiling for k8-generic when you have a modern machine also makes a big difference. I would agree that just flipping compiler flags might not get you very far, but there are key inputs that you shouldn't leave on the table.
By "not that much" I mean orders of magnitude. 20% is reasonable, especially for C++ that needs a lot of inlining.
But a more modern language would benefit from defining some optimizations like inlining as mandatory, the same way tail calls are (and of course GCC has this.) Then there isn't a chance of deploying a build with extra-slow behavior.
A performance improvement of 20% would be a massive win for something as large and widely used and as the linux kernel.
That would average out into a few percentage points of performance improvement in every application across the globe which runs on linux / android. Merging a patch like that in linux would be like quietly reaching into every device across the planet and giving them a small, free CPU improvement and drop in power usage. Not game changing for any individual user, but huge in aggregate.
Some do. I bet lots of companies miss out on massive cost savings like this simply because nobody at the company has the skills and access to improve things.
I heard a story - years ago Google hired an engineer who happened to be an expert from a previous life in video codecs. He wasn’t working on YouTube at Google, but out of interest he pulled up the YouTube source code to see what it did. They were just using the defaults for some of the encoding parameters. He tweaked a few of the encoding parameters and in doing so saved Google millions of dollars per year in compute/storage/network traffic. A few hours of his work was probably more beneficial for Google than years spent in his primary role.
And if tuning opportunities like this abound at Google, you know they’re everywhere. It’d be much better if Linux distributions simply shipped kernels which are already compiled with PGO, with a reasonable profile based on normalish use.
That could've happened because of pointless detuning - x264 comes with good defaults for everything, but then ffmpeg on top of it used to set everything to "off" no matter what it was, and they were probably using the ffmpeg settings.
The correct answer would've been for ffmpeg to not ship that way.
Anyway, I am an expert on video codecs just like that guy is, and I said what I said ;)
Probably, but there are many smaller outfits that probably don't because they have a server fleet of "only" a few dozen servers or so. The cost savings for those won't be in the millions, but having to purchase ~5% fewer servers is still a pretty good win.
That highly depends on your problem. I have some scientific code that gets more than a factor 10 improvement on runtime just by turning on some compiler flags (mainly -O3 and -march, but some others as well).
Getting better performance with -Os over -O is some kind of edge case and people shouldn't generally expect that. -Os has disastrous consequences for C++ algorithms because it refuses to inline functors, so while you may rightly believe that a C++ std::sort is slightly faster than a C qsort, due to superior opportunities to optimize the C++ code, with -Os you'll find that std::sort is an order of magnitude slower. Definitely pays to check the result with a full-scale benchmark.
Interesting claim. I had to try this, and with the Xcode version that I have installed, std::sort with a simple lambda as comparison function gives about the same result with -O3 and -Os. I would not call this a disastrous, but of course opinions differ. Interestingly, qsort is significantly faster with -O3 than with -Os but of course nowhere near std::sort.
Results will vary for small vs. large programs. I've seen catastrophic space optimizations that out-of-lined very small methods like std::vector::at, because the call was one or two bytes smaller than the inline. Lambdas are inlined even with Os because they don't have names or multiple callers and can't be made smaller by out-of-lining. A functor class, or any function with multiple call sites could trigger the problems with Os.
Ok, I'm not continuing with this. Just note that I didn't write that -Os would be always or even usually faster. I could try to come up with an example where -O3 produces a huge loop preamble for loop that's iterated once or just generates enough cache misses to be overall slower, but I don't care enough.
Not a scientific codebase, but ffmpeg seems to benefit massively from optimization
I use it to encode my DVD and Blu Ray TV/movies to HEVC. Doing so reduces the filesize on DVD's by roughly 80% and Blu Rays by 60%
The downside is the encoding is an incredibly CPU intensive process. Hardware encoders like Nvidia's Nvenc or Intel's QuicSync look absolutely terrible and are a non-starter for archival storage
On a stock Fedora XFCE install, I would get roughly 0.5 FPS for a 1080p Blu Ray file (29.97 FPS at 1920x1080)
A Gentoo Linux installation with a global -O3 -march=native as well as LTO, PGO and Graphite enabled globally boosts it from 0.5 to roughly 1.3 FPS. Still slower than realtime, but an absolutely massive improvement
I suspect -march=native is the only thing doing any work here. The other optimizations are as likely to find compiler bugs as they are to improve things, once you get off well-tested paths.
I think it's the opposite. A good annotated code which can use SIMD instructions accelerates tremendously with -O3. -march=native -mtune=native generally improves things if you can saturate the cores with instructions, which can be measured by running perf record and looking to IPC & instruction retirement numbers.
When I was using Eigen on my code, the biggest performance boost came from -O3. -march and -mtune did minimal improvements on the systems which I've ran benchmarks on.
ffmpeg and associated projects do not rely on autovectorization, which pretty much only works on naive scientific code. They write their SIMD in assembly and don't need to care about compiler settings.
If you're compiling hand tuned assembly, -march and -mtune probably will have more effect when compared to compiling and optimizing C/C++ code.
OTOH, I'd like to underline that heavily optimized scientific code and libraries are neither naive (in terms of algorithmic complexity/implementation) nor straightforward :D
Sure! This is the script I wrote for it. I use MP4 containers so I can import them into MAGIX (formally Sony) Vegas but it will iterate over an entire folder of MakeMKV files and convert them https://dpaste.com/DBPA9C59M
I don't know if you would consider it tuning, but a low latency kernel is a massive UX improvement on consumer devices, where you don't want the UI to stutter and lock up under heavy load, and don't mind paying (could be wrong, top of my head) like a 5% throughout penalty.
We just improved the performance of one copying tool by a factor of 2.9, by tuning the size of a memory buffer. (BTW a shout out to both Flamegraphs and hyperfine - both excellent tools for profiling and benchmarking respectively)
Tuning kernel/process priorities has made a big difference in my audio code. Tweaking the GPU and USB IRQ handler priorities let me do realtime audio together with realtime OpenGL data visualization and screen recording, where before the audio would skip or the visualization would hang.
As someone with a limited knowledge of kernel stuff, does this mean Linux is likely to significantly outperform other comparable kernels like FreeBSD and OpenSolaris? Or are those other kernels keeping pace?
Considering the absolute dominance on HPC for a few years now, I don't think there is competition against Linux at the moment in terms of performance. Last time a non Linux system appeared in the top500 list was in June 2017.
There is some more in-depth discussion about the core folio idea on LWN at https://lwn.net/Articles/849538/ from a previous iteration of the patch set,
You'd think so, but it's just not true. Linux has a substantial performance advantage over Windows for rendering, which is (partly) why all the large render farms use Linux instead of Windows.
The most sensible hypothesis I've heard about it is that THP really helps with large rendering loads, and Windows doesn't do that yet.
From my point of view of an alternative life on the game development subculture I am quite sure the free beer weights much more than a couple of hours.
I'm not entirely sure what that means, but as a person running Linux who has been giving early access feedback to a few indy game devs, the consensus among them seems to be "you don't support Linux because you want to reach a larger market, you support Linux because you'll get good bug reports and basically free Q&A".
I'm not saying you're wrong, but these really aren't the same subspecies of indy devs that we're talking about here, hahaha. I'm talking about much smaller teams with a much smaller budget and audience. Like, teams of one to five people.
In my experience in interacting with them, most of these small indy dev teams use Unity or Unreal or Godot these days. Outside of graphic design, the graphics don't really cause big issues any more. Having a team of a handful enthusiasts but often slightly inexperienced programmers figure out the flaws in their own game logic is.
I believe GP is referring to the 10% boost they observed, not a hypothetical 7% boost as in the article. (7×24×0.1 = 16.8, while a 7% speedup would result in ~12 hours.)
One thing I've been mulling over recently is that many containers like vector in C++ for example, have almost no state.
That is to say, we at most might have a bit of logic to tune whether we do *1.5 or *2 on realloc, but why not more?
There must be patterns we can exploit in common use cases to be sneaky and do less malloc-ing. Profile guided? Runtime? I might have some results by Christmas, I have some ideas.
Food for thought: Your container has a few bytes of state to make decisions with, your branch predictor has a few megabytes these days.
The size increase factor for vector is a compromise between performance and wasting memory. It's also a fairly hot code path, so you don't want to run some complicated code there to estimate the 'optimal' factor.
About the best you can do is if you know beforehand roughly how big it will be, is to reserve that capacity with std::vector::reserve().
Malloc is at best a few hundred cycles, that's quite a lot of work if you can keep your data in L1. If you do a bit of work now you can save reallocs later.
Think bigger than changing the coefficient, there probably is no optimal factor.
std::vector would mostly benefit from an improved allocator interface, where it requests N bytes, but the allocator can give more than that and report the actual value.
While some heuristics would be nice if they improve the situation, a lot of apps still leave performance on the table by not estimating the capacity well. You don't have to be very clever about growth if you know that this vector will always have 3 elements and that one will have N that you can estimate from data size.
I wrote a thesis on this in 2008 (Transparent large-page support for Itanium Linux https://ts.data61.csiro.au/publications/theses_public/08/Wie...) and Matthew Wilcox was already involved in the area then. I admire his persistence, and have certainly have not kept up in the state of the art. Itanium had probably the greatest ability to select page size, probably more than any other architecture (?). On x86-64 you really only have 2mb or 4k to work with in a general purpose situation. It was difficult to show the benefits of managing all the different page sizes and, as this notes, re/writing everything to be aware of page sizes effectively. Those who had really big workloads that benefited from huge pinned mappings didn't really care that much either. It made the work hard to find traction at the time.
There is a series of page sizes that steps in increments of 5 bits per width. 8k/256k/8M therefore have regular join points in the address space, the first of which is at 8G.
Does superpage management get easier when each superpage is composed of only 32x of the next size down? When I first stumbled on this idea, it seemed like it would have many more opportunities for forming intermediate-sized superpages.
Linux still has a lot of assumptions baked into the page size. Power9 and some aarch64 systems have 16kB pages, but occasionally you run into some corner cases - for example, you can't mount a btrfs partition created on a x86 machine on a power9 one because the btrfs page size must be >= the mmu page size.
64K pages, actually. Also, POWER9 and aarch64 are perfectly capable of running with 4K pages, but not everything does. My desktop POWER9 with Fedora is a 64K page system, but Void PPC runs the same hardware with a 4K page.
While you may be technically correct, I guess from a practical point of view, OP is correct in that page-size still matters if you trying to run Linux on "unconventional" platforms.
And AFAIK, to access the maximum amount of physical memory in AARCH64 (52-bit physical addresses, instead of 48 bit physical addresses), you must use 64k pages. Since RHEL is normally used on servers, it makes sense to want to be able to access huge amounts of physical memory; that's probably the true reason (or even the sole reason) RHEL uses 64k pages on AARCH64.
It's not only ram that goes in that 256 TB, memory mapped i/o uses it too. You probably don't have TB of video card RAM, but maybe some flash devices offer the whole thing memory mapped?
Good points. Re the last one: you of course also save some by having fewer pages and their metadata. Wonder what the space-optimal page size would be taking these opposing factors into account.
I bet this would also screw masses of user code making assumptions about optimal size, or simply designed/optimized for a smaller size. LMDB was the first thing that came to mind
Indeed, the M1/A14 on mobile has larger pages which lets them have more effective TLB coverage with a smaller cache. In some applications this can boost performance by double digit % (which you can simulate by enabling large pages on x86).
You can also do large pages, 2MB or 1GB or whatever the obscenely large page size is for 5-level paging on latest systems.
2MB vs 4kB isn't quite the same ratio as 4MB -> 32GB, but it's still a lot less pages to cache in the TLB, and it's not too big to manage when you need to copy on write or swap out (or compress with zram) and whatever else needs to be done at the page level.
The other day I was wondering what would happen if all operating systems would stop developing and only optimise for a week or 2. How much time and electricity could have been saved?
If you add up an optimisation of just a nanosecond in like openSSH, how much would that do globally?
I used to be a kernel developer at Apple starting in 2006. Internally, every alternate major release was exactly this. All common paths were identified, and most of the dev time on the release was spent on optimizing those features to hit a goal for the path. Eg. moving 100 files in finder should not take more than x ms
The upgrade from Leopard to Snow Leopard on a plastic MacBook just made everything better. Things were faster, smoother, and you could run more things at the same time without killing the machine. It was the perfect OS. Then when Lion came around, it was the exact opposite, it felt terribly buggy, and made everything clunky and worse. At least that's my memory of these OS updates.
Makes me wonder whether this alternate major release cycle was a good idea. If you delay all feature development for a year, you'll get a barrage of features once the performance-only OS version is out the door, and there's not enough time to do all of them properly, so you get buggy and slow versions.
Maybe doing performance improvements and feature development at the same time would have been the better choice? How is it being done at Apple nowadays?
> If you add up an optimization of just a nanosecond in like openSSH, how much would that do globally?
I believed optimizations like that at a global scale will not have any impact.
Lets say that this nanosecond will be saved trillions of time a day. Resulting in minutes to an hour a day saved globally.
* Not a single user will notice.
* In 99.99% of cases the CPU will not be fully pegged and thus that one nanosecond of compute will not be used to do something else at all.
* CPU throttling isn't that fast so, you won't even save that much power.
If we bump it up by 6 orders of magnitude to a millisecond that all remains true. Even though you are potentially saving 100s of years of computing time a day. Extremely small gains distributed across very large number of machines don't tend to be as impactful as you would hope on a global scale.
This is not to say that small gains are worthless. Many small gains added together can be substantial.
If you make a bicycle one second faster over 40 kilometers, the user does not notice. But the user does win the race by 1 second instead of losing it. That is, things can be of enormous utility even when the user doesn't notice.
It's not the through put, but the fact when the CPU is idle it is sleeping saving power.
This is true on mobile, not for servers as internally they'll poll their ethernet phys.
Very optimized projects are out-competed by well-factored but less optimized projects, because the latter ones can add features faster. Thats why we are where we are today.
In the new technology sector this is certainly true. Find your product-market fit and then optimize when you become mature and scale up.
But aren't we talking about extremely mature kernel code here? My impression is that all kernel distros in high use are optimized but they are general use software. The degree to which you may optimize software is constrained but the diversity of use cases you must support.
As I understood it, pure means a function's output is dependent only on its input:
y=f(x);z=f(x) implies y=z
What they want is something different:
y= f(x) implies y=f(y)
This means if you give something the head of a list of pages, it won't try to go to the head again and again, it knows its already there.
The 'folio' idea as I understand it is roughly an alias for the existing 'page' structure, but code knows it is already at a head AND it should do the work on the whole list, not only on the head.
Linux future seems bright with regard to performance.
[0] https://lkml.org/lkml/2021/1/11/98
[1] https://lore.kernel.org/lkml/20210608115254.11930-1-sj38.par...
[2] https://lkml.org/lkml/2021/4/27/1208