This seems like a rather important point that's glossed over. Typical code is not often as optimized and meticulously written. It would be nice to see how much compilers have improved there.
Harder to benchmark.
And typically much harder to care about, as in: those applications are not stuck in algorithms that benefit from compiler optimisation.
They're waiting for the database. Or the user. Or on the graphics library.
Unfortunately it might be hard to get the same program to compile with both compilers without a bit of work.
I agree though, it would be great to see the same experiment on other code bases.
The choice of WSL2 as platform introduces a few confounders, especially filesystem performance, which might distort the differences between build times in particular. If someone wants to get a better understanding of what's going on, maybe a breakdown of where the time is spent or performing the benchmarks on other platforms would be a good idea.
Most of your users would not be able to use the software otherwise, which is not a small problem.
Since the author seems to care about the latter, I assume they did not use those flags.
Clang 10 was able to automatically vectorize the code, so it performs >2x as fast as GCC 8.3. To be fair to GCC, I'm using my distro's GCC, but I built a newer Clang for C++ coroutine support.
As far as the article, I'm surprised Cache and Meshlets are 5% slower in 11 than 2.7. Some insight could be gained as to what caused this regression.
I looked at the disassembly with objdump. I tend to build with both clang and GCC regularly, for some reason I like comparing them. Since I'm sending many rays and bounces, a 50% reduction in time is very noticeable, so I looked at the generated code. I mentioned the GCC version because it is slightly unfair to compare a very new clang to GCC from a few years back. The GCC output has some vectorization as well, but the clang output seems to generate smaller code with more vectorization. It would be interesting to compare it side-by-side on godbolt, but I'd have to cut-and-paste a bunch of files to do so, and it's not a priority at the moment.
Maybe I should have responded to another comment here. The intention of my previous comment was to bolster the idea that more typical naive and less-optimized code might benefit more than already-optimized code like in the article. 3d math in general is obviously a domain that can benefit from vectorization more than most.
Another fun find, was that sharing the PRNG state among threads destroyed performance. I have other higher priority side-projects, so I haven't had a chance to investigate why yet. Whether it was something like the cache-line bouncing between cores (I wouldn't be surprised if the PRNG was the hottest code in the whole program), or a cascading effect on the generated code. A lot of my code is visible to the compiler for the ray tracing hot path, so it's also possible it broke inlining or some other compiler optimizations.
As an extreme example, I imagine dynamic languages are hard to optimize because the compiler can make few assumptions about the code.
(Have little knowledge of compilers so correct me if I'm wrong.)
A compiler can do very little for dynamic languages. It could try to apply high-level optimizations, but as you say they are few and between, and hard. A Just-in-time compiler that optimizes hot path at runtime is usually the way to go. Unfortunately, they are quite a bit more complex. Most dynamic languages did not have them for a long time.
Perhaps someone could compare FORTRAN compilers to get a longer term view.
I doubt anything in software performance comes within many orders of magnitude of that.
DJB is Daniel J. Bernstein
The author's blog has been consistently great throughout the years, https://easyperf.net/notes/
See also microarchitectural performance analysis tools & readings, https://github.com/MattPD/cpplinks/blob/master/performance.t... and "Comments on timing short code sections on Intel processors", http://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timin...
But that does bring up the point, would we be better served by just, writing lower level code when needed and turning off all these optimizations for faster compile times?