Hacker News new | past | comments | ask | show | jobs | submit login
How does Clang 2.7 hold up in 2021? (gist.github.com)
163 points by tbodt 71 days ago | hide | past | favorite | 32 comments

> It is possible in theory that code that's less carefully optimized exhibits different behavior, or that the benchmarks chosen here are simply not as amenable to compiler optimization as they could be

This seems like a rather important point that's glossed over. Typical code is not often as optimized and meticulously written. It would be nice to see how much compilers have improved there.

It's not glossed over. It's mentioned multiple times in the article.

Yeah, I would suspect that a code that has "algorithms" is already pretty well optimized before it hits the compiler; the place where gains would be seen would be in "enterprise" or business code - or something like OpenOffice.

Harder to benchmark.

> harder to benchmark

And typically much harder to care about, as in: those applications are not stuck in algorithms that benefit from compiler optimisation.

They're waiting for the database. Or the user. Or on the graphics library.

Typically those programs are waiting for data to arrive from RAM because they have terrible data locality and chase pointers to do everything.

It takes 15 to 20 nanoseconds to fetch something cold from ram. That's slow if you do it in a tight loop, but that's absolutely nothing in a gui application.

Yeah I agree. It would be really interesting to see the performance difference of a larger program - firefox, the linux kernel, postgres, or maybe clang itself.

Unfortunately it might be hard to get the same program to compile with both compilers without a bit of work.

isn't llvm 11 able to build old llvm 2.7 software ?

It should! I'd love to read an analysis if anyone is interested in running this test and publishing a followup.

I haven’t done it in years, but with some hacking early versions of clang could totally compile blender and that’s quite amiable to benchmarking (maybe too much?).

Blender is probably already pretty well optimized through SIMD and specialized high performance math libraries and whatnot. I think it would be more interesting to see the difference with "general purpose code".

More work for the optimizer (because the code hasn't been manually "pre-optimized") most likely means longer compile times too though, so the relation between "twice the compile time for 15% speedup" might not improve much, and the optimizer might spend a lot of time on code that actually doesn't need optimizing (because it's not "hot" code).

I agree though, it would be great to see the same experiment on other code bases.

I'm not all that surprised by the small improvement on regular C++ code: the last decade hasn't seen radical changes in how this is done; compiler innovation has been elsewhere with only the SIMD story seen in this article. I was surprised by the lousy build times, though.

The choice of WSL2 as platform introduces a few confounders, especially filesystem performance, which might distort the differences between build times in particular. If someone wants to get a better understanding of what's going on, maybe a breakdown of where the time is spent or performing the benchmarks on other platforms would be a good idea.

It's not clear that the author of that post used `-march=native -mtune=native`. And if they didn't, that could account for the odd results.

In practice, you can almost never use that for desktop and tablet software.

Most of your users would not be able to use the software otherwise, which is not a small problem.

If they did, the article would really need a distinction between the speedup by new hardware features (which the old compiler cannot know) and hardware-independent smarter optimizations.

Since the author seems to care about the latter, I assume they did not use those flags.

unless they ran it on hardware from clang 2.7's era.

I have a simple C++ raytracer I wrote by going through Ray Tracing in One Weekend. I have not even made an attempt to optimize it. I really only made it parallel by splitting it up into tiles.

Clang 10 was able to automatically vectorize the code, so it performs >2x as fast as GCC 8.3. To be fair to GCC, I'm using my distro's GCC, but I built a newer Clang for C++ coroutine support.

Are you sure? Modern clang and gcc both have auto-vectorizers. clang's is enabled by default.[1] gcc requires '-ftree-vectorize'[2]. For my use case, I've seen the most improvements with clang + openmp + polly, requiring code changes along with hinting. Good news if your analysis is correct.

As far as the article, I'm surprised Cache and Meshlets are 5% slower in 11 than 2.7. Some insight could be gained as to what caused this regression.

[1] https://llvm.org/docs/Vectorizers.html

[2] https://gcc.gnu.org/projects/tree-ssa/vectorization.html

Am I sure about what? If it is auto-vectorizing? Yes. If the performance difference at O2 for both compilers is that dramatic? Yes. If the vectorization is the ultimate difference in the performance? No, not really.

I looked at the disassembly with objdump. I tend to build with both clang and GCC regularly, for some reason I like comparing them. Since I'm sending many rays and bounces, a 50% reduction in time is very noticeable, so I looked at the generated code. I mentioned the GCC version because it is slightly unfair to compare a very new clang to GCC from a few years back. The GCC output has some vectorization as well, but the clang output seems to generate smaller code with more vectorization. It would be interesting to compare it side-by-side on godbolt, but I'd have to cut-and-paste a bunch of files to do so, and it's not a priority at the moment.

Maybe I should have responded to another comment here. The intention of my previous comment was to bolster the idea that more typical naive and less-optimized code might benefit more than already-optimized code like in the article. 3d math in general is obviously a domain that can benefit from vectorization more than most.

Another fun find, was that sharing the PRNG state among threads destroyed performance. I have other higher priority side-projects, so I haven't had a chance to investigate why yet. Whether it was something like the cache-line bouncing between cores (I wouldn't be surprised if the PRNG was the hottest code in the whole program), or a cascading effect on the generated code. A lot of my code is visible to the compiler for the ray tracing hot path, so it's also possible it broke inlining or some other compiler optimizations.

Are bigger optimizations to be hard in the design of the higher level languages that are easier for compilers to optimize?

As an extreme example, I imagine dynamic languages are hard to optimize because the compiler can make few assumptions about the code.

(Have little knowledge of compilers so correct me if I'm wrong.)

Higher level languages often use complex data structures at runtime that are a maze of pointers and thus suffer from bad cache locality. Such languages benefit the most from language-specific high-level optimizations. Haskell for example uses strictness analysis to eliminate pointless lazy evaluation, and loop fusion to combine calls to `filter`, `map` and friends, thereby avoiding building up intermediary lists.

A compiler can do very little for dynamic languages. It could try to apply high-level optimizations, but as you say they are few and between, and hard. A Just-in-time compiler that optimizes hot path at runtime is usually the way to go. Unfortunately, they are quite a bit more complex. Most dynamic languages did not have them for a long time.

I would expect Proebsting's law to hit a wall faster than Moore's law, simply because software performance is better understood than physics.

Perhaps someone could compare FORTRAN compilers to get a longer term view.

Both general relativity and quantum field theories make predictions that match experiments to around 12 digits of accuracy.

I doubt anything in software performance comes within many orders of magnitude of that.

Accurate precisions are not enough to fulfill Moore's law.

> This takes me back to "The death of optimizing compilers" by David J. Bernstein

DJB is Daniel J. Bernstein

Does anyone have a good resource/book on how to do close to the metal benchmarking?

I can definitely recommend https://book.easyperf.net/perf_book

The author's blog has been consistently great throughout the years, https://easyperf.net/notes/

See also microarchitectural performance analysis tools & readings, https://github.com/MattPD/cpplinks/blob/master/performance.t... and "Comments on timing short code sections on Intel processors", http://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timin...

Intel’s vtune manual and anger fog’s manuals cover this (and much more)

Wow, 10 years for only 15%.

To be fair even the less optimized areas of meshoptimizer are very low level code. Probably not much optimization to be done. I've seen this in other domains too, I have some graphics/art code that is very low level C#, and .NET 4.6 to the latest .NET Core, which has huge performance gains in normal enterprise code, does nothing for it. Which makes sense its all carefully thought out loops on arrays, not much to be done.

But that does bring up the point, would we be better served by just, writing lower level code when needed and turning off all these optimizations for faster compile times?

Better compiler means less spots need hand-optimizing. Machine time is cheap compared to human labor. I suspect the break-even comes very quickly in favor of slower but better compilers.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact