Hacker News new | comments | ask | show | jobs | submit login
Firefox 64 built with GCC and Clang (hubicka.blogspot.com)
188 points by Twirrim 36 days ago | hide | past | web | favorite | 63 comments

I realize it's not the focus of his test, but as someone who thinks often about how to take advantage of advanced vectorization techniques on modern processors, I was surprised by statements like this:

Moreover GCC -O2 defaults are (in my opinion unfortunately) still not enabling vectorization and unrolling which may have noticeable effects on benchmarks.

This led to enabling AVX and since the global constructor now gets some code auto-vectorized the binary crashed on invalid instruction during the build (my testing machine has no AVX).

No AVX? He wants to better take advantage of vectorization, but he's doing the testing on a processor that is 3 generations behind in vectorization support. AVX (128-bit) come out in 2011, and has been followed by AVX2 (256-bit) and (still limited release) AVX-512.

Clock speeds have been fairly flat, and most of the improvements to recent processors have been microarchitectural. A lot of the optimization done by compilers ends up being architecture specific. Seeing which brand-new compiler best targets old hardware seems like it might produce misleading results.

I realize that not everyone has (or can have) the most recent hardware, but this seems like a case where it would be strongly in AMD and Intel's interest to make sure that people like Jan have better access to the improvements made in the last few years.

Intel still disables AVX instructions on their low-end Core architecture chips for market segmentation purposes. It is also entirely absent to begin with from their Atom and Celeron chips. AMD did not have AVX support until Ryzen, but they are still selling Piledriver based CPUs on their AM4 platform.

Firefox can't blindly use AVX without checking for its presence or it will crash on these types of systems.

AMD Bulldozer and later arch supports AVX, what Ryzen adds is AVX2 instructions.

You're right, and this is a little confusing. The article says he's using a "AMD Opteron 6272", which seems like it should support AVX: http://www.cpu-world.com/CPUs/Bulldozer/AMD-Opteron%206272%2.... So maybe the GCC bug he encountered is actually because he lacks AVX2 support? Or an incompatibility between early AMD and Intel support for AVX?

That doesn't explain GCC's side of the story though

EDIT: I should have clarified, the way the author of the article phrased it at first, I got the impression Clang did better auto-vectorization with Firefox code than GCC.

The author is trying to figure out why GCC underperforms Clang for Firefox builds. Seems inappropriate to worry about cutting-edge features that probably are available to 10% or so of computers that run Firefox (or a competing browser.)

Are you suggesting that 10% is a reasonable estimate of the percentage of Firefox instances running on hardware that supports AVX? I'm not sure of the real numbers but this strikes me as an extremely low estimate.

The closest I can find is Valve's hardware survey, which says that 87% of users are running on computers that support AVX. https://store.steampowered.com/hwsurvey (click on "Other Settings" on the bottom to expand).

Firefox is presumably lower than this, although I don't know by how much. Does Mozilla collect statistics on the capabilities of the computers that Firefox is being run on?

You can look at https://data.firefox.com/dashboard/hardware for similar numbers. CPU models are not broken out, but you can get a sense of what sort of CPUs are in use by looking at the GPU model: Intel powers 60%+ of the sampled population.

Well, the author also wrote:

> Moreover GCC -O2 defaults are (in my opinion unfortunately) still not enabling vectorization and unrolling which may have noticeable effects on benchmarks.

... which makes it sound like Clang does a better job at this with Firefox code.

There are several independent things

- first how you set -O2 defaults in your compiler. This is a delicate problem since you need to find right balance of code size, compile time, robustness of generated code (do not trigger undefined effect in super evil ways) and of course runtime. In benchmarks I have found that Clang has bit of edge for runtime which is mostly vectorization (on x86-64)

- selection of minimal ISA you support. For GCC x86-64 is still the original Opteron, but distributions can easily (and some do) decide for better. Indeed AVX is big win, but for general purpose distribution this is still too agressive. You can provide AVX optimized libraries where it depends

- selection of CPU tunning (i.e. generic/intel)

So I consider it mistake that GCC traded vectorization over compile time speed+reliablity for -O2 because it can make important difference in common workloads this days (not 10 years ago, say).

It is also clearly a bug for GCC to produce AVX instruction when not explicitly asked for :)

I also do testing on Zen, Core and some PowerPC. For the firefox machine I use Buldozer box because I don't care it spends long nights running builds & benchmarks and I think this particular problem is not very CPU specific.

Firefox still treats SSE2 as its SIMD assumed-present baseline. Does GCC introduce multiversioned AVX functions on its own initiative?

Does GCC introduce multiversioned AVX functions on its own initiative?

Yes, this seems like one reasonable approach. The current approach of compiling to a "least common denominator" and then updating this denominator every decade or so seems insufficient. Instead of interpreting the absence of a "-march=" flag to mean optimize for nothing, maybe it can mean that multiple optimizations are automatically compiled and the appropriate one is selected at runtime. Alternatively, maybe we need to move away from the idea of compiling a single binary that runs on all platforms, and encourage greater use of platform specific compilations.

You need to explicitly ask for it via attribute, no automatic multiversioning is done (yet) and it would be more for -Ofast than usual -O2 builds I guess.

Performance improvements have diminishing returns though, and for a consumer product like Firefox, a large number of users will be on older hardware. Ceteris paribus, it would be much better to get say a 10% improvement for the users on below-median hardware than for the users on above-median hardware since it will matter a good deal more for the former group.

AVX is 256-bit (float×8 and double×4 vectors), AVX2 is the same size but adds integer operations (and many other things).

You're right, sorry. I concentrate on integer operations, which are mostly restricted to 128-bit on AVX, and had forgotten that it also supported 256-bit floating point. I'm not sure which (if either) would be most relevant to a web browser. I hope this error doesn't distract too much from my overall point.

AVX does not add anything to integer operation, like SSE. 128-bit integer is SSE2, which is part of amd64.

Well, AVX does explicitly add 256-bit integer loads and stores: https://software.intel.com/sites/landingpage/IntrinsicsGuide....

And some of the Boolean operations can be quite useful on integers. So even for integers, I was definitely wrong to characterize AVX as 128-bit.

Also AVX has shuffles which are interesting for integer code.

I'm surprised that the GCC doesn't add some autodetection mechanism in it.

This leads to issues like you have on Mac: When using safari, the lower consumption is super low. With Chrome you don't get nearly the runtime (when playing video or listing to music).

I'm surprised that the GCC doesn't add some autodetection mechanism in it.

As 'hubicka' mentions in another comment, GCC does have "multi-versioning" capability, but it doesn't use it by default. Instead, one needs to mark individual functions with GCC specific attributes, asking for versions with different capabilities to be created: https://lwn.net/Articles/691932/. This isn't necessarily a bad approach, but the fact that it seems be used rarely makes me wonder if some other more default approach that works with unmodified code might be an improvement.

I'm late to reading this article, but @hubicka already updated the article with Skylake benchmarks. GCC's lead is similar, if not higher.

It's hard to count the number of major improvements that landed in GCC since the inception of clang. Competition in this landscape is benefiting everyone.

I like the rising popularity of LTO the most.

I'm not loving the Firefox motion toward clang. For years we've been told that clang is great because we finally have a competitor for gcc and that multiple interoperable compilers can only improve the ecosystem (which is undeniably true).

Now we have a big project deciding to move from a reasonably portable gcc build to a clang-specific LTO framework that required significant engineering effort to achieve and which apparently isn't easily portable to the equivalent gcc effort, requiring a gcc maintainer to jump in on their behalf to show equivalence.

How is this not moving backwards?

The advantage of having multiple C compilers has always been broader platform support and competition to improve compiler speed, error messages, and codegen quality. Conspicuously, one complete non-advantage of having multiple C compilers is portability of a codebase between compilers; the C specification contains too much implementation-defined behavior (including undefined behavior) and is too anemic (requiring compilers to come up with nonstandard extensions to support things like assembly and SIMD) for compiler-portability to be anything other than a nightmare for large projects. C projects even have a hard time upgrading to new versions of the same compiler, which helps to explain why so many shops are still using positively ancient versions of GCC.

I had the good fortune to work on a code base that started as a blank emacs buffer and was compiled with both g++ and clang++ from the get go (and built on multiple platforms, with maximal warnings enabled). The two compilers surfaced different bugs in our codebase which, in addition to finding actual (semantic) bugs has been quite valuable for portability and maintainability.

(Also, as others have noted, the existence of clang has really been good for gcc).

We saw significant performance gains when moving from GCC (6) to clang (6). I don't think it'd be particularly hard to switch back at this point; this article provides some solid data for doing so.

Yeah, but to be fair the work to actually enable LTO was very significant (at least as far as we outside the community could see via stuff like the blog post here) and involved a ton of toolchain-specific hackery and work with the clang upstream.

Given that same level of effort (c.f. the article we're discussing) it seems like you could have done as well or better by moving to a more recent gcc instead. Or better, by working with both at coming up with a portable way to get LTO working.

I'm not really concerned with what you use to build (I mean, you have to pick some compiler at the end of the day), just with what seems to be "needlessly tight coupling" between clang/llvm and Firefox in a way that hurts the interoperable toolchain ecosystem.

What are you referring to by "a ton of toolchain-specific hackery" and "a portable way to get LTO working"? It seems like there are very specific things you have in mind, but I'm unclear what bits of work you're referencing. Unless you're thinking of the cross-language LTO work, which is still in progress and is of course clang/llvm-specific? I'd love to see that feature work with GCC, but it's simply not feasible at the present time.

Regardless, that feature being enabled when you're using suitable versions of clang/llvm/rustc doesn't preclude using LTO with other compilers.

I would be happy to help with solving GCC related issues and look into performance regressions relative to clang (I am still in process of looking into -O2 performance and plan to set up talos next)

And even worse, it's generating worse and ridiculously bigger code. So they spend a lot of effort for worse result.

And apparently they don't catch serious problems in their build system on their own (profile feedback pass timing out and truncating data)

Overall not a great result at all for Mozilla. Where is the regression testing for performance? They are not data driven?

"And even worse, it's generating worse and ridiculously bigger code. So they spend a lot of effort for worse result."

GCC aggressively size optimizes cold regions, and LLVM doesn't bother. This would be pretty easy to fix, but outside of binary size, need to prove it actually matters (the test harness here is a pretty darn old CPU)

The "don't catch serious problems in build system" is the bigger worry, imho.

I have re-tested on my skylake notebook and updated the blog. It confirms darn old CPU I use as my benchmark machine. Maybe it is bit more sensitive to the difference which is expected for non-server CPU.

GCC does "almost full LTO" with partitioning, while clang does thinLTO that does make most of code size/speed tradeoffs without considering whole program context, so it may be interesting to get both alternatives closer in code size/performance metrics.

I have got Firefox developer account of level1 and I am looking into official benchmarking architecture which I have now updated to GCC 8 with LTO+PGO.

Regression testing for performance is checked via several test suites (Talos, for instance, with numerous subtests: https://wiki.mozilla.org/Performance_sheriffing/Talos/Tests) and monitored by our ever-dutiful performance sheriffs (https://wiki.mozilla.org/Performance_sheriffing).

It'll be interesting to see the numbers again after LLVM thinLTO starts applying across C++ and Rust resulting in cross-language inlining.

I don't know what you are saying here. Can you elaborate?

A goal that is being worked towards is making LLVM thinLTO not just consider clang output but to consider clang-generated LLVM IR together with rustc-generated LLVM IR. This is expected to lead to inlining between C++ and Rust making the FFI layer of C-linkage function calls melt away between C++ and Rust.

This seems to be the bug that's tracking this in rust:


One of the things that disturbed me from the article was how the Mozilla build chain just merrily ignores that profiling had failed, and moves on building stuff using that profile. That seems like quite dangerous behaviour. Surely that should be a failing step for a build, or at the very least a large warning should go out at the end "This build may be optimised based on complete nonsense because profiling failed"

This is all way way WAY too complicated.

It's like a fractal Rube Goldberg machine made of Rube Goldberg machines.

All this to render web pages. I think we must have made a wrong turn somewhere.

> All this to render web pages. I think we must have made a wrong turn somewhere.

We've taken plenty of wrong turns, but none, I think, accounted for more than a rounding error in time or code needed to render web pages. Writing a browser is hard.

Hell, even writing a toy browser-like mockup isn't easy. I built an extremely bad renderer for an extremely simple class-provided XML-ish grammar in school. It only supported a handful of styling keywords (all inline/attribute-based), only one of which was positioning-related ("wrap to next fixed-height global line of display after this element").

It was really hard. Like, really hard. Even looking back on the code with the benefit of experience, it still would not be a breeze.

It supported a single fixed window size and a guaranteed-correct input file. Removing either of those constraints would have exploded the code size to the point I doubt I could have done it alone then, and if I could now it would take me an incredibly long time. Adding the full HTML spec would probably bring its SLoC counts into the 100ks, if not millions. Supporting re-renders and after-the-fact DOM updates would blow it far beyond that, making them fast might require me to go back to school, but who knows; maybe it's easier than my hunch. I suppose I could shave time by moving some of those hundreds of thousands of lines into the libraries which evolved during the many years since browsers became popular, but it would still be a gargantuan undertaking.

And all of that is before the immense amount of person-hours which would be needed for:

- Supporting cascading styling of any kind, with or without embedding another language.

- Adding networking, even if interoperability/an agreed-upon communication pattern or protocol already existed.

- Displaying assets other than styled text and SVG-esque drawings (images, videos, etc).

- Securing the request/response protocol, even if leveraging existing tools like OpenSSL to the max.

- Adding another turing-complete and secure programming language for communicating with random local/networked resources and producing more requests or DOM updates.

It's hard.

TL;DR There are plenty of needlessly-complex tools and technologies out there. But I don't think web browsers are some of them. Even if you're anti-JS and anti-CSS, there is still an absolute shitload of complex, careful, hard-to-get-right interactions going on under the hood.

Is all that really needful? To draw documents? To make apps?

I don't think it is. I think Elm lang (for example) proves that it's not.

I finally got around to trying Elm. Once I got over the way it feels like a toy compared to the HTML/CSS/JS/etc world, I realized I could never ever justify NOT using it in a business context.

What I mean is, the business-value case for using the normal front-end stack vs. Elm just isn't there.

That's just an example for the domain you described.

The VPRI STEPS project demonstrated that we could reduce our codebase(s) by orders of magnitude while retaining or even improving functionality, "from the desktop to the metal".

I'm sure Elm is lovely, but how is it useful without a browser to deliver it, a browser to provide it a document to manipulate, and a browser to display its changes to that document?

It's those things that are complex; the client-side programming language (if present) is just one of many, many high-complexity parts in a browser.

Bless you! I was hoping someone would ask me that.

TL;DR: Write an Elm to native app compiler/interpreter. Servers serve Elm code.

(As an aside, The loveliness of Elm is incidental to the point. If it looked like COBOL it would still make economic sense. Lots of people have developed DSLs for apps, the important thing about Elm is that it's a very elegant and well-thought-out domain-specific system for specifying apps. Elm is much less complex than HTML+CSS+JS+Frameworks/libs/NPM etc.)

At the moment, the delivery vehicle for Elm-specified apps is the Fractal Rube Goldberg Machine, yes.

But consider e.g. an Elm-to-GTK compiler, or Elm-to-TCL/Tk interpreter, whatever... The FRBM is just a reasonable first target platform.

I don't think I'm wrong here, or even saying anything controversial. Go look at what VPRI did with STEPS. Our code volume and complexity is too high by two or three orders of magnitude.

I don't see why I wouldn't just use Haskell and a native UI library right now, to similar effect, instead of waiting for all this to appear. The language is in a much more stable state than Elm, which already makes it more ideal in a business-context.

Cheers! You're making my point: the "FRBM" isn't needful.

I'm not making your point for you, I just don't agree with you on what the actual problem is.

Programmers don't disagree that we could be using better approaches. The question is not why they don't exist, because they do exist. The question is why we don't or can't use them currently.

Most of the time, the reason is purely cultural, either due to management or legacy. I'd love to use Purescript and Haskell at my job, but I cannot. I don't get to make that choice. A new Elm transpiler won't solve a cultural problem.

I wouldn't run testing on a notebook. I mean you can if what you are possibly testing is boosting characteristics and other variables... But best bet for low variable consistent testing is a machine where you have set a static core clock speed and have disabled c-states and other power save things. Remember, you are testing differences in compile optimization. You don't want your system being a variable.

The 48% difference in code size is surprising. But after all who cares in this world of electron etc.

If it affects cache behaviour, we should all care.

Whether that is true or not depends on a lot of factors.

They are mostly unrelated to overall binary size due to paging, etc.

You also won't easily predict the behavior due to reordering.

Smaller binary enables use of a lower-level cache, no? [0]

My understanding is that profile-guided optimisation is largely based on the utility of small binaries, by optimising hotspots for speed and everything else for space, thereby alleviating cache-pressure. Is this wrong?

> You also won't easily predict the behavior due to reordering.

I wasn't thinking of anything as sophisticated as looking at specific flows, where I can well imagine things get unpredictable with reordering and speculative execution. Won't there will be a reliable pattern of better fitting in cache, if we shrink everything?

[0] https://lwn.net/Articles/534735/

"Smaller binary enables use of a lower-level cache, no? [0]" No. It would if all of the stuff was actually all in memory at once, and pulled in the stuff next to it.

IE you couldn't pull in function A without pulling in function B. That is mostly not true[1].

This is why reordering mostly brings load time benefits instead of run time benefits.

The utility of PGO is mostly about knowing where to spend your time optimizing, and knowing what to do. That's a generalization. There are certainly cases in inline/etc heavy code where it helps get the speed part right too. A lot of that is more often about "it lets the compiler spend it's inlining budget on inlining stuff that matters" than "it stops the compiler from blowing cache out".

I speak in generalities because there are always counterexamples.

There are cases where PGO makes things significantly worse, for example!

Last I remember (My job now means i don't have time to stay in the game), LLVM did not bother to optimize the cold regions for size, and GCC did.

[1] It depends on function sizes and page sizes and mlocking and section flags and all sorts of fun things, but i'm just going to assert the truth of this in most cases go make it simpler.

I guess that LLVM is fairly cache aware so I think the performance probably isn't affected too greatly (I hope).

I would be interested to know what cache aware code layout optimizations are available in LLVM. I personally know of none. GCC is bit simplistic in this sense (it does reorder functions based on profile feedback and execution time) and I plan to change that for next stage 1 (i.e. GCC 10)

Hey Jan, Long time ;)

LLVM will do the same kind of reordering.

(Both are interestingly well behind what commercial compilers do, and this is one of the very few areas where that is true. My suspicion is that it does not matter as much in practice as we want it to. Most forms of layout optimization are also very hard to perform on the C++ code you want to optimize due to inability to prove safety)

Hehe, nice to see you :)

Yep, I have code layout pass in my tree for a while, but because I was never really able to measure off-noise improvements it is not in the tree, yet. I hope to make more sense of it with help of CPU counters which improved over the time.

I'm not familiar enough with LLVM to really say, so I was just speculating: I vaguely remembered some kind of talk about cache optimisation and LLVM, so it's possible it was talking about the LLVM codebase rather than the passes available in LLVM.

I think I'm going to mess with this later myself. On Manjaro I usually compile Chromium with -O3 and -march=native with no mtune or any of that but I never benchmarked it against anything. I'll do the same with Firefox. This is on coffee lake BTW.

He's testing on a 7 year old 8-core server CPU. As irrelevant as possible for your average laptop Firefox user.

I suspect you seriously overestimate the amount of developers on bleeding edge hardware.

My desktop is a first gen i7. My Laptop a 4th gen. My work machine a 5 year old Xeon.

And you know what? They all work great and I see no reason to upgrade.

I guess you will be shocked to hear I’m doing fine with 8GBs of ram too :)

You can try the binary on your CPU.

This particular workload does not make much difference between modern CPUs. I just tried the Sunspider benchmark on my skylake and it has similar outcomes as reported, but there is more noise since it is notebook

What I got is: GCC 8 build: 333 +- 3.3% Tumbleweed distro firefox: 352 +- 3.4% Firefox 63 (GCC) official binary: 346 +- 5.6% Firefox 64 (llvm) official binary: 342 +- 5.1% but I do not completely trust the numbers as re-running the benchmark leads to different outcome each time

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact