Moreover GCC -O2 defaults are (in my opinion unfortunately) still not enabling vectorization and unrolling which may have noticeable effects on benchmarks.
This led to enabling AVX and since the global constructor now gets some code auto-vectorized the binary crashed on invalid instruction during the build (my testing machine has no AVX).
No AVX? He wants to better take advantage of vectorization, but he's doing the testing on a processor that is 3 generations behind in vectorization support. AVX (128-bit) come out in 2011, and has been followed by AVX2 (256-bit) and (still limited release) AVX-512.
Clock speeds have been fairly flat, and most of the improvements to recent processors have been microarchitectural. A lot of the optimization done by compilers ends up being architecture specific. Seeing which brand-new compiler best targets old hardware seems like it might produce misleading results.
I realize that not everyone has (or can have) the most recent hardware, but this seems like a case where it would be strongly in AMD and Intel's interest to make sure that people like Jan have better access to the improvements made in the last few years.
Firefox can't blindly use AVX without checking for its presence or it will crash on these types of systems.
EDIT: I should have clarified, the way the author of the article phrased it at first, I got the impression Clang did better auto-vectorization with Firefox code than GCC.
The closest I can find is Valve's hardware survey, which says that 87% of users are running on computers that support AVX. https://store.steampowered.com/hwsurvey (click on "Other Settings" on the bottom to expand).
Firefox is presumably lower than this, although I don't know by how much. Does Mozilla collect statistics on the capabilities of the computers that Firefox is being run on?
> Moreover GCC -O2 defaults are (in my opinion unfortunately) still not enabling vectorization and unrolling which may have noticeable effects on benchmarks.
... which makes it sound like Clang does a better job at this with Firefox code.
- first how you set -O2 defaults in your compiler. This is a delicate problem since you need to find right balance of code size, compile time, robustness of generated code (do not trigger undefined effect in super evil ways) and of course runtime. In benchmarks I have found that Clang has bit of edge for runtime which is mostly vectorization (on x86-64)
- selection of minimal ISA you support. For GCC x86-64 is still the original Opteron, but distributions can easily (and some do) decide for better. Indeed AVX is big win, but for general purpose distribution this is still too agressive. You can provide AVX optimized libraries where it depends
- selection of CPU tunning (i.e. generic/intel)
So I consider it mistake that GCC traded vectorization over compile time speed+reliablity for -O2 because it can make important difference in common workloads this days (not 10 years ago, say).
It is also clearly a bug for GCC to produce AVX instruction when not explicitly asked for :)
I also do testing on Zen, Core and some PowerPC. For the firefox machine I use Buldozer box because I don't care it spends long nights running builds & benchmarks and I think this particular problem is not very CPU specific.
Yes, this seems like one reasonable approach. The current approach of compiling to a "least common denominator" and then updating this denominator every decade or so seems insufficient. Instead of interpreting the absence of a "-march=" flag to mean optimize for nothing, maybe it can mean that multiple optimizations are automatically compiled and the appropriate one is selected at runtime. Alternatively, maybe we need to move away from the idea of compiling a single binary that runs on all platforms, and encourage greater use of platform specific compilations.
And some of the Boolean operations can be quite useful on integers. So even for integers, I was definitely wrong to characterize AVX as 128-bit.
This leads to issues like you have on Mac: When using safari, the lower consumption is super low. With Chrome you don't get nearly the runtime (when playing video or listing to music).
As 'hubicka' mentions in another comment, GCC does have "multi-versioning" capability, but it doesn't use it by default. Instead, one needs to mark individual functions with GCC specific attributes, asking for versions with different capabilities to be created: https://lwn.net/Articles/691932/. This isn't necessarily a bad approach, but the fact that it seems be used rarely makes me wonder if some other more default approach that works with unmodified code might be an improvement.
Now we have a big project deciding to move from a reasonably portable gcc build to a clang-specific LTO framework that required significant engineering effort to achieve and which apparently isn't easily portable to the equivalent gcc effort, requiring a gcc maintainer to jump in on their behalf to show equivalence.
How is this not moving backwards?
(Also, as others have noted, the existence of clang has really been good for gcc).
Given that same level of effort (c.f. the article we're discussing) it seems like you could have done as well or better by moving to a more recent gcc instead. Or better, by working with both at coming up with a portable way to get LTO working.
I'm not really concerned with what you use to build (I mean, you have to pick some compiler at the end of the day), just with what seems to be "needlessly tight coupling" between clang/llvm and Firefox in a way that hurts the interoperable toolchain ecosystem.
Regardless, that feature being enabled when you're using suitable versions of clang/llvm/rustc doesn't preclude using LTO with other compilers.
And apparently they don't catch serious problems in their build system on their own (profile feedback pass timing out and truncating data)
Overall not a great result at all for Mozilla. Where is the regression testing for performance? They are not data driven?
GCC aggressively size optimizes cold regions, and LLVM doesn't bother.
This would be pretty easy to fix, but outside of binary size, need to prove it actually matters (the test harness here is a pretty darn old CPU)
The "don't catch serious problems in build system" is the bigger worry, imho.
GCC does "almost full LTO" with partitioning, while clang does thinLTO that does make most of code size/speed tradeoffs without considering whole program context, so it may be interesting to get both alternatives closer in code size/performance metrics.
I have got Firefox developer account of level1 and I am looking into official benchmarking architecture which I have now updated to GCC 8 with LTO+PGO.
It's like a fractal Rube Goldberg machine made of Rube Goldberg machines.
All this to render web pages. I think we must have made a wrong turn somewhere.
We've taken plenty of wrong turns, but none, I think, accounted for more than a rounding error in time or code needed to render web pages. Writing a browser is hard.
Hell, even writing a toy browser-like mockup isn't easy. I built an extremely bad renderer for an extremely simple class-provided XML-ish grammar in school. It only supported a handful of styling keywords (all inline/attribute-based), only one of which was positioning-related ("wrap to next fixed-height global line of display after this element").
It was really hard. Like, really hard. Even looking back on the code with the benefit of experience, it still would not be a breeze.
It supported a single fixed window size and a guaranteed-correct input file. Removing either of those constraints would have exploded the code size to the point I doubt I could have done it alone then, and if I could now it would take me an incredibly long time. Adding the full HTML spec would probably bring its SLoC counts into the 100ks, if not millions. Supporting re-renders and after-the-fact DOM updates would blow it far beyond that, making them fast might require me to go back to school, but who knows; maybe it's easier than my hunch. I suppose I could shave time by moving some of those hundreds of thousands of lines into the libraries which evolved during the many years since browsers became popular, but it would still be a gargantuan undertaking.
And all of that is before the immense amount of person-hours which would be needed for:
- Supporting cascading styling of any kind, with or without embedding another language.
- Adding networking, even if interoperability/an agreed-upon communication pattern or protocol already existed.
- Displaying assets other than styled text and SVG-esque drawings (images, videos, etc).
- Securing the request/response protocol, even if leveraging existing tools like OpenSSL to the max.
- Adding another turing-complete and secure programming language for communicating with random local/networked resources and producing more requests or DOM updates.
TL;DR There are plenty of needlessly-complex tools and technologies out there. But I don't think web browsers are some of them. Even if you're anti-JS and anti-CSS, there is still an absolute shitload of complex, careful, hard-to-get-right interactions going on under the hood.
I don't think it is. I think Elm lang (for example) proves that it's not.
I finally got around to trying Elm. Once I got over the way it feels like a toy compared to the HTML/CSS/JS/etc world, I realized I could never ever justify NOT using it in a business context.
What I mean is, the business-value case for using the normal front-end stack vs. Elm just isn't there.
That's just an example for the domain you described.
The VPRI STEPS project demonstrated that we could reduce our codebase(s) by orders of magnitude while retaining or even improving functionality, "from the desktop to the metal".
It's those things that are complex; the client-side programming language (if present) is just one of many, many high-complexity parts in a browser.
TL;DR: Write an Elm to native app compiler/interpreter. Servers serve Elm code.
(As an aside, The loveliness of Elm is incidental to the point. If it looked like COBOL it would still make economic sense. Lots of people have developed DSLs for apps, the important thing about Elm is that it's a very elegant and well-thought-out domain-specific system for specifying apps. Elm is much less complex than HTML+CSS+JS+Frameworks/libs/NPM etc.)
At the moment, the delivery vehicle for Elm-specified apps is the Fractal Rube Goldberg Machine, yes.
But consider e.g. an Elm-to-GTK compiler, or Elm-to-TCL/Tk interpreter, whatever... The FRBM is just a reasonable first target platform.
I don't think I'm wrong here, or even saying anything controversial. Go look at what VPRI did with STEPS. Our code volume and complexity is too high by two or three orders of magnitude.
Programmers don't disagree that we could be using better approaches. The question is not why they don't exist, because they do exist. The question is why we don't or can't use them currently.
Most of the time, the reason is purely cultural, either due to management or legacy. I'd love to use Purescript and Haskell at my job, but I cannot. I don't get to make that choice. A new Elm transpiler won't solve a cultural problem.
They are mostly unrelated to overall binary size due to paging, etc.
You also won't easily predict the behavior due to reordering.
My understanding is that profile-guided optimisation is largely based on the utility of small binaries, by optimising hotspots for speed and everything else for space, thereby alleviating cache-pressure. Is this wrong?
> You also won't easily predict the behavior due to reordering.
I wasn't thinking of anything as sophisticated as looking at specific flows, where I can well imagine things get unpredictable with reordering and speculative execution. Won't there will be a reliable pattern of better fitting in cache, if we shrink everything?
IE you couldn't pull in function A without pulling in function B.
That is mostly not true.
This is why reordering mostly brings load time benefits instead of run time benefits.
The utility of PGO is mostly about knowing where to spend your time optimizing, and knowing what to do. That's a generalization.
There are certainly cases in inline/etc heavy code where it helps get the speed part right too. A lot of that is more often about "it lets the compiler spend it's inlining budget on inlining stuff that matters" than "it stops the compiler from blowing cache out".
I speak in generalities because there are always counterexamples.
There are cases where PGO makes things significantly worse, for example!
Last I remember (My job now means i don't have time to stay in the game), LLVM did not bother to optimize the cold regions for size, and GCC did.
 It depends on function sizes and page sizes and mlocking and section flags and all sorts of fun things, but i'm just going to assert the truth of this in most cases go make it simpler.
LLVM will do the same kind of reordering.
(Both are interestingly well behind what commercial compilers do, and this is one of the very few areas where that is true. My suspicion is that it does not matter as much in practice as we want it to. Most forms of layout optimization are also very hard to perform on the C++ code you want to optimize due to inability to prove safety)
Yep, I have code layout pass in my tree for a while, but because I was never really able to measure off-noise improvements it is not in the tree, yet. I hope to make more sense of it with help of CPU counters which improved over the time.
My desktop is a first gen i7. My Laptop a 4th gen. My work machine a 5 year old Xeon.
And you know what? They all work great and I see no reason to upgrade.
I guess you will be shocked to hear I’m doing fine with 8GBs of ram too :)
This particular workload does not make much difference between modern CPUs. I just tried the Sunspider benchmark on my skylake and it has similar outcomes as reported, but there is more noise since it is notebook
What I got is:
GCC 8 build: 333 +- 3.3%
Tumbleweed distro firefox: 352 +- 3.4%
Firefox 63 (GCC) official binary: 346 +- 5.6%
Firefox 64 (llvm) official binary: 342 +- 5.1%
but I do not completely trust the numbers as re-running the benchmark leads to different outcome each time