
Firefox 64 built with GCC and Clang - Twirrim
https://hubicka.blogspot.com/2018/12/firefox-64-built-with-gcc-and-clang.html
======
nkurz
I realize it's not the focus of his test, but as someone who thinks often
about how to take advantage of advanced vectorization techniques on modern
processors, I was surprised by statements like this:

 _Moreover GCC -O2 defaults are (in my opinion unfortunately) still not
enabling vectorization and unrolling which may have noticeable effects on
benchmarks._

 _This led to enabling AVX and since the global constructor now gets some code
auto-vectorized the binary crashed on invalid instruction during the build (my
testing machine has no AVX)._

No AVX? He wants to better take advantage of vectorization, but he's doing the
testing on a processor that is 3 generations behind in vectorization support.
AVX (128-bit) come out in 2011, and has been followed by AVX2 (256-bit) and
(still limited release) AVX-512.

Clock speeds have been fairly flat, and most of the improvements to recent
processors have been microarchitectural. A lot of the optimization done by
compilers ends up being architecture specific. Seeing which brand-new compiler
best targets old hardware seems like it might produce misleading results.

I realize that not everyone has (or can have) the most recent hardware, but
this seems like a case where it would be strongly in AMD and Intel's interest
to make sure that people like Jan have better access to the improvements made
in the last few years.

~~~
supercilious
Intel still disables AVX instructions on their low-end Core architecture chips
for market segmentation purposes. It is also entirely absent to begin with
from their Atom and Celeron chips. AMD did not have AVX support until Ryzen,
but they are still selling Piledriver based CPUs on their AM4 platform.

Firefox can't blindly use AVX without checking for its presence or it will
crash on these types of systems.

~~~
karavelov
AMD Bulldozer and later arch supports AVX, what Ryzen adds is AVX2
instructions.

~~~
nkurz
You're right, and this is a little confusing. The article says he's using a
"AMD Opteron 6272", which seems like it should support AVX: [http://www.cpu-
world.com/CPUs/Bulldozer/AMD-Opteron%206272%2...](http://www.cpu-
world.com/CPUs/Bulldozer/AMD-Opteron%206272%20OS6272WKTGGGU.html). So maybe
the GCC bug he encountered is actually because he lacks AVX2 support? Or an
incompatibility between early AMD and Intel support for AVX?

------
duhast
It's hard to count the number of major improvements that landed in GCC since
the inception of clang. Competition in this landscape is benefiting everyone.

~~~
Avamander
I like the rising popularity of LTO the most.

------
ajross
I'm not loving the Firefox motion toward clang. For years we've been told that
clang is great because we finally have a competitor for gcc and that multiple
interoperable compilers can only improve the ecosystem (which is undeniably
true).

Now we have a big project deciding to move from a reasonably portable gcc
build to a clang-specific LTO framework that required significant engineering
effort to achieve and which apparently isn't easily portable to the equivalent
gcc effort, requiring a gcc maintainer to jump in on their behalf to show
equivalence.

How is this not moving backwards?

~~~
froydnj
We saw significant performance gains when moving from GCC (6) to clang (6). I
don't think it'd be particularly hard to switch back at this point; this
article provides some solid data for doing so.

~~~
ajross
Yeah, but to be fair the work to actually enable LTO was very significant (at
least as far as we outside the community could see via stuff like the blog
post here) and involved a ton of toolchain-specific hackery and work with the
clang upstream.

Given that same level of effort (c.f. the article we're discussing) it seems
like you could have done as well or better by moving to a more recent gcc
instead. Or better, by working with both at coming up with a portable way to
get LTO working.

I'm not really concerned with what you use to build (I mean, you have to pick
some compiler at the end of the day), just with what seems to be "needlessly
tight coupling" between clang/llvm and Firefox in a way that hurts the
interoperable toolchain ecosystem.

~~~
froydnj
What are you referring to by "a ton of toolchain-specific hackery" and "a
portable way to get LTO working"? It seems like there are very specific things
you have in mind, but I'm unclear what bits of work you're referencing. Unless
you're thinking of the cross-language LTO work, which is still in progress and
is of course clang/llvm-specific? I'd love to see that feature work with GCC,
but it's simply not feasible at the present time.

Regardless, that feature being enabled when you're using suitable versions of
clang/llvm/rustc doesn't preclude using LTO with other compilers.

------
hsivonen
It'll be interesting to see the numbers again after LLVM thinLTO starts
applying across C++ and Rust resulting in cross-language inlining.

~~~
unixhero
I don't know what you are saying here. Can you elaborate?

~~~
hsivonen
A goal that is being worked towards is making LLVM thinLTO not just consider
clang output but to consider clang-generated LLVM IR together with rustc-
generated LLVM IR. This is expected to lead to inlining between C++ and Rust
making the FFI layer of C-linkage function calls melt away between C++ and
Rust.

~~~
pedrocr
This seems to be the bug that's tracking this in rust:

[https://github.com/rust-lang/rust/issues/49879](https://github.com/rust-
lang/rust/issues/49879)

------
Twirrim
One of the things that disturbed me from the article was how the Mozilla build
chain just merrily ignores that profiling had failed, and moves on building
stuff using that profile. That seems like quite dangerous behaviour. Surely
that should be a failing step for a build, or at the very least a large
warning should go out at the end "This build may be optimised based on
complete nonsense because profiling failed"

------
carapace
This is all way _way_ WAY too complicated.

It's like a fractal Rube Goldberg machine made of Rube Goldberg machines.

All this to render web pages. I think we must have made a wrong turn
somewhere.

~~~
zbentley
> All this to render web pages. I think we must have made a wrong turn
> somewhere.

We've taken plenty of wrong turns, but none, I think, accounted for more than
a rounding error in time or code needed to render web pages. Writing a browser
is _hard_.

Hell, even writing a toy browser-like mockup isn't easy. I built an extremely
bad renderer for an extremely simple class-provided XML-ish grammar in school.
It only supported a handful of styling keywords (all inline/attribute-based),
only one of which was positioning-related ("wrap to next fixed-height global
line of display after this element").

It was really hard. Like, _really_ hard. Even looking back on the code with
the benefit of experience, it still would not be a breeze.

It supported a single fixed window size and a guaranteed-correct input file.
Removing either of those constraints would have exploded the code size to the
point I doubt I could have done it alone then, and if I could now it would
take me an incredibly long time. Adding the full HTML spec would probably
bring its SLoC counts into the 100ks, if not millions. Supporting re-renders
and after-the-fact DOM updates would blow it far beyond that, making them fast
might require me to go back to school, but who knows; maybe it's easier than
my hunch. I suppose I could shave time by moving some of those hundreds of
thousands of lines into the libraries which evolved during the many years
since browsers became popular, but it would still be a gargantuan undertaking.

And all of that is before the immense amount of person-hours which would be
needed for:

\- Supporting cascading styling of any kind, with or without embedding another
language.

\- Adding networking, even if interoperability/an agreed-upon communication
pattern or protocol already existed.

\- Displaying assets other than styled text and SVG-esque drawings (images,
videos, etc).

\- Securing the request/response protocol, even if leveraging existing tools
like OpenSSL to the max.

\- Adding another turing-complete and secure programming language for
communicating with random local/networked resources and producing more
requests or DOM updates.

It's hard.

TL;DR There are plenty of needlessly-complex tools and technologies out there.
But I don't think web browsers are some of them. Even if you're anti-JS and
anti-CSS, there is still an absolute shitload of complex, careful, hard-to-
get-right interactions going on under the hood.

~~~
carapace
Is all that really _needful?_ To draw documents? To make apps?

I don't think it is. I think Elm lang (for example) _proves_ that it's not.

I finally got around to trying Elm. Once I got over the way it feels like a
toy compared to the HTML/CSS/JS/etc world, I realized I could never ever
justify NOT using it in a business context.

What I mean is, the business-value case for using the normal front-end stack
vs. Elm just isn't there.

That's just an example for the domain you described.

The VPRI STEPS project demonstrated that we could reduce our codebase(s) by
orders of magnitude while retaining or even improving functionality, "from the
desktop to the metal".

~~~
zbentley
I'm sure Elm is lovely, but how is it useful without a browser to deliver it,
a browser to provide it a document to manipulate, and a browser to display its
changes to that document?

It's those things that are complex; the client-side programming language (if
present) is just one of many, many high-complexity parts in a browser.

~~~
carapace
Bless you! I was hoping someone would ask me that.

TL;DR: Write an Elm to native app compiler/interpreter. Servers serve Elm
code.

(As an aside, The loveliness of Elm is incidental to the point. If it looked
like COBOL it would still make economic sense. Lots of people have developed
DSLs for apps, the important thing about Elm is that it's a very elegant and
well-thought-out domain-specific system for specifying apps. Elm is _much
less_ complex than HTML+CSS+JS+Frameworks/libs/NPM etc.)

At the moment, the delivery vehicle for Elm-specified apps is the Fractal Rube
Goldberg Machine, yes.

But consider e.g. an Elm-to-GTK compiler, or Elm-to-TCL/Tk interpreter,
whatever... The FRBM is just a reasonable first target platform.

I don't think I'm wrong here, or even saying anything controversial. Go look
at what VPRI did with STEPS. Our code volume and complexity is too high by two
or three orders of magnitude.

~~~
sli
I don't see why I wouldn't just use Haskell and a native UI library _right
now_ , to similar effect, instead of waiting for all this to appear. The
language is in a much more stable state than Elm, which already makes it more
ideal in a business-context.

~~~
carapace
Cheers! You're making my point: the "FRBM" isn't _needful_.

~~~
sli
I'm not making your point for you, I just don't agree with you on what the
actual problem is.

Programmers don't disagree that we could be using better approaches. The
question is not why they don't exist, because they do exist. The question is
why we don't or can't use them currently.

Most of the time, the reason is purely cultural, either due to management or
legacy. I'd love to use Purescript and Haskell at my job, but I cannot. I
don't get to make that choice. A new Elm transpiler won't solve a cultural
problem.

------
navjack27
I wouldn't run testing on a notebook. I mean you can if what you are possibly
testing is boosting characteristics and other variables... But best bet for
low variable consistent testing is a machine where you have set a static core
clock speed and have disabled c-states and other power save things. Remember,
you are testing differences in compile optimization. You don't want your
system being a variable.

------
mhh__
The 48% difference in code size is surprising. But after all who cares in this
world of electron etc.

~~~
MaxBarraclough
If it affects cache behaviour, we should all care.

~~~
DannyBee
Whether that is true or not depends on a _lot_ of factors.

They are mostly unrelated to overall binary size due to paging, etc.

You also won't easily predict the behavior due to reordering.

~~~
MaxBarraclough
Smaller binary enables use of a lower-level cache, no? [0]

My understanding is that profile-guided optimisation is largely based on the
utility of small binaries, by optimising hotspots for speed and everything
else for space, thereby alleviating cache-pressure. Is this wrong?

> You also won't easily predict the behavior due to reordering.

I wasn't thinking of anything as sophisticated as looking at specific flows,
where I can well imagine things get unpredictable with reordering and
speculative execution. Won't there will be a reliable pattern of better
fitting in cache, if we shrink everything?

[0] [https://lwn.net/Articles/534735/](https://lwn.net/Articles/534735/)

~~~
DannyBee
"Smaller binary enables use of a lower-level cache, no? [0]" No. It would if
all of the stuff was actually all in memory at once, and pulled in the stuff
next to it.

IE you couldn't pull in function A without pulling in function B. That is
mostly not true[1].

This is why reordering _mostly_ brings load time benefits instead of run time
benefits.

The utility of PGO is mostly about knowing where to spend your time
optimizing, and knowing what to do. That's a generalization. There are
certainly cases in inline/etc heavy code where it helps get the speed part
right too. A lot of that is more often about "it lets the compiler spend it's
inlining budget on inlining stuff that matters" than "it stops the compiler
from blowing cache out".

I speak in generalities because there are always counterexamples.

There are cases where PGO makes things significantly worse, for example!

Last I remember (My job now means i don't have time to stay in the game), LLVM
did not bother to optimize the cold regions for size, and GCC did.

[1] It depends on function sizes and page sizes and mlocking and section flags
and all sorts of fun things, but i'm just going to assert the truth of this in
most cases go make it simpler.

------
navjack27
I think I'm going to mess with this later myself. On Manjaro I usually compile
Chromium with -O3 and -march=native with no mtune or any of that but I never
benchmarked it against anything. I'll do the same with Firefox. This is on
coffee lake BTW.

------
21
He's testing on a 7 year old 8-core server CPU. As irrelevant as possible for
your average laptop Firefox user.

~~~
josteink
I suspect you seriously overestimate the amount of developers on bleeding edge
hardware.

My desktop is a first gen i7. My Laptop a 4th gen. My work machine a 5 year
old Xeon.

And you know what? They all work great and I see no reason to upgrade.

I guess you will be shocked to hear I’m doing fine with 8GBs of ram too :)

