

New optimizations for X86 in upcoming GCC 5.0 - ingve
https://software.intel.com/en-us/blogs/2014/11/24/what-is-new-for-x86-in-upcoming-gcc-50

======
justincormack
Here is a more general list of gcc 5 changes
[https://gcc.gnu.org/gcc-5/changes.html](https://gcc.gnu.org/gcc-5/changes.html)

EDIT was looking at this for __builtin_mul_overflow which apparently are in
clang already, for testing overflow of arbitrary types.

~~~
cremno
Actually Clang doesn't implement the listed built-in functions (yet?). They're
type-generic versions.

[http://llvm.org/releases/3.5.0/tools/clang/docs/LanguageExte...](http://llvm.org/releases/3.5.0/tools/clang/docs/LanguageExtensions.html#checked-
arithmetic-builtins)

[http://clang.llvm.org/docs/LanguageExtensions.html#checked-a...](http://clang.llvm.org/docs/LanguageExtensions.html#checked-
arithmetic-builtins)

~~~
justincormack
Ah yes. The generic versions are nicer to use, and can be used for mixed
types. It seems gcc will implement the clang compatible ones too.

------
livemusic
Only loosely related, but I'm curious: What compiler optimizations have the
biggest impact on scientific / floating point computation? Integer
(audio/image) ops? With modern CPUs performing speculative execution, register
renaming, and all the other magic they do, the CPU is acting like an optimizer
in its own right. x64 is mostly just a byte code that gets JIT compiled at
runtime. I'd be interested in seeing how much specific optimizations
(including classic ones) contribute to speeding things up.

~~~
vardump
This depends entirely on the specific code you're talking about. If the
compiler finds out a specific optimization is applicable for a certain block
of code, then it uses it.

So your question cannot be answered. Only possible answer is "it depends".
Another guess: "From nothing to an order of magnitude. Probably most cases
about 5-20%."

~~~
livemusic
Yeah, I realize it was an open-ended question. I'm just curious if one was
starting a (C-like) compiler from scratch, where the biggest bang for the buck
comes from. It seems like peephole optimizations are already handled at
runtime by the CPU itself.

------
e12e
Does anyone know if -O3 (or -O2?) -march=native should be enough to get
reasonable optimization for running on the same cpu as the gcc host? Or are
one better off tweaking options manually (note, I know that knowing the
details of how gcc works will always be better than not -- I'm just wondering
if -march=native is currently considered stable/"good enough" from reasonable
value of "enough" ;-)

~~~
nkurz
I think yes. A couple weeks ago I tested compilation for a project on Haswell
and Sandy Bridge using various recent versions of CLang, GCC, and ICC to
determine whether it was safe to use just "-O3 -march=native" in combination
with "#include <x86intrin.h>" instead of more specific versions.

While my testing was far from rigorous, my conclusion was that this is now
sufficient and acceptable to get appropriate platform-specific SIMD
optimization.

You'll find some recent arguments for "-Ofast" instead of "-O3" or "-O2", but
compiler versions that don't support this are still recent enough to common.
Others occasionally argue that "-Os" is a better modern default, but I haven't
found this to be true. Although more debatable, I'd suggest "-g -Wall -Wextra"
also be part of the defaults compiler options for most projects.

~~~
e12e
Thank you both, and especially for the mention of x86intrin.h ... I wasn't
aware of that.

------
medgno
I'm so glad to see auto-vectorization happening more and more often. However,
I wonder whether a language that had built-in support for primitive floating-
point vector types (e.g., GLSL's vec3, vec4, mat3, mat4) could help the
compiler with performing these sorts of optimizations.

~~~
tmurray
I spent a lot of the past month improving the RenderScript (Android data-
parallel compute, using C99 plus vectors) codegen to better work with the LLVM
vectorizers, so I have a fair amount of experience with this exact question.

The vector types help in some ways--if you can use only vectors and do
arithmetic on entire vectors only the SIMD codegen already works great--but
they don't really help more than an intelligent vectorizer could. For example,
the LLVM loop vectorizer can only handle loops where the induction variable
changes by 1 or -1 each iteration. As a result, you couldn't set A[i], A[i+1],
and A[i+2] in a for loop where i += 3 each iteration. If you could do that,
you wouldn't really need vec3 in the first place. (also, the presence of any
vector types whatsoever prevent any vectorization in LLVM right now, so...)

Another issue is that using vectors for storage and vectors for computation
are very different things. I don't think anyone actually likes using vectors
for computation, but vectors for storage make a lot of sense in some fields
like image processing. However, as soon as you start operating on single
channels, things get messy. Let's say you have a for loop where each iteration
operates on a float4 as individual channels. Let's also say you want to turn
each of those single channel operations into its own float4 operation,
vectorizing across four iterations of the loop. Given most current SIMD ISAs,
you're going to have to do four 32-bit loads of A[i].x, A[i+1].x, etc., then
pack that into a vector, do your arithmetic ops, unpack the vector, and do
four 32-bit writes to memory. Unsurprisingly, this is not particularly fast,
and if you're doing one mul or FMA per channel, you shouldn't be vectorizing
at all. This is why you see cost models in vectorizers (to prevent this sort
of packing/unpacking from killing your performance when you enable your
vectorizer) as well as why you see newer SIMD ISAs like AVX2 in Haswell
including support for scatter/gather and permute.

The last issue is that it's trivial to break a vectorizer because of easy to
overlook things like ailasing. Missing a single restrict will prevent
vectorization if the compiler can't prove that pointers won't alias, for
example (why do you think people still use fortran in HPC?). There's actually
been a lot of great work here in LLVM over the past few months with new
aliasing metadata for LLVM IR ([http://llvm.org/docs/LangRef.html#noalias-and-
alias-scope-me...](http://llvm.org/docs/LangRef.html#noalias-and-alias-scope-
metadata)), which is what I used to make the SLP vectorizer work with
RenderScript in a way similar to ISPC (except in the compiler back end instead
of the front end, because we don't even know the target ISA at the time the RS
source is compiled to LLVM IR). I'll probably get the patch in AOSP in the
next week or two if you want to keep an eye on that; it needs a newer LLVM
version than what shipped in L and we're finishing up that rebase.

(honestly, I think that if you're trying to get good server CPU performance
and you know exactly what CPU you're going to be using at compile time, you
should be looking at ispc instead of doing SSE intrinsics yourself:
[https://ispc.github.io/](https://ispc.github.io/) )

~~~
nkurz
Fabulous comment, please keep posting more and sharing your
knowledge/experience. And thanks for the re-reminder of 'ispc'. A couple tiny
additions:

While loading each element with an individual 32-bit load usually negates the
advantage of vectorization, if you are using 8-bit data, doing a single vector
load and shuffling bytes within a 128-bit vector is really fast with AVX. And
shuffling double- and quad-words within 256-bits with AVX2 can be a useful
alternative even if you have to do it twice and combine. But even better is to
rearrange your data layout to match the operations you are going to be doing.

AVX2 doesn't actually offer scatter support, although AVX-512 will. And AVX2
gather on Haswell normally isn't any faster than multiple individual loads,
although it's no worse and the the expectation is that future generations will
improve on this significantly. Also worth pointing out is that 256-bit YMM
vectors are conceptually split in two, so apart from the slightly slower
32/64/128-bit permute instructions, it's not possible to move data from one
128-bit 'lane' to the other.

~~~
tmurray
sorry, the desktop Intel SIMD ISAs are generally not something I use on a day-
to-day basis, so I get them mixed up a lot. (LRBni was the last one I looked
at for any serious length of time.)

------
janvdberg
As this is the number 1 story on HN now: can someone explain what kind of
improvements we will see in day to day use from this?

~~~
_wmd
The article mentions two: unpacking 24 bit values and matrix multiplication,
both might be something you'd find in a software video (de)compressor or a
computer game

~~~
vinkelhake
Where did you get 24 from? It's about vectorizing loads of 3 elements. The
elements don't have to be 8-bit. The second example uses floats.

~~~
cwp
Yeah, but 24-bit values (RBG) are what you'd find in video and gaming
applications.

~~~
vardump
Games, at least what I've seen, usually just don't use 3x(8|16|32) bits per
pixel, if it's not needed as alpha. It's pretty rare to pack 24/48/96-bit
pixels in memory. Traditionally, unaligned loads are just not worth it.

Sometimes packed formats were worth it. I remember doing RLE schemes for fast
"alpha blending" (anti-aliased blitting really!) back in 2000 or so, before
MMX or GPUs were common. The scheme I used had three different types of run
lengths. Completely transparent, pre-multiplied alpha blended and completely
opaque. 16-bit pre-multiplied values were in memory in 32-bit format like
this: URGBU00A. Groups of 5 bits, except U denotes unused single bit. Frame
buffer in RAM was in 15 bit format, 0RRRRRGGGGGBBBBB. With this scheme, alpha-
blending needed totally just _one multiply_ per pixel [1], compared to three
with usual pre-multiplied alpha or six with normal alpha without pre-
multiplication.

[1]: Frame buffer pixels were masked and shifted into
000000GGGGG000000RRRRR00000BBBBB, multiplied by A and shifted back in place.
Because all values were 5 bit entities, no overflow (overlap) could occur in
multiply, thus one multiply yielded 3 multiplied frame buffer pixel values,
for R, G and B!

------
ape4
Competition (with clang) is good.

~~~
saurik
People contribute to compilers because they need features or performance
improvements, not "to kick that other compiler's ass". Sometimes (as here)
contributions are from CPU vendors who want to make certain that compilers
show off the features of their new hardware. There were tons of great
performance impovements in every new version of GCC before clang, and there
continue to be today. You just can't model an open source compiler the way you
model some closed source word processor :(.

The real question that should be asked is whether, now with two compilers to
contribute features into instead of one, whether the progress of either of the
compilers is actually faster than the one previous one. Again: people aren't
just not going to improve the compiler if there is only one of them, as
improvements come from contributors who need the better compiler, not from
some abstract notion of "have the bestest compiler". Having multiple open
source projects in the same space is a detriment--it is a cost--that hopefully
is justified by the benefits.

What happens in a world of "competition", is that now Intel has to ask whether
it is valuable to add this feature to GCC or to clang; maybe they have time
for both, but probably not. That this work ends up in one compiler and some
other work ends up in a different compiler is unfortunate. It isn't even clear
who is "competing" or what they are "competing" for. It certainly isn't the
performance optimization itself: you can't credit the GCC team (whatever that
might even mean) for that, as this work is from Intel. GCC is an open source
project that is contributed to by a vast number of independent actors, not
centrally developed by a small handful of people.

The argument for clang and LLVM being helpful is therefore not "competition"
(which would make absolutely no sense), but instead "GCC has accumulated years
of entrenched architecture that, in 2010 as opposed to 1980, we would design
differently: LLVM is thereby designed in a way that is easier to modify; so,
even though effort is now forked (meaning fewer contributions to each
compiler), a smaller number of contributions will have a more powerful effect
on the state of compilers". If this is true (and many think it is), then
that's great, but it isn't "competition" in the way people normally consider
that concept.

~~~
smosher_
I disagree, competition itself was good for GCC.

Stallman's reaction¹ to LLVM was noisy and possibly childish, but it gave the
FSF another important goal: not only should it supply a "libre" compiler but
it should also be worth using over the first alternative that comes along.
That second goal is a moving target, unlike the first, and it supplies a real
motivation to make the best compiler they can.

1\.
[http://www.phoronix.com/scan.php?page=news_item&px=MTU4MjA](http://www.phoronix.com/scan.php?page=news_item&px=MTU4MjA)

~~~
GFK_of_xmaspast
GCC hasn't really had a great track record here, things got so bad back in the
day that EGCS became necessary
([https://gcc.gnu.org/news/announcement.html](https://gcc.gnu.org/news/announcement.html))

------
veselin
One thing that is not clear to me is if OOP code such as copy constructors
will benefit from using the vector registers. Just having vector copy of
objects may turn out to be a big difference.

Otherwise examples with array of size 4 elements are nice to show, but in most
of these cases it is also easy to use intrinsics.

------
runeks
Will this work for AMD x86 CPUs as well, or only Intel CPUs?

