EDIT was looking at this for __builtin_mul_overflow which apparently are in clang already, for testing overflow of arbitrary types.
I'm disappointed it says literally nothing about auto-vectorization. I had to do an incredible amount of hand-holding to get GFortran 4.9's auto-vectorization to work worth anything. A few experiments with C code showed that GCC 4.9 was just plain bad at it compared to even Clang, much less ICC. If I work on that project again, I'll probably look into getting DragonEgg  to work so I can get decent optimization without contorting the code as much.
The OP mentions an important case with loads and stores of 3D data, but GCC needs to up its game in more ways than that.
So your question cannot be answered. Only possible answer is "it depends". Another guess: "From nothing to an order of magnitude. Probably most cases about 5-20%."
In my experience, auto-vectorization is the big one. Modern CPUs do 2-, 4-, or 8-wide operations, but it can be hard to convince the compiler to use them. Next is loop tiling, to keep things in cache(s) where possible. These are both hard to do by hand.
Loop interchange to get better locality is nice, but can be done by hand without much trouble by someone who understands computers.
While my testing was far from rigorous, my conclusion was that this is now sufficient and acceptable to get appropriate platform-specific SIMD optimization.
You'll find some recent arguments for "-Ofast" instead of "-O3" or "-O2", but compiler versions that don't support this are still recent enough to common. Others occasionally argue that "-Os" is a better modern default, but I haven't found this to be true. Although more debatable, I'd suggest "-g -Wall -Wextra" also be part of the defaults compiler options for most projects.
I did have the occasional weird breakage with -Os, though (once it broke make...). Anyways, -O2, -O3, and -Os should be pretty reliable as I imagine they are the most used optimization flags these days.
You generally don't have to tweak options manually unless a small increase in performance is very important to your application or you are running specialized programs that will benefit largely from a specific optimization. Remember -- most Linux distributions use pretty basic cflags and are reasonably performant.
Come to think of it, the Gentoo project has probably been useful in rooting out weird cflag bugs in GCC. :)
EDIT: Also check this out:
Sets -fno-math-errno, -funsafe-math-optimizations, -ffinite-math-only, -fno-rounding-math, -fno-signaling-nans and -fcx-limited-range.
This option causes the preprocessor macro __FAST_MATH__ to be defined.
This option is not turned on by any -O option besides -Ofast since it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications.
OTOH it gives the compiler to do a great deal of algebraic simplifications, including expression reordering. This probably will bring what the compiler can actually do more in line with what you think it should be able to do.
Basically, you need to test with it on if you're going to use `-ffast-math`. You might also have good luck with turning on a subset of the flags. For example, IIRC in the project with the NaN checks, using all of them except `-ffinite-math-only` fixed the problem in this case.
Some of them are obvious to turn on though. `-fno-math-errno` should be the default for most programs, if you ask me. I've never seen anybody check `errno` to see if their call to, e.g. `sqrt`, was invalid, and I hope I never do.
-fassociative-math (generally safe unless you're dealing with math written specifically to take advantage of the details of floating point arithmetic)
-fno-signed-zeros - again, though it's possible some code depends on this, it's rather unlikely.
-fno-trapping-math - do you actually use traps?
The vector types help in some ways--if you can use only vectors and do arithmetic on entire vectors only the SIMD codegen already works great--but they don't really help more than an intelligent vectorizer could. For example, the LLVM loop vectorizer can only handle loops where the induction variable changes by 1 or -1 each iteration. As a result, you couldn't set A[i], A[i+1], and A[i+2] in a for loop where i += 3 each iteration. If you could do that, you wouldn't really need vec3 in the first place. (also, the presence of any vector types whatsoever prevent any vectorization in LLVM right now, so...)
Another issue is that using vectors for storage and vectors for computation are very different things. I don't think anyone actually likes using vectors for computation, but vectors for storage make a lot of sense in some fields like image processing. However, as soon as you start operating on single channels, things get messy. Let's say you have a for loop where each iteration operates on a float4 as individual channels. Let's also say you want to turn each of those single channel operations into its own float4 operation, vectorizing across four iterations of the loop. Given most current SIMD ISAs, you're going to have to do four 32-bit loads of A[i].x, A[i+1].x, etc., then pack that into a vector, do your arithmetic ops, unpack the vector, and do four 32-bit writes to memory. Unsurprisingly, this is not particularly fast, and if you're doing one mul or FMA per channel, you shouldn't be vectorizing at all. This is why you see cost models in vectorizers (to prevent this sort of packing/unpacking from killing your performance when you enable your vectorizer) as well as why you see newer SIMD ISAs like AVX2 in Haswell including support for scatter/gather and permute.
The last issue is that it's trivial to break a vectorizer because of easy to overlook things like ailasing. Missing a single restrict will prevent vectorization if the compiler can't prove that pointers won't alias, for example (why do you think people still use fortran in HPC?). There's actually been a lot of great work here in LLVM over the past few months with new aliasing metadata for LLVM IR (http://llvm.org/docs/LangRef.html#noalias-and-alias-scope-me...), which is what I used to make the SLP vectorizer work with RenderScript in a way similar to ISPC (except in the compiler back end instead of the front end, because we don't even know the target ISA at the time the RS source is compiled to LLVM IR). I'll probably get the patch in AOSP in the next week or two if you want to keep an eye on that; it needs a newer LLVM version than what shipped in L and we're finishing up that rebase.
(honestly, I think that if you're trying to get good server CPU performance and you know exactly what CPU you're going to be using at compile time, you should be looking at ispc instead of doing SSE intrinsics yourself: https://ispc.github.io/ )
While loading each element with an individual 32-bit load usually negates the advantage of vectorization, if you are using 8-bit data, doing a single vector load and shuffling bytes within a 128-bit vector is really fast with AVX. And shuffling double- and quad-words within 256-bits with AVX2 can be a useful alternative even if you have to do it twice and combine. But even better is to rearrange your data layout to match the operations you are going to be doing.
AVX2 doesn't actually offer scatter support, although AVX-512 will. And AVX2 gather on Haswell normally isn't any faster than multiple individual loads, although it's no worse and the the expectation is that future generations will improve on this significantly. Also worth pointing out is that 256-bit YMM vectors are conceptually split in two, so apart from the slightly slower 32/64/128-bit permute instructions, it's not possible to move data from one 128-bit 'lane' to the other.
It is great if the compiler shines, but frustrating if it screws up.
And yes, that applies to other tools, too. TeX, for instance, most of the time does a great job, given few specific directions, but sometimes, it doesn't quite do what you think is possible. Then, you may have a long fight ahead to make it do what you want.
"It is ... frustrating if it screws up."
Sometimes packed formats were worth it. I remember doing RLE schemes for fast "alpha blending" (anti-aliased blitting really!) back in 2000 or so, before MMX or GPUs were common. The scheme I used had three different types of run lengths. Completely transparent, pre-multiplied alpha blended and completely opaque. 16-bit pre-multiplied values were in memory in 32-bit format like this: URGBU00A. Groups of 5 bits, except U denotes unused single bit. Frame buffer in RAM was in 15 bit format, 0RRRRRGGGGGBBBBB. With this scheme, alpha-blending needed totally just one multiply per pixel , compared to three with usual pre-multiplied alpha or six with normal alpha without pre-multiplication.
: Frame buffer pixels were masked and shifted into 000000GGGGG000000RRRRR00000BBBBB, multiplied by A and shifted back in place. Because all values were 5 bit entities, no overflow (overlap) could occur in multiply, thus one multiply yielded 3 multiplied frame buffer pixel values, for R, G and B!
1. Image conversion (RGB structure to some other)
2. N-dimentional coordinates. (Normalize array of XYZ points)
3. Multiplication of vectors by constant matrix
That are few things that came across my mind.
The real question that should be asked is whether, now with two compilers to contribute features into instead of one, whether the progress of either of the compilers is actually faster than the one previous one. Again: people aren't just not going to improve the compiler if there is only one of them, as improvements come from contributors who need the better compiler, not from some abstract notion of "have the bestest compiler". Having multiple open source projects in the same space is a detriment--it is a cost--that hopefully is justified by the benefits.
What happens in a world of "competition", is that now Intel has to ask whether it is valuable to add this feature to GCC or to clang; maybe they have time for both, but probably not. That this work ends up in one compiler and some other work ends up in a different compiler is unfortunate. It isn't even clear who is "competing" or what they are "competing" for. It certainly isn't the performance optimization itself: you can't credit the GCC team (whatever that might even mean) for that, as this work is from Intel. GCC is an open source project that is contributed to by a vast number of independent actors, not centrally developed by a small handful of people.
The argument for clang and LLVM being helpful is therefore not "competition" (which would make absolutely no sense), but instead "GCC has accumulated years of entrenched architecture that, in 2010 as opposed to 1980, we would design differently: LLVM is thereby designed in a way that is easier to modify; so, even though effort is now forked (meaning fewer contributions to each compiler), a smaller number of contributions will have a more powerful effect on the state of compilers". If this is true (and many think it is), then that's great, but it isn't "competition" in the way people normally consider that concept.
Stallman's reaction¹ to LLVM was noisy and possibly childish, but it gave the FSF another important goal: not only should it supply a "libre" compiler but it should also be worth using over the first alternative that comes along. That second goal is a moving target, unlike the first, and it supplies a real motivation to make the best compiler they can.
Otherwise examples with array of size 4 elements are nice to show, but in most of these cases it is also easy to use intrinsics.