
A bug story: data alignment on x86 (2016) - phab
https://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html
======
nkurz
Summarizing the conclusion of the article: GCC correctly interprets the spec
as saying that all integers reads must be aligned in memory, and when
vectorizing the code chooses to use an "aligned" instruction that fails on
unaligned data (MOVDQA). On modern x64 processors, the unaligned version of
this instruction (MOVDQU) is just as fast on both aligned an unaligned data,
and has the advantage of not causing a segfault when run.

Is this a bug in GCC that should be fixed, or is GCC justified in its
behavior? Or is there another interpretation?

Having been bitten by this in the past, my conclusion is that while this is
not a bug, it is (slight) evidence that GCC will not act in its users' best
interests unless required to do so by the spec. Given the same inputs, Intel's
ICC generates working fast code. If both compilers were equally available, I
would usually prefer ICC over GCC for code that is going to run on modern
Intel processors.

~~~
userbinator
_unless required to do so by the spec_

Which IMHO is an absolutely stupid point of view, because compilers don't
exist in a void. As the old saying goes, "what's right isn't always legal, and
what's legal isn't always right", and behaving to the letter of the law is not
the same as behaving to the spirit of the law.

Thus I think it is absolutely a bug. The standard even says undefined
behaviour may result in something like "behaving in a manner characteristic of
the environment", which is absolutely what programmers expect from the
language.

I also suspect the fact that GCC has become a de-facto monopoly (duopoly if
you count Clang/LLVM) among C compilers for Linux platforms makes them more
likely to dismiss such complaints.

My experience agrees with yours that ICC and MSVC are nowhere near as
aggressive and hostile with UB, yet still generate very good code.

~~~
__s
Given that it's gcc is generating instructions which require alignment
"behaving in a manner characteristic of the environment" is one way of
describing this situation

~~~
userbinator
Then perhaps Intel is ultimately to blame for this mess, since requiring
alignment is completely at odds with how x86 normally behaves, and as shown in
the article, there's a version of the instruction not requiring alignment
_and_ not really slower at all.

------
nly
I'm surprised any seasoned C developer would make this mistake. You haven't
been able to assume trivial translation of C code to assembly for decades.

Casting from a 'pointer to type A' to a 'pointer to type B' is unsafe in all
but a handful of circumstances.

\- B is char or unsigned char.

\- A is char or unsigned char, _and the pointer was previously cast from a
pointer to type B_.

\- Where A is a struct (or a standard layout[0] class in C++) and B is the
first member of that structure.

[0]
[https://en.cppreference.com/w/cpp/language/data_members#Stan...](https://en.cppreference.com/w/cpp/language/data_members#Standard_layout)

~~~
userbinator
_You haven 't been able to assume trivial translation of C code to assembly
for decades._

Then perhaps everyone should continue to do that, and continue to _very
strongly_ complain to the compiler-authors who seem to have become absolutely
engrossed in blindly following the standard to the letter and completely
ignoring the fact that people are wanting C precisely because it's supposed to
be close to Asm. But I guess it satiates their egos more to point and laugh...
"we're following the standard, fuck you for thinking we care about anything
else."

~~~
nly
If the standard didn't allow this optimization then GCC would have to emit
unaligned access instructions everywhere. If it did that then people who want
C to be 'close to the machine' would be complaining about lackluster
performance and calling for everything to be written in ASM.

The reality is that the abstract machine modeled by C is a long, long way from
what modern CPUs actually do, and the C language standards committee seems to
have little interest in extending their language

~~~
userbinator
If you were compiling for an architecture that doesn't allow unaligned access,
then you'd expect unaligned accesses to fault. If you were compiling for an
architecture which does, then you'd wouldn't. That's what sane undefined
behaviour should be expected to do.

------
saagarjha
Previously:

[https://news.ycombinator.com/item?id=17910851](https://news.ycombinator.com/item?id=17910851)

[https://news.ycombinator.com/item?id=12889855](https://news.ycombinator.com/item?id=12889855)

Some new developments:

> C++ allows us to write the same function in much more readable way by
> employing some template programming. We’ll introduce a generic type called
> const_unaligned_pointer.

If you support C++20, there's std::bit_cast:
[https://en.cppreference.com/w/cpp/numeric/bit_cast](https://en.cppreference.com/w/cpp/numeric/bit_cast)

Edit: fixed link. Thanks, nkurz.

~~~
userbinator
I think something has gone _very_ wrong with the state of programming when the
original C version reads extremely straightforwardly, the version that works
without UB already looks quite a bit more noisy, and the C++ version is
just... _yuck_!

In this case, just writing the appropriate Asm instructions themselves
would've been much shorter and simpler, and also worked the first time.

~~~
userbinator
...and something has certainly gone very wrong when you get downvoted for
pointing out the truth! A lot of this idiotic complexity increase would be
completely avoided if compiler authors would just exercise some common sense,
but unfortunately it's not so common after all.

------
peteri
This is just expected behaviour isn't it? So many processors require aligned
access from the original C target of the pdp-11 onwards. That it happens to
work on a x86 is pure chance.

------
bcoates
Am I the only one weirded out that the author didn't even consider
writing/benchmarking the obvious byte-at-a-time version (in a loop, and/or
unrolled) before resorting to a nonportable/incorrect 'optimized' version?

Versions of GCC old enough to drive are pretty good at generating fast code
from byte operations and summing up bytes seems like the kind of transparently
analyzable case where modern compilers really are sufficiently smart

------
pjc50
Challenge for all the anti-C people here: what other language (a) lets you set
up this situation in the first place and (b) avoids the problem "optimally"?
That is, generates performant assembler on x86 and doesn't crash on ARM?

(a) is a much harder criterion than it sounds!

~~~
benibela
Pascal probably

