
A bug story: data alignment on x86 - sconxu
http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html
======
ogoffart
Again someone who relies on undefined behavior. Casting pointer of wrong
alignement is not a platform specific behavior, it's an undefined behavior.
Relying on it is an error.

The author did not know "What Every C Programmer Should Know About Undefined
Behavior": [http://blog.llvm.org/2011/05/what-every-c-programmer-
should-...](http://blog.llvm.org/2011/05/what-every-c-programmer-should-
know.html)

Another good link about that:
[http://blog.regehr.org/archives/213](http://blog.regehr.org/archives/213)

~~~
loup-vaillant
Again someone blaming the victim. I'm kind of sick of this.

While true, your comment doesn't get at the root of the problem. The obvious
fact here is that there is a mismatch between the C standard and how the users
really use it. The less obvious fact/opinion/fallacy, is that it is not
automatically the user's fault.

 _The standard could be wrong._

Sure, there are reasons why such and such behaviour ended up undefined. Those
reason are sometimes weak however: some behaviour ended up undefined on _all_
platforms because _some_ of them couldn't handle it reasonably. The alignment
bug here is such an example.

To this day I don't understand why undefined behaviour wasn't specified on a
platform-by-platform basis. We already have implementation defined behaviour,
after all. I _guess_ this is because it lets compiler writers unify their
front-ends and optimizers, but as a result, we cannot use our platforms to
their fullest potential.

~~~
benchaney
You are missing the point. The standard could be wrong, but it is still the
standard. You still have an obligation to conform to it if you are writing C
code. It is a prescriptive document, not a descriptive one. It can only be
wrong to the extent that it can make poor decisions. Regardless of what it
says it is authoritative.

~~~
loup-vaillant
Did I gave the impression we don't need to conform to that standard? That
wasn't my intention. I just wanted to point out the standard is sometimes
stupid.

Of course we have to conform to the standard, however crazy. The only
alternative is forking the language itself.

------
qb45
The correct solution for GCC is specifying 1-byte alignment for this
particular array:

    
    
      #include <stdlib.h>
      #include <stdint.h>
    
      typedef uint32_t __attribute__((__aligned__(1))) uint32_t_unaligned;
    
      uint64_t sum (const uint32_t_unaligned * p, size_t nwords)
      {
          uint64_t res = 0;
          size_t i;
          for (i = 0; i < nwords; i++) res += p [i];
          return res;
      }
    

Probably works on clang too and IIRC the MS compiler provides similar
functionality with different syntax. AFAIK there is no portable solution.

And I'm not sure how exactly this code will fail on architectures which don't
support unaligned uint32_t.

~~~
speps
In C++11 there is a standard for this :
[http://en.cppreference.com/w/cpp/language/alignas](http://en.cppreference.com/w/cpp/language/alignas)

~~~
d33
Anybody knows how Rust handles this problem?

~~~
alkonaut
Not sure if this is what you are asking: Last time I tried alignment in Rust I
worked around the lack of explicit alignment support by adding a zero length
array of the correct size to the end of the struct. Not sure if alignment
support from proper attributes has landed yet.

    
    
       [repr(C)]
       struct Something
       {
          pub foo: f32,
          pub _alignment: [EightBytes, 0]
       }
    

where "EightBytes" is a data type of size 8, to align the whole struct on 8
bytes.

~~~
steveklabnik
Structs already insert padding to give them alignment:

    
    
        struct One {
          foo: u8,
        }
        
        struct Two {
          bar: u16,
        }
        
        struct Three {
          foo: u8,
          bar: u16,
        }
        
        struct Four {
          foo: u16,
          bar: u16,
        }
        
        fn main() {
            assert_eq!(1, std::mem::size_of::<One>());
            assert_eq!(2, std::mem::size_of::<Two>());
            assert_eq!(4, std::mem::size_of::<Three>());
            assert_eq!(4, std::mem::size_of::<Four>());
        }

~~~
alkonaut
What I needed to do was to place e.g. a N byte struct exactly on an M byte
alignment (e.g. 11 byte struct on 32 byte alignment etc).

~~~
steveklabnik
Ah! Yeah, that's different, and as far as I know your way is the current right
way to do it, but I'm not an expert on the subject.

------
rwmj
These SSE instructions that operate only on aligned data are a pain. It's not
well known that Linux/x86 stack frames must always be 16 byte aligned. GCC
uses this knowledge to use the SSE aligned instructions when accessing certain
fields on the stack.

Unfortunately a while back the OCaml compiler generated non-aligned stack
frames. Which is no problem for pure OCaml code and even saves a little bit of
memory. However if the code called out to C, then _sometimes_ and
unpredictably (think different call stacks, ASLR) the C code would crash. That
was a horrible bug to track down:

[https://caml.inria.fr/mantis/view.php?id=5700#c10779](https://caml.inria.fr/mantis/view.php?id=5700#c10779)

~~~
userbinator
Agreed, I've always found them unusual and perhaps a bit of a shortsighted
decision --- they've been making processors seamlessly handle any alignment
with perhaps an extra cycle, even for the MMX instructions, yet somehow felt
the need to restrict much of the SSE ones into aligned and only provide one
unaligned move.

The stack alignment restriction is also annoying when handwriting Asm,
although fortunately it's only when calling into other C libraries that it
needs to be minded.

~~~
jjoonathan
> seamlessly handle any alignment with perhaps an extra cycle

I'm not up to date on the latest mitigation strategies, but the hairball of
cache implications caused by unaligned access make me suspicious of that
claim. If you (or your compiler) signal that you want performance by using
vector instructions, I think it's completely fair for Intel to demand that you
pay attention to alignment.

~~~
qb45
Case in point: a simple example of code which copies 1MB of data with SSE and
slows down on misalignment:

[https://news.ycombinator.com/item?id=12718625](https://news.ycombinator.com/item?id=12718625)

Presumably due to the hairball of cache implications, as you put it.

But it also is true that the choice of aligned/unaligned instructions makes no
difference if the array is aligned.

------
pjc50
Well, that's pretty horrendous. Note that the naive code which just casts the
input to uint16_t would work fine. I can't help but wonder if the solution to
this might have been better expressed as naive implementation + platform-
specific _assembly_ implementation.

After all, if you have to understand the underlying instructions executed in
order to fix the problem, why not stop trying to make the compiler emit the
"right" instructions and just write them yourself?

(Language lawyers: is casting a char* to a uint32_t* actually defined
behavior? For unaligned data?)

~~~
Kubuxu
No it is not, the author even cited the rule from C99 himself.

> 6.3.2.3 A pointer to an object or incomplete type may be converted to a
> pointer to a different object or incomplete type. If the resulting pointer
> is not correctly aligned for the pointed-to type, the behavior is undefined.
> Otherwise, when converted back again, the result shall compare equal to the
> original pointer. When a pointer to an object is converted to a pointer to a
> character type, the result points to the lowest addressed byte of the
> object. Successive increments of the result, up to the size of the object,
> yield pointers to the remaining bytes of the object.

The fist sentence is what is in play over here. If you have undefined
behaviour in your program anything can happen.

------
GlitchMr
Compiler is allowed to assume alignment of pointers (what are you doing is
creating a pointer to a value with invalid alignment, hence undefined
behaviour (just creating a pointer is undefined behaviour)). The correct
solution would be to read values indirectly. For example, a function like that
could be used to replace every access to "q" variable.

    
    
        static uint32_t read(const char *p, size_t index) {
          uint32_t out;
          memcpy(&out, &p[index * sizeof out], sizeof out);
          return out;
        }
    

A compiler can recognize this pattern, and continue to use unaligned accesses
that would work.

This has a cost of unaligned accesses on non-x86 platforms (a quite big at
that), but considering the original code didn't work on these at all, it's an
improvement.

~~~
amadvance
Is this equivalent ? At least to me this seems a more common pattern than
using memcpy(), and I expect to work better on old compilers.

    
    
      static uint32 read(const void *ptr) {
        const uint8 *b = (const uint8 *)ptr;
        return (b[3] << 24) | (b[2] << 16) | (b[1] << 8) | (b[0]);
      }

~~~
qb45
Functionally it seems equivalent, except that it doesn't work on big endian
machines (assuming the ints are in native endian).

On dumb compilers it may be faster if a few bitops happen to be faster than a
call to memcpy. Which sounds plausible.

On modern compilers it boils down to whether your compiler can optimize either
one or both of these patterns into a single unaligned load. GCC for example
certainly optimizes 4 byte memcpy, probably already at -O2, but whether it
recognizes your pattern I don't know. Compile it and check.

------
slededit
At some point you might as well byte the bullet and just write the code in
assembly.

~~~
krylon
> byte the bullet

Is that a clever pun or Freudian slip?

~~~
slededit
Accidental but I'll keep it.

------
ambrop7
Note that even if you try to manually correct the pointer to work on aligned
data (read any initial bytes via char pointer and read the rest via uint32_t
pointer), you still generally have undefined behavior: strict-aliasing
violation. And the worst thing here is that whether you do have a violation
depends on how _other_ code accesses the same data / how the object is
initially declared. E.g., you're fine if the original declaration is char[] or
uint32_t[], but not if it's uint16_t[]. Because that would entail access to
the same data via both uint16_t and uint32_t, a violation of strict-aliasing.

Actually two out of three inet checksum implementations in lwIP have this bug
[1].

And like the problem discovered in the article, this is NOT theoretical. I
have personally seen code "miscompiled" due to strict aliasing violations (in
that case, packed structures were involved).

I think the only way to do this "manual alignment handling" is to use
assembly, either by writing the entire thing in assembly, or using inline asm
sections for doing the individual 32-bit memory reads/writes.

Funny story... When I was looking for a fast inet checksum implementation to
use for an embedded ARM project, I took the one from RTEMS, which is written
in C with much inline asm, and like the lwIP code, it has strict aliasing
violations (and also problems compiling correctly with clang). What I did was,
compiled it to assembly with gcc once, then included this compiled assembly in
the source code. Assuming that this was compiled correctly, I don't need to be
afraid of future compiler change breaking it.

[1]
[http://git.savannah.gnu.org/cgit/lwip.git/tree/src/core/inet...](http://git.savannah.gnu.org/cgit/lwip.git/tree/src/core/inet_chksum.c)

~~~
dmm
Strict aliasing can worked around by casting through a union, essentially
introducing a new type which can alias any of its component types. Mike Acton
has a good description of it with lots of assembly examples here:

[http://cellperformance.beyond3d.com/articles/2006/06/underst...](http://cellperformance.beyond3d.com/articles/2006/06/understanding-
strict-aliasing.html)

------
lukego
Related Snabb experiments with IP checksum in C with automatic vectorization,
C with vector intrinsics, and AVX2 assembler:
[https://github.com/snabbco/snabb/pull/899](https://github.com/snabbco/snabb/pull/899)

------
novia
I'm taking assembly right now, and we're working on our first RISC project
after spending all semester working with the x86. Why does RISC crash if the
bytes are not aligned?

~~~
brassic
Early RISC processors read data from memory 32 bits (4 bytes) at a time, and
these reads had to be aligned on 4-byte boundaries. This was a feature of the
memory architecture, not just the processor.

Thus an aligned read of a 32-bit integer took one memory access; an unaligned
read took two, which took twice as long. This killed performance.

Rather than quietly performing badly, the processor threw an exception to
encourage you to fix your code.

The ARM2 worked a bit differently. It ignored the bottom two bits of the
address when it read a value from memory. When the read was complete the value
was rotated by the value of the bottom 2 bits multiplied by 8. This had the
effect of putting the byte referenced by the full address in the bottom 8 bits
of the 32-bit register. A flag in the instruction let you optionally mask off
the top 24 bits to simulate a byte read.

------
Hello71
You can probably also just pass -fno-strict-aliasing to gcc.

------
ot
If you're willing to use compiler extensions, you can avoid the memcpy by
using packed structs. This can generate better code.

Folly has a generic `loadUnaligned()` that uses this trick:
[https://github.com/facebook/folly/blob/5d52fb8c30e567403b8cc...](https://github.com/facebook/folly/blob/5d52fb8c30e567403b8ccb65e5c1a159fb92d707/folly/Bits.h#L539)

------
phkahler
What if you put the array in a struct and made a union of both uint32_t and
uint8_t? Would the union with the larger size force the compiler to generate a
4-byte aligned array for the bytes?

I suggest this because it would be portable without any compiler specific
stuff.

~~~
mzs
It's already too late, the data is read from a file so the base of the array
can be say at address 0xYYYYY2.

------
koverstreet
Attribute((aligned)) might be useful here.

------
amelius
TL;DR: Even though most instructions of your processor (x86) allow data to be
aligned on any byte, your compiler might not.

------
pklausler
So much HTML to complain about C working the way C is defined rather than the
way the OP wants it to work! It's not that hard to write a fast
ones'-complement checksum that's portable and compliant, but whining's always
easier than coding.

