
A bug story: data alignment in C on x86 (2016) - fanf2
http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html
======
dronemallone
Why not use the routines from the Linux kernel, which are battle tested?

[https://elixir.bootlin.com/linux/latest/source/arch/x86/incl...](https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/checksum_64.h#L45)

Or even NetBSD: [http://ftp.netbsd.org/pub/NetBSD/NetBSD-
current/src/sys/net/...](http://ftp.netbsd.org/pub/NetBSD/NetBSD-
current/src/sys/net/npf/npf_inet.c)

EDIT:

freebsd:
[https://github.com/freebsd/freebsd/blob/master/sys/netinet/i...](https://github.com/freebsd/freebsd/blob/master/sys/netinet/in_cksum.c)

openbsd:
[https://github.com/openbsd/src/blob/master/sys/netinet/in4_c...](https://github.com/openbsd/src/blob/master/sys/netinet/in4_cksum.c)

~~~
ahmedalsudani
It’s impressive of you to dig out all those implementations! Do you have a
trick for navigating codebases?

~~~
saagarjha
Not to downplay their achievements, but it seems to me that they just dug up
the implementations in the IP stack for each OS.

~~~
dronemallone
That's true, pretty easy to do once you look in the "net" or "inet" or
"netinet" folders and then look for ipv4.

------
saagarjha
The correct way to do this is to use memcpy, as was pointed out in the
article. Given a half-decent compiler, it will give the most portable,
standards-compliant, and often fastest code. Please just use memcpy instead of
limiting yourself with "defined (__GNUC__) && (defined (__x86_64__) || defined
(__i386__))".

~~~
mjg59
The article points out that, for the author's use case, memcpy _doesn 't_
generate the fastest code.

~~~
saagarjha
I qualified my statement with an "often" for a reason ;)

But looking back at the article, it seems like the author is complaining that
the compiler used an unaligned move instead of an aligned one–but based on the
code they wrote, this seems necessary. The compiler has no knowledge of any
data alignment (how would it know that the data is always aligned without
telling it?), so it emits conservative instructions.

~~~
mjg59
Not really - the initial implementation (without memcpy) resulted in an
aligned move, and hence the crash. The memcpy implementation results in an
SSE-based unaligned instruction that does the right thing, but is slower (in
the specific case that the author cares about, ie cases where the loop will
rarely trigger and will basically never go through more than three iterations)
than not using SSE. There's no real way to hint to the compiler that this is
the case, so the author chooses to disable the use of SSE in that function on
architectures that support it - this way they can use memcpy() (for
correctness) and still get the hoped for code generation.

------
firethief
> I didn’t have to bother about portability

This is where it became evident that the class of bug was going to be "assumed
the language's machine model gets out of the way if you know the target
platform". It's funny how easily we all forget about the nasal demons, but UB
!= implementation defined.

~~~
tom_
But it _could_ by treated equivalently in many cases. The standard, after all,
imposes no restrictions. It might not explicitly say that the implementation
could behave like that - I don't know what phrase would work - "a documented
manner characteristic of the platform", something like that - but by imposing
no restriction it leaves the option open.

Which is a large part of why people complain.

~~~
gok
That’s not accurate in this example. In C, uint32_t* literally means “a 4
byte-aligned pointer to a 32 bit unsigned int” on every platform. The author
wasn’t getting bitten by UB so much as misunderstanding how pointer types work
in C.

~~~
jcranmer
uint32_t literally means "an unsigned integer type with width 32 [bits] and no
padding bits" (I'm quoting from the C standard here). It does not imply
anything about what the alignment is; a 16-bit processor may well defined the
alignment of the 32-bit unsigned integer type to be 2 bytes instead of 4.

The incorrect logic that causes the bug is thinking "I know that I can do X in
assembly, therefore the naive translation to C should let me do the same
thing." The idea that C is portable assembler is very widespread, but it is
quite wrong, and it has been wrong for decades. A lot of people complain when
they find out that the compiler doesn't work when they to write what is
effectively assembler in C syntax, but quite frankly, many more people
(including many of those who want it to be portable assembler) would complain
if the compiler were to actually treat _all_ C code as portable assembler.

~~~
tropo
Got an alternative?

The trouble here is that lots of programmers want/need a language that is like
a portable assembler. It's all well and fine to say that C isn't such a
language, but then what language shall we use?

Since there isn't an alternative, C needs to be that language.

~~~
jcranmer
The alternative is to sit down and design a new language for this task, rather
than to try to force C to be that language, which is a task that it is
successful enough sufficiently often to be touted as the solution and fails at
sufficiently often for people to complain that it fails.

The problem is that almost no one needs a portable assembler, and even those
who want it very often only want it for very small snippets of code. If you
want to specify something very exactly that is legal only because of the
particulars of your machine, then use assembly. There are certainly a few
things that compilers could expose that are poorly exposed in most languages
(floating point environment, various flag bit manipulation including checked
overflow and add-with-carry, vector instructions, synchronization barriers,
underaligned types). But you don't need portable assembly for those things,
just better compilers.

~~~
userbinator
_rather than to try to force C to be that language_

Ironic, considering that C was specifically designed for that from the
beginning.

~~~
pjmlp
On the context of PDP-11 class hardware.

C isn't the first portable systems programming language, it just became
ubiquitous thanks to UNIX.

------
dmitrygr
Why the surprise? x86 ABI says that a uint32 shall be 32-bit aligned. The fact
that CPU can cope with others does not mean ABI allows it. Compiler complies
with ABI and assumes your uint32* is 32-bit aligned.

~~~
crististm
I assume because ABI is just a convention for application interoperability.
Hardware interface was the problem here.

You can ignore ABI if you are standalone. But you can't ignore the CPU.

~~~
dmitrygr
Compiler sees "function" and understands "ABI says that pointer is 32-bit
aligned, I can count on that because whoever calls it will also be ABI
compatible and make it so"

~~~
crististm
You confirmed my interoperability claim with your answer - whoever calls.

Consider just the main function. You are free to ignore the ABI there but you
still need to obey the cpu spec.

~~~
dmitrygr
Not really. If main ignores abi for example by not saving callee saved regs,
you'll crash. Ignoring abi is never safe unless your entire project is written
in asm

------
ericand
Also posted on reddit which has some interesting discussion:
[https://www.reddit.com/r/cpp/comments/5bn8jx/a_bug_story_dat...](https://www.reddit.com/r/cpp/comments/5bn8jx/a_bug_story_data_alignment_on_x86/)

------
pcwalton
In Clang, you can get the adc instruction using __builtin_addcl (and similar),
but the compiler isn't that great of making optimal use of it. A few minutes
of playing around got me [1], which isn't very good code.

[1]: [https://godbolt.org/z/hTzECo](https://godbolt.org/z/hTzECo)

------
userbinator
_Alignment issues are not only relevant for RISC processors. SSE makes them
valid for x86, too (in both 32 and 64-bit mode)._

Requiring alignment with certain SSE instructions was the original poorly-
thought-out misstep that lead to this bug. One of the basic principles of CISC
is to implement instructions flexibly in hardware even if it requires
microcode, then optimise that hardware to gain a performance advantage with
each new generation.

If the original SSE instructions worked fine with unaligned data, perhaps
being slower in such cases like all the other instructions did, then no
expectations would be violated and everything would've worked out nicely. The
fact that the explicitly unaligned version is now actually slightly _faster_
in the benchmarks shown (because there doesn't need to be extra code to do
alignment) is evidence enough that the original decision to have separate
aligned/unaligned versions was really shortsighted. Very wide memory buses and
multistage pipelines make alignment pretty much a non-issue.

The other big point this article doesn't mention is that rather than fighting
with the compiler, if you already know what instructions you want, it would be
better to write the Asm directly. For short sequences like these the Asm is
going to be even clearer and shorter than "truly portable" C. In practice
there really aren't that many architectures your code is going to run on.

------
Asooka
A note of warning: memcpy and memmove are not optimized away on msvc and icc.
That is, every call to those functions _is always_ a function call. Contrast
to gcc and clang which always inline those calls even in no-optimizations
builds. To do away with memcpy calls, you can use this:

    
    
        _Bool check_ip_header_sum(const char* p, size_t size) {
            struct Chunk { char arr[4]; };
            const Chunk* chunks =
                               (const Chunk*)p;
            union {
                Chunk chunk;
                uint32_t n;
            } helper;
            uint64_t sum=0;
            for (size_t i=.....) {
                helper.chunk=chunks[i];
                sum+=helper.n;
            }
            ........
        }
    

Technically speaking, going through a union like that is _also_ undefined
behaviour, however this is used everywhere in the Linux kernel, so both GCC
and Clang will support it indefinitely, while MSVC is very conservative
anyway, so you can rely on it supporting it as well. There is pretty much zero
reason for the compiler to produce garbage code here. Note also that you can't
cast p to the union type directly, because the union type has uint32_t
alignment, so you have to do it this way. And you have to go through a struct,
because you can't copy-assign arrays... Anyway the union+chunk trick seems
more readable to me as well, but that's down to personal preference.

~~~
Karliss
When was the last time you checked that msvc and icc always use a function
call form memcpy? At least the latest versions of msvc and icc can replace
memcpy with appropriate move instruction. You can see here
[https://godbolt.org/z/BX0Wya](https://godbolt.org/z/BX0Wya) that msvc 2017,
msvc 2015, icc 18.0 and even icc 13.0.1 used a mov instruction for memcpy to
uint32 instead of function call.

~~~
userbinator
I believe even MSVC6 will inline memcpy() (and a few others), but it has to be
enabled with one of the optimisation switches (intrinsic functions.)

------
scottlamb
Can't this also segfault by overreading across a page boundary?

It's assuming "size" is >= 20 and is a multiple of 4. That might be true for
all valid headers, but at least in this snippet, there's no comment or
anything stating the precondition that callers ensure the headers are valid.
That's a(nother) bug.

If it's not true, it will read too much data. This means producing the wrong
result. It also could make the read happen to cross a page boundary into space
that isn't mapped into the process's virtual address space.

~~~
klodolph
Depending on which page boundary, increasingly severe exceptions will be
raised, but it is typical for an OS to service those exceptions by performing
the operation and jumping back into the code… maintaining the illusion that it
worked.

This is what happened to _all_ unaligned accesses in the PowerPC 601 -> 603e
transition in ~1995. Game developers writing their own “optimized” blit code
suddenly found that every single unaligned access on the 603 would succeed,
but it would go through an ISR to get there!

I feel for these kind of routines, either write it in dumb, straightforward C
code and just use -O2 or -Os, or make an assembler version—possibly just a
saved copy of the compiler’s output, but that’s enough to completely kill any
interprocedural optimization. Trying to write ultra-optimized C code gives me
a bit of a headache, unless it’s something that you can express
straightforwardly as SSE intrinsics or something like that. You end up going
through contortions to satisfy constraints imposed by the language which
aren’t imposed by the machine.

> It's assuming "size" is >= 20 and is a multiple of 4. That might be true for
> all valid headers, but at least in this snippet, there's no comment or
> anything stating the precondition that callers ensure the headers are valid.
> That's a(nother) bug.

That’s too much of a reach to call it a bug. The IP header’s size is measured
in multiples of 4 bytes, there is simply no way to encode a header which is
not a multiple of 4 bytes. It becomes a question of how gracefully you want to
handle _true_ garbage being passed into the function… at best this is a matter
of style. Should strlen() call assert(p!=NULL)? These days with asan, I’m not
so sure that writing these assert() is the best way to spend my time.

~~~
scottlamb
> Depending on which page boundary, increasingly severe exceptions will be
> raised, but it is typical for an OS to service those exceptions by
> performing the operation and jumping back into the code… maintaining the
> illusion that it worked.

I'd expect crossing a boundary between two valid pages to be fine; that's just
part of making unaligned accesses work.

I'm talking specifically about an unmapped page. I don't have any data, but
I'm pretty confident access to an unmapped page (such as null pointers) is a
much more common source of SIGSEGV in general than unaligned access, to the
extent that most programmers probably don't even know the latter is a
possibility.

> That’s too much of a reach to call it a bug. The IP header’s size is
> measured in multiples of 4 bytes, there is simply no way to encode a header
> which is not a multiple of 4 bytes.

I take a pretty hard stance on this kind of thing.

The function might be used in a way you're not expecting. The article
mentioned the IP frames can be read from a file. Maybe they're in some record-
delimited format that states length in bytes, and that's the length that gets
passed in rather than the original one. Redundant framing information like
that is pretty common I think when you're using some general-purpose code (a
library that handles a variety of record formats as well as some functions
that operate on specific types like IP frames). Those records could get
corrupted, even maliciously so; callers probably expect then to get back a
nonsense result then but not an out-of-bounds read.

Or there could be a bug in the logic that decodes the header's length. By
itself that won't cause memory errors, but combined with this assumption it
does, and so I'm uncomfortable with the assumption being implicit.

I think a lot of C code's memory safety is dependent on this sort of implicit
assumption that rots over time (if it was ever correct to begin with), and
every time I see it I prefer memory-safe languages a little more.

Working in a language (Rust) that requires you to explicitly annotate such
places with "unsafe" and have your public interfaces be "sound" (have it be
impossible for safe code to cause the unsafe code to do the wrong thing) makes
this more clear to me.

> Should strlen() call assert(p!=NULL)?

If it behaves in a surprising way when called with NULL, it should mention
that case (either in an interface comment/manpage or via explicit assert). In
practice on modern machines strlen raises SIGSEGV. [1] When folks see a
SIGSEGV from strlen, they probably think wild pointer or missing terminator
and will figure out exactly what happened before long.

[1] Although I imagine it's actually undefined behavior according to the C
standard, so now I wonder if some compiler does / my compiler will someday do
something more strange, and I'm sad.

~~~
tedunangst
Calling string functions on null pointers is mostly undefined, which can lead
to all the same problems as dereferencing. Notably the compiler can now delete
"extraneous" null checks.

~~~
scottlamb
Makes sense. Do you happen to know of any way this can cause null pointer
dereference to result in something other than SIGSEGV in a "normal" situation
(userspace, modern POSIX system, modern hardware, no prior mmap(NULL, ...,
MAP_FIXED, ...) call, no huge offset into the null pointer)? I can see how it
would if you relax any of those conditions. In particular, I see an example of
the deleted null check you're describing here:
[https://lwn.net/Articles/342330/](https://lwn.net/Articles/342330/)

In any case, I'd still say strlen(NULL) is qualitatively different than
check_ip_header_sum(malloc(4095), 4095). The former's undefined behavior is
standard (strlen is expected to dereference the pointer, so it makes sense
that it has whatever problems doing that has). The latter shouldn't be
expected to exceed the stated bounds without that being clearly stated.

~~~
tedunangst
strlen(null) is pretty straightforward, although being a pure function the
compiler might rearrange things.

    
    
        if (!s) puts("hi");
        strlen(s);
    

I don't think that's guaranteed to print hi if s is null.

More fun is memset(0,0,0) which is technically verboten.

------
valarauca1
code sample + name (and already dealing with SIMD alignment issues) told me
what the problem is.

Pointer casts are ALWAYS bad in C. Try to avoid them. Reply and down vote you
old hat C programmers who will defend this. Your old practices are actively
harmful to people learning the language.

~~~
rightbyte
How do you prefer to use e.g. malloc?

~~~
valarauca1
you mean `mmap`?

Also yes I'm radical enough to suggest the 50+ year old C standard library API
is bad. Which isn't really a radical position >.>

~~~
deathanatos
mmap and malloc do very different things. You didn't really clarify _why_ you
thought malloc() was bad, or why you think someone should call mmap(), so I'm
mostly left to guess.

mmap, for example, is free to require that the requested range be a multiple
of a page; trying to use it as if it were malloc() would, on such systems,
cause any allocations less than a page to require a whole page. And since this
is a rather common case, you'd be using a _ton_ more memory.

malloc(), however, can subdivide a large allocation from the OS into small
allocations for the process; it is my understanding that most modern malloc()s
will do this, in some form. (It is also my understanding that some malloc()
implementations will use mmap() for allocating memory from the OS…)

~~~
valarauca1
`malloc` is part of the standard library. the underlying implementation is
environmental specific, or implementation specific depending on the host
application.

Generally speaking `mmap` is the system call that

* Allocates memory

* Moves allocations around

`malloc` is the common entry point into a library that attempts to ensure you
invoke the `mmap` system call as little as possible.

    
    
       trying to use it as if it were malloc() would, on such systems,
       cause any allocations less than a page to require a whole page.
       And since this is a rather common case, you'd be using a ton more
       memory.
    

This is a misconception [1]. Malloc is governed by hardware limitations. If
you ask for 50bytes, it'll give you a full 4KiB page, because hardware can
only segment memory in 4KiB chunks. Actually some allocators will attempt to
_use_ those non-allocated regions of the remaining ~4046 bytes for other
allocations in order to be faster (less systemcalls), this is why sometimes
writing past buffer ends can give you garbage data instead of segfaults.

[1] [https://stackoverflow.com/questions/45458732/how-are-
memory-...](https://stackoverflow.com/questions/45458732/how-are-memory-
allocations-smaller-than-page-size-handled-in-linux)

~~~
deathanatos
The link you post is about _mmap_ ; the section you quote is about _malloc_.
The link is correct — about _mmap_ ; but that limitation does not apply to
malloc. The link precisely backs up the part you're quoting, so saying "This
is a misconception" makes no sense.

> _If you ask for 50bytes, it 'll give you a full 4KiB page, because hardware
> can only segment memory in 4KiB chunks._

No, not really. (But you seem to realize this, as you contradict this point in
the next quote I'll pull.) A malloc implementation is free to divvy up a page
in order to service several allocations. Just because the pointer it returns
to you is somewhere within a page that must, almost by definition, be a full
page is irrelevant to the point of this discussion (that malloc() is not
needed b/c mmap exists, and is somehow superior). The point is that the
allocation consumes, in the grand scale of things, an amount on par with the
requested size. (There is _some_ overhead, of course, but not nearly the same
are the mmap-must-return-a-full-page overhead.)

> _Actually some allocators will attempt to _use_ those non-allocated regions
> of the remaining ~4046 bytes for other allocations in order to be faster
> (less systemcalls), this is why sometimes writing past buffer ends can give
> you garbage data instead of segfaults._

That was part of the point I was making.

You seem to have misunderstood my post. malloc() is free to divvy up a page
(potentially handed to it by mmap) into several allocations. mmap() cannot.
The point here is that replacing every malloc() call with a call to mmap()
results in the loss of that ability to divvy up pages, requiring each
allocation to be a full page; this would significantly bloat the memory
consumed by an application, as small allocations would not share pages. Hence
why those two calls are not equivalent, which is what the original post:

> _you mean `mmap`?_

> _Also yes I 'm radical enough to suggest the 50+ year old C standard library
> API is bad._

Very much seemed to imply. Though, _again_ , I asked for clarification on your
point, to make sure I'm understanding it, and you again failed to provide it.
So, again,

> _You didn 't really clarify why you thought malloc() was bad, or why you
> think someone should call mmap() [in lieu of malloc()], so I'm mostly left
> to guess._

~~~
valarauca1

        A malloc implementation is free to divvy up a page in
        order to service several allocations
    

But like the hardware won't enforce that. So yeah you can malloc 50bytes, but
the hardware is just going to assume there is a 4096 based page behind it.
Because what memory can/cannot be written/read too is controlled by mmap ->
linux -> hardware MMU.

So when ever `malloc` gives you 50bytes without allocating an entire page it's
just lying to you and giving you a pointer offset from another allocation.
This isn't strictly being enforced by anything. There is zero contract that
enforces what memory you can/cannot use.

~~~
deathanatos
> _There is zero contract that enforces what memory you can /cannot use._

There is, though; malloc's specification forms a contract. You are not allowed
to access outside the bounds of the returned pointer. That _some_ hardware
allows it _sometimes_ does not mean that there is "zero contract".

The function represents an interface, and an interface is a set of rules and
behavior both the user & consumer of that interface expect. If you stray from
that, it's UB, and the machine can do whatever at that point.

In this case, that's what matters. We don't know what hardware malloc or mmap
will run on, and it isn't impossible that someone _could_ implement a malloc()
implementation that _did_ enforce those bounds. valgrind, for example, is very
close to this: it already knows when you access memory outside of the bounds
of a malloc'd region (as that's its purpose: to report such accesses), and it
would be a trivial modification to it to kill the program on an illegal
access.

