Hacker News new | past | comments | ask | show | jobs | submit login
Incrementing vectors (travisdowns.github.io)
98 points by BeeOnRope 49 days ago | hide | past | web | favorite | 44 comments

Ugh. Aliasing rules in C/C++ are very unfortunate, but it's hard to tighten the rules after the fact.

Related: The Strict Aliasing Situation is Pretty Bad (2016) https://blog.regehr.org/archives/1307

>but it's hard to tighten the rules after the fact.

Depends on what you mean by tighten the rules. If you tighten the rules for the optimizer by simply allowing everything to alias, that should not introduce bugs into currently well formed programs. (Except that they might be slower.)

The essence:

> That means that writes through uint8_t pointers are treated as potentially updating any memory of unknown provenance. So the compiler assumes that every increment v[i]++ potentially modifies size() and hence must recalculate size() every iteration.

However, I think that a smart compiler would still be able to sort out that size() is not affected by such writes.

how, without whole program compilation?

edit: specifically for std vector, the implementer could somehow hardcode the knowledge that size can never alias with the buffer, but that would be ad hoc and won't extend to user defined containers, which seems suboptimal.

The compiler could generate both vectorized and non-vectorized variants and decide dynamically which one to use based on a simple aliasing check. That's an approach that would be more usual for JIT compilers than AOT compilers, but habits can change.

Yeah I came here to say this: I don't see a good way to prove this at compile time, but you could always do a dynamic check.

In general, compilers don't generate runtime checks like this. In the case of aliasing problems, however, they actually do make such checks, but usually only in the case of possibly overlapping arrays. An example:


Both gcc and clang make a check to see if the x and y arrays overlap, and if they do they fall back to a scalar loop (instructions ending in ss: scalar single-precision) - but if there is no overlap they proceed with the vectorized approach (instructions ending in ps: packed single-precision).

So I guess these cases were important enough that they got specific handling in the compiler, but I haven't seen it for possible aliasing with a scalar, even though the same type of check could be used. Actually it's a bit trickier, because the compiler wants to know the array size to make this calculation, but the array size itself, which comes from size() is the thing subject to aliasing! So it takes a more complex analysis to show that it's safe.

If I'm reading the generated asm correctly (which I might well be not), the following function is vectorized conditional to a runtime alias check (the non vectorized case is completely removed if the pragma is uncommented). So even scalars seem to be handled:

  void inc(float* x, float*  c, size_t n)
    //#pragma GCC ivdep
    for (int i = 0; i < n; ++i)
      x[i]+= *c;
the issue really seems to be that the compiler just gives up very early if the iteration count is not a loop invariant.

Yeah, I read it the same as you: so alias checks can definitely happen even for single values.

It seems like it's not just the loop counts that's the problem: in the blog post even when size is a local, it fails to check to see if the vector array pointer itself can alias.

I.e., analogous to incB here:


incA is like your example: the possibly aliasing is between the array and the value added, and an alias check occurs. incB is like the v[i] example: possible aliasing is between the array and the pointer to the same array, no check and conservative code is generated.


That's true. In fact GCC should be able to do that in principle. I guess it only does it if it believes that the vectorization is significantly profitable.

edit: it is possible it only does it on PGO builds.

I linked an example [1] in a parallel reply that shows gcc (and clang) doing it at -O3 without PGO, for the case of possibly overlapping arrays.


[1] https://godbolt.org/z/9vZfl_

> how, without whole program compilation?

Dataflow analysis. If size() depends on values that are in one malloc'ed block, and the vector elements are in a different malloc'ed block, then you know there is no aliasing. (For example).

std::vector is entirely defined in a header file (it has to be, it's a template), so whole program optimization (which is what I presume you mean?) doesn't enter into it. The entire definition of vector is within the same translation unit.

I don't know enough about compiler optimization algorithms to say how trivial it would be to figure out these two things don't alias, but when you have a type with the magic property "this guy is allowed to alias ANYTHING", then I imagine it gets pretty tricky.

Definition of the type is not enough, the compiler need to track the origin of all pointer values, which means it needs to 'see', recursively, all callers of the function it is optimizing up to the point where the parameter objects were originally constructed, plus any function where a pointer to the object might potentially escape to (which might be potentially any function, if it escapes to the heap).

And that might still not be enough: passing the vector by value still doesn't trigger the optimization in GCC and clang (but it does for a bare bone vector implementation, probably std vector itself ends up leaking its own internals).

> when you have a type with the magic property "this guy is allowed to alias ANYTHING", then I imagine it gets pretty tricky.

well, yes, that's the whole point of the article :)

> the compiler need to track the origin of all pointer values

If the size and the buffer pointer are both private, wouldn't it be enough to analyze all the member functions to ensure that there is no aliasing between them? Users might try to muck around in the internal representation using memcpy or whatever, but that would presumably be undefined behavior.

User specialized member functions can be used to circumvent the private barrier (IIRC formally you are not allowed to do this for std types). Also elsethread someone discussed the extremely evil example of placement new'ing a vector on another vector's buffer then swapping them.

Surely both of those things are UB?

I mean, it's not like the optimization is impossible (by rejiggering the code a bit, it kicks in), it's just that using uint8_t's makes the optimzation far more brittle, because the compiler has to walk on eggshells around uint8_t's. Right?

the placement new is not UB, at least at a language level. Modifying the vector would break the vector invariant, leading to UB. The specialization is UB only because vector is an std type, otherwise is perfectly legal.

In both cases, yes, if you hardcode the specific std::vector invariants in the compiler, it might be able to do better, but a user provided vector class wouldn't benefit from the same optimisations.

Note that compilers are not beyond hardcoding invariants of standard library facilities: malloc, C string functions, etc are treated as primitives by the compiler (for example, the result of malloc cannot alias with any other pointer), but for a complex class like vector, it does not seem appropriate and it is better to use compiler hooks in the implementation that are also available to the user (restrict, builtin functions, alias annotations, etc).

I'm trying to see if rustc does the same thing: https://godbolt.org/z/6WcYMf

but I am not good at reading this sort of thing. It looks like it has the same issue, but I didn't think rust had the type ambiguity that C++ does around u8 / char?

Rust has no type-based aliasing restrictions at all. However, it does require references to be either unique or read-only, which means that it can automatically treat them as noalias (LLVM equivalent of __restrict). Unfortunately, this feature is disabled due to an LLVM bug:


On the other hand, if you use the more idiomatic `&mut [u8]`, that passes the pointer and size by value, so it should avoid the issue.

Edit: Using an iterator also avoids the problem, unlike in C++, because Rust iterators store their bounds within the iterator itself rather than calling iterable.end() on each iteration.

Regardless of what exactly rustc does right now, in the future Rust's strict aliasing rules will basically ensure Rust does this optimization in every possible case, because Rust reasons at compile time to ensure no mutable references ever alias. Things have been held up for a while on this LLVM bug, though[0].

[0] https://github.com/rust-lang/rust/issues/54878

Last I heard there's a fair chance that this might be implemented at the MIR level instead, allowing it to be unblocked from the LLVM bug.

To answer both my own questions. Rust does not seem to have this same problem: https://godbolt.org/z/JpwcRN

But iterating over the loop with .iter_mut is different than iterating over 0..v.len(), which is interesting.

I'm guessing its bounds-checking, though replacing v[i] += 1 by `*v.get_unchecked_mut(i) += 1` generates yet another bit of code (it seems to somewhat unroll the loop but stops there).

Also you can use `-C target_cpu=native` and get AVX512 instructions on godbolt's.

rustc also doesn't vectorize it?

It does if you use iterators as totalperspectiv did in their alternate version https://godbolt.org/z/JpwcRN

Because Rust bounds-checks indexes, which likely breaks LLVM's vectorisation (you'd have to reimplement vectorisation & vectorised bounds-checking in MIR for it to work I guess e.g. bounds-check the entire chunk before applying the vectorised operation).

Oh, its BeeOnRope. I was wondering who could have investigated this detail to such fine degree.

No comments really, your blogpost seems to answer all my questions. Excellent work!

@BeeOnRope Regarding comments on the blog, have you considered using https://utteranc.es/? It stores them as GitHub issues. I think https://staticman.net/ does something similar.

re Disappointment, the somewhat portable #pragma GCC ivdep in theory would help in a lot of cases.

In this specific case, the compiler should be able to assume that a previous write cannot affect the next read from the size field, but it seems that neither GCC nor clang take advantage of it.

In HPC, we use `#pragma omp simd`.

`#pragma omp imd` should force the vectorization, but maybe not the hoisting, I don't know. Could be nice to check.


Clang fail: ``` <source>:5:5: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]

    #pragma omp simd 

1 warning generated. Compiler returned: 0 ```

But ICC seems happy with it.

it works on my test:

where ivdep fails. Nice!

You do need -fopenmp of course.


strangely, it still fails to vectorize the original vector example.

> the compiler should be able to assume that a previous write cannot affect the next read from the size field, but it seems that neither GCC nor clang take advantage of it.


Good point about #pragma ivdep. I feel like the semantics of that one are a bit less than rigorous, but it's a useful tool.

Why they do not take advantage ov ivdep? I don't know. I thought it was because the size address was not a function of the iteration variable, but even by artificially adding the dependency (with a barebone vector implementation) does not seem to trigger the optimization.

I mean why "should [the compiler] be able to assume that a previous write cannot affect the next read from the size field"?

It's not that the size field is a function of the iteration variable (i?) but that the previous write may have overwritten the size field, because the size field is located at some arbitrary place in memory, and the array backing the vector is located at some other arbitrary place and they may overlap (but only because the vector type is uint8_t/char - if it other than a char type the compiler can be sure they don't overlap because the size() fields are pointers which have a distinct type from uint32_t and are not subject to the "char aliasing loophole" rule).

sure, but ivdep is supposed to explicitly tells the compiler that there are no loop carried dependencies between iterations, so iteration i-1 couldn't possibly write to a memory location read by iteration i.

Ah sorry, I read your ivdep sentence and the second part of your post as independent, i.e,. that the compiler should be able to assume this (even without ivdep).

I realize now you were saying "[With ivdep] and in this specific case, the compiler should be able to assume..."

It makes sense now.

What about it?

More-so than the compiler, it depends on the design goals that are implemented in width and speed of the processor's registers, SIMD vector and conventional integer units. For x86, one could readily guess un/signed ints (32 & 64-bit) would be fast; un/signed 8/16-bit math is unlikely to be as fast. That hypothesis is backed-up by this article's data.

Did you read to the end? The final results show that in the end the computation on the vectors with 8-bit elements is 4x as fast (per element) as the computation on the vectors with 32-bit elements.

EDIT ignore the following, I was mistaken: Infuriatingly the article doesn't benchmark signed 8-bit vectors. If the issue is due to the special aliasing properties of pointers to unsigned bytes, signed should just work really fast out of the box, without jumping through any hoops.

The issue is due to the special properties of C character types, which include char, signed char and unsigned char. uint8_t and int8_t are not required to be character types, but they're typically implemented as typedefs of the char types, so they pick up the same properties. Here's the GCC bug discussing this problem: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66110

Thanks. I misremembered C's rules as only unsigned char having this special property (as well as plain char if it's unsigned), but not signed char.

I think you are partly right.

My understanding is that the special aliasing rules apply only to `char` and `unsigned char` in C++, not to `signed char`. In C, however, I think it applies to all three.

AFAIK char is a distinct type from signed char and unsigned char, regardless of whether it is signed or not. In practice it will have the same representation as those but will not be the same type:


So `signed char` would seem to be a way to get a strongly typed char not subject to the aliasing rules (even if bare char is signed), but compilers don't take advantage as far as I can tell.

Just to back you up, this post quotes both standards: https://stackoverflow.com/a/51228315/331041

Did you? it clearly states these aliasing rules apply to character types[0].

[0] https://travisdowns.github.io/blog/2019/08/26/vector-inc.htm...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact