
Is sorted using SIMD instructions - tomerv
http://0x80.pl/notesen/2018-04-11-simd-is-sorted.html
======
CodeArtisan
When compiling with GCC, the option `-fopt-info-vec-all` gives you information
about the vectorization of the code. In this case, GCC reports

    
    
       // the for block
       <source>:10:24: note: ===== analyze_loop_nest =====
       <source>:10:24: note: === vect_analyze_loop_form ===
       <source>:10:24: note: not vectorized: control flow in loop.
       <source>:10:24: note: bad loop form.
       <source>:5:6: note: vectorized 0 loops in function.
       
       // the if block inside the for block
       <source>:11:9: note: got vectype for stmt: _4 = *_3;
       const vector(16) int
       <source>:11:9: note: got vectype for stmt: _8 = *_7;
       const vector(16) int
       <source>:11:9: note: === vect_analyze_data_ref_accesses ===
       <source>:11:9: note: not vectorized: no grouped stores in basic block.
       <source>:11:9: note: ===vect_slp_analyze_bb===
       <source>:11:9: note: ===vect_slp_analyze_bb===
       <source>:11:9: note: === vect_analyze_data_refs ===
       <source>:11:9: note: not vectorized: not enough data-refs in basic block.
    

edit: with intel compiler using `-qopt-report=5 -qopt-report-phase=vec -qopt-
report-file=stdout`

    
    
       Begin optimization report for: is_sorted(const int32_t *, size_t)
    
            Report from: Vector optimizations [vec]
        
        LOOP BEGIN at <source>(12,5)
        
           remark #15324: loop was not vectorized: unsigned types for induction
                          variable and/or for lower/upper iteration bounds make
                          loop uncountable
    
        LOOP END

~~~
stabbles
You can get automatic vectorization (with -O3) like this:

    
    
        bool is_sorted(const int32_t* input, size_t n) {
          int32_t sorted = true;
    
          for (size_t i = 1; i < n; ++i) {
            sorted &= input[i - 1] <= input[i];
          }
    
          return sorted;
        }
    

And the performance is similar to the AVX version (benchmarked on a MacBook
Air early 2015):

    
    
        $ ./benchmark_avx2 1048576
        input size 1048576, iterations 10
        scalar         : 6379 us
        SSE (generic)  : 3544 us
        SSE            : 3704 us
        My example     : 2769 us
        AVX2 (generic) : 2679 us
        AVX2           : 3360 us
    

So I'm getting 2769us with the above 5 simple lines of code. It's just 3%
slower (that might be noise).

~~~
Veedrac
Though this does throw away early-exit, which means it will be many times
slower for unsorted cases.

~~~
stabbles
True, I had hoped GCC would optimize that -- it doesn't :(.

~~~
xamuel
It would make no sense for GCC to automatically add early-exit: that requires
a judgement call about the vectors the function is intended to run on. A
priori, the function might be intended to run on vectors that are almost
always sorted, in which case the extra branch would be severely suboptimal.

~~~
banachtarski
I agree that the optimization isn't possible but this is for correctness
reasons. The performance penalty of the extra branch is almost negligible due
to branch prediction.

------
redcalx
I think the reduction in the number of branches inside the loop is a
significant factor here, i.e. the _if_ statement is performed per 4,8 or 16
elements instead of per value element. Branches invoke branch prediction logic
which is non trivial, thus even though it may not slow execution it may
increase power consumption.

On this basis another way of approaching the problem is to loop over the whole
array and return the result at the end, but this of course would take N/2
loops on average, whereas the nested _if_ can exit early. A compromise might
be to loop over short sub-spans of the array, and do an exit early test at the
end of each sub-span.

A good sub-span length for scalar code might be around 16, for one because we
hit the law of diminishing returns for longer spans.

Also I think is_sorted_asc() or is_sorted_ascending() might be a better name
if this were for a function in a general purpose library.

~~~
progval
> A compromise might be to loop over short sub-spans of the array, and do an
> exit early test at the end of each sub-span.

Another good use of Duff's device!

~~~
redcalx
For reference:
[https://en.wikipedia.org/wiki/Duff%27s_device](https://en.wikipedia.org/wiki/Duff%27s_device)

------
stkdump
It seems that the early exit (the return false) in the middle of the loop
prohibits the compiler from vectorizing the loop. The compiler can't know if
part of the memory is inaccessible and so reading memory not being sure that
the loop would run up to this point is illegal.

If you introduce a result variable and set it during the loop, the compiler
can vectorize the loop. At least icc does, but I didn't play with the compiler
settings of the other compilers too much.

Also compilers can still exit the loop early and at least MSVC does, because a
segfault is UB and can be "optimized away".

------
Veedrac
FWIW HeroicKatora gives an improved version on Reddit:
[https://www.reddit.com/r/cpp/comments/8bkaj3/is_sorted_using...](https://www.reddit.com/r/cpp/comments/8bkaj3/is_sorted_using_simd_instructions/dx7jj8u/)

------
foxhill
using a bit more arcane (but not crazy) method for determining sorted-ness -
[https://godbolt.org/g/MKN9HP](https://godbolt.org/g/MKN9HP) \- clang, and icc
seem to have no problem vectorising the inner loop.

gcc manages it too, but emits a huge amount of code, though. and setting -Os
seems to stop it from vectorising the code. shame.

edit: replaced with more correct version..

------
zakk
It would be interesting to see how the SIMD versions work for small arrays. I
suspect in this case the naive version better, and this could be the reading
why compilers do not convert the code to SIMD instructions...

~~~
gameswithgo
SIMD tends to always win for pretty small arrays, like around ~25 elements,
and probably any size that is evenly divisible by the vector width.

------
lokopodium
Generic code yields better results than SSE/AVX optimized ones. I wonder why
that could be.

~~~
eloff
It's just replacing extra loads with a bunch of other instructions. In reality
loads are cheap (and cached), it turns out to be cheaper than doing permutes
to shuffle the vectors around.

------
danbruc
_i += 7;_

Wouldn't this cause a sizable performance hit due to being misaligned most of
the time?

~~~
secure
No: many (most?) modern SIMD instructions don’t require alignment. From the
Intel Intrinsics Guide (can’t figure out how to link directly to it, sorry) on
_mm_loadu_si128:

> Load 128-bits of integer data from memory into dst. mem_addr does not need
> to be aligned on any particular boundary.

~~~
detaro
Doesn't need, but is there a performance difference? I seem to remember there
is no difference between _mm_load_si128 and _mm_loadu_si128 on modern CPUs,
but I'm not sure.

------
en4bz
Another possible implementation which is log^2(n)
[https://en.wikipedia.org/wiki/Bitonic_sorter](https://en.wikipedia.org/wiki/Bitonic_sorter)

~~~
monochromatic
No, you can’t even read the whole array in O(log^2(n)). It’s not possible to
do better than O(n) without “cheating.”

~~~
IAmLiterallyAB
I think the trick is you can run it in parallel. Which under ideal
circumstances may give that kind of performance. But this is the first I've
heard of this algorithm.

------
alfanick
I always miss multithreaded benchmarks when using SSE/AVX instructions. AFAIK
AVX processing units are oversubscribed, there are less of them than CPU
cores.

I can imagine that running AVX is_sorted (or any other AVX procedure) in
multiple threads would be actually slower than running non-vectorized
procedure.

Of course, that's my purely anecdotal opinion.

~~~
nvartolomei
Here is an experience report on how AVX-512 instructions impact CPU
performance [https://blog.cloudflare.com/on-the-dangers-of-intels-
frequen...](https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-
scaling/)

~~~
paulmd
Note that Skylake-SP and Xeon-W/i7/i9s behave very differently in this regard.
On Skylake-SP (eg Xeon Silvers like they're using) it's over 50% clockrate
reduction when AVX-512 is in the pipe, on Xeon-W and the HEDT chips it's more
like 10-20%.

[https://twitter.com/InstLatX64/status/934093081514831872](https://twitter.com/InstLatX64/status/934093081514831872)

~~~
floatboth
On HEDT (and mainstream desktop) you can actually adjust AVX offset manually.
With 0 offset and 5GHz clock, you can consume 500W (in Prime95 AVX) :D

------
nwmcsween
This assumes unaligned access is cheap.

~~~
lucb1e
See this thread (posted 7 hours before your comment):
[https://news.ycombinator.com/item?id=16842012](https://news.ycombinator.com/item?id=16842012)

~~~
nwmcsween
Which is OK in this case but for any of the arch independent (SWAR) bit hacks
it would probably be better to have a loop to align.

------
pubby
Unfortunately SSE doesn't speed-up huge data much. It's only fast when
everything's in the cache.

~~~
vardump
Not sure what you mean, I can process _non-cached_ sequential data from RAM
over 20 GB/s by using SSE/AVX.

There's no chance you could achieve same by using scalar instructions. SIMD
can access memory _a lot_ faster than scalar.

Random access is another matter. The trick is of course to avoid non-
sequential access patterns.

~~~
Peaker
Isn't the bottleneck in any sequential access case the memory bandwidth?

IOW: Are the scalar instructions slower than memory bandwidth?

~~~
obl
you can saturate memory bandwidth without SIMD, since you can issue at least 2
8-byte scalar loads per cycle. it does not leave much room for actual
processing though

