
SIMD Everywhere: 0.5.0 - lelf
https://simd-everywhere.github.io/blog/announcements/release/2020/06/21/0.5.0-release.html
======
Const-me
I sometimes write code portable across SSE4 and NEON, and I'm not sure this is
going to work fast enough for that. There're important unique features.

SSE has shuffles, pack/unpack, movemask, 64-bit doubles, testzero, float
rounds, blends, integer averages, float square roots and dot product.

NEON has interleaved RAM load/stores, vectors operators with scalar other
argument, byte swap, rotate, bit scan and population count, and versions of
all instructions processing 8-byte long vectors.

That's enough differences that I have to adjust both algorithms and data
structures to be portable between them. I'm not convinced it's possible to do
automatically.

~~~
gioele
> I'm not sure this is going to work fast enough for that.

From reading the SIMD Everywhere description, it seems to me that SIMDe is a
way to _allow_ code that targets only a platform to work on other platforms as
well. As a nice byproduct, you get _some_ speed up if architecture targeted by
the code is similar to the architecture that will run the code.

Portability is the main focus, not speed.

Obviously, once you have a good emulation of an architecture the first
question is going to be: can I make it faster?

~~~
mtgx
All other chip architectures should adopt Arm's SVE2 or something similar.

[https://community.arm.com/developer/ip-
products/processors/b...](https://community.arm.com/developer/ip-
products/processors/b/processors-ip-blog/posts/new-technologies-for-the-arm-a-
profile-architecture)

~~~
fao_
Hi there! It looks like you've been shadowbanned, I vouched a few of your
comments because they didn't seem terrible, but I didn't scroll very far. It
might be worth sending a mail to dang to ask him to take a look at your
account or something!

~~~
chmod775
That account has 33k karma and is 8 years old.

I just can't imagine a set of circumstances where a shadowban of all things
could be justified if they transgressed in some way.

It's probably a mistake?

~~~
fao_
or they said something ridiculously heinous

~~~
chmod775
In that case one should point that out to them and warn them.

Shadowbans are quite a heinous punishment in themselves, and better used to
prevent spammers etc. from creating new accounts.

But people who make thousands upon thousands of comments are bound to make an
emotional comment or an error of judgment at some point.

------
mainland
If you're doing new development and not opposed to using C++, I recommend
xsimd, which provides a higher-level interface to architecture-specific SIMD
instructions: [https://github.com/xtensor-
stack/xsimd](https://github.com/xtensor-stack/xsimd)

~~~
nemequ1729
Apart from the fact that you have to rewrite your code, the big disadvantage
to something like this is that it's a bit slower.

With SIMDe you're still free to use functions like `_mm_maddubs_epi16` and
they'll be _really_ fast on x86, but still work everywhere.

With xsimd (and similar libraries) you're generally limited to the lowest
common denominator.

FWIW, if abstraction layers are your thing you might want to look at
std::experimental::simd ([https://github.com/VcDevel/std-
simd](https://github.com/VcDevel/std-simd)) instead. Google's Highway
([https://github.com/google/highway](https://github.com/google/highway)) is
also pretty interesting.

~~~
alfalfasprout
Not sure where you got that from... xsimd will detect your instruction set
automatically. Do you mean that if you're distributing a single binary then
you'll need to compile for the lowest common denominator?

If so, that's not necessarily true either. A few patterns exist here. One is
what the intel compilers do where you conditionally call variants of a
function based on the instruction set. Another is to compile SIMD-accelerated
functionality into shared libs that are dynamically loaded at launch based on
the instruction set.

~~~
nemequ1729
> Not sure where you got that from... xsimd will detect your instruction set
> automatically. Do you mean that if you're distributing a single binary then
> you'll need to compile for the lowest common denominator?

No, what I mean is that since xsimd is an abstraction layer you can't really
use the "full" ISA extension; you're limited to composing operations based on
a simpler subset that is supported across multiple architectures.

For example, consider `_mm_maddubs_epi16`, which is a favorite example of mine
because it's so specific… I honestly have _no_ idea when this is useful, but
I'm sure Intel had a particular use case in mind when they added it. It adds a
8-bit signed integer to an __un __signed 8-bit integer, producing a signed
16-bit integer result for each lane. Then it performs saturated addition on
each horizontal pair and returns the result.

Now I'm not that familiar with xsimd's API, but I can't imagine they have a
single function that does all that. It's much more likely that you have to
call a few functions in xsimd; maybe one for each input to widen to 16 bits,
then at least one addition. For pairwise addition there might be a function,
if not you'll need some shuffles to extract the even and odd values. Then
perform saturated addition on those, which [isn't supported by
xsimd]([https://github.com/xtensor-
stack/xsimd/issues/314](https://github.com/xtensor-stack/xsimd/issues/314)),
so you'll need a couple of comparisons and blends to implement that.

That's basically what we have to to in SIMDe in the fallback code; I don't
have a problem with that at all. However, even if you're targeting SSSE3 xsimd
it's pretty unlikely xsimd will be able to fuse that into a single
`_mm_maddubs_epi16`.

OTOH, in SIMDe we can also add optimized implementations of various functions,
and `_mm_maddubs_epi16` is no exception. There is already an AArch64
implementation which should be pretty fast, and a ARMv7 NEON implementation
which isn't too bad.

With SIMDe what you get isn't the lowest common denominator of functionality,
it's the union of everything that's available. SIMDe's `_mm_maddubs_epi16` may
not be any faster than xsimd _if you 're not targeting SSSE3_, but if you
_are_ targeting SSSE3 or greater SIMDe is going to be a lot faster.

SIMDe's approach isn't without drawbacks, of course. For one, it can be hard
to know whether a particular function will be fast or slow on a given
architecture, whereas lowest-common-denominator libraries will pretty much be
fast everywhere but functionality will be a bit more basic. It's also a _lot_
more work… there are around 6500 SIMD functions in x86 alone, and IIRC NEON is
at around 2500.

------
jasonzemos
I'd like to commend the authors for embarking on this. Complex ISA's are an
unfortunate reality for performance as advances in cycles-per-second on a
single core are negligible. The divergence of these increasingly complex ISA's
among platforms weigh heavily on competitive application developers.

As someone who is interested in writing cross-platform SIMD code the most
valuable asset to me is a library or compiler that can generate the
instructions dynamically from otherwise normal'ish looking C/C++. This is the
most powerful mode of development in my experience. Clang already does this
remarkably. I can write with standard C++ syntax (albeit awkwardly) and maybe
using a few custom types with `__attribute__((vector_size(x)))` and _not_ have
to involve explicit intrinsics except perhaps for a very small number of leaf
operations that cannot be expressed. At this time Clang has the upper-hand on
GCC: the latter cannot generate code which scales between platforms utilizing
different vector sizes. For example, if you try to perform an operation on a
256-bit vector using a 128-bit target: Clang will seamlessly generate two
128-bit operations; GCC will fall back entirely to scalar. My assumption is
that developments in Clang for ARM's SVE have carried over to generating
scalable code for other platforms, but nevertheless it is remarkable.

I don't believe that writing functions comprised of hand-crafted lists of
intrinsics is the best way forward. Undoubtedly it's worked for projects, even
quite well to ship stable software -- but it scales and adapts poorly in a
fast-developing and diverse market of hardware. For example, years ago I wrote
a simple `tolower(string)` implementation using an assemblage of 128-bit
standard-Intel SSE2 statements and today the instructions it produces are
exactly the same as the day that I wrote it. All I can hope for is that
256-bit capable architectures can execute two of my operations at once. That's
not ideal.

~~~
throwaway189262
If you just need "big" numbers, Rust supports 128 bit int on all platforms
using Clangs code generation

~~~
moonchild
No need to reach for rust; just use __(u)int128_t. Works with gcc, too.

------
lloydatkinson
Very cool. This is effectively a replacement for the now unfortunately
abandoned Yeppp library which I used with C# - though modern .NET has SIMD now
too.
[https://news.ycombinator.com/item?id=10232395](https://news.ycombinator.com/item?id=10232395)

~~~
stagger87
This would not replace yepppp. This is more like a library yeppp would use. If
you need to replace yeppp look at something like Intel MKL or Ipps.

------
chuckcode
Some matrix math libraries like Eigen[1] support vectorization via SSE, AVX,
NEON, etc. and also use cache friendly algorithms for larger matrices. Highly
recommended when you don't need to go quite as low level as individual
instructions.

[1] [http://eigen.tuxfamily.org](http://eigen.tuxfamily.org)

~~~
corysama
If you need to work on large matrices, Eigen is highly recommended. If you
need to do a stream of custom processing in small steps, Eigen is not a good
fit. Eigen is a super complicated, template-heavy library that can compile
very heavyweight tasks into very efficient code (eventually).

For custom work, [https://github.com/VcDevel/std-
simd](https://github.com/VcDevel/std-simd) is working it's way into the C++
standard.

~~~
Const-me
> Eigen is a super complicated, template-heavy library

I agree, but that feature allows to apply optimizations by specializing these
templates.

Works for both micro-optimizations (in their pbroadcast4<__m256d> they do 4
loads, on many CPUs AVX2 can do better with a single load + shuffles) and
replacing large parts of Eigen (I was able to improve performance of conjugate
gradient solver by moving the sparse matrix into a SIMD-optimized structure).

------
gameswithgo
A similar library in Rust:
[https://github.com/jackmott/simdeez](https://github.com/jackmott/simdeez)

~~~
lorenzhs
That library is not even remotely similar to SIMDe. The goal of SIMDeez is to
provide an abstraction over different SIMD instruction sets (different
versions of the x86 SIMD instructions: SSE2, SSE4.1, AVX2). The goal of SIMDe
is to let you run code using platform-specific intrinsics _on machines that
don 't have these instructions_, e.g. run code with SSE/AVX intrinsics on an
ARM-based CPU. They're very different things.

------
microcolonel
Careful when invoking (some) AVX-512 instructions, because having a process
using them on just one hardware thread can cripple your entire system and hurt
overall performance in workloads where the kernel or another process on the
system is doing a lot of the work.

~~~
lorenzhs
While that's true, I don't see how this is relevant to SIMDe? SIMDe lets you
compile code using (e.g.) SSE/AVX intrinsics for ARM targets, using the target
platform's SIMD intrinsics when possible. It doesn't even officially include
support for AVX512 yet.

Anyway, for anyone looking for details on what the OP mentioned, it's an
implementation detail of Intel CPUs, see
[https://en.wikichip.org/wiki/intel/frequency_behavior#Base.2...](https://en.wikichip.org/wiki/intel/frequency_behavior#Base.2C_Non-
AVX_Turbo.2C_and_AVX_Turbo)

~~~
nemequ1729
It does include support for AVX-512, it's just still a work in progress.
AVX-512 is enormous (IIRC ~ 4k functions), so it will be a while before it is
fully supported.

If you're targeting AVX-512 but don't have hardware that supports AVX-512
(like all AMD CPUs), SIMDe can be quite nice. The result is much faster than
Intel's SDE and it's just native code that you can use your normal debugger
on.

