
An Intel Programmer Jumps over Wall: First Impressions of ARM SIMD Programming - signa11
https://branchfree.org/2019/03/26/an-intel-programmer-jumps-over-the-wall-first-impressions-of-arm-simd-programming/
======
glangdale
Author here. Happy to answer questions, endure abuse, or, best yet, be put in
my place by someone with a Really Nice Table of ARM Latencies and Throughputs
that Could Have Been Found If I Wasn't Such an Idiot.

[ Note the title of the article was "Jumps over The Wall", in keeping with the
dish. ]

~~~
gok
You're using C intrinsics instead of assembly, right? Are you sure the
compiler isn't doing the scheduling well enough on its own?

~~~
glangdale
If you're think about algorithms in terms of how they might embed into a
processor, you need to know the throughput and latency numbers even if the
compiler is getting things right.

For example, there are some nice possibilities with the TBL instruction (or
PSHUFB, or VERMB, etc) for doing character class membership tests. Which
version you use would have a lot to do with whether you think TBL is going to
issue 2/cycle, 1/cycle or 0.5/cycle (just fr'instance) - you might design the
algorithm quite differently. It's not just a case of picking the one design
choice and then having the compiler schedule that - if you're not thinking
about this during design, you're not going to be remotely near peak
performance.

~~~
gok
Sure but the compiler can compile your intrinsics to different-but-equivalent
instructions, and modern compilers actually do that pretty aggressively. Are
you sure the code output was actually using the instructions you expected?

~~~
burntsushi
Often times you wind up needing to change the entire algorithm to use
different instructions with different latency/throughput numbers. The classic
case off the top of my head (for Intel at least) is PCMPESTRI versus
completely different approaches for substring search.

------
blu42
Props to the author for the objective article. Re CA53 throughput and
latencies, no -- I don't have the tables, but I've done my fair share of
tbl/tbx mesurements: [https://www.cnx-software.com/2017/08/07/how-arm-nerfed-
neon-...](https://www.cnx-software.com/2017/08/07/how-arm-nerfed-neon-permute-
instructions-in-armv8/)

~~~
dman
Welcome to hn, we meet again!

~~~
blu42
Salut! I've finally decided to register for writing, after years of mute
reading ; )

------
devit
Looks like the author is going to enjoy RISC-V SIMD (once it is finalized).

~~~
glangdale
I'm not sure... I like the look of the bit manipulation stuff, but I'm not
really a fan of the variable-length vector approach. I think these systems are
built to make the world safe for matrix multiply - and simple math workloads -
but the short-vector approach (e.g. the typical x86 SIMD style of doing
things) is sometimes exactly what you want.

I keep meaning to write more about this. I have found 3 categories of SIMD use
in my own work:

1) Doing one thing a gazillion times (e.g. conventional SIMD). This works well
on vector machines, of course.

2) Using SIMD registers to do "more stuff". So in a couple string matchers and
regex matchers I've designed, you are really use using a SIMD register because
_it 's bigger_ than a GPR register (duh). But this might be because you want
to simulate a 512-bit NFA rather than a 64-bit NFA.

3) Using SIMD operations to do weird, irregular stuff where what you're really
getting is a substitute for branchy code. I blogged about an example of this
called "SMH" [https://branchfree.org/2018/05/30/smh-the-swiss-army-
chainsa...](https://branchfree.org/2018/05/30/smh-the-swiss-army-chainsaw-of-
shuffle-based-matching-sequences/)

~~~
mastax
I'll grant that a lot of code can't be readily adapted to large vectors, but
this doesn't seem like much of a problem for (proposed) RISC-V. If you're
creating an algorithm that only works on a specific width, you can just
`setvl` and assert that you have enough space available. You're likely
targeting a specific CPU or class of CPU where you know there will be support
for 256-bit vectors or whatever. If someone tries to run it on a cheap
embedded CPU, they'll be disappointed that your code won't run on their 64-bit
vectors, but this is no different from trying to run AVX code on an Intel
Atom.

I suppose it's hard to know until we can actually write code for it, but I
haven't imagined a scenario where the RV model is significantly worse than
packed SIMD, other than a tiny bit of vector configuration bookkeeping. I
think that's a worthwhile tradeoff for getting simple, portable, fast code for
elementwise operations and implementation flexibility.

I'd love to be convinced otherwise, though.

~~~
brandmeyer
I'm not sure I understand how callee-saved registers are going to work under
RVV, given the way that dynamic reconfiguration works.

~~~
devit
My guess is that either no registers will be callee-saved or one group of 8
registers will be callee-saved (in the current draft, you can group 1, 2, 4 or
8 sequential registers, but only starting on a multiple of the group size, so
register groups are always going to be contained within one of the four
8-register maximum-size groups); this will require dynamic stack allocation
since the vector length is not fixed.

The saving sequence would be something like this (after setting up a frame
pointer if needed): "vsetvl t0, x0, e8, m8; sub sp, sp, t0; vse.v v16, (sp)".

------
phaedrus
This is by the same person who authored Hyperscan, which is also mentioned on
this blog & worth a read.

------
Const-me
About documentation, Intel’s is not great either. Otherwise I wouldn’t bother
making this: [https://github.com/Const-
me/IntelIntrinsics](https://github.com/Const-me/IntelIntrinsics)

~~~
jcranmer
[https://software.intel.com/sites/landingpage/IntrinsicsGuide...](https://software.intel.com/sites/landingpage/IntrinsicsGuide/)
is kept in my browser history. It gives pretty good documentation of most of
the details of intrinsics (including timing information on some processors for
some instructions), although it is missing the enum definitions.

------
Solar19
Nice work. By the way, on the x86 simdjson did you use any of the string
compare instructions in SSE 4.2? Or was AVX faster?

~~~
glangdale
SSE4.2 is bollocks. It was DOA. There are better ways of doing most of it
using PSHUFB and what's more, those ways were better when SSE4.2 arrived. It's
only gotten slower, relatively speaking, as it is exiled to the edge of the
die and has not been promoted to wider regs (AVX2, AVX512).

