
Pruning spaces from strings quickly on ARM processors - ingve
http://lemire.me/blog/2017/07/03/pruning-spaces-from-strings-quickly-on-arm-processors/
======
glangdale
[ full disclosure: Intel employee, working in software that does exactly this
sort of thing ]

I would suggest that the best way to generally hunt down a character class -
i.e. white space is itself an 'algorithmic' problem and a mildly interesting
one. If you have a fast permute handy this can be done on both x86 and ARM.

We have some code here that does this, but it's not the easiest to understand
(this is the run-time only):

[https://github.com/01org/hyperscan/blob/master/src/nfa/shuft...](https://github.com/01org/hyperscan/blob/master/src/nfa/shufti.c)

The idea is that you represent a character class by intersecting two PSHUFB
(or whatever floats your boat) results. So if you want to do a real regex-
style \s, you will probably have to use one "bucket" to represent the low-
order stuff (like \r and \n) and one for \x20. You then use 1 PSHUFB to look
up your bucket for the high 4 bits (aka nibble) and the other PSHUFB to look
up the low 4 bits.

This can't represent all character classes - it's possible to run out of
buckets. There's an obvious way of doing all character classes with 3 shuffles
(see truffle.c in the same directory).

As another commenter says, it's a lot harder to do this for UTF-8 as you are
looking for bigger data items, variable-sized data items, and potentially a
more generous definition of whitespace.

~~~
glangdale
To expand on the UTF-8 point - assuming you are using "UCP" type definitions
of whitespace where you start looking for "Mathematical Space" (codepoint
0x205f) and "Ideographic Space" (codepoint 0x3000) and a good couple dozen
other friendly spaces.... you can do a trick similar to what I describe if you
_really_ think that SIMD is going to pay off here (hey, maybe you're doing a
lot of "6 EM Space" processing). It's still potentially doable in SIMD and the
comparison is, well, what? One-char-at-a-time?

You will need more shuffles and/or more buckets.

------
ChuckMcM
I really like that more folks are looking at the vectorized instructions in
various ARM chips but I worry about gross generalizations like _" However, for
many problems, the total lack of a movemask instruction, and the weakness of
the equivalent to the pshufb, might limit the scope of what is possible with
ARM NEON."_

I would much prefer taking each of the various 'cpu eater' type applications
and look at them as a unit. So DSP or convolution or bin packing or list
searching and look at what you can do vs what needs help. A lot of the
improvements in the x86 architecture came about because a design engineer at
Intel or AMD read a clear statement of the problem and the challenges with the
solution.

That said, and to justinjlynn's comment, I would really love a top to bottom
'unicode string processing current processors' so many people are doing
nothing but scripting these days that string hacking is a big part of
interpreters (rather than say raw floating point performance back in the day).

~~~
ori_b
The problem with that is that Unicode string processing is highly data
dependent. You may not get good speedups because most performance improvement
from putting things in hardware comes from parallelism in the data path.

A good approach to getting speedups is probably by assuming ascii, vectorizing
the hell out of that, and falling back to multibyte processing where that
fails. Checking for multibyte characters comes down to checking if the high
bit of any byte is set, which should be fast and easy for the branch predictor
to deal with.

------
0xfaded
The NEON coprocessor runs at the same CPU clock, but 20 cycles behind the main
CPU. It has its own instruction queue, and as long as the queue isn't full the
main CPU will simply forward instructions to it. If you want to read from the
wide registers though, you need to wait the 20 cycles for the coprocessor to
catch up.

~~~
pm215
A quick google suggests this was true for the Cortex-A8 (released in 2005,
over a decade ago), but is unlikely to be true for any implementation since,
and I would definitely not expect it to be true for a modern 64-bit
implementation like the one the article author tested.

~~~
0xfaded
Thanks for the reply. After much sifting I finally found modern numbers.

    
    
      Gen -> NEON 6 cycles
      NEON -> Gen 8 cycles
    

So yes, definitely much closer.

[http://infocenter.arm.com/help/topic/com.arm.doc.uan0015b/Co...](http://infocenter.arm.com/help/topic/com.arm.doc.uan0015b/Cortex_A57_Software_Optimization_Guide_external.pdf)

------
NathanOsullivan
> Most of the benefit from this approach would come if you can often expect
> streams of 16 bytes to contain no white space character. This seems like a
> good guess in many applications.

The average word length in English is substantially less than 16, so it is
hard to see how this would help.

~~~
wongarsu
parse concatenated zero-terminated strings, parse space-separated base64
strings, find the commas in CSVs, discard sensor data that doesn't meet the
threshold, do run length encoding on sorted data, ...

There are plenty of applications for this algorithm or variations of it. But
you are right that the presented application doesn't seem immediately useful.

------
hedora
(x<32)?1:0 asks for a branch. Presumably the compiler will remove it.

However, you can just write (x<32), which is defined to be zero or one by the
language spec. This shouldn't save anything on ARM, but who knows what the
compiler will do.

Similarly, the scalar version might be faster if you do manual loop unrolling.
It probably still won't beat Neon, but it makes for a better "scalar vs
vectorized" bake-off.

~~~
torrent-of-ions
The C specification requires that (x<32) is 0 or 1, but that doesn't mean it
is necessarily possible to do it branch-free on any particular computer.

~~~
vidarh
Maybe there are cpu's where it isn't possible, but I don't think there are any
non-joke systems where it isn't possible.

There are just too many ways of doing it. E.g. on most architectures there'll
be a straightforward way of doing it using an add/sub operator and taking
advantage of overflow/underflow to set a carry flag, and then another operator
or three to get the least significant bit set to the carry flag.

Or you can do a straight lookup into a table on any architecture that supports
a load from memory and an add, though it'd of course be hugely wasteful.

On any architecture that also supports a right shift / integer division plus
AND you could make that use less memory by shifting, masking, lookup up (any
set bit =>0, no set bits => 1), and then and'ing the results together with an
initial 1.

And many, many more. I have a hard time picturing what an instruction set that
would make it impossible to do that without branches would even look like.

Of course that doesn't guarantee it will be the most efficient choice, and of
course also doesn't guarantee that a compiler will know how to do it.

~~~
pm215
In this particular instance, 64-bit ARM (A64) has a conditional-set
instruction, so "int cmp(int x) { return x < 32; }" compiles straightforwardly
to "cmp w0, #0x1f; cset w0, le; ret". For 32-bit ARM you can use the
conditional-execution you get on almost all instructions, so gcc emits "cmp
r0, #31; movgt r0, #0; movle r0, #1; bx lr".

------
justinjlynn
These tricks won't have a great time with Unicode encodings.

~~~
ams6110
Unless the hot spot in your code is removing spaces from strings, these tricks
are just academic and not something you would ever do yourself. You'd use a
well tested, safe string handling library in real code.

~~~
glangdale
I am not an academic. I have done this myself. Of course, we wrote a regex
library...

These tricks are in general something that it's well worth learning how to do
if you do any performance critical code, as often the boundary of the way that
they are implemented in someone's library is not useful for the actual task
(i.e. you may have a fast 'find a space' function, but it might not be fast by
the time you painfully iterate through each of its results removing spaces one
by one). Sadly, a lot of this sort of processing doesn't compose very well
unless you have access to an omniscient optimzer.

But yes, if this isn't your hot spot, don't do this thing in particular. It's
still possible you might learn something useful by reading about it.

