
Avoid character-by-character processing when performance matters - ingve
https://lemire.me/blog/2020/07/21/avoid-character-by-character-processing-when-performance-matters/
======
kevingadd
This is a great example of what I'd call "vectorization without vector
instructions", though you could also just call it pipelining. You recognize
that you're going to be operating on big chunks of data at once, so you make
allowances to be able to only operate in chunks - like ensuring you can
overrun the end of the buffer - or split your main loop out so there is a
'slow path' for the start/end of your operation and a fast path for the rest
of it. Then you just operate on as much data as possible in one go and ignore
things like branching, early out, etc. You can mask your results at the end or
simply discard unused outputs. The '(running & 0x8080808080808080)' example in
this article is perfect because it recognizes that in the end you only need to
be sure that nothing failed, you don't need to know _what_ and you don't need
to know _when_.

Most of the time if you see people doing this it's using vector registers and
newer instruction sets, because that's where you get 2-10x speedups by doing
it. But even on ARM and x86 without using vector registers, careful pipelining
and vectorization can deliver pretty sizable speedups. It increases cache
locality, cuts down on branches, and reduces register pressure.

I ended up working on a project a while ago where we had to port a game's
software rasterizer to multiple targets, including a low-spec ARM device.
Instead of trying to hand-code a NEON implementation (I don't remember for
sure if it even had NEON...) I started by just vectorizing it to operate on
groups of 4 pixels instead of one (because I could fit 4 pixels in a single
regular-sized integer register) so that each iteration of the main loop was
doing more work and spending less time waiting on memory. Just that alone
ended up improving performance considerably at the kind of clock speeds I was
dealing with, and on higher-spec x86 cores it was still a slight improvement
for about a day of hacking and staring at disassembly.

------
jeffbee
To be perfectly honest what I want to see, for the specific C++ case cited, is
std::all_of with a suitable predicate. Let the compiler figure it out.

Plugged it into compiler explorer and clang 10 does unroll the std::all_of
version, doing four chars at a time.

~~~
galkk
Exactly that, I was surprised to see that looong for loop definition (I'm not
using C++ at all) and wanted to see if there's anything else to simplify.
First I found for_each then all_of

If anyone is interested, here is simpler version:

    
    
        bool is_ascii_branchy(const std::string_view v) {
            return std::all_of(v.begin(), v.end(), [](auto i) { 
        return uint8_t(i) < 128;});
        }

~~~
jonstewart
STL is great! How’s the performance, though, compared with TFA’s optimistic
and hybrid versions?

------
nkurz
I'm somewhat surprised that compilers doesn't optimize this automatically. I
thought at first that it might be something C++ related, but I tried a
straight C equivalent[1] with Godbolt and it wasn't optimized by any the main
compilers either. Is there a standard related reason why this can't be
optimized to work on chunks of 8B ints? Or even longer vectors? Or is this
just proof of the fact that there are so many possible optimizations that a
compiler can't possibly do all of them?

[1] [https://godbolt.org/z/WfPWEz](https://godbolt.org/z/WfPWEz)

~~~
ridiculous_fish
This particular code cannot be effectively optimized because of the early exit
inside the loop. What if `size` was enormous and the first character was
negative? One extra byte of lookahead might SIGSEGV! So it must gingerly step
one byte at a time.

Rewrite it to always consume the full string, and it goes 8 bytes at a time:
[https://godbolt.org/z/xG7avr](https://godbolt.org/z/xG7avr)

In practice the best implementations will manually align and then unroll.

~~~
mehrdadn
> What if `size` was enormous and the first character was negative? One extra
> byte of lookahead might SIGSEGV! So it must gingerly step one byte at a
> time.

Pages aren't 1 byte each, they're 4KiB. You can't just segfault at by crossing
1 byte at any arbitrary address. Some alignment checks to handle edge cases
should let the compiler at least read things 8-byte aligned in the steady
state and only fall back on edge cases.

~~~
ridiculous_fish
The page size is a runtime property - for example recent iOS devices have 16KB
pages while older ones are 4k. There may be ways to make it safe, but it seems
to be firmly in the "compiler heroics" camp.

~~~
mehrdadn
Page sizes are easy to lower bound when you're compiling. You know the target
system you're compiling for. This most definitely is not in the "compiler
heroics" camp.

~~~
ridiculous_fish
IMO this is one of those fragile optimizations. If the optimization is
important, it should be explicit (hand-rolled asm) or else you risk breaking
it. If it's not important, then nobody cares.

Emitting OoB reads dependent on optimization level? Think about ASAN, debugger
watchpoints...can't be worth it!

clang has a cool feature that highlights missed vectorization opportunities:

    
    
        > clang++ -O3 -c  test.cpp -Rpass-analysis=loop-vectorize
        test.cpp:5:4: remark: loop not vectorized: loop control flow is not understood by vectorizer [-Rpass-analysis=loop-vectorize]
    

love it!

~~~
BeeOnRope
In practice this type of over-read is not uncommon in library code (even for
standard routines), but I haven't seen compilers generate it. False positives
with memory-sanitizers is definitely a problem: the hand-rolled library uses
can be whitelisted, but that's not feasible once the compiler starts
generating it in arbitrary code.

------
userbinator
Things like this are why I wish x86 would've exploited its CISC-ness to do
more autovectorisation in hardware, something like this:

    
    
        mov rsi, ptr
        mov rcx, size
        repz andb 80h ; while(rcx-- && !(al = *rsi++ & 0x80)) ;
        setz al

------
tyingq
You can see Mozilla iterating on the same example of is_ascii() here, using
SSE:
[https://bugzilla.mozilla.org/show_bug.cgi?id=585978](https://bugzilla.mozilla.org/show_bug.cgi?id=585978)

------
PudgePacket
Curious how this would apply to the Redis protocol. It's a text protocol that
is friendly/intended for character by character parsing. The example code at
the bottom of the page is an example of iterating the characters of a byte
stream.

[https://redis.io/topics/protocol](https://redis.io/topics/protocol)

~~~
56quarters
The Redis protocol uses length prefixes for bulk strings and arrays (per your
linked document). That enables it to read multiple bytes at a time off the
socket, though it does seem like you'll still end up doing a bit of character
by character processing with the result.

    
    
      RESP uses prefixed lengths to transfer bulk data, so there is never a need to scan the payload for special characters like it happens for instance with JSON, nor to quote the payload that needs to be sent to the server.
    
      The Bulk and Multi Bulk lengths can be processed with code that performs a single operation per character while at the same time scanning for the CR character, like the following C code:

------
jakeogh
Fast getdelim() replacement:
[https://github.com/ThePythonicCow/rawscan](https://github.com/ThePythonicCow/rawscan)

ebuild: [https://github.com/jakeogh/jakeogh/blob/master/dev-
libs/raws...](https://github.com/jakeogh/jakeogh/blob/master/dev-
libs/rawscan/rawscan-9999.ebuild)

------
RcouF1uZ4gsC
While a neat trick, there are some caveats. First of all, the
is_ascii_branchless can be written in c++ as

    
    
        return std::all_of(sv.begin(),sv.end(),[](uint8_t c){return c < 128;});
    

Once you do that, the differences in code complexity between this and the
other solutions becomes even more stark.

Second, even the slow case is 2 GB/s. While the principle is helpful, the
extra complexity may not be worth it in the vast majority of real life code.

~~~
P-ala-din
A thing I want to stress, is that you pay for the "complexity" just once, and
it gets amortized over all projects that use this code.

This is a standard technique, that is often used in String functions. You can
see it also in implementations of strlen.

------
reagent_finder
Bitwise hacks in production code? Seriously?

This is a nice gimmick to show off in an interview because the interviewer is
definitely not going to get it, however it might backfire because the
interviewer might realize a) you'll be out to get his job b) you might
actually want to put stuff like this into production.

The idea is solid, don't get me wrong, and clever code is always clever code.
Also like u/kevingadd mentions, the principle behind the thought is an
important one, knowing when and how to handle data in bulk since you are
looking for a needle in the haystack. For instance SQL and streams in Java are
places where you need this kind of thinking.

But please, for the love of Gord, never put this bit of code in production.

~~~
jonstewart
Daniel Lemire’s code is most definitely in production, returning fast results
so your client code can burn CPU on JavaScript frameworks.

~~~
wruza
As if _we_ chose to burn cpu on specifically _javascript_ frameworks.

