Avoid character-by-character processing when performance matters

kevingadd · on July 25, 2020

This is a great example of what I'd call "vectorization without vector instructions", though you could also just call it pipelining. You recognize that you're going to be operating on big chunks of data at once, so you make allowances to be able to only operate in chunks - like ensuring you can overrun the end of the buffer - or split your main loop out so there is a 'slow path' for the start/end of your operation and a fast path for the rest of it. Then you just operate on as much data as possible in one go and ignore things like branching, early out, etc. You can mask your results at the end or simply discard unused outputs. The '(running & 0x8080808080808080)' example in this article is perfect because it recognizes that in the end you only need to be sure that nothing failed, you don't need to know what and you don't need to know when.

Most of the time if you see people doing this it's using vector registers and newer instruction sets, because that's where you get 2-10x speedups by doing it. But even on ARM and x86 without using vector registers, careful pipelining and vectorization can deliver pretty sizable speedups. It increases cache locality, cuts down on branches, and reduces register pressure.

I ended up working on a project a while ago where we had to port a game's software rasterizer to multiple targets, including a low-spec ARM device. Instead of trying to hand-code a NEON implementation (I don't remember for sure if it even had NEON...) I started by just vectorizing it to operate on groups of 4 pixels instead of one (because I could fit 4 pixels in a single regular-sized integer register) so that each iteration of the main loop was doing more work and spending less time waiting on memory. Just that alone ended up improving performance considerably at the kind of clock speeds I was dealing with, and on higher-spec x86 cores it was still a slight improvement for about a day of hacking and staring at disassembly.

jeffbee · on July 25, 2020

To be perfectly honest what I want to see, for the specific C++ case cited, is std::all_of with a suitable predicate. Let the compiler figure it out.

Plugged it into compiler explorer and clang 10 does unroll the std::all_of version, doing four chars at a time.

BeeOnRope · on July 25, 2020

This doesn't generate anything close to what the author is talking about: it still does character-by-character comparisons. Sure, it has unrolled the loop (but this is nothing to do with std:: and everything to do with what optimizations are enabled by clang at various -O levels), but it hasn't combined 8 individual checks into a single 64-bit wide check of the same cost.

jeffbee · on July 25, 2020

All true. But I’d expect an exceptional justification for the optimization in the article, balanced against readability and maintainability.

BeeOnRope · on July 25, 2020

Agreed. As a high performance library function such tricks are par for the course, but you'd hope it wouldn't be sprinkled through ordinary code without good reason.

The best of both worlds would be having the compiler transform the canonical std:: version into the fast one under the covers, but that's far from universal with current compiler tech and languages.

afmos · on July 25, 2020

Clang 10 appears to unroll the loop 4x, but does not perform the optimization from the blog post where comparisons are made 64 bits at a time (rather than 8 bits): https://godbolt.org/z/oefzhx

galkk · on July 25, 2020

Exactly that, I was surprised to see that looong for loop definition (I'm not using C++ at all) and wanted to see if there's anything else to simplify. First I found for_each then all_of

If anyone is interested, here is simpler version:

    bool is_ascii_branchy(const std::string_view v) {
        return std::all_of(v.begin(), v.end(), [](auto i) { 
    return uint8_t(i) < 128;});
    }

jonstewart · on July 25, 2020

STL is great! How’s the performance, though, compared with TFA’s optimistic and hybrid versions?

cornstalks · on July 25, 2020

And for a fair comparison, here's an imperative loop that's way cleaner than what TFA used:

    bool is_ascii_branchy(std::string_view v) {
        for (auto c : v) if (uint8_t(c) < 128) return false;
        return true;
    }

im3w1l · on July 25, 2020

So I guess this would be the best way to write it if you have a set of long strings most of which are non-ascii, and most such strings having a high proportion of non-ascii? Since then early exit will really help.

nkurz · on July 25, 2020

I'm somewhat surprised that compilers doesn't optimize this automatically. I thought at first that it might be something C++ related, but I tried a straight C equivalent[1] with Godbolt and it wasn't optimized by any the main compilers either. Is there a standard related reason why this can't be optimized to work on chunks of 8B ints? Or even longer vectors? Or is this just proof of the fact that there are so many possible optimizations that a compiler can't possibly do all of them?

[1] https://godbolt.org/z/WfPWEz

ridiculous_fish · on July 25, 2020

This particular code cannot be effectively optimized because of the early exit inside the loop. What if `size` was enormous and the first character was negative? One extra byte of lookahead might SIGSEGV! So it must gingerly step one byte at a time.

Rewrite it to always consume the full string, and it goes 8 bytes at a time: https://godbolt.org/z/xG7avr

In practice the best implementations will manually align and then unroll.

userbinator · on July 25, 2020

More interestingly: if I use "good old-school C89" char instead of bool and make the loop body branchless, I get actual vector instructions that appear to be consuming 256 bytes at a time:

https://godbolt.org/z/57766b

Wow!

If I use bool and keep the body branchless (z |= (string[i] < 0);), the vector instructions disappear and it goes back to doing only 4 bytes at a time. That is very odd.

im3w1l · on July 25, 2020

256 bits (=32 bytes)?

EDIT: Oh I see now, it's an unrolled loop each iteration of which uses 8 256bit vector instructions for a total of 256 bytes.

deathanatos · on July 25, 2020

> This particular code cannot be effectively optimized because of the early exit inside the loop. What if `size` was enormous and the first character was negative? One extra byte of lookahead might SIGSEGV! So it must gingerly step one byte at a time.

If `size` was enormous and the first character "was negative"¹, then one extra byte of lookahead would never SIGSEGV; `size` is enormous, and by definition, that is presumably more than our lookahead.

If you mean, "the first character of the lookahead's block" then it still doesn't make sense. Even in the case where we exit later, we still can't access later bytes of the lookahead if we don't know them to be out of bounds.

In the example Godbolt you linked to, the compiler is doing 8 byte jumps with the index. (It's not doing a great job, as we're still reading a byte at a time. But the bounds checks and the index increment will get reduced.) But, what the compiler is doing in your example could also be done in the early exit example. It'd be basically the same code, but replace .LBB0_10 with something like,

    cmpb $0, (%rdi,%rdx)
    js do_exit
    cmpb $0, 1(%rdi,%rdx)
    js do_exit
    # etc. for the remaining 6 reads
  .do_exit
    xorl    %eax, %eax
    retq

This is very similar to your Godbolt, which also does 8 individual memory accesses. We know they won't segfault here for the same reason we know they won't in your Godbolt: we check ahead that we can access the next 8 bytes and abort to the non-unrolled version if not.

The cmp/js's are very likely not as efficient as the mov/or's, but that's not the point; the point is the unroll is possible.

It is not the early exit that prevents a loop unroll. Something else is stopping it, either it does not think it worth it, is not sufficiently intelligent, or some arcana of C prevents it.

(Your example alternates between mov and or, in a very interesting pattern, which makes me suspect it also knows something there, and perhaps more opportunity.)

¹This is an ugly example, since the implementation is free to have a char with range 0-255.

ridiculous_fish · on July 25, 2020

> one extra byte of lookahead would never SIGSEGV; `size` is enormous, and by definition, that is presumably more than our lookahead

But `size` is just arbitrary number, while the code as written will not proceed past the first negative char. For example some jerk may write:

    c_is_ascii_branchy("\xFF", a_gazillion);

but the function has to stop at the first char.

> It is not the early exit that prevents a loop unroll

It really is!

nkurz · on July 25, 2020

>> It is not the early exit that prevents a loop unroll > It really is!

I think that's the dispute: does an early exit prohibit the compiler from generating code with additional reads? Or is the actual prohibition about causing segfaults? Intuitively, I'd think that so long as it produces the correct answer and avoids visible side effects (like segfaults), any number of additional reads from anywhere would be legal, but I don't know the spec well enough to claim this is actually correct.

BeeOnRope · on July 25, 2020

Without any additional tricks, a vectorized loop could fail in practice with a segfault, which is certainly observable.

However, a compiler could align the memory accesses to avoid this problem on a particular platform, say based on its knowledge of the 4k protection granularity on x86. For example, if you 32-byte align your reads, and the source alogorithm reads the first byte in a 32-byte chunk, it's safe to read the remaining 31.

Actual implementations of things like strlen do exactly this, but these are generally hand-written: I haven't seen compilers do it during autovec. On problem is that it produce false-positives in tools like Valgrind.

Bonus fact: icc actually breaks the rules in similar scenarios and vectorizes (without any alignment) loops that gcc and clang don't, and may produce writes where none existed in the source, causing segfaults or concurrency bugs in correct programs. So any time you see icc successfully vectorize a loop that other compilers don't, you have to wonder whether it was smarter, or simply breaking the rules.

nkurz · on July 25, 2020

> Without any additional tricks...

Yes, the question is about which "additional tricks" are legal according to the C standard. That is, what are the limits of what is technically allowed if you are wearing nothing but your language lawyer hat? Is the rule that anything goes as long as you don't segfault, or something more stringent? For that matter, is a segfault actually evidence that you have broken the rules? Note that this is different the rules for "undefined" behavior, as that pertains to the C source code rather than the generated assembly.

My impression is that the C standard assumes a hypothetical virtual machine that does not specify details like page sizes. I think this means that according to the standard, over-reads must be legal or illegal for a conforming compiler regardless of whether they cause a segfault on a particular real-world architecture. Or maybe the C standard doesn't make any claims as to whether the generated code will actually run on any particularly architecture?

> or simply breaking the rules

Which rule? My conclusion right now is that maybe there isn't one, and that the standard doesn't actually guarantee that a conforming compiler will generate code that runs to completion without segfault in the real world. There are simply too many differences in actual CPU's to make any firm guarantees. It's obviously fair to consider a compiler "buggy" if it does this, but I'm not sure it's noncompliant.

BeeOnRope · on July 25, 2020

> Yes, the question is about which "additional tricks" are legal according to the C standard. That is, what are the limits of what is technically allowed if you are wearing nothing but your language lawyer hat?

I probably should have been clearer, but it is my position the suggested trick (alignment) is definitely allowed by the standard. As you point out, the standard is working at a much higher level and is never going to say anything about memory protection, page size, etc. The program just has to produce the side effects required by the standard (output, etc), and whatever it wants to do to achieve that is fine. Even with my language lawyer hat firmly in place, I can't see any possible reason it would be disallowed: the standard doesn't even have the tools or language in place to even start talking about it.

> I think this means that according to the standard, over-reads must be legal or illegal for a conforming compiler regardless of whether they cause a segfault on a particular real-world architecture. Or maybe the C standard doesn't make any claims as to whether the generated code will actually run on any particularly architecture?

The standard makes no guarantees about any of it. It basically just says "these side effects must occur". How that is implemented is totally up to the compiler and runtime environment.

Note though that what I'm suggesting will "always work" in the sense that the a compiler targeting x86 would produce some x86-specific code that uses alignment and knowledge of the page boundary to avoid faults. This won't fail on some other architecture, because the code wouldn't be compiled in that way on a different architecture. It's not much different than the compiler using x86 specific instructions when compiling for x86: obviously it must do this and this can't work on some other uarch.

Added: I should be clear that this is a valid transformation for the compiler to make, but it is not conforming for a programmer to write a program that does this at a source level!

That is, the compiler can take a conforming program that doesn't make any over-reads at the source level, and compile it to assembly that makes "known safe" over-reads at the assembly level.

However, if you wrote the equivalent of the transformed algorithm yourself, in the source, it would not be a conforming program, because out of bound reads are definitely disallowed by the standard, regardless of what you "know" about the standard. This sort of asymmetry between what the compiler can do and what you can validly write in the source language is pervasive.

> Which rule?

I was a bit loose in my language. I mean the most important rule: that a conforming program must produce the side effects that would be produced by the abstract machine. Informally, that it "works at all rather than simply crashing".

You can write a simple, conforming, program in icc that produces "Segmentation fault" rather than "Hello world" or whatever, which is what other compilers produce and what the abstract machine will produce. This is breaking the rules (rule #1, basically).

deathanatos · on July 25, 2020

> but the function has to stop at the first char.

Yes, and my example of an unrolled version of the early exit loop does stop at the first character, despite being unrolled.

(That is an interesting example, though, and I think is a good proof that generating extra reads is not permitted.)

dataflow · on July 25, 2020

> What if `size` was enormous and the first character was negative? One extra byte of lookahead might SIGSEGV! So it must gingerly step one byte at a time.

Pages aren't 1 byte each, they're 4KiB. You can't just segfault at by crossing 1 byte at any arbitrary address. Some alignment checks to handle edge cases should let the compiler at least read things 8-byte aligned in the steady state and only fall back on edge cases.

ridiculous_fish · on July 25, 2020

The page size is a runtime property - for example recent iOS devices have 16KB pages while older ones are 4k. There may be ways to make it safe, but it seems to be firmly in the "compiler heroics" camp.

dataflow · on July 25, 2020

Page sizes are easy to lower bound when you're compiling. You know the target system you're compiling for. This most definitely is not in the "compiler heroics" camp.

ridiculous_fish · on July 25, 2020

IMO this is one of those fragile optimizations. If the optimization is important, it should be explicit (hand-rolled asm) or else you risk breaking it. If it's not important, then nobody cares.

Emitting OoB reads dependent on optimization level? Think about ASAN, debugger watchpoints...can't be worth it!

clang has a cool feature that highlights missed vectorization opportunities:

    > clang++ -O3 -c  test.cpp -Rpass-analysis=loop-vectorize
    test.cpp:5:4: remark: loop not vectorized: loop control flow is not understood by vectorizer [-Rpass-analysis=loop-vectorize]

love it!

BeeOnRope · on July 25, 2020

In practice this type of over-read is not uncommon in library code (even for standard routines), but I haven't seen compilers generate it. False positives with memory-sanitizers is definitely a problem: the hand-rolled library uses can be whitelisted, but that's not feasible once the compiler starts generating it in arbitrary code.

cesarb · on July 25, 2020

> Page sizes are easy to lower bound when you're compiling. You know the target system you're compiling for.

Do you really know the target system?

Most of the time, you want the compiled code to work not only on the current system, but also in future systems of the same lineage. What prevents a future version of the architecture from for instance lowering the translation granule from 4096 bytes to 512 bytes, or even lower?

BeeOnRope · on July 25, 2020

This would be a backwards incompatible and massively breaking change on most systems that define a minimum page size. These minimum page sizes are generally treated as or are explicitly architecturally guaranteed.

kevingadd · on July 25, 2020

If you're at byte 4095 and you advance by 1 byte and the next page is a guard page, you're dead.

You can minimize the risk of this by requiring that input addresses be aligned (and many ABIs/APIs do require this) but it's realistic to say that an extra byte of lookahead can and will SIGSEGV in unfortunate cases.

dataflow · on July 25, 2020

What I'm saying is that the compiler can generate code that examines the memory address and looks ahead only when it's safe to do so.

saurik · on July 25, 2020

(And for those who think this is strange, memcpy implementations usually do this, so there is precedent that this is very sane.)

BeeOnRope · on July 25, 2020

memcpy doesn't have this same problem because it is known-length without any early exit.

Someone · on July 25, 2020

I don’t think the compiler can assume the entire memory range from string to string+size (exclusive) can be read (the ‘C machine’ only reads it up to the first byte ≥ 128), but even if so, the compiler doesn’t know whether bytes ≥ 128 are rare or common, nor does it know the distribution of buffer lengths.

⇒ I think you should only start complaining if the compiler had pragmas to indicate that returning ‘false’ is extremely rare and that size can run in the kilobytes, and the program used them.

lilyball · on July 25, 2020

Is the optimizer allowed to add extra memory accesses to locations that otherwise wouldn’t have been accessed? Because that’s required here.

kevingadd · on July 25, 2020

In another language, perhaps. It'd definitely be introducing undefined behavior if you did it to C or C++, and it would likely cause intermittent crashes.

deathanatos · on July 25, 2020

No, what they mean is that, say we have a block read of 8 in the unrolled version, and all upcoming 8 indexes are valid memory locations, that is, a read would never segfault and never cause UB. Say they are:

  20  20  20  20  ff  20  20  20
                  ^^

The ff is where the naïve algorithm would abort. But, some versions of an unrolled loop, if they do an 8-byte move to a register would read/access all 8 bytes. But the naïve version won't; it'll stop at the ff, never accessing the last 3 bytes. Even if we can guarantee the bytes are valid, does C somehow guarantee that the accesses would not occur?

But, even with that, you can still unroll the loop to:

  bounds check the next 8 bytes (is idx + 8 < size)
  read 1 byte, abort if not ascii
  read 1 byte, abort if not ascii
  etc.

It isn't as efficient as reading 8 bytes in one fell swoop and using math magic to process them all in one pass. But it still means that, so long as we remain in this unrolled version, we're eliding 7 out of 8 bounds checks and increments to the index.

BeeOnRope · on July 25, 2020

The benefit of unrolling like this is quite small: just some reduction of loop overhead. The primary trick demonstrated here is combining 8 checks into a single operation.

deathanatos · on July 25, 2020

Oh, absolutely, the trick demonstrated in the article is significantly better. (That I had to elide 75% of the instructions in the psuedo-code that I wrote is practically proof of that.) My point was only that it is possible to unrolled the loop without introducing UB.

userbinator · on July 25, 2020

Things like this are why I wish x86 would've exploited its CISC-ness to do more autovectorisation in hardware, something like this:

    mov rsi, ptr
    mov rcx, size
    repz andb 80h ; while(rcx-- && !(al = *rsi++ & 0x80)) ;
    setz al

tyingq · on July 25, 2020

You can see Mozilla iterating on the same example of is_ascii() here, using SSE: https://bugzilla.mozilla.org/show_bug.cgi?id=585978

PudgePacket · on July 25, 2020

Curious how this would apply to the Redis protocol. It's a text protocol that is friendly/intended for character by character parsing. The example code at the bottom of the page is an example of iterating the characters of a byte stream.

https://redis.io/topics/protocol

56quarters · on July 25, 2020

The Redis protocol uses length prefixes for bulk strings and arrays (per your linked document). That enables it to read multiple bytes at a time off the socket, though it does seem like you'll still end up doing a bit of character by character processing with the result.

  RESP uses prefixed lengths to transfer bulk data, so there is never a need to scan the payload for special characters like it happens for instance with JSON, nor to quote the payload that needs to be sent to the server.

  The Bulk and Multi Bulk lengths can be processed with code that performs a single operation per character while at the same time scanning for the CR character, like the following C code:

mac-chaffee · on July 25, 2020

> After the first CR is identified, it can be skipped along with the following LF without any processing. Then the bulk data can be read using a single read operation that does not inspect the payload in any way.

Sounds like it only does the character-by-character processing for a dozen or so characters (to read the the command, control chars, and the length). And I think the character-by-character processing is also converting the string "123" to an integer, so the trick of using &&/|| might not work.

jakeogh · on July 25, 2020

Fast getdelim() replacement: https://github.com/ThePythonicCow/rawscan

ebuild: https://github.com/jakeogh/jakeogh/blob/master/dev-libs/raws...

RcouF1uZ4gsC · on July 25, 2020

While a neat trick, there are some caveats. First of all, the is_ascii_branchless can be written in c++ as

    return std::all_of(sv.begin(),sv.end(),[](uint8_t c){return c < 128;});

Once you do that, the differences in code complexity between this and the other solutions becomes even more stark.

Second, even the slow case is 2 GB/s. While the principle is helpful, the extra complexity may not be worth it in the vast majority of real life code.

P-ala-din · on July 25, 2020

A thing I want to stress, is that you pay for the "complexity" just once, and it gets amortized over all projects that use this code.

This is a standard technique, that is often used in String functions. You can see it also in implementations of strlen.

reagent_finder · on July 25, 2020

Bitwise hacks in production code? Seriously?

This is a nice gimmick to show off in an interview because the interviewer is definitely not going to get it, however it might backfire because the interviewer might realize a) you'll be out to get his job b) you might actually want to put stuff like this into production.

The idea is solid, don't get me wrong, and clever code is always clever code. Also like u/kevingadd mentions, the principle behind the thought is an important one, knowing when and how to handle data in bulk since you are looking for a needle in the haystack. For instance SQL and streams in Java are places where you need this kind of thinking.

But please, for the love of Gord, never put this bit of code in production.

rainbow-road · on July 25, 2020

> Bitwise hacks

There's nothing hacky about this — it's a very simple procedure.

> the interviewer is definitely not going to get it,

I sure hope any interviewer for a software engineering position would get it. But this has nothing to do with interviews.

cbsmith · on July 25, 2020

This isn't really about interview questions.

In the real world, this kind of approach is critical to making effective use of compute resources. This is the kind of stuff you want in production.

jonstewart · on July 25, 2020

Daniel Lemire’s code is most definitely in production, returning fast results so your client code can burn CPU on JavaScript frameworks.

wruza · on July 25, 2020

As if we chose to burn cpu on specifically javascript frameworks.