Hacker News new | comments | show | ask | jobs | submit login
A Branchless UTF-8 Decoder (nullprogram.com)
303 points by zdw 11 months ago | hide | past | web | favorite | 107 comments

The core of the algorithm shows that UTF-8 is actually big-endian in its ordering of the bits, making it somewhat more difficult to implement efficiently on the usual little-endian machine --- the first byte, at the lowest address, actually contains the most sigificant bits. Had it been the other way around, the final shift would not be length-dependent.

Also, none of the compilers I tried at https://godbolt.org/ were able to combine the 4 s[0...3] 8-bit reads into one 32-bit read, which is somewhat disappointing. They generated 4 separate byte-read instructions, one for each s[0..3]. Yes, I know it could be unaligned, but this is x86 where it would still be faster than 4 separate reads. You would need to do this instead (and sacrifice endianness/alignment-independence):

    uint32_t v = *(uint32_t*)s;
    *c  = (uint32_t)(v & masks[len]) << 18;
    *c |= (uint32_t)((v>>8) & 0x3f) << 12;
    *c |= (uint32_t)((v>>16) & 0x3f) <<  6;
    *c |= (uint32_t)((v>>24) & 0x3f) <<  0;
    *c >>= shiftc[len];
We're still left with a bunch of shifts, ands, and ors, the purpose of which is to essentially "compact" the 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx bitstring into one 21-bit codepoint by removing the bits in the middle. All the compilers I tried weren't so clever as to do this (assuming eax = v from above after masking, and cl = slightly modified shiftc[len]) at any optimisation level:

    bswap eax
    mov ebx, eax
    and ebx, 00FF00FFh
    lea ebx, [ebx + ebx*2]
    add eax, ebx
    shr eax, 2
    shl ax,4
    shr eax, cl
I played around with optimising UTF-8 decoding a while ago, and the above resembles one of the fastest and smallest ways to "compact" the 4 bytes into a codepoint.

Of course, this bit string manipulation would be much faster and trivial in hardware (essentially rearranging wires), which makes me want a LODSUTF8 instruction that operates like LODSB but reads a whole codepoint instead, and sets CF on error...

One thing you could do to optimize further (on CPUs with the BMI2 instruction set) is to use the PEXT instruction to perform the bit extract operation (after a BSWAP of course). The whole operation could be something like (untested):

    ; Parameters: RDI - pointer to input character
    ; Return value: RAX - Unicode codepoint, or -1 for error
    lea r8, [encoding_table] ; Load pointer to table base using rip-relative addressing

    mov edx, [rdi] ; unaligned (over)read of 32-bit utf8 codepoint
    mov ebx, edx   ; Copy to EBX to construct our table index
    and ebx, 0xf8  ; Mask off the (potential) data bits
    ; Now we want to convert the top 5 bits into an index into a table of three * 32-bit entries
    ; Currently EBX = index * 8, we need index * 12, so we'll divide by two and then multiply by 3
    shr ebx, 1
    lea ebx, [ebx + ebx * 2]
    bswap edx      ; Get the codepoint into big-endian representation

    pext ecx, edx, [ebx + 4] ; extract padding bits using mask    
    pext eax, edx, [ebx] ; extract data bits
    cmp ecx, [ebx + 8] ; check padding bits
    cmovnz eax, -1
    ; 00000xxx - 01111xxx
    times 16 dd 0x7F, 0x80, 0x00
    ; 10000xxx - 10111xxx (invalid)
    times 8 dd 0, 0xFFFFFFFF, 0 ; The padding will never match here, forcing an error return
    ; 11000xxx - 11011xxx (two bytes) - padding value 11010
    times 4 dd 0x1F3F, 0xE0C0, 0x1A
    ; 11100xxx - 11101xxx (three bytes) - padding value 1110 1010
    times 2 dd 0x0F3F3F, 0xF0C0C0, 0xEA
    ; 11110xxx (four bytes) - padding value 111 1010 1010
    dd 0x073F3F3F, 0xF8C0C0C0, 0x7AAAA

As a followup, I went and implemented this for fun: https://github.com/bdonlan/branchless-utf8/blob/master/test/...

Performance seems to be quite good compared to the implementations from the article (577.333 MB/s vs the author's implementation's 410 MB/s on my hardware)

Pretty cool, a recent development, only in Haswell and later architectures (2013 and later.)


...and on the Haswell PEXT runs in one uop with a latency of 3! That is nothing short of amazing for an operation which, from its description, would seem to require a microcode loop or at least a few more cycles to collect an arbitrary number of bits with arbitrary gaps between them:


The only unfortunate thing is that it was introduced quite recently in terms of x86 history (instead of being present early on but just microcoded), so earlier software can't take advantage of it nor any subsequent improvements, and uses a pretty complex encoding (VEX) only usable in protected mode. But if anything, this is another datapoint in the argument in favour of CISC --- try doing this on a MIPS, RISC-V or even ARM! With less powerful ISAs, even something as simple as the "shl ax, 4" (shift the lower 16 bits only) turns into a multi-instruction sequence of masking and combining.

ARMV8 has bit field extract/insert, which are fairly powerful, and I'd say more generic than "shl ax,N", which is really only there for legacy reasons (it's an instruction with 16-bit operator prefix).

RISC generally tries to keep instructions as generic/useful as possible to keep silicon small.

I was thinking about this today. Imagine treating it N bits at a time. N=4, perhaps. 16 cases, so it won't take up much space, and you can just write out each case independently. Perhaps it's easy to have a table? - then you might do 8 bits at a time, maybe.

This suboperation does an N-bit version of the total operation: takes N bits of the mask, the corresponding N bits of the input, and produces an S-bit result, R. (S is the population count of that part of the mask.)

Suppose N=4. Then for a 64-bit value, you can do 16 of these in parallel, and you've got 16 intermediate results, R0...Rf, and 16 result sizes, S0...Sf. Then combine them. Overall result = R0 | R1<<S0 | R2<<S0+S1 | R3<<S0+S1+S2 | ... | Rf<<S0+S1+S2+...+Se.

I've no idea whether this is actually easy to implement this way, but at least it requires no loop ;)


12-bit example where N=4.

Mask is %000101011000, and value is %lkjihgfedcba. (I've labelled each bit of the value by its position, since that's the key part.)

Intermediate results: R0=%000d, S0=1. R1=%00ge, S1=2. R2=%000i, S2=1.

R0's contribution to the overall result = R0 = %d.

R1's = R1<<S0 = R1<<1 = %ge0.

R2's = R2<<S0+S1 = R<<3 = %i000.

So the overall result is %000d|%ge0|%i000 = %iged.

You have to use memcpy with length 4 to read that uint32_t from something that was a char pointer, those casts are not allowed anymore. On x86 it would become a single mov instruction.

Are you sure about this?

I did this recently in C++ and it worked fine: cast a uint8_t* to a uint32_t* and read data as 32 bit ints. I had to use reinterpret_cast in C++ though and of course I had to swap words to get the correct ordering.

Pretty sure it's UB if the uint32_t* doesn't point to a correctly-aligned address. (IIRC, just computing the unaligned pointer is already UB before you dereference it.) On x86, the code might appear to work, though, if you don't happen to trigger autovectorization.

The correct way to do an unaligned read is memcpy.

To give an example, iirc on sparc, unaligned access traps. And iirc, on older arm (before armv7), it would read at the clamped aligned address, and rotate the bits to match the unaligned address (eg if your aligned data is abcdef and you read 4 bytes at offset 2, you actually read cdab). I think you can still get this behavior on armv7 if you want.

I think even if you happen to know that the address is always aligned, but the compiler can't tell that it's guaranteed to be always aligned, the compiler can miscompile the code. Even without autovectorization.

If it's correctly aligned it will always work. The compiler must assume that your cast declares the proper type and behave accordingly. This is fundamental to C and (AFAIU) C++. The cast is the final word on the matter, but it's up to you to ensure the cast is telling the truth, the whole truth, and nothing but the truth.

What the compiler cannot glean from casts are alias relationships, which are important for ensuring the order of operations are correctly maintained. As long as you're not mutating aliased objects, it's irrelevant. But if you are mutating aliased objects, you usually should derive the alias through a union with an in-scope definition. That way the compiler, by backtracking the derivation of the reference through the union, will recognize that the two objects of different types alias and ensure that loads and stores are properly ordered.

Another way to signal alias relationships relevant to UTF-8 decoding is through pointers to char or unsigned char. Those types can legally alias anything, and the compiler must take note. Sometimes it's useful to cast an address to, e.g., (char *) even when using memcpy(), even though it's unnecessary in C for the immediate expression because of automatic void pointer conversions. I can't think of a concrete example at the moment to explain the point, but it's something to keep in mind. Note that int8_t and uint8_t are not necessarily the same type as char or unsigned char, even when they have the same representation, and thus aren't required to express to the compiler the possibility of aliasing with other types (that is, mutating a uint8_t object doesn't necessarily tell the compiler that another object of type uint32_t might have been mutated, while mutating an unsigned char object would have in the same context).

Oh in that case I understand. I'm doing pointer arithmetic on guaranteed aligned data so it compiles/works fine.

Yes, I'm sure. Here is a blog post by one of the leading professors of undefined behavior about this:


(edited to better blog post by same author)

AFAIK, and also in the examples shown with that blog post, all the problems with strict aliasing involve reading/writing the same object with separate differently-typed pointers. In my example, I'm only accessing s as a uint32_t*, so there is nothing to "alias" per se.

But `s` is defined somehow, and it is not defined as a `uint32_t * ` (or else you would not need the cast). It is usually defined as `char * `. The memory being read is reachable by dereferencing `s`, a `char * `, and by dereferencing `(uint32_t * )s`, a `uint32_t * `. Thus, type punning and aliasing. Each byte of memory should be reachable only by pointers/array indexes of a single type (and anything that is not a `char * ` can be accessed by `char * `).

That works fine if and only if uint8_t is an alias of char, as otherwise it violates the strict aliasing rule.

cast != read.

Deferencing a pointer of incorrect alignment will generate traps on many architectures.

Again, the pointers are to addresses that are guaranteed to be aligned on my target architecture.

Nonetheless, I have taken the advice in this thread and converted my code to use std::memcpy instead.

He is correct. It is undefined behavior—memcpy should be used.

C++11 § 5.2.10, paragraph 7 says:

An object pointer can be explicitly converted to an object pointer of a different type. When a prvalue v of type “pointer to T1” is converted to the type “pointer to cv T2”, the result is static_cast<cv T2>(static_cast<cv void>(v)) if both T1 and T2 are standard-layout types (3.9) and the alignment requirements of T2 are no stricter than those of T1, or if either type is void. Converting a prvalue of type “pointer to T1” to the type “pointer to T2” (where T1 and T2 are object types and where the alignment requirements of T2 are no stricter than those of T1) and back to its original type yields the original pointer value. The result of any other such pointer conversion is unspecified

> those casts are not allowed anymore

As far as I know, aliasing rules are nothing but new. You can cast _to_ char, but that's pretty the sole cast allowed in C [note: void is a kind of neutral, transitional type, and is not a cast per se]

I've been puzzling over your assembly code for a while. Assuming you just had the useful bits in eax and had just executed bswap:

MSB .... byte1... byte2... LSB

00000xxx 00xxxxxx 00xxxxxx 00xxxxxx

How does multiplying (byte1 | LSB) * 3 help? I'm asking about lea ebx, [ebx + ebx*2], which is a shortcut for multiplying by 3.

Genuinely lost here, the 6 bits of byte1 and the 6 bits of LSB are separated by enough 0 bits that a multiplication by 3 won't mix them up. But within the 6 bits of byte1 and within the 6 bits of LSB, the multiplication will mean each bit is the sum of itself, the lower order bit, and any carried bits. What is that for?

Thanks for any tips!

The hint is to look at the next instruction.

Spoiler: https://pastebin.com/msjkXkKd

Yes, that makes sense now. Thanks!

Edit: I fiddled around a bit and came up with:


The nullprogram.com version is 0x185 bytes (gcc 4.8.4 x86_64 -O3; objdump -s -j .text -j .rodata)

I switched to __builtin_clz (with cross platform support). I hesitate to put in your assembly code, though it would work great! My version still sticks to C and not inline assembly.

My version takes 0xeb bytes.

Benchmarking with clang-802.0.42 -O3 -march=native on a Core i7-4770HQ @ 2.20GHz:

    branchless: 490.666725 MB/s, 0 errors
    0xeb small: 352.000042 MB/s, 0 errors
    Hoehrmann:  337.333374 MB/s, 0 errors
With gcc 4.8.4 -O3 -march=native on a Xeon E5-1650v3 @ 3.50GHz:

    branchless: 436.000052 MB/s, 0 errors
    0xeb small: 345.333375 MB/s, 0 errors
    Hoehrmann:  412.000049 MB/s, 0 errors
[1] https://stackoverflow.com/questions/19527897/how-undefined-a...

I was able to reduce it to 0xe9 bytes and I'm getting:

clang-802.0.42 -O3 -march=native on a Core i7-4770HQ @ 2.20GHz:

    branchless: 490.666725 MB/s, 0 errors
    0xe9 small: 380.000045 MB/s, 0 errors
    Hoehrmann:  337.333374 MB/s, 0 errors
With gcc 4.8.4 -O3 -march=native on a Xeon E5-1650v3 @ 3.50GHz:

    branchless: 436.000052 MB/s, 0 errors
    0xe9 small: 405.333382 MB/s, 0 errors
    Hoehrmann:  412.000049 MB/s, 0 errors

Processors will never read anything under a cache line, so it doesn't really matter (as memory access goes)

> Yes, I know it could be unaligned, but this is x86 where it would still be faster than 4 separate reads.

You have to worry about memory access violations too.

From the article:

"One important consequence of always reading four bytes is that the caller must zero-pad the buffer to at least four bytes. In practice, this means padding the entire buffer with three bytes in case the last character is a single byte."

In other words, both the byte-by-byte and dord-at-a-time code assume none of the bytes being read will be at an invalid address.

Oh, sorry, I was confused about what exactly you were replacing.

Where in the C standard are memory access violations specified as a language feature?!

"memory access violations" are reading beyond allocated storage for a variable. You're not allowed to dereference or even reference locations outside storage allocated for variables, except one past the last item of an array which can be referenced in a comparison (but cannot be dereferenced). Stack variables, memory allocated by malloc and memory allocated by mmap all conceptually behave this way.

... which is why trying the access is undefined behaviour, which is why there are no constraints on what the compiler has to do for that case. The standard says nothing about what the compiler has to do in the case that you try to dereference a pointer that points outside an object, therefore it may assume that you don't do so, therefore it may combine the four byte loads into one 32-bit load.

If you're worried about the actual C standard then you hit undefined behavior the moment you dereference a uint32_t* pointing to data that's actually char.

Actually, this is false. The language specifically allows aliasing char (both signed and unsigned) with other types as an exception to type-based aliasing.

Only in one direction. You can use char to access data that's really uint32_t, but you can't use uint32_t to access data that's really char. ("Really" here is what the C standard calls "effective type"; for normal variables it's equivalent to the declared type, but for malloc'ed memory it has a weird definition where you can change the type by writing to it.)

Which doesn't prevent the compiler from doing a 32 bit load, does it? Dereferencing a pointer to a location outside any object is also undefined, therefore, if there are no conditionals between subsequent byte reads, the compiler should be perfectly fine with combining them into a 32 bit load!?

The issue is that once you hit undefined behavior, the optimizer is allowed to do whatever it wants - the proverbial launch nukes, crash, raid your fridge, etc.

More likely, it could just not execute that instruction or any instruction that depends on it, because that's the fastest way to preserve undefined behavior.

> The issue is that once you hit undefined behavior, the optimizer is allowed to do whatever it wants - the proverbial launch nukes, crash, raid your fridge, etc.

That's not quite how it works, or at least not what usually happens. Optimizers don't look for undefined behaviour to then do whatever, they look for the set of defined behaviours and then try to generate the cheapest code that behaves correctly in all those cases--and if that means that undefined cases do weird stuff, that's an acceptable side effect.

That in particular means that the potential for undefined behaviour to occur at runtime does not make the program undefined. If you ask the user to input a number and you then use that number as an index into some array without any input validation, the user could enter an out-of-range index, in which case the behaviour of the program would be undefined. But if the user inputs a valid index, the program must not exhibit any undefined behaviour.

Therefore, yes, if any of the bytes are not inside any valid C object, then that would produce undefined behaviour. Which is precisely why the compiler would be allowed to combine the loads into one 32-bit load: As the code, if it takes that path, is going to load all four bytes, and accessing a byte outside any objects would result in undefined behaviour, the compiler may assume that all accesses are within valid objects, and therefore loading them all in one is correctness-preserving.

> Optimizers don't look for undefined behaviour to then do whatever, they look for the set of defined behaviours and then try to generate the cheapest code that behaves correctly in all those cases--and if that means that undefined cases do weird stuff, that's an acceptable side effect.

> if any of the bytes are not inside any valid C object, then that would produce undefined behaviour

The compiler might also assume that every one of these loads will be exactly 4 bytes apart. Even worse, it might know that int32_t allocations will always be aligned, so it will assume that every load will be 4-byte-aligned. It then might output code that loses track of the bottom two bits, or SSE instructions that only work with certain alignments.

This isn't theoretical, this exact kind of bug has happened before, on a normal non-malicious compiler: http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-... https://news.ycombinator.com/item?id=12889855

It's true that the buffer overrun issue is the same for either version; I misread what was being replaced. But the alignment issue is specific to this version, and can't be ignored.

Yes, that is all true for the suggested replacement code, but you quoted the statement "Yes, I know it could be unaligned, but this is x86 where it would still be faster than 4 separate reads.", which referred only to what the compiler should do, and the compiler combining char reads into 32 bit loads obviously is not allowed to assume that the resulting loads are aligned ...

The sentence I quoted refers just to the technique of using 32-bit loads. That applies equally to your compiler suggestion and the C code you wrote. So I think it's a perfectly fine sentence to quote when pointing out a flaw of the C code, which is while unaligned accesses are fine in a vacuum, doing them with a wrong-typed pointer is unsafe.

I ran into the same UTF-8 decoder that you did (http://bjoern.hoehrmann.de/utf-8/decoder/dfa/) and tried to better understand why it was fast. In the process, I wrote a naive UTF-8 parser that turned out to be significantly faster. I used a version of the original benchmark (the current Hindi Wikipedia dump as one big XML file).

My original benchmark read the file as a stream — no guarantee that you'll get whole characters in each read — so I modified it a bit to read the whole file into memory, parse it, and print out the number of characters decoded and an XOR "hash". Here's what happened (looking at the "user" line of `time` output):

- Bjoern Hoehrmann’s decoder: 2.159s

- This post’s decoder: 2.754s (27.6% slower)

- Naive decoder: 1.453s (32.7% faster)

I'm curious whether this test makes sense and, if so, why there's such a big difference. Currently my best guess is the data cache pressure mentioned in twoodfin's comment.

I guess I should write a blog post, too.

Naive parser: https://gist.github.com/s4y/7c95f1ebeb2c069cfb09db3c3251eca3

Benchmarks: https://gist.github.com/s4y/344a355f8c1f99c6a4cb2347ec4323cc

Dealing with situations like this where you can’t guarantee you read a complete decode sequence is exactly where memory mapped files are gold.

Replacing read, test sequence end, buffer, drop buffer if it exceeds certain length without sequence end but record sequence start, seek until end sequence is found, and process that mess in a safe, passably-efficient manner with memory mapped files made my cross-platform tac a breeze to write: https://neosmart.net/blog/2017/a-high-performance-cross-plat...

The simplest thing to do there is to issue a read if you have less than 4 bytes remaining to decode. This guarantees that you always have at least one well formed character, or an EOF coming up.

Mmap doesn't work with pipes, which hugely reduces the utility of this kind of program.

You aren’t reading a byte at a time, it’s a block. You have no guarantee that you just ended on yet another partial sequence. It usually makes more sense to just rewind to the last complete sequence (though with streams that is not always an option).

> You have no guarantee that you just ended on yet another partial sequence

Which is why you don't read to the end (until you have an EOF). You can just sidestep the partial sequence problem by getting more data before there's potential for a partial sequence to show up. The longest possible sequence is 4 bytes, so as long as you ensure you have 4 bytes available, partial sequences are impossible on well formed input.

Almost all the text is ascii, so the branch predictor gets things right. Correctly predicted branches are nearly free, costing about as much as their decode overhead.

It'd be interesting to see the results where you have randomly shuffled characters of all lengths, although it's likely to be something you see in reality.

A couple of years ago, I made the same exercise and, when benchmarking several real-world languages, I got similar results. On a slowish Intel CPU (from my memory):

  DFA-based decoder: ~400 MB/s
  Naive decoder with ASCII: ~1.2 GB/s
  Naive decoder with with most Western languages: ~1.0 GB/s
  Naive decoder with with Russian, Chinese, Japanese: ~800 MB/s
Branch prediction of modern CPUs is so good that it's hard to beat the naive if-else chain with real-world data. Interestingly, I got the worst performance with heavily accented Eastern European languages like Czech where performance dropped to 600-700 MB/s.

> Branch prediction of modern CPUs is so good

Modern x86 CPUs. ARM branch prediction is much worse, unfortunately.

Your decoder accepts lone surrogates and overlong sequences.

This probably helps a bit for speed.

A constant reminder: Be wary when microbenchmarking functions that make relatively heavy use of lookup tables vs. those that don’t. You may not see the effects of the pressure this adds to the data cache, but a real program using many functions and other data will.

The same is true for long vs. short functions and the instruction cache.

If that bothers you, there's this approach:

    #define VALIDTAB (0x7f00ffff)
    #define LENTAB (0x000000003a550000)
    inline bool isvalidutf8start(char c) 
    {   return((VALIDTAB >> (c>>3)) & 1); }
    inline unsigned utf8length(char c)
    {   return(1+((LENTAB >> ((c>>3)*2)) & 2)); }

The 32-entry table of lengths can be coded up as a 64-bit value. The "is valid" table is stored as a separate constant.

Should be:

    inline unsigned utf8length(char c)
    {   return(1+((LENTAB >> ((c>>3)*2)) & 3)); }


    #define LENTAB (0x3a55000000000000)

> My primary — and totally arbitrary — goal was to beat the performance of Bjoern Hoehrmann’s DFA-based decoder.

The thing is, that decoder could be already branchless. The core of the algorithm is:

    *codep = (*state != UTF8_ACCEPT) ?
      (byte & 0x3fu) | (*codep << 6) :
      (0xff >> type) & (byte);
Which can be compiled with a conditional move (CMOV) instead of a branch. In fact:

> With GCC 6.3.0 on an i7-6700, my decoder is about 20% faster than the DFA decoder in the benchmark. With Clang 3.8.1 it’s just 1% faster.

There's a chance that when inlined into the benchmark Clang uses a CMOV while GCC uses a branch. As he correctly points out, his benchmark is the worst case for branch prediction:

> The even distribution of lengths greatly favors a branchless decoder. The random distribution inhibits branch prediction.

That looks way oversimplified. I once wrote a minimal HTTP server in Lua for Nmap project and part of the problem was UTF8 security. Consider get_next_char_len function here to see some of the edge cases:


Mimimal but useful. I just tried this httpd.lua script with djb's tcpserver and was pleasantly surprised. It is quick and handles large PDFs well enough. I like that there are no third party libraries. If directory listing and output for video could be added I might try using something like this on a local LAN as a replacement for the httpd binary I am using.

Glad to hear you like it! What do you mean by output for video? As for video listing, Lua doesn't provide a cross-platform way to do this, but I might have hacks for Unix/Windows that parse the output of dir/ls commands. Interested? Can't promise I find it though.

Briefly tested with a few filetypes and a couple of browsers, including Safari. Problem loading video/mp4, i.e., problem with "HTTP streaming". Maybe need to handle Range: header?

Of course, any examples in Lua of different approaches to UNIX directory listing via HTTP are appreciated. Still a Lua noob.

The Lua code is just excessively verbose.

What do you mean? How else would you implement a check for whether the prefix length is valid?

Here's the code to do it: https://github.com/skeeto/branchless-utf8/blob/master/utf8.h

It's pretty straightforward. He uses a lookup table for the first byte ("lengths") and then later checks that subsequent bytes have the correct continuation prefix (0xc0 mask / 0x2a value).

I think it checks all of the same conditions as your Lua code.

CMOV isn't necessarily faster than no branches. There's still data dependencies which can prevent reordering. But considerably faster than a misprediction.

Sure, but if you look at the implementation in the article, that's a much longer dependency chain:

    *c  = (uint32_t)(s[0] & masks[len]) << 18;
    *c |= (uint32_t)(s[1] & 0x3f) << 12;
    *c |= (uint32_t)(s[2] & 0x3f) <<  6;
    *c |= (uint32_t)(s[3] & 0x3f) <<  0;
    *c >>= shiftc[len];

    /* Accumulate the various error conditions. */
    *e  = (*c < mins[len]) << 6;
    *e |= ((*c >> 11) == 0x1b) << 7;  // surrogate half?
    *e |= (s[1] & 0xc0) >> 2;
    *e |= (s[2] & 0xc0) >> 4;
    *e |= (s[3]       ) >> 6;
    *e ^= 0x2a; // top two bits of each tail byte correct?
    *e >>= shifte[len];
Plus, all those loads are not making it easy for the processor to pipeline.


    *e  = (*c < mins[len]) << 6;
is really a CMOV in disguise :)

is really a CMOV in disguise :)

I don't see how using CMOV would be appropriate there --- I'd compile it as something more like this:

    ; eax = *c, ebx = mins[len]
    cmp eax, ebx
    sbb eax, eax
    and eax, 64
    ; eax = (*c < mins[len]) << 6

I dunno what you mean by CMOV vs “no” branches.

However, CMOV is single cycle latency in Kaby Lake. Also, I don’t know if those data dependencies wouldn’t also cause similar problems for resolving a conditional branch.

It may be possible to construct a scenario where CMOV is marginally slower but I think we can put Linus Torvalds’ CMOV rant in the bit bucket. Intel recommends CMOV with a proviso. Assembly/Compiler Coding Rule 2:

  Use the SETCC and CMOV instructions to eliminate
  unpredictable conditional branches where possible.
  Do not do this for predictable branches.
The Intel C Compiler uses it.

Something like

    X = flag ? X : 0;
    X = test ? X : 0;

    X |= flag;
    X |= test;
The latter is more obviously branch less. How exactly the two perform probably requires some testing.

And somewhat more general point, since we're writing C and not asm, depends how much you want to depend on the compiler to use the right instructions. To backtrack on my point a bit, whether CMOV gets used is an important consideration.

> There's a chance that when inlined into the benchmark Clang uses a CMOV

And a chance it won't. :)

Did Intel disprove Linus?

Obviate would be a better choice of word.

UTF-8-decode and encode are basically the only things that could be done by a single instruction, are common enough for implementing this instruction to make sense, yet still don't exist in x86. I wonder why.

Every time I have worked for a company that had a tight relationship with Intel I've gotten the opportunity to meet with them and ask for chips features. I always ask for encoding / decoding Unicode. So far I've been ignored for 15 years or so.

You need to have padding bytes at the end of the stream. What if you don't have them and got a const pointer to a buffer provided by a user? Then you need to copy the entire buffer again with the padding bytes behind it. Are there any tricks to avoid this being costly?

Presumably !len is a test and set instruction which has the same impact on the pipeline. But that said branchless code that fits in a couple of cache lines seems like it would be the best you could do.

I don't know the answer to this. Say on x86, if you do "test eax, eax", then "setz eax", then use eax in further arithmetic - obviously there is no jump involved, but does that stall the pipeline?

At any rate the caller needs to check for errors coming out of this function (the int *e parameter) and zero termination, so there will be conditional jumps somewhere in the use of this thing. (And reading the code it seems to assume s[3] is safe to dereference...)

Seems like a fun project! One thing I stumbled over:

> unsigned char *next = s + len + !len;

> Originally this expression was the return value, computed at the very end of the function. However, after inspecting the compiler’s assembly output, I decided to move it up, and the result was a solid performance boost. That’s because it spreads out dependent instructions. With the address of the next code point known so early, the instructions that decode the next code point can get started early.

If there's no change in the actual logic, I feel like the compiler should be able to move up expressions in order to make them available earlier. I wonder whether this was tested/inspected with some optimizations turned off.

Except in case of profile-guided optimization (which few people use), the compiler has one massive weakness: it doesn't know the "shape" and distribution of your data. If it did, it would be able to do quite a bit more, but it mostly does not.

I also find that people have way too much faith in omnipotence of compilers. If you care about performance, you _have_ to go down to the assembly level and ensure that the compiler did the right thing, at least in the hotspots. Oftentimes it does not, and you have to change your higher level code to make it cooperate.

> Except in case of profile-guided optimization (which few people use), the compiler has one massive weakness: it doesn't know the "shape" and distribution of your data. If it did, it would be able to do quite a bit more, but it mostly does not.

Well, that's only true because programmers define functions in terms of types that are not sufficiently specific wrt their runtime invariants; in a dependently-typed language you could have an optimizing compiler that makes use of type-level information to do better optimizations. For instance, in a dependently-typed language, you could have a type of integers divisible by 11, and it will have no runtime overhead with respect to a machine integer. Of course, this comes with a cost: if you want to use such specific types, you have to prove to the compiler that the data it is getting really does satisfy the property (and yes, you can do such proofs even with data that is only available at runtime, by simply implementing a runtime check).

The example line is just operations of registers, without any dependencies until the function return. The compiler should know to interleave these instructions with the code, so that the CPU can issue them alongside other operations.

Compilers should view operations more like a dependency graph, laying out operations with an interleaving that maximizes multi-issue, while keeping register pressure in check. In my mind, if the human can significantly change the output of the assembly by moving one line, the compiler optimizations may be turned off.

> laying out operations with an interleaving that maximizes multi-issue, while keeping register pressure in check

How to actually do that has been an active research topic for 30 years or so. So far, that research hasn't yielded anything that is simple, efficient, and effective enough to be included in general-purpose compilers. Last time I looked at LLVM's backend, it used to schedule for minimal register pressure because spills are more expensive in general, and because out-of-order execution will undo your careful scheduling anyway. It also has a second scheduler after register allocation, but at that point it's hard to move stuff, especially on x86.

On this particular code, manual scheduling paid off, partly because humans, unlike compilers, can simply try different variants and keep the best. Think of it as survivorship bias.

Edit: BTW, if you actually compile the code from Github and compare the generated assembly to a version where the computation of next is moved to the end, you'll see that moving it to the end causes GCC to allocate a stack frame (on x86-64), which it doesn't do in the manually optimized one.

So optimising code feels like playing chess now?

It is not "optimising" because nobody even hopes for something "optimal".

Now? For a long time x86/amd64 performance is quite unpredictable because the CPUs are so complex.

It is not like chess. More like Poker because more uncertainty.

> It is not "optimising" because nobody even hopes for something "optimal".

Optimizing doesn't need to reach the optimal, just improve.

> It is not like chess. More like Poker because more uncertainty.

I would say the uncertainty is similar to chess. It isn't feasible to know the best option in any given situation, but it is deterministic.

did you mean because computers can do it better? I can't see anything else in their comment that you could have referred to, but the quoted part shows that the author did it better than the compiler output...

No, not really. I meant by the prose description of how to optimise the code. It felt like how chess masters may discuss their strategies after a game of chess. Very high level and far from what a layperson may really understand about chess (or optimising), but very true and distinct nonetheless.

It gives a glimpse into the mind of the chess master (or optimiser) even when you yourself don't possess these skills.

Oh okay. I see the similarity now - it does look like prose discussion of chess games. The analogy can be pretty deep, like strategic considerations ("opposite-colored bishops" vs "spread-out dependent instructions" in the prose you replied to). Thanks for the clarification, this was insightful.

Similarily, I made a couple of year ago a UTF8 decoder using SIMD. It is almost branchless. (The branches are there only to detect errors) https://woboq.com/blog/utf-8-processing-using-simd.html

So the classic lookup-tables are faster than switch/case?

*c >>= shiftc[len]; could probably be replaced by switch(len) and would result in equal code.

My guess is that the speed increase comes from the trailing 3 bytes potential overread and the somewhat odd error handling, not the lookups.

Sorry dumb question, why branchless?

Why branchless? Because branches can kill performance. For an explanation, see this answer on branch prediction: https://stackoverflow.com/a/11227902

The article answers this in detail, but the TLDR is that branches slow down execution on modern CPUs

I don't see where this is catching encoding errors, like overlong forms. That has security implications.

It's useful when there is some assurance that the UTF-8 is error-free. E.g. tdata bundled with the program or otherwise trusted.

This line catches overlong forms:

  *e  = (*c < mins[len]) << 6;

I see, thanks.

This appears to be a descendant of Duff's device (https://en.wikipedia.org/wiki/Duff's_device)

Benchmarks should've included random ASCII. I understand they hinted at that with labeling their benchmark synthetic, but it'd give a idea of the impact of branch mispredicts

I'm an idiot in this area.

Why would you decode utf8?

In order to look up glyph in your font, you'll typically need a 32bit unicode code-point. A utf-8 string encodes those code-points using a variable-number-of-bytes scheme, so you need to decode each code-point before you can render it.

Yes, but this code expands it to a 32-bit array. That seems a bit silly because you'd be wasting a lot of memory bandwidth there. I suspect the loss could even be greater than what you've gained from removing those branches.

Not quite, it's expanded to a 32-bit integer, one codepoint at a time.

Perhaps you're misinterpreting what a decoder does. From the article:

> This week I took a crack at writing a branchless UTF-8 decoder: a function that decodes a single UTF-8 code point from a byte stream…

To print it to the screen for example.

To get the unicode codepoint. For example, you need this when you want to render a character, or convert it to another representation.

Some languages represent unicode strings as objects quite separate to their binary encoding, e.g. Python.

Weird to talk so much about branch prediction for branchless code.

1. Understanding how branches work is important to understanding the motivation for branchless code.

2. It's not entirely branch less. Consistent branching is equally important.

He can also add the restrict keyword on his pointer arguments to squeeze out some more of the compiler's optimizations.


Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact