
A Branchless UTF-8 Decoder - zdw
http://nullprogram.com/blog/2017/10/06/
======
userbinator
The core of the algorithm shows that UTF-8 is actually big-endian in its
ordering of the bits, making it somewhat more difficult to implement
efficiently on the usual little-endian machine --- the first byte, at the
lowest address, actually contains the _most_ sigificant bits. Had it been the
other way around, the final shift would not be length-dependent.

Also, none of the compilers I tried at
[https://godbolt.org/](https://godbolt.org/) were able to combine the 4
s[0...3] 8-bit reads into one 32-bit read, which is somewhat disappointing.
They generated 4 separate byte-read instructions, one for each s[0..3]. Yes, I
know it could be unaligned, but this is x86 where it would still be faster
than 4 separate reads. You would need to do this instead (and sacrifice
endianness/alignment-independence):

    
    
        uint32_t v = *(uint32_t*)s;
        *c  = (uint32_t)(v & masks[len]) << 18;
        *c |= (uint32_t)((v>>8) & 0x3f) << 12;
        *c |= (uint32_t)((v>>16) & 0x3f) <<  6;
        *c |= (uint32_t)((v>>24) & 0x3f) <<  0;
        *c >>= shiftc[len];
    

We're still left with a bunch of shifts, ands, and ors, the purpose of which
is to essentially "compact" the 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx bitstring
into one 21-bit codepoint by removing the bits in the middle. All the
compilers I tried weren't so clever as to do this (assuming eax = v from above
after masking, and cl = slightly modified shiftc[len]) at any optimisation
level:

    
    
        bswap eax
        mov ebx, eax
        and ebx, 00FF00FFh
        lea ebx, [ebx + ebx*2]
        add eax, ebx
        shr eax, 2
        shl ax,4
        shr eax, cl
    

I played around with optimising UTF-8 decoding a while ago, and the above
resembles one of the fastest and smallest ways to "compact" the 4 bytes into a
codepoint.

Of course, this bit string manipulation would be much faster and trivial in
hardware (essentially rearranging wires), which makes me want a LODSUTF8
instruction that operates like LODSB but reads a whole codepoint instead, and
sets CF on error...

~~~
bdonlan
One thing you could do to optimize further (on CPUs with the BMI2 instruction
set) is to use the PEXT instruction to perform the bit extract operation
(after a BSWAP of course). The whole operation could be something like
(untested):

    
    
        ; Parameters: RDI - pointer to input character
        ; Return value: RAX - Unicode codepoint, or -1 for error
        lea r8, [encoding_table] ; Load pointer to table base using rip-relative addressing
    
        mov edx, [rdi] ; unaligned (over)read of 32-bit utf8 codepoint
        mov ebx, edx   ; Copy to EBX to construct our table index
        and ebx, 0xf8  ; Mask off the (potential) data bits
        ; Now we want to convert the top 5 bits into an index into a table of three * 32-bit entries
        ; Currently EBX = index * 8, we need index * 12, so we'll divide by two and then multiply by 3
        shr ebx, 1
        lea ebx, [ebx + ebx * 2]
        
        bswap edx      ; Get the codepoint into big-endian representation
    
        pext ecx, edx, [ebx + 4] ; extract padding bits using mask    
        pext eax, edx, [ebx] ; extract data bits
        cmp ecx, [ebx + 8] ; check padding bits
        cmovnz eax, -1
        retq
        
        encoding_table:
        ; 00000xxx - 01111xxx
        times 16 dd 0x7F, 0x80, 0x00
        ; 10000xxx - 10111xxx (invalid)
        times 8 dd 0, 0xFFFFFFFF, 0 ; The padding will never match here, forcing an error return
        ; 11000xxx - 11011xxx (two bytes) - padding value 11010
        times 4 dd 0x1F3F, 0xE0C0, 0x1A
        ; 11100xxx - 11101xxx (three bytes) - padding value 1110 1010
        times 2 dd 0x0F3F3F, 0xF0C0C0, 0xEA
        ; 11110xxx (four bytes) - padding value 111 1010 1010
        dd 0x073F3F3F, 0xF8C0C0C0, 0x7AAAA

~~~
goldenkey
Pretty cool, a recent development, only in Haswell and later architectures
(2013 and later.)

[https://en.m.wikipedia.org/wiki/Bit_Manipulation_Instruction...](https://en.m.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets)

~~~
userbinator
...and on the Haswell PEXT runs in _one_ uop with a latency of 3! That is
nothing short of amazing for an operation which, from its description, would
seem to require a microcode loop or at least a few more cycles to collect an
arbitrary number of bits with arbitrary gaps between them:

[http://www.felixcloutier.com/x86/PEXT.html](http://www.felixcloutier.com/x86/PEXT.html)

The only unfortunate thing is that it was introduced quite recently in terms
of x86 history (instead of being present early on but just microcoded), so
earlier software can't take advantage of it nor any subsequent improvements,
and uses a pretty complex encoding (VEX) only usable in protected mode. But if
anything, this is another datapoint in the argument in favour of CISC --- try
doing this on a MIPS, RISC-V or even ARM! With less powerful ISAs, even
something as simple as the "shl ax, 4" (shift the lower 16 bits _only_ ) turns
into a multi-instruction sequence of masking and combining.

~~~
ant6n
ARMV8 has bit field extract/insert, which are fairly powerful, and I'd say
more generic than "shl ax,N", which is really only there for legacy reasons
(it's an instruction with 16-bit operator prefix).

RISC generally tries to keep instructions as generic/useful as possible to
keep silicon small.

------
Sidnicious
I ran into the same UTF-8 decoder that you did
([http://bjoern.hoehrmann.de/utf-8/decoder/dfa/](http://bjoern.hoehrmann.de/utf-8/decoder/dfa/))
and tried to better understand why it was fast. In the process, I wrote a
naive UTF-8 parser that turned out to be significantly faster. I used a
version of the original benchmark (the current Hindi Wikipedia dump as one big
XML file).

My original benchmark read the file as a stream — no guarantee that you'll get
whole characters in each read — so I modified it a bit to read the whole file
into memory, parse it, and print out the number of characters decoded and an
XOR "hash". Here's what happened (looking at the "user" line of `time`
output):

\- Bjoern Hoehrmann’s decoder: 2.159s

\- This post’s decoder: 2.754s (27.6% slower)

\- Naive decoder: 1.453s (32.7% faster)

I'm curious whether this test makes sense and, if so, why there's such a big
difference. Currently my best guess is the data cache pressure mentioned in
twoodfin's comment.

I guess I should write a blog post, too.

Naive parser:
[https://gist.github.com/s4y/7c95f1ebeb2c069cfb09db3c3251eca3](https://gist.github.com/s4y/7c95f1ebeb2c069cfb09db3c3251eca3)

Benchmarks:
[https://gist.github.com/s4y/344a355f8c1f99c6a4cb2347ec4323cc](https://gist.github.com/s4y/344a355f8c1f99c6a4cb2347ec4323cc)

~~~
ComputerGuru
Dealing with situations like this where you can’t guarantee you read a
complete decode sequence is exactly where memory mapped files are gold.

Replacing read, test sequence end, buffer, drop buffer if it exceeds certain
length without sequence end but record sequence start, seek until end sequence
is found, and process that mess in a safe, passably-efficient manner with
memory mapped files made my cross-platform tac a breeze to write:
[https://neosmart.net/blog/2017/a-high-performance-cross-
plat...](https://neosmart.net/blog/2017/a-high-performance-cross-platform-tac-
rewrite/)

~~~
ori_b
The simplest thing to do there is to issue a read if you have less than 4
bytes remaining to decode. This guarantees that you always have at least one
well formed character, or an EOF coming up.

Mmap doesn't work with pipes, which hugely reduces the utility of this kind of
program.

~~~
ComputerGuru
You aren’t reading a byte at a time, it’s a block. You have no guarantee that
you just ended on yet another partial sequence. It usually makes more sense to
just rewind to the last complete sequence (though with streams that is not
always an option).

~~~
ori_b
> You have no guarantee that you just ended on yet another partial sequence

Which is why you don't read to the end (until you have an EOF). You can just
sidestep the partial sequence problem by getting more data before there's
potential for a partial sequence to show up. The longest possible sequence is
4 bytes, so as long as you ensure you have 4 bytes available, partial
sequences are impossible on well formed input.

------
twoodfin
A constant reminder: Be wary when microbenchmarking functions that make
relatively heavy use of lookup tables vs. those that don’t. You may not see
the effects of the pressure this adds to the data cache, but a real program
using many functions and other data will.

The same is true for long vs. short functions and the instruction cache.

~~~
Animats
If that bothers you, there's this approach:

    
    
        #define VALIDTAB (0x7f00ffff)
        #define LENTAB (0x000000003a550000)
     
        inline bool isvalidutf8start(char c) 
        {   return((VALIDTAB >> (c>>3)) & 1); }
            
        inline unsigned utf8length(char c)
        {   return(1+((LENTAB >> ((c>>3)*2)) & 2)); }
    
    

The 32-entry table of lengths can be coded up as a 64-bit value. The "is
valid" table is stored as a separate constant.

~~~
Animats
Should be:

    
    
        inline unsigned utf8length(char c)
        {   return(1+((LENTAB >> ((c>>3)*2)) & 3)); }

~~~
Animats
Correction:

    
    
        #define LENTAB (0x3a55000000000000)

------
ot
> My primary — and totally arbitrary — goal was to beat the performance of
> Bjoern Hoehrmann’s DFA-based decoder.

The thing is, that decoder could be already branchless. The core of the
algorithm is:

    
    
        *codep = (*state != UTF8_ACCEPT) ?
          (byte & 0x3fu) | (*codep << 6) :
          (0xff >> type) & (byte);
    

Which can be compiled with a conditional move (CMOV) instead of a branch. In
fact:

> With GCC 6.3.0 on an i7-6700, my decoder is about 20% faster than the DFA
> decoder in the benchmark. With Clang 3.8.1 it’s just 1% faster.

There's a chance that when inlined into the benchmark Clang uses a CMOV while
GCC uses a branch. As he correctly points out, his benchmark is the worst case
for branch prediction:

> The even distribution of lengths greatly favors a branchless decoder. The
> random distribution inhibits branch prediction.

~~~
d33
That looks way oversimplified. I once wrote a minimal HTTP server in Lua for
Nmap project and part of the problem was UTF8 security. Consider
get_next_char_len function here to see some of the edge cases:

[https://github.com/nmap/nmap/blob/b7a5a6/ncat/scripts/httpd....](https://github.com/nmap/nmap/blob/b7a5a6/ncat/scripts/httpd.lua)

~~~
feelin_googley
Mimimal but useful. I just tried this httpd.lua script with djb's tcpserver
and was pleasantly surprised. It is quick and handles large PDFs well enough.
I like that there are no third party libraries. If directory listing and
output for video could be added I might try using something like this on a
local LAN as a replacement for the httpd binary I am using.

~~~
d33
Glad to hear you like it! What do you mean by output for video? As for video
listing, Lua doesn't provide a cross-platform way to do this, but I might have
hacks for Unix/Windows that parse the output of dir/ls commands. Interested?
Can't promise I find it though.

~~~
feelin_googley
Briefly tested with a few filetypes and a couple of browsers, including
Safari. Problem loading video/mp4, i.e., problem with "HTTP streaming". Maybe
need to handle Range: header?

Of course, any examples in Lua of different approaches to UNIX directory
listing via HTTP are appreciated. Still a Lua noob.

------
Tuna-Fish
UTF-8-decode and encode are basically the only things that could be done by a
single instruction, are common enough for implementing this instruction to
make sense, yet still don't exist in x86. I wonder why.

~~~
spullara
Every time I have worked for a company that had a tight relationship with
Intel I've gotten the opportunity to meet with them and ask for chips
features. I always ask for encoding / decoding Unicode. So far I've been
ignored for 15 years or so.

------
Aardwolf
You need to have padding bytes at the end of the stream. What if you don't
have them and got a const pointer to a buffer provided by a user? Then you
need to copy the entire buffer again with the padding bytes behind it. Are
there any tricks to avoid this being costly?

------
goldenkey
He can also add the restrict keyword on his pointer arguments to squeeze out
some more of the compiler's optimizations.

[https://en.m.wikipedia.org/wiki/Restrict](https://en.m.wikipedia.org/wiki/Restrict)

------
ChuckMcM
Presumably !len is a test and set instruction which has the same impact on the
pipeline. But that said branchless code that fits in a couple of cache lines
seems like it would be the best you could do.

~~~
asveikau
I don't know the answer to this. Say on x86, if you do "test eax, eax", then
"setz eax", then use eax in further arithmetic - obviously there is no jump
involved, but does that stall the pipeline?

At any rate the caller needs to check for errors coming out of this function
(the int *e parameter) and zero termination, so there will be conditional
jumps somewhere in the use of this thing. (And reading the code it seems to
assume s[3] is safe to dereference...)

------
ant6n
Seems like a fun project! One thing I stumbled over:

> unsigned char *next = s + len + !len;

> Originally this expression was the return value, computed at the very end of
> the function. However, after inspecting the compiler’s assembly output, I
> decided to move it up, and the result was a solid performance boost. That’s
> because it spreads out dependent instructions. With the address of the next
> code point known so early, the instructions that decode the next code point
> can get started early.

If there's no change in the actual logic, I feel like the compiler should be
able to move up expressions in order to make them available earlier. I wonder
whether this was tested/inspected with some optimizations turned off.

~~~
0xbear
Except in case of profile-guided optimization (which few people use), the
compiler has one massive weakness: it doesn't know the "shape" and
distribution of your data. If it did, it would be able to do quite a bit more,
but it mostly does not.

I also find that people have way too much faith in omnipotence of compilers.
If you care about performance, you _have_ to go down to the assembly level and
ensure that the compiler did the right thing, at least in the hotspots.
Oftentimes it does not, and you have to change your higher level code to make
it cooperate.

~~~
ant6n
The example line is just operations of registers, without any dependencies
until the function return. The compiler should know to interleave these
instructions with the code, so that the CPU can issue them alongside other
operations.

Compilers should view operations more like a dependency graph, laying out
operations with an interleaving that maximizes multi-issue, while keeping
register pressure in check. In my mind, if the human can significantly change
the output of the assembly by moving one line, the compiler optimizations may
be turned off.

~~~
tom_mellior
> laying out operations with an interleaving that maximizes multi-issue, while
> keeping register pressure in check

How to actually do that has been an active research topic for 30 years or so.
So far, that research hasn't yielded anything that is simple, efficient, and
effective enough to be included in general-purpose compilers. Last time I
looked at LLVM's backend, it used to schedule for minimal register pressure
because spills are more expensive in general, and because out-of-order
execution will undo your careful scheduling anyway. It also has a second
scheduler after register allocation, but at that point it's hard to move
stuff, especially on x86.

On this particular code, manual scheduling paid off, partly because humans,
unlike compilers, can simply _try_ different variants and keep the best. Think
of it as survivorship bias.

Edit: BTW, if you actually compile the code from Github and compare the
generated assembly to a version where the computation of next is moved to the
end, you'll see that moving it to the end causes GCC to allocate a stack frame
(on x86-64), which it doesn't do in the manually optimized one.

------
ogoffart
Similarily, I made a couple of year ago a UTF8 decoder using SIMD. It is
almost branchless. (The branches are there only to detect errors)
[https://woboq.com/blog/utf-8-processing-using-
simd.html](https://woboq.com/blog/utf-8-processing-using-simd.html)

------
Too
So the classic lookup-tables are faster than switch/case?

*c >>= shiftc[len]; could probably be replaced by switch(len) and would result in equal code.

My guess is that the speed increase comes from the trailing 3 bytes potential
overread and the somewhat odd error handling, not the lookups.

------
exabrial
Sorry dumb question, why branchless?

~~~
zaxomi
Why branchless? Because branches can kill performance. For an explanation, see
this answer on branch prediction:
[https://stackoverflow.com/a/11227902](https://stackoverflow.com/a/11227902)

------
kazinator
I don't see where this is catching encoding errors, like overlong forms. That
has security implications.

It's useful when there is some assurance that the UTF-8 is error-free. E.g.
tdata bundled with the program or otherwise trusted.

~~~
sounds
This line catches overlong forms:

    
    
      *e  = (*c < mins[len]) << 6;

~~~
kazinator
I see, thanks.

------
2sk21
This appears to be a descendant of Duff's device
([https://en.wikipedia.org/wiki/Duff's_device](https://en.wikipedia.org/wiki/Duff's_device))

------
__s
Benchmarks should've included random ASCII. I understand they hinted at that
with labeling their benchmark synthetic, but it'd give a idea of the impact of
branch mispredicts

------
kuwze
I'm an idiot in this area.

Why would you decode utf8?

~~~
fra
In order to look up glyph in your font, you'll typically need a 32bit unicode
code-point. A utf-8 string encodes those code-points using a variable-number-
of-bytes scheme, so you need to decode each code-point before you can render
it.

~~~
amelius
Yes, but this code expands it to a 32-bit array. That seems a bit silly
because you'd be wasting a lot of memory bandwidth there. I suspect the loss
could even be greater than what you've gained from removing those branches.

~~~
amaranth
Not quite, it's expanded to a 32-bit integer, one codepoint at a time.

------
smegel
Weird to talk so much about branch prediction for branchless code.

~~~
tedunangst
1\. Understanding how branches work is important to understanding the
motivation for branchless code.

2\. It's not entirely branch less. Consistent branching is equally important.

