
Validating UTF-8 strings using as little as 0.7 cycles per byte - ingve
https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/
======
SloopJon
My first thought on seeing this was Bjoern Hoehrmann's UTF-8 decoder, which
encodes a finite state machine into a byte array:

[http://bjoern.hoehrmann.de/utf-8/decoder/dfa/](http://bjoern.hoehrmann.de/utf-8/decoder/dfa/)

I happened to come across this decoder again recently in Niels Lohmann's JSON
library for C++:

[https://github.com/nlohmann/json](https://github.com/nlohmann/json)

I see that this is mentioned in a previous post:

[https://lemire.me/blog/2018/05/09/how-quickly-can-you-
check-...](https://lemire.me/blog/2018/05/09/how-quickly-can-you-check-that-a-
string-is-valid-unicode-utf-8/)

One thing I'd like to check in the new code is whether it's as picky about
things like overlong sequences as Bjoern's code is.

~~~
loeg
> One thing I'd like to check in the new code is whether it's as picky about
> things like overlong sequences as Bjoern's code is.

I believe that's what checkContinuation() is doing, based on its use of the
"counts" parameters. I don't understand _how_ it works, but I don't see any
other reason for count_nibbles() to compute the '->count' member.

~~~
kwillets
The counts are actually to find the length of the following continuation
bytes, and mark them to look for overlaps or underlaps between the end of one
code point and the next.

~~~
loeg
Underlaps and overlaps would occur if sequences are invalid, i.e., overly long
(or short) compared to the initial byte's prefix.

------
KMag
With such a high percentage of text on the web in UTF-8, and many uses for
variable-length integers, I hope that we'd see single instructions for reading
and writing Unicode codepoints to/from UTF-8, as well as reading/writing
128-bit variable length integers (preferably prefix-encoded[0] rather than
LEB128).

A while back, I read that the Chinese Longson processors were a (subset?) of
the MIPS instruction set with added instructions for Unicode handling, but
that's all I've heard of processors with Unicode accelerating instructions,
and I'm not sure which encoding(s) was/were accelerated.

[0][https://news.ycombinator.com/item?id=11263378](https://news.ycombinator.com/item?id=11263378)

~~~
chrisseaton
> With such a high percentage of text on the web in UTF-8, and many uses for
> variable-length integers, I hope that we'd see single instructions for
> reading and writing Unicode codepoints to/from UTF-8

If we took this attitude with every new technology, we'd have a large number
of instructions that are now useless. At one time it probably seemed a good
idea to have custom instructions for parsing XML, and people really were doing
custom instructions for interpreting JVM bytecode.

~~~
AceJohnny2
> _and people really were doing custom instructions for interpreting JVM
> bytecode._

Ah yes, ARM Jazelle, and the good ol' ARM926EJ-S...

[https://en.wikipedia.org/wiki/Jazelle](https://en.wikipedia.org/wiki/Jazelle)

Did the concept die with ARMv8 (64bit)?

~~~
monocasa
It had been essentially dead for a while even before AArch64. It's slightly
obfuscated by the fact that there's a version of Jazelle that's basically a
nop mode that'll trap on each bytecode.

------
zbjornson
I'm confused how the ASCII check works. It's checking if any byte is greater
than zero. Wouldn't you want to check if the 8th bit in any byte is set?

~~~
aeruder
The trick here is that each byte is treated as a signed 8-bit number. When the
top bit is set, the number is negative.

~~~
zbjornson
Oh duh, thanks. (Checking less than zero, read it wrong.)

I think it would be faster to OR the entire string with itself, then finally
check the 8th bit though. On Skylake that would cut it to 0.33 cycles per 16
bytes (HSW 1 per 16).

[https://github.com/lemire/fastvalidate-
utf-8/pull/2](https://github.com/lemire/fastvalidate-utf-8/pull/2)

~~~
Sharlin
Depends on your input. If non-ASCII strings are frequent and likely to contain
a non-ASCII character fairly close to the start of the string, then it makes
sense to short circuit.

~~~
loeg
The previous >0 algorithm didn't short-circuit. There is no change to short-
circuit behavior here.

~~~
Sharlin
Ah, I was thinking of the naive implementation in the previous post [1].

[1] [https://lemire.me/blog/2018/05/09/how-quickly-can-you-
check-...](https://lemire.me/blog/2018/05/09/how-quickly-can-you-check-that-a-
string-is-valid-unicode-utf-8/)

------
MichaelGG
Fast checking is really useful in things like HTTP/SIP parsing. Rust should
expose such a function as well seeing as their strings must be UTF-8
validated. Though it's even faster if you can just avoid utf8 strings and work
only on a few known ASCII bytes, it means you might push garbage further down
the line.

~~~
masklinn
> Rust should expose such a function as well seeing as their strings must be
> UTF-8 validated.

That's more or less what std::str::from_utf8 is: it runs UTF8 validation on
the input slice, and just casts it to an &str if it's valid:
[https://doc.rust-lang.org/src/core/str/mod.rs.html#332-335](https://doc.rust-
lang.org/src/core/str/mod.rs.html#332-335)

from_utf8_unchecked nothing more than an unsafe (c-style) cast:
[https://doc.rust-lang.org/src/core/str/mod.rs.html#437](https://doc.rust-
lang.org/src/core/str/mod.rs.html#437) and so should be a no-op at runtime.

~~~
MichaelGG
I meant Rust should have a SIMD optimised version that assumes mostly ASCII.
I'm guessing there is a trade-off involved depending on the content of the
string.

~~~
scottlamb
The linked implementation assumes mostly ASCII. It doesn't use SIMD. SIMD in
Rust is a work in progress - the next version (1.27) will stabilize
x86-64-specific SIMD intrinsics. There's an rfc ([https://github.com/rust-
lang/rfcs/pull/2366](https://github.com/rust-lang/rfcs/pull/2366)) for
portable SIMD types.

------
nabla9
There is validating and "validating", what is this code doing?

For example, valid utf-8 must always use the shortest possible sequence or
it's invalid. Validator must check against decoding invalid sequences.

example of invalid sequence:

    
    
        0xF0 0x80 0x80 0x8A

~~~
jwilk
The test suite includes these as "bad" sequences:

* \xc0\x9f (overlong U+001F)

* \xed\xa0\x81 (surrogate)

~~~
SloopJon
I was too lazy to look up what these sequences represented, but I did a quick
test with the Euro symbol example from Wikipedia, and it did indeed reject the
overlong sequence:

    
    
        #include <stdbool.h>
        #include <string.h>
        #include <stdio.h>
        #include "simdutf8check.h"
        
        int
        main(int argc, char *argv[])
        {
            const char euro[] = "\xe2\x82\xac";
            const char eurolong[] = "\xf0\x82\x82\xac";
        
            bool valid = validate_utf8_fast(euro, sizeof euro);
            printf("validate_utf8_fast(euro): %d\n", valid);
        
            valid = validate_utf8_fast(eurolong, sizeof eurolong);
            printf("validate_utf8_fast(eurolong): %d\n", valid);
        
            return 0;
        }

------
kizer
Should strings be represented as codepoint arrays in memory? In C, for
example, should I always decode input and work with that, or convert to utf8?

~~~
masklinn
> Should strings be represented as codepoint arrays in memory?

No. I think the implementor of Factor's unicode support originally did that
but it turns out to not be useful:

* it blows up memory usage for ASCII and BMP (4 bytes per codepoint versus 1~3)

* this also has impact on CPU caches, lowering the average amount of data you can fit in your cache and work on

* it requires a complete conversion of incoming ascii and utf8 data (which only get more and more common as time goes on) rather than just a validation

* and because Unicode itself is variable-length (combining codepoints) it's not actually helpful when you're trying to properly manipulate unicode data

The only "advantage" of representing strings as codepoint arrays is that you
get O(1) access to codepoints _which is a terrible idea you should not
encourage_.

UTF-32 internal encoding makes some manipulations very slightly easier, but
not enough to matter in the long run, and it encourages bad habits. If you
don't need the O(1) access thing for backwards compatibility reasons, don't
bother.

~~~
kizer
I did not know that Unicode itself was variable length. That alone nullifies
motivation to use codepoint arrays. Why would they do that??

~~~
masklinn
To limit combinatoric explosion. Without combining codepoints, you need a
version of each base symbol for each _combination_ of modifier. And while
latin scripts usually limit themselves to one diacritic per base[0], other
scripts (hebrew or arabic come to mind) can pile up multiple diacritics.

[0] not that that's always the case, the navajo alphabet can have both ogonek
and acute on the same latin base character

~~~
kizer
I see... speaks to my ignorance on language.

------
ozzmotik
how exactly do you use a fraction of a cycle? i would assume that it would be
fractional on average, but then again I can't exactly speak for my expertise
on how instructions map to cycles and if it's solely an integer mapping

~~~
dragontamer
> I can't exactly speak for my expertise on how instructions map to cycles and
> if it's solely an integer mapping

1\. Standard ALUs on modern processors are 64-bits at a time. So right there,
you're 8x faster on a per-byte basis.

2\. He's using vectorized operations, so he can work with 128-bits, 256-bits
(or potentially even 512-bits on high-end processors like Skylake-X). So 16x,
32x, or 64x at a time per operation.

3\. Modern processors are super-scalar with multiple ALUs and multiple
execution ports. I forget what the precise theoretical limit is, but Intel AND
AMD machines can execute something like 4 or 5 operations per clock, depending
on circumstances.

That assumes that all operations have been decoded into micro-ops (uops), they
fit inside the uop cache (think of a uop cache as a L0 cache: beyond even the
L1 cache), that they perfectly line up to available execution ports, that your
data has no dependencies, and a whole host of other conditions. But its
theoretically possible.

\---------

In practice: your code will be limited by memory (even L1 cache is slower than
the CPU), by decoding speed (not everything fits in the uOp cache, and only
loops really benefit from the uop cache), dependencies (a = x + 1. a = a+2.
The 2nd instruction depends on the 1st one to execute first, so the two
instructions can't be done in parallel / superscalar).

The CPU is pretty good at trying to figure out how to optimally reorder and
out-of-order execute your code. But that means that predicting the performance
of the CPU is incredibly difficult: you have to keep in mind all of the parts
of the CPU while thinking of the assembly code.

