
Validating UTF8 Strings via Lookup Tables - petermcneeley
http://darkcephas.blogspot.com/2018/10/validating-utf8-strings-with-lookup.html
======
nradov
Has anyone tried to feed it strings from @glitchr_?

[https://twitter.com/glitchr_](https://twitter.com/glitchr_)

ٳٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٳٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٳٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٳٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٳٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰٰ

ه꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲

#oͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦͦ

╰̩̩̩̩̩̻̍̍̍̍̍̊●̩̩̩̩̩̩̩̻̍̍̍̍̍̍̍̊ᴗ̩̩̩̩̩̩̩̩̩̩̪̺̍̍̍̍̍̍̍̍̍̍̆̑●̩̩̩̩̩̩̩̻̍̍̍̍̍̍̍̊╯̩̩̩̩̩̩̻̍̍̍̍̍̊

فͤ҈ͨͥ҉҉ͦ҈҉ͨ҈ͩ҉ͪ҈ͣͯͫ҉ͥͬͨ҈ͭ҉ͮ҈ͯ҉ͨ҈ͭͭͬ҉ͧͥ҈ͣ҉ͨ҉҉҈ͧͥ҉ͯ҈ͮͥ҉ͭ҈ͤ҈ͦ҈ͥ҉ͧ҈ͩͯ҉ͭ҈ͨ҉ͨͥ҉҉ͣ҉ͣͪ҉ͧ҈ͭ҉ͩ҈ͤ҉ͮ҈ͯͥ҈ͬ҈ͭ҈ͦ҈ͨͣ҉ͥ҈ͯ҉҉ͣͧ҈ͫ҉ͭ҈ͥͯͯ҉ͦ҈ͥ҉ͧ҉҈ͩ҉ͭ҈ͣͨ҉ͣͥ҈ͪ҉ͧ҈ͭ

~~~
xiconfjs
thanks mentioning it.

------
amluto
> const uint16_t* asShort = reinterpret_cast<const uint16_t*>(src);

Ick. That violates the aliasing rules in many cases and is therefore UB. And
it’s wrong on big-endian systems.

Just do src[i] | ((size_t)(src[i+1]) << 8).

~~~
petermcneeley
That code might produce two byte reads and a shift and the bitwise-or to
reformulate the 16 bit value in memory . In the original code the 16 bit read
feeds directly into the lookup table.

~~~
amluto
The compiler always _might_ produce garbage. But a decent compiler won't:

    
    
        #include <stddef.h>
        #include <stdint.h>
        
        uint16_t read_2byte_le(const uint8_t *src)
        {
            return (uint16_t)(src[0] |
                   ((uint16_t)(src[1]) << 8));
        }
    

This generates the correct code with gcc and with clang. Sadly, clang is a bit
fragile in this regard:
[https://bugs.llvm.org/show_bug.cgi?id=39438](https://bugs.llvm.org/show_bug.cgi?id=39438)

~~~
cesarb
Would a memcpy into a uint16_t be less fragile?

~~~
pjscott
Just tried this with godbolt.org and yes, the memcpy() thing is consistently
good while the shift-and-or version often gets an unoptimized implementation
(depending on compiler and version).

~~~
beached_whale
In C++, not necessarily C, memcpy is how you type pun. Seeing as they often
are the same compiler it's probably the fastest way.

------
devwastaken
While out of scope to the post somewhat, with validating utf-8 it'd be great
to know of a standard way to toss out special characters like the vertical
ones that can be used to interrupt page content. Seems most places just never
filter it. Even back in Neverwinter nights I believe you could do some fun
names.

~~~
kevingadd
Keep in mind those characters largely exist to allow people to write in their
native languages, so you need to make sure your filter doesn't stop them from
expressing themselves.

~~~
VMG
There are always trade-offs. Even written language is a constraint by itself.

------
aaaaaaaaaab
Warning: unaligned reads invoke undefined behaviour!

~~~
woadwarrior01
Not on modern x86 processors[1].

[1]: [https://lemire.me/blog/2012/05/31/data-alignment-for-
speed-m...](https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-
or-reality/)

~~~
aaaaaaaaaab
Undefined behaviour has nothing to do with CPU architecture...

~~~
woadwarrior01
It is undefined in multithreaded scenarios, not in single threaded code. In a
world where memory access latencies dominate and you pay a huge penalty for
cache misses, and a smaller yet significant penalty for TLB misses, there are
some significant performance gains that can be gained by tightly packing
fields in structures, while still maintaining alignment at a structure level.
We’ve been doing this safely for years in prod now. I understand that the
dogma of “unaligned reads invoke undefined behavior” is rather popular given
how memorable it is, but it isn’t always true.

~~~
cesarb
It's undefined even in single threaded code. The compiler can assume it's
aligned and make optimizations which depend on it.

It sounds like what you were doing is marking a struct as "packed", and
accessing the fields directly (instead of through a pointer to the field).
That should be fine.

~~~
woadwarrior01
This might be a bit too late to reply. It’s undefined in C, but it’s well
defined at a processor level (on x86). IIRC, there was one particular bit of
code that’d crash on AArch64 due to unaligned pointers, but would run a bit
faster than aligned code on x86-64. On the other hand, this trick wouldn’t
work in multithreaded code even if the semantics were well defined, due to
false sharing of cache lines.

------
inetknght
Talking about lookup tables without using constexpr is like pissing away
performance.

~~~
pianoben
IMO this kind of lookup table is best computed externally, e.g. by a python
script, and included as a literal `constexpr std::array` or something. I was
recently trying to do something very similar with constexpr lookups, and threw
in the towel due to different compilers having different levels of constexpr
support. With literal tables, you know exactly what's going to happen.

Constexpr has tons of promise, but the compiler support just isn't there yet -
we still have to think really hard about it and look at the generated code to
verify.

~~~
beached_whale
What version of c++? Sounds like you are mixing 11 and 14+

~~~
pianoben
c++14. I hit limits in clang relating to the depth of computation it will
perform at compile-time, and those limits differ from gcc's and msvc's, so I
switched.

What do you mean by "mixing 11 and 14"?

~~~
beached_whale
It sounded kind of like the differences in constexpr between 11 and 14 where
11 only allowed a single statement type thing.

Some people have used trampoline techniques to get around the depth limits.
CRTP just did this for compile time regex.

------
navaati
> Lookup tables perform computations in a manner such that they are in effect
> creating new instructions custom fit for the algorithms specific purpose.

This is interesting. Considering validating utf8 is something a computer does
_a lot_ these days, would it be practical to provide this lookup table and
associated instruction hardwired in a processor ? I'm thinkink about the ISA
extensions possibilities provided by RiscV for example.

I may be way off, this table is 64kB in size right ? I have no idea about how
much silicon real estate that represents relatively speaking...

~~~
kwillets
I think it's 2^12 = 4096 in size.

------
karlisolte
Wait. Does it actually validate utf8 or just checks ranges in which 1-4 byte
characters are encoded?

~~~
kevin_thibedeau
Björn Höhrmann wrote the gold standard for a minimalist scalar decoder and
didn't depend on UB to do it:

[https://bjoern.hoehrmann.de/utf-8/decoder/dfa/](https://bjoern.hoehrmann.de/utf-8/decoder/dfa/)

~~~
petermcneeley
Björn's decoder is nearly identical to the naive validity checker found here.
It is about 9x slower than the superposition lookup table found here.

