
Show HN: Faster UTF-8 validator - zwegner
https://github.com/zwegner/faster-utf8-validator
======
andrewf
Hi! I took a shot at vectorized UTF-8 validation in 2012. I just put this
faster validator, and Daniel Lemire's code, into my test harness at
[https://github.com/andrewffff/utf8fuzz/tree/2019_compare](https://github.com/andrewffff/utf8fuzz/tree/2019_compare)

On an i7-7800X (clang 6.0.0-1ubuntu2, WSL, don't trust my numbers) my
benchmarks showed about the same relationship between SSE4, AVX2 and Lemire's
code, as your benchmarks did. My own attempt is about half as fast.
[https://raw.githubusercontent.com/andrewffff/utf8fuzz/2019_c...](https://raw.githubusercontent.com/andrewffff/utf8fuzz/2019_compare/rough-
benchmark.png)

A few examples of invalid UTF-8 from Markus Kuhn's suite pass this validator
right now, specifically 4.1.3, 4.2.3 and 4.3.3. My randomized tests, which
compare the results from different validators, also fail a small percentage of
the time, I'd guess for the same reason.
[https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt](https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt)

I'm really interested in the different approaches taken here. Fortunately both
Daniel and you've communicated what you were doing, I think it's going to take
longer for me to re-comprehend my own approach!

Thanks for sharing this.

~~~
zwegner
Oh wow, that is most definitely a bug. Thank you very much for reporting that,
before this spreads too much. How embarrassing... I was in a bit of a rush to
stick this up on HN before the weekend, as otherwise I'd probably never get
around to it. Evidently I got a bit sloppy making the error table.

Luckily, that's a very easy bug to fix--it was just caused by a mistake
constructing the error tables. The only annoying part is having to renumber
the error bits and write the tables again by hand :)

Thanks too for benchmarking! I see you tried out make.py, but it didn't work?
I should probably add a Makefile for people that don't want to deal with yet
another random build system...

------
BeeOnRope
Good stuff.

The key to a solid claim, especially for something as bold as "fastest in the
world", is a complete specification of the inputs to the benchmark.

You mention random ASCII bytes and "random UTF-8 bytes". The former is
definitely an really important case but also the least interesting. I can
write on a napkin a UTF-8-but-is-actually-ASCII decoder that approaches 256
bytes a cycle cached (with a fallback routine when the ASCII assumption
fails).

So then what about the random UTF-8 bytes. What does it mean? Do you generate
random bytes and then exclude invalid sequences? Do you generate a uniform
random codepoint in the 21-bit codepoint space and then concert it to utf-8?

At a minimum it would good to see a benchmark with random distributions that
approximate common languages.

The true fastest decoders on such realistic data will be adaptive and more
complicated than yours.

~~~
zwegner
Well, to be fair, my claim was that it was the fastest in the world _that I 'm
aware of_, which is a much weaker statement. :)

The random UTF-8 in the benchmark was generated from the code in Daniel
Lemire's fastvalidate-utf-8 repository, specifically this code:
[https://github.com/lemire/fastvalidate-
utf-8/blob/ed53c0c64b...](https://github.com/lemire/fastvalidate-
utf-8/blob/ed53c0c64b3e5ef767eeea8f8f1c205f75c377af/benchmarks/benchmark.c#L93-L141)

Looking closer at it, I think that the distribution of random UTF-8 is not as
uniform as it should be: it generates one byte first, and then generates
continuation bytes depending on the value of that byte. Which means that half
the code points will just be ASCII.

But I think this doesn't matter very much for benchmarking, at least for the
mostly-branchless SIMD algorithms like mine. For each vector of input bytes,
there's three branches that can get taken: the early exit for ASCII-only, and
two branches that exit the loop due to validation failures, which don't really
matter for benchmarking. Even though the distribution of code points will be
roughly half ASCII, for a 32-byte AVX2 vector, the probability of a pure ASCII
vector is something less than 2^-32 (since non-ASCII initial bytes translate
into more than one byte of output, and the probability of 32 ASCII _code
points_ in a row is 2^-32, the probability of 32 _bytes_ in a row is less, by
an amount I'd rather not try to calculate). So for this particular benchmark,
random UTF-8 should be roughly equivalent to "no ASCII". To verify, I disabled
generating ASCII in the linked code, and the numbers came out pretty much
identical. If anything, _even more_ ASCII bytes would be a more interesting
test, since that would make the quick-ASCII branch less predictable. If you
know of any good UTF-8 corpora for common use cases, I'd be happy to benchmark
them.

Given that different sets of "random UTF-8 bytes" generated with this method
will virtually always follow the same path, the exact distribution doesn't
really matter when measuring cycles/byte. Input that is purely 2-byte code
points will be faster in terms of cycles/code point than purely 4-byte input,
but that's mostly a property of UTF-8 being variable length.

I doubt there's much speed to be gained by adding any sort of adaptive
behavior beyond the ASCII check. Generally these days you want as few branches
as you can manage, and this algorithm has only one that matters.

~~~
BeeOnRope
> So for this particular benchmark, random UTF-8 should be roughly equivalent
> to "no ASCII". To verify, I disabled generating ASCII in the linked code,
> and the numbers came out pretty much identical. If anything, even more ASCII
> bytes would be a more interesting test, since that would make the quick-
> ASCII branch less predictable. If you know of any good UTF-8 corpora for
> common use cases, I'd be happy to benchmark them.

Indeed, this is a good example of a problem with random bechmarks: no text
really has the behavior of uniform random with 50% ASCII but no character-to-
character correlation. In the real world you often expect bursts of ASCII,
e.g., where a title is written in ASCII or where you have say structured data
like HTML, ASCII, JSON, etc which often have lots of ASCII data mixed in with
meant-for-human strings which may be non-ASCII, but it's very bursty.

As you point out, for your algorithm, 50% ASCII but no burstiness basically
means every vector takes the non-ASCII path, which is actually "good" compared
to say 50% of vectors taking the ASCII shortcut (which would slow things down
due to BP failures) - so a lot of the interesting design space is ignored
(e.g., I think the ideal algorithm will choose between strategies in a branch-
predictor aware fashion).

> this doesn't matter very much for benchmarking, at least for the mostly-
> branchless SIMD algorithms like mine

Right - but of course it matters a lot for existing algorithms (which tend to
be branchy), which can come out looking much worse or much better depending on
the distributions (not that I think any will be able to pass a good vectorized
approach). Also, as above, the early-out for ASCII _will_ matter for a lot of
practical inputs, but neither benchmark stresses it.

> I doubt there's much speed to be gained by adding any sort of adaptive
> behavior beyond the ASCII check. Generally these days you want as few
> branches as you can manage, and this algorithm has only one that matters.

Right, well the ASCII one is a big one. I haven't looked at the details of
your algorithm, but could the core loop be faster if say it never saw any
4-byte sequences, or certain other uncommon things? Then an adaptive approach
would use an optimized kernel for that scenario if it was encountered for a
while.

"Adaptive" doesn't really mean more branches - just that you occasionally
might switch to a different kernel based on the observed data. This shouldn't
add many branches compared to e.g., the existing loop and failure branches,
and evidently you have to do it in a branch-predictor aware way (e.g., you
need some type of hysteresis in mode switches so you don't switch too often).
Sometimes you can build the adaptivity into the existing branches, e.g., maybe
you are taking a branch anyways (e.g., when you find non-ASCII text) and you
can build the adaptive "state machine" directly into the code via duplication
of code, w/o needing any explicit data counting the number of failures
(effectively, the IP holds extra information recording something about the
path you took to reach the current location).

~~~
zwegner
Ah, thanks for the reply. I understand you better now, and for the most part I
agree. Taking out the early exit for ASCII can speed up the frequently-
changing case to avoid the misprediction penalty, as long as sufficiently many
chunks of input need to take the long validation path.

Thinking about this more, I think there's at least one way that an adaptive
approach might be beneficial without too much extra complexity: keeping two
copies of the main kernel for ASCII-checking and non-ASCII checking. It should
be pretty cheap to add a counter that is incremented or decremented based on
the ASCII-only mask, and split the outer loop into N-byte chunks, with a
branch for each chunk determining whether we should take the branchless path
or not.

> Right, well the ASCII one is a big one. I haven't looked at the details of
> your algorithm, but could the core loop be faster if say it never saw any
> 4-byte sequences, or certain other uncommon things? Then an adaptive
> approach would use an optimized kernel for that scenario if it was
> encountered for a while.

I started out thinking this wouldn't really work, but I think there might be
some potential here. My initial problem was that even if some work could be
saved if there weren't 4 byte sequences, we still need to detect them every
time. But, because my algorithm uses lookup tables for error flags which have
some free bits, and because the 4 byte sequences can be detected in the same
indices that are used for these lookups, we can set another error bit that
means "take the 4-byte slow path". Then, when some input fails validation, we
only do the work there to check whether it's really a failure. This gets
complicated, though: first off, the check for the proper number of
continuation bytes is before the table lookup, so we'd need to put some logic
in there. Secondly, this lookup table gets validation failures one byte later
in the stream than the initial byte. So in the case that the 4-byte sequence
starts on the last byte of a vector, we'd need to have special handling.

It'd probably be a good idea to also have some hysteresis in this approach
too... So overall, I think there might be some nice gains from adaptive
behavior. There are two big concerns that make me skeptical, though: code
size/I$ pressure, and code complexity. While code size wouldn't be much of an
issue in microbenchmarks, in real applications it matters a lot more--I don't
want the size overhead to get too big. Right now the full validation algorithm
is about 600 bytes of code for AVX2, I'd rather not make that explode by a
large factor by adding several specializations. And I'm already reaching the
limits of what little generic programming can be done in the C preprocessor...
This sort of specialization is really better done with C++ templates (or maybe
Rust or something). I had wanted to keep this a pure-C library, but maybe C++
might be better.

Hope some of that makes sense... I'm mostly thinking out loud here. In any
case, you've given me a lot to think about, so thanks very much!

~~~
BeeOnRope
Thanks for the reply. I put some more thoughts on github.

------
vardump
Pretty nice. Always nice to see better optimized commonly required algorithms.
If this is really faster than previous implementations (no reason to doubt,
but don't have time to validate right now), this might save megawatts of power
world wide. (Disclaimer: Would need to actually check with power measurement
tools, but generally faster consumes less power.)

A small nitpick: not sure whether those macros bought anything, I guess the
optimizer could inline function calls to intrinsic wrappers just fine as well?

~~~
FpUser
_...this might save megawatts of power world wide..._

Since when saved megawatts bothered programmers? If that was the case we would
see far smaller usage of scripting languages and electron apps.

~~~
marmada
It takes effort to not use electron. It doesn't take effort to have your
library update its dependencies to use a faster utf-8 validator

~~~
F-0X
> It takes effort to not use electron.

Why do you believe this? I don't think it is quite that ubiquitous.

~~~
jimsmart
Let's say I want to build a cross-platform app. I could use Electron (it's
somewhat of a known quantity), or I could start doing some research into what
else is available, and then evaluate the various offerings - not only to
ensure they do what I need, but to also ensure that they use less
power/resources than Electron. The effort here is (at least some kind of)
unavoidable up-front cost/time.

To switch existing code to use some particular UTF-8 decoding library should
be as simple as changing some imports, and perhaps some other (search/replace)
code tweaks or implementing a quick shim/adapter. If it doesn't work out,
rollback or abandon the branch. On most codebases, even larger ones, this
whole process is probably an easier thing to do, requiring less effort than
finding a decent (and more power efficient) alternative to Electron — seems to
be the GP's gist. And personally I'd agree.

~~~
FpUser
_" Let's say I want to build a cross-platform app. I could use Electron (it's
somewhat of a known quantity), or I could start doing some research into what
else is available, and then evaluate the various offerings - not only to
ensure they do what I need, but to also ensure that they use less
power/resources than Electron. The effort here is (at least some kind of)
unavoidable up-front cost/time."_

Yup that's right. Any sign of inconvenience and efficiency be damned. Why do
we even bother trying to fix environment. This cost money and inconveniences
so many people.

Also it depends on your background. I have no problem doing without Electron.
I would have to apply your exact logic to even look at it.

~~~
dwild
You are clearly someone seeing in black and white. There's shade of grays too,
you know?

> Any sign of inconvenience and efficiency be damned.

Nope, not any signs... just that if you can build more with less time, in
exchange of a bit of CPU efficiency, it's not so bad.

> Why do we even bother trying to fix environment.

For the future. You see, my program using a bit more RAM or CPU won't change
much. Not eating that meat, that will change quite a bit, thus that's one of
the things I do.

A software can allow people to be many orders in magnitudes more efficient, if
I can build more of them, it allows even more people to be a few orders of
magnitude more efficient. The power usage afterward, fit in the error margin
quite well.

------
eyegor
I'm not sure if the compiler is smart enough to do it for you, but you might
be able to squeak a bit more performance by precomputing a partial sum (data +
V_LEN - 1) outside the loop for this statement:

> v_load(data + offset + V_LEN - 1);

Other than that, this is incredibly clean code and I doubt you can get much
more performance without dropping into assembly. I wish my coworkers made code
this well documented :/

~~~
zwegner
At least my compiler (LLVM 10) is smart enough to know that "data" isn't
modified, and thus fold the addresses of the two vector loads inside the main
loop into single instructions:

    
    
        1a0: c5 fe 6f 6c 31 ff     vmovdqu ymm5,YMMWORD PTR [rcx+rsi*1-0x1]
        1a6: c5 fe 6f 24 31        vmovdqu ymm4,YMMWORD PTR [rcx+rsi*1]
    

I'd expect most modern compilers to get this right with optimizations on, but
if somebody finds one that doesn't, I'd like to know. In general, I read
through the disassembly a decent amount when developing this. That's how I
noticed the "req += (vmask2_t)set << n;" (instead of |=) trick, which gets
compiled to one lea instruction. The disassembly got a little bit hairy when I
added the code to handle trailing bytes, though...

Thanks for the kind words, though! It warms my heart :)

------
aqrit
how does this differ from here:
[https://github.com/lemire/simdjson/pull/365](https://github.com/lemire/simdjson/pull/365)

~~~
zwegner
Oh interesting, I hadn't seen that. It looks like it uses the same idea of
shuffle-lookups on the first three nibbles. That's a fairly large patch
though, I don't think I've fully grokked it. At the very least, my code
differs from that in that mine does less byte stream shifting, instead doing
shifts in the scalar domain. Their code also needs to deal with quotes,
escaping, etc. due to being part of a JSON validator.

------
SlowRobotAhead
I like the way you write and comment your C code. Very nicely written.

Except all those spaces between the # and “define”, no sir I don’t like that
:)

~~~
zwegner
Well how else are you supposed to nest your preprocessor macros? :P

Thanks though! I like taking the time to make my code as clean as I can. Glad
others appreciate it!

~~~
usr1106
I have used the same "indendation" in the past. Because there are always
review comments that his would be weird/confusing/unconventional I have given
up on using it :(

~~~
kevin_thibedeau
It used to be the only valid way to indent macros. It has the advantage that
macro lines always get highlighted in column-1 and stand out more from regular
structured code.

------
vmurthy
Sorry if this is slightly off topic but a great time to revisit this great
primer on Unicode [0] by Joel Spolsky.

[0] [https://www.joelonsoftware.com/2003/10/08/the-absolute-
minim...](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-
every-software-developer-absolutely-positively-must-know-about-unicode-and-
character-sets-no-excuses/)

~~~
matheusmoreira
Also:

[https://utf8everywhere.org/](https://utf8everywhere.org/)

------
pabs3
I wonder if the authors plan to get this merged into commonly used
implementations of UTF-8 validators so that folks can benefit from their work.

Anyone have any ideas about which open source codebases UTF-8 validators exist
in?

~~~
nitwit005
There's not as big a niche as you might think. A lot of the software that
needs valid UTF-8 don't validate upfront, but do it as part of a parsing
process. An example would be something like an XML or JSON parser.

For more mundane uses like opening a big text file, you often want to tolerate
invalid bytes and show a replacement character. You could use the technique,
but you'd have to modify it to work for that usage.

~~~
zwegner
In case anyone sees this: I initially agreed with you, thinking that pure
UTF-8 validation is a bit more of a niche than might be expected. But if I'm
not mistaken, I think at least for JSON validation, the UTF-8 validation can
happen completely independently of the parsing. I don't think any control
characters (braces, commas, quotes, escapes, etc) will change the validity of
the UTF, or vice versa. So it might be beneficial (both for speed and
security, as the sibling comment notes) to validate chunks of input as UTF-8,
then pass them off to the parser, which now doesn't need to deal with
validation.

Maybe this could be applied to XML, but that's quite a behemoth of a standard
--I really wouldn't be surprised if there was a way to switch encodings mid-
stream. I have no idea though...

~~~
nitwit005
Sure, you can validate and then parse the JSON, but parsing it is also
effectively validating it as UTF-8, except for the string values in the JSON,
so it's heavily going to be duplicated effort. And unfortunately, the strings
can contain things like invalid surrogate pairs by using unicode escapes, so
you may want to validate after escaping.

You can't normally blindly parse XML as UTF-8. The encoding has rules for
detecting the character set.

------
CodesInChaos
Do the instructions you use cause downclocking on Intel CPUs?

~~~
BeeOnRope
No, because non-"FMA unit" (mostly SIMD FP) AVX2 instructions don't cause
license-based downclocking on recent Intel CPUs.

To get downclocking to L1 license (the middle speed tier), you need to use
either FMA unit AVX/AVX2 insructions or any AVX-512 instructions. This algo
doesn't use any of this, staying in the integer domain.

Here's a longer description:

[https://stackoverflow.com/a/56861355](https://stackoverflow.com/a/56861355)

Of course this doesn't address say power or temperature based throttling which
might be more likely to occur when wide SIMD is used, but it's not possible to
give a precise answer there, other than many low or moderate core count setups
won't see this kind of throttling outside of extreme use cases.

~~~
zwegner
Thanks for the link, that's a good post! I hadn't investigated the throttling
issues much in the past, not owning an AVX-512 machine, but that's much more
precise than the standard "don't use AVX-512" meme that's repeated all over
the place. Bookmarked!

