
Validating UTF-8 bytes using only 0.45 cycles per byte (AVX edition) - akarambir
https://lemire.me/blog/2018/10/19/validating-utf-8-bytes-using-only-0-45-cycles-per-byte-avx-edition/
======
the_clarence
I see a lot of applications trying to take advantage of SIMD, but what when
you try to run them on systems that don't support these instructions? My guess
is that you need to write multiple files taking advantage of different sets of
instructions and then dynamically figure out which to use at runtime with
cpuid, but isn't that cumbersome and a way to inflate a codebase dramatically?

~~~
why_only_15
In my understanding when you use intrinsics and build for a processor without
support for the intrinsics then GCC for example will replace it with
equivalent code.

~~~
mcbain
Unfortunately, no.

That is the case with GCCs __builtin functions. With a few exceptions,
intrinsics are basically macros for inline asm that the compiler can reason
about.

If on x86-64 you use a _mm256* intrinsic and compile without AVX support you
just get a compile error, not a pair of equivalent SSE instructions.

~~~
rurban
Even worse. You mostly get run-time errors when the built machine supported
that feature, your machine doesn't, and the features aren't separated into
multiversioning or loading different shared libs.

------
bradleyjg
Under the new string model in java > 8 a fairly frequent workflow is:

1) get external string

2) figure out if it is UTF-8, UTF-16, or some other recognizable encoding

3) validate the byte stream

4) figure out if the code points in the incoming string can be represented in
Latin-1

5) instantiate a java string using either the Latin-1 encoder or the UTF-16
encoder

I know some or all of these steps are done using hotspot intrinsics, and then
the JIT/VM does inlining, folding and so on, but I wonder how fast a custom
assembly function to do all these steps at once could be.

~~~
Twirrim
You might be interested in his blog on the same subject a few days ago:
[https://lemire.me/blog/2018/10/16/validating-utf-8-bytes-
jav...](https://lemire.me/blog/2018/10/16/validating-utf-8-bytes-java-
edition/)

------
jwilk
Previous blog post on HN:

[https://news.ycombinator.com/item?id=17081571](https://news.ycombinator.com/item?id=17081571)

------
kissiel
I wonder about the Joules per byte. AFAIK AVX units are quite expensive
energy-wise.

~~~
masklinn
Don't they also tend to work at a lower clock due to their higher energy
requirements?

edit: though this is AVX2 ("AVX-256") rather than AVX-512, and Lemire has
covered AVX and the possibility of throttling (with or without AVX) in the
past so they're probably aware of the potential issue and consider that they
either won't get triggered or the gain is good enough to compensate the lower
frequency.

~~~
kissiel
Nice. So I understand that AVX2 is not bringing the CPU's clock down.

Got any sources for power consumption figures/comparisons of those AVX units?

~~~
lorenzhs
Heavy use of complex AVX2 operations causes downclocking, too, but typically
less so than AVX-512. More details are documented in
[https://en.wikichip.org/wiki/intel/frequency_behavior](https://en.wikichip.org/wiki/intel/frequency_behavior)
\-- also see e.g.
[https://en.wikichip.org/wiki/intel/xeon_gold/6138#Frequencie...](https://en.wikichip.org/wiki/intel/xeon_gold/6138#Frequencies)
for an example how the frequencies differ depending on the number of active
cores.

I _think_ the reason for reducing clock speed when vector units are in heavy
use is to keep power usage in check.

You might also find [https://blog.cloudflare.com/on-the-dangers-of-intels-
frequen...](https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-
scaling/) helpful, which goes into detail about a specific case where dynamic
frequency scaling resulted in AVX-512 code running _slower_ than AVX2 code.

~~~
Twirrim
It's worth noting that the cloudflare test was done on a Xeon Silver, which
has worse properties around the frequency changes than the Gold or Platinum.
If you're on either Gold or Platinum, you're less likely to suffer the
problems that Cloudflare did with mixed workloads.

This seems an optimisation nightmare. Your program needs to be aware both of
the capability of the chip for using instructions, and what type of chip it is
within a family to decide if you maybe do or don't want to use certain
vectored instructions.

------
akarambir
What does linux utilities like sed, awk use for text manipulation because they
were very slow when I was changing a few table names in a sql file.

~~~
coldtea
What was the size of the SQL file?

A "few table names" doesn't mean much if the SQL file is 20GB.

In any case, sed and awk are plenty fast, but not the fastest methods of text
manipulation. You could write a custom C program for that.

~~~
Thiez
While it sure is possible to do text manipulation in C, I don't think it
should ever be the first choice, even if 'fastest' is a goal. A 0 byte is
perfectly acceptable in a utf8 string (or any unicode string, really). But C
has those annoying zero-terminated strings, so if you want to manipulate
arbitrary unicode strings the first thing you can do is kiss the string
functions in the C standard library goodbye. Which you probably want to do
anyway because pascal-strings are simply better.

I would use Rust or C++ for this task.

~~~
knome
> A 0 byte is perfectly acceptable in a utf8 string (or any unicode string,
> really)

What? My understanding was that utf8 was crafted specifically so that the only
null byte in it was literally NUL. That all normal human language described by
a utf8 string will never contain a NUL. They're comparable to C strings in
that way, where it can be used safely as an end of string marker. If you have
embedded NULs, it's not really utf8, is it?

~~~
Dylan16807
> My understanding was that utf8 was crafted specifically so that the only
> null byte in it was literally NUL.

Correct.

> That all normal human language described by a utf8 string will never contain
> a NUL.

Correct.

> If you have embedded NULs, it's not really utf8, is it?

Incorrect.

NUL is a valid character. If you accept arbitrary utf-8, or arbitrary ascii,
or arbitrary 8859-1, then there might be embedded NUL. You can filter them out
if you want, but they're not invalid.

~~~
paavoova
It's invalid for unix filenames to have a null character. Therefore, if your
application is printing filenames in their unicode representation, it doesn't
ever need to consider there to be a null byte. This of course isn't an
arbitrary case, but it shows one can make assumptions regardless of the
"validity" of a character. I believe for most cases of arbitrary input, the
correct and safe thing to do is to assume a byte stream of unknown encoding.

~~~
Thiez
Since we arrived on this null-character discussion by considering text
manipulation in C, I suspect most comments in this thread are made in the
assumption that the text must be manipulated in some way (mine are!), so
treating it as a byte stream of unknown encoding doesn't really solve the
problem.

While null in filenames may be forbidden on Unix (and also on Windows), there
are more exotic systems where it is allowed [1]. When writing portable
software it's probably best not to make assumptions about what characters will
never be in a filename.

Naturally if you have a problem where you can get away with just moving bytes
around and never making assumptions about its contents then that is a great
solution.

[1]:
[https://en.wikipedia.org/wiki/Filename#Comparison_of_filenam...](https://en.wikipedia.org/wiki/Filename#Comparison_of_filename_limitations)

