
Parsing short hexadecimal strings efficiently - fcambus
https://lemire.me/blog/2019/04/17/parsing-short-hexadecimal-strings-efficiently/
======
nwallin
You can use the "fancy math" version on all 8 bytes in the string
simultaneously:

    
    
        #include <cstdint>
    
        uint32_t convert_hex(const char* s)
        {
          uint64_t a = *reinterpret_cast<const uint64_t*>(s);
          a =          (a & 0x0F0F0F0F0F0F0F0Fu) + 9 * ((a & 0xc0c0c0c0c0c0c0c0) >> 6);
          a =          (a & 0x000F000F000F000Fu) |     ((a & 0x0F000F000F000F00) >> 4);
          a =          (a & 0x000000FF000000FFu) |     ((a & 0x00FF000000FF0000) >> 8);
          uint32_t b = (a & 0x000000000000FFFFu) |     ((a & 0x0000FFFF00000000) >> 16);
          b = __builtin_bswap32(b);
          b = (b & 0x0f0f0f0f) << 4 | (b & 0xf0f0f0f0) >> 4;
          return b;
        }
    

Compiles to 30 instructions with clang, so 3.75 instructions per byte. (clang
is one instruction cleverer than gcc) No branching, the only "complicated"
instructions are bswap (__builtin_bswap32) lea, (the "multiplication" by nine)
and one addition. Other than that it's all bit manipulations. (moves, shifts,
ands, ors) However, I doubt it will pipeline well, the data dependencies are
quite linear.

It has _very_ little tolerance for inputs that are anything other than 8 byte
hex strings. Do error checking elsewhere.

I doubt this would be faster in any meaningful sense in a real world use case.

------
dmoldavanov
Using of array lookup is a bad wayof optimization: digittoval[src[N]] can take
up to 200 cycles if not in cache

Only synthetic tests that small enough (most of them do nothing than tested
code) show good results.

~~~
LeifCarrotson
The only reason that you'd want to optimize this function is if it's called
frequently. Imagine you you have a text file or database with millions of
numbers in ASCII hexadecimal notation. When parsing that data with this code,
the table is practically guaranteed to be in the cache.

Modern processors have 32 KB in L1 just 4 cycles away, hundreds of kB in L2 12
cycles away, and MB of L3 on-die cache which still only takes 50 cycles or so.
I'd trust the machine to be certain to put this 256 byte table in L1. It would
probably even pull the relevant parts (48-57 for the decimal digits and 65-70
for alphabetic) of a 65 KB 2-entry table if you were calling it in a tight
loop. I'm curious if you could load it fast enough to sort out cache when
working with a 4-entry, 4 GB table...

~~~
dmoldavanov
>Imagine you you have a text file or database with millions of numbers in
ASCII hexadecimal notation

These much amount of data with guarantee will crowd out this array from cache
bcauz processor cache in not a LRU

~~~
dmoldavanov
And now imaging an Operation System with hundreds or even thousands other
processes at work.

------
timerol
The current winner is a SIMD-ish solution using 32 bit math as 4 8-bit lanes.
This has been added to the repo, and is discussed in the comments.

[https://github.com/lemire/Code-used-on-Daniel-Lemire-s-
blog/...](https://github.com/lemire/Code-used-on-Daniel-Lemire-s-
blog/tree/master/2019/04/17)

------
daeken
I don't have the spare cycles to test this out for myself right now, but I'm
curious if a 64kb table would be substantially faster. You'd cut it down from
8 indirections, 3 shifts, and 3 ORs to 4 indirections, 1 shift, and 1 OR.

    
    
        uint32_t hex_to_u32_lookup(const uint8_t *src) {
         uint16_t *wsrc = (uint16_t *) src;
         uint32_t v1 = digittoval[wsrc[0]];
         uint32_t v2 = digittoval[wsrc[1]];
         return v2 << 8 | v1;
        }
    

(Also, small note: The function is named `hex_to_u32_lookup` but it only gives
a 16-bit result. Might want to clarify in the post.)

~~~
0815test
One comment in the article pointed out that you could create a 4GB lookup
table, and convert 32 bits of input at once. This is actually less silly than
might otherwise seem, because only 64KB of it would ever be read for correct
input; the table itself would only exist as a reservation somewhere in the
process' 64-bit (well, 48-bit actually, but still) virtual address space, so a
lookup of invalid input could easily be made to error out and return some
error condition.

For that matter, I'd want to run the author's "fancy math-based function"
through a superoptimizer and check that there isn't something that can do the
job in fewer cycles.

~~~
BonesJustice
Wouldn’t the ‘valid’ values in such a table be relatively sparse, creating a
high likelihood of cache misses? The guy’s agonizing over 4 cycles, but a
cache miss is far more costly.

------
csense
With modern hardware, for this to be a meaningful percentage of your workload,
you'd need to be working with super high data volumes (100 megabytes per
second at the very least, maybe an order of magnitude more depending on the
exact hardware).

A reasonable system designer would never create an interface that accepts only
hexadecimal input for bulk data transfers at those rates. That would waste
half the I/O, which makes no sense!

In this case, the problem with your system isn't the cycle count of your hex
conversion function. It's a poorly designed interface that's forced to use hex
for high data volumes. You get maybe a fraction of a percent optimizing the
hex conversion function, but there's a low-hanging 100% improvement if you
just use all 8 bits of your bytes.

Or in Donald Knuth's words, "Premature optimization is the root of all evil."

~~~
nostrademons
Oftentimes the biggest data-ingestion tasks happen at organizational
boundaries. You find this great open-source data set that's 77TB, or you
contract with another company to dump a 50GB CSV file on your FTP server every
night, or you pay for API access to a feed that generates 1G/sec, or another
department dumps 1TB in your S3 bucket nightly.

There's a strong incentive to use standardized text-based formats
(particularly JSON and CSV) for these. You can inspect sample data with a text
editor, or just with 'head'. You can run really simply data analytics with
just UNIX tools (I've found that a 'gunzip -c | grep | awk | grep | wc'
pipeline has gotten me results in 15 seconds that would've taken 5 minutes
with Python or half an hour in Java). Everybody already has a well-tested
parser and serializer for the format in any language they might choose to use.
Everybody's already familiar with them; you need only document field types &
formats rather than giving a byte-by-byte description of the protocol. With
JSON in particular, you've got 1-line parsing built into the browser instead
of needing to ship a large library. If problems arise, you can dump the raw
input record to the console for debugging without needing to pore over a hex
dump, and you have tools available (jq, awk) for filtering out extraneous
information.

If you control both sides of the protocol, than by all means use an efficient
binary serialization protocol like Cap'n'Proto. But a lot of interesting data
ingestion problems don't fit in this category. Oftentimes the system designer
doesn't actually have control of the format that data is provided to the
system.

