
Faster Integer Parsing - fanf2
https://kholdstare.github.io/technical/2020/05/26/faster-integer-parsing.html
======
tzs
> It’s 16 characters long, and this could also apply to credit card numbers.

Credit card numbers can be up to 19 characters and as short as 12 characters.
(I think they can theoretically be shorter but I'm not aware of any issuer
that actually does that).

Also, they are strings over the alphabet of decimal digits. They are not
intended to be interpreted as representing integers, and doing so is a bad
idea. Leading 0s matter in credit card numbers. The first few digits identify
the issuer. You could in theory have two card numbers where one is N digits
without any leading zeros and the other is N+k digits with k leading zeros and
the remaining N digits the same as the first card's. These would be different
cards from different issuers but if you just parsed them into numbers would
come out the same.

~~~
theandrewbailey
Credit card numbers are a good example of the test 'do you need to do math or
compare them?' If not, keep the numbers as strings.

~~~
bch
Are the strings checksummed, though? In this case, you’d need to do math, but
still keep them as strings...

~~~
PeterisP
The checksum calculation does not involve interpreting the whole multidigit
number as an integer, so you don't need to parse the number; you're doing math
on specific characters/digits only.

~~~
boring_twenties
Wouldn't it in fact be a lot slower using an integer? Lots of divisions by 10,
instead of just adding every byte.

~~~
userbinator
[https://en.wikipedia.org/wiki/Luhn_algorithm](https://en.wikipedia.org/wiki/Luhn_algorithm)

There's exactly _one_ division required.

(Implementing it using vector instructions is left as an exercise for the
reader. A very elementary exercise, so I encourage you to try it if you're
curious about SSE and such.)

~~~
Arnavion
Luhn's algorithm works on a digit string, not a single integer.

------
ritter2a
Certainly a nice read.

On the "culture of 'optimization is the root of all evil'" remark in the
conclusions: I find this to be a nice example for the full Knuth quote.

If you face an arbitrary task including parsing 64 bit integers, starting by
developing/using the technique from the article (as a _premature_
optimization) is probably a bad idea since it costs time for the
implementation (and even more time for debugging and understanding the code a
few months later), while in most cases, it is probably not what dominates the
running time of your code. If you however have built a solution that does the
job, but is just not fast enough, and profiling shows you that you spend
considerable time parsing integers, this kind of optimization is the way to
go.

~~~
userbinator
Or even better: if you can change the problem so that you don't need to parse
at all, then do that instead.

Recently I made a comment on the overhead of textual formats from the
validation perspective:
[https://news.ycombinator.com/item?id=23582056](https://news.ycombinator.com/item?id=23582056)

I think that a textual format is overall a horrible idea if the use-case is
not presenting the vast majority of the content to humans.

In an ideal world, communication between machines would be all in binary
protocols, and developers would know how to read/write them with tools like
hex editors as naturally as a second language.

Instead, countless amounts of time and space are wasted by machines converting
their native integers into strings, wrapping it in JSON, base64'ing that,
wrapping it into XML, then sending it over the network (whose lower layers are
thankfully binary) to another machine where the reverse process happens, but
with additional checks during parsing. (I am not exaggerating. I have seen
systems like this.) 99.99999...% of this data will _never_ be seen by a human.
What a disgusting waste of computing power.

"The fastest way to do something is to not do it at all."

~~~
khold_stare
Author of the article here - I completely agree. This text parsing problem
came up because I was lamenting how many cycles were wasted in text
processing. At my current job a text-based protocol from a 3rd party means
around 80% of CPU time for the entire application is spent parsing and
rendering text just for the protocol. JSON and HTTP come to mind too.

The human readability argument doesn't really hold any water because if you
have a structured description of a protocol (e.g. a C struct), you can always
write simple tools to inspect the protocol and make it just as humanly
readable as JSON is. This is even easier if a language has reflection to
generate all this code.

~~~
imtringued
What prevents you from using CBOR? I do understand that a fixed schema is more
of a shackle and can't be used in every situation but CBOR doesn't require an
external schema.

------
nickcw
This is a nicely written article about getting C++ to parse a 64 bit integer
represented as an ASCII string as fast as possible.

It starts with the library version, moves onto a custom version, then unrolls,
then does clever tricks then finally onto SIMD.

Worth a read if you enjoy optimization stories!

~~~
not2b
But it's an apples-to-oranges comparison, between functions that skip
whitespace, validate the input, and report error conditions, and a function
that assumes that it has exactly N digits and does not need to check. Code
that doesn't have to solve the most expensive part of the real problem can run
much faster.

It would have been more interesting to try to optimize something closer to the
problem the library routines are solving: skip leading whitespace, error if
the first character isn't a digit, accumulate digits until a non-digit is
reached. Can the article writer beat the standard functions? By how much?

~~~
ip26
Does all that stuff really wind up costing meaningful performance?

If you say:

    
    
       while (i++)
         if ( notWhitespace(chars[i]) && isDigit(chars[i]) )
            doWork(chars[i])
    

speculative execution will give you near full performance when the input is
clean, as long as you didn't write it in a way that invites mispredicts.
You'll only suffer the cost of checking when your input is not clean.

~~~
Someone
That’s more elaborate than BM_Naive from the article, which takes over 10
times the time of their last version.

Reason? the next to last iteration of the code grabs 8 characters in an
unsigned long, parses them, grabs 8 more characters, parses them, multiplies
the first number by 100,000,000, and adds it to the second.

So, it does 2 iterations per 16-digit number, not 16.

The last result further improves on that by grabbing all 16 digits in one go.

------
fabian2k
I find it interesting how slow some standard library functions can be if your
code is really pushing them in untypical ways. It's not actually that
surprising in most cases once you remember that the library functions often
have to handle expensive edge cases and invalid input.

I did encounter a similiar issue a while ago in Javascript parsing files that
could be hundreds of megabytes with mostly floating point numbers as text in
them. parseFloat turned out to be the bottleneck in this case after optimizing
everything else, and I used a very naive (and entirely wrong) function to
parse the floats that worked very similar like the earlier optimization in
this case. Simply reading each digit, multiplying it by 10*position and adding
it all up. Of course this creates the wrong result because of rounding, but it
was close enough for my use case. And it was actually something like 2-3 times
faster for the overall runtime (not just microbenchmarking the float parsing).
What helped here was that the number of digits was generall low, so you
typically had ~3 digits after the dot, not the maximum floats could represent.

I'm pretty sure someone that knows what they're doing could have written an
even faster and more accurate version there given the same constraints, but I
was surprised just by how much I could beat the library function for my
specific, restricted use case with a very naive implementation.

------
gumby
What. A lovely article and some of the comments show that its context isn’t
understood (despite the author’s efforts to imply it!)

I’m going to start by ignoring this in my code. Just make it work correctly,
right? But as we get closer to deployment performance probably matters (it
does to us). Then we start to notice lots of important coupling. For example
input validation typically* isn’t required inside the system: datastructures
and generated and used by our code. Validation applies at the “surface” or
external inputs of the system. Log records which are generated by one part of
the system may need to be dredged rapidly — forcing the representation to be
fixed means the code that parses it can make assumptions the library code
cannot. Metering will tell you where optimization opportunities lie.

* super long-lived applications sometimes need to do consistency checks even on internal data to avoid subtle corruption issues that could have hardware or software origin and could be very expensive to encounter. Things like phone switches and spacecraft that need to go years or even decades without a reboot.

------
imglorp
The article also doesn't explore another optimization: if you are parsing a
log file with a bunch of times in the same vicinity, maybe you can leverage
the last time stamp you saw? Only do math on the unique part of the new stamp,
then add that to the old stamp to get a new time.

------
zamalek
Bit twiddling is always fun, especially where ASCII is concerned. This
branchless hex parser[1] of mine was a fun half day to figure out, but was
only possible because ASCII has some really useful properties.

[1]:
[https://gist.github.com/jcdickinson/9a4205287ae107e9f4e5f676...](https://gist.github.com/jcdickinson/9a4205287ae107e9f4e5f6764f432db9)

------
velox_io
Json.org has a very nice diagram on how JSON numbers are interpreted. It's
fairly comprehensive (accepts scientific notation), but...

That treats all numbers as decimals (double) as Javascript treats all numbers
that way. If you need to parse integers only, or even only positive integers,
bit shifting is the only way to fly (bonus points for SIMD).

One detail the author didn't go into was cache latency and branching. I can't
find the numbers right now, but to get sub nanosecond is impressive. I can't
find the numbers right now, off the top of my head: L1 is ~1ns, L2 ~4ns, L3
~8ns (the L3 cache on Ryzen 3900x & 3950x is 64MB! in 8MB clusters with 3-4
cores), and main memory is around 80ns (let's not get into memory ranking). So
to get down to those speeds you cannot afford ANY branching, and have to do
operations in parallel.

And I didn't mention localisation with some Europeans using commas instead of
periods and periods instead of commas - madness!

------
nurettin
What's missing is a `thread_local static std::istream` inside the function so
that the expensve steram object isn't recreated every time the function is
called.

------
AnotherGoodName
Not discussed here - there's literally BCD instructions on x86 that take an
ASCII string and treat it as a number. I don't know of performance though
since these instructions are definitely archaic.

~~~
vardump
BCD is Binary Coded Decimal. Not ASCII. 0x12345678 BCD is 12345678 in decimal.

[https://en.wikipedia.org/wiki/Binary-
coded_decimal](https://en.wikipedia.org/wiki/Binary-coded_decimal)

~~~
AnotherGoodName
On Intel unpacked BCD was specifically for working with numbers written in
ASCII.
[https://en.wikipedia.org/wiki/Intel_BCD_opcode](https://en.wikipedia.org/wiki/Intel_BCD_opcode)

~~~
vardump
Damn, and I thought I knew all about x86 assembler... never noticed unpacked
BCD before.

------
mkchoi212
Always great finding out about new SIMD instructions I've never heard of
before. Hope I get the chance to use them in the future and make stuff
insanely fast :p

------
29athrowaway
Are SSSE3 instructions faster than their AVX counterparts? It is my impression
that AVX should be faster?

~~~
qayxc
There are no AVX counterparts. AVX just adds wider instructions (e.g. 256 bit
registers and 512 bit register respectively for AVX-512).

Since the problem in question is concerned with 16-digit integers (i.e. 128
bit wide), SSSE3 is a perfect match.

AVX instructions could be used to convert two numbers at a time, though.

------
sigil
Avoiding integer timestamp parsing overhead (or worse, strptime-based parsing)
is in the rationale for djb’s tai64:
[https://cr.yp.to/libtai/tai64.html](https://cr.yp.to/libtai/tai64.html)

------
sreekotay
VERY cool work and results :)

I'm always curious on these... microbenchmarks can skew things because of CODE
cache localilty? E.g. startup time, ability to inline, etc.... any thoughts on
how to do a more "real world" test?

------
eska
The benchmark graphs didn't show up for me on Firefox mobile. Using the
desktop version fixed it.

------
ChrisLomont
I suspect he could make it faster using the newer neural network instructions
like
[https://iq.opengenus.org/avx512-vnni/](https://iq.opengenus.org/avx512-vnni/)
which pack four multiply and add ops into parallel instead of two that he
used.

~~~
khold_stare
The problem with that instruction is that to do 4 at a time you would have to
multiply by 1, 10, 100 and 1000 respectively. The last multiplier does not fit
in a byte.

~~~
ChrisLomont
Use the expand operation to put the operands in larger bitfields - you have
512 bits to play with, the input took 128, so you can expand each byte to a 32
bit value and use 32 bit multipliers.

Then you can use larger than byte operations and larger than byte multipliers
if you need them.

Alternatively,

First pass you do all the 1,10,100 cases in one pass, using multipliers
1,10,100,0 (throw out last byte).

Then mask (or use the compression instructions as needed) and use the larger
bit versions and do all the 1000 multiplier cases in one pass.

Then you merge them all in one pass.

It merges 4 at a time for the large pass and 3 at a time for the small pass
instead of 2 at a time.

I'll write it up later today if I get time.

