Hacker News new | past | comments | ask | show | jobs | submit login
Computing the number of digits of an integer even faster (lemire.me)
103 points by vitaut 3 months ago | hide | past | favorite | 56 comments

The big table becomes more obvious when the constants are written using hex:

  static uint64_t table[] = {
    0x100000000, 0x1FFFFFFF6, 0x1FFFFFFF6,
    0x1FFFFFFF6, 0x2FFFFFF9C, 0x2FFFFFF9C,
    0x2FFFFFF9C, 0x3FFFFFC18, 0x3FFFFFC18,
    0x3FFFFFC18, 0x4FFFFD8F0, 0x4FFFFD8F0,
    0x4FFFFD8F0, 0x4FFFFD8F0, 0x5FFFE7960,
    0x5FFFE7960, 0x5FFFE7960, 0x6FFF0BDC0,
    0x6FFF0BDC0, 0x6FFF0BDC0, 0x7FF676980,
    0x7FF676980, 0x7FF676980, 0x7FF676980,
    0x8FA0A1F00, 0x8FA0A1F00, 0x8FA0A1F00,
    0x9C4653600, 0x9C4653600, 0x9C4653600,
    0xA00000000, 0xA00000000,

Originally I used a macro:

// this increments the upper 32 bits (log10 T - 1) when >= T is added

#define K(T) (((sizeof(#T)-1)<<32) - T)

(T is 10,100, etc.)

Here's one interesting application of a digit-count function.

The naive implementation of an integer-to-string conversion involves writing the number backwards, starting from the least significant digit, and then reversing the string at the end.

In a 2017 talk titled "Fastware"[1], Andrei Alexandrescu showed that it's faster to start by counting the number of digits you're going to print, and then you can print by working backwards from the least significant digit. It comes out in the correct order without any need to reverse at the end.

His implementation of the digit count was itself interesting, since it does 4 comparisons per loop and then a divide by 10000, instead of the naive single comparison and divide by 10.

    uint32_t digits10(uint64_t v) {
      uint32_t result = 1;
      for (;;) {
        if (v < 10) return result;
        if (v < 100) return result + 1;
        if (v < 1000) return result + 2;
        if (v < 10000) return result + 3;
        v /= 10000U;
        result += 4;
IIRC, his insight was that smaller numbers tend to occur more often in the real world, so most calls to this function will not end up doing any divisions.

[1] https://www.youtube.com/watch?v=o4-CwDo2zpg

I imagine his code letting out a sigh when it gets to v/= 10000U.

"ok, fine..."

Why wouldn't you unroll the entire loop so no divisions are ever required?

Code density

> Alexei Andrescu

Andrei Alexandrescu

Thanks, corrected.

There are so many branches for the processor to evaluate in such approaches. The linked article does it without any branches (no if condition).

That's clever. A uint64 doesn't have more that 20 digits, so I'm guessing returning a uint8 over a uint32 might be better in memory-bound cases.

No, it really wouldn’t matter in the grand scheme of things, especially since the result will live in a register. Even if spilled to the stack, this one usage isn’t going to matter vs all the other stuff that gets regularly spilled.

Also most compilers can optimize the divide by constant to a multiply and shift which are much faster than a divide.

This blog post is a response to another blog post which is a response to a question about how to implement integer-to-string length as quickly as possible in an optimising implementation of the Ruby programming language called TruffleRuby, if people here wanted a concrete example of how these kind of brain-teasers come up for real in industrial programming in an everyday job.

I sometimes have difficulty finding efficient algorithms for specific tasks like this. Any advice on becoming better at finding efficient algorithms?

Here's a gem called _Matters Computational_; a storehouse of many such tricks with proper categorization and index.


Thank you for this

holy moly this is awesome; thanks!

I don't have any tips. Micro-optimization at this level often requires creative leaps, and proving that they work can be tricky (although for 32-bit integers, you can just iterate over all of them to validate). There are web pages with "bit-twiddling hacks" (https://graphics.stanford.edu/~seander/bithacks.html), and sometimes you can find a solution for your problem with a superoptimizer (if you care enough).

One other way to prove these algorithms is by writing them in z3 prover. You write a simple one that is 100% correct and then you prove equality for your optimized one or find an example that will show inequality.

A great thing about a monadic operation that takes a 32-bit integer, like this one, is that you can also just exhaustively test it in reasonable time!

Today I learned "monadic" has two significantly different meanings in the realm of programming.

I'm only familiar with using "unary" to mean single-parameter.

Nah, they’re pretty much the same meaning.

A monad, derived from monoid, means multiple very specific things completely unrelated to the other definition of "single parameter".

And the etymology is basically a coincidence because "mono" got used in different places, isn't it?

> []completely unrelated[] to the other definition of "single parameter".

Actually, all monads are monadic functions/operators on their (usually type) parameter. Eg, 'list' is a monadic operator from element types to types of lists of those elements.

It's completely different (AFAICT), and mostly unrelated, but it's not completely unrelated.

Let's assume you've studied the well-known computer science tricks like divide and conquer, greedy algorithms, and the various tree searches, and that's insufficient, even after specializing for your reduced domain. (These will get you 90% of the way there 90% of the time, and usually that's good enough. Sometimes it's not.)

For integer tricks like this, you'll want to look at number theory. Modular equivalence is particularly useful, but lots of things in that space are useful. However, there are a lot of other tricks that rely on geometric sequences, or geometry themselves, and other higher maths -- a lot of advances like this are simply connecting random theorum a to situation b, if you can envision the situation in the correct space.

Buy a copy of Hacker's Delight.

    Digits = LEN( STR$( SomeNumber ))
(run and hide in shame)

you need an ABS( ) in there no or else the negative sign gets counted? Also some languages will write really big numbers or really small numbers in engineering notation (eg 10,000,000 becomes 10e6) but that's typically only for floating point types.

The article only presents a function operating on unsigned integers, it's not unreasonable to suggest that this is intended to do the same ;)

Awesome find! But I wonder how easily this could be extended to 64 bit integers? That sequence in the lookup table won't fit into 64 bits if it's extended. I think you would have to fall back to just having the lookup table storing the transition data, as described in https://commaok.xyz/post/lookup_tables/ in the "A Small Step" section. And probably add a conditional in the beginning to handle very large integers that would overflow. Alternatively you might be able to do something with two lookup tables, but I think at that point you've got enough lookup data that it would be less efficient.

You'd need 128-bit integers because of the way this works in the intermediate steps before the shift. There are u128 types in many modern compilers, but it's probably cheaper just to use a different type (u64) of table and another operation.

> how easily this could be extended to 64 bit integers?

It's effectively:

  size_t logx = int_log2(x);
  uint32_t carry = ((uint64_t)x + tablo[logx]) >> 32;
  return tabhi[logx] + carry;
Assembly can do (from a equivalent start):

  mov rdx rax  # save x
  xor eax eax
  add rdx [8*rcx+tablo]  # set carry (would cmp be better?)
  adc al [1*rcx+tabhi]  # result never exceeds 64/log(10) < 20

It's easier when there are more bits available than in the original operand.

For 64, a variable shift may work. Powers of 10 have a lot of trailing 0 bits.

I recall needing this on a snprintf() implementation.

Isn't this effectively a table-optimized logBase(x) rounded-up to the ceiling, which can be reduced to efficient log2(x)/log2(Base)? If you know the base, then you can combine many ops into table-lookups.

As an exercise to the reader, I suggest creating efficient code constrained to 32-bit or 16-bit unsigned.

PS: I was going to say "look at Hacker's Delight." I can definitely read. LOL.

it all goes back to 650x, where we had no multiplication/division and we had to create sin/cos tables.

And even on stuff like the 68000 on the Amiga/Atari ST we were using sin/cos tables for lookups were way faster than doing multiplication/divisions if I recall correctly. I have fond memories of tools allowing to create these tables for "sprites" movements (in either demos or games) but forgot how these little tools were called: maybe "table editor"?

It'd be interesting to see a performance benchmark of the various solutions, including ones that don't use lookup tables.

The results of such a benchmark would probably heavily depend on whether the lookup table is allowed to be in the CPU caches.

Is there a version of this that would tell you the number of bytes an integer would use? If so, how would it differ with signed versus unsigned integers?

You can tell how many bits an integer uses with lzcnt.

Thank you!

No way to leverage popcount instruction for this?

If you insist, you could calculate the log2 as something like

    x |= x >>  1;
    x |= x >>  2;
    x |= x >>  4;
    x |= x >>  8;
    x |= x >> 16;
    return popcount(x-1);
but there isn't really much of an advantage to it. lzcnt/bsr are better.

lzcnt is the first instruction in his solution. Seems it's left as implicit in this post, but if you go back through the chain of posts it's there.

LZCNT (base 2) would get you to a range of possible decimal digit (base 10) lengths that would need further processing, but it's still more expensive than a table-lookup direct solution directly to the base you're interested in.

> LZCNT (base 2) would get you to a range of possible decimal digit (base 10) lengths that would need further processing, but it's still more expensive than a table-lookup direct solution directly to the base you're interested in.

Are you saying you've got a better solution than the blog post that does a direct table lookup? Can you share it?

Why did you quote what I said that has nothing to do with it?

Do you have a better solution?

log2 is generally implemented with some variant of the "count leading zeros" instruction, but the general popcount is irrelevant

int->float conversion, followed by extraction and unbiasing of the exponent, also works for log2 and is more portable.

I don't normally convert my ints to floats, but when I do, I *(float *)(&x). ;-) (or fild)

A few days ago I had to write a bit of code in C# with an int to string conversion, and since I never use that language and I'm not used to OOP I completely looked over the ToString() method every primitive type has, I wrote a dumb conversion like I used to do as a student in C. The basic modulo 10 loop thing. You know the one.

Anyway I'm glad to know there's a cool lookup table trick out there, I never really get to use anything like that but it makes for a nice read.

This is a fun optimization, but what is a use case where I would need to count the number of digits?

Formatting numbers into strings is a pretty fundamental operation to need to do in your fast-path. For example formatting time into log lines. To do that we need to know how many digits the milliseconds component has so we can pad it with zeros correctly. That's what motivated this particular question in our case.

That makes sense. Never really thought about that as a counting digits problem but yeah.

you can also base your solution on hamilton cycles. that also boils down to o(1).

> you can also base your solution on hamilton cycles

I don't know what that means. Can you illustrate (preferably with a code example)?


is part of the solution. the other part is getting to the most significant digit.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact