
How to Print Integers Really Fast - signa11
https://www.pvk.ca/Blog/2017/12/22/appnexus-common-framework-its-out-also-how-to-print-integers-faster/
======
im3w1l
A fairly common special case of printing integers is printing all integers
from 0 to N. In that case you can use a trick: Keep a string representation of
the current number and increment the string representation.

~~~
blt
This is a common case? Besides educational exercises, why? I would have
guessed most printed integers are random ID numbers, prices, database foreign
keys, etc.

~~~
jzwinck
Line numbers in less(1) or GitHub. Row numbers in Excel, Pandas, R, etc.

The most common way to label rows is with sequential numbers.

~~~
eesmith
FWIW, in a file with 1,000,000 lines, the best-of-3 time for "less filename >
/dev/null" is:

    
    
      2.508u 0.152s 0:02.66 99.6%	0+0k 0+0io 0pf+0w
    

and the best-of-3 time for "less -N filename > /dev/null" is:

    
    
      2.568u 0.159s 0:02.73 99.2%	0+0k 0+0io 0pf+0w
    

That is, it doesn't seem like printing sequential is the limiting factor in
performance.

This is with "less 458", "Copyright (C) 1984-2012". I downloaded and compiled
stock 487 and the best-of-3 times went up to 0:02.94 for both cases.

Checking the source code, it does not appear to use knowledge of the previous
output index in order to save time. The relevant code is:

    
    
            static int
      iprint_linenum(num)
            LINENUM num;
      {
            char buf[INT_STRLEN_BOUND(num)];
    
            linenumtoa(num, buf);
            putstr(buf);
            return ((int) strlen(buf));
      }
    

where

    
    
      #define TYPE_TO_A_FUNC(funcname, type) \
      void funcname(num, buf) \
              type num; \
              char *buf; \
      { \
              int neg = (num < 0); \
              char tbuf[INT_STRLEN_BOUND(num)+2]; \
              register char *s = tbuf + sizeof(tbuf); \
              if (neg) num = -num; \
              *--s = '\0'; \
              do { \
                      *--s = (num % 10) + '0'; \
              } while ((num /= 10) != 0); \
              if (neg) *--s = '-'; \
              strcpy(buf, s); \
      }
      
      TYPE_TO_A_FUNC(linenumtoa, LINENUM)

~~~
blt
hmm, maybe someone can get a pull request for implementing this optimization
:)

~~~
eesmith
While I know you meant that comment in jest, it suggests that I wasn't clear
enough.

My implicit argument is that there's there's no reason to believe that line
counting adds anything more than trivial overhead to the less output, so
there's no reason to even consider this optimization, much less getting to the
point to make a pull request.

------
Const-me
Not long ago I’ve benchmarked several implementations. The fastest itoa I’ve
found was fmt::FormatInt class from there
[https://github.com/fmtlib/fmt](https://github.com/fmtlib/fmt)

Based on the article, I think fmtlib will be faster than what they did here.

~~~
Veedrac
FormatInt doesn't write out to a buffer and doesn't pad the output with zeros,
so it's going to lose out nontrivially in actual use.

I have tested some FormatInt-like code that does do those things; if I recall
the results correctly, the article's is slightly faster with unpredictable
sizes and significantly faster with predictable sizes.

------
phkahler
Division is slow. I recently needed to send 16 bit integers over a UART and
came up with something like this:

uint16_t n,d;

    
    
      // 10K digit
      d = ((uint32_t)n * 53687 + 8208) >> 29;
      n -= d*10000;
      // output d+’0’ 
      
      // 1K digit
      d = ((uint32_t)n * 2097 + 1368) >> 21;
      n -= d*1000;
      // output d+’0’
      
      // 100s digit
      d = (n * 41) >> 12;
      n -= d*100;
      // output d+’0’
      
      // 10s digit
      d = (n * 51 + 40) >> 9;
      n -= d*10;
      // output d+’0’
      // output n+’0’
    

It does scale to 32 bits but will require 64bit intermediates.

~~~
Const-me
You’re probably using some exotic compiler.

Most C compilers already do something very similar, check this out:
[https://godbolt.org/g/weicFj](https://godbolt.org/g/weicFj)

Obviously, "x / 10" is much easier to read and support, compared to these
manual optimizations.

~~~
phkahler
I'm not doing mod 10 operations, I'm printing left to right. I also put each
digit in a case: and the digit printing is part of a larger state machine
transmitting various things. No characters are stored, one of these blocks was
run whenever the UART was ready to receive a new character. I didn't want to
do the entire conversion to string all at once and this is a very fast way to
extract digits in order.

~~~
Const-me
Your “d = ((uint32_t)n * 53687 + 8208) >> 29;” is no faster than “d = n /
10000;” Modern compilers already apply these multiply-shift tricks when
dividing by a constant.

~~~
phkahler
I was writing for an ARM based micro controller. The compiler may well have
been GCC. I should have timed it vs the obvious use of division. I was not
aware that they would do such things for division by arbitrary large numbers.
If that's all so, then why is TFA worried about the speed of itoa?

~~~
Const-me
> I was not aware that they would do such things for division by arbitrary
> large numbers.

They don’t, just 10000 is not large enough :-) But yeah, apparently 32-bit 10k
division works on GCC ARM as well:
[https://godbolt.org/g/ebJqrw](https://godbolt.org/g/ebJqrw)

> why is TFA worried about the speed of itoa?

Because itoa is slow.

I once needed to read / write large (100MB+) G-code files. Profiler showed me
it spent majority of time in standard library routines like itoa / atof / etc.

Replacing itoa with [http://fmtlib.net](http://fmtlib.net) lead to much faster
code.

------
legulere
Would be interesting to see comparisons to the algorithms in that benchmark:
[https://github.com/miloyip/itoa-benchmark](https://github.com/miloyip/itoa-
benchmark)

What I generally wonder is, why ISAs don't have special instructions for a
task that is so common.

~~~
fulafel
ISA design is a highly quantitative business. You take a bunch of application
benchmarks, compile them with you proposed new instruction, and measure
speedup.

So the answer is one of:

1) people have, and decided against it.

2) people have dismissed it in an earlier phase, by observing that it doesn't
make a big enough blimp in CPU profiles to warrant further investigation.

Also, it's hard to speed up very much in hardware because it's a bunch of
interdependent divides which current cpu divide instructions are already good
at. If the hardware acceleration interface is vector style, it's doubtful if
many applications can easily take advantage of it. This article talks about
printf, which certainly can't.

Do you have an area of applications in mind that spends a big slice of its
computation time on integer -> ascii conversions?

~~~
Veedrac
> Also, it's hard to speed up very much in hardware because it's a bunch of
> interdependent divides

Most things which seem sequential are actually easily parallelized. This is
one of them; all of the divides can be done in parallel, and since divisions
are by constants it all falls into some fairly simple hardware.

The real argument is more that it's a bit pointless to hardware accelerate
something that takes 10-20ns in optimised software if it's not used 100
billion times per day per core.

------
otabdeveloper2
> It’s a stupid problem to have–just use a binary format

Not really. A safe, efficient and future-proof binary format is not a simple
problem to solve. At the very least, you'll need something like base-127
encoded digits.

At this point you might as well go with base 10 and have human readability as
a side bonus.

P.S. "Just use protobuf" is not a solution either. It's only an option because
you're using a pre-built library solution, and there's no apriori reason to
pick protobuf over some other pre-built solution.

~~~
stouset
> P.S. "Just use protobuf" is not a solution either.

It's _one_ solution, which makes it _a_ solution. It also happens to be a good
solution for 95%+ of data-interchange use-cases.

JSON is a genuinely terrible approach to interchange for most anything that
you expect to grow to be nontrivial. Whatever your thought of static vs.
dynamic typing for languages, schemaless _interchange_ formats are absolute
madness — an analogue to what people are rapidly learning about schemaless
databases. "Schemaless" formats don't actually mean there isn't a schema. You
have one, it's just informal and you lack any useful tools to manipulate it or
make changes in the future.

What you gain in early development speed is lost many times overby not
actually stopping to think about how your data will be modeled beforehand and
it will be lost many times over again when you need to change your informal
parsing logic (e.g., random hash accesses) that's spread across dozens of
unrelated areas of your code and impossible to locate. As an added bonus, you
often end up having to indefinitely support every buggy, half-baked version of
this format going backward to the beginning of time.

~~~
fpoling
JSON is terrible format for data storage. But for data exchange it is OK. The
problem with schemaless storage is that as code evolve it is way to easy to
forget to cover the needs of already stored but not presently accessed data.
With communications this is much less a problem as serialization and parsing
must evolve together with code.

~~~
stouset
You have this exact same issue with JSON for data interchange, unless there
are exactly one producer and one consumer and you can deploy both
simultaneously.

If you have more than one producer, more than one consumer, or you don't have
the control to update all participants simultaneously, you are setting
yourself up for pain.

------
nitwit005
> I find Protobuf does very well as a better JSON

I assume this was for their ad bidding? I recall they sent JSON with a
ridiculous number of integers for the arrays representing audience segments.

------
merraksh
I've seen recently another version of itoa called itoa_jeaiii with good
performance claims:

[https://twitter.com/Atrix256/status/938994058348847106](https://twitter.com/Atrix256/status/938994058348847106)

Code at
[https://github.com/jeaiii/itoa/tree/master/itoa](https://github.com/jeaiii/itoa/tree/master/itoa)

