
Optimizing 128-bit Division - EvgeniyZh
https://danlark.org/2020/06/14/128-bit-division/
======
danlark1
Hi everyone, the author is here. Yes, I believe the title should be changed to
`Optimizing 128-bit Division`

Yet, I was not expecting it to be here. Overall, I put some knowledge hidden
in Hacker's Delight book, Knuth, GMP and GNU in the article with my knowledge
of low level optimizations. In the end it turned out to be a good thing to
write and to submit into LLVM

~~~
WalterBright
What's the license? Hopefully, Boost or Boost compatible?

~~~
danlark1
For compiler-rt it is Apache 2.0 license

[https://github.com/llvm/llvm-project/blob/master/compiler-
rt...](https://github.com/llvm/llvm-project/blob/master/compiler-
rt/LICENSE.TXT)

~~~
WalterBright
Thank you. We've standardized on the Boost license for D, as it is the least
restrictive, and well-accepted in the C++ community.

Is it possible you can make a Boost licensed version of it so we can add it to
D?

[https://www.boost.org/LICENSE_1_0.txt](https://www.boost.org/LICENSE_1_0.txt)

~~~
danlark1
That's a difficult question. As I am working at Google I need to consult each
open source I want to publish on my behalf. This does not involve contributing
to the list of the approved projects and telling about these contributions.

If you want to workaround this for now, I suggest looking into libdivide
([https://github.com/ridiculousfish/libdivide](https://github.com/ridiculousfish/libdivide)),
it is published with the boost license and the library contains all the needed
artifacts I described in the article (unfortunately, not combined).

~~~
WalterBright
Thanks for the pointers. I suspect Google would be ok with Boost, since they
are a C++ house and Boost is the major library. A big reason D picked it was
because Boost is corporate lawyer approved.

------
chris_st
I remember seeing a letter to the editor of an early Byte magazine, wherein
the author of the letter recommended that you "Be sure to take your 32-bit
divide routines with you when you change jobs... there's no guarantee that
your new job's routines will be correct, let alone fast!".

For the record, I have never followed this advice.

------
est31
Very nice post! I remember helping to port the 128 bit integer functions in
compiler-rt to Rust years ago [0], because clang only supported 128 bit
integers on 64 bit platforms, but the goal of Rust was to support them on all
platforms that Rust supports.

Since then, all algorithms in compiler-rt have been ported to Rust and live in
the compiler-builtins crate. This [1] is the source code file for unsigned
division. The actual logic is inside a macro and used to implement both 128
bit division in terms of 64 bit numbers and 64 bit division in terms of 32 bit
numbers.

I wonder if similar optimizations can be done to that code.

[1]: [https://github.com/rust-lang/rust/pull/38482](https://github.com/rust-
lang/rust/pull/38482)

[0]: [https://github.com/rust-lang/compiler-
builtins/blob/master/s...](https://github.com/rust-lang/compiler-
builtins/blob/master/src/int/udiv.rs)

~~~
danlark1
Hi, I believe they can be done for 32 bit platforms also with the multiword
division in Knuth
[https://skanthak.homepage.t-online.de/division.html](https://skanthak.homepage.t-online.de/division.html)
what I've chosen for fallback if not x86_64 platform.

I will try to implement the same optimizations in Rust in the upcoming weeks

 __UPD __And we opened an issue :)[https://github.com/rust-lang/compiler-
builtins/issues/368](https://github.com/rust-lang/compiler-
builtins/issues/368)

~~~
mjcohen
Many years ago I implemented a set of multiple precision integer routines in
fortran. For division, I slavishly copied Knuth's routine, using all the
sample divisions in his book to test, and it worked fine.

------
eutectic
If all you need is to uniformly reduce a number into a given range, there is a
division-free approach:

lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/%3famp

~~~
repiret
It took my fat fingers a while to be able to successfully copy-n-paste the
link on mobile. Here’s a proper link to save others the hassle:

[https://lemire.me/blog/2016/06/27/a-fast-alternative-to-
the-...](https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-
reduction/)

------
Stratoscope
Looks like this got bit by the HN title number ripper.

For the curious, the actual title is "128 bit division".

------
drfuchs
The article’s title is actually “128 bit division” while HN currently shows it
as just “Bit Division”. I suggest both be changed to “Optimizing 128-bit
Division”. Nice article, btw.

------
amelius
Great post. The only thing missing is a good strategy to _test_ the resulting
algorithm. Errors in arithmetic can be a nightmare.

~~~
danlark1
Hi, I thought that this is not the most interesting part of all the
benchmarks, for example, all we need to test that with the quotient and the
remainder: dividend = quotient * divisor + remainder, remainder < divisor and
multiplication does not overflow which is free of division operations.

Yet, I added several tests like dividend < divisor, close to zero remainders,
a lot of random stuff just to make sure each time I add a new approach, it is
correct.

~~~
specialist
Nicely done.

FWIW, I dimly recall a numeric library, maybe a float to string, which tested
_everything_ for verification. Took a few days to run.

Then maybe use the spot checks to test for regressions. Weird compiler,
toolchain, processor combos. That sort of thing.

~~~
lifthrasiir
> FWIW, I dimly recall a numeric library, maybe a float to string, which
> tested everything for verification. Took a few days to run.

You can definitely test against all ~2^31 IEEE 754 binary32 values to make
sure that float-to-decimal conversion is correct; that's what I've done with
the Rust standard library (took 2 hours per test). I believe testing all ~2^63
binary64 values are also feasible by now, but only with dedicated hardwares.
For that reason I believe the library had only tested with binary32.

~~~
saagarjha
What kind of hardware can crank through the 2^64 space in reasonable time?

~~~
grandmczeb
It’s actually surprisingly feasible. 1,000 cores running at 3GHz for a week do
~2^64 cycles.

~~~
bloak
Don't you mean 10,000 cores?

~~~
grandmczeb
Whoops, yes! Too late to edit.

