Hacker News new | past | comments | ask | show | jobs | submit login
Optimizing a bignum library for fun (austinhenley.com)
145 points by azhenley 39 days ago | hide | past | favorite | 63 comments



I've had to implement BigNum on an embedded system that had very little RAM and where the initial optimized version of ModExp took many minutes to complete. After much hair-pulling, the final version took 40 seconds.

First, you should work with unsigned numbers, and use a power of 2 as your word size. The fastest choice for a word size is very operation- and CPU-dependent.

A key trick is to lay out your bignums in the same order as the endianity of each word in memory, e.g. least-significant word first on little-endian systems. This will allow you to choose your word size dynamically for each operation: in memory, a number with M words of N bits each is identical to a number with M / 2 words of N * 2 bits each.

For multiplication, identify the CPU instruction with the widest result, then use half that size as your word size. Each step through the arrays generates a word result in the low half and a word carry in the top half. The carry gets added to the result of the next step, possibly overflowing.

For addition, use the widest result as your word size. This can also overflow.

How you deal with overflows is very CPU-dependent. You can use adc/addc as someone else mentioned, which will be faster on embedded and may be faster on fatter chips. Alternatively, you can halve the word size and use the top half as the carry.

If addc is not available, you can test for overflows as follows:

    uint32_t a = ..., b = ...;
    uint32_t res = a + b;
    uint32_t carry = res < a;
On overflow, res must necessarily be less than both a and b, so no need to check b.

If SIMD instructions are available, that will almost always be the fastest choice by far. While it doesn't change the above guidelines in principle, there are often e.g. optimized overflow mechanisms.


    """
    uint32_t a = ..., b = ...;
    uint32_t res = a + b;
    uint32_t carry = res < a;
    """
But note well: this trick doesn't work when there is also a carry-in. That is,

    uint32_t res = a + b + carry;
    carry = res < a;
won't work when b = 0xFFFFFFFF and carry = 1, so you have to check for carries twice, or set `carry = res < a || (res == a && carry);` or similar. So if you are implementing bignums on an architecture that doesn't have ADC (such as RISC-V), or on a vector architecture where the performance would be bad, then you often need an alternative strategy such as using less than 32/64 bits per limb (whether that's half-size, or 30 bits or 51 bits or whatever).


Yep. Those are two additions, and naturally every full-word addition needs a separate overflow check. You can avoid the short-circuiting branches by just adding the carries:

   uint32_t res1 = a + carry;
   carry = res1 < a;
   uint32_t res2 = b + res1;
   carry += res2 < b;


Just curious, what kind of embedded system were you working on that required arbitrary precision numbers, and where it was acceptable for the system to stall for 40 seconds during a calculation?


The ModExp was used for cryptographic authentication, and only ran once per session or once when pairing. IIRC it was a battery BLE device.


Coincidentally I was just writing a bignum library from scratch two weeks ago.

A few interesting things I learned:

1. Knuth volume 2 has an extensive discussion of the problem space. I've only gotten a chance to skim it so far, but it looks interesting and approachable.

2. I need to support bitwise operations, which operate on a two's complement representation, so I figured it would be simpler to use two's complement internally, despite seeing that most (all?) bignum libraries use signed magnitude. I'm starting to regret this: two's complement introduces a lot of complexity.

The most fun thing about working on a bignum library is that it makes the algorithms you learned in grade school for add/subtract/multiply/divide relevant again. The basic ("classical") algorithms on bignum are basically exactly the same thing you learned in grade school, except on a much larger base than base-10.


Knuth has a "hugenum" scheme, TCALC, which is based on trees rather than byte arrays. There are many ways to do it, depending on which type of numbers you need most - for TCALC, it's numbers close to towers of exponents.

Of course for every big number encoding that can beat an array of bytes for some numbers, it also has to lose to byte arrays for some numbers... But it's possible to bound this inefficiency, so that it's never more than twice as bad as byte arrays, but potentially exponentially more efficient for the favoured numbers.


Twice as bad being a timeout and 2nd call?


I don't quite get what you mean. Computation time is also bounded with the size of the tree.


For maximum efficiency, you should work in binary instead of base 10. Handling carries becomes more straightforward with the right primitives, for example __builtin_addc with GCC: https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins...

You can also implement it in C if you want a more portable solution: https://github.com/983/bigint/blob/ee0834c65a27d18fa628e6c52...

If you scroll around, you can also find my implementations for multiplication and such.


For maximum efficiency you'd avoid addc like hell, because it blocks internal precomputation, and use guarantees like he did, avoiding overflow better. Better use int128 types though.

I'd just just stack objects for small typical sizes. And esp. do formal proofs than the random test joke.


> For maximum efficiency you'd avoid addc like hell, because it blocks internal precomputation, and use guarantees like he did, avoiding overflow better.

Whether that is the case depends on the CPU. I hit memory bandwidth limits before adc becomes a problem. But I concede that you have a valid point and that there are definitely cases where leaving sufficient space for carries is the better strategy.

Anyway, we can probably both agree that "% 1000000000" is not the best way to do it.


addc makes all your adds serialize on the flags register, which is really painful.

more modern approach is to reserve some bits out of every word for carries, but that drastically complicates things.


Is this really true? I would intuitively expect that register renaming would apply to eflags too, so that reads from flags don't truly need to be serialized despite nominally writing a bunch of things to the same register.

EDIT: this paper (linked in another comment) seems to indicate that this is possible:

> An out-of-order machine can look ahead and process the accumulation pass in parallel with the partial sum pass using a renamed eFlags register.

https://web.archive.org/web/20150131061304/http://www.intel....


EFLAGS is actually put in the same renaming register as the result, so you get renaming for free.

The renaming registers in the Intel Pentium Pro and Pentium II are actually over 80 bits wide. They need to hold a full 80bit float, or 64bit MMX result. The Pentium III extends this to 128bit wide renaming registers to support SSE.

This is despite the fact that the P6 architecture only had 32bit bit integer registers until the Core 2 in 2006. So there is plenty of room to store EFLAGS in the same renaming register as the result. This also means that the branch uops point to the result of the most recent flag modifying instruction.

It was only with Sandybridge (and the introduction of AVX) that the P6 switched to a PRF design, with separate registers for floats and integers. Of course, Netburst also had a PRF design.



I remembered and ultimately found a source for a workaround for the serialising on flags problem, intel paper at https://web.archive.org/web/20150131061304/http://www.intel.... amounts to new instructions with better behaviour for ILP


Funny, I did this myself 10 years ago. Shit, it's been that long..?

For my undergrad project I wrote a computer algebra system to do symbolic integration. The supervisor was a hardcore, old school C guy, so naturally I was going to just use C and no libraries. He told me I'd need bignums first, so I got to work (this is because many algorithms like polynomial GCD create massive numbers as they go, even if the inputs and final outputs are generally very small).

I just couldn't figure out how to do better than the largest power of 10 per digit at the time. Working with non base 10 arithmetic was a mind fuck for me at the time. So I did it with digits holding 10^9 and the classical algorithms from Knuth. Division is the hardest!

At some point I discovered the GNU multiple precision library (GMP) and made my program work with that instead of mine. I was shocked at how much faster GMP was! I finished my project with my own code, but I knew I had to come back to do it better.

The breakthrough came when I got a copy of Hacker's Delight. It has stuff like how to detect overflow after it's happened (in C). Something twigged and then I just understood how to fill each word completely rather than use a power of 10. I don't know what confused me before.

But, of course, the real way to do it is to use assembly. You can't get close to high performance in C alone. In assembly you get the overflow bit. It's actually easier in a way! So you write tiny platform specific bits for the digits and build on that in C. My add and subtract were then as fast as GMP. I lost interest when it came to implement faster multiplication algorithms.

Code in case anyone is interested: https://github.com/georgek/bignums


Modulo is surprisingly expensive even when you combine it with a quotient. It is almost always better to use binary "limbs", in this case 31 or 32 bits wide, because decimal parsing and printing should be much rarer than individual operations in general.


This is a bit misleading. Quotient and modulo with constant right-hand side is readily optimised into faster operations, like multiplication and bit shifts. "All" optimising compilers do this and don't emit a true divide instruction in such cases. See for instance: https://godbolt.org/z/vxnTPvY6M

Quotient and modulo with a variable RHS is expensive, but this isn't necessary for the algorithm showed in the article. Although I agree that binary wins over decimal anyway. We are using binary computers after all, for the last 50 years or so.


> Quotient and modulo with a variable RHS is expensive

For dividing ‘big’ bignums by ‘small’ divisors, I would use/tweak libdivide (https://libdivide.com/), and spend the time to optimize the division.

For dividing ‘big’ bignums by ‘big’ divisors, that may not be worth it.

(Determining the correct values of the various ‘big’ and ‘small’ values left as an exercise for the reader. Those will depend on the hardware at hand)


The oldish version of gcc used by highload.fun seems to stop performing this optimization after the 5th or so constant divisor within a small function body. Always good to double check :(


The reason a division by 1000000 can be efficiently turned into a multiplication is because you're targeting a system that has a 32x32->64 multiplication instruction, which has plenty of room to spare. But if you have that, you would be better off using e.g. 2^32 as your modulo.


I agree that 2^32 is better, but the reason is not that division is too expensive because OPs algorithm after standard optimisation does not need any division.

Clearly choosing a base type that is fast on the machine, is key. But I feel even a 64/32 divide with constant operand will be optimised into something reasonably fast on most systems. If you have a counterexample it would be interesting to hear.

A system without fast divide likely doesn't have any floats either. So all arithmetic would be fixed point and 32x32->64 multiplication would be a pretty common operation.


I just meant that the compiler in your example is computing x / 1000000 as follows:

   *q = (x * 1125899907) >> 50;
This works out OK because there's a 32x32->64 instruction, arm64's umull in this case. If it didn't have one, and it had a 32/32->32 division instruction, it might have emitted that instead of hand-rolling the same scheme using 32x32->32 multiplications.

Note that it's technically an arithmetic approximation that just happens to work for all possible values of x, since obviously 1<<50 is only divisible by 1<<(0 to 50).


I see. However, 32x32->64 multiply has been common on 32-bit processors since the mid 80's. It would be interesting to know if somebody is still building new products on (32-bit) processors that lack it.


For a small survey of practical efficient methods for bignum arithmetic operations, the Algorithms section of the documentation of GMP [1] is excellent.

[1]: https://gmplib.org/manual/Algorithms


I think that your description is almost excellent, but that you're fundamentally misleading in describing what you are doing as a "30-bit" digit.

It's a 10^9 digit mathematically, occupying 30 bits of storage. You do briefly mention that it's 10^9, but repeatedly say 30-bits.


It isn't misleading at all.

A hexadecimal digit has 4 bits of entropy. You can guess it correctly by chance one in sixteen times. Calling that a four-bit digit is correct. Same with a 30 bit digit, all that changes is the magnitude.

The "storage bits" aren't a detail, they're an essential property. Stored in balanced ternary it would still have 30 bits of entropy.


I think you misunderstood the parent. If you speak of a number in the range 0-15 stored in 4 bits, we can all agree that "4 bit digit" is the appropriate term for it. But what about a number also stored in 4 bits but restricted to the range 0-13? It's a digit that fits in 4 bits, but calling it a "4 bit digit" without further qualification would omit relevant information.


That hadn't occurred to me, I had thought (having read the Fine Article, but not closely, nor really more than glanced at the code) that it was using exactly 30 bits and reserving two for, IDK, carry detection, and that 10^9 was just approximating 1073741824. We do that every time we call 1024 a KB and not a KiB, so there's precedent.

Ok. Fair. It's a ~28.9 bit digit taking up 32 bits of storage. It turns out I was mislead!


Fascinating article, I have always been wondering how these big number libraries worked.

As a side question, does anyone the program that the author used when making that "addition" and "multiplication" performance graph? Thanks.


If you're interested in even more details you could also have a look at Tom's book about bignum arithmetics. C.f. third paragraph of https://github.com/libtom/libtommath#summary

The library evolved since then, so the algorithm documentation only exists in code nowadays.

FTR I'm involved in the project as maintainer.


I don't know for sure, but you can do that kind of thing pretty easily with Matplotlib in Python. Or in R base graphics, with more effort to get it looking pretty.


LOL, I've been there: https://www.kylheku.com/cgit/txr/commit/mpi?id=98dedd310b1d5...

(Patch was originally from 2011; it was bugfixed once, and then in 2015 converted to git commit:

https://www.kylheku.com/cgit/txr/commit/mpi-patches/faster-s... )

Another one: faster highest-bit search:

https://www.kylheku.com/cgit/txr/tree/mpi-patches/bit-search...

That does use GCC built-ins today: https://www.kylheku.com/cgit/txr/commit/?id=15b7c542dc44899e...


I am so surprised that there's no exploration of Karatsuba's algorithm. That's what makes the Python implementation perform.

I actually came here hoping to find discussion on Karatsuba. https://en.m.wikipedia.org/wiki/Karatsuba_algorithm


Please see the section titled "Faster multiplication"


Very useful post! Also it's cool to see how many people in this thread have worked on this problem -- lots of new info here I haven't seen

I wonder if anyone is interested in implementing a big numbers in Oils? It's a Unix shell with TWO complete implementations - the "executable spec" in Python, and an automatic translation to pure C++ (which is 30x-50x faster)

We currently use 64-bit integers in C++, but big nums are a better semantic. Some trivia about bad shell semantics here:

Integers - Don't do whatever Python or C++ does - https://www.oilshell.org/blog/2024/03/release-0.21.0.html#in...

This is a very self-contained project: the interface is defined by 200 lines of Python:

https://github.com/oilshell/oil/blob/master/mycpp/mops.py

and the trivial 64-bit overflowing implementation is also about 200 lines:

https://github.com/oilshell/oil/blob/master/mycpp/gc_mops.h

https://github.com/oilshell/oil/blob/master/mycpp/gc_mops.cc

(We have a fast Ninja-based build system, so you can probably iterate on this in 100 milliseconds or less -- it should be fun for the right person)

---

I think the main reason it is specific to Oils is that the bigger number should become GC objects. Details on our GC here:

Pictures of a Working Garbage Collector - https://www.oilshell.org/blog/2023/01/garbage-collector.html

It's been very solid for the last 18 months, basically because it's well tested by ASAN and #ifdef testing modes.

The main thing I'd be concerned with is how to TEST that big number operations are correct. I think there are probably some interesting strategies, which I'd love to discuss.

You're of course welcome to adapt existing open source code, including code you've already written -- I probably even prefer that, i.e. something that has had some real world testing. We want all the operations in that file, and it should be integrated with our GC.

---

We've had ~6 contributors funded by grants from https://nlnet.nl for the past couple years, so you can even be paid (there's a chance it depends on the country you live in, but generally a wide range of situations is OK).

Contact me at andy at oilshell.org or https://oilshell.zulipchat.com/ if interested!


It doesn't look especially specific to me - slightly hack the structure in https://github.com/libtom/libtommath to make sense to your garbage collector and you're pretty much done. That somewhat punts on the testing question but I doubt add/mul/bitops from that library are buggy.


Hm I'm not familiar with that library (or really any others in the landscape). Why pick that one?

Why be doubtful it's buggy -- is it used in a major app? I can't really tell from this page:

https://www.libtom.net/LibTomMath/

I'd be interested in a review / survey of different options, and the strategy for testing. Basically the tests should "argue" that the code is correct, and documentation / blog posts also count!

(I'm honestly not sure how hard it is to make a correct big num library. But I think the original article in this series said there were probably bugs, so I'm guessing it's not trivial)

To give a flavor, we publish test results and and coverage numbers like this [1], and any new library should be consistent with this:

https://www.oilshell.org/release/0.22.0/quality.html

Our GC runtime has less than 5000 lines of hand-written C++ now, so any solution should be in proportion to that. That is, we probably wouldn't take on 3000 lines of code just for big nums.

So we favor correctness over performance. Hopefully there would be little perf loss in the small int case, but it's OK if the big int case is slow. It's the rarer case for sure.

It's a shell, not a Fortran competitor!

---

As far as the grant, we pay in 2500 euro or 5000 euro increments, and I don't try to parse "effort spent" any more than that. I care about the result, not if it's easy for a particular person :)

For example earlier this year we had a contributor Justin who worked on the pretty printer, with Wadler's algorithm (which I was unfamiliar with).

https://www.oilshell.org/blog/2024/06/release-0.22.0.html

And he had co-authored a paper on pretty printing, so it was probably not too difficult overall. But that's a good thing as far as I'm concerned!

So if you or anyone else thinks it's easy, please feel free to contact me / play around with the code, etc.

https://github.com/oilshell/oil/wiki/Where-Contributors-Have...

[1] our unit test coverage is reported at 83% or so, but we favor "exterior" tests, so it's really close to 95% or 99%. Also of course line coverage isn't really a way to measure correctness per se.


I believe it's used by a bunch of places that don't really bother with attribution, what with it being marked public domain. It's notable for being extremely well documented and for the (initial at least) emphasis on simplicity.

It's probably overkill for your domain though and complexity does tend to bring bugs. A reasonable approach would probably be to implement the simple array-of-wordsize/2 strategy and regression test against that lib. Or actually the GNU one is reasonable for testing purposes, though beware that GNU's one kills the process on out of memory.

I need to write a bigint implementation for a lisp/ml compiler anyway and am probably up for making that compatible with your GC layout. Oils built and ran locally a moment ago. I'm more likely to write that for free than for money though, will send you an email anyway.


Ah OK, yeah I wasn't aware of libtom, but it seems cool.

I definitely like the strategy of testing against multiple implementations -- we do a lot of that! Shell is really good for that too.

Thank you for the e-mail, I'll reply more there!!


The title is a little bit dishonest, because they are optimizing their own bignum library.


Speaking of bignum libraries, I recently watched a talk with Rob Pike where he mentioned that one thing he regretted about Go was not making the default integer implementation arbitrary precision. Supposedly the performance overhead for normal numbers is very small, and you avoid the weirdness and complicated semantics of fixed precision integers. I found that to be quite fascinating, especially coming from a "low-level guy" like Rob Pike. Ever since I've been wanting a language with that feature and to understand how bignum implementations work.


It's a pretty common feature on high-level languages. Python, Ruby, Lisp(s), Haskell, etc, all use arbitrary-precision integers by default. Even JavaScript has integers now (since 2020), not as the default number type, but with the `n` suffix, like `42n`.


That's cool. I didn't realize that those languages used arbitrary-precision integers by default. I know that many language offer a bigint but to me the difference between having bigints and them being the default seems significant. For instance, in JavaScript the `n` notation and them being called `bigint` (and not just `int`) is going to mean that they will get used very rarely.


Yeah i absolutely agree. The saddest cases for me are languages like Java, which comes with working arbitrary-precision integers on its standard library, but using them feels so second-class compared to the "primitive" types like `int`. You need to import the class; initializing a value looks like `var a = new BigInteger("39")`; they don't support normal math operators, so you need to do `a.add(new BigInteger("3"))` just for a simple addition; and so on.

It's difficult to argue that arbitrary-precision integers are "simpler" (because they don't have an overflow edge-case on almost every operation) when they are so much more inconvenient to use.


Common Lisp has been like this forever. It makes testing the language much easier, as you don't have to check for overflows when constructing test code.

A dynamic language like CL has the added requirement that the integers fit with everything else under a single top type, which in practice limits fixnums to a few bits less than the word size.


CL also lets you "switch off" multiple precision if you know you don't need it. You can start with bignums everywhere then declare certain functions to use fixnums after you profile your program to speed it up. Compilers like SBCL can then produce code pretty much as fast as C.


You can write a macro to duplicate a piece of code with different type declarations, and switch between them at entry depending on the types of various values. Code blowup, yes, but you can optimize the common cases without manually duplicating code.


I'm "low-level guy, like Pike", yet I wouldn't have integers any other way in an application language.

Nobody uses Go for ethernet card drivers. Dealing with machine integers is just mental overhead for Go programmers, and a source of bugs.


> I've been wanting a language with that feature

GNU MP's C++ interface makes this mostly transparent, expressions like "x = y * z" just work due to operator overloading. The only time you need to know that the underlying class is "mpz_class" instead of "int" is at input/output.


For what it's worth, Haskell's Integer type is that way.

edit To be clear, in Haskell you 'default' to using its arbitrary-precision type, unlike say Java where they're available in the standard library but not in the core primitive types.


> I've been wanting a language with that feature

Python, which realization CPython is mentioned in the article, has arbitrary precision integers.


Smalltalk is another language that has had this feature since forever.

Tagged pointers / SmallInteger encodings make the memory overhead zero to negligible, and with the CPU / memory-bandwidth gap being what it is, the few extra CPU operations rarely matter.

Daniel Bernstein made the argument that integers should default to these semantics for security reasons:

https://css.csail.mit.edu/6.858/2018/readings/qmail.pdf


Zig isn't quite arbitrary precision, but it supports integer widths up to u65536. I've found this quite handy.



eh, I disagree. It's not massive overhead, but it turns literally every integer operation into a branch. Sure the branch is going to get ~100% prediction accuracy if you aren't using bigints, but that's still a pretty big cost especially since the branches inserted make all sorts of compiler optimizations illegal.


Do you remember where you saw that talk? I'd like to watch it.

Thanks.


Yes, the talk is called "What We Got Right, What We Got Wrong" from GopherConAU 2023. It's available on YouTube: https://www.youtube.com/watch?v=yE5Tpp2BSGw


not + -


Except he's not doing it for fun at all. He's doing it for clout and publicity on HN.


Looks like play to me. The earlier post doesn't have a HN reference and there's not much correlation with their publications list. People legitimately do write things like this as entertainment.




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: