Base58 isn't a great choice from an implementation point of view. It basically r...

eesmith · on Nov 8, 2018

The implementations at https://github.com/search?q=Base58 don't look like they use a bignum package, nor do that look particularly slow or complicated. The main encoding loop for https://github.com/luke-jr/libbase58/blob/master/base58.c is:

    for (i = zcount, high = size - 1; i < binsz; ++i, high = j)
	{
		for (carry = bin[i], j = size - 1; (j > high) || carry; --j)
		{
			carry += 256 * buf[j];
			buf[j] = carry % 58;
			carry /= 58;
		}
         }

ur-whale · on Nov 8, 2018

There's a divmod by 58 in there. That's dog slow compared to the shifts a power of two basis would require - even when the compiler gets smart about it because it's a constant.

SideQuark · on Nov 8, 2018

That divmod can be replaced with table walking like CRC code does for division and remainder of fixed values, which is certainly fast, and does not require bignums. Unless you now want to claim all CRC code is also slow as hell or requires bignums. These are solved problems.

Many compilers even expand the code inline to generate fix size divides into mult and shift.

Here's a 4 year old Hacker news thread on compiler tricks then [1]. They're even better now.

[1] https://news.ycombinator.com/item?id=8368639

a1369209993 · on Nov 9, 2018

> Many compilers even expand the code inline to generate fix size divides into mult and shift.

Specificly, u8 divmod 58 can be reduced to a u8->u16 multiply, a right shift, and three conditional subtractions; that's not great, but on a modern CPU it's a afterthought compared to the quadratic loop over the input size.

ur-whale · on Nov 9, 2018

>Many compilers even expand the code inline to generate fix size divides into mult and shift.

This is what I meant when I said:

>even when the compiler gets smart about it because it's a constant.

SideQuark · on Nov 9, 2018

You also claimed it's dog slow, and earlier you claimed it requires a big num lib. All of these are wrong. Many compiler/architecture combinations can reduce the divmod to one asm instruction, that on some architectures is very quick. If it is slower than shift/mult trick, that compiler can choose to do that.

Go look at some decent implementations and the assembly they generate. Don't spread unfounded or unchecked claims.

nkurz · on Nov 9, 2018

It's OK for different people have different impressions about the speed of dogs. On x64 for 64-bit registers, built-in assembly DIV is about 10 times slower than MUL, and 10 times faster than a lookup from RAM. Whether this makes it fast or slow depends on what you are trying to do.

I'm interested in your comment that "some architectures" have a fast divmod. Do you actually know of one where it's faster than a shift and a multiplication? Or the same? We're finishing up a paper where this would be very relevant information.

SideQuark · on Nov 9, 2018

Intel flavors often do both with a single idiv instruction. Agner Fog has performance tables for many variants [1]. I’d guess a few pipeline to similar per loop cost of shift and add.

I suppose if you’re writing a paper you’re aware of quite a bit of literature on exactly this problem. Recent papers have quite fast methods to do this. I’ve not looked at recent state of the art to see if 58 has a near zero cost divmod, but numbers of many forms do. I’d not be surprised if state of the art has it reduced to a few non-mem access branchless no division instructions on most architectures.

Maybe later I’ll poke at 58 and see what I can design. I’ve made quite a few such algorithms over the years.

[1] https://www.agner.org/optimize/instruction_tables.pdf

nkurz · on Nov 9, 2018

I feel comfortable with the x64 approaches, but am much less familiar with the efficiency of other architectures. The paper also benchmarks ARM (which is relatively faster) and Power 8 (which I don't understand well). I'd be particularly interested in knowing if any other architectures are much faster for division/modulus (which would weaken the paper) or much slower (which would strengthen it).

I'm might be exposing my ignorance, but what's 58 in this context? Is this an ARM Cortex series, or something else?

BeeOnRope · on Nov 10, 2018

FWIW I don't think other architectures are significantly faster than x86 for the div or divmod instruction. The GP said that "it can be reduced to a single assembly instruction" which is certainly true, but this one instruction is dog slow and produces 30+ uops[1]. Which is exactly why compilers choose to flip fixed divides to a multiplication and a series of corrections that often add up to ~10 instructions but even then it's still worth it.

Divide is a hard problem and I'm not aware of any breakthroughs. It's been getting faster, but it's still 10 slower (at least) than multiplication as you point out. Intel has thrown hardware at it a few generations back, speeding it up (it used to be over 100 cycles), and AMD has a fairly fast divider in Ryzen as well (for many inputs faster than Intel).

I expect POWER to be in the same range. ARM architectures are all over the power/performance map, obviously, but for the biggest/fastest, I would guess they are within a factor of 2, either way, of contemporary x86 implementations.

---

[1] This point is important: it means you can't get as much work done while the division is happening. There are other slow instructions, such as floating point div, sqrt, transcendental and trig functions. However, these functions are only a single uop and then occupy only their respective execution unit for the duration, so you can use the rest of the EUs at full speed. So if you only need to do one every 50 cycles or so, and have work to do in the meantime, they can be also free. idiv is not like that: it spews out many uops which compete for the same EUs as the rest of your code.

SideQuark · on Nov 12, 2018

IDIV on Skylake-X is 10 uops [1,p246], not 30+. Some other Intel chips are even lower (Goldmont is 3 uops, for example, but slower overall, Piledriver is 2 uops for r32, etc.). IMUL is often 2-3 uops on the same chips.

However, uops are not the bottleneck, pipelined throughput is what matters for performance. Multiple IDIV can be in flight at once on different register sets (real or virtual). Register renaming gives even more room.

A compiler can unroll the loop, use different registers (or the CPU can use renaming), and pipeline them, getting better throughput. r16 IMUL has a relative throughput of 2, r16 IDIV then has a relative throughput of 6. This is 3 times slower, not 10. Pretty much any other 10 instructions are going to be slower, not faster as you claim.

Compared to the overhead of looping and memory access, these are not the bottleneck. These represent a small amount of the time to do the computation.

Do some timing and check. IDIV is not what it once was.

[1] https://www.agner.org/optimize/instruction_tables.pdf

BeeOnRope · on Nov 13, 2018

To be clear, I'm generally talking about machine-width operations, in this case 64-bit output, not smaller operations like 8 or 16 bits.

For Skylake, we have 36 uops and 35-88 cycles of latency for a 64-bit div, and for idiv it is even worse: 57 uops and 42 to 95 cycles of latency. These can execute one every 20 to 80 cycles, so the throughput is fully twenty to eighty times worse than 64-bit multiplication.

> IMUL is often 2-3 uops on the same chips.

On Intel chips multiplication is almost always 1 uop if you need a 64-bit result (this is what a multiply in C will compile to), and 2 uops if you need the full 128 result. In in either case it can sustain one every cycle.

> However, uops are not the bottleneck, pipelined throughput is what matters for performance. Multiple IDIV can be in flight at once on different register sets (real or virtual). Register renaming gives even more room.

How do you know uops are not the bottleneck? They often are: it depends on the surrounding code.

More importantly, div is slow in every way that could matter: if latency is the bottleneck, div sucks (20 to 80 cycles, vs 3 for multiply). If uop throughput is the bottleneck, div sucks. If pure div throughput is the bottleneck (as you are suggesting), div still sucks: the inverse throughput for 64-bit div is 21 to 83 cycles: almost the same as the latency, so it is barely pipelined at all, and fully 21 to 83 times slower than multiplication.

Now one might say that you are mostly interested in 32-bit values, as these are common even in 64-bit code, and Intel has good support for them.

In this case div is quite a bit faster, but still dramatically slower than multiplication. In latency terms it is 26 cycles on Skylake, versus 3 for multiplication, so about 9 times slower. In "pipelined throughput" terms, it is 6 cycles per division, versus 1 per multiplication, so 6 times slower.

As a summary, we can say that on the most recent Intel, which probably have the fastest dividers around (and which have seen recent improvements), 64-bit division is anywhere from 11 to 90 times slower than multiplication, depending on how you measure it (and the input values), and 32-bit division is between 6 and 9 times slower, depending on how you measure it. I think my "10 times slower" which actually falls towards the faster end of that range is actually quite conservative!

On chips older than Skylake (which is the vast majority of chips you'll still find in datacenters and in the cloud, since SKX is quite recent), the situation is worse for div, since mul had the same 1/3 latency/tput performance, but div was slower.

> Do some timing and check. IDIV is not what it once was.

Here, I agree.

It has gotten much faster! In other words, it has gone from shockingly slow to merely very slow. It is still way slower than multiplication in every respect (which itself has gotten faster to the point where it has 1-cycle throughput now), and almost all of the tricks to avoid actual div instructions still apply.

a1369209993 · on Nov 9, 2018

58 is the number we're trying to divide by. (It's a constant, so there's a bunch of clever optimizations we can do; I don't think anyone is claiming fully-general division is fast by any standards beyond "round trip to memory".)

nkurz · on Nov 10, 2018

I feel silly, but thank you for the polite explanation of what is now obvious.

eesmith · on Nov 8, 2018

Thank you for correcting my misunderstanding.

The Ascii85 implementation in Python is 1/70th the performance of the base64 version, and the command-line driver to the base58 implementation I linked to is incredibly slower than the system base64. Profiling shows that the time is indeed spent on the two div/mod lines.

I assume that these are not the fastest possible implementations, but it does show that high performance is not trivial.