
Show HN: Accelerating SHA256 by 100x in Golang on ARM - y4m4b4
https://blog.minio.io/accelerating-sha256-by-100x-in-golang-on-arm-1517225f5ff4#.jmo3osv3q
======
userbinator
This is the sort of thing that shows why the whole "pure RISC philosophy" of
implementing only the simplest instructions is ultimately a pretty dead-end in
processor design. CISC-style dedicated instructions and hardware which
utilises them will always be more efficient than a pure software
implementation using just "simple RISC instructions", because hardware can
easily express specialised computations like the extreme parallelism of
algorithms such as SHA while software implementations essentially rely on the
instruction-scheduling mechanisms to arrange potentially hundreds or even
thousands of instructions in an optimal order.

In addition, when there is dedicated hardware, it is also significantly easier
to add a SHA instruction than attempt to recognise sequences of many regular
instructions that could be "fused together" into a single operation using that
hardware. When there _isn 't_, adding such an instruction that gets internally
expanded into multiple uops for the equivalent using the existing functional
units is still beneficial, since it leaves open room for an immediate
performance gain on all existing software when a future revision does add that
dedicated hardware, or optimises its microarchitecture to allow those
(possibly different) uops to execute faster.

Despite the name, ARM is definitely not very RISC anymore, and that's what has
kept it competitive.

~~~
ridiculous_fish
> CISC-style dedicated instructions and hardware which utilises them will
> always be more efficient than a pure software implementation using just
> "simple RISC instructions"

For a counterexample, see the x86 string instructions. "Hardware"
implementations like `repne scasb` are routinely outperformed by software
implementations using SSE2.

Another problem is that these instructions don't die. SHA will some day be
replaced, but the instruction will live on. The x86 BCD instructions
illustrate this.

From what I remember, RISC is less about "simple" instructions, and more about
regular instructions which can execute with predictable throughput, ideally
one per clock cycle. "Do this particular arithmetic" is in the RISC
philosophy, while looping instructions (rep, lswi, etc.) are not.

~~~
userbinator
_For a counterexample, see the x86 string instructions. "Hardware"
implementations like `repne scasb` are routinely outperformed by software
implementations using SSE2._

That's only because SCAS (and CMPS) has not (yet) received quite the same
amount of attention as MOVS and STOS. For a counterexample to your
counterexample, look up "enhanced REP MOVSB". REP MOVS/STOS can operation on
cacheline-sized blocks since at least the P6, when it was introduced as the
"fast strings" feature, and its performance has been steadily improved over
the processor generations.

 _Another problem is that these instructions don 't die. SHA will some day be
replaced, but the instruction will live on. The x86 BCD instructions
illustrate this._

Replaced for secure crypto, yes, but there are plenty of other applications
like (nonmalicious) data corruption detection where a reasonably fast yet far
more collision-resistant algorithm than regular CRC is very useful.

On the topic of BCD instructions, there's this:
[https://news.ycombinator.com/item?id=8477254](https://news.ycombinator.com/item?id=8477254)

------
mrb
_" Interestingly enough, there are actually comparable Intel SHA extensions to
the ARM equivalents. Linux 4.4 has added support for this but so far we have
not been able to identify any CPUs that will actually run this code."_

Intel Goldmont is the only microarch that implements the instructions. It was
"released" in April: [http://www.extremetech.com/computing/226800-intels-new-
low-c...](http://www.extremetech.com/computing/226800-intels-new-low-cost-
apollo-lake-platform-skylake-graphics-new-goldmont-cpu) But it is one of these
annoying soft launches where the processors are not for sale anywhere, the
specs are incomplete, and even ark.intel.com doesn't know about them. _sigh_

~~~
kalden_
It seemed to me that some Xeons had those since 2014. See
[https://software.intel.com/en-us/isa-extensions/intel-
sha](https://software.intel.com/en-us/isa-extensions/intel-sha)

Should something be understood instead?

~~~
mrb
Huh you seem right. Only the multi-socket Haswell seems to have them
(E7-xxxxv3 and E5-26xxv3).

And now Intel re-introduce the SHA instructions in some low-power low-end
CPUs, but not for any desktop or single-socket server CPU? What a bizarre case
of feature fragmentation. Typical Intel.

------
wyldfire
... by moving from a software implementation to a mostly hardware one. ;)

Still a worthwhile article but the title seemed to make me think of
implementation/algorithm changes.

~~~
fwessels
No, there are no algorithm changes (otherwise the results would be different),
it is just taking advantage of the ARM SHA accelerations when they are
available.

~~~
halomru
>there are no algorithm changes (otherwise the results would be different

Insertionsort, Mergesort and Timsort are three wildly different algorithms
with different speeds, but on every possible input they produce the exact same
result

~~~
sjolsen
Strictly speaking, it's possible for different sorting algorithms to produce
different results if you sort using a weak order rather than a total order.

~~~
cmrx64
Or an unstable sort (which none of grandparent's examples are, but aren't
uncommon in naive implementations)

------
DanielDent
Did you compare performance to SHA512? Despite being a theoretically more
secure/"harder" algorithm, on 64 bit platforms it can sometimes be faster than
SHA256. If you don't want to use 512 bits, using 256 bits of the output of
SHA512 is standardized as SHA512/256 and is considered valid/secure.

(I'm unclear if this performance oddity remains true with the crypto hardware
extensions being used here.)

~~~
cyphar
The oddity is that I believe sha512 uses 64-bit words while sha256 uses 32-bit
words. So on 64 bit hardware sha512 will be faster (presumably because using
the 32 bit registers is slower somehow, I don't know). This is why ZFS is
adding support for sha512/256 as it's Merkel tree hashing function.

~~~
tedunangst
SHA512 eats data 512 bits at a time, while SHA256 eats it 256 bits at time.
Both internally use 8 "registers", which are either 64 or 32 bits wide.
Assuming you have the hardware registers to match, this would make SHA512
about twice as fast. But internally, it mixes data using 80 rounds, vs 64 for
SHA256, so the speedup isn't quite 2x.

~~~
pbsd
For the sake of pedantry, SHA-512 eats 1024 bits, resp. 512 bits for SHA-256,
at a time. It's the chaining variables that are of those lengths.

------
dharma1
How about accelerating Keccak/SHA-3 (used on Ethereum)? I found this but seems
it's not implemented on ARM/Intel hardware yet

[http://caslab.eng.yale.edu/workshops/hasp2016/HASP16-10.pdf](http://caslab.eng.yale.edu/workshops/hasp2016/HASP16-10.pdf)

------
theparanoid
The amount of assembly in the Go ecosystem is crazy, ~1% of the standard
library is assembly. There are better ways.

~~~
giovannibajo1
Most of it is written to take advantage of special hardware instructions in
CPUs. If you take OpenSSL (a C library), it's got many assembly code paths as
well.

I'm not sure why this should be a problem, it just shows attention to
performance in my opinion.

~~~
pcwalton
Opaque assembly blocks that the compiler doesn't understand the effects of are
actually suboptimal for performance.

~~~
dlsniper
Is this a general thing or a Go specific one? This generic statement seems to
be false given how many optimizations that actually need assembly exist in Go
and not only.

~~~
pjmlp
It is a general thing and quite common.

Either you keep adding intrisics to the compiler, or outsource it to an
external Assembler.

Even managed languages have bytecode Assemblers available.

------
anonymous7777
ok mentioning Go here is a bit pointless. How did Go help compared to any
other language when Arm has hardware instructions for SHA1&2?

~~~
fwessels
Go didn't help specifically, it is just that this package makes support for
the ARM SHA instructions available for Golang.

~~~
shurcooL
Also Go has support for using assembly in a package natively, not all
languages/build tools support that as easily.

~~~
pjmlp
It is pretty common in imperative compiled languages.

Back when compilers were sold, the professional version always had an
assembler in the box.

In MS-DOS even BASIC compilers like Turbo Basic could use inline Assembly.

------
serge2k
> WORD $0x4cdf2025 // ld1 {v5.16b-v8.16b}, [x1], #64

Why is the code written as words like this instead of just the assembly
instructions?

~~~
nickcw
Because the Go assembler doesn't have those instructions built in most likely.

That sort of thing is quite common in Go assembler. The assemblers are a
little primitive and they force the assembly language for each processor into
a common pattern, so when you are writing ARM assembler for instance,
everything is backwards.

I imagine having rationalised the assemblers between processors somewhat it
makes the core developers lives easier though who have to lightly touch lots
of different processor assembler.

~~~
pjmlp
Backwards from AT&T point of view, right?

Because for me it feels very natural, given the Z80 and Intel Assembly.

------
gnaritas
So with hardware sha256, can this mine bitcoin cheaper than an ASIC?

~~~
astrodust
Does it do terahashes per second? If not, then no.

~~~
gnaritas
Haven't been keeping up on how fast ASIC's are, just asking, thanks.

------
cloudjacker
Its a bird! Its a plane!

No no thats just my bitcoin miner booting up

~~~
DanielDent
Pretty sure there are planes which are quieter than some bitcoin miners...

------
matt_wulfeck
Are aes instructions new in aarch64 processors? I'm surprised this doesn't
work magically "out of the box", but I'm not up to date with this stuff.

Also, this is another really good reason never to use a sha2 as your hashing
algo for password storage.

~~~
loeg
AES and SHA2 are completely different algorithms.

> Also, this is another really good reason never to use a sha2 as your hashing
> algo for password storage.

I don't know what you mean by this. I understand that SHA2 is poor for some
things, but I'm not sure how this illustrates that.

~~~
niftich
> but I'm not sure how this illustrates that

SHA2 just got even faster, so it illustrates rather well that you wouldn't
want to use it for password storage -- of course that was always true, but the
improvement drives that point home.

~~~
WhitneyLand
My understanding is that SHA2 is fine for password storage, given a prudent
number of hashing rounds.

~~~
niftich
It's fine to use a SHA2 variant HMAC in PBKDF2 with an obscene, ever-
increasing number of rounds, but the scrypt paper [1] provides a detailed
justification why you may want to switch to a dedicated password hashing
function regardless. To quote their website [2]:

"We estimate that on [circa-2009] hardware, if 5 seconds are spent computing a
derived key, the cost of a hardware brute-force attack against scrypt is
roughly 4000 times greater than the cost of a similar attack against bcrypt
(to find the same password), and 20000 times greater than a similar attack
against PBKDF2."

Specifically, PBKDF2 is a tool to produce mostly-okay password storage hash
from crypto primitives that are totally unfit for it, while bcrypt, scrypt,
Argon2 are purpose-designed to make certain kinds of attacks difficult, like
time-memory tradeoffs, parallelized custom hardware attacks, etc.

[1]
[http://www.tarsnap.com/scrypt/scrypt.pdf](http://www.tarsnap.com/scrypt/scrypt.pdf)

[2] [http://www.tarsnap.com/scrypt.html](http://www.tarsnap.com/scrypt.html)

~~~
cperciva
Almost right. bcrypt is not designed to be secure against ASICs; it requires a
fixed circuit size. In fact, given that CPU acceleration is available for
SHA2, I suspect that PBKDF2-SHA256 is now stronger than bcrypt based on the
"if my server spends X seconds hashing this password, how much money will
someone need to spend to crack it" metric.

~~~
tedunangst
I waffle about with this metric. If it's cheap to buy a server with some
feature, it's obviously cheap for an attacker to buy the same. Widespread SHA2
hardware means that SHA2 cracking hardware is also available to even the
poorest of crackers. I think there's a sliding scale, where the ratio changes
for some attackers but not for all.

~~~
cperciva
Right, sophisticated attackers have had custom SHA256 circuits for a long
time; less sophisticated attackers are only gaining them now that they are
present in CPUs. But if you're defending against such less sophisticated
attackers, you're still not _losing_ anything; worst case, SHA256 instructions
in CPUs give them the same speedup as you're getting so their cost remains
fixed.

~~~
tedunangst
The question is who upgrades their hardware faster, attackers or big
enterprises? :)

