
Show HN: Accelerate SHA256 Computations in Go Using AVX512 instructions - y4m4
https://github.com/minio/sha256-simd
======
wolf550e
A recent blog post by Vlad Krasnov, author of a bunch of the crypto assembly
code in openssl and in golang, about frequency scaling when using AVX-512
making it not worth it: [https://blog.cloudflare.com/on-the-dangers-of-intels-
frequen...](https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-
scaling/)

He doesn't like the title of the OP and provided links:

> Very misleading title. Could just as well name it "accelerate sha256 up to
> 134x". You need to compare apples to apples. If AVX2 was used in the same
> way AVX512 is used, the speedup would be 2X at most. Reminds me of two of my
> papers
> [https://eprint.iacr.org/2012/371.pdf](https://eprint.iacr.org/2012/371.pdf)
> [https://eprint.iacr.org/2012/067.pdf](https://eprint.iacr.org/2012/067.pdf)

(from
[https://twitter.com/thecomp1ler/status/940724783804645376](https://twitter.com/thecomp1ler/status/940724783804645376))

EDIT: Thanks 'delhanty !

~~~
blasdel
He's using the "cheap" Xeon Silver chips that clock down all cores immediately
and aggressively when any are using AVX-512

Most of the Gold and Platinum series chips don't start frequency scaling down
below baseline until around half the cores are using AVX512. The fanciest
Platinum chips can use it on all cores with the only limit being that you
can't Turbo quite as much:
[https://en.wikichip.org/wiki/intel/xeon_platinum/8180m](https://en.wikichip.org/wiki/intel/xeon_platinum/8180m)

Without that capability, cloud providers wouldn't be able to offer multitenant
VMs with access to the new instructions

~~~
thecompilr
Can't turbo as much is still a net loss of performance. Even on Platinum turbo
frequency is almost 10% lower with AVX512 on single core, and 30% lower with
all cores. So you really want the majority of your software to use AVX512 to
gain net benefit. It takes the system 2ms to recover after an AVX512
instruction. But you are correct that the Silvers are way worse. I suspect
Intel intentionally killed AVX512 performance on the Silvers. I tested power
consumption, and there is no reason to reduce the frequency, except for the
sake of it. The sad thing is there is no CPUID flag to distinguish good AVX512
from useless AVX512. Would really be better if they disabled it completely on
Silver. The way it is now will just hurt adoption.

~~~
zx2c4
Of interest regarding this might be:
[https://twitter.com/InstLatX64/status/934093081514831872](https://twitter.com/InstLatX64/status/934093081514831872)

> The sad thing is there is no CPUID flag to distinguish good AVX512 from
> useless AVX512.

You can read the the avx512_2ndFMA bit from the PIROM, according to this Intel
datasheet:
[https://www.intel.com/content/www/us/en/processors/xeon/scal...](https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-
scalable-datasheet-vol-1.html)

Linux doesn't implement reading PIROM over SMBus, but it sure would be nice to
expose this flag in /proc/cpuinfo.

In WireGuard we're at the moment just disabling the zmm AVX512F implementation
on Skylake-X, falling back to the still-fast-but-not-as-fast AVX512VL
implementation that only touches ymm and doesn't downclock as much (following
OpenSSL's reasoning on +/\- Andy Polyakov's same implementation):

[https://git.zx2c4.com/WireGuard/tree/src/crypto/chacha20poly...](https://git.zx2c4.com/WireGuard/tree/src/crypto/chacha20poly1305.c#n47)

I may look into trying to read the PIROM so that I can make a more informed
decision. I've tested those Platinum boxes, and indeed it's a lot faster
there, even with the [lesser] downclocking, whereas a Gold box didn't perform
as well, making the ymm-only implementation necessary.

~~~
thecompilr
If that is an issue for you, you could try using the implementation I wrote
for boringssl. It avoids SIMD multiplications altogether and only uses simple
AVX2 instructions, so there is no slowdown (AFAICT) although it is not as fast
as AVX512VL from OpenSSL in benchmarks.

------
eloff
This is assembly, not pure Go, but it doesn't use CGO which I probably what
they mean.

Intel Cannon Lake processors will support the SHA instruction extensions
(currently available only on Goldmont). It will be interesting to see how that
compares with this approach of running 16 SHA computations in parallel. You
would be able to get rid of the scheduling overhead of having to first queue
up 16 SHA calculations from other threads.

~~~
stcredzero
_This is assembly, not pure Go_

Well, if you're going to dip into pedantic mode, couldn't the language
maintainers just define Go to include a few relevant Assembly instruction
sets? (Not taking a dig at you but rather at the above level of pendantry.)

~~~
MaxBarraclough
Not without tying Go to one architecture, no.

When C programmers write inline assembly, they don't pretend it's C code.

~~~
sythe2o0
Go has it's own form of assembly which it compiles to multiple architectures.

~~~
MaxBarraclough
I followed npongratz's link. Interesting read.

Seems the point of it is to enable easier porting of assembly between
architectures, by providing a consistent syntax.

I was expecting something akin to LLVM assembly language, but no, they're come
up with their own bizarre high-level assembly-language intended to map fairly
directly to various different instruction-sets. It's not an abstraction layer
in the usual sense; the exposed register-set and instruction-set are faithful
to the target architecture.

It's a finite register machine which _isn 't_ just faithfully exposing the
underlying architecture. Not something we often see. iirc SPIR and GNU
Lightning are both finite register machines, but, to quote Douglas Adams, this
has made a lot of people very angry and been widely regarded as a bad move.

How is it compiled? Presumably it doesn't get translated to LLVM as an
intermediary.

It strikes me as an awful lot of work. Does their high-level assembler really
outperform LLVM? Would've thought a project of that sort would deserve to
exist in its own right, not just as an obscure component of Go.

~~~
stcredzero
_It 's a finite register machine which isn't just faithfully exposing the
underlying architecture. Not something we often see. iirc SPIR and GNU
Lightning are both finite register machines, but, to quote Douglas Adams, this
has made a lot of people very angry and been widely regarded as a bad move._

TAOS operating system used such a virtual ISA, and was able to achieve around
90% efficiency of native code. The worst case was PowerPC which fell to 80%.
That's pretty darn good, IMO.

~~~
MaxBarraclough
Skimreading [http://www.dickpountain.co.uk/home/computing/byte-
articles/t...](http://www.dickpountain.co.uk/home/computing/byte-articles/the-
taos-operating-system-1991) ( linked from
[https://news.ycombinator.com/item?id=9806607](https://news.ycombinator.com/item?id=9806607)
, where you yourself commented )

Interesting OS. Its 'VP code' looks like a precursor to Java bytecode/HotSpot,
but much more low-level and RISC-ey.

Inferno OS's 'Dis' VM took a similar to approach to VP code, if I understand
correctly.

I presume that, in 1991 when the article was written, "JIT" wasn't yet in the
techies' parlance. It's not used anywhere in the article.

------
foobarbazetc
One thing to note is that the benchmark is running on a Skylake Platinum chip
which has two AVX512 FMAs.

You need a Gold 6000 series and above to see any benefit from AVX512. In most
other cases the CPU throttles down some insane amount and there’s no to little
benefit.

~~~
thecompilr
You don't use FMA (or any multiplication) for SHA2. The throttling is a big
issue.

~~~
foobarbazetc
True.

Did you guys get to test Epyc at CloudFlare?

The 7401P seems pretty special. Like really great $ per perf. I think
SuperMicro are coming out with 1 socket Epyc boards/servers.

------
ComputerGuru
I blogged about the SHA instruction support in the x86_64 ISA a few months
back, it’ll be nice to see it actually happen:
[https://neosmart.net/blog/2017/will-amds-ryzen-finally-
bring...](https://neosmart.net/blog/2017/will-amds-ryzen-finally-bring-sha-
extensions-to-intels-cpus/)

------
dragonfax
Isn't this the kind of thing that was missing from the "go on different
platforms" benchmark a little while back. The intel platform has crazy
optimization for encryption algorithms on Inteil, while ARM was severely
lacking.

~~~
wolf550e
search for "arm" on this list:
[https://dev.golang.org/reviews](https://dev.golang.org/reviews)

------
mikebenfield
Possibly I'm confused, but in what sense is this "in Pure Go"?

~~~
gjem97
IMO, it's playing fast and loose with that term, but I guess the point is that
it's not using CGO (i.e. calling into C code). It is, however, using the
assembler packaged with the Go tools, so in that regard it's not "pure" go.

~~~
stcredzero
_IMO, it 's playing fast and loose with that term_

The terminology in this context is already fast and loose: It is rigorous in a
practical engineering sense and is far from a mathematical level of precision.
As I pointed out above, the maintainers could just define Go to include a few
Assembly languages.

