
An Introduction to Vectorization - conductor
https://blog.cr.yp.to/20190430-vectorize.html
======
y7
This is all rather disingenuous (the bit on anti-vector campaigners, which is
probably the point of the entire blog post).

> Do they seriously believe that Intel and AMD and ARM have all screwed up by
> supporting vectorization?

The benefit of vectorization of course depends on the total workload. In a
web/application server where e.g. 5% of resources is spent on encryption and
95% on other things like networking it could very well be that enabling
vectorization and thus having lower clock speeds results in decreased
performance.

> You might run into an anti-vectors campaigner who wants to take this freedom
> away

I really don't see much campaigning in the slides, that merely say:

> We do not give a machine code implementation using SSE etc

> – We (and others) have found that using these extensions causes overall
> performance of cryptographic systems to slow down

> – Just looking at run times of algorithms on their own is not a good measure
> of overall performance in the wild

> – Would caution NIST against putting too much emphasis on academic measures
> of performance of algorithms for this reason

I don't see any indication they're against the option of vectorization, but
merely warn that systems performance is going to be what matters, and not just
the algorithm.

~~~
rwmj
Does the clock speed permanently reduce as soon as you use a vector
instruction? I would have thought that the clock speed goes back up once the
temperature has gone down.

~~~
jzwinck
It isn't about temperature directly. Recent Intel CPUs instantly incur a
noticeable penalty as soon as certain 256 bit or wider instructions are
encountered. It doesn't matter if it's only a single instruction, you pay the
penalty before anybody notices the temperature.

After some fraction of a second it does revert to normal, if you haven't used
any such instructions again.

------
kristianp
By the author of Salsa20 and ChaCha algorithms, Daniel J. Bernstein.

~~~
bem94
Thanks for pointing this out. He has his own set of candidate submissions to
the NIST post quantum competition, so throwing shade at other candidates makes
a bit more sense now.

I've done some analysis on one of his candidates and it was _by far_ the
slowest on cores which do not feature wide vector operations. My opinion is
that you shouldn't just optimise for Intel (as many submissions do) for all
the obvious reasons.

Edit: A quote from the slides by Nigel Smart which Bernstein criticises -
"[We] Would caution NIST against putting too much emphasis on academic
measures of performance of algorithms for this reason"

\- I couldn't agree more.

~~~
nemo1618
SIMD is available on more than just Intel. What platforms did you test on?

~~~
bem94
Indeed. I was actually looking at very small / embedded cores which do lack
SIMD. Think ARM M0/M3/RISC-V-IMC.

