
It takes two to ChaCha (Poly) - jgrahamc
https://blog.cloudflare.com/it-takes-two-to-chacha-poly/
======
tptacek
Adoption of ChaCha20/Poly1305 isn't so much to have a "backup" to GCM; it's
because GCM sucks.

The biggest problem for TLS with GCM is that it includes a 128 bit carryless
multiplication. You can do that reasonably quickly in software, but you need
lookup tables to do it, and those tables have secret indices which leave
trails in caches. Modern processors have instruction extensions (like Intel
CLMUL) that avoid that problem, but now you have a hardware dependency.

The big problem overall with GCM is that GCM is terrible. It fails
catastrophically if random numbers get repeated, it has a short counter space,
and it can blow up if you truncate MAC tags. It's weird to me that GCM has
seen as much adoption as it has.

The interesting thing about ChaCha20/Poly1305 isn't the cipher (ChaCha20 is an
extremely boring cipher, and while TLS won't take advantage of this, CC20 can
be replaced with AES in a similar design), but the Poly1305 MAC, which was
designed to be fast in software on conventional architectures.

Poly1305 as a design is pretty boring, but follows the DJB trend of taking a
simple, proven idea and ruthlessly refining it for performance and safety on
the kinds of computers everyone uses. Until recently, DJB was one of the only
person doing that.

TL;DR: GCM is the worst.

~~~
JoachimSchipper
Among my employer's customers and employees, the people who ask for GCM really
want "a TLS algorithm that isn't terrible" \- and GCM is more or less the only
widely-supported solution. (As you know, OCB's license made it a non-starter,
CCM was slower and not Intel-supported, and most other stuff is newer than
GCM.) Since people will take their cues from - at best - experts talking about
TLS, even an old-fashioned solution like "HMAC plus CBC, but in the correct
order" isn't on their radar.

It helps that TLS-on-Intel is about the only environment in which GCM isn't
too terrible (hardware support fixes side-channel attacks, short-lived
connections plus key (re-)negotiation fixes the short counter space, and
fortunately nobody has standardized the foot-gun that is short-MAC GCM). Also,
GCM does admit very fast pipelined hardware implementations (including AES-
NI).

SIV, which you suggest downstream, is indeed cool (if you can afford a two-
pass algorithm). A colleague suggests that GCM-SIV is likely to lap the CAESAR
candidates for inclusion into various standards including TLS, which would
be... about half as horrifying, I guess?

TL;DR: GCM is bad, but it's TLS which is the worst. ;-)

~~~
conductor
_> OCB's license made it a non-starter_

Coincidentally the 4-th draft version of AES-OCB for TLS was published [0]
today, and it says that:

 _" Historically Offset Codebook Mode has seen difficulty with implementation,
deployment and standardization because of pending patents and intellectual
rights claims on OCB itself. In preparation of this document all involved
parties have declared they will issue IPR statements exempting use of OCB Mode
in TLS from these claims."_

[0] - [https://datatracker.ietf.org/doc/draft-zauner-tls-aes-
ocb/?i...](https://datatracker.ietf.org/doc/draft-zauner-tls-aes-
ocb/?include_text=1)

------
nisa
Reading about djb makes me feel really unproductive and stupid :) He wrote
daemontools and solved most issues that systemd tries to solve years ago,
wrote qmail and basically designed the blueprint for writing secure daemons on
Unix. He wrote a secure wrapper for using C library functions safely, created
a build system that is similiar to NixOS and solves most of the pains of the
Unix filesystem hierarchy. Wrote a secure BIND DNS replacement and I've
probably forget lot's of other archivements in cryptography and mathematics
linke DNSCurve, ECC ciphers and MACs.
[https://cr.yp.to/djb.html](https://cr.yp.to/djb.html)

~~~
przemoc
Similar thoughts. :)

He's author of great stuff out there, but I think his works should be treated
more as showing the way of doing this or that than making you stick to his
particular implementations (I'm not implying they're bad, though). For
instance I prefer runit over daemontools, but quite likely without daemontools
there wouldn't be runit. (Actually I want to try s6 [1] soon, as it seems
another step in evolution of process supervision tooling.)

But his "build system" is actually unnecessarily non-standard and clunky. I am
against autohell, but carefully crafted handmade Makefiles are really nice.

    
    
      [1]: http://skarnet.org/software/s6/

~~~
gruturo
You mean like the DJB Way? There's a website for that.
[http://thedjbway.b0llix.net/](http://thedjbway.b0llix.net/)

------
RUBwkVjwLsDKgPw
AVX2 usage reduces clock frequency for a millisecond. It's suspicious (?) that
performance numbers were given in terms of clock cycles and not wall clock
time. It might have even worse system performance impact for small messages,
where one takes the AVX2 clock frequency hit for little gain.

~~~
tveita
Cycles per byte is pretty standard for crypto algorithms, and tends to give
consistent results across an architecture. It is usually measured with dynamic
frequency scaling disabled.

Do you have a source for use of AVX2 automatically reducing clock frequency? I
found this:

"Because Intel AVX instructions generally consume more power, frequency
reductions can occur to keep the processor operating within TDP limits. [...]
Performance of workloads optimized for Intel AVX instructions can be
significantly greater than workloads that do not use Intel AVX instructions
even when the processor is operating at a slightly lower frequency"

Which seems to indicate that it would only trigger when you're doing a fair
amount of work with it, which should be giving you a total performance
benefit.

~~~
knweiss
Check out the "Intel AVX Instructions Optimization" slide on
[http://anandtech.com/show/10158/the-intel-
xeon-e5-v4-review/...](http://anandtech.com/show/10158/the-intel-
xeon-e5-v4-review/3).

Quote:

 _On Haswell, one AVX instruction on one core forced all cores on the same
socket to slow down their clockspeed by around 2 to 4 speed bins (-200,-400
MHz) for at least 1 ms, as AVX has a higher power requirement that reduces how
much a CPU can turbo. On Broadwell, only the cores that run AVX code will be
reducing their clockspeed, allowing the other cores to run at higher speeds._

~~~
tveita
Interesting, that seems to contradict the information in
[http://www.intel.com/content/dam/www/public/us/en/documents/...](http://www.intel.com/content/dam/www/public/us/en/documents/white-
papers/performance-xeon-e5-v3-advanced-vector-extensions-paper.pdf)

Specifically, the FAQ says:

"Will running a small number of Intel AVX instructions reduce frequency below
the regular marked frequency?

No, frequency will be reduced below the regular marked frequency only if a
real power or thermal constraint is reached, not just due to the presence of
Intel AVX instructions. Some workloads that utilize Intel AVX instructions
could still achieve turbo above the marked TDP frequency."

~~~
semi-extrinsic
No, you're actually agreeing: GP says "it reduces how much a CPU can turbo",
while you are pointing out "it will not reduce frequency below the regular
marked frequency".

Say marked freq is 3.4 and the CPU turbos to 3.9. GP is saying an AVX2
instruction will drop it from 3.9 to 3.7 or 3.5, while you're saying it won't
drop it below 3.4.

------
CloudFlrFeedbck
Can you whitelist Tor on blog.cloudflare.com, at least? :-)

