Hacker News new | past | comments | ask | show | jobs | submit login
The BLAKE3 cryptographic hash function (github.com)
288 points by erwan 12 days ago | hide | past | web | favorite | 104 comments





It looks like the speedup is coming from two main changes.

The first change is reducing the number of rounds from 10 to 7. Think of it like making a smoothie - you add bits of fruit to the drink (the input data), then pulse the blades to blend it up (making the output hash). This change basically runs the blades for 7 seconds instead of 10 seconds each time they add fruit. They cite evidence that the extra 3 seconds aren't doing much - once the fruit's fully liquid, extra blending doesn't help - but I worry that this reduces the security margin. Maybe those extra 3 rounds aren't useful against current attacks, but they may be useful against unknown future attacks.

The other change they make is to break the input into 1KiB chunks, then hash each chunk independently. Finally, they combine the individual chunk hashes into a single big hash using a binary tree. The benefit is that if you have 4KiB of data, you can use 4-way SIMD instructions to process all four chunks simultaneously. The more data you have, the more parallelism you can unlock, unlike traditional hash functions that process everything sequentially. On the flip side, modern SIMD instructions can handle 2 x 32-bit instructions just as fast as 1 x 64-bit instructions, so building the algorithm out of 32-bit arithmetic doesn't cost anything, but gives a big boost to low-end 32-bit CPU's that struggle with 64-bit arithmetic. The tree structure is a big win overall.


> but I worry that this reduces the security margin. Maybe those extra 3 rounds aren't useful against current attacks, but they may be useful against unknown future attacks.

This was covered in more detail in previous "Too Much Crypto" paper [1], which argued that many standards have excessively high round counts. Note that Aumasson is author of both Blake3 and Too Much Crypto

[1] https://news.ycombinator.com/item?id=21917505


From paper:

> Our goal is to propose numbers of rounds for which we have strong confidence that the algorithm will never be wounded

They take algorithms, past 10 years of public crypto research and shave off rounds, until it just about starts falling apart. AFAIU having security-reducing attacks is the target.

I prefer to have ample confidence in my crypto algorithms. Would not recommend BLAKE3 (without those extra rounds).


Thinking back to Schneier’s running commentary on SHA3, using hierarchical hashes was part of a general attempt to increase the internal state space of the hashes. SHA1 exposes all of the bits. So once you get two prefixes with the same hash, adding a common suffix results in the same output hash.

With hashes of hashes, the prefixes have to be the same length, and possibly very short.


Another benchmark:

time openssl sha256 /tmp/bigfile

  real  0m28.160s
  user  0m27.750s
  sys   0m0.272s
time shasum -a 256 /tmp/bigfile

  real  0m6.146s
  user  0m5.407s
  sys   0m0.560s
time b2sum /tmp/bigfile

  real  0m1.732s
  user  0m1.450s
  sys   0m0.244s
time b3sum /tmp/bigfile

  real  0m0.212s
  user  0m0.996s
  sys   0m0.379s
TIL OpenSSL sha256 invocation is really slow compared to the shasum program. Also BLAKE3 is really fast.

Edit: bigfile is 1GB of /dev/random


May be silly to ask, but can we assume the numbers are from second run for each variant.

On my machine running Ubuntu 18.04 (coreutils 8.28, openssl 1.1.1), openssl is faster than both shasum and sha256sum.

Yeah as someone familiar with openssl it looks like a version of openssl that was built incorrectly.

Did you prime the cache before the first run? Or put /temp on a RAM file system?

Here's my similar runs on a E3-1230 v5:

   b3sum     ./1GB 0m0.124s
   md5sum    ./1GB 0m1.620s
   sha512sum ./1GB 0m2.988s
   sha256sum ./1GB 0m4.738s
I was even more impressed when I timed dd:

   $ time dd if=./1GB of=/dev/null bs=65536
   16384+0 records in
   16384+0 records out
   1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.10451 s, 10.3 GB/s

   real 0m0.107s

Some notes re coreutils and ssh

sha256sum uses 128KiB blocks when reading, which may be more optimized than the openssl program.

sha256sum supports using openssl libs as they are generally faster. This is enabled on Red Hat flavored distros and arch, but not debian flavored as yet

The upcoming release of sha256sum (8.32) will auto enable use of openssl for >= v 3, as openssl's licence has changed to apache in that version


So this is bao + blake2?

I remember watching Bao, a general purpose cryptographic tree hash, and perhaps the fastest hash function in the world: https://www.youtube.com/watch?v=Dya9c2DXMqQ a while ago.

Nice job!


Yep that's me :) The Bao project evolved into BLAKE3, and the latest version -- which I literally just released -- is now based on BLAKE3.

Excuse my confusion. I understand "the Bao project evolved into BLAKE3", but "is now based on BLAKE3" confuses me. Bao is based on blake3? But isn't bao ... the blake3 itself now? Circular dependency detected.

Ha, yes, that's confusing. The Bao project was originally two things: 1) a custom tree hash mode, and 2) an encoding format and verified streaming implementation based on that tree hash. The first half evolved into BLAKE3. Now the Bao project itself is just the second half.

Hi, some questions...

The README lists 4 designers, including yourself. However the Bao project doesn't list anybody, so presumably you are the only designer. What exactly were the contributions of the other 3 people to warrant being listed?

At what point did the Bao project become "BLAKE3" and why?


All three others are principals of the Blake or Blake2 design and major implementations.

Hey I like your speaking style, clear, and funny. Nice work.

IIRC Bao was already based off of blake2s.

Looks like they've taken Too Much Crypto to heart[1] and dropped the number of rounds from Blake2B's 12 down to 7 for Blake3:

https://github.com/BLAKE3-team/BLAKE3/blob/master/reference_...

https://github.com/BLAKE2/BLAKE2/blob/master/ref/blake2b-ref...

Which, yeah, that alone will get you a significant improvement over Blake2B. But definitely doesn't account for the huge improvement they're showing. Most of that is the ability to take advantage of AVX512 parallelism, I think. The difference will be more incremental on AVX2-only amd64 or other platforms, I think.

[1]: Well, TMC recommended 8 rounds for Blake2B and 7 for Blake2S.


> Looks like they've taken Too Much Crypto to heart

Not surprising considering that one of they is the author of Too Much Crypto


Indeed. It's also 3rd citation in their formal spec.

Just one variant, that's refreshing. And performance is impressive. What's the short input performance like? Say for 64 bytes of input.

The Performance section of the spec (https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blak...) has a paragraph about short input performance. 64 bytes happens to be the BLAKE3 block size, and performance at that length or shorter is best in class. Look at the left edge of Figure 3 (https://i.imgur.com/smGHAKA.png).

64 bytes happens to be the typical cache line size so it makes sense to use it as a block size.

What about padding?

That's impressive speedup. I just installed it and holy moly it really is fast. All those extra cores can finally get busy. ;)

b3sum -l 256 big-2.6Gfile

  real 0m0.384s
  user 0m2.302s
  sys  0m0.175s

b2sum -l 256 big-2.6Gfile

  rear  0m3.616s
  user  0m3.360s
  sys   0m0.256s
(Intel® Core™ i7-8550U CPU @ 1.80GHz × 8 )

EDIT: ah, the catch. blake3 targets 128 bit security. It competes with SipHash for speed and security

EDIT2 scratch the previous edit.


> ah, the catch. blake3 targets 128 bit security. It competes with SipHash for speed and security.

No no, BLAKE3 is a general-purpose cryptographic hash just like BLAKE2, SHA-2, and SHA-3. The confusion here is that a hash function's security level is half of its output size, because of the birthday problem. BLAKE3, like BLAKE2s and SHA-256, has a 256-bit output and a 128-bit security level. (BLAKE3 also supports extendable output, but that doesn't affect the security level.)

> holy moly it really is fast

Thank you :)


>security level is half of its output size

A hash can have different security levels against different attacks. BLAKE3 appears to have 128 bits of security against all attacks.

SHA3-256 was originally designed to have 128 bits of collision security and 256 bits of preimage security. NIST then made a change to it giving it 128 bits of security against all attacks. A lot of people got mad. Then NIST caved and changed it back to 128 bits of collision security and 256 bits of preimage security.

It looks like BLAKE3 agrees with how NIST wanted SHA3 to be. I wonder if people will be mad at BLAKE3.

https://en.wikipedia.org/wiki/SHA-3#Capacity_change_controve...

For a more fair performance comparison against SHA3, you should compare against SHAKE128(256). That is, the version with 128 bits of security all around and a 256 bit output (how NIST wanted it). Although maybe it's pointless, because according to Wikipedia SHAKE128(256) is only 8% faster than SHA3-256 for large inputs.


> BLAKE3 appears to have 128 bits of security against all attacks.

That's not accurate. The best pseudo-preimage attack on BLAKE2s has complexity 2^{253.8} against 6.75 rounds (section 3.2 of https://eprint.iacr.org/2019/1492.pdf ). The best full-preimage attack on BLAKE2s is against 2.75 rounds. BLAKE3's round function is identical to BLAKE2s (although used in a different mode). Currently there isn't any known classical preimage attack on BLAKE3 better than these ones against reduced BLAKE2s. This should be interpreted with caution since the design has only just been published.

[Disclosure of interest: I know Zooko and work for Electric Coin Company. This is only based on a cursory review of the paper, though; I had not seen it prior to publication.]

-- Daira Hopwood


>Although maybe it's pointless, because according to Wikipedia SHAKE128(256) is only 8% faster than SHA3-256 for large inputs.

This is mainly due to SHA3's humongous 1600-bit state, which is not very friendly to embedded systems. In sponge constructions with smaller states, or generally primitives with smaller states, the difference is much larger.

Also in general I would say that small message performance is usually more important than large message performance, since large messages with desktop/laptop CPUs are so incredibly fast anyway with most hash functions that the bottleneck goes somewhere else. (Storage, network, etc.)


I mentioned large message performance because that appeared to be what BLAKE3's benchmarks were focusing on.

They're probably right to be upset by that change. Collisions require a lot more resources than simply trying every input on a GPU/ASIC, and as https://news.ycombinator.com/item?id=22007810 points out, an attacker that can only cover some fraction of the key space is massively more disadvantaged when they're performing a collision attack. You need more raw bits of preimage resistance to make the attacks equally hard.

I don't know if you need twice as many bits of preimage resistance, but I'd feel a lot more comfortable with an extra 32.


> Collisions require a lot more resources than simply trying every input on a GPU/ASIC

Would a 128 bit difficulty preimage attack against SHAKE128(256) be as simple as trying every single input? Trying every single input would be a 256 bit difficulty attack I would assume. To get it to a 128 bit attack I would think the attack would need something more advanced.


Yeah, I should have stuck with my initial wording about having major space requirements and not just time requirements.

You do need a more complex calculation to try to mount a preimage attack. But from what I can figure out it still tends to use big blocks of standalone arithmetic, something that's still very easy to accelerate with limited I/O.


Thanks for clarifying that.

> BLAKE3, like BLAKE2s .. has .. a 128-bit security level.

Skimming and interpreting the Too Much Crypto paper[1], the security target is strictly less than 128-bit security. If it maintained 128-bit security, it would be considered too many rounds.

[1] https://news.ycombinator.com/item?id=21917505


> ah, the catch. blake3 targets 128 bit security

Not really. 128 bits for collision (using a 256 bit hash) is not the same as 128 bits for key recovery or preimage. It is much, much stronger.

Preimage and key recovery attacks have a linear drop off: halving the number of tries halves your chances of success. Collisions have a quadratic drop off: halving your number of tries divides your chances of success by four.

Moreover, it is easier to find one key if you have many keys to try (finding 1 key among N is N times easier than finding any specific key). No such considerations for collisions: finding any collision is just as hard as finding one collision among many.

512-bit hashes were never needed for their security levels. Though in some circumstances (deriving multiple keys, EdDSA), bigger digests come in handy.


That's an interesting point I hadn't thought of before. If someone says "I want < 1% chance of an attack" then 128 bits of collision resistance will be stronger than 128 bits of preimage resistance.

On the other hand if someone says "I want < 99% chance of an attack" then 128 bits of preimage resistance will be more secure. But people probably don't have this goal, so 128 bits of collision resistance is better overall.

But another aspect is once you start finding collisions, you're likely to find more faster (whereas with preimages they keep coming linearly). This would actually be a worse point for collision resistance.

> (finding 1 key among N is N times easier than finding any specific key)

Yes, but only up to a point. You have to compare the cost of computing a hash vs comparing the hash. If the time to hash is 10000x the time to compare, then no matter how large N is, you'll never get faster than a 10000x speedup. This is essentially Amdahl's law. But this speedup is still quite significant.


I would be interested in how this compares on the smhasher against some of the other fastest hash competitors like meow hash or xxhash.

Give me two days.

I am also curious about how it performs as a PRF in places where e.g. Chacha20 is used as a keystream generator now. Also as a reduced round variant in places where non-cryptographic PRNGs are used for very fast RNG needs: JSF, SFC, Lehmer, Splitmix, PCG.

In my extremely limited testing (on AVX2, but not AVX512 hardware), (buffered) reduced (four) round Chacha is only about 1.5-2x slower than fast non-cryptographic PRNGs like JSF, SFC, Lehmer, or pcg64_fast (all with Clang -O2 -flto, the fast PRNGs are header-only implementations and only chacha is two files).

This thing still uses 7 rounds, but that is easy to tune down. Very neat.


The "too much crypto" paper linked in the specs recommends to lower Chacha20 down to 8 rounds.

Blake3 wouldn't compete with Chacha20, it would compete with Chacha8.


Note that a "round" in BLAKE/2/3 is equivalent to a "double-round" in ChaCha.

Ah, ok. So 7-round Blake3 is perhaps closest to 14-round Chacha.

I can't seem to find any non-rust implementations in the works yet, so I may sit down and adapt the reference to C# this weekend. Anyone know how the single/few threads performance holds up excluding avx512?

Luke ported the reference implementation to Go https://github.com/lukechampine/blake3

There's a non-Rust implementation in the same repository, at https://github.com/BLAKE3-team/BLAKE3/tree/master/c (in C).

Yes, but about half as slow as the Rust version, because the rust version processes the chunks in parallel.

I'm working on exporting the rust version to C, so all can be compared properly.


If you squint at it, there's also a Python implementation here: https://github.com/oconnor663/bao/blob/master/tests/bao.py. But it's far too slow to use it for anything besides testing.

We made an OpenCL implementation of blake2b-256 here (the filename does not mean it's only a SHA1 implementation):

https://github.com/ndsol/git-mine/blob/master/sha1.cl

We are considering doing a Blake3 implementation, but haven't committed to do it yet.


Does anybody know of benchmarks for ARM? Or any research trying to break it?

The numbers look astonishing.


Take a look at Figure 5 (https://i.imgur.com/Izs23wf.png) in the spec (https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blak...). That benchmark was done on a Raspberry Pi Zero, which is a 32-bit ARM1176.

Do you have single-thread cpb benchmark figures on amd64 hardware without AVX512?

Clearly the benefits of AVX512 really exaggerate the comparison on hardware that supports it, and the benefit over Blake2S is pretty muted on hardware without vector intrinsics (low end 32-bit ARM). But I'm interested in the middle — e.g., Zen1/2 AMD, Broadwell and earlier Intel x86-64.

Thanks!


Thanks! Impressive as well!

I'm also curious for benchmarks on amd64 without AVX512, as that's the majority of existing x86 metal and the entire AMD product line, even the new Zen2 stuff.

Why are there no AES-like hashing algorithms out there? AES design is very suitable to be used as a building block in a hash if you remove "add round key" operation.

I helped design Meow Hash using AES-NI. It is not general purpose crypto strength, but ridiculously fast, targeting a theoretical performance of 16 bytes per cycle on some processors, too fast for memory to keep up. https://github.com/cmuratori/meow_hash

Regarding the speed, meow_hash is certainly performant, but when hashing a single file, b3sum actually seems to be faster, perhaps due to using more threads. Could meow_hash be made even faster by running more operations in parallel?

time ~/.cargo/bin/b3sum ubuntu-19.10-beta-desktop-amd64.iso 0899b731b6b57d75a65273fe29d54802653b3bbe3bae6732140c487c4f0ece71 ubuntu-19.10-beta-desktop-amd64.iso

real 0m0,180s user 0m1,735s sys 0m0,092s

time meow_example ubuntu-19.10-beta-desktop-amd64.iso meow_example 0.5/calico - basic usage example of the Meow hash (C) Copyright 2018-2019 by Molly Rocket, Inc. (https://mollyrocket.com) See https://mollyrocket.com/meowhash for details.

  Hash of "ubuntu-19.10-beta-desktop-amd64.iso":
    E5CC524A-522BDE7E-ED34A277-5C7D73AF
real 0m0,766s user 0m0,119s sys 0m0,645s

>It is not general purpose crypto strength

This made me curious. Is it because at this stage it is a proposal that has not yet been verified/analysed or are there actual reasons that you know of that make this not "general purpose strong"?


I don't actually have proof that it isn't crypto strength. But comparing it to other algorithms that have been broken, it seems unlikely that it would hold given the rather modest amount of computation done.

I do believe that it meets the requirements for being a MAC function, and I'm completely certain that it is a great non-cryptographic hash function.


Is it possible to benchmark agaist blake2 etc. but where they have the same number of rounds, testing both for reducing blake2 and also increasing blake3? Also, in that vein, offering the version with more rounds could win over the "paranoid" for mostly being a faster Blake2 thanks to SIMD and extra features thanks to the Merkle tree?

smhasher results without the Rust version yet (which should be ~2x faster):

http://rurban.github.io/smhasher/doc/table.html

It's of course much faster as most of the other crypto hashes, but not faster than the hardware variants of SHA1-NI and SHA256-NI. About 4x faster than blake2.

Faster than SipHash, not faster than SipHash13.

The tests fail on MomentChi2 dramatically, which describe how good the user-provided random seed is mixed in. I tried by mixing a seed for IV[0], as with all other hardened crypto hashes, and for all 8 IV's, which didn't help. So I'm not convinced that a seeded IV is properly mixed in. Which is outside the usage pattern of a crypto or digest hash (b3sum), but inside a normal usage.

Rust staticlib is still in work, which would parallize the hashing in chunks for big keys. For small keys it should be even a bit slower. b3sum is so much faster, because it uses many more tricks, such as mmap.


  Benchmark #1: cat b1
    Time (mean ± σ):      1.076 s ±  0.007 s    [User: 5.3 ms, System: 1069.4 ms]
    Range (min … max):    1.069 s …  1.093 s    10 runs
  Benchmark #2: sha256sum b1
    Time (mean ± σ):      6.583 s ±  0.064 s    [User: 5.440 s, System: 1.137 s]
    Range (min … max):    6.506 s …  6.695 s    10 runs
  Benchmark #3: sha1sum b1
    Time (mean ± σ):      6.322 s ±  0.086 s    [User: 5.212 s, System: 1.103 s]
    Range (min … max):    6.214 s …  6.484 s    10 runs
  Benchmark #4: b2sum b1
    Time (mean ± σ):     13.184 s ±  0.108 s    [User: 12.090 s, System: 1.080 s]
    Range (min … max):   13.087 s … 13.382 s    10 runs
  Benchmark #5: b3sum b1
    Time (mean ± σ):     577.0 ms ±   5.4 ms    [User: 12.276 s, System: 0.669 s]
    Range (min … max):   572.4 ms … 587.0 ms    10 runs
  Benchmark #6: md5sum b1
    Time (mean ± σ):     14.851 s ±  0.175 s    [User: 13.717 s, System: 1.117 s]
    Range (min … max):   14.495 s … 15.128 s    10 runs
    
  Summary
    'b3sum b1' ran
      1.86 ± 0.02 times faster than 'cat b1'
     10.96 ± 0.18 times faster than 'sha1sum b1'
     11.41 ± 0.15 times faster than 'sha256sum b1'
     22.85 ± 0.28 times faster than 'b2sum b1'
     25.74 ± 0.39 times faster than 'md5sum b1'

gotdang that's some solid performance. (here running against 10GiB of random bytes; machine has the Sha ASM extensions, which is why sha256/sha1 perform so well)

edit: actually not a straight algo comparison, as b3sum here is heavily benefiting from multi-threading; without that it looks more like this:

  Benchmark #1: cat b1
    Time (mean ± σ):      1.090 s ±  0.007 s    [User: 2.9 ms, System: 1084.8 ms]
    Range (min … max):    1.071 s …  1.096 s    10 runs
   
  Benchmark #2: sha256sum b1
    Time (mean ± σ):      6.480 s ±  0.097 s    [User: 5.359 s, System: 1.115 s]
    Range (min … max):    6.346 s …  6.587 s    10 runs
   
  Benchmark #3: sha1sum b1
    Time (mean ± σ):      6.120 s ±  0.090 s    [User: 5.027 s, System: 1.082 s]
    Range (min … max):    5.979 s …  6.233 s    10 runs
   
  Benchmark #4: b2sum b1
    Time (mean ± σ):     12.866 s ±  0.208 s    [User: 11.722 s, System: 1.133 s]
    Range (min … max):   12.549 s … 13.124 s    10 runs
   
  Benchmark #5: b3sum b1
    Time (mean ± σ):      5.813 s ±  0.079 s    [User: 4.606 s, System: 1.202 s]
    Range (min … max):    5.699 s …  5.933 s    10 runs
   
  Benchmark #6: md5sum b1
    Time (mean ± σ):     14.355 s ±  0.184 s    [User: 13.305 s, System: 1.039 s]
    Range (min … max):   14.119 s … 14.605 s    10 runs
   
  Summary
    'cat b1' ran
      5.33 ± 0.08 times faster than 'b3sum b1'
      5.62 ± 0.09 times faster than 'sha1sum b1'
      5.95 ± 0.10 times faster than 'sha256sum b1'
     11.81 ± 0.21 times faster than 'b2sum b1'
     13.17 ± 0.19 times faster than 'md5sum b1'
still beating the dedicated sha extensions, but not nearly as dramatically.

Where is this useful?

I'm guessing not for password hashes simply because a fast hash is bad for passwords (makes brute forcing/rainbow tables easier).

So is this mostly just for file signing?


Anywhere where you do hashing of stuff and you'd like it to be faster. Content addressable systems use hashing to generate identifiers for content (where the same content has the same identifier, and different content for sure has different ID). If you have a 2GB file you want a identifier for, using BLAKE3 would make that a lot faster.

> Capable of verified streaming and incremental updates, again because it's a Merkle tree.

I don't really understand what this means (verified streaming + incremental updates), could someone clarify? Merkle Trees are simple (https://en.wikipedia.org/wiki/Merkle_tree for people who don't know)


From their BLAKE3 paper (https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blak...) they give an example of verified streaming in 6.4.

Basically to verify a video file using serial hash functions you need to download the entire video file before you can perform the hashing. In BLAKE3 you can verify each chunk of the video as it is being streamed because the hash internally is just a Merkle Tree.


As far as I can tell the crate doesn't expose APIs needed to construct or verify the sibling/uncle hashes that would be used in partial verification.

Verified streaming is implemented by https://github.com/oconnor663/bao.

Just thinking of writing such code myself is enough to get nightmares.

Thanks for the link mauricio! That's very interesting and makes a ton of sense. Gonna have a lot of fun playing around with BLAKE3 for different things.

Fast hashes are useful for signing, MACs (symmetric "signatures" so to speak), key derivation (HKDF and all kinds of Diffie-Hellman handshakes come to mind), as part of cryptographically secure PRNGs (though most of the world has moved on to stream ciphers for that instead) and probably more.

While programming, just try to think of a scenario where having a mapping between some kind of arbitrary data (and maybe a key) and a fixed-size, uniformly random-looking output could be useful. Opportunities to sprinkle some hashes on things come up quite often when you look for them.


So I’m not super familiar with things like this, but for example, WireGuard uses BLAKE2 for hashing. What level of undertaking would it be to move from BLAKE2 to BLAKE3 in regards to WireGuard? Can you just pop out BLAKE2 and pop in BLAKE3?

The two hashes aren't compatible, so a hash of the same message will yield two different hashes under BLAKE2 and BLAKE3.

As far as I can tell, BLAKE2 has effectively all the properties BLAKE3 has (arbitrary output length, keyed hash mode), so the upgrade for communications boils down to negotiating/determining which of the two hash functions to use over the wire (with all the downsides that come with agility of cryptographic primitives); for stored hashes, they have to be recomputed and replaced (or you could store a flag is BLAKE2/is BLAKE3 and update them as you touch hashes, kind of similar to how password hashes are swapped at login time).

Note that BLAKE3 existing doesn't break BLAKE2. It's perfectly fine to just keep trucking BLAKE2, it's just that BLAKE3 has better performance characteristics that make it very attractive.


Wireguard uses pre-shared keys, which are manually (or through some separate program) configured on each end, so one could in theory make "oh, and by the way we will use BLAKE3 instead of BLAKE2" part of that pre-shared key.

Wireguard would bump protocol revision, and the new protocol would just pop out Blake2 and use Blake3 hashes, yes.

Assuming wireguard hashes data shorter than 4k (i.e. most network packets), there is no reason to switch; BLAKE3 is only faster than BLAKE2 on data longer than 4k.

That isn't literally true; the reduced rounds make it faster on small inputs, too. And jumbo packets can be 4kB or 9000B or whatever, if wireguard is used on such an interface.

Does BLAKE3 reduce rounds vs BLAKE2s?

7 rounds instead of 10.

Though for Wireguard, you'd compete with Blake2b as well, which has the advantage of using 64-bit words. And if you want a fair comparison, you should reduce the rounds of Blake2b down to 8 (instead of 12), as recommended in Aumasson's "Too Much Crypto".

On a 64-bit machine, such a reduced Blake2b would be much faster than Blake3 on inputs greater than 128 bytes and smaller than 4Kib.


They address this in the paper, to some extent. With SIMD, you get 128, 256, or 512 bits of vector. You can either store 32x4, 32x8, 32x16, or 64x2, 64x4, 64x8 words. But either way you're processing N bits in parallel.

The concern about 64-bit machines and using 64-bit word sizes vs 32-bit word sizes really only matters if your 64-bit machine doesn't have SIMD vector extensions. (All amd64 hardware, for example, has at least SSE2.) And as they point out, being 32-bit native really helps on low-end 32-bit machines without SIMD intrinsics.

(Re: the hypothetical, if wireguard were to do a protocol revision and replace Blake2B with this, it would make sense to also replace Chacha20 with Chacha8 or 12 at the same time. I doubt the WG authors will do any such thing any time soon.)


I was talking about small-ish inputs, for which vectorisation wouldn't help.

Yes. It is addressed in TFA, as well as the current top-voted comment on the article.

> as part of cryptographically secure PRNGs (though most of the world has moved on to stream ciphers for that instead)

My understanding is that plenty of stream ciphers are based on hashes. For example each block of the stream can be hash(key + nonce + block counter + constants) that you xor with your plaintext (or don't, if you just want a CSPRNG).


BLAKE and Chacha are pretty intimately related; if this is faster, I don't see any reason to not use it as a CSPRNG over, say, Chacha20. You may have to be careful about not rolling the counter, unlike Chacha20 (which can easily be extended to have a 128-bit counter).

> unlike Chacha20 (which can easily be extended to have a 128-bit counter)

Is this frequently done in practice? The CSPRNG code for ChaCha20 I've looked at rotates the key itself using 32 out of every 768 bytes. In that case rolling the counter isn't a concern.


The 128-bit counter would work, and would remove the 32 bytes of overhead. The speed difference however is fairly negligible, and you would lose forward secrecy in the process (if your unrotated seed gets stolen, all past random numbers are revealed).

Now I wonder where this 768 bytes could possibly come from. It's only a multiple of 256, which can only take advantage of 128-bit vectors (4 blocks at a time). Ideally you want an 8 way parallelism (AVX2) or even 16 way parallelism (AVX-512). That is, either 512 byte blocks, or 1024 byte blocks.


> Now I wonder where this 768 bytes could possibly come from.

This is totally implementation defined, it's not required by the spec. As loeg says (below) I was looking at a reference implementation by djb. I did a quick skim of OpenBSD's arc4random (which also uses ChaCha20) and if I'm reading it correctly, it rekeys every 1024 bytes.

> Ideally you want an 8 way parallelism (AVX2) or even 16 way parallelism (AVX-512)

My guess is that 768 was thought to be a decent enough trade-off between maximum and average latency for calls to the CSPRNG. I wouldn't be surprised to see that most implementations that are optimized for specific CPU architectures use different values.


128-bit counter and key rotation are not mutually exclusive.

I believe the 768 byte figure comes from DJB's blog on fast key erasure[1]. Why he picked 768, I do not know.

[1]: https://blog.cr.yp.to/20170723-random.html


Oh, I see. Then people blindly copied it, without taking into account that Chacha20 has bigger blocks than AES, and could benefit from vector implementations (while AES has bit slicing or AES-NI).

I don't know about frequently; it is done at least once. (Of course, nothing about a wide counter prevents you from rotating the key, too, for forward secrecy properties.)

Hashes are useful in a wide variety of cryptographic applications, of which file signing is just one example. But you are correct, you would not want to use this by itself for password hashing. But that is true of any cryptographic hash. not just Blake3.

I'm not sure what you mean, do you mean something like Bcrypt on its own isn't enough for password hashes?

My understanding is that for password hashing you want two things - cryptographic security (can't go from hash -> password by attacking the algorithm ) - and you want it to be "as slow as is usable" to counter brute force attacks? Also per-password salts like bcrypt uses counters rainbow attacks don't hurt.

What else do you need to do to keep it secure ?


There are two different kinds of cryptographic hashes. It's confusing because there isn't standard terminology to distinguish between them, so I'm just going to call them Type 1 and Type 2. Type 1 hashes are designed to be fast and Type 2 hashes are designed to be slow. MD5, SHA, and Blake are Type 1 hashes. Bcrypt is a type 2 hash, as is PBKDF. Type 2 hashes are built using Type 1 hashes as components. Type 2 hashes also typically include additional features like salts that Type 1 hashes do not have.

The term you are looking for is “key derivation function.”

And there's the third and most important property: constant time. timing resistant. Usually not too fast and not too slow, but always in the same speed for each input. no shortcuts.

No, hashes do not need to be constant time because they aren't keyed. In fact, they cannot be constant time because they have to handle arbitrary-length input.

If your hash was not constant-time and constant-power, you could work out what was hashed via precise timings or power readings.

No.


https://en.wikipedia.org/wiki/Gish_gallop

The first two references you cite are attacks against a password hash. Blake3 is not a password hash, it is a general-purpose hash.

The third reference is not an attack on the hash function timing, it is an attack against hash table timing, which is not at all the same thing. From the abstract:

"Our attack does not rely on any weakness of a particular hash function and can work against any hash."

Once again: Blake3 is a general-purpose hash, not a password hash. General-purpose hashes cannot be constant-time because they must operate on inputs of arbitrary length.


Then I think you misunderstood rurban's post. He's saying it is a property cryptographic hash functions must have that they are constant time, constant power. Yes, all cryptographic hashes suitable for general-purpose use have this property. That's part of why they are good general-purpose hash functions.

And each round of Blake3 is constant-time, irrespective of input. We're talking about rounds (or, alternatively, constant-time for equivalent length input).


> He's saying it is a property cryptographic hash functions must have that they are constant time, constant power.

It's a little unclear whether rurban was saying that they must have this property, or merely that they do have it. But that is neither here nor there because...

> each round of Blake3 is constant-time

That is not the same thing as the entire algorithm being constant-time.

This whole thread has turned into a horrible mess.

Yes, all else being equal, constant time/power is nice to have. But the only circumstance under which it is necessary is if you are processing secret data in a situation where an adversary can potentially observe side channels. But this is true for any algorithm, not just hashes. Furthermore, most common application of hashing a secret is password hashing, and there is is much more important that the hash be expensive than that it be constant time and power.

But Blake3 is not a password hash. It can be used as a component of a password hash, and there its constant-round time becomes a useful property. But to emphasize this in the context in which rurban's comment appears is at best badly misleading.


Concurrent filesystems, high availability systems etc



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: