
The BLAKE3 cryptographic hash function - erwan
https://github.com/BLAKE3-team/BLAKE3
======
s_tec
It looks like the speedup is coming from two main changes.

The first change is reducing the number of rounds from 10 to 7. Think of it
like making a smoothie - you add bits of fruit to the drink (the input data),
then pulse the blades to blend it up (making the output hash). This change
basically runs the blades for 7 seconds instead of 10 seconds each time they
add fruit. They cite evidence that the extra 3 seconds aren't doing much -
once the fruit's fully liquid, extra blending doesn't help - but I worry that
this reduces the security margin. Maybe those extra 3 rounds aren't useful
against current attacks, but they may be useful against unknown future
attacks.

The other change they make is to break the input into 1KiB chunks, then hash
each chunk independently. Finally, they combine the individual chunk hashes
into a single big hash using a binary tree. The benefit is that if you have
4KiB of data, you can use 4-way SIMD instructions to process all four chunks
simultaneously. The more data you have, the more parallelism you can unlock,
unlike traditional hash functions that process everything sequentially. On the
flip side, modern SIMD instructions can handle 2 x 32-bit instructions just as
fast as 1 x 64-bit instructions, so building the algorithm out of 32-bit
arithmetic doesn't cost anything, but gives a big boost to low-end 32-bit
CPU's that struggle with 64-bit arithmetic. The tree structure is a big win
overall.

~~~
zokier
> but I worry that this reduces the security margin. Maybe those extra 3
> rounds aren't useful against current attacks, but they may be useful against
> unknown future attacks.

This was covered in more detail in previous "Too Much Crypto" paper [1], which
argued that many standards have excessively high round counts. Note that
Aumasson is author of both Blake3 and Too Much Crypto

[1]
[https://news.ycombinator.com/item?id=21917505](https://news.ycombinator.com/item?id=21917505)

~~~
labawi
From paper:

> Our goal is to propose numbers of rounds for which we have strong confidence
> that the algorithm will never be wounded

They take algorithms, past 10 years of _public_ crypto research and shave off
rounds, until it just about starts falling apart. AFAIU having security-
reducing attacks is the target.

I prefer to have ample confidence in my crypto algorithms. Would not recommend
BLAKE3 (without those extra rounds).

------
clarkmoody
Another benchmark:

time openssl sha256 /tmp/bigfile

    
    
      real  0m28.160s
      user  0m27.750s
      sys   0m0.272s
    

time shasum -a 256 /tmp/bigfile

    
    
      real  0m6.146s
      user  0m5.407s
      sys   0m0.560s
    

time b2sum /tmp/bigfile

    
    
      real  0m1.732s
      user  0m1.450s
      sys   0m0.244s
    

time b3sum /tmp/bigfile

    
    
      real  0m0.212s
      user  0m0.996s
      sys   0m0.379s
    

TIL OpenSSL sha256 invocation is really slow compared to the shasum program.
Also BLAKE3 is _really_ fast.

Edit: bigfile is 1GB of /dev/random

~~~
paavoova
On my machine running Ubuntu 18.04 (coreutils 8.28, openssl 1.1.1), openssl is
faster than both shasum and sha256sum.

~~~
xemdetia
Yeah as someone familiar with openssl it looks like a version of openssl that
was built incorrectly.

------
dpc_pw
So this is bao + blake2?

I remember watching Bao, a general purpose cryptographic tree hash, and
perhaps the fastest hash function in the world:
[https://www.youtube.com/watch?v=Dya9c2DXMqQ](https://www.youtube.com/watch?v=Dya9c2DXMqQ)
a while ago.

Nice job!

~~~
oconnor663
Yep that's me :) The Bao project evolved into BLAKE3, and the latest version
-- which I literally just released -- is now based on BLAKE3.

~~~
dpc_pw
Excuse my confusion. I understand "the Bao project evolved into BLAKE3", but
"is now based on BLAKE3" confuses me. Bao is based on blake3? But isn't bao
... the blake3 itself now? Circular dependency detected.

~~~
oconnor663
Ha, yes, that's confusing. The Bao project was originally two things: 1) a
custom tree hash mode, and 2) an encoding format and verified streaming
implementation based on that tree hash. The first half evolved into BLAKE3.
Now the Bao project itself is just the second half.

~~~
prilanoth
Hi, some questions...

The README lists 4 designers, including yourself. However the Bao project
doesn't list anybody, so presumably you are the only designer. What exactly
were the contributions of the other 3 people to warrant being listed?

At what point did the Bao project become "BLAKE3" and why?

~~~
loeg
All three others are principals of the Blake or Blake2 design and major
implementations.

------
loeg
Looks like they've taken _Too Much Crypto_ to heart[1] and dropped the number
of rounds from Blake2B's 12 down to 7 for Blake3:

[https://github.com/BLAKE3-team/BLAKE3/blob/master/reference_...](https://github.com/BLAKE3-team/BLAKE3/blob/master/reference_impl/reference_impl.rs#L83-L95)

[https://github.com/BLAKE2/BLAKE2/blob/master/ref/blake2b-ref...](https://github.com/BLAKE2/BLAKE2/blob/master/ref/blake2b-ref.c#L200-L211)

Which, yeah, that alone will get you a significant improvement over Blake2B.
But definitely doesn't account for the huge improvement they're showing. Most
of that is the ability to take advantage of AVX512 parallelism, I think. The
difference will be more incremental on AVX2-only amd64 or other platforms, I
think.

[1]: Well, TMC recommended 8 rounds for Blake2B and 7 for Blake2S.

~~~
zokier
> Looks like they've taken Too Much Crypto to heart

Not surprising considering that one of they is the author of Too Much Crypto

~~~
loeg
Indeed. It's also 3rd citation in their formal spec.

------
kzrdude
Just one variant, that's refreshing. And performance is impressive. What's the
short input performance like? Say for 64 bytes of input.

~~~
oconnor663
The Performance section of the spec
([https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blak...](https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blake3.pdf))
has a paragraph about short input performance. 64 bytes happens to be the
BLAKE3 block size, and performance at that length or shorter is best in class.
Look at the left edge of Figure 3
([https://i.imgur.com/smGHAKA.png](https://i.imgur.com/smGHAKA.png)).

~~~
nabla9
64 bytes happens to be the typical cache line size so it makes sense to use it
as a block size.

------
nabla9
That's impressive speedup. I just installed it and holy moly it really is
fast. All those extra cores can finally get busy. ;)

b3sum -l 256 big-2.6Gfile

    
    
      real 0m0.384s
      user 0m2.302s
      sys  0m0.175s
    
    

b2sum -l 256 big-2.6Gfile

    
    
      rear  0m3.616s
      user  0m3.360s
      sys   0m0.256s
    

(Intel® Core™ i7-8550U CPU @ 1.80GHz × 8 )

EDIT: ah, the catch. blake3 targets 128 bit security. It competes with SipHash
for speed and security

EDIT2 scratch the previous edit.

~~~
oconnor663
> ah, the catch. blake3 targets 128 bit security. It competes with SipHash for
> speed and security.

No no, BLAKE3 is a general-purpose cryptographic hash just like BLAKE2, SHA-2,
and SHA-3. The confusion here is that a hash function's security level is half
of its output size, because of the birthday problem. BLAKE3, like BLAKE2s and
SHA-256, has a 256-bit output and a 128-bit security level. (BLAKE3 also
supports extendable output, but that doesn't affect the security level.)

> holy moly it really is fast

Thank you :)

~~~
Thorrez
>security level is half of its output size

A hash can have different security levels against different attacks. BLAKE3
appears to have 128 bits of security against all attacks.

SHA3-256 was originally designed to have 128 bits of collision security and
256 bits of preimage security. NIST then made a change to it giving it 128
bits of security against all attacks. A lot of people got mad. Then NIST caved
and changed it back to 128 bits of collision security and 256 bits of preimage
security.

It looks like BLAKE3 agrees with how NIST wanted SHA3 to be. I wonder if
people will be mad at BLAKE3.

[https://en.wikipedia.org/wiki/SHA-3#Capacity_change_controve...](https://en.wikipedia.org/wiki/SHA-3#Capacity_change_controversy)

For a more fair performance comparison against SHA3, you should compare
against SHAKE128(256). That is, the version with 128 bits of security all
around and a 256 bit output (how NIST wanted it). Although maybe it's
pointless, because according to Wikipedia SHAKE128(256) is only 8% faster than
SHA3-256 for large inputs.

~~~
pingyong
>Although maybe it's pointless, because according to Wikipedia SHAKE128(256)
is only 8% faster than SHA3-256 for large inputs.

This is mainly due to SHA3's humongous 1600-bit state, which is not very
friendly to embedded systems. In sponge constructions with smaller states, or
generally primitives with smaller states, the difference is much larger.

Also in general I would say that small message performance is usually more
important than large message performance, since large messages with
desktop/laptop CPUs are so incredibly fast anyway with most hash functions
that the bottleneck goes somewhere else. (Storage, network, etc.)

~~~
Thorrez
I mentioned large message performance because that appeared to be what
BLAKE3's benchmarks were focusing on.

------
memco
I would be interested in how this compares on the smhasher against some of the
other fastest hash competitors like meow hash or xxhash.

~~~
loeg
I am also curious about how it performs as a PRF in places where e.g. Chacha20
is used as a keystream generator now. Also as a reduced round variant in
places where non-cryptographic PRNGs are used for very fast RNG needs: JSF,
SFC, Lehmer, Splitmix, PCG.

In my extremely limited testing (on AVX2, but not AVX512 hardware), (buffered)
reduced (four) round Chacha is only about 1.5-2x slower than fast non-
cryptographic PRNGs like JSF, SFC, Lehmer, or pcg64_fast (all with Clang -O2
-flto, the fast PRNGs are header-only implementations and only chacha is two
files).

This thing still uses 7 rounds, but that is easy to tune down. Very neat.

~~~
loup-vaillant
The "too much crypto" paper linked in the specs recommends to lower Chacha20
down to 8 rounds.

Blake3 wouldn't compete with Chacha20, it would compete with Chacha8.

~~~
oconnor663
Note that a "round" in BLAKE/2/3 is equivalent to a "double-round" in ChaCha.

~~~
loeg
Ah, ok. So 7-round Blake3 is perhaps closest to 14-round Chacha.

------
eyegor
I can't seem to find any non-rust implementations in the works yet, so I may
sit down and adapt the reference to C# this weekend. Anyone know how the
single/few threads performance holds up excluding avx512?

~~~
cesarb
There's a non-Rust implementation in the same repository, at
[https://github.com/BLAKE3-team/BLAKE3/tree/master/c](https://github.com/BLAKE3-team/BLAKE3/tree/master/c)
(in C).

~~~
rurban
Yes, but about half as slow as the Rust version, because the rust version
processes the chunks in parallel.

I'm working on exporting the rust version to C, so all can be compared
properly.

------
rurban
smhasher results without the Rust version yet (which should be ~2x faster):

[http://rurban.github.io/smhasher/doc/table.html](http://rurban.github.io/smhasher/doc/table.html)

It's of course much faster as most of the other crypto hashes, but not faster
than the hardware variants of SHA1-NI and SHA256-NI. About 4x faster than
blake2.

Faster than SipHash, not faster than SipHash13.

The tests fail on MomentChi2 dramatically, which describe how good the user-
provided random seed is mixed in. I tried by mixing a seed for IV[0], as with
all other hardened crypto hashes, and for all 8 IV's, which didn't help. So
I'm not convinced that a seeded IV is properly mixed in. Which is outside the
usage pattern of a crypto or digest hash (b3sum), but inside a normal usage.

Rust staticlib is still in work, which would parallize the hashing in chunks
for big keys. For small keys it should be even a bit slower. b3sum is so much
faster, because it uses many more tricks, such as mmap.

------
bjoli
Does anybody know of benchmarks for ARM? Or any research trying to break it?

The numbers look astonishing.

~~~
oconnor663
Take a look at Figure 5
([https://i.imgur.com/Izs23wf.png](https://i.imgur.com/Izs23wf.png)) in the
spec
([https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blak...](https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blake3.pdf)).
That benchmark was done on a Raspberry Pi Zero, which is a 32-bit ARM1176.

~~~
loeg
Do you have single-thread cpb benchmark figures on amd64 hardware without
AVX512?

Clearly the benefits of AVX512 really exaggerate the comparison on hardware
that supports it, and the benefit over Blake2S is pretty muted on hardware
without vector intrinsics (low end 32-bit ARM). But I'm interested in the
middle — e.g., Zen1/2 AMD, Broadwell and earlier Intel x86-64.

Thanks!

------
anticensor
Why are there no AES-like hashing algorithms out there? AES design is very
suitable to be used as a building block in a hash if you remove "add round
key" operation.

~~~
NohatCoder
I helped design Meow Hash using AES-NI. It is not general purpose crypto
strength, but ridiculously fast, targeting a theoretical performance of 16
bytes per cycle on some processors, too fast for memory to keep up.
[https://github.com/cmuratori/meow_hash](https://github.com/cmuratori/meow_hash)

~~~
mr__y
>It is not general purpose crypto strength

This made me curious. Is it because at this stage it is a proposal that has
not yet been verified/analysed or are there actual reasons that you know of
that make this not "general purpose strong"?

~~~
NohatCoder
I don't actually have proof that it isn't crypto strength. But comparing it to
other algorithms that have been broken, it seems unlikely that it would hold
given the rather modest amount of computation done.

I do believe that it meets the requirements for being a MAC function, and I'm
completely certain that it is a great non-cryptographic hash function.

------
babel_
Is it possible to benchmark agaist blake2 etc. but where they have the same
number of rounds, testing both for reducing blake2 and also increasing blake3?
Also, in that vein, offering the version with more rounds could win over the
"paranoid" for mostly being a faster Blake2 thanks to SIMD and extra features
thanks to the Merkle tree?

------
ptomato

      Benchmark #1: cat b1
        Time (mean ± σ):      1.076 s ±  0.007 s    [User: 5.3 ms, System: 1069.4 ms]
        Range (min … max):    1.069 s …  1.093 s    10 runs
      Benchmark #2: sha256sum b1
        Time (mean ± σ):      6.583 s ±  0.064 s    [User: 5.440 s, System: 1.137 s]
        Range (min … max):    6.506 s …  6.695 s    10 runs
      Benchmark #3: sha1sum b1
        Time (mean ± σ):      6.322 s ±  0.086 s    [User: 5.212 s, System: 1.103 s]
        Range (min … max):    6.214 s …  6.484 s    10 runs
      Benchmark #4: b2sum b1
        Time (mean ± σ):     13.184 s ±  0.108 s    [User: 12.090 s, System: 1.080 s]
        Range (min … max):   13.087 s … 13.382 s    10 runs
      Benchmark #5: b3sum b1
        Time (mean ± σ):     577.0 ms ±   5.4 ms    [User: 12.276 s, System: 0.669 s]
        Range (min … max):   572.4 ms … 587.0 ms    10 runs
      Benchmark #6: md5sum b1
        Time (mean ± σ):     14.851 s ±  0.175 s    [User: 13.717 s, System: 1.117 s]
        Range (min … max):   14.495 s … 15.128 s    10 runs
        
      Summary
        'b3sum b1' ran
          1.86 ± 0.02 times faster than 'cat b1'
         10.96 ± 0.18 times faster than 'sha1sum b1'
         11.41 ± 0.15 times faster than 'sha256sum b1'
         22.85 ± 0.28 times faster than 'b2sum b1'
         25.74 ± 0.39 times faster than 'md5sum b1'
    
    

gotdang that's some solid performance. (here running against 10GiB of random
bytes; machine has the Sha ASM extensions, which is why sha256/sha1 perform so
well)

edit: actually not a straight algo comparison, as b3sum here is heavily
benefiting from multi-threading; without that it looks more like this:

    
    
      Benchmark #1: cat b1
        Time (mean ± σ):      1.090 s ±  0.007 s    [User: 2.9 ms, System: 1084.8 ms]
        Range (min … max):    1.071 s …  1.096 s    10 runs
       
      Benchmark #2: sha256sum b1
        Time (mean ± σ):      6.480 s ±  0.097 s    [User: 5.359 s, System: 1.115 s]
        Range (min … max):    6.346 s …  6.587 s    10 runs
       
      Benchmark #3: sha1sum b1
        Time (mean ± σ):      6.120 s ±  0.090 s    [User: 5.027 s, System: 1.082 s]
        Range (min … max):    5.979 s …  6.233 s    10 runs
       
      Benchmark #4: b2sum b1
        Time (mean ± σ):     12.866 s ±  0.208 s    [User: 11.722 s, System: 1.133 s]
        Range (min … max):   12.549 s … 13.124 s    10 runs
       
      Benchmark #5: b3sum b1
        Time (mean ± σ):      5.813 s ±  0.079 s    [User: 4.606 s, System: 1.202 s]
        Range (min … max):    5.699 s …  5.933 s    10 runs
       
      Benchmark #6: md5sum b1
        Time (mean ± σ):     14.355 s ±  0.184 s    [User: 13.305 s, System: 1.039 s]
        Range (min … max):   14.119 s … 14.605 s    10 runs
       
      Summary
        'cat b1' ran
          5.33 ± 0.08 times faster than 'b3sum b1'
          5.62 ± 0.09 times faster than 'sha1sum b1'
          5.95 ± 0.10 times faster than 'sha256sum b1'
         11.81 ± 0.21 times faster than 'b2sum b1'
         13.17 ± 0.19 times faster than 'md5sum b1'
    

still beating the dedicated sha extensions, but not nearly as dramatically.

------
cogman10
Where is this useful?

I'm guessing not for password hashes simply because a fast hash is bad for
passwords (makes brute forcing/rainbow tables easier).

So is this mostly just for file signing?

~~~
beefhash
Fast hashes are useful for signing, MACs (symmetric "signatures" so to speak),
key derivation (HKDF and all kinds of Diffie-Hellman handshakes come to mind),
as part of cryptographically secure PRNGs (though most of the world has moved
on to stream ciphers for that instead) and probably more.

While programming, just try to think of a scenario where having a mapping
between some kind of arbitrary data (and maybe a key) and a fixed-size,
uniformly random-looking output could be useful. Opportunities to sprinkle
some hashes on things come up quite often when you look for them.

~~~
lame-robot-hoax
So I’m not super familiar with things like this, but for example, WireGuard
uses BLAKE2 for hashing. What level of undertaking would it be to move from
BLAKE2 to BLAKE3 in regards to WireGuard? Can you just pop out BLAKE2 and pop
in BLAKE3?

~~~
aidenn0
Assuming wireguard hashes data shorter than 4k (i.e. most network packets),
there is no reason to switch; BLAKE3 is only faster than BLAKE2 on data longer
than 4k.

~~~
loeg
That isn't literally true; the reduced rounds make it faster on small inputs,
too. And jumbo packets can be 4kB or 9000B or whatever, if wireguard is used
on such an interface.

~~~
aidenn0
Does BLAKE3 reduce rounds vs BLAKE2s?

~~~
loup-vaillant
7 rounds instead of 10.

Though for Wireguard, you'd compete with Blake2b as well, which has the
advantage of using 64-bit words. And if you want a fair comparison, you should
reduce the rounds of Blake2b down to 8 (instead of 12), as recommended in
Aumasson's "Too Much Crypto".

On a 64-bit machine, such a reduced Blake2b would be much faster than Blake3
on inputs greater than 128 bytes and smaller than 4Kib.

~~~
loeg
They address this in the paper, to some extent. With SIMD, you get 128, 256,
or 512 bits of vector. You can either store 32x4, 32x8, 32x16, or 64x2, 64x4,
64x8 words. But either way you're processing N bits in parallel.

The concern about 64-bit machines and using 64-bit word sizes vs 32-bit word
sizes really only matters if your 64-bit machine doesn't have SIMD vector
extensions. (All amd64 hardware, for example, has at _least_ SSE2.) And as
they point out, being 32-bit native really helps on low-end 32-bit machines
without SIMD intrinsics.

(Re: the hypothetical, if wireguard were to do a protocol revision and replace
Blake2B with this, it would make sense to also replace Chacha20 with Chacha8
or 12 at the same time. I doubt the WG authors will do any such thing any time
soon.)

~~~
loup-vaillant
I was talking about small-ish inputs, for which vectorisation wouldn't help.

