
Bao: A verified streaming tool based on BLAKE3 - luu
https://github.com/oconnor663/bao
======
oconnor663
To stave off some confusion, Bao was originally two things: 1) a custom tree
hashing mode based on BLAKE2, and 2) an encoding format and verified streaming
implementation for that tree hash. The first half grew into BLAKE3, which we
announced on January 9. The Bao project itself is now focused on the second
half, with the internal tree hash defined to be BLAKE3.

------
rakoo
So if I understand correctly from the spec, the point is to be able to have
slices that include the content and the necessary metadata that confirms this
is a part of the original content. This is something that is already done with
merkle tree hashes: specified in bittorrent V2
([http://bittorrent.org/beps/bep_0052.html](http://bittorrent.org/beps/bep_0052.html))
(although I don't know if any client implements it) and already working in dat
([https://datprotocol.github.io/how-dat-
works](https://datprotocol.github.io/how-dat-works)): when receiving a piece,
you also receive

\- its hash \- the hash of the brother so you can derive the hash of the
parent \- the hash of the uncle so you can derive the grand parent \- etc...
Up until the last brother before the root, so you can verify the root hash

In bao it seems to be the opposite: the included hashes are the ones from the
direct ancestors ? So how can bao trust the whole chain ? Or maybe there's
something in BLAKE3 that explains how "Chaining Values" fix it ?

~~~
oconnor663
> This is something that is already done with merkle tree hashes

Yes, BLAKE3 is a Merkle tree, and what Bao is doing is conceptually similar to
BitTorrent. (BEP 30 is even cited in the BLAKE3 spec.) Bao's goal is to be a
general-purpose wire format and implementation library, for any application
that needs verified streaming. Applications that use BLAKE3 for generic file
hashing today can also start doing verified streaming at any point in the
future, without needing any other trusted metadata.

~~~
rakoo
Understood. My question was more about what is the exact content of a slice,
apart from the chunk: is it the sibling's hash and the uncles hash ? Something
else ?

~~~
oconnor663
It's the set of parent nodes on the path from the root to the chunk(s), plus
the chunks themselves. Parent nodes contain the hashes / chaining values of
their two children, so you could also think of this as including sibling
hashes. For example, consider this tree representing a 4 KiB file:

    
    
                     root
                  /        \
           parent1         parent2
           /     \         /     \
        chunk1 chunk2   chunk3 chunk4
    

Say we build a slice containing only chunk2. The contents of that slice will
be:

    
    
        [4096][root][parent1][chunk2]
    

The parent node parent1 is [hash of chunk1][hash of chunk2], and the root node
is [hash of parent1][hash of parent2]. The 8-byte total length on the front
(4096) is the same as in the full encoding; it defines the structure of the
tree, so the decoder knows how many parent nodes to expect.

For more details:
[https://github.com/oconnor663/bao/blob/master/docs/spec.md](https://github.com/oconnor663/bao/blob/master/docs/spec.md)

This is my first time looking at the Dat protocol. Thanks for pointing that
out.

~~~
rakoo
Ok so the intermediary nodes are the concatenation of the children, that's
what I missed. Thanks a lot for the details, and even more for the rest of
your work !

------
ebg13
Is this functionally different than sending the hash of a block of bytes when
sending a block of bytes?

> _With a serial hash, the recipient would need to download the entire
> attachment to verify it_

Rings to me as being trivially untrue, since, if the file is divided into
chunks, you can just verify each chunk separately.

Alternately, can this be used to generate the complete file hash from chunks
received out of order by a download using multiple parallel substreams without
writing all of them to a block device in order first?

~~~
rfk
There are plenty of mischievous things an attacker could do if you just verify
each chunk separately, including reordering or omitting chunks, and truncating
the file.

~~~
ebg13
An attacker who can do those things could send you a different initial
checksum as well. How does this solve that?

~~~
rfk
Any sort of verification presumes that you have an initial trusted checksum
against which to verify, so I don't believe this solves the separate problem
of obtaining such a checksum.

IIUC your question here boils down to "what is the point of a hash tree?" as
opposed to e.g. a list of individual chunk hashes. The answer is that a hash
tree lets you verify an individual chunk by looking at the hashes of O(log(N))
chunks rather than having to look at the hash of every chunk.

~~~
ebg13
Thanks. That is meaningful, but I still don't understand how "the recipient
can stream a video attachment, while still verifying each byte as it comes in"
isn't basically also true for sequential hashing with periodic chunk hashes.

------
0xcoffee
Can somebody explain the Outboard Mode? I don't really understand the example.
It decodes the input file using an outboard file, but the result is what? You
already have the input file?

~~~
oconnor663
The Bao encoding supports two modes, combined and outboard. In the combined
mode, the bytes of the tree are interleaved with the original file, producing
an encoded file that's about 6% larger. In the outboard mode, the bytes of the
tree are stored separately in their own file.

Which mode to use depends on whether you need to keep the original, unencoded
file around on the server. If the only thing the server needs to do is serve
encoded bytes, the combined mode is more convenient. But if the server also
wants to serve the original file (to clients who trust the server and don't
need to verify any hashes), the outboard mode can avoid taking up double the
storage space.

The idea in the outboard example is that the decoder is running somewhere
else, not on the server, and it's streaming both the original file and the
outboard encoding simultaneously.

