Bao: A verified streaming tool based on BLAKE3

oconnor663 · on Jan 15, 2020

To stave off some confusion, Bao was originally two things: 1) a custom tree hashing mode based on BLAKE2, and 2) an encoding format and verified streaming implementation for that tree hash. The first half grew into BLAKE3, which we announced on January 9. The Bao project itself is now focused on the second half, with the internal tree hash defined to be BLAKE3.

rakoo · on Jan 15, 2020

So if I understand correctly from the spec, the point is to be able to have slices that include the content and the necessary metadata that confirms this is a part of the original content. This is something that is already done with merkle tree hashes: specified in bittorrent V2 (http://bittorrent.org/beps/bep_0052.html) (although I don't know if any client implements it) and already working in dat (https://datprotocol.github.io/how-dat-works): when receiving a piece, you also receive

- its hash - the hash of the brother so you can derive the hash of the parent - the hash of the uncle so you can derive the grand parent - etc... Up until the last brother before the root, so you can verify the root hash

In bao it seems to be the opposite: the included hashes are the ones from the direct ancestors ? So how can bao trust the whole chain ? Or maybe there's something in BLAKE3 that explains how "Chaining Values" fix it ?

oconnor663 · on Jan 16, 2020

> This is something that is already done with merkle tree hashes

Yes, BLAKE3 is a Merkle tree, and what Bao is doing is conceptually similar to BitTorrent. (BEP 30 is even cited in the BLAKE3 spec.) Bao's goal is to be a general-purpose wire format and implementation library, for any application that needs verified streaming. Applications that use BLAKE3 for generic file hashing today can also start doing verified streaming at any point in the future, without needing any other trusted metadata.

rakoo · on Jan 17, 2020

Understood. My question was more about what is the exact content of a slice, apart from the chunk: is it the sibling's hash and the uncles hash ? Something else ?

oconnor663 · on Jan 17, 2020

It's the set of parent nodes on the path from the root to the chunk(s), plus the chunks themselves. Parent nodes contain the hashes / chaining values of their two children, so you could also think of this as including sibling hashes. For example, consider this tree representing a 4 KiB file:

                 root
              /        \
       parent1         parent2
       /     \         /     \
    chunk1 chunk2   chunk3 chunk4

Say we build a slice containing only chunk2. The contents of that slice will be:

    [4096][root][parent1][chunk2]

The parent node parent1 is [hash of chunk1][hash of chunk2], and the root node is [hash of parent1][hash of parent2]. The 8-byte total length on the front (4096) is the same as in the full encoding; it defines the structure of the tree, so the decoder knows how many parent nodes to expect.

For more details: https://github.com/oconnor663/bao/blob/master/docs/spec.md

This is my first time looking at the Dat protocol. Thanks for pointing that out.

rakoo · on Jan 17, 2020

Ok so the intermediary nodes are the concatenation of the children, that's what I missed. Thanks a lot for the details, and even more for the rest of your work !

oconnor663 · on Jan 17, 2020

> already working in dat (https://datprotocol.github.io/how-dat-works)

I just looked over the doc at https://datprotocol.github.io/how-dat-works, particularly the section on the tree hash. It looks pretty reasonable, and definitely similar to BLAKE3.

I think one practical downside of the Dat approach is that root finalization is always done as a separate hashing step. So for example, if I hash the short string "hello world", I have to compute one chunk hash and then one root hash, for a total of two calls to the compression function. Compare that to BLAKE3, which will only do a single compression there, by setting both the chunk flags and the root flag for that one block. This difference is probably totally negligible for Dat's intended use case, where network round trip time is going to completely dwarf hashing time for small inputs. But for a general-purpose hash function you don't want to cut your small input performance in half.

A bigger philosophical difference is that Dat doesn't specify a particular chunk size. Again this makes sense for their use case, since you might want to experiment with fancy things like Rabin fingerprinting. Looks like they discuss that here: https://stackoverflow.com/a/51604110/823869. But for a cryptographic hash, it's important that every file has exactly one hash, and a fixed chunk size is required. Another upside of a fixed chunk size is that you really cut down on metadata; the 8-byte length at the front of an encoding or a slice determines the structure of the entire tree.

rakoo · on Jan 17, 2020

Indeed content-based chunking definitely has appeal, I'm thinking about all the backup softwares that use it as a smart way to cut down on total size in a way that leaves all snapshots independent from each other. On the other hand having a well-defined format and the tools to play with like you provide is definitely better to have it spread, so kudos for it.

ebg13 · on Jan 15, 2020

Is this functionally different than sending the hash of a block of bytes when sending a block of bytes?

> With a serial hash, the recipient would need to download the entire attachment to verify it

Rings to me as being trivially untrue, since, if the file is divided into chunks, you can just verify each chunk separately.

Alternately, can this be used to generate the complete file hash from chunks received out of order by a download using multiple parallel substreams without writing all of them to a block device in order first?

oconnor663 · on Jan 16, 2020

The idea is that you have the hash of some file, and you trust that hash somehow. (For example, maybe you stored it locally in the past, or someone important has signed it, or it's included in the Bitcoin blockchain.) But now you want to fetch the contents of the file from some peer you don't trust.

If you just hash each chunk separately, you have a storage problem. Let's say you have a 4 GiB file, about the size of a DVD, and you use 1 KiB chunks like BLAKE3 does. The set of chunk hashes is 128 MiB. You can't store that much data in any blockchain, so maybe you hash it again and store the meta-hash of all the hashes instead (basically a custom 1-level tree mode). Then when it's time to verify, you fetch the set of hashes from the peer as a first step. This is all doable, but now you have to download 128 MiB of chunk hashes before you can verify a single byte. You can try to use a larger chunk size to reduce this overhead, but that increases overhead elsewhere.

rfk · on Jan 15, 2020

There are plenty of mischievous things an attacker could do if you just verify each chunk separately, including reordering or omitting chunks, and truncating the file.

ebg13 · on Jan 15, 2020

An attacker who can do those things could send you a different initial checksum as well. How does this solve that?

rfk · on Jan 16, 2020

Any sort of verification presumes that you have an initial trusted checksum against which to verify, so I don't believe this solves the separate problem of obtaining such a checksum.

IIUC your question here boils down to "what is the point of a hash tree?" as opposed to e.g. a list of individual chunk hashes. The answer is that a hash tree lets you verify an individual chunk by looking at the hashes of O(log(N)) chunks rather than having to look at the hash of every chunk.

ebg13 · on Jan 16, 2020

Thanks. That is meaningful, but I still don't understand how "the recipient can stream a video attachment, while still verifying each byte as it comes in" isn't basically also true for sequential hashing with periodic chunk hashes.

0xcoffee · on Jan 15, 2020

Can somebody explain the Outboard Mode? I don't really understand the example. It decodes the input file using an outboard file, but the result is what? You already have the input file?

oconnor663 · on Jan 16, 2020

The Bao encoding supports two modes, combined and outboard. In the combined mode, the bytes of the tree are interleaved with the original file, producing an encoded file that's about 6% larger. In the outboard mode, the bytes of the tree are stored separately in their own file.

Which mode to use depends on whether you need to keep the original, unencoded file around on the server. If the only thing the server needs to do is serve encoded bytes, the combined mode is more convenient. But if the server also wants to serve the original file (to clients who trust the server and don't need to verify any hashes), the outboard mode can avoid taking up double the storage space.

The idea in the outboard example is that the decoder is running somewhere else, not on the server, and it's streaming both the original file and the outboard encoding simultaneously.

rakoo · on Jan 15, 2020

I believe it's more like a verification -- the output should be an error code that is 0 if all is fine and 1 if there's an error.