
Can someone explain Merkle Trees to me in a practical sense? - traceroute66
I&#x27;ve been spending much time in the company of &lt;insert_your_favoured_search_engine&gt; and can only find highly theoretical, often mathematical answers to the question.<p>The problem I&#x27;ve got is I can&#x27;t wrap my head around how Merkle Trees work in a practical sense.<p>The advertised beauty of Merkle Trees is that you can verify the integrity of a branch without the entire tree.<p>Ok, fine, I get it but how does that work in practice ?<p>Let&#x27;s take the well known example of AWS Glacier. I give Amazon a calculated merkle tree root and the BLOB.<p>So far so good, but what I dont&#x27; get is how does that help Amazon verify the integrity of a portion (&quot;branch&quot; in Merkle-speak) of a file ?<p>I know that the tree root I calculated is the concatenation of various parts ? And given the tree is a concatenation of cryptographic hashes, you can&#x27;t exactly extract the branches from the root hash ?<p>Does it work by virtue of brute-force (i.e. when you first send them the file, Amzazon calculates the branches at the point-of-receipt and then stores the hashes of those branches ad-infinitum ?)   Or is it more clever than that ?<p>That&#x27;s my point. I get the supposed benefits.  I understand how you calculate the root.  But I just can&#x27;t fathom how you magically derive the branches from the root without going through the steps of calculating the branches first ?<p>I hope this is not the wrong place to ask this sort of question !
======
jjirsa
An easy example of this is the implementation in Apache Cassandra

In Cassandra, we use MTs during active anti-entropy repair, where we have N
replicas of some data and want to make sure they’re in sync

The repair command will repair a range of data - in real terms, it’s repairing
all keys between two tokens, where the token is the murmur3 hash of the key.
The whole cluster covers 2^64, but a typical repair will repair perhaps 2^40
or smaller.

When you’re repairing 2^40 keys, you don’t want to compare each of them one by
one - if they match, you’re sending a ton of data across the network for no
reason. Instead, if each replica builds a needle tree representing the data
for that range, you can not only tell quickly if they’re in sync (comparing
the roots), but you can descend to identify the minimal set of data to stream
between replicas.

The Cassandra implementation isn’t a textbook implementation of the concept,
but it’s easy to think about.

------
cbergoon
I am not familiar with the Amazon use case mentioned but a Merkle Tree in
general only requires you to compute the hashes of the nodes in the tree that
are in the path from the leaf node that you are trying to verify to the root
node.

This critical path is sometimes called the Merkle Path and gives you the
missing information to calculate the resulting hash at each level of the tree.
Essentially this 'rehashing' process rebuilds a portion of the tree until you
reach the root and if the resulting hash matches the known Merkle Root then
the content is valid.

The benefit here is that you only need to do log2(n) calculations where n is
the number of leaf nodes in the Merkle Tree.

The image on this SE answer shows the concept well:
[https://bitcoin.stackexchange.com/a/50680](https://bitcoin.stackexchange.com/a/50680)

~~~
cbergoon
Also, this implementation might be helpful.

[https://github.com/cbergoon/merkletree/blob/master/merkle_tr...](https://github.com/cbergoon/merkletree/blob/master/merkle_tree.go)

------
dsukhin
Your intuition is correct. The Merkle root (hash of the root node) is only
useful as a summary of the whole hash tree structure. The clever idea here is
not that the tree itself is encoded in the Merkle root but rather that the
root is a unique id of a whole unique tree and all the data. Having the hash
tree available (which is much smaller than the data itself), you can validate
any leaf node's (1) integrity and (2) membership in the set much quicker than
a full scan by calculating the tree.

Merkle trees are also used in Bitcoin and similar protocols to find if a
transaction is present in a block. This may be a another useful link to help
understand the concepts: [https://hackernoon.com/merkle-
trees-181cb4bc30b4](https://hackernoon.com/merkle-trees-181cb4bc30b4)

------
lotharrr
It might make more intuitive sense if you reverse the question. Suppose you've
uploaded a file to AWS, let's say 1MB, but you don't entirely trust that they
won't change the data on you. You're about to sell all your computers and go
on a boat trip around the world with nothing but the clothes on your back and
a single piece of paper, and when you get back next year, you want to retrieve
that data from AWS and know for sure whether they corrupted it or not. (And,
for some reason, you're ok with losing it entirely if they did corrupt it:
this is error detection, not error-correction).

So before you upload it, you compute the SHA256 hash of the file, and you
write down that 32-byte hash on your piece of paper. Then you upload the data,
delete the local copy, and sell all your computers. A year from now, when you
download it again and want to make sure AWS hasn't corrupted the contents, you
hash the downloaded data and make sure it matches the hash you wrote down.

Next, suppose you're uploading a 1GB dataset. And you know that a year from
now, you're only going to need 1MB of it (at a time). You don't really want to
download 1000x the data just to do the validity check. So instead you split
the dataset up into 1MB chunks, hash each one separately, concatenated the
hashes (giving you a 32000 byte string), and do a second-level hash of _that_.
You upload the 1GB dataset and the 32000 bytes of hashes, and you write down
just the single 32 byte root hash. Next year, when you download that 1MB, you
also download the 32000 bytes. You hash the 32000 bytes and compare it against
the root hash that you wrote down, and then you hash the 1MB that you care
about and compare it against the small portion of the 32000 bytes. That gives
you a two-level tree shape.

But now suppose you've got a 1 _TB_ dataset. That concatenated list of hashes
is going to be 32 _1000_ 1000 = 32MB long, which is a drag, since you have to
download the whole thing (to hash the whole thing) in order to validate any
part of it. You can see where this is going: you introduce another level, you
have 1M leaves of 1MB each, in groups of 1000, each of which produces a
second-level hash, then you concatenate and hash all 1000 of those second-
level hashes to get the root hash.

A full (binary) Merkle tree is the log2(N) version of this. You split the
dataset up into blocks, hash those to get the leaves of the tree, hash the
leaves to get half as many intermediate nodes, hash those, etc, until you wind
up with a single root hash. Then you upload the dataset and all the nodes
(leaves and intermediate notes), and you remember the root. Next year, you
reconstruct the shape of the tree, and figure out which subset of the nodes
you'll need to verify everything (basically the sibling node of the leaf you
want to download, and the sibling node of their mutual parent, and the sibling
of that node, etc, up to the top, where the last piece you need is the one
child of the root node that _isn 't_ an ancestor of the leaf you're
downloading). We called this the "uncle chain", and it will be log(N) long.
Then you go to your backend store and fetch the leaf and those log(N) hashes,
and re-perform the subset of the initial hashing that got you the root.
Finally you have a reconstructed root, and you can compare that to the one
you've been holding all year long.

In Tahoe-LAFS ([https://tahoe-lafs.org](https://tahoe-lafs.org)), we had a
metric we named "alacrity", which we defined as the number of bytes you have
to fetch from your storage servers before you can deliver the first byte of
decrypted and validated data to the user. The size of the leaves directly
impacts the alacrity, as does the size of the uncle chain. In the linear
approach (the 1MB example), the alacrity is the entire file: O(N) (really just
"N"). The one-level 1000-way "tree" reduces that a lot, but the alacrity is
still linear, the sum of the block size and something like (N/1MB) _32\. By
going to a full tree, you minimize the alacrity overhead to logarithmic.

The tradeoff is overhead. The flat hashing adds zero overhead: every byte on
the server is a byte that the client wanted to store. The full tree is maximal
overhead: for every block/leaf, you need 2x hashes to build up the tree. So
for a 1TB file, and 1MB blocks, you'd have 64MB of hashes to store, and the
alacrity is 1MB plus about 10_32 = 320 bytes (since log2(1M) ~= 10). You can
reduce the alacrity by half by halving the blocksize, but that doubles your
overhead.

To use this, the validating side must know the shape of the tree (how much
data goes into each leaf, how many leaves get hashed into each intermediate
node, and how many levels of the tree you've got). But this is usually a
deterministic function of the filesize. And then it requires that you have a
way to download a specific subset of the data from the server (maybe using
HTTP Range queries, or some kind of random-access seek() call), to gather just
the useful hash nodes and nothing else.

Hope that helps! -Brian

------
traceroute66
All great replies. Thank you all. Particular hat-tip to @lotharrr for the
extra detail.

