
The Compact Merkle Multiproof - irwt
https://arxiv.org/abs/2002.07648
======
nemo1618
Huh. I think I created this independently about a year ago. My implementation
also works in streaming fashion, so you can compute proofs without holding all
of the data in memory. I didn't think it was novel enough to write a paper
about. ¯\\_(ツ)_/¯

Maybe I should publish a paper about the "diff proofs" I also came up with,
for proving arbitrary modifications to a Merkle tree?

[https://gitlab.com/NebulousLabs/merkletree/-/commit/bae5b547...](https://gitlab.com/NebulousLabs/merkletree/-/commit/bae5b547495f9dddbd4ddeecfdcb00cb89d99d76)

~~~
irwt
I don't want to be rude, but after checking your code I can say that what you
did is most definitely not the same thing as the compact Merkle multiproof
idea..

~~~
nemo1618
Can you elaborate? Like yours, my proofs verify the presence of k arbitrary
leaves within a tree, and require only the indices of those k leaves.

Here's a test case illustrating such a proof:
[https://gitlab.com/NebulousLabs/merkletree/-/blob/bae5b54749...](https://gitlab.com/NebulousLabs/merkletree/-/blob/bae5b547495f9dddbd4ddeecfdcb00cb89d99d76/range_test.go#L205)

~~~
irwt
You son of a gun, good job. I initially looked at the wrong part of the code,
my apologies. This part describes it very well:
[https://gitlab.com/NebulousLabs/merkletree/-/blob/master/ran...](https://gitlab.com/NebulousLabs/merkletree/-/blob/master/range.go#L197)
It's basically the same thing. Implementation wise it's slightly different,
but it's the same idea.

During the upcoming days I'll add to the paper that you were the first to
propose this algorithm.

~~~
nemo1618
Ha, thanks. Our algorithms differ slightly in that mine deals with ranges
(rather than individual leaves) and processes the tree "left to right" rather
than "bottom to top". As such, your algorithm is probably better suited to
generic trees, whereas mine was specifically designed for "complete" binary
trees. But otherwise, they are quite similar. :)

------
sahrizv
_Compact Sparse Merkle Trees (CSMT)_ [1] is an already existing scheme for
shorter proofs, which introduces the new CSMT data structure and was also
posted on HN a while ago.[2]

This[3] is the implementation(in Elixir) of the above paper by the author for
those interested.

Also, here [4] is an interesting discussion between the author of Compact
Sparse Merkle Trees and Vitalik Buterin, the creator of Ethereum on their
research forum.

I have posted these links because it seems disingenuous at best and malicious
at worst, to not cite this original work which has been extensively discussed
and documented before, anywhere in the current paper.

[1]
[https://eprint.iacr.org/2018/955.pdf](https://eprint.iacr.org/2018/955.pdf)
[2]
[https://news.ycombinator.com/item?id=18166298](https://news.ycombinator.com/item?id=18166298)
[3]
[https://github.com/ZanjeerPlatform/csmt](https://github.com/ZanjeerPlatform/csmt)
[4] [https://ethresear.ch/t/compact-sparse-merkle-
trees/3741](https://ethresear.ch/t/compact-sparse-merkle-trees/3741)

~~~
irwt
One of the authors of the paper here. We did not cite this work, because it
has nothing to do with what we're proposing. One should not confuse Merkle
multiproofs with sparse Merkle trees. We mentioned that in the paper as well.
I do know that Vitalik once proposed the idea of sparse Merkle multiproofs
somewhere informally, and we would love to cite that if someone could provide
a link to that. But regarding the compact sparse Merkle tree, just because
this paper shares some words in the title, does not make it the same thing.

~~~
decentralised
I vaguely remember Vitalik mentioning multiproofs in the context of stateless
clients for Ethereum 2.0, but this is what I could find:
[https://ethresear.ch/t/open-research-questions-for-
phases-0-...](https://ethresear.ch/t/open-research-questions-for-
phases-0-to-2/5871/2) and the code:
[https://github.com/ethereum/research/tree/master/merkle_tree](https://github.com/ethereum/research/tree/master/merkle_tree)

I also found a good article introducing multiproofs from Jim McDonald here:
[https://www.wealdtech.com/articles/understanding-sparse-
merk...](https://www.wealdtech.com/articles/understanding-sparse-merkle-
multiproofs/)

~~~
irwt
Thank you for the links, I will try and edit the paper in the upcoming days to
include Vitalik's mentioning. Jim McDonald's article is also wonderful, we
referenced it multiple times and experimented quite a lot with his Merkle tree
repo: [https://github.com/wealdtech/go-
merkletree](https://github.com/wealdtech/go-merkletree)

------
krcz
My two cents:

1\. That seems to work for complete binary Merkle trees only.

2\. Authors mention "significantly reducing the size of multiproofs" while not
providing any data. Assuming using 256-bit hashes, a tree with 2^63 leafs
would be needed to claim 20% reduce in size (64-bit indices, so droping them
saves at most 64/(64 + 256)%). That's assuming storing hashes for proved
leafs, but disregarding them wouldn't change much - as there are much more
non-proved hashes to be stored.

~~~
irwt
About the second point: I think you misunderstood something.

Imagine that we have a tree with 1000 leaves and we want to have a Merkle
proof that can prove the presence of, let's say 5 elements. The sparse Merkle
multiproof will require an index for every non-leaf hash, so worst case around
30-40 indices (that's just an estimate, as every multiproof can have different
sizes, depending on the location of the leaves). The compact Merkle multiproof
on the other hand requires only 5 indices.

~~~
krcz
You still need one hash for every non-leaf node, so 30-40 hashes of 256 bits
each. What this algorithm saves is 30-40 indices, which is 11 bits each for a
1000-leaf tree. Even aligning to 16 bits it's doesn't seem a great
optimization.

Unless I misunderstood something and we are not trying to minimize the size of
whole proof, but just its structure - but I see no reason why it could be
useful. But I can see there are two other papers: one on distributed Bloom
filters [1] and the other on Bloom trees [2], maybe these can shed some light
on it, I'll check later today.

[1]
[https://arxiv.org/pdf/1910.07782.pdf](https://arxiv.org/pdf/1910.07782.pdf)

[2]
[https://arxiv.org/pdf/2002.03057.pdf](https://arxiv.org/pdf/2002.03057.pdf)

EDIT: I haven't noticed you are one of the authors! I liked the paper, it is
very clear and visualizes how Merkle (multi)proofs work in a nice way. The
only thing that worries me is claim of significant reduce in the size without
providing any hard data in that topic.

~~~
irwt
I am glad that you liked our paper :) (the distributed bloom filter paper
requires serious editing though).

Current sparse Merkle multiproofs require an index and a hash inside the
proof. So in our hypothetical example that would be 30-40 indices and 30-40
hashes. Only then can a recipient of the proof perform the operations in the
right order to recreate the original merkle root. With our proposal on the
other hand, for this hypothetical example, we would need 30-40 hashes, and
only 5 indices for a recipient to be able to recreate the correct Merkle root.

Here's by the way a great repo that can create sparse Merkle multiproofs
[https://github.com/wealdtech/go-merkletree](https://github.com/wealdtech/go-
merkletree)

------
cryptica
I work in the blockchain space (the blockchain I work on doesn't use Merkle
trees) and I still don't understand why there is so much hype around Merkle
trees. My understanding is that they allow you to quickly check if an element
is part of a potentially large set? Isn't this 'fast lookup' problem already
solved by database indexing?

Can anyone actually explain this? I don't think I've ever heard a single good
explanation yet.

For example, a node can see if a transaction exists by simply fetching it from
the database using the transaction ID since there is an index on the ID field?
Indexing already provides O(log n) lookup speed. Why do I need a Merkle proof
to do this check? Especially if each block contains the hash of the previous
block...

What is the use case which Merkle trees solve which is not solved by indexing?

At first, I thought maybe it allows you to verify the validity of a
transaction without having the full record of all transactions, but my
understanding is that the Merkle Proof still requires multiple transaction IDs
to be known by the verifying node; but in order to be able to verify ANY
ARBITRARY transaction reliably, the verifying node would end up needing to
have a record of all transactions anyway? Because the proof will be different
for each transaction... So the aggregated proof of all transactions ends up
spanning the entire set... So why not just rely on the original set and just
use DB indexing instead??

~~~
irwt
I see your confusion, from that perspective, you're absolutely right. But the
usefulness of Merkle trees isn't in its 'fast lookup', it is rather in its
bandwidth efficient way to prove that something remained unaltered.

~~~
cryptica
I still don't get it. For example, if each block contains the hash of the
previous block (including the hash of all its transactions), then it's already
possible to verify if a block was illegally substituted for another (e.g.
secretly behind the scenes) because the following block hash would become
inconsistent. The verification step would be O(1) time complexity because you
just need to check the block hash and the hash of its two neighbors which is
faster than Merkle Proof which is O(log n).

~~~
krcz
The clue is in "all transitions" part. If you have n blocks adding values, you
have to verify O(n) blocks on average to check, that the value is in the chain
leading to the current head.

With Merkle tree you just need a path from leaf to the root - so O(log n)
hashes required.

~~~
cryptica
Is it related to Bitcoin's design with input and output transactions and the
double spend problem?

It still seems that the problem can be solved by including a hash of the
previous record as part of each new record. Then to verify the validity of any
given record, you just fetch that record, its predecessor and its successor.
If the record's hash cannot be computed using the predecessor's hash plus the
hash of the current record's transactions (in a way that it's consistent with
the successor's hash), then you know the current record is invalid.

~~~
krcz
It's about anchoring your knowledge/trust. If you start with zero data, you're
saying that it's enough to just fetch that record. How to do that? You need to
query full nodes specifically for the block that contains your record. But how
do you know it is part of the "real" blockchain? They could provide you fake
block - even faking whole history, just to fake your record is there.

That is not possible when you are querying many full nodes for current node,
without telling them what you are looking for. They cannot return different
results based on your query. You can trust that one piece of information -
newest block.

In order to verify validity of the record, you need the block containing that
result and checking that it is in history of the current block. That requires
fetching O(n) blocks. Keep in mind that you need to trust just the head - even
if someone wanted to provide you fake history, they are not able to, as it is
infeasible to create block matching specific hash.

But instead of having linear block chain (corresponding to list datatype) you
can have a Merkle tree (corresponding to tree datatype), where it is possible
to compute hash of each node having just its two children. You can anchor your
knowledge in the hash of the root (e.g. by having all nodes provide you the
same value). Than anyone can prove you in a trustless way, that the tree
contains specific leaf: by providing way to verify the hash of all nodes on
the path from the leaf to the root. That requires O(log n) data, compared to
O(n) in linear blockchain.

------
nullc
The Bitcoin p2p protocol has used compact merkle multiproofs since ~2012:
[https://github.com/bitcoin/bitcoin/commit/4bedfa9223d38bbc32...](https://github.com/bitcoin/bitcoin/commit/4bedfa9223d38bbc322d19e28ca03417c216700b)

(Other bitcoin things/proposals have had efficient multiproof encodings for
non-complete tree too. Also signaling/index free encodings for non-multiproof
merkle multisets.)

~~~
irwt
Very clever design. It's not exactly like our proposal, or the one implemented
by @nemo1618 a year ago, but it's just as legit. I will edit the paper and
cite your work properly.

