Hacker News new | past | comments | ask | show | jobs | submit login
The Compact Merkle Multiproof (arxiv.org)
57 points by irwt on Feb 20, 2020 | hide | past | favorite | 31 comments



Huh. I think I created this independently about a year ago. My implementation also works in streaming fashion, so you can compute proofs without holding all of the data in memory. I didn't think it was novel enough to write a paper about. ¯\_(ツ)_/¯

Maybe I should publish a paper about the "diff proofs" I also came up with, for proving arbitrary modifications to a Merkle tree?

https://gitlab.com/NebulousLabs/merkletree/-/commit/bae5b547...


I don't want to be rude, but after checking your code I can say that what you did is most definitely not the same thing as the compact Merkle multiproof idea..


Can you elaborate? Like yours, my proofs verify the presence of k arbitrary leaves within a tree, and require only the indices of those k leaves.

Here's a test case illustrating such a proof: https://gitlab.com/NebulousLabs/merkletree/-/blob/bae5b54749...


You son of a gun, good job. I initially looked at the wrong part of the code, my apologies. This part describes it very well: https://gitlab.com/NebulousLabs/merkletree/-/blob/master/ran... It's basically the same thing. Implementation wise it's slightly different, but it's the same idea.

During the upcoming days I'll add to the paper that you were the first to propose this algorithm.


Ha, thanks. Our algorithms differ slightly in that mine deals with ranges (rather than individual leaves) and processes the tree "left to right" rather than "bottom to top". As such, your algorithm is probably better suited to generic trees, whereas mine was specifically designed for "complete" binary trees. But otherwise, they are quite similar. :)


Compact Sparse Merkle Trees (CSMT) [1] is an already existing scheme for shorter proofs, which introduces the new CSMT data structure and was also posted on HN a while ago.[2]

This[3] is the implementation(in Elixir) of the above paper by the author for those interested.

Also, here [4] is an interesting discussion between the author of Compact Sparse Merkle Trees and Vitalik Buterin, the creator of Ethereum on their research forum.

I have posted these links because it seems disingenuous at best and malicious at worst, to not cite this original work which has been extensively discussed and documented before, anywhere in the current paper.

[1] https://eprint.iacr.org/2018/955.pdf [2] https://news.ycombinator.com/item?id=18166298 [3] https://github.com/ZanjeerPlatform/csmt [4] https://ethresear.ch/t/compact-sparse-merkle-trees/3741


One of the authors of the paper here. We did not cite this work, because it has nothing to do with what we're proposing. One should not confuse Merkle multiproofs with sparse Merkle trees. We mentioned that in the paper as well. I do know that Vitalik once proposed the idea of sparse Merkle multiproofs somewhere informally, and we would love to cite that if someone could provide a link to that. But regarding the compact sparse Merkle tree, just because this paper shares some words in the title, does not make it the same thing.


I vaguely remember Vitalik mentioning multiproofs in the context of stateless clients for Ethereum 2.0, but this is what I could find: https://ethresear.ch/t/open-research-questions-for-phases-0-... and the code: https://github.com/ethereum/research/tree/master/merkle_tree

I also found a good article introducing multiproofs from Jim McDonald here: https://www.wealdtech.com/articles/understanding-sparse-merk...


Thank you for the links, I will try and edit the paper in the upcoming days to include Vitalik's mentioning. Jim McDonald's article is also wonderful, we referenced it multiple times and experimented quite a lot with his Merkle tree repo: https://github.com/wealdtech/go-merkletree


I heard about about a similar (same?) method before and wrote about it here: https://ethresear.ch/t/optimizing-merkle-tree-multi-queries/...

In the comments Vitalik links to his implementation.


My two cents:

1. That seems to work for complete binary Merkle trees only.

2. Authors mention "significantly reducing the size of multiproofs" while not providing any data. Assuming using 256-bit hashes, a tree with 2^63 leafs would be needed to claim 20% reduce in size (64-bit indices, so droping them saves at most 64/(64 + 256)%). That's assuming storing hashes for proved leafs, but disregarding them wouldn't change much - as there are much more non-proved hashes to be stored.


About the second point: I think you misunderstood something.

Imagine that we have a tree with 1000 leaves and we want to have a Merkle proof that can prove the presence of, let's say 5 elements. The sparse Merkle multiproof will require an index for every non-leaf hash, so worst case around 30-40 indices (that's just an estimate, as every multiproof can have different sizes, depending on the location of the leaves). The compact Merkle multiproof on the other hand requires only 5 indices.


You still need one hash for every non-leaf node, so 30-40 hashes of 256 bits each. What this algorithm saves is 30-40 indices, which is 11 bits each for a 1000-leaf tree. Even aligning to 16 bits it's doesn't seem a great optimization.

Unless I misunderstood something and we are not trying to minimize the size of whole proof, but just its structure - but I see no reason why it could be useful. But I can see there are two other papers: one on distributed Bloom filters [1] and the other on Bloom trees [2], maybe these can shed some light on it, I'll check later today.

[1] https://arxiv.org/pdf/1910.07782.pdf

[2] https://arxiv.org/pdf/2002.03057.pdf

EDIT: I haven't noticed you are one of the authors! I liked the paper, it is very clear and visualizes how Merkle (multi)proofs work in a nice way. The only thing that worries me is claim of significant reduce in the size without providing any hard data in that topic.


I am glad that you liked our paper :) (the distributed bloom filter paper requires serious editing though).

Current sparse Merkle multiproofs require an index and a hash inside the proof. So in our hypothetical example that would be 30-40 indices and 30-40 hashes. Only then can a recipient of the proof perform the operations in the right order to recreate the original merkle root. With our proposal on the other hand, for this hypothetical example, we would need 30-40 hashes, and only 5 indices for a recipient to be able to recreate the correct Merkle root.

Here's by the way a great repo that can create sparse Merkle multiproofs https://github.com/wealdtech/go-merkletree


I work in the blockchain space (the blockchain I work on doesn't use Merkle trees) and I still don't understand why there is so much hype around Merkle trees. My understanding is that they allow you to quickly check if an element is part of a potentially large set? Isn't this 'fast lookup' problem already solved by database indexing?

Can anyone actually explain this? I don't think I've ever heard a single good explanation yet.

For example, a node can see if a transaction exists by simply fetching it from the database using the transaction ID since there is an index on the ID field? Indexing already provides O(log n) lookup speed. Why do I need a Merkle proof to do this check? Especially if each block contains the hash of the previous block...

What is the use case which Merkle trees solve which is not solved by indexing?

At first, I thought maybe it allows you to verify the validity of a transaction without having the full record of all transactions, but my understanding is that the Merkle Proof still requires multiple transaction IDs to be known by the verifying node; but in order to be able to verify ANY ARBITRARY transaction reliably, the verifying node would end up needing to have a record of all transactions anyway? Because the proof will be different for each transaction... So the aggregated proof of all transactions ends up spanning the entire set... So why not just rely on the original set and just use DB indexing instead??


If you are indexing all transactions you are right that merkle trees are not very useful. They become useful when you aren't, for example, when running a lightweight client that only stores block headers.

Let's say you are running a lightweight wallet and import a private key, your client will download all transactions that spend from/to that private key from a node that does index all transactions (a "full node"). With the block hash and merkle branch of each transaction, your client will be able to independently verify that each transactions were included in a block and are therefore valid. This of course assumes that the longest chain (in terms of proof of work) is valid, which is usually a reasonable assumption.


We use them when we replicate very complex dependency trees. Often with 1e6+ heterogeneous objects in the tree.

If we replicate a tree, say, for regression testing, we can trivially confirm that it replicated identically. If not, we can use the partial trees to, very quickly, triage the discrepancies to the earliest divergence point.

We also have a number of cases where we want to quickly confirm that 2 (or more) trees have identical dependencies in parts of the tree even though the overall trees diverge substantially. A Merkle tree approach works nicely here too.

Indexing doesn't help in these cases as what we're doing is, effectively, fast content matching.


I see your confusion, from that perspective, you're absolutely right. But the usefulness of Merkle trees isn't in its 'fast lookup', it is rather in its bandwidth efficient way to prove that something remained unaltered.


I still don't get it. For example, if each block contains the hash of the previous block (including the hash of all its transactions), then it's already possible to verify if a block was illegally substituted for another (e.g. secretly behind the scenes) because the following block hash would become inconsistent. The verification step would be O(1) time complexity because you just need to check the block hash and the hash of its two neighbors which is faster than Merkle Proof which is O(log n).


The clue is in "all transitions" part. If you have n blocks adding values, you have to verify O(n) blocks on average to check, that the value is in the chain leading to the current head.

With Merkle tree you just need a path from leaf to the root - so O(log n) hashes required.


Is it related to Bitcoin's design with input and output transactions and the double spend problem?

It still seems that the problem can be solved by including a hash of the previous record as part of each new record. Then to verify the validity of any given record, you just fetch that record, its predecessor and its successor. If the record's hash cannot be computed using the predecessor's hash plus the hash of the current record's transactions (in a way that it's consistent with the successor's hash), then you know the current record is invalid.


It's about anchoring your knowledge/trust. If you start with zero data, you're saying that it's enough to just fetch that record. How to do that? You need to query full nodes specifically for the block that contains your record. But how do you know it is part of the "real" blockchain? They could provide you fake block - even faking whole history, just to fake your record is there.

That is not possible when you are querying many full nodes for current node, without telling them what you are looking for. They cannot return different results based on your query. You can trust that one piece of information - newest block.

In order to verify validity of the record, you need the block containing that result and checking that it is in history of the current block. That requires fetching O(n) blocks. Keep in mind that you need to trust just the head - even if someone wanted to provide you fake history, they are not able to, as it is infeasible to create block matching specific hash.

But instead of having linear block chain (corresponding to list datatype) you can have a Merkle tree (corresponding to tree datatype), where it is possible to compute hash of each node having just its two children. You can anchor your knowledge in the hash of the root (e.g. by having all nodes provide you the same value). Than anyone can prove you in a trustless way, that the tree contains specific leaf: by providing way to verify the hash of all nodes on the path from the leaf to the root. That requires O(log n) data, compared to O(n) in linear blockchain.


Merkle trees allow for verification; I don't think they are particularly beneficial for lookup? For example, this paper is about a better way to verify the integrity of a "sparse Merkle multiproof".


My point is that lookups and simple `if then else` comparisons can be used to do verification too with similar performance characteristics and a lot simpler to implement.

Is it related to zero-knowledge proofs? The industry makes it seem like Merkle trees are required for all blockchains which is definitely not the case since most blockchains are public and each full node has a full record of the history.


I mean, each node doesn't have a literal full record of the history embedded in it, right? There's just a back pointer to the previous node. And the Merkle Tree part verifies that the thing the node points back to is legit and not just a fake block you pulled out of thin air.


I was talking about nodes as independent machines which have the full history (full nodes). Maybe you were thinking of nodes in a mathematical sense like blocks in a blockchain.

My point stands that if each block ID is computed from:

1. A hash of all the transactions inside the block. AND

2. The hash from the previous block.

Then you can verify that the block is legit without a merkle tree. Tampering or substitution of any block would result in the hash not being consistent with the neighboring blocks.


It's not about Bitcoin or blockchains and nothing about zero knowledge proofs or anything. You did just describe a merkle tree though.

Here is another example. BitTorrent uses merkle trees for torrent and piece identification.

You request peer list from a tracker via the torrents merkle tree root hash. These are visible in torrent files and sometimes magnet links. That root hash corresponds to a certain tree of piece hashes whose data that when reassembled will recreate the torrent. The tree of piece hashes is included with the torrent file.

You can now ask peers for those pieces by their hash and know the peer sent you the correct data because it hashes correctly. Furthermore you can know you reassembled the original torrent because you can verify all the piece hashes hash back up to the merkle tree root this verifying the torrent was downloaded correctly.


But in order to prove that to someone that doesn't have full history, they would have to synchronize it, by downloading all the data and verifying all the transitions. With Merkle tree (identified with its root hash, just as blockchain is identified by hash of its head) it is possible to verify there is a value inside without full synchronization.


I mean, isn't that literally a Merkle Tree?


The Bitcoin p2p protocol has used compact merkle multiproofs since ~2012: https://github.com/bitcoin/bitcoin/commit/4bedfa9223d38bbc32...

(Other bitcoin things/proposals have had efficient multiproof encodings for non-complete tree too. Also signaling/index free encodings for non-multiproof merkle multisets.)


Very clever design. It's not exactly like our proposal, or the one implemented by @nemo1618 a year ago, but it's just as legit. I will edit the paper and cite your work properly.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: