
Attacking Merkle Trees with a second preimage attack - wepple
https://flawed.net.nz/2018/02/21/attacking-merkle-trees-with-a-second-preimage-attack/
======
KMag
The authors of Keccak / SHA-3 came up with Sakura[0], a hash tree construction
that's provably as collision-resistant as the underlying hash function and
very flexible.

If you're designing a new system using hash trees, you better use Sakura trees
or have a good explanation for why not.

Edit: I also wrote some demo code[1] for this attack when arguing with one of
the IPFS guys about why they really shouldn't use Bittorrent BEP 30-style
Merkle Trees. In order to get security with Bittorrent-style trees, the
[length, root] pair really needs to be your cryptographic identifier, and you
need to make sure they're always properly handled as a pair. There are just
too many caveats to usage and we have provably secure alternatives.

As a former LimeWire developer, it makes me sad that Gnutella's TigerTrees
avoided this vulnerability long before bittorrent's BEP 30 was published, and
yet Bittorrent got it wrong. It was a well-known vulnerability, and was
covered as part of my interview at LimeWire.

[0][https://keccak.team/files/Sakura.pdf](https://keccak.team/files/Sakura.pdf)
[1][https://github.com/kmag/bad_examples/tree/master/bad_merkle_...](https://github.com/kmag/bad_examples/tree/master/bad_merkle_tree)

~~~
gritzko
RFC 7574 Merkle tree solves that rather strictly (I am the author of the
scheme).

Max file size is 2^64 bits, so the hash tree is defined as a binary tree
covering the entire 2^64 range, leaf layer made of 1KB pieces. A hash of an
empty range is defined to be 0, at any level. That way, each hash is reliably
fixed to its interval ("bin").

There are some features derived from that. For example, you can _prove_ file
size by showing a logarithmic number of hashes. A file is identified by its
root hash (covering the entire 2^64 range), no additional metadata needed. And
so on.

These days, the 7574 scheme is used in the DAT protocol.

~~~
KMag
Having an RFC is better than nothing, and I don't see any obvious
vulnerabilities in your scheme.

I've posted several places about being able to take the hash of the rightmost
chunk and log(N) other hashes as a compact proof of file length for a hash
tree root. (Including, I think, the posts I had arguing with the IPFS folks
about using BEP 30 Merlke Trees.) That's handy, but there's nothing novel
about RFC 7574 there.

However, 2^64 bits is a bit of an arbitrary limit. What's the justification
there? SHA-512 and SHA-384 support up to 2^128-1 bits, and SHA-3 doesn't have
such a limit.

More importantly, I don't recognize the names of the RFC authors from any
cryptographic analysis. There's a big difference between a scheme that looks
pretty good to a ton of random people who have looked at it and a scheme
designed and published by world-famous cryptographers. Has your custom tree
construct undergone formal review by qualified cryptographers? It looks pretty
good to me, but if I had to bet by career on the security of a tree hash
scheme, I'd prefer to go with the people who brought us Keccak/SHA-3 (and one
of the people who brought us AES, despite its known weaknesses).

~~~
gritzko
1\. 7574 is not "novel" indeed. About 10 years old, if you count the original
draft.

2\. 2^64 is quite a lot, for a single file. The variant I described deals with
a single static file. Also, it is a part of a network protocol, so there are
some requirements. Like, using standard integer arithmetics for packet-level
processing.

3\. Feel free to show it to any people you like.

------
Everlag
This is a well known edge case of Merkle trees; my undergrad security course
covered this as an aside and it is prominent on the wikipedia page.[0]

For those not familiar with the specific implementation issue, this post
should be a fun read.

I don't know what's more concerning, that pymerkletools advertises itself for
use with Bitcoin or that none of the contributors read the wiki page :|

[0]
[https://en.wikipedia.org/wiki/Merkle_tree#Second_preimage_at...](https://en.wikipedia.org/wiki/Merkle_tree#Second_preimage_attack)

~~~
kerkeslager
It's a bit strange to consider this even an edge case, and weird that one
would even create a diagram (such as the one on the wikipedia page) that
doesn't have a solution (such as a leaf node marker prepended). Why isn't the
prepended leaf node marker included in the concept of a Merkle tree? Is there
ever a situation where you'd want a Merkle tree that allowed this?

~~~
KMag
The original use case for Merkle trees were to compactly represent sets of
one-time-use public keys. In that context, an attacker trying to exploit this
sort of collision would end up presenting an interior node as a public key
leaf node, but the size of an interior node is too small to be a valid public
key, so no party that performs sanity checks on the public keys would be
fooled by the attacker. In that case, tagging nodes as leaves or internal
nodes is arguably a tiny tiny bit wasteful. That's the only argument I can
think of, and it's a poor one.

Edit: Okay, the other valid argument for not using prefixes to tag leaves and
interior nodes is that you're using the provably secure Sakura construction,
which using suffixes rather than prefixes. There are a few advantages to using
suffixes, such as being able to have a single node containing both data and
child nodes without having to know the length of the data portion before
starting to hash the node. There's also better performance due to memory
alignment when hashing memory-mapped files (if using block sizes that pack
well into memory pages and cache lines) if you use suffixes. But, suffixes vs.
prefixes is a tiny nit to pick.

~~~
kerkeslager
Okay, I guess in mathematical terms a better way to express the whole thing
would be in terms of two hash functions:

h_leaf(leaf) which takes a leaf.

h_branch(branch_left, branch_right) which takes the two branches.

The important point being that one should not be able to find a leaf such that
h_leaf(leaf) = h_branch(branch_left, branch_right) for any branch_left,
branch_right.

Prefixes versus suffixes are just implementation details of the hash functions
(i.e. h_leaf(x) = "\00" \+ sha256(x), h_branch(x,y) = "\01" \+ sha256(x + y)
would also work).

------
moscovium
This was interesting, but when you think about it, this isn't really a flaw as
much as something inherent to the way it was designed.

Imagine you had some combinations of functions f, g, and h such that
f(g(h(x))) = y. Obviously you could calculate that as h(x) -> g(h(x)) ->
f(g(h(x)) = y, but then of course knowing h(x), or g(h(x)) would enable you to
find y as well. So of course, due to the recursive nature of it, picking any
set of inputs that were the outputs of a previous call would give you the same
output.

That argument doesn't exactly fit multiple inputs, but the idea is the same.

~~~
andrewflnr
You say that like flaws and inherent features are different. I say those are
the worst kind of flaw.

I guess the point is, this property (value-neutral) of that design makes it
unsuitable for real-world purposes. Since the implementations in question are
meant for real world use, that makes them flawed, at least. :)

~~~
aidenn0
For fixed-depth trees merkle trees are fine though. I have used them for just
such a purpose.

------
JeremyBanks
[https://crypto.stackexchange.com/questions/2106/what-is-
the-...](https://crypto.stackexchange.com/questions/2106/what-is-the-purpose-
of-using-different-hash-functions-for-the-leaves-and-internals-of-a-hash-tree)

------
hedora
I find this article and discussion unsettling. A large number of people made
non-typesafe merkle trees (including bittorrent, apparently), and then were
surprised that passing it incorrect types cause it to produce incorrect
output.

I don’t see how this is a vulnerability in the merkle tree algorithm. It just
seems like yet another case of “python libraries contain bugs that are common
in idiomatic python”.

I guess since so many people have independently implemented the same mistake,
the writeup is useful.

------
DennisP
I think I might be missing the point...is the attack just to pretend an
intermediate hash was a leaf?

If you're providing a merkle branch to prove that a document you've hashed is
in the set, then I don't see how this helps you at all.

------
bluesign
Another solution to this attack can be:

hashing root with tree depth as last step.

~~~
kerkeslager
This doesn't work unless you can guarantee that the size of hashed sets is
going to be a exact power of two, because otherwise subtree depths can differ.

~~~
bluesign
Oh I meant the longest subtree, afaik depth of merkel tree is the longest one
no?

Edit: Also I think element count can be also alternative for depth

~~~
kerkeslager
> Oh I meant the longest subtree, afaik depth of merkel tree is the longest
> one no?

I get what you meant, but it still doesn't prevent the second preimage attack,
unless you're forcing all subtrees to be the same depth. Think about the
algorithm for testing set membership. How would they use the tree depth to
distinguish between hash(hash(leafA) + hash(leafB)) and hash(leafC) where
leafC = hash(leafA) + hash(leafB) but leafC is NOT a member of the set? Keep
in mind that leafA and leafB could be leaves of the longest subtree, but other
valid leaves could be on valid subtrees that are one node shorter.

> Edit: Also I think element count can be also alternative for depth

Element count doesn't solve the second preimage attack problem either.

~~~
bluesign
Oh I maybe wrong on this but I was assuming something like this

    
    
           ABCDEEEE 
          /        \
         ABCD      EEEE
        /    \      /
       AB    CD    EE 
      /  \  /  \  /
      A  B  C  D  E
    

From: [https://bitcoin.org/en/developer-guide#transaction-
data](https://bitcoin.org/en/developer-guide#transaction-data)

~~~
kerkeslager
Okay, if I'm understanding correctly, that diagram is a little unclear. Like I
said earlier, using the tree depth works if the number of items in your set is
a power of two, because then all branches of the tree are the same depth. In
this case, Bitcoin is _forcing_ all branches to be the same depth by
duplicating the final element to fill in the rest of the Merkle tree up to a
power of 2. In the example, the power of 2 is 8, and the theoretical Merkle
tree looks like this:

    
    
                    ABCDEEEE
                   /        \
               ABCD          EEEE
              /    \        /    \
            AB      CD     EE     EE
           /  \    /  \   /  \   /  \
          A   B   C    D E    E E    E
    

The diagram you linked doesn't show the full right side of the Merkle tree,
because of a clever computational trick that they use to optimize storage and
membership. Basically, if you compute and store hash(E) for the leftmost E,
you don't have to compute or store hash(E) again to compute and store
hash(hash(E) + hash(E)) for the leftmost EE. The diagram is showing _the
computations_ , not the full Merkle tree.

Incidentally, a side effect of this is that you can't have duplicate
transactions, because you wouldn't be able to tell between a E and F such that
E=F, and E simply being duplicated to pad out to a power of 8. This works for
Bitcoin because duplicate transactions aren't desirable, but it might not work
for other applications.

~~~
bluesign
Yeah I totally agree with you, my comment was, I learned first about merkle
tree from bitcoin, so I was always assuming it is working like this in
general. But seems they were filling leaves to power of 2

------
CodesInChaos
My preferred fix is including the total size of data in the root hash, since
it's often convenient to know the total file size before starting a download.

------
Agnosco
I think the Wikipedia reference doesn't do it justice - the article alone is
already ~1500 words.

 _Sorry, I couldn 't help myself_

