Hacker News new | comments | show | ask | jobs | submit login

The authors of Keccak / SHA-3 came up with Sakura[0], a hash tree construction that's provably as collision-resistant as the underlying hash function and very flexible.

If you're designing a new system using hash trees, you better use Sakura trees or have a good explanation for why not.

Edit: I also wrote some demo code[1] for this attack when arguing with one of the IPFS guys about why they really shouldn't use Bittorrent BEP 30-style Merkle Trees. In order to get security with Bittorrent-style trees, the [length, root] pair really needs to be your cryptographic identifier, and you need to make sure they're always properly handled as a pair. There are just too many caveats to usage and we have provably secure alternatives.

As a former LimeWire developer, it makes me sad that Gnutella's TigerTrees avoided this vulnerability long before bittorrent's BEP 30 was published, and yet Bittorrent got it wrong. It was a well-known vulnerability, and was covered as part of my interview at LimeWire.

[0]https://keccak.team/files/Sakura.pdf [1]https://github.com/kmag/bad_examples/tree/master/bad_merkle_...




RFC 7574 Merkle tree solves that rather strictly (I am the author of the scheme).

Max file size is 2^64 bits, so the hash tree is defined as a binary tree covering the entire 2^64 range, leaf layer made of 1KB pieces. A hash of an empty range is defined to be 0, at any level. That way, each hash is reliably fixed to its interval ("bin").

There are some features derived from that. For example, you can prove file size by showing a logarithmic number of hashes. A file is identified by its root hash (covering the entire 2^64 range), no additional metadata needed. And so on.

These days, the 7574 scheme is used in the DAT protocol.


Looking at 7574 and Bittorrent's BEP 30, unless I'm missing something, the only part that is resisting the attack mentioned in the source article and what you are replying to, comes from also specifying the total length of the data hashed, along with the block size. So you're getting the root hash, total size, and leaf block sizes - it's that secondary information that is necessary to prevent the imaging attacks mentioned in the source article and what you are replying to.

With a better construction (Sakura, ...), that total and block size information are unnecessary. I wonder if any almost-compliant RFC 7574 or BEP 30 tools accidentally ignore those other sizes.


Having an RFC is better than nothing, and I don't see any obvious vulnerabilities in your scheme.

I've posted several places about being able to take the hash of the rightmost chunk and log(N) other hashes as a compact proof of file length for a hash tree root. (Including, I think, the posts I had arguing with the IPFS folks about using BEP 30 Merlke Trees.) That's handy, but there's nothing novel about RFC 7574 there.

However, 2^64 bits is a bit of an arbitrary limit. What's the justification there? SHA-512 and SHA-384 support up to 2^128-1 bits, and SHA-3 doesn't have such a limit.

More importantly, I don't recognize the names of the RFC authors from any cryptographic analysis. There's a big difference between a scheme that looks pretty good to a ton of random people who have looked at it and a scheme designed and published by world-famous cryptographers. Has your custom tree construct undergone formal review by qualified cryptographers? It looks pretty good to me, but if I had to bet by career on the security of a tree hash scheme, I'd prefer to go with the people who brought us Keccak/SHA-3 (and one of the people who brought us AES, despite its known weaknesses).


1. 7574 is not "novel" indeed. About 10 years old, if you count the original draft.

2. 2^64 is quite a lot, for a single file. The variant I described deals with a single static file. Also, it is a part of a network protocol, so there are some requirements. Like, using standard integer arithmetics for packet-level processing.

3. Feel free to show it to any people you like.


Can I ask you a quick question: do you think Merkle trees are a good model for key revocation? I'm currently working cryptographic file system for a class, and trying to speed up the sharing and de-sharing of files via symmetric keys. Right now, I'm thinking of adapting the tree-based certificate revocation model that Micali talked about [1]. Thanks!

[1] https://patents.google.com/patent/US6097811


Markle Trees are a good data structure whenever you want to securely represent an ordered set very compactly, with fairly compact proofs of set membership.

If you're stuck with an architecture that depends on certificate revocation lists, I think Micali's representation of the CRL is both efficient and secure. On the other hand, blacklists (such as certificate revocation lists) are generally easier to exploit than whitelists (such as the published lists used in certificate transparency).

I think we all would have been better off with an infrastructure that mandated certificate transparency, where a Merkle Tree of all valid certificates was published periodically by each Certificate Authority. In other words, keep a cryptographically verifiable whitelist instead of a cryptographically verifiable blacklist. For scalability, you'd want to have the published list be a Merkle Tree root of valid certificates, plus a signed tree hash of a Bloom filter of all of the certificates revoked in (say) the past month. From a signed Merkle Tree root, each site can create a compact proof of validity for its certificate of log(N) size. Using the signed Bloom filter, the site can keep reusing that proof for (say) a month. You'd want the Bloom filter to use a keyed hash function so that a collision in one publishing wouldn't be likely to keep colliding for the whole month.


Actually, certificate revocation and roles / memberships / ownership / access control lists etc. are all the same issue.

You need a concept of a timestamp of a message on a stream relative to some other stream of data.

This is a distributed computing problem.

To solve this for Intercoin, we invented the Permissionless Timestamping Network

https://intercoin.org/technical.pdf

Look for descriptions of PTN. That's what you REALLY need for revocations.

The problem of revocations is then really reduced to a search and incentive problem.


How would you deal with false positives in the bloomfilter blacklist? Is there a three-part check?:

1) in whitelist

2) maybe in blacklist

3) check cert not in actual blacklist

And a certificate is valid if 1 (and not) 2, or if 1 and 2 and 3?

[ed: i see you mentioned a keyed hash "to avoid colliding for a whole month". But the risk of a few random days of service (certificate) unavailability seems like an unacceptable design tradeoff? (as there will be other, non-logical reasons for downtime too) ]


Of course, an actual certificate system would need much more thought and more details filled in than below (not to mention review by multiple actual experts), but here's a rough sketch of a possible design:

The high-level view is that you have a whitelist, but to improve scalability of constructing proof of membership, we allow for longer lifetimes and Bloom filter shortening lifetimes. In case of collision in the Bloom filter, you need to construct a newer proof than would be ordinarily required.

With certificate revocation lists, the client doesn't hammer the CRL server with one request for every new connection it tries. 24 hours is a common lifetime for a CRL, during which the client wouldn't typically re-check for a new CRL.

So, presumably, every half month (similar to renewing DHCP leases when they're half expired) a server would get a freshly timestamped signature of the root of the CA's tree of valid certificates and get the list of hashes up the path from the server's certificate up to the root of the tree. The root signature, the list of hashes, and the number of leaves to the left of the server's certificate leaf in the tree constitute a compact proof of membership in the set of valid certificates.

In order to have equivalent worst-case revocation time to a CRL with 24 hour validity, the server would also need to present either:

   1. A proof of membership in the valid set of certificates   where the CA's root signature is at most 24 hours old OR
   2. A proof of membership in the valid set of certificates  within the past month and a signed Bloom filter less than 24 hours old where the server's certificate isn't in the Bloom filter
If the CA signs Bloom filters every 6 hours and uses keyed hash functions with changing keys, then if any one of the 3 latest signed Bloom filters doesn't have a collision with your certificate, you can re-use your proof. If not, you need to go and reconstruct a new compact proof before your 24 hour window expires.

Of course, there's a denial-of-service opportunity here, just like there is with a CRL system. With a CRL system, there's fail-safe (don't allow any connections if unable to get a new CRL) or fail-dengerous (if the CRL server is down, just keep using the most recently seen CRL). One could have a similar fail-dangerous whitelist system where if the server can't get a Bloom filter signed within then past 24 hours, the client tries to get a signed Bloom filter within 24 hours via an independent channel, and if this fails, accept the Server's bloom filter if it's less than (say) 7 days old. I'm not necessarily advocating fail-dangerous, just stating that the denial-of-service mitigation is similar to that of CRL systems.

Note that if you don't actually need transparency, you can use a cryptographic accumulator in place of a Merkle Tree root. The advantage of a Merkle Tree root is that if leaves are ordered in terms of DNS name (.com.google.maps, .org.eff, etc.) then the CA can give you a compact run of all of the certificates ordered for your domain, as well as the certificate right before and the certificate right after your domain, plus a short list of hashes, and you can prove that you have a complete list of the valid certificates issued for your domain.

All of the cryptographic accumulators I'm aware of don't carry any proof of order, so there's no compact proof that you've been given all of the certificates in a certain name range.

One broadcast of one Merkle Tree root signature and one signed Bloom filter every 6 hours for a few hundred CAs is really easy to scale up. Presumably there would be something similar to NTP or a P2P broadcast network along with the major CDNs all providing caches of this information. All of these elements are the same for all participants in the system: the amount of data distributed is constant per CA and independent of the number of issued certificates. Unfortunately, for a constant false positive rate, the size of the Bloom filters needs to scale with the number of revoked certificates.

Constructing the compact proofs of membership requires interaction with some server that has access to the list of all of the leaves of the tree, which is heavier weight, and why we give these proofs a lifetime of a month.


> ordered set

Do you mean "vector"?


That depends on which domain's vocabulary we're using.

I don't have much formal mathematical education, but I believe the normal term in Set Theory for the mathematical construct is an ordered set. I know next to nothing of Abstract Algebra, so I won't speculate on defining your normal linear algebra vector operators for vectors of Certificates.

C++ and Java call the most obvious data structure for it a [Vv]ector. Python calls the most obvious data structure a List. Some data structures textbooks would call the most obvious data structure a dynamically sized array (or a fixed sized array, for that matter).


It's not merely a nomenclature issue; they actually mean different things. A "set" (whether ordered or not) implies you can't have duplicates. A "vector" (which I suppose you could refer to as an "ordered multiset" or "ordered bag" if you really feel like using those terms) allows duplicates. I'm asking if not having duplicates is really a constraint you intended to impose, because I don't see why we need it.


You're right. A Merkle Tree works perfectly well with duplicate entries.

For the certificate authority case, I see verification and efficiency advantages in not allowing duplicate entries, but your point is well taken that deduplication wouldn't be required.


If I understand your demo implementation correctly then you simply append 0x00 for leaf nodes and 0x01 for tree nodes?

The paper isn't terribly clear on how to implement it, though I was only able to skim over it due to being otherwise occupied atm.


> If I understand your demo implementation correctly then you simply append 0x00 for leaf nodes and 0x01 for tree nodes?

I I understand the paper correctly, one should append 0x00 for root nodes too. It's a little odd that roots & leaves are treated the same, but the math works out correctly.

(also, I think the paper appends a 1 bit for roots & leaves and a 0 bit for interior nodes — but it's all much of a muchness)


Brings back good memories. Pad your leaf and inner nodes with different bytes y’all.


Can you summarize the relevant ideas and techniques in the paper?

It's pretty long.


On a side note: I really love that a hash TREE coding is named Sakura—this is super cute and very punny!


Since it is such a natural combination, it is difficult to search for the name in crypto context though :)




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: