BEP 30: Merkle hash torrent extension

amluto · on Feb 2, 2015

I think it's sad that they're using SHA-1 for this. SHA-1 is a bit weak, and the hashes are too short. There's a reason that SHA-1 is deprecated for X.509 certificates.

At the very least, this should use SHA-256.

If they really did it right, though, the protocol would use a secure tree hash. The construction they're using has trivial collisions, which are only avoided because the size of the file comes from a trusted source. A good hash (e.g. the Sakura construction) doesn't have this problem. Fixing that would make the resulting torrent files or URLs a bit shorter, as the size could potentially be omitted.

nhaehnle · on Feb 2, 2015

The construction they're using has trivial collisions, which are only avoided because the size of the file comes from a trusted source.

Could somebody elaborate on this? I assume that you're referring to the fact that (without the file size information) somebody could pretend that the concatenation of the child hashes at an inner node is actually the file content in this position. Is there anything else?

It seems that this could be trivially fixed by adding a single bit to the data hashed in each node to indicate whether the node is a leaf or an inner node, or by just adding the size information to the hash data in the root node.

Actually, you want to know the file size very early anyway, since this simplifies the data structures required to keep track of chunks you already have, allows you to already reserve hard disk space, and so on.

AlyssaRowan · on Feb 2, 2015

You've pretty much nailed it, yes, that and not hashing the level of the child hashes internally, you can construct a file which pretends to be upper hashes. That is potentially not just collidable but actually second-preimagable, given what we saw with the much older MD4-based ones - and they used SHA-1, which wasn't a great idea either! (Although, it should be noted, in (2009) - could a mod mark the headline such?)

The file size being there does complicate an attack - but with the weaknesses in SHA-1, I certainly wouldn't feel comfortable with it.

This is a disaster of a spec, we already had TTH at this point and that at least did it better: it needed revising and should not be implemented by anyone.

Today, you should consider using BLAKE2b's tree hash for this purpose. It walks all over this construct from every direction.

jfindley · on Feb 2, 2015

I do really like the BLAKE2b hash, but I've been concerned about actually using it in practice (although recently I had an application which it would have suited very well).

I'm worried that, having failed to win the SHA-3 contest it will end up relegated into obscurity, and using obscure hashing functions isn't usually a great idea.

Is this a valid concern, or am I placing too much weight in the NIST process?

sp332 · on Feb 2, 2015

I think BLAKE2 is too fast to be ignored. That's just a guess though.

vog · on Feb 2, 2015

> this should use SHA-256

> the protocol [sh]ould use a secure tree hash

Indeed, these are two huge flaws in the proposal that would surely bite us sooner or later.

Moreover, these issues aren't even mentioned in the "Discussion" section. Instead, they discuss pretty minor stuff such as binary versus n-ary trees or how to interface legacy clients.

mox1 · on Feb 2, 2015

Are you serious?

Realistically, these are bits of a video stream, not your bitcoin wallet or some other bits where security it of the upmost concern. Were talking millions of dollars of equipment to find a collision in SHA1 today....

How exactly is a 160 bit hash too short? Collisions can be had after 2^80 trys in naive scenario and 2^57.5 with an active attacker, not exactly easy...

amluto · on Feb 2, 2015

Torrents can be, and often are, executable. Many Linux distributions are available over Bittorrent, for example.

Breaking crypto, especially new crypto, should pass a much higher bar than "not exactly easy". 2^57.5 is not all that large by the standards of a big cloud provider or a government.

Gurkenmaster · on Feb 2, 2015

It's not new. The document was created in 2009.

mox1 · on Feb 2, 2015

Again "not exactly easy" today will cost you > $1 MILLION dollars, you are greatly exaggerating the actual problem.

If you have millions of dollars to spend and want to commit digital crimes, there are better methods than flipping bits in torrents......

wereHamster · on Feb 2, 2015

$1 million today, at the cost of the average PC in a couple years. Think of the future.

spacemanmatt · on Feb 2, 2015

Think of the adversary, too.

qqueue · on Feb 2, 2015

Hash trees are pretty cool. For some fancier uses, see the Peer to Peer Streaming Protocol (PPSP), which is authored by the libswift/tribler guys.

https://datatracker.ietf.org/wg/ppsp/documents/

Basically, instead of the stream source having to sign every single new chunk (so peers can verify that they're getting the right data), the source signs subtree hashes of the new data and slowly builds up a larger hash tree. Once the stream is over, the complete hash tree is instantly seedable by anybody in the original stream.

amirouche · on Feb 2, 2015

The (other?) big win of pspp is the ability to stream live feeds.

zmanian · on Feb 2, 2015

Very similar to IPFS's merkle DAG and seems like a critical element of getting content centric networking projects like Bitorrent, Inc's Project Malestrom up and running

andrewchambers · on Feb 2, 2015

Isn't this how git works? I would have thought bit torrent did this all along.

JeremyBanks · on Feb 2, 2015

Not exactly. Git uses per-file, per-commit, and per-directory hashes. It does make up a sort of hash tree, but the tree does not descend within a given file. You need to hash the entire file and determine its hash to know if any of the pieces you have are valid. This would be a problem for large files if many of those pieces come from untrusted sources -- you'd have to spend a lot of time/bandwidth downloading invalid data before you realized it. This isn't generally the case for Git, but it for BitTorrent.

BitTorrent traditionally solves this using a hash list. All of the data in a torrent is broken up between pieces of a chosen size, and hashes are calculated for each of those pieces individually. (These are generally not aligned with file boundaries, which is why you may have noticed that even if you tell your torrent to only download a certain set of files, you may still end up with some data from adjacent files.)

This entirely hash list is included in the torrent file. For torrents with a lot of data, this could be 10MB or more.

If you're using a magnet link, that torrent file needs to be downloaded from peers. This brings back the original problem: you need to download this entire large file before you know that the peer isn't just sending you random data.

BEP-30 proposes a solution: generate a binary hash tree whose leaves are the torrent pieces, and include only the single root hash in the torrent file to keep it small. When you're getting pieces from a peer, they send you the missing inner hashes of the tree that you need to verify the piece.

The minimum data transfer in ideal circumstances is increased a bit, but the peer-to-peer system is made more robust, able to identify invalid data much faster.

I think it's a great modification to the protocol. Unfortunately it isn't widely-enough supported to be practical for general use.

im3w1l · on Feb 2, 2015

>The minimum data transfer in ideal circumstances is increased by log(N)

There should be roughly as many internal nodes as there are leaves, so there is a linear space increase. As the leaves are much bigger than the internal nodes, the linear factor is small.

rakoo · on Feb 2, 2015

Yes it is, but it makes much more sense for git to use it: if a single byte is changed somewhere, you only need re-hashing the file and the trees up to the root; the other blobs/trees are left unchanged. It is very useful in this setting to reuse older existing hashes. In bittorrent, OTOH, we suppose every swarm will share different content, so there was no point making sure a hash would be reused somewhere.

beagle3 · on Feb 2, 2015

git's smallest unit is a file - if your repostiory has one 3GB file, that's what is going to be hashed.

but bup[0] does so within a git repository to make incremental backup more efficient.

[0] https://github.com/bup/bup

whyrusleeping · on Feb 2, 2015

ipfs uses exactly the same strategy to distribute content via the hash of their root merkleDAG node. http://ipfs.io

edonkey · on Feb 2, 2015

"Large torrent files put a strain on the Web servers distributing them".

Finally! bittorrent designers are acknowledging the centralized deficiency in the torrent protocol and implementing Merkle / root hash distribution model, as eDonkey / eMule used since 2000's:

https://en.wikipedia.org/wiki/Ed2k_URI_scheme#eD2k_hash_algo...

15 years later, everything old is new again.

gojomo · on Feb 2, 2015

ED2K was kind of a degenerate Merkle tree: large chunks (9.5 MiB) and only one level of leaves, all under the root.

Thus it didn't have some of the benefits of a full tree that this 2009 Bittorrent spec was hoping to achieve, such as verifying smaller-sized chunks without a metadata-cost that grows linearly with the size of the full resource.

(AFAIK, the 1st application of multi-level Merkle trees to P2P filesharing was the TigerTree hash I wrote up with Justin Chapweske in 2002. At first glance, it looks like this proposal makes the same mistake we did in our first draft, not distinguishing between leaf and node hashes, corrected in the final TigerTree spec version of March 2003.)

adamzochowski · on Feb 2, 2015

edonkey has two methods of file verification. The ED2K hash as you have mentioned. However, around 2004, edonkey world also received nicer AICH hashes. People that generate ed2k style links are encouraged to make include the

To recap:

ICH : inteligent corruption hash : also known as the old original ed2k method. Root hash is md4 and is generated from a series of md4 hashes generated for 9.5mb chunks. If a file is below 9.5mb then ed2k root hash is just the real md4 hash.

AICH : Advanced inteligent corruption hash : is a full Merkle tree using SHA1, where chunks are 180kb (with exception of a chunk on the 9.5MiB boundary). This weird part/chunking allows to map 53 of these AICH chunks perfectly into the ed2k chunks.