
RFC 6920: Naming Things with Hashes - tosh
https://tools.ietf.org/html/rfc6920
======
btrask
Lots of comments are asking what problem this solves.

First, content addressing is the best general solution to exactly-once message
delivery in a distributed system. Examples of distributed message protocols
include (ahem) email, HTTP and RSS. The alternative, UUIDs, is strictly worse
in almost every way (which is to say, if you're currently using UUIDs for
anything, consider switching to hashes of the content).

Second, content addressing guarantees the integrity of messages, meaning as
long as you have the hash, you can get the data from anywhere. This is highly
useful for mirroring and failover. Basically, we might be able to solve
linkrot to a large degree, and make easier to mirror sites (especially
locally, which is useful for high latency and offline).

I wrote up a short comparison of this protocol, along with a few alternatives
a while back.[1] (Note that the IPFS comparison is incorrect, it should have
an /ipfs/ path prefix.)

[1]
[https://bentrask.com/?q=hash://sha256/1e9a6b770ef9f1ca894af4...](https://bentrask.com/?q=hash://sha256/1e9a6b770ef9f1ca894af47ca7541f8fd36da624631d3fe79ede0cef1e5a0c1f)

~~~
haberman
If you use a message hash as your unique identifier for exactly-once delivery,
your messaging system won't let a user send the same message twice, even if
they mean to.

You could wrap the user's message in some metadata including a uuid to ensure
the overall content is distinct from anytime the user previously posted that
message. But now you're back at uuids, which you say you don't like.

~~~
btrask
You need to think about what makes messages distinct. Content addressing will
help you as long as you have a rigorous definition of identity, which you
might otherwise be doing ad-hoc when you assign UUIDs.

If the same message is meaningful at different points in time, your hashed
content should include a timestamp (and of course you need to choose an
appropriate level of precision). If your message is meaningful in different
contents, it needs to reference that context somehow (ideally by hash).

The payoff for doing this well is 1. an elimination of double-sent emails,
double-posts, etc. (when the software might not know if it went through the
first time), and 2. the ability to deduplicate across a network partition, for
example if a user makes the same change on their computer and on their phone
and then syncs them.

~~~
zAy0LfpBZLC8mAC
... except wallclock timestamps for ordering of events in a distributed system
is a terrible idea in most cases for the simple reason that (a) it's not a
(even close to) guaranteed unique event identifier (the same time happens in
many locations), (b) you usually don't have a clock of sufficient precision to
even reliably establish a total order of distinct moments and (c) even if you
have, that order does not necessarily preserve causality.

Now, UUIDs don't solve (b) or (c) either, but at least they actually give you
a reliable identifier for events.

~~~
btrask
No one is talking about ordering events.

~~~
zAy0LfpBZLC8mAC
... in which case my post explains why UUIDs are the better solution.

~~~
btrask
Content hashing doesn't rely on timestamps to act as a unique identifier. It
relies on _all of the content_ , which may (or may not) include a timestamp,
depending on your application. To the degree that collisions occur, that's a
feature, not a bug. If you get undesirable collisions, you aren't hashing the
right data.

Edit: to clarify, content hashes are just like UUIDs except you can get useful
collisions _if you want them_.

------
Einherji
Sounds similar to Joe Armstrong's ideas[0]. Very interesting.

[0]:
[https://joearms.github.io/2015/03/12/The_web_of_names.html](https://joearms.github.io/2015/03/12/The_web_of_names.html)

~~~
skywhopper
Interesting ideas, but when it gets into utopian promises about being able to
find all versions of a file ever on the web, it reminds me of the talk around
RDF[1] and FOAF[2] in the late 90s/early 00s. Not to mention pre-Web hypertext
theory, PGP, and any number of other things. All these things can be very
useful patterns in tightly restricted, highly curated environments with
specific use cases (whether or not they are part of the larger web). But they
aren't panaceas and they don't address the underlying sources of the problems
they are trying to work around, namely that all this unorganized junk on the
web they aim to take control of is created by people and companies who are one
or more of: lazy, sloppy, confused, conflicted, error-prone, avaricious,
malicious, obnoxious, cruel, naive, desperate, or ignorant.

[1]
[https://en.wikipedia.org/wiki/Resource_Description_Framework](https://en.wikipedia.org/wiki/Resource_Description_Framework)
[2]
[https://en.wikipedia.org/wiki/FOAF_(ontology)](https://en.wikipedia.org/wiki/FOAF_\(ontology\))

~~~
Spooky23
What problem is this solving?

One of the magical things about the web is that you have the freedom to
distribute information as you wish, and DNS namespace scopes everything.

------
s986s
Ok, so the simple example you gave is very abstract. Alice and bob may have a
great time talking on the phone but I felt no closer to understanding its
purpose.

I do understand the purpose in requesting a resource from Alice that has no ip
address associated to it. Say Bob and Alice are part of a network. Alice saves
s file on one of the computers but has no idea which one. Bob wants that file.
Hash Identifier protocol provides a standardized way to retrieve it. Using it,
Bob makes a request to sll conputers and only one responds.

The RFC felt far more boring than this topic is. Technologies such as magnet
uris and IPFS are your competition. I would argue that security is not the
purpose for this though it can be used in conjuction. I believe this is far
more effective for distributed systems

------
candeira
Sounds similar to [https://IPFS.io/](https://IPFS.io/)

~~~
kodablah
Specifically
[https://github.com/jbenet/multihash](https://github.com/jbenet/multihash)

~~~
candeira
Multihash/multiaddr, yeah.

------
KMag
It seems that the authors have decent ideas, but don't have much experience
with the subject matter. If you look at the assigned identifiers in the binary
scheme, the authors believe that 32-bit truncated cryptographic hash functions
are useful. 64-bit truncated cryptographic hash functions are essentially
useless from a security standpoint, so the 32- and 64-bit variants would be
better served by CRC-32C and CRC-64ECMA. If your standard contains a 64-bit
truncated cryptographic hash, people without subject matter expertise will
assume that it provides more than a modicum of security.

Also, the authors apparently vastly underestimate the utility of using Merkle
trees. I used to be an engineer for LimeWire. Gnutella used SHA-1 as
identifiers for the content. Unfortunately, this means only being able to
integrity-check an entire large file instead of small pieces, or having to get
a Merkle tree root out-of-band. LimeWire just got the Merkle tree root (plus
one row of the tree) from the first peer it contacted for download data. This
represents a simple denial-of-service poisoning attack where the attacker
hands out Merkle trees corresponding to something like 10% of the blocks being
corrupted. The clients would then repeatedly request the "corrupted" blocks
from peers, until giving up, notifying the user that the download was
corrupted, and most likely keeping around a file that had sections that
weren't integrity checked. (Corruption really does happen. The TCP checksum is
rather weak.) If they had used the Merkle tree root as the file identifier
instead, then there would be no opportunity to trick the client into
incorrectly associating the wrong Merkle tree with a given file.

If you're going to define a hash URI scheme, please incorporate Sakura trees
(a provably secure hash tree scheme) of degree 2 with a fixed leaf block size.
Leaving the block size variable leads to the Bittorrent problem where a single
set of identical files has multiple identifiers and clients using hashing at
one granularity can't share information with clients using hashing at another
granularity. Merkle trees with a single standard leaf block size allow
different levels of the tree to be shared to in effect give different
granularities, without dividing resources due to the multiple identifier
problem.

Also, in the case where there are 2^N + 1 blocks (or any other case where
you'd be tempted to "optimize" by skipping a node at that level), please have
a re-hashing node at each level for all blocks. This means that the final
block (along with at most one extra hash per tree level) constitutes a
cryptographic proof of the file length corresponding to the tree root.
Otherwise, in order to avoid certain denial of service attacks, you also need
to always put the length of the file as part of the URI, or the first client
needs to send the entire bottom row of the Sakura tree.

Note that the Bittorrent Merkle tree format is broken in the same way that the
original Gnutella tiger tree proposal was broken (fixed before implementation
in the Gnutella case). Use Sakura trees. They're provably as secure as the
hash function you use.

~~~
davidp
_If you look at the assigned identifiers in the binary scheme, the authors
believe that 32-bit truncated cryptographic hash functions are useful._

No they don't. From Section 2:

    
    
       The sha-256 algorithm as specified in [SHA-256] is mandatory to
       implement; that is, implementations MUST be able to generate/send and
       to accept/process names based on a sha-256 hash.  However,
       implementations MAY support additional hash algorithms and MAY use
       those for specific names, for example, in a constrained environment
       where sha-256 is non-optimal or where truncated names are needed to
       fit into corresponding protocols (when a higher collision probability
       can be tolerated).
        
       Truncated hashes MAY be supported.  When a hash value is truncated,
       the name MUST indicate this.  Therefore, we use different hash
       algorithm strings in these cases, such as sha-256-32 for a 32-bit
       truncation of a sha-256 output.  A 32-bit truncated hash is
       essentially useless for security in almost all cases but might be
       useful for naming.  With current best practices [RFC3766], very few,
       if any, applications making use of names with less than 100-bit
       hashes will have useful security properties.

~~~
btrask
32-bit hashes are not useful for content addressing pretty much at all. If you
look at the table here:
[https://en.wikipedia.org/wiki/Birthday_problem#Probability_t...](https://en.wikipedia.org/wiki/Birthday_problem#Probability_table)
(50% chance of collision with only 77,000 resources)

See also:
[https://lkml.org/lkml/2010/10/28/287](https://lkml.org/lkml/2010/10/28/287)
where Linus Torvalds says 12 hex digits (96 bits) is pretty much the minimum
short-hash for the Linux kernel commit history.

Extremely short hashes can be useful _briefly_ for manually transcribing
between devices, as long as you immediately "resolve" them back into a longer
form, before new collisions can happen. But this is more on par with clicking
"I'm feeling lucky" than creating a link. :)

~~~
batbomb
I came here to talk about this. I deal with data management for a bunch of
physics and astronomy experiments it's quite typical to have many millions of
files for any given experiment (say, 1PB of storage for a medium size
experiment, 10 million 100MB files). CERN's experiments would easily have
billions of files, so I was thinking 64 bits would be a minimal truncation,
but 96 is probably more reasonable.

------
Drup
For a split second, I though it was a 23 pages RFC to formalize Hashtags.

I got afraid.

~~~
_jomo
So did I, the RFC being from April 2013 didn't help that.

------
mchahn
This seems like a very useful concept. Are there any plans to implement it?

~~~
2bitencryption
Unless I misunderstand what this is for, a similar concept is already used for
trackerless bittorrent using magnet links
[https://en.wikipedia.org/wiki/Magnet_URI_scheme](https://en.wikipedia.org/wiki/Magnet_URI_scheme)

It's not the exact same specification, but I imagine these two concepts are
similar?

------
morenoh149
reminds me of the hash-n-slash proof of concept
[https://github.com/amoffat/hash-n-slash](https://github.com/amoffat/hash-n-
slash)

------
ajkjk
Can anyone explain the motivation for this? I'm not really getting it from the
document and don't have any background.

~~~
darkhorn
My thoughts: Now this standardizes hash links. Now we can have web sites that
cannot be blocked.

------
nbevans
How are hash conflicts/collisions handled?

~~~
akx
I don't think there'll be a feasible collision attack on SHA-256 anytime soon.

~~~
nbevans
It doesn't need to be an attack. Simply a case where two files happen to hash
down to the same bytes.

~~~
akx
I wouldn't worry...
[http://stackoverflow.com/a/4014407/51685](http://stackoverflow.com/a/4014407/51685)

------
stcredzero
So, if we programmers are so advanced, why are we still using names to
identify various programming language entities? Why aren't we giving the
equivalent of UUIDs to Classes and Functions and other programming language
entities? As it is now, there are version control systems that could well be
confused by name collisions.

~~~
detaro
Yeah, the idea of "let's serialize the ASTs and add names only when humans
look at it" comes up all the time. There are also a few implementations, but
editing it is just so different from text editing that the hurdle is quite big
(and emulating text editing on top of it is really hard to get right, without
loosing most of the benefits in the process).

example: [http://www.lamdu.org/](http://www.lamdu.org/)

------
prutschman
The prose has an odd feel to it, like a patent description.

~~~
stonogo
It's exemplary of IETF work; it's specific, utilitarian, clear, well-
documented, and will never be used by anyone ever.

~~~
Tossrock
...like the way no one uses TCP/IP?

~~~
icebraining
TCP/IP was published before the IETF even existed.

------
JulianMorrison
Every time I see one of those, I'm going to have a Monty Python moment.

~~~
svckr
Did you say shrubberies?

