
Principles of Content Addressing (2014) - btrask
http://bentrask.com/notes/content-addressing.html#hn
======
gojomo
Some quibbles:

* URNs weren't _just_ for ISBN-like identifiers from an assignment authority. Some of the early writing talks about that use case, but the key feature was "location-independent persistent names", for which hash-names have always been a good fit. Nothing in the relevant specs precludes hash-names as an URN scheme – it's not "breaking the spec" – and a number of projects have used hash-named URNs. While there's a policy for registration, in practice some URN namespaces have been 'de facto registered by use', much like common-law trademarks and a lot of URI/URL schemes. (Of course, the neat "URLs and URNs as distinct subtypes of URIs" view never fully took root, as W3C's 2001 "Clarification" note acknowledged.)

* Magnet-links were absolutely designed to promote P2P/CDN-network-agnostic content-addressing. But, they were also made flexible enough to squeeze in other descriptive metadata, aliases, or fallback locations as well. The JS-launching was an adaptive hack; the descriptive (and usually hash-based) content-names were the point. A key early use case was making a common, vendor-neutral hash link for competing Gnutella clients, but the loose stuff-anything-useful-in generality saw magnet-links adopted by other software (such as DC++ and BitTorrent) as well. The 'magnet' URI scheme was only ever 'common-law' registered.

* It's a bit odd to consider the algorithm the URI 'authority', though if I squint I can see a sort of funhouse-logic to it. Notably the similar URI-scheme that's made it through IETF standardization, the 'ni' scheme (RFC6920), usually leaves the 'authority' component blank, so three slashes appear in a row – but alludes to the optional declaration of an 'authority' that might be able to help accessing the referenced content.

~~~
btrask
Great points, you've clearly read the RFCs. :)

\- You're right that URNs were not _just_ for ISBNs, but they shaped the
formation of the standard and (IMHO) made it inapplicable for hashing. A
content addressing system that can't resolve any of the standardized URNs
wouldn't be very useful. FWIW one of my earliest prototypes used URNs, and I
still use them for ISBN links!

\- Magnet links work fine today, but if you look at the original proposal[1],
they really were designed for all of the wrong reasons (including explicitly
popping up a JS handler). In practice everyone who uses magnet links tunnels
URNs through them, which serves no point for a general purpose system. A
system that supports magnet links must also support URNs (meaning the
arguments against using URNs apply, and magnet: doesn't add much value).

\- I've considered eventually adding ni support to StrongLink, although it's
not like anyone else uses it so it wouldn't be an interoperability win. I
think my hash scheme proposal is much better, so I'm hoping we can just forget
about ni entirely. (But to be clear, it's extremely easy for a system to
support both.)

[1] [http://magnet-uri.sourceforge.net/magnet-draft-
overview.txt](http://magnet-uri.sourceforge.net/magnet-draft-overview.txt)
(warning: SourceForge link)

~~~
gojomo
I wrote the original magnet-URI proposal, so trust me when I say the JS-stuff
was a demo hack, and the content-based names the real point. (Essentially no
one ever implemented the JS-handler-negotiation, which was a quasi-web-intents
mechanism before that concept was named.)

Magnet-URI's immediate predecessor was the Hash/Urn Gnutella Extensions,
'HUGE' [1], and the reason that all the examples in the magnet-URI spec are
hash-URNs, and that such hash-URNs are the main way magnet-URI has been used,
is because that's what magnet-URI was _for_.

I respect that your design opinion is that URNs aren't good for this; it's
just false for you to say hash-names are against the URN specs. Neither the
language of the URN specs nor historical practice supports that idea. And,
hashes are, as you clearly agree, a great way to generate "persistent,
location-independent, resource identifiers" (the stated purpose of URNs).

A system (P2P, CDN, local content-addressed stores, etc.) can be plenty useful
even if it chooses to support only some URNs, or only some magnet-URIs. All
the magnet-using systems have essentially ignored standardized/assigned URNs,
and instead used ad-hoc hash URNs, and in total they've been quite useful to a
lot of people.

[1] [http://rfc-gnutella.sourceforge.net/Proposals/HUGE/draft-
gdf...](http://rfc-gnutella.sourceforge.net/Proposals/HUGE/draft-gdf-
huge-0_93.txt)

~~~
btrask
Sorry, I didn't know Magnet was your work. I think it's an example of the
inner-platform effect (tunneling URIs in URIs), but I agree it's served a lot
of applications (especially BitTorrent) very well.

You're right that hashes aren't prohibited by the URN specification. My
argument is that the URN spec doesn't prohibit anything, because it's too
broad. In fact, I think that URNs boil down to URLs, because in order to
resolve many schemes (including ISBNs), you need a dynamic lookup to a central
authority. I've been considering an article called something like "Locations,
names and hashes" in order to explain that locations and names are effectively
the same, but hashes are fundamentally different. That is my opinion of the
underlying reason why URNs failed to catch on (aside from BitTorrent).

Even the practical point of interoperability is moot, because BitTorrent
namespaces its hashes. It's impossible for another system to support existing
URN/magnet links without "emulating" torrent files (which introduces too much
ambiguity anyway).

BTW, the [http://memesteading.com/](http://memesteading.com/) link in your
profile appears to be broken.

Edit: I see you've worked on a lot of things I've read about, e.g. WARC files.
Do you currently work at the Internet Archive? I was planning on approaching
them at some point with some ideas.

~~~
gojomo
Yes, I think many now recognize that the original idea of a stark contrast
between URLs and URNs doesn't fit the fuzzy reality. (RFC3986's "URI, URL, and
URN" section,
[https://tools.ietf.org/html/rfc3986#section-1.1.3](https://tools.ietf.org/html/rfc3986#section-1.1.3),
acknowledges this point.) My interpretation is: there's quite a few de-facto
URNs in use, just without the official label "urn:" or namespace registration,
which in practice has turned out to be an unnecessary formality.

(Thanks for the note regarding memesteading.com; mapping updated to work now.)

I'm no longer regularly doing anything for the Internet Archive, but can
definitely help make contact! If you're in the bay area, a good way to start
learning more about its projects (or show off your own) is to attend the open-
house lunches, held most Fridays. (You should just shoot them a note or call
before showing up, so they know the expected attendance, or warn you if it's
one of the occasional days that it's not held.)

~~~
btrask
Rather than de facto URNs, I'd say de facto URLs, but we don't have to quibble
over that.

I'm on the east coast, unfortunately (North Carolina).

HN is going to start capping the thread depth, but you're welcome to email me
if you'd like to talk more (bentrask@comcast.net). I've been trying to come up
with an archival web proxy or something sort of related to WARC and the
tooling around it, possibly using content addressing (although converting
existing web pages seems ugly and I haven't found an ideal way).

------
anoxic
I like the idea of being able to address content based on it's actual content.
Perhaps I don't have a very good imagination, but if all you have is a hash of
the content, how to you know _where_ to find it?

From within a single app it's easy, but what about in other apps or other
machines. Would there be a (possibly distributed/voluntary) lookup service?
Could comments or lookup "hints" be added to the spec?

~~~
btrask
This is the secret sauce that makes every implementation unique. Camlistore,
IPFS, StrongLink (my project), and others all have different answers. I think
the important thing is that they all use hash URIs that can interoperate. Then
you can find the content using whichever system you prefer or makes the most
sense.

StrongLink doesn't use a distributed hash table, because one of my
requirements is that it must work offline. In StrongLink, you pull from other
repositories you're interested in, and then always resolve your requests
against your own repo (locally).

------
chmike
Is there more information on the intended application domain (use cases) ?
This was unclear for me after reading your referenced document.

Most of your arguments apply to hash referencing and make perfect sense to me.
The compatibility with URI is also a good point and not difficult to achieve.
I would add that compatibility with URL (http scheme) is also desirable and
may be solved by use of a protocol bridge so that the information is
accessible to web applications.

I'm also working on a distributed information system for many years now (not
full time;). But I use a different referencing system. The reason is because
one can't modify the information without invalidating the hash and thus the
references. So using hashes as reference make sense for a system containing
only immutable information. Your choice of reference is thus very application
specific.

The system I'm working on allows to modify and move data without invalidating
references and is distributed like the web. A reference is _at most_ 32 byte
long in its binary representation, a bit longer in its ASCII representation.
Take that http! ;)

~~~
btrask
I believe there is a fundamental distinction between URI schemes that are
dynamic but centralized (meaning they require some form of coordination, to
handle mutability), versus schemes that are static and decentralized (for
example, they are defined by a hash algorithm that anyone can run
independently). If you go the dynamic/centralized route you end up with
something like the World Wide Web, which already works very well for that use
case. Content addressing will never be useful for things like online shopping,
where your requests are basically remote procedure calls. Instead I think it's
good for publishing, sharing files, and things like that.

I do have a plan to build mutability on top of StrongLink eventually, using
diffs like Git. I think it's important to have a clean separation, so that the
sync protocol stays simple and impossible to get wrong.

The initial application for StrongLink is notetaking, which I find much easier
to do with immutability. It's like writing in ink.

~~~
chmike
To support mutability you would need redirections so that you can access to
the most recent version with an old reference. The problem with that is that
you can't check the validity of a redirection as you can do with a returned
file.

Thus hash references work in a secure and trusted environent.

------
btrask
I'm happy to answer any questions about this article or the implementation
I've put together.

~~~
lisper
You say you've been working on your CAS system for two years. What are the
challenges you are facing that is making it take that long? Because a simple
CAS is trivial: just hash the file, rename it to the hash, and serve it up via
a normal HTTP server. What is making it hard?

~~~
btrask
Here's what I've written up for a toy content addressing system I've been
writing in Python:

"If it's so easy, why did it take us 3 years and 10,000 lines?" "Because we
wrote it in C."

And because the real system:

\- Tracks file meta-data (including tags and full-text indexing)

\- Has a way to find files/hashes without file or hash (search)

\- Uses a database instead of hardlinks everywhere

\- Supports hash algorithm plugins (not technically done yet)

\- Supports more hash formats (short, base-64, etc.)

\- Is ACID (and coordinates transactions between DB and file system)

\- Supports multiple repos per user (at arbitrary paths)

\- Handles hash collisions (aside from the primary hash)

\- Has a basic user interface

\- Has a sync protocol

------
ClayFerguson
Everything is content! The JCR (Java Content Repository) will eventually be
the standard -- and sort of already is. Apache Jackrabbit Oak will take off as
the open source technology of choice eventually. It's able to leverage both
MongoDB and Lucene search, and has a standard for naming. I have an open
source app located here (meta64.com), that's a full-stack modern standards
implementation of a portal built on it. Any node is addressible on the url
using the JCR standard.

~~~
falcolas
It strikes me that this is one of those "a square is a rectangle, but a
rectangle is not always a square" moments. JCR seems like it fits into the
content identification family; it uniquely identifies content by a hash (in
the case of JCR that hash is also a path), but that hash is system dependent.
That is, I can't go to a system B and ask for the hash(path) '/foo/bar' and
expect to get the same content as I would on system A.

Cryptographic hashes, on the other hand, make it possible to use the same hash
on multiple systems and get the same content (if that content is available).

~~~
chmike
You need an index to achieve that. The index will tell you where the
information associated to the hash is stored. That index is in itself a
distributed system.

------
skybrian
Do these links work on mobile phones? (Can an app register to handle them?)

~~~
Sophistifunk
Should do, apps can register to be protocol handlers. You can only register at
the "foo://" level though, not "foo://bar". What I mean is I don't believe an
app can only handle some of the "foo:" URIs.

~~~
btrask
For what it's worth, I consider this a feature. There have been proposals to
make sha1: and other schemes, but every system will need hash agility (as
better algorithms are invented and flaws are found in old ones). A single
system should be able to resolve every relevant hash algorithm.

