
The web of names, hashes and UUIDs - lelf
https://joearms.github.io/2015/03/12/The_web_of_names.html
======
kentonv
See also Zooko's Triangle:

[http://en.wikipedia.org/wiki/Zooko%27s_triangle](http://en.wikipedia.org/wiki/Zooko%27s_triangle)

There's a long tradition of using this kind of approach in capability systems.
If you do it right, you can have globally unique identifiers that are not
human readable (probably, some sort of public key which is also routable), and
then let humans assign local "petnames" (labels) to them in a way that is
actually pretty usable, if everyone is using software designed for it.

Problems come when, say, you want to put your web address on a billboard. In
an ideal world the billboard would somehow transmit the advertiser's public
key to the viewer's device so that the viewer could then look them up, but
obviously we don't have any particular tech for doing that. So instead we
create this whole complex system by which people can register human-readable
identities, which in turn requires a centralized name service (yes, DNS is
centralized), certificate authorities (ugh), etc.

Similarly, whenever you tell your friend about some third-party entity
(another person, company, whatever), you should be giving them the public key
of that entity. But that's not really practical. We need some sort of brain
implant for this. :)

~~~
vertex-four
Namecoin solves Zooko's Triangle (at the cost of keeping a Bitcoin-like
network running). The issue is that nobody's software supports it.

~~~
rictic
The support problem for namecoin is being solved. See: okturtles and dnschain.

------
AlyssaRowan
UUIDs can also be collided. Did you want some form of public key referencing
that only a private key can answer, perhaps?

As already pointed out by ivoras here, magnet: links are close to exactly what
you're looking for. It also reminds me of, for example, the Freenet
CHK/SSK/USK system, or several other things with similar designs.

You should also not use SHA-1 hashes for uniqueness or checksumming anymore:
they're too weak.

~~~
duskwuff
I was reminded of Freenet as well. The USK scheme for allowing content to be
updated is rather clever.

------
ivoras
Some of the ideas are already present in Magnet links
([http://en.wikipedia.org/wiki/Magnet_URI_scheme](http://en.wikipedia.org/wiki/Magnet_URI_scheme)).

~~~
mmahemoff
Similar proposal for script tags [http://lists.w3.org/Archives/Public/public-
webappsec/2013Feb...](http://lists.w3.org/Archives/Public/public-
webappsec/2013Feb/0052.html)

This would avoid the inefficient situation we have at present, where the same
jQuery script is fragmented across dozens of CDNs, causing a new request each
time even when previous instances are already cached by the browser.

~~~
anon4
I would love to see maven in the browser. Have each webapp serve a pom.xml
listing the libs it uses along with exact versions and be done with this hell.
The browser then checks if the lib is installed, if not, it goes through the
user-specified maven repositories, then through the default ones and then
through the webapp-specific ones searching for the artefact. Problem solved.

~~~
LunaSea
The problem is not that simple. Most webapps are compiled and minimised using
systems like Closure Compiler, Uglify and R.js (for AMD apps).

You almost never point directly to a dependency as a standalone file. Doing so
would mean 15-30 requests per webapp and since browsers only accepts 5-6
parallel requests, it would slow down the page considerably.

------
williamcotton
_SHA1 checksums are fine for content that is immutable (doesn’t change) - but
what about a file whose content changes with time?_

I think we've been doing some very silly things with data over the last few
decades related to the UPDATE statement.

Why not just consider all published data immutable? Look at book publishing as
an analogy. There are multiple printings of a book. If there were corrections
or updates they don't retroactively affect the previous printings. Why can't
we look at data that is published on the Internet is the same manner?

If you want to update something you'll have to publish a brand new version.
This also mirrors versioning in software libraries.

CRDTs, immutable data structures, eventually consistent data... from UI
programming to big data, these are more than just eternally recurring trends.
We're learning some very lasting things about how computers should deal with
data.

~~~
MichaelGG
Because computers and reality don't actually deal with immutable data? I love
the reasoning powers you get from it at a code level, but it's simply not a
fast way to do things in many cases. Also, many applications do not care about
all (or any) interim copies of data.

With respect to web content, what are you proposing? If I got to
site.com/product1data and you update the price, we certainly don't want the
URL to change. In such situations, how would a versioned system add any value,
and how would the UI be exposed?

~~~
nitrogen
[not the OP] In your example, the URL name would be a reference to an
immutable document with an immutable ID. You can update the named reference to
point at the new ID, but the old ID still refers to the old document.

It would also be possible to store and distribute new versions as patches to
previous versions by linking between the patch set, old version, and new
version.

~~~
MichaelGG
OK, but how does that fit into how we're actually using URLs today, like how
would this change the web? If you browse my product info page, I don't want
you linking to a versioned document, in general. In the cases I'm thinking of
off the top of my head, I'm not sure when you'd ever want people
linking/copying version-specific URLs that somewhat negate the point of
updating the page in the first place.

Or perhaps I don't understand "Why can't we look at data that is published on
the Internet is the same manner".

~~~
williamcotton
The web of today is based completely on "this named thing references this
other named thing". When we create links to other content we are most
definitely doing so in the context of the content as it is at the time of
linking, not some future state. What happens if that data is waaay different
or missing? Dead links we call them.

It terms of commerce this is somewhat analogous to bait-and-switch.

In this thread I'm mainly referring digital content as an end-to-itself, not
of digital content as a reference to physical products.

As for referencing physical products, be they automobiles or paintings,
deriving a direct cryptographic hash isn't possible, but GUIDs are. Cars
already have serial numbers. Paintings have signed certificates from experts.

If I'm on a website buying a used car I definitely want the price list to be
linking to the GUID, that is, to a reference of the object itself.

------
whyrusleeping
For those interested in a system that is using ideas presented here, check out
ipfs ([http://ipfs.io](http://ipfs.io))

~~~
patcon
y'all are doing a bang-up job with IPFS, but I'd say it's always a good habit
to clarify affiliation when spreading the word :)

(I was going to give your team a shout-out if you hadn't beat me to it!)

~~~
_prometheus
hey patcon! yep, you're right. whyrusleeping and I are both building ipfs. :)

------
xyzzy123
If you tracked the content of the changes themselves, rather than having them
be implicit, you have git!

------
jgreen10
The reason naming things by hash is a good idea is that you can cache them
"forever", until you decide to change headers. The only benefit of a UUID over
a URL is that the lack of meaning makes it less likely that people will change
it later on. However, semantic naming is useful, and avoids collisions. True
UUIDs require very careful use of RNGs.

------
gojomo
For modifiable content, more useful than UUIDs would be (chosen-name, content-
hash) tuples that are then signed by the publishing crypto-identity.

The chosen-name can be a UUID, but doesn't have to be something so
semantically-opaque. It's more likely to be a tree-namespace like traditional
domain-centric URLs. The publishers-key replaces the role of the domain-name
as the 'authority' portion of the URL.

(Once upon a time, I suggested 'kau:' – for Keyed AUthority – as a URI-scheme
for such URLs in a location/protocol-oblivious web:
[http://zgp.org/pipermail/p2p-hackers/2002-July/000719.html](http://zgp.org/pipermail/p2p-hackers/2002-July/000719.html)
)

~~~
rakoo
Interestingly camlistore does that:

\- First you create a "random" blob that has an identity (called a permanode),
that you sign with your private key. It acts as an "anchor" you can link to.

\- Then you sign a piece of json that references the permanode and the content
you wish; it effectively means "I, owner of key XXX, claim that permanode
called 12345 now references value ABCD".

\- To get the content of a permanode, you search all modifications and merge
them to obtain the final value

------
yason
An alternative, peer-to-peer name resolution mechanism will be in greater
demand every year. Currently, DNS records are spoofed, proxied, blocked, and
whatnot by the domain name registrars because of legal threats by big money.

There should be a way for people to search a particular server by name without
being in cahoots with every other party who wants to get involved in it.

------
rdtsc
BTW the author is Joe Armstrong -- Erlang's inventor.

The link to the The Mess We Are In he refers to (
[https://www.youtube.com/watch?v=lKXe3HUG2l4](https://www.youtube.com/watch?v=lKXe3HUG2l4)
) is a fun and accessible talk to gave at Strange Loop conference last year.

------
wampus
"Once we have the SHA1 name of a file we can safely request this file from any
server and don’t need to bother with security."

I don't understand this statement. Dispensing with encryption makes the
request and returned content susceptible to passive observation. Saying that
this approach is resistant to active manipulation assumes the existence of
some kind of web of trust between hashed documents. At some point you're going
to have to click on a hash without knowing its provenance. How do you know
you're not being phished?

~~~
TeMPOraL
When you have a hash you trust locally, you can fetch a file from anywhere
without caring about the source - the only thing that matters is if the file
matches your hash. So the only thing you need to care about is ensuring that
the hash you have is the one you really want.

~~~
afandian
Only if you assume that your hashcode can't be collided (it probably can).

~~~
gojomo
If you're using a proper cryptographically-secure hash, it almost certainly
_can 't_ be collided.

~~~
afandian
So I happily get file md5:d41d8cd98f00b204e9800998ecf8427e

Tomorrow everyone discovers that MD5 has been compromised by some organisation
with a lot of money (obviously this happened long ago).

So the author needs to re-publish it as
sha:adc83b19e793491b1c6ea0fd8b46cd9f32e592fc

And all my links are suddenly broken and I can't provide a mapping from old to
new.

And then someone breaks SHA...

All I'm saying is that content-addressed hashing doesn't obviate the need for
secure transport and trust.

~~~
gojomo
MD5 has been known to be too weak for this purpose since 1995. So, no
competent protocol-designer or publisher will have used it for this purpose
for decades. SHA1 is now under enough suspicion to avoid for this purpose, but
not yet proven to be compromisable.

But SHA2-256 and up, and many other hashes, are still safe for this purpose
and likely to remain so for decades – and perhaps indefinitely.

So within the lifetime of an application or even a person, secure-hash-naming
_does_ obviate the need for secure transport and trust. Also note that
'secure' transport and trust, if dependent on things like SSL/TLS/PKI, also
relies on the collision-resistance of secure hash functions – in some cases
even weaker hash functions than anyone would consider for content-naming.

(For the extremely paranoid, using pairs of hash functions that won't be
broken simultaneously, and assuming some sort of reliable historical-
record/secure-timestamping is possible, mappings can be robust against
individual hash breaks and refreshed, relay-race-baton-style, indefinitely.)

~~~
smadge
Gojomo, that was the point of using MD5 in the example. If this system had
been deployed in 1995, it would have been used MD5, and thus the problem of
broken/obsolete links that the comment outlined would have applied to it after
a few years. Who's to say the same thing wont happen to this system in 5
years?

~~~
gojomo
There was no surprise break of MD5 – it came after plenty of warning, so even
a hypothetical 1995 deployment would've had years for a gradual transition,
and continuity-of-reference via correlation-mapping to a new hash.

So even that hypothetical example – with an early, old, and ultimately flawed
secure hash – reveals hash-based as more robust than the alternatives.

And in practice, hash-names are as strong or stronger than the implied
alternative of "trust by source" – because identification of the source is,
under the covers, also reliant on secure hashes… _plus_ other systems that can
independently fail.

We have experience now with how secure hash functions weaken and fail. It's
happened for a few once-trusted hashes, with warning, slowly over decades. And
as a result, the current recommended secure hashes are much improved – their
collision-resistance could outlive everyone here.

Compare that to the rate of surprise compromises in SSL libraries or the
PKI/CA infrastructure – several a year. Or the fact that SSL websites were
still offering sessions bootstrapped from MD5-based PKI certificates after MD5
collisions were demonstrated.

------
mholt
Reminds of me an experimental, hypothetical future Internet architecture
called Named Data Networking: [http://named-data.net/](http://named-data.net/)

------
adrusi
Freenet does this, but with a more secure (and less flexible) alternative to
UUIDs.

[https://freenetproject.org/](https://freenetproject.org/)

~~~
derefr
Yep. Freenet is basically this plus a bunch of really complicated mechanisms
to create anonymity in who shoved the documents into it. If you take those
bits out† then you're left with a DHT that:

1\. lets you talk about individual objects by hash-based URNs;

2\. or lets an publisher insert _versioned streams_ of objects using
_document-signing with deterministic subkeying_ ‡; gives the stream as a whole
a UUID-based URN; and then lets clients query for either the latest, or for
any fixed version-index of a given object-stream;

3\. and which does a sort of pull-based store-and-forward of content—every
node acting as a caching proxy for every other node.

I'm _really_ surprised nobody has just built this trimmed-down design and
called it a "distributed object storage mesh network" or somesuch. A public
instance of it would beat the Bittorrent DHT at its own game; and private
instances of it would be competitive with systems like Riak CS.

\---

† Which is perfectly sensible even for Freenet itself; you could always just
run your Freenet node as a Tor hidden service, now that both exist. Tor
cleanly encapsulates all the problems of anonymous packet delivery away; the
DHT can then just be a DHT.

‡ This is similar to Bitcoin's BIP0032 proposal, but the root keys are public
keys and are available in the same object-space as the transactions. Given
that you have the root public key, you can both 1. prove that all the
documents were signed with keys derived from this key, and also 2. figure out
what the "nonce" added to the root key to create the subkey was in each case.
If the inserting client agrees to use a monotonically-increasing counter for
nonces, then the subkey-signed documents are orderable once you've recovered
their subkeys.

~~~
williamcotton
_I 'm really surprised nobody has just built this trimmed-down design and
called it a "distributed object storage mesh network" or somesuch._

You mean something like this?

[http://ipfs.io/](http://ipfs.io/)

------
rictic
A potential stepping stone to the web of hashes proposal here is the
Subresource Integrity spec:
[http://w3c.github.io/webappsec/specs/subresourceintegrity/](http://w3c.github.io/webappsec/specs/subresourceintegrity/)

The other thing in this area to watch is ipfs.

------
yarcub
XRI[1] was a nice effort to solve this issue.

[1][http://en.wikipedia.org/wiki/Extensible_resource_identifier](http://en.wikipedia.org/wiki/Extensible_resource_identifier)

------
maemre
There is some research going on in Information Centric Networking and Named
Data Networking. The basic idea is switching to a system based on content
instead of hosts. The proposed solutions try to reduce traffic overhead (like
replication and caching of content on switches and routers), provide better
indexing and search functionalities etc.

[1]: [https://en.wikipedia.org/wiki/Information-
centric_networking](https://en.wikipedia.org/wiki/Information-
centric_networking)

------
natch
Very interesting, this could be powerful.

Things like git hashes and the bitcoin blockchain are already giving us pieces
that head in this direction.

One thing I would not like to see would be total lockdown of the worldwide
body of documents, where every digital creation is perfectly and irrefutably
tagged and tracked back to its creator and on every step along the way. I'm
not sure what all the bad consequences of this could be, but it doesn't give
me a warm and fuzzy feeling.

------
weitzj
Isn't this the point of URIs?
[https://tools.ietf.org/html/rfc3986](https://tools.ietf.org/html/rfc3986)

~~~
treve
No it's the point of URNs, you can go back much further for solutions that
seek to address this very issue.

~~~
weitzj
Thank you

------
dilipray
Why not something like Bit.ly Combination of 6 Chars? 57+B Combination is a
good number. For a short URL website.

For content website, something like Youtube which has combination of 11 Chars.
It's pretty good. For them but surely will have a problem in future but that
might be after 5+ Years. But if they add one more char, it will have more
combinations.

If this is about URLs and not files. Can any one correct me if I am wrong.

~~~
MichaelGG
Because those 6 chars need to have a backing store to map them. You cannot
derive the 6 chars just from the content of a file. Whereas with a hash
function, you can.

~~~
dilipray
Got it. Thank you.

------
olifante
Aren't these the same ideas that Rich Hickey has been talking about since his
seminal 2009 talk about values, identity and state?
[http://www.infoq.com/presentations/Value-Identity-State-
Rich...](http://www.infoq.com/presentations/Value-Identity-State-Rich-Hickey)

Joe Armstrong's proposal seems to boil down to this:

\- Identities as UUIDs

\- State as the payload of UUID URIs

\- Values as SHA-1 URIs

------
daenz
Regarding "the web of hashes", I wrote a chrome proof-of-concept awhile back:

[https://github.com/amoffat/hash-n-slash](https://github.com/amoffat/hash-n-
slash) previous discussion:
[https://news.ycombinator.com/item?id=6996398](https://news.ycombinator.com/item?id=6996398)

------
lovemenot
Armstrong's concern about the ever-growing number of data objects is
interesting in itself. He says Git(hub) is great and all but - paraphrasing -
forks outnumber merges. At some point such systems become unwieldy unless
there is systemic incentive to dedupe, merge, categorise or otherwise reduce
complexity. This chimes with my view too.

------
afandian
Tangentially related to some issues mentioned here (keeping track of articles
that change over time) is the Memento protocol:
[http://timetravel.mementoweb.org/](http://timetravel.mementoweb.org/)

It's a very interesting project, worth a look

------
jacques_chester
I've seen this done in Cloud Foundry, for these reasons.

Objects with _identity_ , like a buildpack, are given a UUID. That UUID is
stable, but the hash changes on disk depending on the exact file uploaded
(because you can replace buildpacks).

File paths include both UUID and hash.

------
rckrd
Gfycat[1] provides a slightly hybrid solution. Not sure that it is that
effective for sharing though.

[1] [http://gfycat.com/about#links](http://gfycat.com/about#links)

------
wyager
RIPEMD-160 would be a more appropriate hash for this. SHA-1 is no longer
thought to be sufficiently secure against collisions.

Edit: If you're downvoting, please explain. This is more or less factually
correct.

~~~
timmclean
A member of the SHA2 (or SHA3) family would be more appropriate. RIPEMD-160 is
slower and less resistant to collisions. I agree however that the use of SHA1
is problematic!

~~~
wyager
>A member of the SHA2 (or SHA3) family would be more appropriate. RIPEMD-160
is slower and less resistant to collisions.

"openssl speed" reports RIPEMD-160 being a few percent _faster_ than SHA256 on
my computer.

What makes you say it's "less resistant to collisions"? I don't think there
are any serious cryptanalytical attacks on RIPEMD160.

~~~
timmclean
N-bit hashes have at best (N/2)-bit collision resistance (see birthday
attack[1]). An 80-bit security level does not have a large enough margin of
safety nowadays.

RIPEMD has a 256-bit variant, but it hasn't received enough scrutiny.

[1]
[https://en.wikipedia.org/wiki/Birthday_attack](https://en.wikipedia.org/wiki/Birthday_attack)

~~~
wyager
That's incredibly misleading.

We don't care about the likelihood of producing some random collision; we care
about the likelihood of producing some _specific_ collision (which is not
vulnerable to the birthday attack).
[http://en.wikipedia.org/wiki/Preimage_attack](http://en.wikipedia.org/wiki/Preimage_attack)

The reason SHA-1 is considered insufficient is that it is cryptographically
broken [https://marc-
stevens.nl/research/papers/PhD%20Thesis%20Marc%...](https://marc-
stevens.nl/research/papers/PhD%20Thesis%20Marc%20Stevens%20-%20Attacks%20on%20Hash%20Functions%20and%20Applications.pdf)

I.e. the chance of finding a collision is substantially higher than would be
expected from an ideal PRF.

As far as I know, there are no serious cryptanalytical attacks on RIPEMD-160,
and 160 bits is more than sufficient for cryptographically unique identifiers.

~~~
timmclean
Collision resistance is critical for most applications of the OP's scheme. The
OP is proposing using hashes as identifiers for immutable content. Imagine the
following:

\- I publish a JavaScript library under this scheme using a hash without
collision resistance.

\- Popular/important websites refer to my library as hashname://..., trusting
that this refers to the version of the library that they audited.

\- I can then create a new, malicious version of the library that has the same
hash and use it to infect popular sites.

Allowing collisions breaks the immutability requirement, which impacts
security in many important cases.

~~~
Aloisius
> Allowing collisions breaks the immutability requirement, which impacts
> security in many important cases.

 _All_ hashes have collisions. There are no cryptographic hashes that can
promise zero collisions.

~~~
timmclean
Of course; I was speaking informally. "allowing collisions" == "allowing it to
be feasible to find a collision"

------
xav
Guys, yes, there is a need for sorting things out, but it should go the other
way around: the uuids and hashes should be abstracted into a way more user-
friendly interface.

