There's a long tradition of using this kind of approach in capability systems. If you do it right, you can have globally unique identifiers that are not human readable (probably, some sort of public key which is also routable), and then let humans assign local "petnames" (labels) to them in a way that is actually pretty usable, if everyone is using software designed for it.
Problems come when, say, you want to put your web address on a billboard. In an ideal world the billboard would somehow transmit the advertiser's public key to the viewer's device so that the viewer could then look them up, but obviously we don't have any particular tech for doing that. So instead we create this whole complex system by which people can register human-readable identities, which in turn requires a centralized name service (yes, DNS is centralized), certificate authorities (ugh), etc.
Similarly, whenever you tell your friend about some third-party entity (another person, company, whatever), you should be giving them the public key of that entity. But that's not really practical. We need some sort of brain implant for this. :)
> In an ideal world the billboard would somehow transmit the advertiser's public key to the viewer's device so that the viewer could then look them up, but obviously we don't have any particular tech for doing that.
Augmented-reality tech (whatever comes next in the line of tech Google Glass is in) would presumably do this as one of its primary use-cases, though. As soon as there's a reader-device that knows how to passively scan for and "absorb" encountered pubkeys into your keychain, a "signed link emitter in hybrid QR-code/NFC format" would become as commonplace as printed URLs are today, because they'd actually be useful over-and-above URLs.
UUIDs can also be collided. Did you want some form of public key referencing that only a private key can answer, perhaps?
As already pointed out by ivoras here, magnet: links are close to exactly what you're looking for. It also reminds me of, for example, the Freenet CHK/SSK/USK system, or several other things with similar designs.
You should also not use SHA-1 hashes for uniqueness or checksumming anymore: they're too weak.
This would avoid the inefficient situation we have at present, where the same jQuery script is fragmented across dozens of CDNs, causing a new request each time even when previous instances are already cached by the browser.
I would love to see maven in the browser. Have each webapp serve a pom.xml listing the libs it uses along with exact versions and be done with this hell. The browser then checks if the lib is installed, if not, it goes through the user-specified maven repositories, then through the default ones and then through the webapp-specific ones searching for the artefact. Problem solved.
The problem is not that simple. Most webapps are compiled and minimised using systems like Closure Compiler, Uglify and R.js (for AMD apps).
You almost never point directly to a dependency as a standalone file.
Doing so would mean 15-30 requests per webapp and since browsers only accepts 5-6 parallel requests, it would slow down the page considerably.
SHA1 checksums are fine for content that is immutable (doesn’t change) - but what about a file whose content changes with time?
I think we've been doing some very silly things with data over the last few decades related to the UPDATE statement.
Why not just consider all published data immutable? Look at book publishing as an analogy. There are multiple printings of a book. If there were corrections or updates they don't retroactively affect the previous printings. Why can't we look at data that is published on the Internet is the same manner?
If you want to update something you'll have to publish a brand new version. This also mirrors versioning in software libraries.
CRDTs, immutable data structures, eventually consistent data... from UI programming to big data, these are more than just eternally recurring trends. We're learning some very lasting things about how computers should deal with data.
Because computers and reality don't actually deal with immutable data? I love the reasoning powers you get from it at a code level, but it's simply not a fast way to do things in many cases. Also, many applications do not care about all (or any) interim copies of data.
With respect to web content, what are you proposing? If I got to site.com/product1data and you update the price, we certainly don't want the URL to change. In such situations, how would a versioned system add any value, and how would the UI be exposed?
The price of something has nothing to do with the content. The data that represents the price is a pointer to the content, akin to affixing a new price sticker to the jacket of a book. Perhaps the original published content could have a recommended price embedded in it.
This applies to the names and categorizations of things as well.
As for updates, imagine you're a shopkeep and in the morning you publish a table of prices, titles and content-addressable hashes.
Now, for the whole naming-of-the-things... take your pick: ICANN or Namecoin-like.
Claim ownership of a top-level name and then you can point it at whatever you want.
[not the OP] In your example, the URL name would be a reference to an immutable document with an immutable ID. You can update the named reference to point at the new ID, but the old ID still refers to the old document.
It would also be possible to store and distribute new versions as patches to previous versions by linking between the patch set, old version, and new version.
OK, but how does that fit into how we're actually using URLs today, like how would this change the web? If you browse my product info page, I don't want you linking to a versioned document, in general. In the cases I'm thinking of off the top of my head, I'm not sure when you'd ever want people linking/copying version-specific URLs that somewhat negate the point of updating the page in the first place.
Or perhaps I don't understand "Why can't we look at data that is published on the Internet is the same manner".
The web of today is based completely on "this named thing references this other named thing". When we create links to other content we are most definitely doing so in the context of the content as it is at the time of linking, not some future state. What happens if that data is waaay different or missing? Dead links we call them.
It terms of commerce this is somewhat analogous to bait-and-switch.
In this thread I'm mainly referring digital content as an end-to-itself, not of digital content as a reference to physical products.
As for referencing physical products, be they automobiles or paintings, deriving a direct cryptographic hash isn't possible, but GUIDs are. Cars already have serial numbers. Paintings have signed certificates from experts.
If I'm on a website buying a used car I definitely want the price list to be linking to the GUID, that is, to a reference of the object itself.
As concisely as possible: IPFS is a way to publish your data in the cloud (and the related Filecoin project is a way to host your cloud data), urbit is a way to publish and host your identity, computing and data.
As a more specialized project, you'd expect IPFS to be better at the part of the problem it solves, for the same reason a sprinter sprints faster than a decathlete. (Not to mention that JB is awesome.) On the other hand, not having an identity model other than public keys (or, to put it differently, not trying to square Zooko's triangle), imposes certain problems on IPFS that urbit doesn't have. For instance, with routable identities, you don't need a DHT, and so the idea of a hash-addressed namespace is less interesting.
That said, it would be easy to imagine a world in which urbit either could talk to IPFS, or even layered its own filesystem (which has a fairly ordinary git structure under the hood) over IPFS. Like I said, it's a cool project.
The reason naming things by hash is a good idea is that you can cache them "forever", until you decide to change headers. The only benefit of a UUID over a URL is that the lack of meaning makes it less likely that people will change it later on. However, semantic naming is useful, and avoids collisions. True UUIDs require very careful use of RNGs.
For modifiable content, more useful than UUIDs would be (chosen-name, content-hash) tuples that are then signed by the publishing crypto-identity.
The chosen-name can be a UUID, but doesn't have to be something so semantically-opaque. It's more likely to be a tree-namespace like traditional domain-centric URLs. The publishers-key replaces the role of the domain-name as the 'authority' portion of the URL.
- First you create a "random" blob that has an identity (called a permanode), that you sign with your private key. It acts as an "anchor" you can link to.
- Then you sign a piece of json that references the permanode and the content you wish; it effectively means "I, owner of key XXX, claim that permanode called 12345 now references value ABCD".
- To get the content of a permanode, you search all modifications and merge them to obtain the final value
An alternative, peer-to-peer name resolution mechanism will be in greater demand every year. Currently, DNS records are spoofed, proxied, blocked, and whatnot by the domain name registrars because of legal threats by big money.
There should be a way for people to search a particular server by name without being in cahoots with every other party who wants to get involved in it.
"Once we have the SHA1 name of a file we can safely request this file from any server and don’t need to bother with security."
I don't understand this statement. Dispensing with encryption makes the request and returned content susceptible to passive observation. Saying that this approach is resistant to active manipulation assumes the existence of some kind of web of trust between hashed documents. At some point you're going to have to click on a hash without knowing its provenance. How do you know you're not being phished?
When you have a hash you trust locally, you can fetch a file from anywhere without caring about the source - the only thing that matters is if the file matches your hash. So the only thing you need to care about is ensuring that the hash you have is the one you really want.
Their point is that you can only assume so much about the security of the tools you are using. If your hash algorithm of choice is insecure, then you have to switch. Take apt-get install package, if the SHA hash of the package matches what the server claims, then you ought to trust the package even if the package is from a random website. But if you are paranoid you shouldn't be getting from a random website. If you want extra confidence, then only get from servers you trust, and only get from sites protected over HTTPS.
Also consider what benefit do you get by verifying the entire file? Some applications may wish to read the first few bytes to ensure such file is openable by the application, but in the end the first N bytes can fool you if the last M bytes are malicious. So you would open the file in a sandbox to minimize impact.
MD5 has been known to be too weak for this purpose since 1995. So, no competent protocol-designer or publisher will have used it for this purpose for decades. SHA1 is now under enough suspicion to avoid for this purpose, but not yet proven to be compromisable.
But SHA2-256 and up, and many other hashes, are still safe for this purpose and likely to remain so for decades – and perhaps indefinitely.
So within the lifetime of an application or even a person, secure-hash-naming does obviate the need for secure transport and trust. Also note that 'secure' transport and trust, if dependent on things like SSL/TLS/PKI, also relies on the collision-resistance of secure hash functions – in some cases even weaker hash functions than anyone would consider for content-naming.
(For the extremely paranoid, using pairs of hash functions that won't be broken simultaneously, and assuming some sort of reliable historical-record/secure-timestamping is possible, mappings can be robust against individual hash breaks and refreshed, relay-race-baton-style, indefinitely.)
Gojomo, that was the point of using MD5 in the example. If this system had been deployed in 1995, it would have been used MD5, and thus the problem of broken/obsolete links that the comment outlined would have applied to it after a few years. Who's to say the same thing wont happen to this system in 5 years?
There was no surprise break of MD5 – it came after plenty of warning, so even a hypothetical 1995 deployment would've had years for a gradual transition, and continuity-of-reference via correlation-mapping to a new hash.
So even that hypothetical example – with an early, old, and ultimately flawed secure hash – reveals hash-based as more robust than the alternatives.
And in practice, hash-names are as strong or stronger than the implied alternative of "trust by source" – because identification of the source is, under the covers, also reliant on secure hashes… plus other systems that can independently fail.
We have experience now with how secure hash functions weaken and fail. It's happened for a few once-trusted hashes, with warning, slowly over decades. And as a result, the current recommended secure hashes are much improved – their collision-resistance could outlive everyone here.
Compare that to the rate of surprise compromises in SSL libraries or the PKI/CA infrastructure – several a year. Or the fact that SSL websites were still offering sessions bootstrapped from MD5-based PKI certificates after MD5 collisions were demonstrated.
Well, we understand hash functions a lot better now than we did back then. It would be foolish to confidently state that SHA2 or SHA3 will _never_ be broken, but it's not foolish to state that, given what we know, they are unlikely to be broken.
In context, Armstrong is only concerned about 'security' from tampering/forgery in that statement. Confidentiality is not specifically ensured... but being indifferent as to the path/server which delivers your content may help the effectiveness of other strategies for obscuring your interest, such as routing your requests through mixes of trusted and untrusted relays.
If you use a tree hash, the side sending you content can even include compact proofs that what they're sending you is a legitimate part of a full-file with the desired final hash.
So for example, if receiving a 10GB file, you don't have to get all 10GB before learning any particular relayer is a dishonest node.
Yep. Freenet is basically this plus a bunch of really complicated mechanisms to create anonymity in who shoved the documents into it. If you take those bits out† then you're left with a DHT that:
1. lets you talk about individual objects by hash-based URNs;
2. or lets an publisher insert versioned streams of objects using document-signing with deterministic subkeying‡; gives the stream as a whole a UUID-based URN; and then lets clients query for either the latest, or for any fixed version-index of a given object-stream;
3. and which does a sort of pull-based store-and-forward of content—every node acting as a caching proxy for every other node.
I'm really surprised nobody has just built this trimmed-down design and called it a "distributed object storage mesh network" or somesuch. A public instance of it would beat the Bittorrent DHT at its own game; and private instances of it would be competitive with systems like Riak CS.
---
† Which is perfectly sensible even for Freenet itself; you could always just run your Freenet node as a Tor hidden service, now that both exist. Tor cleanly encapsulates all the problems of anonymous packet delivery away; the DHT can then just be a DHT.
‡ This is similar to Bitcoin's BIP0032 proposal, but the root keys are public keys and are available in the same object-space as the transactions. Given that you have the root public key, you can both 1. prove that all the documents were signed with keys derived from this key, and also 2. figure out what the "nonce" added to the root key to create the subkey was in each case. If the inserting client agrees to use a monotonically-increasing counter for nonces, then the subkey-signed documents are orderable once you've recovered their subkeys.
There is some research going on in Information Centric Networking and Named Data Networking. The basic idea is switching to a system based on content instead of hosts. The proposed solutions try to reduce traffic overhead (like replication and caching of content on switches and routers), provide better indexing and search functionalities etc.
Things like git hashes and the bitcoin blockchain are already giving us pieces that head in this direction.
One thing I would not like to see would be total lockdown of the worldwide body of documents, where every digital creation is perfectly and irrefutably tagged and tracked back to its creator and on every step along the way. I'm not sure what all the bad consequences of this could be, but it doesn't give me a warm and fuzzy feeling.
Why not something like Bit.ly Combination of 6 Chars?
57+B Combination is a good number. For a short URL website.
For content website, something like Youtube which has combination of 11 Chars. It's pretty good. For them but surely will have a problem in future but that might be after 5+ Years. But if they add one more char, it will have more combinations.
If this is about URLs and not files. Can any one correct me if I am wrong.
Because those 6 chars need to have a backing store to map them. You cannot derive the 6 chars just from the content of a file. Whereas with a hash function, you can.
Armstrong's concern about the ever-growing number of data objects is interesting in itself. He says Git(hub) is great and all but - paraphrasing - forks outnumber merges. At some point such systems become unwieldy unless there is systemic incentive to dedupe, merge, categorise or otherwise reduce complexity. This chimes with my view too.
Tangentially related to some issues mentioned here (keeping track of articles that change over time) is the Memento protocol: http://timetravel.mementoweb.org/
I've seen this done in Cloud Foundry, for these reasons.
Objects with identity, like a buildpack, are given a UUID. That UUID is stable, but the hash changes on disk depending on the exact file uploaded (because you can replace buildpacks).
A member of the SHA2 (or SHA3) family would be more appropriate. RIPEMD-160 is slower and less resistant to collisions. I agree however that the use of SHA1 is problematic!
N-bit hashes have at best (N/2)-bit collision resistance (see birthday attack[1]). An 80-bit security level does not have a large enough margin of safety nowadays.
RIPEMD has a 256-bit variant, but it hasn't received enough scrutiny.
We don't care about the likelihood of producing some random collision; we care about the likelihood of producing some specific collision (which is not vulnerable to the birthday attack). http://en.wikipedia.org/wiki/Preimage_attack
I.e. the chance of finding a collision is substantially higher than would be expected from an ideal PRF.
As far as I know, there are no serious cryptanalytical attacks on RIPEMD-160, and 160 bits is more than sufficient for cryptographically unique identifiers.
Collision resistance is critical for most applications of the OP's scheme. The OP is proposing using hashes as identifiers for immutable content. Imagine the following:
- I publish a JavaScript library under this scheme using a hash without collision resistance.
- Popular/important websites refer to my library as hashname://..., trusting that this refers to the version of the library that they audited.
- I can then create a new, malicious version of the library that has the same hash and use it to infect popular sites.
Allowing collisions breaks the immutability requirement, which impacts security in many important cases.
You're describing a second-preimage attack. You can not use a birthday attack to pull this off. The expected difficulty of pulling of a second-preimage attack on RIPEMD-160 is close to 2^159, which is more than sufficient.
The reason SHA-1 is insecure is that it is cryptographically broken, and the same attack takes less than 2^60 attempts.
Guys, yes, there is a need for sorting things out, but it should go the other way around: the uuids and hashes should be abstracted into a way more user-friendly interface.
http://en.wikipedia.org/wiki/Zooko%27s_triangle
There's a long tradition of using this kind of approach in capability systems. If you do it right, you can have globally unique identifiers that are not human readable (probably, some sort of public key which is also routable), and then let humans assign local "petnames" (labels) to them in a way that is actually pretty usable, if everyone is using software designed for it.
Problems come when, say, you want to put your web address on a billboard. In an ideal world the billboard would somehow transmit the advertiser's public key to the viewer's device so that the viewer could then look them up, but obviously we don't have any particular tech for doing that. So instead we create this whole complex system by which people can register human-readable identities, which in turn requires a centralized name service (yes, DNS is centralized), certificate authorities (ugh), etc.
Similarly, whenever you tell your friend about some third-party entity (another person, company, whatever), you should be giving them the public key of that entity. But that's not really practical. We need some sort of brain implant for this. :)