A function's true name should be its content hash. (Where that content hash is calculated after canonicalizing all the call-sites in the function into content hash refs themselves.) This way:
- functions are versioned by name
- a function will "pull in" its dependencies, transitively, at compile time; a function will never change behaviour just because a dependency has a new version available
- the global database can store all functions ever created, without worrying about anyone stepping on anyone else's toes
- magical zero-install (runtime reference of a function hash that doesn't exist -> the process blocks while it gets downloaded from the database.) This is safe: presuming a currently-accepted cryptographic hash, if you ask for a function with hash X, you'll be running known code.
- you can still build "curation" schemes on top of this, with author versioning, using basically Freenet's Signed Subspace Key approach (sort of equivalent to a checkout of a git repo). The module author publishes a signed function which returns a function when passed an identifier (this is your "module"). Later, they publish a new function that maps identifiers to other functions. The whole stdlib could live in the DB and be dereferenced into cache on first run from a burned-in module-function ref.
- function unloading can be done automatically when nothing has called into (or is running in the context of) a function for a while. Basically, garbage collection.
- you can still do late binding if you want. In Erlang, "remote" (fully-qualified) calls don't usually mean to switch semantics on version change; they just get conflated with fully-qualified self-calls, which are explicitly for that. In a flat function namespace, you'd probably have to make late-binding explicit for the compiler, since it would never be assumed otherwise. E.g. you'd call apply() with a function identifier, which would kick in the function metadata resolution mechanism (now normally just part of the linker) at runtime.
Plug: I am already working on a BEAM-compatible VM with exactly these semantics. (Also: 1. a container-like concept of security domains, allowing for multiple "virtual nodes" to share the same VM schedulers while keeping isolated heaps, atom tables, etc. [E.g. you set up a container for a given user's web requests to run under; if they crash the VM, no problem, it was just their virtual VM.] 2. Some logic with code signing such that calling a function written by X, where you haven't explicitly trusted X, sets up a domain for X and runs it in there. 3. Some PNaCl-like tricks where object files are simply code-signed binary ASTs, and final compilation happens at load-time. But the cached compiled artifact can sit in the global database and can be checked by the compiler, and reused, as an optimization of actually doing compilation. Etc.) If you want to know more, please send me an email (levi@leviaul.com).
Currently it runs in JavaScript, but I was looking at other languages that it would be good to compile to, and Erlang seems like a pretty good fit for it.
This sounds great. The issue is when you fix a function you generally want everyone to use your bug fix. This is even more important for secruity issues.
However, sometimes code needs the broken version of some function...
There really is no great universal solution to this stuff.
The second one would be the behaviour you'd get out of the "low-level" contract of the system. The former could be built on top: you could have late-bound references that effectively call a function with a semver constraint. e.g.
apply(repo_handle_foo, bar_fn, {'~>', 2, 3})
Meaning the system would bind that to a version of bar_fn that exists in the foo repo, and is considered to be in the 2.3.X release series. (Conveniently, given the global code DB, it'd probably always resolve to the very latest thing in that series.)
I would suggest a slight change in thinking, though: at the individual function level, you can version implementation, but you don't version semantics. If you have foo 1.0 and foo 2.0 that do different things, that have different API contracts—then those are simply different functions which should have separate identifiers. The only time a function-contract identifier should have "revisions" is to correct flaws that diverge the implementation from the contract. Eventually, the function perfectly does what the contract specifies; at that point, the function is done. If you want the function to do something else, new contract = new identifier.
(But what if your old implementation works and meets the contract, but is way too slow? This is where you get into alternative implementations of the same contract, and compilers doing global runtime A/B-testing of different implementations as a weird sort of JIT. This is orthogonal to versioning, almost, but it means you can't put the version constraints like "relies on [revision > bugfix] of foo" in your code—because what if the VM is running a foo that never had the bug? So those constraints go in the database itself, and effectively "hide" old versions from being offered as resolution targets, without impeding direct reference.)
the hash must account for bound variables so that fun add(x, y) { return x + y } is the same as fun add(a, b) { return a + b }. Beta-reduction independance.
Right. Identifier names effectively become metadata; the AST would look like numbered SSA references (as in LLVM, which does a similar transformation, to allow the optimizer pass to just pattern-match known AST "signatures" and rewrite them.)
What is the probability to have a hash collision of two functional pieces of code? For example one could possibly replace a hard-coded domain "http://example.com" with "http://exàmlpœ.com" (assuming it results in the same hash), then register exàmlpœ.com and do a man-in-the-middle attack with it.
I am not a cryptographer so maybe one of them can chime in and verify this:
If you use a strong crypto hashing algorithm that would be impossible given current computational resources and their growth for eons into the future. For example, there are no known collisions of SHA-2. No one has found 2 items that have the same SHA-2 hash.
Quantum computers / some break-through algorithm could change that. If that happens all encryption on the internet likely breaks then as well, except where quantum cryptography is being used.
The probability, assuming a properly-designed hash function, is no more likely than one in about 300 undecillion. No algorithmic flaws are known that can would let an attacker take an arbitrary input to MD5, SHA-1, SHA-2, or SHA-3 and construct a hash collision in noticeably better time. There are attacks that let an attacker construct two inputs with the same hash for MD5 and SHA-1, so out of an abundance of caution, we're considering MD5 and SHA-1 both "broken", there are now two successor hashes to SHA-1, and we're hurrying to replace all uses of MD5 and SHA-1 where collisions matter.
If you're not designing your software to worry about cosmic rays, you shouldn't design your software to worry about hash collisions. Just pick a good hash.
I disagree with your threat model. Why do you think that the ability to create two colliding functions with basically arbitrary content does not endanger the end user. E.g. (1) I write (or copy) a useful function f1, (2) create an evil function f2, (3) manipulate the meta data of both functions to create f1' and f2' which are functionally identical to f1 and f2, (4) publish f1'. (5) People start to use f1', (6) and end user requests hash(f1') from the key-value-store, (7) I man-in-the-middle the connection and return f2' instead, (8) the end user executes f2' and is compromised.
First off, because you simply don't have that ability in SHA-2 or SHA-3. If you're designing a new system, don't use MD5 or SHA-1; if you're using a legacy system, organize some sort of orderly panic (just like the CA/Browser forum is doing with SHA-1 certificates).
Second, because it's not "basically arbitrary": the content almost always looks like it's been specifically designed to make room for a hash collision. For binary formats like images and PDFs, it's easy to put a large amount of unrendered data in the file format that isn't visible in a viewer, which is exactly why the newsworthy collisions have been images and PDFs. Even X.509 certificates allow you some room for arbitrary data. For program code, having a bunch of arbitrary data in the middle of the function would look extremely suspicious. (Obviously you shouldn't design the metadata format of your functions to permit enough arbitrary data, but given the talk of alpha conversions, it sounds like the proposal is to hash a canonical, high-information-density representation of the function.)
But again this doesn't come up unless you're using a broken hash. My argument is just that even the "broken" hashes really aren't very broken, so if you're using the non-broken ones, you should basically assume they're perfectly secure.
> First off, because you simply don't have that ability in SHA-2 or SHA-3. If you're designing a new system, don't use MD5 or SHA-1;
Then don't mention MD5 and SHA1 in the first place. The sooner they leave everyone's mind as valid alternatives the better.
> For binary formats like images and PDFs, it's easy to put a large amount of unrendered data in the file format that isn't visible in a viewer
MD5 collisions have gotten much shorter in the past years [1].
> For program code, having a bunch of arbitrary data in the middle of the function would look extremely suspicious. (Obviously you shouldn't design the metadata format of your functions to permit enough arbitrary data,
And you trust the designers of these formats to know this "obvious" fact? You reference code normalization, but there was talk in this thread about keys that are to be associated with the functions to allow updates (and thus included in the hash) and I think it is perfectly valid to include graphics in the documentation of a function.
> My argument is just that even the "broken" hashes really aren't very broken, so if you're using the non-broken ones, you should basically assume they're perfectly secure.
My point is that this "structured collision resistance" is used far too often as a handwave argument why their specific protocol can continue to use a broken hash. (Remember how CAs said the same things about X.509 certificates before Appelbaum, Molnar et al. [2] presented an actual proof-of-concept?) Software developers already have difficulties to distinguish pre-image resistance from collision resistance. Giving them yet another argument to shoot themselves in the foot with is not a good idea.
Yeah, you're right I was unclear about the status of MD5 and SHA-1. Unfortunately they are currently both well-known and faster, so I think there's value in calling them broken -- but I didn't do that effectively.
My argument is roughly that, one, you should listen to cryptographers about what is "broken" and take that advice seriously, and two, having done that, you should realize that the standard of "broken" is so conservative that the possibility collisions in a non-broken hash is not even worth thinking about at an application level. I'm not sure that got across effectively.
I'm assuming his suggestion is no updates ever unless you update your dependencies version. many module systems like say npm has a culture of using fuzzy matching which means when you do an npm install again you can easily pull in new versions of libraries that have had upgrades since you last did npm install. I'm of a fan of strict dependencies but some people prefer to make it easier to stay to the latest minor patch or other logical upgrade patterns.
Another option would be have an area that is reserved for meta data about upgrades that can't be modified by page content but is browser level that still has hash verifiable and pages with fixed URLs. But at the top there would be something like: There have been 5 changes to this page since this version. See Changes [link] Go to latest version [Link]
"the global database can store all functions ever created, without worrying about anyone stepping on anyone else's toes"
I've had similar thoughts in the past, though my thought was as a function loader for javascript. Why not just experiment with this as library for javascript functions? Downloading the code from the "global database" would be provided by a function:
getDataByHash(hash) : Blob
To load the code into the return use eval (blech), or back up a step and use script tag + an advanced version of magnet links I'd call snow links:
There will have to be a dom-scanning/watching js library that the pages load from a normal url which scans the page for snow links, downloads them by calling getDataByHash(), and then swaps in new URLs using URL.createObjectURL(Blob).
This leads me in a somewhat orthogonal direction to your post...
Obviously any element in the page that loads resources would benefit from this (link rel=stylesheet, img, iframe, etc) so the document-scanner should work it's magic on them as well.
To me the tricky part becomes "the global database." There are several ways to implement it. My thought would be to build it as a DHT on top of Web-RTC. I'd look into webtorrent, it has to scale from very small to very large files. Maybe have multiple different DHTs that the scanner will try.
Storing stuff in the DHT ought to be as simple as declaring to the network that data (or solution) of a given hash is known. Clients p2p connect over to the knower (or daisy chain style as a sort of ad-hoc STUN/TURN setup?), and virally spread the data by declaring they too now have the solution for the hash they just downloaded. A CDN can still store a file and can be listed as a fallback provider in the snow url.
As an example application:
Once this DHT exists and p2p is built on top of it should not be hard to ask peers to run the content of a given SHA and return the result (or they will return it's SHA if it's large, effectively memo-izing the function). Something like:
run(hash) : RunOffer
Any peer (AWS or Google etc could implement a peer as well) could respond to that with a offer to run the computation for an amount of p2p currency that scales to billions or trillions of tiny transactions per second.
...
Anyhow this comment is way too long already, but the ideas keep flowing from here. There are a lot of technical challenges for vital features (like how to register (im)mutable alias for hashes and distribute them to the network without using DNS for authoritative top-level namespacing? how get reliable redundancy guarantees for stored data without resorting to a CDN? What to do about realtime steaming of data where data is produced by an ongoing process? grouping multiple devices for the same user?)
All-in-all it seems like being able to get data by the hash of its content from inside of a program (including library code) as easily as loading it from a URL or off the filesystem is pretty useful. It also seems like we can engineer the technology to make this happen, so I think it's inevitable and will happen pretty soon.
> Clients p2p connect over to the knower (or daisy chain style as a sort of ad-hoc STUN/TURN setup?), and virally spread the data by declaring they too now have the solution for the hash they just downloaded.
This is, in fact, exactly what Freenet does! It's a global DHT acting as a content-addressible store, where there's no "direct me to X" protocol function, only a "proxy my request for X to whoever you suspect has X, and cache X for yourself when returning it to me" function.
(Aside: Freenet also does Tor-like onion-routing stuff between the DHT nodes, which makes it extremely slow to the point that it was discarded by most as a solution for anonymous mesh-networking. But it doesn't have to do that. "Caching forwarding-proxy DHT" and "encrypted onion-routing message transport" are completely orthogonal ideas, which could be stacked, or used separately, on a case by case basis. "Caching forward-proxying DHT" is to the IP layer as "encrypted onion-routing message transport" is to TLS. PersistentTor-over-PlaintextFreenet would be much better than either Tor or Freenet as they are today.)
I agree that this could totally be done as a library in pretty much any language that allows for runtime module loading or runtime code evaluation. And, like you say, there are tons of interesting corollaries.
I think, though, that to be truly useful, you have to push this sort of idea as low in the stack as possible.
Unix established the "file descriptor" metaphor: a seekable stream of blocks-of-octets, some of which (files) existed persistently on disk, some of which (pipes) existed ephemerally between processes and only required resources proportional to the unconsumed buffer, some of which (sockets) existed between processes on entirely separate hosts. Everything in Unix is a file. (Or could be, at least. Unix programmers get "expose this as a file descriptor" right at about the same rate web API designers get REST right.) Unix (and most descendants) are "built on" files.
To truly expose the power of a global code-sharing system, you'd need the OS to be "built on" global-DHT-dereferenceable URNs. There would be cryptographic hashes everywhere in the system-call API. Because your data likely wouldn't just be on your computer (just cached there), you'd see cryptographic signatures instead of ownership, and encryption where a regular OS would use ACL-based policies.
At about the same level of importance that Unix treats "directories", such an OS would have to treat the concept of a data equivalence-mapping manifests (telling you how to get git objects from packfiles, a movie from a list of torrent chunk hashes, etc.)
For any sort of collaborative manipulation of shared state, you'd probably see blocks from little private cryptocurrency block-chains floating about in the DHT, where the "currency" just represents CPU cycles the nodes are willing to donate to one-another's work.
And then (like in Plan9's Fossil), you might see regular filesystems and such built on top of this. But they'd only be metaphors—your files would be "on" your disk to about the same degree that a process's stack is "in" L1 cache. Disks would really just be another layer of the memory hierarchy, entirely page-cache; and marking a memory page "required to be persisted" would shove it out to the DHT, not to disk, since your disk isn't highly available.
—but. Designing this as an OS for PCs would be silly. It would have much too far to go to catch up with things like Windows and OSX. Much better to design it to run in userspace, or as a Xen image, or on a rump kernel on server hardware somewhere. It'd be an OS for cloud computers, or to be emulated in a background service on a PC and spoken with over a socket, silently installed as a dependency by some shiny new GUI app.
Which is, of course, the niches Erlang already fits into. So see above :)
Thank you! I had searched but hadn't been able to find this idea this sounds exactly like what I have had bouncing around in my head for a while now!
I just read the w3c spec it seems more like an integrity check, but it so trivial to just use the integrity hash to download the data that the next step is removing the src tags all-together / using them as a fallback.
Yes. The spec includes the option of providing fallback URLs to fetch from, and gives the browser fairly broad freedom (as I read it) to fulfill the request so long as the response data matches the given hash. How this connects up with the other ideas being discussed in this thread isn't immediately clear to me, as DHTs in practice tend to be too slow to block on for most in-browser resources, but I'd definitely call it an intriguing development.
Sounds like we are looking at the same set of problem, just from slightly different starting point!
The reason to build the DHT on top of WEB-RTC rather than just using freenet:
* is freenet brings a lot of extra 'features' that you mention that aren't needed for a lot of use cases, I view freenet more as one potential implementation of the function getDataByHash(hash)
* using a URI style notation (snow:?xt=urn:sha2:beaaca...) makes it clear that the data has 1 simple universal address, it's not terribly difficult to write a FUSE layer that mounts /mnt/snow/sha to your favorite sha providing client (possibly just calling through to node.js).
* Using web-rtc given you a large platform to get the network kickstarted as the userbase is extremely large, and no install is required.
So I do agree 100% that we should try to build a language (or languages!) that refer to functions by the hash of their source.
I view the programming language as just a sub problem of 2 issues:
1. We have to make DHT that was trivially accessible on all platforms, including, and especially, in the browser (and likely there first).
2. Even with the DHT we need a way to alias/label the hashes. The alias can also be versioned. Immutable aliases would just be fixed to version 0.
So for a language implemented on top of the DHT, the aliases would correspond to the function names.
This would be a function
resolveAlias(alias) : Hash
where alias is something like "alias:dns.biz.jackman.code.spacegame.flySpaceship?v=0"
* v=0 fixes the function to a set version, so even if it is later patched the code doesn't change, there is no v specified then the latest is used, a more complicated resolution scheme might be desired)
* A dns. prefix is used because hopefully someday we can move beyond need to piggy back on dns for setting authority on keys, in which a different Super TLD can be chosen)
The corresponding putAlias(name, hash, signature, versionNumber=0) function that will attempt to associate a name with a hash to the network.
To prevent anyone from naming things under domains they don't control, a simple solution could be to use public/private keys. Vend the public key(s) of those able to set aliases under that (sub)domain as a DNS TEXT record, or even a CNAME. They can then sign the putAlias by hashing salt+name+hash+version with their private key (keeping in mind the need to avoid length extension attacks).
I have thought along the cpu cycles a currency, I haven't been able to think of a way to make that a tradeable & bankable currency. I think you still need to have a notion of currency that just tries to be a currency, albeit it needs to scale down to nano-transactions, BTC might work if it can handle higher volumes of TPS. This is because you might need to add to every api function an extra parameter, bounty, wherein you name a price you are willing to pay for your peers to perform that action. Under normal browsing routines, your client should be doing a lot more work than it asking to have done. Those normal clients should be accumulating currency.
When you are on mobile you might be a drain a on the network, in that case you will simply subsidize that activity from your home laptop, or build it up when you go home at night have your phone plugged in charging and are on a WI-FI (or maybe wifi mesh + UAVs that the network itself purchases to increase it's coverage) network.
One other thing, if you want to run a big data study on a bunch of SHA's and use a lot of compute, then obviously you are going to have to acquire some of this currency since you are a load on the network.
- functions are versioned by name
- a function will "pull in" its dependencies, transitively, at compile time; a function will never change behaviour just because a dependency has a new version available
- the global database can store all functions ever created, without worrying about anyone stepping on anyone else's toes
- magical zero-install (runtime reference of a function hash that doesn't exist -> the process blocks while it gets downloaded from the database.) This is safe: presuming a currently-accepted cryptographic hash, if you ask for a function with hash X, you'll be running known code.
- you can still build "curation" schemes on top of this, with author versioning, using basically Freenet's Signed Subspace Key approach (sort of equivalent to a checkout of a git repo). The module author publishes a signed function which returns a function when passed an identifier (this is your "module"). Later, they publish a new function that maps identifiers to other functions. The whole stdlib could live in the DB and be dereferenced into cache on first run from a burned-in module-function ref.
- function unloading can be done automatically when nothing has called into (or is running in the context of) a function for a while. Basically, garbage collection.
- you can still do late binding if you want. In Erlang, "remote" (fully-qualified) calls don't usually mean to switch semantics on version change; they just get conflated with fully-qualified self-calls, which are explicitly for that. In a flat function namespace, you'd probably have to make late-binding explicit for the compiler, since it would never be assumed otherwise. E.g. you'd call apply() with a function identifier, which would kick in the function metadata resolution mechanism (now normally just part of the linker) at runtime.
Plug: I am already working on a BEAM-compatible VM with exactly these semantics. (Also: 1. a container-like concept of security domains, allowing for multiple "virtual nodes" to share the same VM schedulers while keeping isolated heaps, atom tables, etc. [E.g. you set up a container for a given user's web requests to run under; if they crash the VM, no problem, it was just their virtual VM.] 2. Some logic with code signing such that calling a function written by X, where you haven't explicitly trusted X, sets up a domain for X and runs it in there. 3. Some PNaCl-like tricks where object files are simply code-signed binary ASTs, and final compilation happens at load-time. But the cached compiled artifact can sit in the global database and can be checked by the compiler, and reused, as an optimization of actually doing compilation. Etc.) If you want to know more, please send me an email (levi@leviaul.com).