
Cache across domains with a file hash - wgx
http://alexchamberlain.co.uk/opinion/2012/09/13/cache-across-domains.html
======
agl
I've coded something similar up in WebKit and considered landing it in Chrome,
although I was more concerned with being able to use HTTP resources securely
in HTTPS pages.

Firstly you need a way to incrementally verify a file: you don't want to have
to download and buffer the whole file before figuring out whether it's good.
Thankfully you can do this with Merkle trees: just make them `degenerate'
(every left child is a leaf node).

If you make the browser cache content addressable (as the post suggests) then
you need to do more than look at the caching header. Consider a site with
Content Security Policy enabled: If an attacker found an XSS, they could
inject <script href="<http://correct.origin.com/foo.js> digest="degen-
hash:sha256:123abc...">. That would match the CSP origin restrictions without
the server ever having to serve foo.js. (Thanks to abarth for pointing that
out.)

So I believe content addressable caches would need a separate hash:// URL
scheme for this. But I didn't attempt to make the cache content addressable in
my code.

I don't know what to do about img srcsets[1]. There's no obviously good place
to put the hashes in the HTML.

And lastly, and possibly fatally for the "secure HTTP resources in HTTPS
pages" use case, many networks will now transcode images on the fly (and
possibly other files too). So matching a known hash will result in broken
pages for users on those networks.

[1] [http://www.whatwg.org/specs/web-apps/current-
work/multipage/...](http://www.whatwg.org/specs/web-apps/current-
work/multipage/embedded-content-1.html#attr-img-srcset)

~~~
yorhel
> Firstly you need a way to incrementally verify a file

Following the proposal of the article, you don't really need that. If you
don't have the hash in your local cache, you should assume that the link
provided by the site is correct and resume as you normally would without the
hash present. When it's downloaded, you verify the downloaded contents with
the hash, and add the file to the cache if the verification succeeds.

If you already had the hash in the local cache, then you already have the
entire file and there's no need for incremental verification.

As I understand it, the point of the hash is purely for caching purposes, not
strictly for validating that the downloaded file matches it.

~~~
alexchamberlain
> As I understand it, the point of the hash is purely for caching purposes,
> not strictly for validating that the downloaded file matches it.

That is a very useful consequence that prevents malicious content.

------
__alexs
Anything like this is going to need a mandatory Cache-control header that
indicates whether the content can be shared cross-domain. Otherwise you can
just probe the cache for arbitrary resources, similar to stuff like the
a:visited browser history attack.

Or have we just given up trying to deal with all the privacy leaking side-
channels in browsers these days?

Another way to do this would be to allow multiple alternative src urls for a
resource. Then the browser could pick the best (for as yet undetermined values
of best) source the resource.

This has the extra bonus of advantage of looking mostly like a normal <link>
or <script> tag and doesn't introduce yet another way to probe across domains.

~~~
SoftwareMaven
Only links with a hash initially get put in the addressable cache. So if I
download foo.js without a hash in the script tag, it is effectively a
different document than foo.js with the script tag. The cache address hash
would include the file hash, if you will.

This would effectively mean the person serving the files gets to determine if
you are succeptible to this. If I serve you foo.js and don't want others to
know you've seen it, I just make sure not to put a hash.

Although, overloading the hash this way sounds wrong. I'd rather see a
"shared_cache='public|domain|none' property. I still like the consuming HTML
being responsible for deciding that, though.

~~~
ZoFreX
I think it would actually make more sense for the sha256 to be in the headers
for setting it. It seems a bit weird for a property on a link tag pointing at
a file to determine what we think the hash of that file is (not to mention
quite un-fun to maintain).

Of course, that means without duplicating it you would still need a request
similar to If-Modified-Since - something like If-Hash-Differs: <sha256 hash>,
but still better than downloading the whole file.

I've thought about this exact problem before and one nice way to solve it
would be through some kind of signing process - so you instead tell the
browser "I need com.jquery.1.7.2" and there would be some way of verifying
that a given instance of that file was legitimate, and could thus be cached
across multiple URIs.

~~~
SoftwareMaven
I agree the thing serving foo.js should be setting its sha hash in the
headers. The client can validate that the header hash and the actual contents
match, if it so desires.

On the web application side, by putting a "sha" tag in the script tag, the app
is saying, "Fail if foo.js isn't the one I expect, regardless of where it
comes from." By putting a "shared_cached" in the script tag, the app is
telling the client where it is allowed to look for and to put foo.js.

Finally, a client should be able to ignore any shared cache requests and make
everything private at the user's request.

------
sehrope
I'm apprehensive to the idea of a cached version of X for Y based on the hash
value (could be secure but assuming eventual collisions in SHA-256 it wouldn't
be hard to setup a poisoned copy of a library that would live on for a long
time).

An alternative use of this would be to validate the script that is being
downloaded from the remote source. If you're delegating to Google or any other
CDN for a javascript library you really expect it to be the one you requested.
Assuming you're not pointing at the HEAD/latest version versus a specific
(which would already be pretty amateur for a real site) then you expect it to
always be exactly the same (ex: jquery-1.8.1.min.js should always be the same
file ... if a bug is fixed it's a new version). It'd be a security improvement
to have a way to have the browser verify that.

<script src="<https://some-cdn.example.com/some-library-1.2.3.js>
hash="[secure hash of contents]">

If the downloaded file's hash does not match the expected one then the browser
would flag a security warning or simply not load the file.

~~~
alexchamberlain
Great idea! Could the browser include the libraries and authenticate them in
some way?

~~~
sehrope
That's kind of the idea. The hash would be there so the browser could verify
it is what is expected. Assuming the source page is delivered via HTTPS (w/
compression disabled :D) then the client could verify the hash against what it
received to confirm it's valid.

------
patio11
This trivially exposes browser history to any site on the Internet. That's
probably a non-starter. (Edit: I see alex beat me to this, below. He's
detailed the attack, too.)

------
Jailbird
Would this allow some kind of mine-by-SHA cross-site attacks? (sidenote: I've
not thought this through, this is just an off-the-cuff thought.)

~~~
alexchamberlain
If you could find a malicious file that had the same hash as the current
jQuery before the next one was released, then yes. I don't think this is
possible, but I'm no expert!

~~~
RyanMcGreal
"If you fear SHA-256 collisions but do not keep with you a loaded shotgun at
all times, then you are getting your priorities wrong."
<http://stackoverflow.com/a/4681221>

~~~
jefffoster
I think that quote is appropriate if you fear collisions with your own data
set. The worry with a cache scheme like this is people would be actively
seeking to find cache collisions.

~~~
alexchamberlain
That answer clearly explains why you can't, using current techniques and
hardware.

~~~
Dylan16807
> current techniques

We're talking about a long-term standard.

------
gojomo
Having URI types that are content-centric rather than location-centric has
been typical of P2P networks for over a decade... but web browser support has
been spotty. (It's sometimes been faked via a localhost HTTP proxy.)

A new chance for this to change soon may be come alongside the forthcoming
browser-to-browser connectivity standards under the heading "WebRTC". Then you
won't just have to hope a matching file might be in your local cache... you
can cooperate with other nearby nodes to find reliable replicas (like the P2P
networks do), triggered by a content-hash-specifying URI scheme.

------
asdfaoeu
Should be used for static assets over HTTPS. Where the main page could specify
the sha for the various resources and avoid bothering with encryption and
letting http proxies cache them.

------
sp332
If you want to use content-addressable resources, why not use a content-
addressable network, like Kademlia?

------
7952
This is just another addressing system for the Internet. At some level this
has to be about one URL for one resource. The new indexing system just tells
you which URL to use. At that point why not just use normal URLs, and
implement everything else in Javascript and ETags?

------
WimLeers
Assuming this is safe and works well, an interesting side-effect is that the
need/recommendation of aggregating CSS files would be at least partially
obviated.

------
alexchamberlain
Are there any security experts that could confirm whether, given current
estimates and knowledge, whether a collision attack by a botnet would likely
succeed?

------
finnw
Why all the worry about collisions? They will be rare enough that they could
be worked around with a hard-coded blacklist in the browser.

------
yason
Something like a lightweight low-latency bittorrent for browsers would be the
ultimate caching solution.

------
pornel
I've proposed that to W3C HTML WG few months ago:

[http://lists.w3.org/Archives/Public/public-
html/2012Jun/0094...](http://lists.w3.org/Archives/Public/public-
html/2012Jun/0094.html)

