As developers, we try everyday to squeeze till the last byte and optimize things. We all know how performance is important.
So why download for every website the same asset: React, jQuery, libraries, CSS utils, you-name-it? What a waste!
I'm not really sure why this isn't already in place.
edit: The reason I'm not sure why because it sure seems to me that multiple threads on the post are all suggesting basically the same simple idea, the ability to serve a file by it's hash (either just by it's hash alone, or by a url + the hash). Personally, I think whatever form these url's take i think it ought to be backward compatible which I think is possible.
Everything we need is already in place, except for a tweak in the caching strategy of the browsers. With Subresource Integrity  you provide a cryptographic hash for the file you include, e.g.
 And lot's of optimization to minimize cache eviction and handle privacy concerns, but that are different questions.
 Imagine if some CDN would work together with NPM, so every package in NPM would already be present in the CDN.
I think that using the integrity attribute is great because if it happens it's going to have to work through a lot of tricky implementation details (e.g. things like origin laundering) of moving to an internet of content by hash rather than content by location.
However beyond just having an integrity attribute added to html I am interested in the question of how do we encode an immutable url as well as the content-hash for what it points to (as well as additionally required attributes) into a `canonical hash-url` (i.e. encode all these attributes) that is backward compatible with all the current browsers / devices, and which browsers can use in the future to locate an item by hash and/or by location.
The driving reason for this encoding is make sharing of links to resources more resilient, and backwards compatible. Eventually the browsers could parse apart the `canonical hash-url`s and use their own stores for serving the data, but not until the issues (and likely other unthought of ones) listed in the sri addressable caching document you linked are worked through.
One thing that I've found incredibly disappointing about SRI is that it requires CORS. There's some more information here: https://github.com/w3c/webappsec/issues/418 but it essentially means that you can't SRI-pin content on a sketchy/untrustworthy CDN without them putting in work to enable CORS (which, if they're sketchy and untrustworthy, they probably won't do).
And unfortunately, there are too many hosts that make the attack you mention credibly silly:
To be completely honest: Only reach out if you have solutions for any of the problems or can reduce what you want down to something that is solvable with these problems in mind.
If your solution does not live on the web, you'll have a hard time finding allies in the standards bodies that work on the web :)
You'll have a hard time convincing spec editors and browser vendors already. The working group mailing list is https://lists.w3.org/Archives/Public/public-webappsec/
If you have minimal edits to the spec, we can take it straight to Github. SRI spec contains a link to the repo.
And I also have the feeling that surely, just this exist, I just don't know about it ^^
Linking things by hash, other than being so very ugly, consider this case: you have this asset that changes every minute, and is several MB. If a users users your site too much, you will flush out out all other cached stuff with old versions of that file that will never get referenced again. That just strikes me as extremely wasteful, that is, you get a short term boost but even worse performance overall. If other sites do it too much, it will mean your own stuff will not even be cached when visitors come back.
Mozilla already implements it: https://hacks.mozilla.org/2015/09/subresource-integrity-in-f...
Maybe you're right. But still, today we need very complex build tools and silly server hacks like setting expire headers in 10 years.
I wish instead it was as simple as deploying my 3kb app.js and tell the browser: "Hello there, here's my manifest with all the dependecies I need to run my app. Thank you.".
That sounds like IPFS, .
And that's exactly what Nix and Guix do. Elegant solutions to age old problems.
The problem is that www.victim.com/evil.js doesn't exist, and never did, but your browser won't know that if it's in the cache -- this gives you a way of faking files existing on other servers at the URL of your choice, and as long as they're in the cache you'll get away with it.
What sort of failure modes are there when something inevitably goes wrong, either through malice, incompetence or sheer plain accident, and you end up serving different content from the same hash?
If you're drawing at random from a pool of size k you need approximately sqrt(k) draws until you reach a ~50% chance of a collision.
With 256 bits, there are 2^256 possibilities, so following the rule-of-thumb you'd need 2^128 draws until you had a 50% chance of a collision.
2^128 > # of atoms in the universe.
If you adjust your risk tolerance you'll have different numbers come out, but the chance of a collision in any realistic scenario is negligible.
The probability of collision is still negligible.
The Subresource Integrity spec for strong content hashes could improve cache hits by allowing a different URL to be used if the hash matches but everyone wants to avoid that turning into a massive security / privacy hit — see e.g. https://hillbrad.github.io/sri-addressable-caching/sri-addre...
Presumably there's a reason that you'd _want_ them to use 1.12.4 over 1.0.0 (or you'd want to at least check that they didn't break anything that you rely on in 1.13 when that comes out)?
(That's not an exaggeration: I've measured uncached latency for DNS + connection for a CDN host in the high end of that range, especially over cellular connections or outside of the U.S.)
International connectivity out of Korea is pretty congested, especially at peak hours. Tokyo is 30-40ms away in the morning but randomly jumps to 150ms+ in the evening. It can take more than 1 second for the DNS lookup, TCP and SSL handshakes, let alone the actual transfer. So unless the CDN in question has a physical presence in Korea, it is almost always faster to load assets directly from the web server.
I suspect that many regions outside of US/EU are in a similar situation. Using a POP in another country 2000km away does jack shit for local websites, and only harms companies that fall for aggressive CDN marketing.
You'll also miss other features: version range alone could save tons of bandwidth.
but centralizing in the browser is?
Packages inevitably create a tangle of complex problems that page/document model avoids. That is one of the main reasons web apps replaced normal apps in so many domains. Not to mention that a package system will give a huge competitive advantage to already popular scripts, which will inevitably lead to technological stagnation. And it will remove the biggest disincentive to avoid bloat.
If someone loads a bad JS library, it doesn't matter where it came from.
The reason is simple. Websites have lots of static content that seldom changes. But you don't know in advance when it is going to change. However after the fact you know that it did. So you either set long expiry times and deal with weird behavior and obscure bugs after a website update, or set short ones and generate extra load and slowness for yourself.
Instead I'd like to have the main request send information that it does not want static resources older than a certain age. That header can be set on the server to the last time you did a code release, and a wide variety of problems involving new code and stale JS, CSS, etc go away.
But ever since 3.1 (Aug 2011) Rails defaulted to content addressed assets.
These are big complex workarounds for a problem that is easily solved at another layer.
If it was truly custom, you could either do cached on-demand generation or find some other portion of the cache key which reliably changes on updates – e.g. I've seen people use things like Git commit ID + feature flags.
This already exists: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If...
Also, it doesn't cover all the cases properly, because the page with versioned URLs can be cached as well.
If even the page with the versioned URL is cached, what good would a "Flush-Before" header do you?
If I have asset.v0.js, asset.v1.js, asset.v2.js then I am exposing some useful information in the sequence of filenames.
It is also a bit harder to keep track of versions across systems. With hashes, you don't need to know what the last version number was - just what the hash of the current file is. If you have multiple build machines building multiple branches, keeping versions in-line is thing unto itself. I've had problems with versioning in the past where careless merges caused issues - that doesn't happen with hashes.
The main drawback with hashed asset names is it can be a bit more tricky to determine what assets are currently in use and which can be safely deleted.
If you don't like query strings, you can throw in some rewrite rules to make it part of the virtual filename, e.g. asset.deadbeef.js.
For our build, with a large volume of assets and some very large sizes, it is more efficient to write the hash into the filename to save on time syncing to the CDN origin. We also need to maintain the ability to do quick rollbacks, so unique filenames per version help. That will not apply in all cases.
Other than those points, I think we agree? An automated hash based on some unique portion of the file is better than a sequential version.
See also: http://jacquesmattheij.com/301-redirects-a-dangerous-one-way... and the related HN thread with good discussion
Fingerprinted assets with max expiration has been standard practice for years in many frameworks.
However, even fingerprinted assets with max expiration are revalidated when the user hits the refresh button since historically this is what refresh has meant. The `immutable` header just allow skipping the validation.
Its seems much more mundane in this Cache-Control case, but it is the same vector nevertheless.
I agree with you that HTTP headers feel like a hack for most use cases. It would be nice to embrace the best practice of content-based addressing as an official standard rather than as an adhoc hack on top of query parameters (or whatever approach is taken).
The url itself ought to convey:
1. if it's immutable
2. if possible, provide hash-based address of its content, in addition to a particular url that provides that content. (that url should eventually come to be viewed as a fallback location, the content should primarily come to be obtained by its hash).
I agree and don't feel like relying on http headers is the right way to convey that a particular url is in fact immutable. It feels like that that generally belongs with the url itself not in the response to it, (for static resources atleast).
Think about programming if a function was immutable, or pure, we'd mark it as such in it's type signature (e.g. @pure), where we have language support, not as additional thing it we got in the basket of crap we got after we executed it.
For dynamic restful API type services (an image resize service was an example), the url still ought to be marked immutable in some fashion, but the actual hash won't be known until the function is executed once. This seems like a function memoization type problem so in that case the content hash will have to be passed along with the result.
Anyhow, It would be really nice to:
1. Have these features (@pure &| content-hash = abc123) in urls usable today so that they are backwards compatible, e.g. fail right into the fallback address for all the older devices of the world.
2. Are standardized and universally intelligible and not adhoc as the approaches to content-based addressing are today, so that eventually we can get first class support for loading resources just by their hashes from the browsers themselves.
> pure, we'd mark it as such in it's type signature
Actually, MySQL has something similar. Stored Functions can be marked as [NOT] DETERMINISTIC, to indicate that they will always return the same value given the same input.
But I agree that you are correct to issue this warning. It is something you need to understand before you accidentally slather that on all your content.
You could have a fast lane direct from the (http) server, then a slow P2P lane over Tor, both using the same content addressed protocol, and caching the results locally for instant reloads.
However, what if you want to update or invalidate a cached response? For example, suppose you've told your visitors to cache a CSS stylesheet for up to 24 hours (max-age=86400), but your designer has just committed an update that you'd like to make available to all users. How do you notify all the visitors who have what is now a "stale" cached copy of your CSS to update their caches? You can't, at least not without changing the URL of the resource.
After the browser caches the response, the cached version is used until it's no longer fresh, as determined by max-age or expires, or until it is evicted from cache for some other reason— for example, the user clearing their browser cache. As a result, different users might end up using different versions of the file when the page is constructed: users who just fetched the resource use the new version, while users who cached an earlier (but still valid) copy use an older version of its response.
There are a few ways to improve things:
1. Predictive modelling user resource demand in the browser (eg. preloading data) Very very easy to do nowadays with great accuracy.
2. Better cache control / eviction algorithms to keep the cache ultra hot.
3. (This). Immutable caching is one of the major ways we could improve things. I'm not a fan of the parent articles' way of doing it though, because if widely implemented it will break the web in subtle ways, especially for small companies and people that don't have Facebooks resources and engineering talent. It doesn't take into account usability issues and therefore leaves too much room for user error.
I've written up a very simple 11 line spec here that addresses this issue. https://gist.github.com/ericbets/5c1569856c2ad050771ec0c866f...
I'll throw out a challenge to HN. If someone here knows chromium internals well enough to expose the cache and assetloader to electron, within a few months I'll release an open source ML powered browser that speeds up their web browsing experience by something between 10X-100X. Because I feel like this should have been part of the web 10 years ago.
I'd rather see stale-while-revalidate implemented in browsers.
Many websites already do that, they change the URL each time the content changes.
The former would allow a browser to apply some heuristics and say for example evict something from the cache when it hasn't been referenced on a page for the last 5 times you went there (the assumption being that it was changed and the new resource is taking over in it's place).
That can lead to things being held in the cache for longer without having to use a straight "oldest evicted first" kind of method which can lead to thrashing if the cache size is too small or some resources are too big.
A browser can already apply different eviction policies to things with very long `expires`, I don't know if any do.
I'm having trouble seeing what this adds practically, too. I guess it's more of a clear semantic intent to say 'immutable' when you need that, instead of 'expires a long time from now'. Practically though?
Only if you have a system that you keep for more than 10 years. If a browser dumps stuff that it was told it can keep for a decade that means it can also dump stuff that it was told will never change. If the cache is full of things that are immutable something still has to go.
If stuff needs to go, it needs to go. It's going to happen. But if you have to choose between a file which was told to be cached for 10 years, and one which was labeled "immutable" but hasn't been referenced on the page it was previously for the last 10 loads, it might be a better choice to evict the immutable one.
Generally browser caches are "last used first out". So if you start stuffing the cache with a bunch of stuff you can already push the oldest out. And if you fill the whole thing you'll remove everything else.
It was actually quite a problem on mobile browsers for a while. IIRC iOS had something like a 10MB cache for the longest time. A few heavy websites and your whole cache would cycle through. Android had a 5MB cache at one point as well! A shitty news site can already fill that single handedly.
But my idea here (which I gave about 5 minutes of thought...) Was that immutable content would be evicted before the "10 year" stuff. The idea being that if something replaced an immutable "thing" it could be evicted much sooner.
I'm not sure if it would really help at all, just kind of throwing out some ideas.
"we've noticed that despite our nearly infinite expiration dates we see 10-20% of requests (depending on browser) for static resource being conditional revalidation. We believe this happens because UAs perform revalidation of requests if a user refreshes the page"
It sounds like immutable is being introduced so that Firefox can leave the current behavior in place, and only instantiate the new behavior for "immutable" resources.
This seems different than the chrome approach, where they plan on changing the behavior for all "unexpired" resources.
A comment from Facebook  on the Chrome bug says they started logging 'window.performance.navigation.type', which distinguishes between  normal navigation, reload, history traversal (back/forward), and undefined.
(Also, "304" is an HTTP response code -- "Not Modified" -- not something the browser sends.)
The problem people are trying to solve here is that users hit the relaod button. A lot. And if you hit the reload button, that will send If-Modified-Since for everything on the page, no matter what Expires says, because the user intent is to ignore Expires headers.
That's what the immutable thing is about: indicating that even in the reload scenario the Expires is authoritative and no If-Modified-Since request should be sent.
Use case: Logging into Facebook, previously browsers reloaded all resources.
Presumably, only the main (HTML) page should be refreshed and other linked items (CSS/JS/media) with high max-age (and no 'must-revalidate' cache control header) should use the previously cached content.
In a perfect world, we'd just be able to max-age and immutable would be redundant.
Two wholly different strategies, which has ultimately split how the browsers handle caching.
> Clients should ignore immutable for resources that are not part of a secure context
1. Use HTTPS Everywhere.
2. Never use an untrusted WiFi network without a VPN.
I just hope the draft as-is expires and never makes it to an RFC.
The Firefox bug mention hard refresh, but don't say what was implemented:
To summarize, the issue is that Facebook was seeing a higher rate of cache validation requests than they'd expect, and looked into it. Chrome produced an updated chart documenting different refresh behaviors , which is the spiritual successor of this now-outdated stackoverflow answer from 2010 , and in response to Facebook's requests, and have re-evaluated some of their refresh logic.
In this thread, Firefox was being asked to do the same, but they pushed back on adding yet another heuristic and in turn proposed a cache-control extension. Meanwhile, Facebook proposed the same thing on the IETF httpbis list, where the response not enthusiastic , largely feeling that that this is metadata about the content and not a prescriptive cache behavior, and that the HTTP spec already accounted for freshness with age. One of Mark Nottingham's responses :
(...) From time to time, we've had people ask for "Cache-Control: Infinity-I-really-will-never-change-this." I suspect that often they don't understand how caches work, and that assigning a one-year lifetime is more than adequate for this purpose, but nevertheless, we could define that so that it worked and gave you the semantics you want too.
To keep it backwards compatible, you'd need something like:
Cache-Control: max-age=31536000, static
(or whatever we call it)
With subresource integrity hashes, you don't have to encrypt public content. Less time wasted in TLS handshakes.
This increases the democratization of the web and allows small fries to have a disproportionately larger footprint.
So is this like Rails' Turbolink but built into the browser?
How often people actually do a hard refresh? I would guess most people don't know about that key combination.
so an immutable resource wouldn't even need that request