Hacker News new | past | comments | ask | show | jobs | submit login
Using Immutable Caching to Speed Up the Web (hacks.mozilla.org)
397 points by discreditable on Jan 26, 2017 | hide | past | web | favorite | 160 comments



Maybe it's time for browsers to go beyond the cache concept and implement a common standard package manager. Download once, stay forever. True immutable.

As developers, we try everyday to squeeze till the last byte and optimize things. We all know how performance is important.

So why download for every website the same asset: React, jQuery, libraries, CSS utils, you-name-it? What a waste!


We don't need to have anything as complex as package manager. It would be much easier to just link to these libraries (and any other resource) by the hash of their content. If you do that you would just have to link to resources by their content hash.

I'm not really sure why this isn't already in place.

edit: The reason I'm not sure why because it sure seems to me that multiple threads on the post are all suggesting basically the same simple idea, the ability to serve a file by it's hash (either just by it's hash alone, or by a url + the hash). Personally, I think whatever form these url's take i think it ought to be backward compatible which I think is possible.


> I'm not really sure why this isn't already in place.

Everything we need is already in place, except for a tweak in the caching strategy of the browsers[1]. With Subresource Integrity [2] you provide a cryptographic hash for the file you include, e.g.

  <script src="https://example.com/example-framework.js"
          integrity="sha384-oqVuAfXRKap7fdgcCY5uykM6+R9GqQ8K/uxy9rx7HNQlGYl1kPzQho1wx4JwY8wC"
          crossorigin="anonymous"></script>
As it is, browsers first download the file and then verify it. But you can also switch this around and build a content addressable cache in the browser where it retrieves files by their hash and only issue a request to the network as a fallback option, should the file not already be in cache. Combine this with a CDN which also serves their files via https://domain.com/$hash.js [3] and you have everything you need for a pretty nice browserify alternative, without any new web standardization necessary.

[1] And lot's of optimization to minimize cache eviction and handle privacy concerns, but that are different questions.

[2] https://developer.mozilla.org/en-US/docs/Web/Security/Subres...

[3] Imagine if some CDN would work together with NPM, so every package in NPM would already be present in the CDN.


Folks in W3C webappsec are interested, but the cross-origin security problems are hard. We'd love feedback from developers as to what is still useful without breaking the web. Read this doc and reach out! https://hillbrad.github.io/sri-addressable-caching/sri-addre...


What's the best way to reach out?

I think that using the integrity attribute is great because if it happens it's going to have to work through a lot of tricky implementation details (e.g. things like origin laundering) of moving to an internet of content by hash rather than content by location.

However beyond just having an integrity attribute added to html I am interested in the question of how do we encode an immutable url as well as the content-hash for what it points to (as well as additionally required attributes) into a `canonical hash-url` (i.e. encode all these attributes) that is backward compatible with all the current browsers / devices, and which browsers can use in the future to locate an item by hash and/or by location.

The driving reason for this encoding is make sharing of links to resources more resilient, and backwards compatible. Eventually the browsers could parse apart the `canonical hash-url`s and use their own stores for serving the data, but not until the issues (and likely other unthought of ones) listed in the sri addressable caching document you linked are worked through.


These problems are really hairy. Thankfully, all the privacy issues are only one-bit leakages (and there are TONS of one-bit leakages in web browsers), but the CSP bypass with SRI attack is really cool.

One thing that I've found incredibly disappointing about SRI is that it requires CORS. There's some more information here: https://github.com/w3c/webappsec/issues/418 but it essentially means that you can't SRI-pin content on a sketchy/untrustworthy CDN without them putting in work to enable CORS (which, if they're sketchy and untrustworthy, they probably won't do).

The attack that the authors lay out for SRI requiring CORS is legitimate, but incredibly silly - a site could use SRI as an oracle to check the hash value of cross-domain content. You could theoretically use this to brute force secrets on pages, but this is kind of silly because SRI only works with CSS and JavaScript anyway.


I, as someone who worked on the SRI spec find this incredibly disappointing as well. We've tried to reduce this to "must be publicly cachable", but attacks have proven us wrong.

And unfortunately, there are too many hosts that make the attack you mention credibly silly:

It is not uncommon that the JavaScript served by home routers contains dynamically inserted credentials. And the JSON response from your API is valid JavaScript.


Addendum

To be completely honest: Only reach out if you have solutions for any of the problems or can reduce what you want down to something that is solvable with these problems in mind.

If your solution does not live on the web, you'll have a hard time finding allies in the standards bodies that work on the web :)

You'll have a hard time convincing spec editors and browser vendors already. The working group mailing list is https://lists.w3.org/Archives/Public/public-webappsec/

If you have minimal edits to the spec, we can take it straight to Github. SRI spec contains a link to the repo.


Well, HTTP does have etag ( https://en.wikipedia.org/wiki/HTTP_ETag ) but of course that would still require a request sending the known etag, to either get 304 not modified or the content. So how about a way to put the etags of assets into the header of the document loading them or something? Then the browser can decide if it wants to make that request.

And I also have the feeling that surely, just this exist, I just don't know about it ^^

Linking things by hash, other than being so very ugly, consider this case: you have this asset that changes every minute, and is several MB. If a users users your site too much, you will flush out out all other cached stuff with old versions of that file that will never get referenced again. That just strikes me as extremely wasteful, that is, you get a short term boost but even worse performance overall. If other sites do it too much, it will mean your own stuff will not even be cached when visitors come back.


It's called subresource integrity. The spec is at: https://www.w3.org/TR/SRI/

Mozilla already implements it: https://hacks.mozilla.org/2015/09/subresource-integrity-in-f...


Does this make the browser avoid the request if it already has a cached script with that hash? If so, that would indeed be exactly what I was hoping for, except it would need to be extended for all things, not just javascript. Anyway, thanks!


> We don't need to have anything as complex as package manager.

Maybe you're right. But still, today we need very complex build tools and silly server hacks like setting expire headers in 10 years.

I wish instead it was as simple as deploying my 3kb app.js and tell the browser: "Hello there, here's my manifest with all the dependecies I need to run my app. Thank you.".


> It would be much easier to just link to these libraries (and any other resource) by the hash of their content.

That sounds like IPFS, [1].

[1] https://ipfs.io/#how


I hope that one day all the browsers will implement IPFS for this!


> It would be much easier to just link to these libraries (and any other resource) by the hash of their content.

And that's exactly what Nix and Guix do. Elegant solutions to age old problems.


Linking by hash to all common resources would be bad. You do want fixes and updates, after all.


Links are already tied to some specific version. Nobody serves "jquery.js" that points to a constantly updated version of jquery. Nobody that's sane anyway.


Fair. My point was more that for many resources, you probably do appreciate this. Certainly package managers do this. Not sure why web would be different.


Powered by a content-addressable cache, where each object is identified by the hash of its content only: https://rwmj.wordpress.com/2013/09/09/half-baked-idea-conten...


In a browser's current usage model, this is vulnerable to a 'cache origin confusion attack'. See this thread [1]. It's a bit hard to follow, so perhaps see these posts [2][3], which state the problem succinctly. Let me adapt the text from [3]:

The problem is that www.victim.com/evil.js doesn't exist, and never did, but your browser won't know that if it's in the cache -- this gives you a way of faking files existing on other servers at the URL of your choice, and as long as they're in the cache you'll get away with it.

[1] https://news.ycombinator.com/item?id=10310594 [2] https://news.ycombinator.com/item?id=10311555 [3] https://news.ycombinator.com/item?id=10312333


This issue is discussed here, with what I think is a workable solution: https://github.com/w3c/webappsec-subresource-integrity/issue...


I think the best way to fully quell the objections involves subresource integrity but I think the ideal way to deal would be to add a new attribute, sameas="https://canonical jquery link" or something along those lines. Then if the browser already has canonical jquery link AND the subresource integrity checks out it'll use that one. Otherwise it will load from wherever the developer specified and not be stored in a shared cache associated with the canonical jquery url in any way. Browsers can track references to sameas URLs that they keep getting cache misses on and go fetch them in the background for future use, possibly even going back and deleting the duplicates it had from before. This would allow developers to have a much greater chance that their libraries load from cache without adding more dns lookups and dependencies on cdns or servers they can't control.


From that thread, "CSP already has a mechanism for hash-based whitelisting - if this is the only limitation, it'd be just as easy to allow cache-sharing whenever CSP is absent and/or the specific hash is explicitly white-listed." <- I totally understand the attack, but fail to see why it would be a problem if CSP actually does, in fact, have the ability to whitelist hashes (and looking at the spec, it does seem to, though I have not paid attention to CSP before and admit I could be misinterpreting the spec based on this prompt).


This is fixed by relying on CORS and using a key of (domain, hash) rather than just the hash.


But then you're still downloading jQuery (or whatever) again for every unique domain


Not if it points to a CDN.


This brings us back to current status, as it defeats the purpose of normalizing arbitrary CDNs with the hash


And then using bit torrent to do colaborative transfer. If we are in the same LAN and I have a part of the file you are requesting, I'll feed it to you... so decrease network traffic in 90%


Putting my devil's-advocate hat on:

What sort of failure modes are there when something inevitably goes wrong, either through malice, incompetence or sheer plain accident, and you end up serving different content from the same hash?


If you've discovered a way to generate hash collisions with a modern crypto hashing algorithm, there is more glory to be had than screwing with people's browser caches.


So lets skip the malice condition and go to the plain accident one; there are more chunks of content in the world than there are bits in any of the hashes we could use. We are eventually going to run into collisions.


Arbitrary example: If you're using SHA256 you're emitting 256bit hashes. Hopefully the distribution of SHA256 outputs is indistinguishable from random (or close enough) or else we've got big problems in other domains.

If you're drawing at random from a pool of size k you need approximately sqrt(k) draws until you reach a ~50% chance of a collision[0].

With 256 bits, there are 2^256 possibilities, so following the rule-of-thumb you'd need 2^128 draws until you had a 50% chance of a collision.

2^128 > # of atoms in the universe.

If you adjust your risk tolerance you'll have different numbers come out, but the chance of a collision in any realistic scenario is negligible.

[0]: https://en.wikipedia.org/wiki/Birthday_problem


> there are more chunks of content in the world than there are bits in any of the hashes we could use

The probability of collision is still negligible.


Come on, you must know that hashes are astronomical, unpredictable and collision-free for any earthly purpose. If not, just add a bit.


Decentraleyes for Firefox helps with this issue, it's a script cache for CDN assets like jQuery: https://github.com/Synzvato/decentraleyes/wiki/Simple-Introd...


Shouldn't using common CDNs solve this problem?


Using CDNs lowers the hit rate because the average user's browser cache will have each library stored number of CDNs times the number of versions in widespread usage. A package manager could improve some of this by allowing you to say “Any version of jQuery between 1.0 and 2.0 is fine” when most sites don't depend on point releases.

The Subresource Integrity spec for strong content hashes could improve cache hits by allowing a different URL to be used if the hash matches but everyone wants to avoid that turning into a massive security / privacy hit — see e.g. https://hillbrad.github.io/sri-addressable-caching/sri-addre...


I understand the idea of supporting ranges of libraries from a development perspective, but when you're running something in production, is "use any of these configurations" ever desirable?

Presumably there's a reason that you'd _want_ them to use 1.12.4 over 1.0.0 (or you'd want to at least check that they didn't break anything that you rely on in 1.13 when that comes out)?


It's desirable if you're really focused on performance – if you're not a site like Google/Facebook which people access all the time, it's safe to assume that none of your resources are cached and so you might find n milliseconds of load time a healthy gain (less so now that HTTP/2 is widespread). In many cases it's more likely that you'd know of a bug fixed in a new release – where you'd say libfoo >= 1.12.4 – than that 1.3 has a huge problem.


Using a CDN doesn't lower the hit rate over not using a CDN. I assume you mean using a CDN has a potentially lower hit rate than using a package manager within the browser.


One factor to consider: what's the cost of the extra DNS lookups and connection latency to a new CDN host vs. your main host, especially in the post-HTTP/2 era? If you had something like SRI where the browser might not even connect at all, that's a pure win but otherwise it's quite easy to find real pages where the extra 500-5000 milliseconds to connect to the CDN is greater than serving a modest-sized file.

(That's not an exaggeration: I've measured uncached latency for DNS + connection for a CDN host in the high end of that range, especially over cellular connections or outside of the U.S.)


South Korea here. Every time I find random asset-loading delays in my clients' websites, it's caused by one of the major CDNs such as Google web fonts and code.jquery.com.

International connectivity out of Korea is pretty congested, especially at peak hours. Tokyo is 30-40ms away in the morning but randomly jumps to 150ms+ in the evening. It can take more than 1 second for the DNS lookup, TCP and SSL handshakes, let alone the actual transfer. So unless the CDN in question has a physical presence in Korea, it is almost always faster to load assets directly from the web server.

I suspect that many regions outside of US/EU are in a similar situation. Using a POP in another country 2000km away does jack shit for local websites, and only harms companies that fall for aggressive CDN marketing.


Sure, if your target market is outside the area covered by your CDN's POPs then you're not going to get anything out of it. Although if you do get a CDN that's appropriately located it can be a huge win if you can avoid leaving certain country boundaries.


Fair point. It's true that hosting CDN assets on a separate domain at least requires more thought than it used to. Although there's also the fact that if you're using a common CDN the browser may have DNS cached. Plus it can still be a win if the CDN is well-located wrt to your target audience and you have a fair number of assets.


“may have DNS cached” was less common than I thought – even using some fairly popular CDN providers, the RUM DNS latency outliers are surprisingly high even for clients in areas with CDN edge nodes.


I think they are more like a band-aid than a solution.


How is your proposal different to what's happening here? If two websites call a library from the same CDN that implements immutable, then isn't that just a distributed package manager with URL-based namespacing?


The problem will be that every website should use the same CDN. Not likely to happen and centralize things is never good.

You'll also miss other features: version range alone could save tons of bandwidth.


>centralize things is never good.

but centralizing in the browser is?


The fact that this architecturally clueless comment is at the top for an article about caching makes me wheep for the future of the web.

Packages inevitably create a tangle of complex problems that page/document model avoids. That is one of the main reasons web apps replaced normal apps in so many domains. Not to mention that a package system will give a huge competitive advantage to already popular scripts, which will inevitably lead to technological stagnation. And it will remove the biggest disincentive to avoid bloat.


Yeah, you know it so much that I have to download 100s of MBs to view a text-only site.


Why that? The browser download only the assets you require.


Great. By compromising one package you can access their bank account...


Same thing as compromising one of these packages loaded from a CDN...

If someone loads a bad JS library, it doesn't matter where it came from.


I expect WebAssembly to handle this.


What I really want is the exact opposite. I'd like to see a flush-before header to have a particular web page NOT pull older static resources.

The reason is simple. Websites have lots of static content that seldom changes. But you don't know in advance when it is going to change. However after the fact you know that it did. So you either set long expiry times and deal with weird behavior and obscure bugs after a website update, or set short ones and generate extra load and slowness for yourself.

Instead I'd like to have the main request send information that it does not want static resources older than a certain age. That header can be set on the server to the last time you did a code release, and a wide variety of problems involving new code and stale JS, CSS, etc go away.


Then you should do what Facebook does (https://code.facebook.com/posts/557147474482256/this-browser...): content-addressed resources. Have a resource be available not at /resources/something.js, but at /resources/<sha1(something.js)>; whenever there is a new version of something.js, the sha1 changes and is put in your root document, and the browser will never again try to reach the older version, which you know is invalid now.


This is also the default in Rails with the asset pipeline since 4.0.


It's even older than that. Sprockets always have been content addressed, but in early versions it used to leave non digested aliases (by mistake actually), and some people relied on it.

But ever since 3.1 (Aug 2011) Rails defaulted to content addressed assets.


Content addressing of data is one of my favorite ideas in the field.


And what if the resource in question is a dynamically generated minified thing? Do I have to dynamically generate it within the main html page to be able to figure out the sha?

These are big complex workarounds for a problem that is easily solved at another layer.


Typically these are shared resources and can be generated once and saved so your HTML page would simply be checking file metadata generated at the time the app was built.

If it was truly custom, you could either do cached on-demand generation or find some other portion of the cache key which reliably changes on updates – e.g. I've seen people use things like Git commit ID + feature flags.


But what happens when you need to cache the actual page not just static resources? That's what the content-addressed resources trick doesn't address.


You cache the actual page with a much shorter cache timeout.


Then just cache the actual page. The cached version refers to some possibly old versions of the static resources, so your server needs to be able to serve old versions for a period of time.


> that it does not want static resources older than a certain age.

This already exists: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If...


You would need to use Service Workers to use that header the way OP wants it, since he wants to have the HTML page decide when the resources of that page need to become stale. Fingerprinting resources with the hash of the content works a lot better for that.


That sounds convenient, but using versioned URLs will already give you that. Except for when you bump the version without changing the resource I guess.


That is true in most cases, but versioned URLs are a hack that requires you to change document's address and then change a bunch of other documents just to fix an issue with HTTP. It's like changing file content and their names to fix something in FTP protocol.

Also, it doesn't cover all the cases properly, because the page with versioned URLs can be cached as well.


>Also, it doesn't cover all the cases properly, because the page with versioned URLs can be cached as well.

If even the page with the versioned URL is cached, what good would a "Flush-Before" header do you?


Versioned URLs have a couple of drawbacks.

If I have asset.v0.js, asset.v1.js, asset.v2.js then I am exposing some useful information in the sequence of filenames.

It is also a bit harder to keep track of versions across systems. With hashes, you don't need to know what the last version number was - just what the hash of the current file is. If you have multiple build machines building multiple branches, keeping versions in-line is thing unto itself. I've had problems with versioning in the past where careless merges caused issues - that doesn't happen with hashes.

The main drawback with hashed asset names is it can be a bit more tricky to determine what assets are currently in use and which can be safely deleted.


I just take the mtime of each file, apply an HMAC using a secret key, and use the first few letters of the resulting hash. So the URLs end up looking like asset.js?v=deadbeef. It's fully automated and exposes no useful information. (The web server sending a Last-Modified header is another question.)

If you don't like query strings, you can throw in some rewrite rules to make it part of the virtual filename, e.g. asset.deadbeef.js.


It is my opinion that the content of the file is a more reliable source for the hash than the modified time of the file.

For our build, with a large volume of assets and some very large sizes, it is more efficient to write the hash into the filename to save on time syncing to the CDN origin. We also need to maintain the ability to do quick rollbacks, so unique filenames per version help. That will not apply in all cases.

Other than those points, I think we agree? An automated hash based on some unique portion of the file is better than a sequential version.


I'd be very wary of using any HTTP headers with permanent effects. They seem like a way to get easily burned by accident. For immutable caching in particular, I'd probably try to utilize some variation of content-based addressing, eg having the url have the hash of the content.

See also: http://jacquesmattheij.com/301-redirects-a-dangerous-one-way... and the related HN thread with good discussion


> I'd probably try to utilize some variation of content-based addressing, eg having the url have the hash of the content

Fingerprinted assets with max expiration has been standard practice for years in many frameworks.

However, even fingerprinted assets with max expiration are revalidated when the user hits the refresh button since historically this is what refresh has meant. The `immutable` header just allow skipping the validation.


It's not permanent if you don't make it permanent - like the other cache-control headers, you can specify the amount of time that a resource is considered immutable. All this does is change the time period in which the browser avoids sending 304s.


This might be considered similar to the DoS vector that was also present in the HTTP Public Key Pinning proposal [1]; an adversary seizing temporary control over the server can convince clients to not request a response from the server for a certain period, or in the case of HPKP request a response that cannot possibly be fulfilled.

Its seems much more mundane in this Cache-Control case, but it is the same vector nevertheless.

[1] https://blog.qualys.com/ssllabs/2016/09/06/is-http-public-ke...


I made some comments over in the "chrome version" of this thread: https://news.ycombinator.com/item?id=13492483

I agree with you that HTTP headers feel like a hack for most use cases. It would be nice to embrace the best practice of content-based addressing as an official standard rather than as an adhoc hack on top of query parameters (or whatever approach is taken).

The url itself ought to convey:

1. if it's immutable

2. if possible, provide hash-based address of its content, in addition to a particular url that provides that content. (that url should eventually come to be viewed as a fallback location, the content should primarily come to be obtained by its hash).

I agree and don't feel like relying on http headers is the right way to convey that a particular url is in fact immutable. It feels like that that generally belongs with the url itself not in the response to it, (for static resources atleast).

Think about programming if a function was immutable, or pure, we'd mark it as such in it's type signature (e.g. @pure), where we have language support, not as additional thing it we got in the basket of crap we got after we executed it.

For dynamic restful API type services (an image resize service was an example), the url still ought to be marked immutable in some fashion, but the actual hash won't be known until the function is executed once. This seems like a function memoization type problem so in that case the content hash will have to be passed along with the result.

Anyhow, It would be really nice to:

1. Have these features (@pure &| content-hash = abc123) in urls usable today so that they are backwards compatible, e.g. fail right into the fallback address for all the older devices of the world.

2. Are standardized and universally intelligible and not adhoc as the approaches to content-based addressing are today, so that eventually we can get first class support for loading resources just by their hashes from the browsers themselves.


> Think about programming if a function was immutable, or

> pure, we'd mark it as such in it's type signature

Actually, MySQL has something similar. Stored Functions can be marked as [NOT] DETERMINISTIC, to indicate that they will always return the same value given the same input.

https://dev.mysql.com/doc/refman/5.7/en/create-procedure.htm...


That's exactly what Facebook is doing: https://code.facebook.com/posts/557147474482256/this-browser...

But I agree that you are correct to issue this warning. It is something you need to understand before you accidentally slather that on all your content.


You can get burned using 301 redirects over 302 just as easily if you are not careful. But it does not mean that we should abandon using 301 redirects.


The concept of immutable assets and links is at the core of IPFS, a distributed alternative to HTTP. Since Firefox inplements the concept of immutable assets now, it would be totally reasonable to load these assets in the browser peer to peer (see WebRTC and webtorrent). I think this would be a great way to retrofit some decentralization into webpages!


I'm not sure it would be reasonable. To do that without deanonymising people would be an awful lot of work, and at the end if it lead to faster results, that would seem to be ripe for timing attacks to I.e. Figure out if anyone is browsing stonewall.org on your LAN


OpenBazaar are using IPFS over a Tor transport.

You could have a fast lane direct from the (http) server, then a slow P2P lane over Tor, both using the same content addressed protocol, and caching the results locally for instant reloads.


Sorry if this is a dumb question. How is immutable caching any different from cache-control headers with a max age of 100 years?


When you click refresh, your browser will revalidate those 100yr lifespan resources. If they are immutable, it won't revalidate.


According to this link, the browser won't perform any revalidation for the duration of the max-age?

https://developers.google.com/web/fundamentals/performance/o...

However, what if you want to update or invalidate a cached response? For example, suppose you've told your visitors to cache a CSS stylesheet for up to 24 hours (max-age=86400), but your designer has just committed an update that you'd like to make available to all users. How do you notify all the visitors who have what is now a "stale" cached copy of your CSS to update their caches? You can't, at least not without changing the URL of the resource.

After the browser caches the response, the cached version is used until it's no longer fresh, as determined by max-age or expires, or until it is evicted from cache for some other reason— for example, the user clearing their browser cache. As a result, different users might end up using different versions of the file when the page is constructed: users who just fetched the resource use the new version, while users who cached an earlier (but still valid) copy use an older version of its response.


That's all true, but it applies to regular page loads, not the case where the user explicitly presses the reload button.


It shouldn't revalidate unless there are other headers that might cause it to revalidate (If-Modified-Since?).


As an engineer, I've always resented the unnecessary time I spend waiting for data in web browsers, so the bad state of caching in web browsers is an issue that's been on my mind for a long time.

There are a few ways to improve things:

1. Predictive modelling user resource demand in the browser (eg. preloading data) Very very easy to do nowadays with great accuracy.

2. Better cache control / eviction algorithms to keep the cache ultra hot.

3. (This). Immutable caching is one of the major ways we could improve things. I'm not a fan of the parent articles' way of doing it though, because if widely implemented it will break the web in subtle ways, especially for small companies and people that don't have Facebooks resources and engineering talent. It doesn't take into account usability issues and therefore leaves too much room for user error.

I've written up a very simple 11 line spec here that addresses this issue. https://gist.github.com/ericbets/5c1569856c2ad050771ec0c866f...

I'll throw out a challenge to HN. If someone here knows chromium internals well enough to expose the cache and assetloader to electron, within a few months I'll release an open source ML powered browser that speeds up their web browsing experience by something between 10X-100X. Because I feel like this should have been part of the web 10 years ago.


The size of your URLs will be an issue for pages with lots of assets. So much so that I bet if everything's warm in the cache except the page itself, then this proposal would be slower.

I'd rather see stale-while-revalidate implemented in browsers.


Publishers could simply use regular urls for their tiny files.


If everything's warm in the cache they're effectively all tiny files.


So much for useful file names. What if I wanted to download this? What about SEO?


Thanks for your feedback. What do you think about this (filename preserving) scheme:

example.org/urlcache/path/to/file/file.mimext?hashalg=hash

eg. example.org/urlcache/file.png?sha384=9b78b27b73668784e38c8a58a8535deaaecb77709


A lot of webapps already use this type of scheme for cache-busting, though of course the hash algorithm isn't mentioned. I assume that's important so that the browser can do verification?


Exactly right.


What is the difference between an immutable resource and setting the resource to expire in 10 years?

Many websites already do that, they change the URL each time the content changes.


The 'reload' action in browsers has a different meaning than most people expect and does not do the same thing as highlighting the url and hitting enter, or pasting the url into a new tab. 'reload' forces all assets to be re-validated, regardless of the freshness of the cache. The `immutable` header allows browsers to skip this revalidation.


I'm not sure if this is in-use anywhere, but it may lead to better "cache eviction" policies. Saying something is immutable has different implications than saying "I want it cached for 10 years".

The former would allow a browser to apply some heuristics and say for example evict something from the cache when it hasn't been referenced on a page for the last 5 times you went there (the assumption being that it was changed and the new resource is taking over in it's place).

That can lead to things being held in the cache for longer without having to use a straight "oldest evicted first" kind of method which can lead to thrashing if the cache size is too small or some resources are too big.


I'd imagine 'immutable' but not recently used things will still be evicted from the cache at some points, when space is needed. There's no reason for them not to be, is there?

A browser can already apply different eviction policies to things with very long `expires`, I don't know if any do.

I'm having trouble seeing what this adds practically, too. I guess it's more of a clear semantic intent to say 'immutable' when you need that, instead of 'expires a long time from now'. Practically though?


> Saying something is immutable has different implications than saying "I want it cached for 10 years".

Only if you have a system that you keep for more than 10 years. If a browser dumps stuff that it was told it can keep for a decade that means it can also dump stuff that it was told will never change. If the cache is full of things that are immutable something still has to go.


Oh definitely, I meant more in the sense of being able to evict things more "intelligently".

If stuff needs to go, it needs to go. It's going to happen. But if you have to choose between a file which was told to be cached for 10 years, and one which was labeled "immutable" but hasn't been referenced on the page it was previously for the last 10 loads, it might be a better choice to evict the immutable one.


So I can build a site that stuffs your cache full of "immutable" content, and all your frequently used "cache for 10 years" stuff gets evicted?


You can already do that...

Generally browser caches are "last used first out". So if you start stuffing the cache with a bunch of stuff you can already push the oldest out. And if you fill the whole thing you'll remove everything else.

It was actually quite a problem on mobile browsers for a while. IIRC iOS had something like a 10MB cache for the longest time. A few heavy websites and your whole cache would cycle through. Android had a 5MB cache at one point as well! A shitty news site can already fill that single handedly.

But my idea here (which I gave about 5 minutes of thought...) Was that immutable content would be evicted before the "10 year" stuff. The idea being that if something replaced an immutable "thing" it could be evicted much sooner.

I'm not sure if it would really help at all, just kind of throwing out some ideas.


They say the difference goes back to this observation[1] from Facebook:

"we've noticed that despite our nearly infinite expiration dates we see 10-20% of requests (depending on browser) for static resource being conditional revalidation. We believe this happens because UAs perform revalidation of requests if a user refreshes the page"

It sounds like immutable is being introduced so that Firefox can leave the current behavior in place, and only instantiate the new behavior for "immutable" resources.

This seems different than the chrome approach, where they plan on changing the behavior for all "unexpired" resources.

[1]https://www.ietf.org/mail-archive/web/httpbisa/current/msg25...


They could just be seeing 10%-20% of the people doing a Ctrl+F5 'force refresh' -- after all, they're speculating, and have no real data on whether browser or user behavior is responsible for the numbers they're seeing.


You think they didn't test their assumption?


I think they did, but there's not enough information available to reliably distinguish 'normal' reloads from 'force-reloads', unless they ran onKeyDown javascript and correlated the events after the fact.

A comment from Facebook [1] on the Chrome bug says they started logging 'window.performance.navigation.type', which distinguishes between [2] normal navigation, reload, history traversal (back/forward), and undefined.

[1] https://bugs.chromium.org/p/chromium/issues/detail?id=505048... [2] https://developer.mozilla.org/en-US/docs/Web/API/Navigation_...


Firefox themselves appear to have tested and confirmed that regular F5/Refresh (not Ctrl+F5) causes revalidation: http://bitsup.blogspot.de/2016/05/cache-control-immutable.ht...


The difference is that in the case of immutable the browser does not send a 304 conditional.


Really? The article implies that the whole point of Immutable is that the browser won't revalidate contents.

(Also, "304" is an HTTP response code -- "Not Modified" -- not something the browser sends.)


Meant to say that there's no conditional with immutable. Corrected, thanks.


I still don't think that's accurate. My experience is that a future Expires date -- even relatively near-future -- is sufficient to prevent an If-Modified-Since request. (Which is what I think you mean by "304 request".)


For normal loads, yes.

The problem people are trying to solve here is that users hit the relaod button. A lot. And if you hit the reload button, that will send If-Modified-Since for everything on the page, no matter what Expires says, because the user intent is to ignore Expires headers.

That's what the immutable thing is about: indicating that even in the reload scenario the Expires is authoritative and no If-Modified-Since request should be sent.


Aha. The forced If-Modified-Since on reload is the piece I was missing. Thanks.


It's up to the client to decide based on Last-Modified (if previously sent by the server) and its own implementation. But I guess browsers are going to be a tad bit aggressive especially if Last-Modified is absent. Hence the overly aggressive 304s.


Firefox won’t reload the resource, even if the page the resource is on is POSTed to.

Use case: Logging into Facebook, previously browsers reloaded all resources.


You typically do a 302 redirect AFTER the POST authentication. This feels more of a publicity stunt more than anything. It might affect very few use-cases. Bulk of the requests on the web are GETs anyway.


I agree. Facebook redirects a successful login POST with a 302, which then results in a GET to '/'. This is an extremely common pattern. Once you're making GETs, 'fresh' resources can be served from cache without revalidation [1]. This is basically the whole point of the fresh vs. stale distinction.

[1] https://tools.ietf.org/html/rfc7234#section-1


Why would the browser need to reload all resources after a POST request?

Presumably, only the main (HTML) page should be refreshed and other linked items (CSS/JS/media) with high max-age (and no 'must-revalidate' cache control header) should use the previously cached content.


The reality of most browsers is that they ignore the cache time in many cases, like refreshing, in order to work with broken pages.

In a perfect world, we'd just be able to max-age and immutable would be redundant.


Which raises the question: How long until browsers start ignoring the immutable header to work around broken pages?


This is related to the Chrome caching update; as discussed here: https://news.ycombinator.com/item?id=13492483

Two wholly different strategies, which has ultimately split how the browsers handle caching.


I just realized that until http is completely replaced with https, private mode should be always used to browse the Internet on a not trusted wifi network. Otherwise malicious content injected on such a network can be cached and reused by the browser forever.


> In Firefox, immutable is only honored on https:// transactions.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Ca...

> Clients should ignore immutable for resources that are not part of a secure context

https://tools.ietf.org/html/draft-mcmanus-immutable-00


True, but you should also:

1. Use HTTPS Everywhere. 2. Never use an untrusted WiFi network without a VPN.


With WiFi hotspots dropping connections more often than not, how many people would know they need to CTRL-F5 to "fix" a broken page/image/JS/CSS?

I just hope the draft as-is expires and never makes it to an RFC.


Could you clarify your objection? Presumably clients would never immutably cache a resource they couldn't validate.


How does validation work in the case of a 200 status and no Content-Length header?


I'd also like to know. I hope subresource integrity can be implemented alongside the cache control and prevent caching a bad file.

https://developer.mozilla.org/en-US/docs/Web/Security/Subres...


Why would a browser cache an incomplete response? Content-Length exists, and failing that, there are chunked transfers.


Content-Length MUST be ignored in some cases (encoding). Not all send a Content-Length header (it's a SHOULD not a MUST).


If you don’t tell the browser in some way where your response ends (with Content-Length, chunked, or HTTP/2 [?]), you can’t reuse connections, as far as I know. If that’s the case, you probably don’t care enough about performance to use `immutable` anyway.


Just playing devil's advocate there for a minute. And this is definitely one of those things I've had to work around.


HTTP caching is a mess. I wonder why no one proposed a properly redesigned and negotiable protocol that covers all the edge cases. (And maybe supports partial caching/partial re-validation of pages.)


On my phone, so unfortunately no reference, but there is a HTTP 2 spec underway that allows a client to send a cache manifest frame. A server can then push the resources that are newer. Pretty much exactly what's needed.


Unless we're talking about different things, cache manifest is an HTML 5 feature designed to enable websites to work offline. That's quite different from HTTP-level caching, which would be applicable to any files/resources and designed primarily with performance and bandwidth savings in mind. I might be unaware of some relevant HTTP 2 features, though.


Definitely a different thing. This is a HTTP 2 frame, sent by a client


Could you post a link to the relevant part of the spec or some article dealing with this feature?


The current HTTP RFC is actually quite nice: http://httpwg.org/specs/rfc7234.html


What happens if the immutable file is borked in transfer, leading to a partial file sitting in cache? Will this lead to a new class of problem that can only be solved by nuking the entire browser cache? It seems to me the way to do this right would have included a checksum of the content.


A hard refresh (Ctrl+F5) of the page will refresh immutable resources too, so you won't need to clear the whole browser cache.


I wonder if this will bring back the "hard refresh".


I'm curious, too. The blog post says refreshing the page won't revalidate the immutable resources, but doesn't say what happens for a CTRL+SHIFT+R hard refresh.

The Firefox bug mention hard refresh, but don't say what was implemented:

https://bugzilla.mozilla.org/show_bug.cgi?id=1267474


Patrick McManus, the Firefox developer of this feature, confirmed that hard reload will load immutable resources from scratch, so the user always has a nuclear option for fixing cache corruption. :)


You can just disable caching altogether in Dev Tools. Extra refresh levels are unnecessary.


What a mess, but perhaps a happy ending. I made two other comments prior to this one in this thread, but then I read the Bugzilla thread [1] opened by Facebook that laid out the issue and Mozilla's defense. It's a highly enlightening read; I can't recommend it enough.

To summarize, the issue is that Facebook was seeing a higher rate of cache validation requests than they'd expect, and looked into it. Chrome produced an updated chart documenting different refresh behaviors [2], which is the spiritual successor of this now-outdated stackoverflow answer from 2010 [3], and in response to Facebook's requests, and have re-evaluated some of their refresh logic.

In this thread, Firefox was being asked to do the same, but they pushed back on adding yet another heuristic and in turn proposed a cache-control extension. Meanwhile, Facebook proposed the same thing on the IETF httpbis list, where the response not enthusiastic [4], largely feeling that that this is metadata about the content and not a prescriptive cache behavior, and that the HTTP spec already accounted for freshness with age. One of Mark Nottingham's responses [5]:

(...) From time to time, we've had people ask for "Cache-Control: Infinity-I-really-will-never-change-this." I suspect that often they don't understand how caches work, and that assigning a one-year lifetime is more than adequate for this purpose, but nevertheless, we could define that so that it worked and gave you the semantics you want too.

To keep it backwards compatible, you'd need something like:

Cache-Control: max-age=31536000, static

(or whatever we call it)

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1267474 [2] https://docs.google.com/document/d/1vwx8WiUASKyC2I-j2smNhaJa... [3] http://stackoverflow.com/questions/385367/what-requests-do-b... [4] https://www.ietf.org/mail-archive/web/httpbisa/current/msg25... [5] https://www.ietf.org/mail-archive/web/httpbisa/current/msg25...


We should really try to integrate something like https://ipfs.io/ in browsers


From what I understand, the browsers don't really adhere to far-future expiry headers when the user manually reloads a page, instead they re-request every resource to see if it's really not expired. For resources that the server flags as immutable (upon first request), the browser won't make further requests, but instead instantly reloads the elements out of its local cache.


This should be done using subresource integrity. Then, you know it hasn't changed. There should be some convention for encoding the hash into the URL, so that any later change to an "immutable" resource will be detected.

With subresource integrity hashes, you don't have to encrypt public content. Less time wasted in TLS handshakes.


The last image about Squid proxy is included too small in the post, here it is in a readable way: https://hacks.mozilla.org/files/2016/12/sq.png


A very creditable action, cheers to Mozilla!

This increases the democratization of the web and allows small fries to have a disproportionately larger footprint.


How do you update the URL and everywhere its used once an asset (image or script) is changed ? Automatic script or manually ?


> The page’s javascript, fonts, and stylesheets do not change between reloads

So is this like Rails' Turbolink but built into the browser?


Wow I'm out of touch with front end dev. Someone please explain to me why this hasn't been standard already.


What's the difference between Cache-Control: Immutable and setting expires headers really far in to the future?


When you do a hard refresh (Ctrl + F5 / Apple + R), your cache gets invalidated. The immutable parts don't need to be reloaded, because they are immutable.


Oh, interesting. Is that the only difference?

How often people actually do a hard refresh? I would guess most people don't know about that key combination.


not only that, but even when you have a resource with a long cache time, and cache-control public, the browser will still do a request for the file headers (a conditional request - one that serves the content if it has changed/doesn't match what the browser expects)

so an immutable resource wouldn't even need that request




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: