Hacker News new | past | comments | ask | show | jobs | submit login
Why Prefetch Is Broken (jefftk.com)
245 points by astdb on June 2, 2021 | hide | past | favorite | 89 comments

The cache segregation is a bit of a nuisance when building sites that uses iframes on different domains to sandbox user content. For example, Framer (where I work) sandboxes each user project on a unique subdomain of framercanvas.com, which is on the public suffix list so that different projects can’t share cookies, localStorage etc. But all resources loaded on these subdomains are always loaded fresh for new projects because of cache segregation, even if it’s the same old files being loaded over and over. I wish there was a way to establish a shared cache between two domains, because we could manually implement a content tunnel between framer.com and the sandbox that sends down cached content with postMessage, so optionally sharing cache doesn’t seem like an additional tracking issue if opted in explicitly.

Load all resources with fetch(), store to cache by yourself, take the response, clone and transfer into anything that can be sent with postMessage()

Clever! It would be better for the intended behavior to be supported, though, right?

So your setup is:

Since, as you say, outer.example could load the shared resources and postMessage them into userN.inner.example, it does seem to me like there should be a way for outer.example and userN.inner.example to opt into letting userN.inner.example use the outer.example cache partition.

Have you considered raising a spec issue?

I would think that the original intent of the origin cache sandboxing here, is to disallow different source origins from sharing a target origin, even if that target origin wants to be shared. Think "the target origin is an ad-tracking provider domain."

I don't see any good way of enabling a target origin to opt into allowing source origins to share caches with it, that wouldn't also reintroduce the privacy leaks. (As, after all, even if the only things malicious-site-X can see in your cache are ad-tech providers' origins that opted into allowing anyone to interface with them, that's likely still enough to fingerprint you.)

It would still be sharded by top-level domain, though, which would be enough to prevent cross-site tracking. This is specifically for the case where a page looks like: [common top level origin] -> [iframed different origin].

That's interesting.

Just so I understand it correctly: * The iframe loads resources, e.g. /static/bundle.js and /public/index.css (does this include user-defined resources?). * But due to the iframe being embedded on sub**.framercanvas.com, the cache key includes the subdomain. So all resources are fetched again for all projects?

> even if it’s the same old files being loaded over and over

Is there a large enough audience visiting multiples sites that would make the effort worth it?

>Framer (where I work)

Your website runs terribly on firefox. Multiple hundred-of-millisecod periods where the viewport went blank.

I don't really understand the issue? If I want to prefetch an image, I'm on the same origin the whole time and this cache segregation doesn't matter.

Indeed. As the author puts it "sometimes you know something is very likely to be needed". Let's have a look:

    <link rel=prefetch href=url>
What's going on here, for the test case given? It's introducing tight coupling (or it already exists, and you're trying to capture a description of it to serve to the browser) to an external resource. It's not that prefetch is broken, it's that to desire to be able to gesture at the existence of a resource outside your organization's control, while insisting that it's so important as to be timing-critical, is like trying to have your cake and eat it, too.

As mentioned in similar comments, the observed behavior for this particular test case is potentially a problem if you are building Modern Web Apps by following the received wisdom of how you're supposed to do that. There are lots of unstated assumptions in the article in this vein. One such assumption is that you're going to do things that way. Another assumption is that the arguments for doing things that way and the plight of the tech professionals doing the doing are universally recognized and accepted.

From the Web-theoretic perspective—that is, following the original use cases that the Web was created to address—if that resource is so important to your organization, then you can mint your own identifier for it under your own authority.

Ultimately, I don't have a lot of sympathy for the plight described in the article. It's fair to say that the instances where this sort of thing shows up involve abusing the fundamental mechanism of the Web to do things that are, although widely accepted by contemporaries as standard practice, totally counter to its spirit.

I understand what you're saying, but I fundamentally disagree with you.

The issue is that you're immediately deciding that domain X and domain Y are different entities.

In practice, I find that there are a HUGE number of use cases where two domains are actually the same organization, or two organizations that are collaborating.

There is basically no way to say to the browser - I am "X" and my friends are "Y" and "Z", they should have permissions to do things that I don't allow "A", "B", and "C" to do.


We actually have a functioning standard for this on mobile (both iOS and Android support /.well-known paths for manifests that allow apps to couple more tightly with sites that list them as "well-known" (aka - friends)

The browser support for this kind of thing is basically non-existent, though, and it's maddening. SameSite would have been a PERFECT use case. We're already doing shit like preflight requests, why not just fetch a standard manifest and treat whitelisted sites as "Samesite=none" and everything else as "Samesite=lax"? Instead orgs were forced into a binary choice of none (forfeiting the new csrf protections) or lax (forfeiting cross site sharing).

> There is basically no way to say to the browser - I am "X" and my friends are "Y" and "Z", they should have permissions to do things that I don't allow "A", "B", and "C" to do.

Isn't that CORS? That sounds like CORS.

CORS can fully stop an origin from accessing an http resource, this discussion is about allowing different origin to reuse the browser local state (like cookies, http cache, local storage).

CORS is a stronger security layer in the browse than cache segregation, some people want to keep the CORS security model but weaken cache segregation.

> The issue is that you're immediately deciding that domain X and domain Y are different entities.

That's... literally what domain means.

"A domain name is an identification string that defines a realm of administrative autonomy, authority or control within the Internet." - Wikipedia

The entire security policy of the Internet is built on this definition. It's not an assumption. It's a core mechanism.

wrong again... It's very clear "*A* realm". Not "The realm". Certainly not "The only realm".

Entities are allowed to control many assets.

Simplest possible case in the wild for you, since you're being obtuse.

I'm company "A". I just bought company "B". Now fucking what?

> but I fundamentally disagree with you ["We actually have a functioning standard for this on mobile"]

I wouldn't expect any less.


> In practice [...]

There's your problem. Try fixing that.

Moving everything under a single domain makes no sense for the use-case of cross-organization sharing though. Domain as as the root identity on the web is just broken and there's no way to make it work.

Two separate identities. Is it even possible to let the browser know that they should be segmented? Nope.

How about this? No again, not without getting yourself on the suffix list which is just a hack.

Can you tell the browser that these are actually the same and shouldn't be segmented? Nope!

This makes more sense and doesn’t dilute your brand, improves SEO, and helps protect against phishing by making your customers and staff look for a single known-good domain:


Your take is wrong.

Having a domain is identical to having a driver's license: This org says I am "X".

It is fundamentally different from uniquely identifying me.

I am still the same person if I give you my library card - a different ID from a different org that says I am "Y".

Nah, the problem is definitely on your end.

> Having a domain is identical to having a driver's license: This org says I am "X".

Nope. You just described two different documents.



It's very odd the position you're taking here, given the sentiment in your other tirade about being digital sharecroppers to Facebook and Google <https://news.ycombinator.com/item?id=27369652>. Your proposed solution is dumping more fuel in their engines—which is why it's the kind of solution they prefer themselves—and completely at odds with e.g. Solid and other attempts to do things thing would actually empower and protect individual users. I'm interacting with your digital homestead, why are you so adamant about leaking my activity to another domain?

If you actually search a bit on the internet, you'll find Google actually made a RFC for a standard allowing websites to list other domains that should be considered as same origin. Look for the comments on it by the ietf and you'll understand why this is a terrible idea.

As somebody directly involved in this space, that's a pretty bad summary. The spec for first party sets (not an RFC, just a draft) isn't in a great state. Google is going to implement it anyway, Microsoft is supporting, Apple basically said the spec was crap but they might be interested in a good version of it, and Firefox said they didn't like it.

Speaking as engineer.. the Firefox folks don't really get it. You can't just break what sites like StackOverflow and Wikipedia have been doing for years (and in some cases decades) and then say "you were doing the wrong thing." Some version of FPS will ship in browsers, probably in the next 2 years.

Quoting Apple's position directly "[...] Given these issues, I don’t think we’d implement the proposal in its current state. That said, we’re very interested in this area, and indeed, John Wilander [Safari lead] proposed a form of this idea before Mike West’s [Google] later re-proposal. If these issues were addressed in a satisfactory way, I think we’d be very interested. [...]"

Also it was a W3C TAG review. The W3C and IETF are different organizations.

> RFC for a standard allowing websites to list other domains that should be considered as same origin

No, they allowed an origin to list other origins whose cookies would be sent back to the serving origin correctly even if they were iframes loaded in the parent origin DOM.

I.e. this is the expected behavior for iframes until Safari decided that there was such a thing as "third party" origins whose web semantics could be broken in their war against advertising.

Google is trying to (partially) restore the expected behavior of iframes so that named origins get their own cookies sent to them, which is how things worked for the first two decades of the web.

Why don't you search a bit and come back with a link?

Because I can't comment on an RFC I haven't seen, and a quick google search of my own based on your comment turns up nada.

That said - I'm fully aware of the downsides of this approach, but I want my browser to be (to put it crudely) MY FUCKING USER AGENT. I want to be able to allow sharing by default in most cases, and I want a little dropdown menu that shows me the domains a site has listed as friendly/same-entity, and I want a checkbox I can uncheck for each of them.

Then I want an extension API to allow someone else to do the unchecking for me, based on whether the domain is highly correlated with tracking (Google analytics, Segment, Heap, Braze, etc)


The way I see it, the road to hell is paved with good intentions. If the web was developed in our current climate of security/privacy focus, how likely is it that even a fucking <a href=[3rd party]> would be allowed? Because I see us driving to a spot where this verboten. Which also happens to be the final nail in the coffin for any sort of real open platform.

Welcome to the world where the web is literally subdomains of facebook/google. What a fucking trash place to be.

Hum... The resource is not that important to me. I'm just the author of the current page, and are letting the browser know that users are very likely to want that resource.

You are attributing a lot of intention into a mechanism. You don't know if it's a 3rd party tracker or the news link in a discussion page.

The proposal at the article is actually quite good, since I should always know very well if it will load into a frame or a link.

> You are attributing a lot of intention into a mechanism. You don't know if it's a [...] or [...]

Interestingly, I think this remark is a signal that you've read something out of my comment that's just not there (and thus attributing a lot more intention to me than you should).

> The resource is not that important to me.

These are a class of resources that are important enough that folks would pause what they are doing to try and deliberately mark it with a magic incantation that they expect will cause the user agent to do something, notice that it doesn't do the thing that they want it to do, and then go and either write a blog post to complain about it, or throw their support behind someone else's complaints it. The argument that you don't find it particularly important is pretty much self-defeating.

I could be wrong but it seems to me that any cross-domain prefetch that uses the “document” option from the article is potentially privacy-violating and can reintroduce the same leaks that necessitated the original segregation.

A.test prefetches b.test/visited_a.js, b.test/unique_id.js, and log(n) URLs that bisect unique_id.js so that you can search the cache for the unique id.

Have to be careful to balance performance and “this is useful to me” with abuse prevention at scale. It’s also important to realize we have to tread carefully with browser features that seem useful as the graveyard of deprecated features that didn’t survive privacy attacks is quite large.

What's the privacy violation risk here? A doesn't learn anything through those prefetches about whether or not the user has previously cached any B resources (because A never sees the timing or result of its prefetch requests - it doesn't even know if the browser attempted them).

B obviously knows what resources A prefetched because they were requested from B in the first place. And if A wants to pass information to B, they don't need to do a complex prefetch dance, they can just load an img src.

So I don't see any way for A or B to learn anything about the user's behavior on one another's site without the other site's cooperation?

Well, if so it's a problem, because the article says that Chrome and Safari handle prefetch exactly that way.

But I don't see that problem. On this case the a.test domain can not see what is on the cache, only b.test sees it. (At least by what I understood.)

Yeah, his slideshow example doesn't show the problem. Unless each slide was on its own domain, this isn't a problem. It matters for things like Goggle Fonts, but very few folks have multiple domains that share enough of the same assets for this to matter in practice.

Question: Does Google use Google Fonts to track users across the web?

Google's FAQ [1] says that it only collects the information needed to serve fonts, but it says the generic Google privacy policy applies. The Google Privacy Policy allows them to use any information it collects for advertising purposes.

While Google also states that requests do not contain cookies, Google Chrome will automatically send a high-entropy [3], persistent identifier on all requests to Google properties, and this cannot be disabled (X-client-data) [2]. Google can use this X-client-data, combined with the useragent's IP address, to uniquely identify each Chrome user, without cookies.

So, perhaps the privacy statement is more of a sneakily worded non-denial?

[1]: https://developers.google.com/fonts/faq?hl=en#what_does_usin...

[2]: https://github.com/w3ctag/design-reviews/issues/467#issuecom...

[3]: A sample: `X-client-data: CIS2yQEIprbJAZjBtskBCKmdygEI8J/KAQjLrsoBCL2wygEI97TKAQiVtcoBCO21ygEYq6TKARjWscoB` - looks very high entropy to me!

> Google Chrome will automatically send a high-entropy [3], persistent identifier on all requests to Google properties, and this cannot be disabled (X-client-data) [2].

X-Client-Data indicates which experiment variations are active in Chrome:

Additionally, a subset of low entropy variations are included in network requests sent to Google. The combined state of these variations is non-identifying, since it is based on a 13-bit low entropy value (see above). These are transmitted using the "X-Client-Data" HTTP header, which contains a list of active variations. On Android, this header may include a limited set of external server-side experiments, which may affect the Chrome installation. This header is used to evaluate the effect on Google servers - for example, a networking change may affect YouTube video load speed or an Omnibox ranking update may result in more helpful Google Search results. -- https://www.google.com/chrome/privacy/whitepaper.html#variat...

Google doesn't use fingerprinting for ad targeting, through like with IP, UA, etc it receives the information it would need if it were going to. I don't see a way Google could demonstrate this publicly, though, except an audit (which would show that X-Client-Data is only used for the evaluation of Chrome variations.)

(Disclosure: I work on ads at Google, speaking only for myself)

Thanks for the informative answer. I still have trust in engineers and assume truth and good faith, so that's is comforting to know.

You could always ask someone who works on Google Fonts. I did just that. The answer is they don't use the logs for much apart from counting how many people use each font to draw pretty graphs.

Doesn't mean that won't change in the future though. But log retention is only a matter of days, so they can't retrospectively change what they do to invade your privacy.

I find myself wondering whether Google’s front end implements a fully generic tracker: collect source address and headers and forward it to an analytics system. The developers involved in each individual Google property behind the front end might not even know it’s there. Correlating the headers with the set of URLs hit and their timing might give quite a lot of information about the pages being visited.

I hope Google doesn’t do this, but I would not be entirely surprised if they did.

Unless it's regularly verified by a trusted third party, such as a government agency, I wouldn't trust them not to. After all: we're talking about a corporation that lives off the data it gathers about people using their services and products.

If the frontend had a fully generic tracker, teams wouldn't need to set up their own logging and stats systems... Which they do...

I think they would in any case. My impression is that data is siloed internally at Google, and that data sharing between departments would be way more complex than just setting up some (possibly redundant) logging.

I spent ten seconds thinking about the logistics of adding logging to the frontends, and...

Well, obviously I can't say for sure they don't have any. I didn't look it up, and if I had I wouldn't be able to tell you. But since I didn't, I can tell you that the concept seems completely infeasible. There's too much traffic, and nowhere to put them.

Besides that, not everything is legal to log. The frontends don't know what they're seeing, though; they're generic reverse proxies. So...

> completely infeasible. There's too much traffic, and nowhere to put them

If there’s one company in the world for whom bandwidth and storage are not an issue, it’s Google.

It sounds so easy to make, yet so useful, that I can't see how they wouldn't do that. Deontology has been thrown out Google's window a long time ago.

I just went for the easy solution and disabled web fonts. Comes with the drawback that many site UIs are now at least partially broken (especially since some developers had the bright idea to use fonts for UI icons), though flashier sites tend to come with less interesting content anyway.

But as it stands I don't want to trust Google, Facebook etc. more than absolutely necessary. They have lost every right to that a long time ago and are incentivized by their business model to not change anything.

So download your fonts off Google and serve them from your own domain.

Yes, sorry, an image is a bad example. The main issue is with HTML documents. You might open one at the top level, by clicking on it and navigating to it, or you might open one as a child of the current page, by putting it in an iframe. Since they can be opened in both contexts, prefetch doesn't know what to do.

But is it right that the issue is fetching resources from a different domain than the current one? As a user, just because I've connected to domain A, it doesn't mean I necessarily want my computer to connect to any domain B that A links to. Also, I'd rather developers focus on making small pages that are easier to fetch on demand, and am worried that they'll use prefetch to justify bloated pages. If a page is large enough to need prefetch, then I might not want to spend the data pre-fetching it especially if the click probability is not very high. Between all of these, I'm not convinced of the need for cross-domain prefetch.

Apologies that I'm not a front-end person so this may be naive, but it would be great to hear your thoughts!

Yes, this is only an issue for cross domain prefetch.

With HTML resources, the goal of prefetch is typically not to get a head start on loading enormous amounts of data, but instead to knock a link off of the critical path. The HTML typically references many different resources (JS, CSS, images, etc) and, if the HTML was successfully prefetched, when the browser starts trying to load the page for real it then can kick off the requests for those resources immediately.

Makes sense, thanks for the reply!

It would be full document prefetches. Would be useful for sites like Reddit or Google News. Also for things like Okta application list page.

It’s not an issue most of the time, but I do agree that it would be nice to have a fix.

Most, or at least a lot, of the prefetching is for third party libraries (think jQuery, Google Fonts, Facebook Pixel, etc). There’s a general speed advantage for users caching commonly used libraries and fonts across sites. Nonetheless I believe prefetch will still have a speed advantage even when the cache is segregated.

If the cache is per domain, does that mean CDN-served dependencies like JQuery and React are in fact... useless in terms of cache reuse.

Yes. Browsers and protocols have changed, and a lot of past performance best-practices have become actively harmful for performance.

A couple of related techniques are also useless: domain sharding and cookieless domains. HTTP/2 multiplexing and header compression made them obsolete, and now they're just an overhead for DNS+TLS, and often break HTTP/2 prioritization.

You should be careful with prefetch too. Thanks to preload scanners and HTTP/2 prioritization there are few situations where it is really beneficial. But there are many ways to screw it up and cause unnecessary or double downloads.

Honestly cache reuse was never as high as anyone hoped. There were so many versions of jQuery and so many different CDNs that very few first time visitors already had the one you wanted.

And that's fine; CDN benefits are minimal, given how many versions of dependencies are out in the wild, and how the total payload of a website can be reduced by clever packaging and modern data transfer mechanisms. The JS standard library also improved since then, with things like CSS selectors, the Fetch API, and CSS animations being part of the standard library nowadays.

I'd argue there's few 'shared' dependencies on websites nowadays.

As others have already said: Yes, it is useless in terms of cache.

Besides it being useless in terms of cache, it also incurs other overhead. Another DNS request, another TCP handshake, another TLS connection. With HTTP 1.1, this might still make sense because you don't get resource pipelining, but with HTTPv2, the extra overhead is simply extra overhead. With HTTPv3, it becomes even less useful to have the domains sharded. Generally speaking, the best use of resource usage with the modern web is to serve everything from the same domain.

I kind of like how protocol improvements have made the domain an authority, almost like a frontend security boundary.

Using a library-provided CDN can be performance-negative now, since you usually need several RTTs for DNS, TCP, TLS before you even get to HTTP. Serving it from your own domain / CDN allows it to be part of an existing HTTP connection.

Yes. CDNs only offer limited benefits such as lower latency and higher bandwidth.

So still useful for images and other large files. Not so much with scripts.

Yes. They didn't used to be, but they are now.

In Safari, since ~2013.

That's true, but in the other major browsers it's much more recent: 2020 or even early 2021.

Yeah, the security changes here made CDNs go from good to actively bad.


Isn't the actual problem with Chrome's behavior and as=document that they still leak? If a.test preloads b.test/i_have_visited_a.html and it is added to b.test's cache.

Correct. as=document reintroduces a data leak between different domains, and therefore isn't viable.

The only solution is not to allow cross-origin document preloads. Which is lame because the impact on user experience is reasonably substantial.

How is as=document more of a leak than ordinary cross-site navigation?

Because you can do this without navigating the user. You are correct that this can be accomplished by redirecting to the tracker site and back but this is 1) detrimental to user experience so not frequently done. 2) Easier to detect and block as the behaviour is suspicious 3) Means that your site breaks if the tracker site is down. This is especially an issue if you want to "register" the session with multiple trackers.

With preload you can do this in the background very efficiently.

Hmm, I think you're right, there may be a problem here. Unlike any other request you can trigger on a page in a browser that blocks third-party cookies, if a.test has <link rel=prefetch as=document href=b.test> this will send b.test's first-party cookies. This allows cross-site tracking through link decoration without having to navigate the top-level page.

Prefetching is a cool idea but a lot of us can’t actually use it. I tried implementing prefetch in a personal project only to find out uBlock Origin is disabling it. Apparently prefetched resources aren’t filtered by extensions, which kind of defeats tracking protection. So I can’t even use it for my own project, as I’d rather avoid the trackers. I assume many people are using the same default setting here.

> I tried implementing prefetch in a personal project only to find out uBlock Origin is disabling it.

That seems fine to me. Implement it and if the users don't want it then it doesn't occur. You should still code as if it works.

> Apparently prefetched resources aren’t filtered by extensions

This sounds like a browser bug. It should probably be raised against the browsers.

> as I’d rather avoid the trackers.

Again, this is just a result of the browser bug. I see no reason to throw away a nice declarative prefetch simply because browsers forgot to allow filtering.

>That seems fine to me. Implement it and if the users don't want it then it doesn't occur. You should still code as if it works.

Please correct me if I'm misinterpreting this statement. Are you saying it is acceptable if the code breaks if prefetch fails?

How would the absence of a prefetch break something?

Isn't this just a performance optimization?

I take the original statement to mean that worse case scenario is extra time to load.

Yes, that is what I meant. You may as well include the prefetch. And if the browser (or the user) doesn't want the prefetch they just get a slower load. If the user enables them the get the snappier experience.

Are “a.test” and “b.test” meant to represent different domains? The actual syntax used would just be different file paths.

Or .example which is a TLD reserved for examples. Different file paths wouldn't cut it because they are same-origin and don't trigger this issue.

Really hard to understand. I think this should be rewritten to make it more clear.

Yes. It should have been Example.com or something like that. Where we instantly knew it was a domain name.

Yes, different domains.

This is quite interesting. I didn't know this was a thing.

A primary issue, as I see it, is caching of third-party assets (as dmkil posted elsewhere, think jQuery, Google Fonts, Facebook Pixel, etc).

Could this not be solved using the Cache-Control header, or maybe some HTML attribute variation of this? Maybe something like:

  <!-- Use a site-specific cache for its stylesheet, default behavior --> 
  <link rel=spreadsheet href=index.css cache-key="private">

  <!-- Use a global cache for jQuery -->

Your cache-key idea infiltrates the overall idea with third-party access. As a site owner who wants to place Ads I just use your script notation to include the ad and then leak data again...

Ah yes, of course. I was thinking with a perspective of a user who has full control of the site, as they are the owner too.

The cache-key idea would only work if the user themselves could specify it for every resource.

It was a surprise to me that browsers partition their cache now. I think Safari has done it since 2013!

When I found out I wrote a blog post about the HTTP Cache partitioning and hosting jQuery, or any library, from a CDN.


I don't understand the problem here! Prefetch loads additional data while you're not doing anything. The whole argument is that prefetch doesn't respect caching??? Those are two different concepts. While I am looking at slide N I don't care if slide N+1 image is loaded fresh or from a cache. Am I missing something here?

You are missing something, but I wish people hadn't down-voted you for asking an honest question.

The problem described in the blog post is that prefetch loads the resource into cache, which when combined with per-site cache segmentation means that it's ambiguous which cache a resource should be loaded into when it's prefetched across sites.

Isn't it a best practice to white screen / "loading" until everything important is loaded?

How does one detect when everything is loaded? I've seen some websites break when UI interaction occurs before all js are loaded.

i would rather wait even seconds for a page to load than having my cache bruteforced

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact