What would you like to see? The SAFE_CHAR logging allowed us to get data on the rate which is how I got the % of requests figure.
But they're probably aware of this issue and know enough to go looking to purge their caches.
Do you care to share or elaborate on this?
But I guess there is really no incentive for anyone in particular to do anything about this, because it provides a kind of perverted safety in numbers. "It's not just our website that had this issue, it's, like, everyone's shared problem." The same principle applies to uber-hosting providers like AWS and Azure, as well as those creepy worldwide CDNs.
Interestingly, it seems this is one of the cases where using a smaller provider with the same issue would really make you better off (relatively speaking) because there would be fewer servers leaking your data.
The web simply doesn't scale. The only way to fix DDoS reliably is peer-to-peer protocols. Which hardly ever happens because our moronic ISPs believed nobody needed upload. Or even a public IP address.
you can argue "if everything was symmetric, then traffic patterns would be different" and you might be right, but that's not how the market went or how the "internet" started.
the client-server paradigm drove traffic patterns, and there was never any market demand or advantage by ignoring it.
Yes, traffic patterns at the time was heavily slanted towards downloads. I know about copper wires and how download and upload limit each other. Still, setting that situation in stone was very limiting. It's a self fulfilling prophecy.
You don't want to host your server at home because you don't have upload. The ISP sees nobody has servers at home so they conclude nobody needs upload. Peer-to-peer file sharing and distribution is slower than YouTube because nobody has any upload. Therefore everybody uses YouTube, and the ISP concludes nobody uses peer-to-peer distribution networks.
And so on and so forth. It's the same trend that effectively forbid people to send e-mail from home (they have to ask a big shot provider such as Gmail to do it for them, with MITM spying and advertisement), or the rise of ISP-level NAT, instead of giving everyone a public IPv6 address like they all deserve (including on mobile).
There is a point where you have to realise the internet is increasingly centralised at every level because powerful special interests want it to be that way.
Regulation is what we need. Net neutrality is a start. Next in line should be mandated symmetric bandwidth, no ISP-wide firewall (the local router can have safe default settings), public IP (v4 or v6) for everyone, and no restriction on usage patterns (the ISP should not be allowed to forbid servers). Ultimately, our freedom of expression and freedom of information depends on this. They are messing with human rights.
And because IP multicast doesn't work over the internet. If it did, even if merely to some limited extent, some asymmetries would be far easier to stomach.
It may not have been how the market went but it definitely was how the internet got started.
The traffic patterns for higher upstream aren't there because they can't be there.
Sharing here in case anyone else finds it useful
(warning - it's 1.1MiB gzipped / 2.4MiB uncompressed)
any domains outside the top 1 million are ommitted
Potentially huge amounts of stuff might be exposed, but I have some assurances that "the practical impact is low" from someone I trust, so I think it's just a lot of random data. I'd still rotate all credentials which passed through Cloudflare in the past N months (and if I were a big consumer site NOT on Cloudflare, I might change end user passwords anyway, due to re-use), but I don't think it will be the end of the world.
In terms of the caching, knowing the broken sites tells you where to look in the caches after the fact, but do you have any idea of who's data was leaked? Presumably 2 consecutive requests to the same malformed page could/would leak different data.
Wouldn't the second request be served from the CDN cache? Since for Cloudfare that particular page is a valid cached page, it would send you that same page on the second request.
1. every tag pattern that triggers the bug(s)
2. which broken pages with that pattern were requested at an abnormally high frequency or had an unusually short TTL (or some other useful heuristic)
3. on which servers, and at what time, in order to tell
4. who's data lived on the same servers at the same time as those broken pages
to even begin to estimate the scope of the leak. and that doesn't even help you find who planted the bad seeds.
Exactly which search engines and cache providers did you work with to scrub leaked data?
ex: Right now there is in an easily found google cached page with OAuth tokens for very popular fitness wearable's android API endpoints
So: all services that have one or more domains served through Cloudflare may be affected.
The consensus seem to be that no one discovered this before now, and no bad guys have been scraping this leak for valuable data (passwords, OAuth tokens, PII, other secrets). But the data still was saved all over the world in web caches. So the bad guys are now probably after those. Though I don't know how much 'useful' data they would be able to extract, and what the risks for an average internet user are.
This is literally as bad as it gets, anyone trying to palliate the solution has something to sell you. You'd have to be an idiot to think that $organization (public, private, or shadow) doesn't have automated systems to check for something as stupid simple as this by querying resources at random intervals and searching for artifacts.
Someone found it. Probably more than one someone. Denial won't help.
And of course all other sites that are not in alexa 10k are not in this list (if they are not on some other lists used, you can see the source of lists in the README of the Github repo).
Cloudflare has a lot of customers who only use the free DNS service, for example.
You're not exposed if you never sent traffic through their proxies; for instance, if you somehow only used them for DNS.
The DNS service is essentially free. It's an upgrade from most registrars' built-in DNS. It's a pretty robust solution, really -- global footprint, DNSSEC, fully working IPv6, etc.
My point is, the actual number of impacted customers was much smaller than the entire set of Cloudflare customers. There are lists in this thread that still reference hundreds of thousands (millions?) of sites, and that's just wrong.
(I agree on your first point though; I was confused about the nature of the proxy bug at first).
I guess there is a long tail of pages on the internet whose primary purpose is to be crawled by google and serve as search landing pages - but again, if I had a bug in the HTML in one of my SEO pages that caused googlebot to see it as full of nonsense, I'd see that in my analytics because a page full of uninitialized nginx memory is not going to be an effective pagerank booster.
Or is the problem that one domain can trigger the memory leak, and another (unpredictable) domain is the "victim" that has its data dumped from memory?