> The examples we're finding are so bad, I cancelled some weekend plans to go into the office on Sunday to help build some tools to cleanup. I've informed cloudflare what I'm working on. I'm finding private messages from major dating sites, full messages from a well-known chat service, online password manager data, frames from adult video sites, hotel bookings. We're talking full https requests, client IP addresses, full responses, cookies, passwords, keys, data, everything.
This is huge.
I mean, seriously, this is REALLY HUGE.
You have a function that strips all colons from your input. For some reason - in certain cases - your code misbehaves and when you are replacing the colons with an empty character you accidentally replace that colon with other data you have in the memory. So now all the colons in your input have been replaced with data that you shouldn't have touched. So now whoever sent you an input, gets back that input + more data they shouldn't be able to see.
And Google in this case caches those output strings.
Imagine I'm having a chat on some website X, which uses Cloudflare. Cloudflare acts as a man in the middle, meaning my request, and the response, likely pass through its memory at some point to allow me to communicate with X.
Later, a Google bot comes along and requests a page from site Y. Because of this bug, random bits of memory that were left around on the Cloudflare server get inserted into the response to the bot's request. Those bits of memory could be from anything that's gone through that server in the past, including my conversations on website X. The bot then assumes that the content that Cloudflare spits out for website Y is an accurate representation of website Y's contents, and it caches those contents. In this way, my data from website X ends up in Google's cached version of website Y.
Then Google accesses the website as the crawler (user B), and their header and data is saved in M2. However, Google triggered a bug and now has access to M1 as well. So now Google sees their own headers + my data + other garbage.
Google gets this HTML and caches it and that's how it ends up there.
"We leaked information from Customer A to Customer B by accident" is the first order problem.
But the existence of web caches means that all that private information of customer A is potentially fucking everywhere now.
How do you even clean this up? How do you even start?
So a request sent to Cloudflare customer A's site could return data from Cloudflare customer B, including data that B thought was only being served via https to authenticated users of B.
Apparently 7xx sites had this enabled, but that affected 4000ish other sites that happened to be on the same infrastructure.
For certain other sites, with malformed html, there is a bug that caused it to grab random data (headers and body) from memory and include it in the body of the response HTML. (Some html rewriting product that cloudflare offered was broken and it ran on the same servers.)
This stuff got sent to peoples browsers and also to web indexers like Google or Bing.
Google lets you search for stuff and will also show you the original page that it scraped, making it easy to find this data.
Edit: Also you may be seeing more headers in examples because headers are easier to search for.
since anyone can put a broken page behind cloudflare, all you need to do is request your own broken page through cloudflare, and start collecting the random "secure" data that comes back.