
Cloudflare data still in Bing caches - neonate
https://news.ycombinator.com/item?id=13727279
======
Smerity
From the parent thread:

    
    
      The caches other than Google were quick to clear and we've not been able to find active data on them any longer.
      ...
      I agree it's troubling that Google is taking so long.
    

That's really the core issue here - the Cloudflare CEO singled out Google as
almost being complicit in making their problem worse whilst that exact issue
is prevalent amongst other indexes too.

The leaked information is hard to pinpoint in general, let alone amongst
indexes containing billions of pages.

I can understand the frustration - this is a major issue for Cloudflare and
it's in everyone's best interests for the cached data to disappear - but it's
not easy, and they shouldn't say as such (or incorrectly claim that "The
leaked memory has been purged with the help of the search engines" on their
blog post).

This is a burden that Cloudflare has placed on the internet community. Each of
those indexes - Google, Microsoft Bing, Yahoo, DDG, Baidu, Yandex, ... - have
to fix a complicated problem not of their creation. They don't really have a
choice either given that the leak contains personally identifiable information
- it really is a special sort of hell they've unleashed.

Having previously been part of Common Crawl and knowing many people at
Internet Archive, I'm personally slighted. I'm sure it's hellish for the
commercial indexes above to properly handle this let alone for non-profits
with limited resources.

Flushing everything from a domain isn't a solution - that'd mean _deleting
history_. For Common Crawl or Internet Archive, that's directly against their
fundamental purpose.

~~~
tptacek
If Google hadn't noticed and saved them from this bug, who can say how long
they might have continued spraying private data into the world's caches?
Apparently, this had been going on for months prior. Heartbleed was exposed
for years. This could have been too. Worse: malicious attackers could have
discovered and quietly exploited it. _For years_.

The very last people in the universe Cloudflare should be criticizing right
now are the Google security team.

~~~
yarou
The comment from eastdakota seems to be political in nature; some on here have
suggested he has an axe to grind with Google.

Why do humans politicize so much? That's one thing I'll never understand, and
one of the reasons why I refuse to become a manager.

~~~
fulafel
Any talk that is about Google's systemic influence or role is political by
definition, nothing wrong with political. People and companies have political
ideologies and it's good to discuss them.

~~~
yarou
Discuss them, sure.

But what does it really matter?

~~~
makmanalp
Well, it's easy to ignore politics until you're personally affected by it.

------
MichaelGG
I've had a fairly high opinion of CF, apart from their Tor handling and bad
defaults (Trump's website requires a captcha to view static content.) Yeah I'm
uncomfortable with them having so much power, but they seemed like a decent
company.

But their response here is embarassingly bad. They're blaming Google? And
totally downplaying the issue. I really didn't expect this from them. Zero
self awareness- or they believe they can just pretend it's not real and it'll
go away.

~~~
dcosson
Agree that it's a shame that it doesn't really feel like they're owning up to
how bad it was.

But I wonder if it will just mostly go away. Luckily for cloudflare this is a
pretty random sampling of people around the country and world. Unless someone
has put together a big data set from the caches and decides to leak it or
inform the victims, it seems like most people whose accounts do get taken over
from this will have no way to trace it back to this bug.

~~~
mirimir
For sure, there are assholes compiling cache data :(

------
kchoudhu
It's been pretty entertaining watching taviso's attitude towards CF go from
"we trust them" to "dude, you're a tool".

I kind of understand what CF is doing here: they've screwed up, there's no way
for them to clean it up, so all they can do now is deflect attention from the
magnitude of their screw up by blaming others for not working fast enough in
the hope that their fake paper multibillion dollar valuation doesn't take too
big a hit.

Still a dick move though. Maybe next time don't use a language without memory
safety to parse untrusted input.

~~~
CrLf
> Maybe next time don't use a language without memory safety to parse
> untrusted input.

Untrusted input is safely parsed by programs written in languages without
memory safety all the time. In fact, most language runtimes with memory safety
are implemented in languages _without_ memory safety.

What's to criticize here is parsing untrusted input _in the same memory space_
as sensitive information.

~~~
cbr
How would you rewrite websites to optimize them without parsing untrusted
input in the same memory space as sensitive information? The thing you're
trying to change (the HTML) can have PII or be otherwise sensitive.

(I used to work on Google's PageSpeed Service, and if it had had the same bug
I think we would have been in the same situation as CF is now.)

~~~
CrLf
> The thing you're trying to change (the HTML) can have PII or be otherwise
> sensitive.

Sure it can. But leaking PII from the thing you're parsing and leaking PII
from any other random request isn't the same thing.

I understand the performance implications (and the added effort) of sandboxing
the parser, but I'm arguing for it anyway. The mere presence of a man-in-the-
middle decrypting HTTPS and pooling cleartext data from many disparate
services in a single memory space is already questionable (something for
Cloudflare customers - and not Cloudflare itself - to think about) but adding
risky features into the mix shouldn't be done without as much isolation as
possible.

Let's face it: parsers are about the most likely place for this sort of
leakage to happen...

~~~
cbr
Actually, thinking more, we designed PSS to run in a sandbox specifically
because we were parsing untrusted input. But leaking content from one site
into responses from other sites would still be possible, because I think we
didn't reinitialize the sandbox on every request (way expensive) and each
server process handled many sites. Fix that, and then there's still the risk
of leaking things between sites via the cache.

It's definitely possible to fix this (new sandbox per request, cache is
fragmented by something outside of sandbox control) but I'm not sure the
service would make sense economically.

------
tonyztan
Why is Cloudflare underplaying this issue? All data that transited through
Cloudflare from 2016-09-22 to 2017-02-18 should be considered compromised and
companies should act accordingly.

~~~
mcphilip
>Why is Cloudflare underplaying the issue?

I suspect the random nature of the overflow plaintext spewed out into caches
will be difficult to leverage into a statistically significant attack against
any particular customer. If CF's bottom line is unlikely to be impacted, why
not downplay the issue and refer to it in the past tense?

There may be significant long lasting damage to CF's reputation amongst
vulnerability researchers, but that's a tiny subset of the population and
statistically insignificant to a company that ~10% of internet traffic flows
through.

~~~
int_19h
This all is assuming that no-one has found the vulnerability before. My
understanding is that, once you figure out what kind of request causes the
overflow, you can pretty much just spam CF with it, getting new garbage data
every time. If someone was deliberately doing that, they could have more data
than all the indexes combined. And the worst part is that we'll never know.

~~~
ghughes
Does CF redirect all non-HTTP traffic to HTTPS? If not, NSA could have
passively intercepted tons of leaked data, and all it would take is for one of
their analysts - people paid to find stuff like this - to notice a single out-
of-place leaked secret and trace it back to CF.

------
koolba
Rule #1 of breaches: you can't unbreach

At this point if you don't consider all data that was sent or received by
CloudFlare during the "weaponized" window compromised, you're lying to
yourself.

------
uladzislau
I briefly touched base with Cloudflare's Product Management and my impression
was that they were overconfident and snobbish in every aspect, which is kind
of opposite to what I'd expect from the company like this. Being humble never
hurts.

------
mhils
Does Cloudflare have complete logs to rule out that someone noticed this
before taviso and used it to massively exfiltrate data by visiting one of the
vulnerable sites repeatedly?

If they can't tell, someone may now be sitting on a lot of very juicy data,
far beyond what may be left in these caches.

~~~
cjbprime
I haven't seen an attempt by Cloudflare at claiming that this definitely
didn't happen. They may still be working on it. It's possible that the
question is basically unanswerable even with logs.

As you say, in the presence of uncertainty it's most prudent to assume that
this actually happened.

~~~
int_19h
They seem to be presenting some dubious calculations made to imply that it was
highly unlikely to happen.

The reason why I consider them dubious is that anyone simply searching the
name of some HTTP headers in Google et all could have stumbled into this. I
don't find it at all unlikely to happen in a timespan of 5 months.

~~~
Lazare
The odds that Google had the first team of researchers to trip over the bug is
low. But we know that they were the first team to disclose the vulnerability,
and the only reason not to disclose it is if you wanted to exploit it.

So the key question really isn't "how likely was someone to find this", but
"how likely is it that Project Zero was the first". I think it's hard to
estimate odds, but I'd be surprised if it was even as high as 50%; there's too
many teams, individuals, freelancers, state actors, etc. actively engaged in
looking for this kind of thing.

~~~
true_religion
Many people probably tripped over the bug but didn't know what it was.

The data it reveals isn't guaranteed to be obviously private and exploitable.
It can just look like a valid but useless response, or a invalid and corrupted
response depending on what you were looking for in the first place.

------
rdl
I really hope people don't lose sight of how helpful Project Zero has been in
finding ongoing vulnerabilities and making the Internet a better place.

There is a bit of tension between cloudflare and taviso over the timing of
notification, but that is vanishingly insignificant overall.

------
paulcole
Just please tell me the people who found the issue got their free t-shirts.

~~~
H4CK3RM4N
T-shirts. He only got one.

~~~
sneak
...and it had half of a customer logo on it, along the bottom edge.

------
sneak
Cloudflare's email to customers has been calling this a "memory leak", which
means something entirely different than a "secret data disclosure".

One causes swapping. The other causes a month of extra work.

------
dorianm
I'm compiling a list of affected domains (with data found in the wild):
[http://doma.io/2017/02/24/list-of-affected-cloudbleed-
domain...](http://doma.io/2017/02/24/list-of-affected-cloudbleed-domains.html)

If you find some samples with domain names / unique identifiers of domains
(e.g. X-Uber-...) you are welcome to contribute to the list:
[https://github.com/Dorian/doma/blob/master/_data/cloudbleed....](https://github.com/Dorian/doma/blob/master/_data/cloudbleed.yml)

~~~
zda
See also: [https://github.com/pirate/sites-using-
cloudflare](https://github.com/pirate/sites-using-cloudflare)

~~~
Operyl
That's beyond flawed because he assumes any site that uses cloudflare's NS is
using the proxied services. Which is 100% utterly _wrong_.

~~~
WalterGR
What is the correct way?

Honest question...

Out of hundreds of passwords I potentially need to reset, I'd like to
prioritize.

~~~
verroq
You have resolve the DNS and see if it points to one of Cloudflare's reverse
proxies.

~~~
toyg
And even then, no guarantee. One could host a static shopwindow site on his
own, and use CF for the actual backend of a mobile app under a different
domain that nobody knows about.

There is no real way to know what has leaked and from whom. The only ones with
real info are CF and it's clear from the amount of sites they've missed in
their purge-requests that even them don't really know.

------
acqq
It seems that, due to the Cloudflare's confusing disclosure, it's still not
clear what and how is leaked. What I personally observed, just by following
the discussion and the links to some examples:

\- there is a smaller number of sites that used some of the special features
of Cloudflare that allowed leakage for some months, according to what
Cloudflare said.

\- it seems the number of the sites was much bigger for some days, according
to what Cloudflare said.

\- the data leaked are the data passed through the Cloudflare TLS man-in-the-
middle servers -- specifically not only the data from the companies, but the
data from the users, and not only the data related to the sites through which
the leak happened, but also other sites that just happened to pass through
these servers. Again, also the visitor's data, both directions are leaked.
From the visitors, their location data, their login data etc. As an example:
if you imagine the bank which used Cloudflare TLS, in the caches could be both
the reports of the money in the accounts (sent from the bank to the customers)
and the login data of the customers (sent by the customers to the bank), even
if the bank site hasn't had the "special features" turned on. That's what I
was able to see myself in the caches (not for any bank, at least, but the
equivalent traffic).

~~~
buro9
This is a good reading of it.

To be clear, the SSL and caches are isolated from the process that handles
transformations of web pages and neither of those leaked anything.

All traffic that is "orange clouded" passes through the transformation layer
and _may_ have leaked by any of the pages on the sites that had this unique
set of features enabled (the cause) and also had broken HTML (the trigger) if
they happened to be in the memory immediately after the broken HTML.

Which means that a small number of sites (3,438 domains - cite jgc) were able
to leak the first bit of memory for requests located in memory after the page
request of a broken page on one of those small number of sites... and this
other memory could have been any other page that is proxied by Cloudflare.

Is it huge? Absolutely, because the leaked pages could have contained
anything, especially in the headers which will have been included.

Is it a lot of pages? The scale of Cloudflare means no matter how small the
fraction affected it adds up, so yes. The sum of pages that will have leaked
data is horrifying because even a single page is a page too many to leak.

Are you a customer, are you paranoid and want to know what to do? OK, then
change your origin server IP addresses, and expire your user sessions/cookies.
Beyond this, you will need to look at your own web application to determine
whether in the first bit of a response from your origin servers you include
sensitive data, and from that what you feel is an appropriate action.

The only thing I'm doing to my sites is working through an expiry of user
sessions. Even then, I think the chances that I was affected remain
vanishingly small but expiring sessions is the responsible thing for me to do.

Note: I work at Cloudflare but wasn't involved in this security incident
beyond helping to find data in caches. Additionally I run 300+ websites that
are all behind Cloudflare web proxy so I understand that perspective extremely
well.

~~~
acqq
> were able to leak the first bit of memory for requests

It's some _kilobytes_ of browser _requests or_ server _responses_ that are
leaked in the samples I have seen, if I remember correctly. Much more than
"the first bit."

Just to be clear.

~~~
buro9
Yes, apologies for my phrasing... the first bit didn't mean "computer bit"
meant "human description for the first part of a web response (headers and
body)".

To be very precise, I think jgc mentioned that up to 4KB from the bounds of
the initial request could have been leaked, where a good section of that was
the internal server-to-server communication certs, the raw headers as visible
during the internal processing of the request, and then part of the response
body that follows... this may have been encrypted or compressed and could
appear as garbage.

The focus for site owners on Cloudflare should be on "What do I put in headers
that may be sensitive?" or "What URLs do I regard as being
secret/unadvertised?".

Typically that will be session cookies and access_tokens. Hence my advice,
expire and roll all sessions.

Headers include the Cloudflare internal headers, and so includes origin IP
addresses too, so if those are secret for you (i.e. you have previously been
the target of a DoS and are using Cloudflare to hide those IPs) then you'll
want to change your origin IP addresses too. Though if you have been the
target of a DoS then you probably should use iptables to only allow web
traffic from Cloudflare IP addresses.

------
kijin
Millions of domains are on Cloudflare. We can't tell how many of them were
affected.

Either we can search for obvious strings like X-Uber-* and try to scrub them
one by one, or we can just nuke the caches for all the domains that turned on
the problematic features (Scrape Shield, etc.) anytime between last September
and last weekend. Cloudflare should supply the full list to all the known
search engines including the Internet Archive. Anything less than that is
gross negligence.

If Cloudflare doesn't want to (or cannot) supply the full list of affected
domains, an alternative would be to nuke the caches for all the domains that
resolved to a Cloudflare IP [1] anytime between last September and last
weekend. I'm pretty sure that Google and Bing can compile this information
from their records. They might also be able to tell, even without Cloudflare's
cooperation, which of those websites used the problematic features.

[1] [https://www.cloudflare.com/ips/](https://www.cloudflare.com/ips/)

~~~
josecastillo
Nuking the caches is one thing, but what about services like the Internet
Archive whose job it is to hang on to these pages? Pages with leaked data are
clearly difficult to identify; removing the leaked data without nuking the
document may be impossible, at least in an automated fashion. Are we supposed
to erase five months of history from the affected domains?

This CloudFlare breach seems to have put a lot of people in a tough spot, but
it feels like it's put archivists in an impossible position.

~~~
kijin
I agree that nuking entire domains would be bad for the Internet Archive. But
I don't think it would be overwhelmingly difficult, nor controversial, to
identify and remove the vast majority of "contaminated" documents. This
applies to the Internet Archive as well as major search engines.

First, we're talking about raw memory pages, not merely malformed HTML. Those
memory pages might contain valid HTML, but most of the sensitive information
is in the headers, not HTML markup. It won't be very difficult to write a
script to identify documents where random headers and POST data have been
inserted where they don't belong, or where the markup is so obviously invalid
(even compared to similar documents from the same site) that there is a high
probability of contamination. Having a full list of contaminated domains would
obviously help a lot, because we'll only have to deal with thousands of
domains instead of millions.

Second, contaminated documents by definition contain information that is NOT
what the publisher intended to be crawled, indexed, or archived. So there
should be less resistance to removing them.

Finally, most of the contaminated domains used features such as Scrape Shield
that were intended to deter archival. It's as if the domain had a robots.txt
that said "User-agent:* Disallow:/". I'm not sure whether it's even possible
for the Internet Archive to archive such domains. If they can, maybe they've
been doing it against the publisher's wishes. If they can't, well, there's no
problem to begin with.

~~~
greglindahl
Archives don't delete stuff, nor do they have much capacity to do much
computation on their archived data. Whereas if blekko still existed as a
search engine, I'd just push code to refuse to show cached pages or snippets
containing text that likely means the CloudFlare problem. 15 minutes work, and
the underlying data would expire fully in a couple of months.

So I would completely disagree with your speculation about what's easy or
hard. (Note that I've worked at a search engine and an archive.)

------
foobarbecue
After reading this, I'm considering switching from cloudflare for my DNS
servers. Recommend a similar free service?

~~~
buro9
DNS only customers were totally unaffected by this web proxy bug.

~~~
4ad
Sure, but maybe he doesn't want to do business with Cloudfare anymore?

~~~
foobarbecue
Right. The bug doesn't affect me, but I don't like how they responded to this.

------
kfrzcode
IANAL --- what, if any, legal precedent/structure is there for what will
happen to CF if, say, 1.5billion users are hacked and money shifts
dramatically as a result or some other reasonably thinkable "hypothetical"
situation that we, the Internet-At-Large, at this point, have no certain idea
if the incident in question has or has not happened ... I'm saying, there's
got to be negligence charges or something if there is money lost, that's how
capitalism in America works... but this is a Global problem.

If this is how 2017 is pacing, we've got a long year ahead. This is an
insanely interesting time to be alive, let alone at the forefront of the
INTERNET.

Fellow Hackers, I wish you all the best 2017 possible.

------
flylib
I lost all respect for Cloudflare

------
bitmapbrother
eastdakota 19 hours ago [-] (Cloudflare CEO)

>Google, Microsoft Bing, Yahoo, DDG, Baidu, Yandex, and more. The caches other
than Google were quick to clear and we've not been able to find active data on
them any longer. We have a team that is continuing to search these and other
potential caches online and our support team has been briefed to forward any
reports immediately to this team.

>I agree it's troubling that Google is taking so long. We were working with
them to coordinate disclosure after their caches were cleared. While I am
thankful to the Project Zero team for their informing us of the issue quickly,
I'm troubled that they went ahead with disclosure before Google crawl team
could complete the refresh of their own cache. We have continued to escalate
this within Google to get the crawl team to prioritize the clearing of their
caches as that is the highest priority remaining remediation step. reply

taviso 6 hours ago [-] Tavis Ormandy

>Matthew, with all due respect, you don't know what you're talking about.

>[Bunch of Bing Links]

>Not as simple as you thought?

~~~
Rapzid
Cloud Flares judgement of the situation is obviously compromised. I'm not
saying this much thought went into Google's disclosure but they were 100℅ on
point in disclosing. There is no way to purge this data from the entire net;
it was important for everyone to know what happened as soon as the leak was
plugged.

------
remx
If anyone wants to, they can access (cached/archived) pages from any number of
services listed here:
[https://en.wikipedia.org/wiki/List_of_Web_archiving_initiati...](https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives#References)

My personal favorites are:

\- [https://archive.fo](https://archive.fo)

\- [https://archive.org/web/web.php](https://archive.org/web/web.php)

\- [https://historio.us](https://historio.us)

\- [https://timetravel.mementoweb.org](https://timetravel.mementoweb.org)

------
sersi
I have a question which might be stupid.

What happens for sites using Full SSL (a certificate between cloudflare and
the user and a certificate between cloudflare and the server), could any
information from ssl pages have been leaked?

~~~
AlexandrB
My understanding is yes - if the https page was in CF memory (for decrypting
from server and encrypting for user) its contents could have been dumped to a
cache of one of the affected sites.

~~~
fiedzia
According to CF, they store certificates and cached content separately, so
they were not exposed.

~~~
oldsj
Right. Clear text data was exposed, not the certificate itself. But who needs
a certificate when you have the cleartext?

------
patcheudor
Also still in Yahoo caches with the same leaks found in both Yahoo and Bing. I
posted the URLs to the linked thread.

------
djhworld
Can someone explain why Cloudflare parse the HTML in the first place.

Is there some sort of information extraction feature service or something they
offer? I don't get it.

~~~
sudhirj
HTML, CSS, JS optimization service. Also email obfuscation and scraping
shield.

------
Rapzid
As much as CF would like people to believe otherwise (oh and look at our
awesome response time and automation!) this cat can't go back in the bag. They
should step away from the mic and contact a PR firm that specializes in
salvage jobs.

If I were google I would hit back hard. They prob won't just stop, but I would
not bother trying to even clean up the data unless under legal pressure. It
out there, it's too late.

------
spyder
And the "irony" is that some of the data may leaked only to "bad bots" and "IP
has a poor reputation (i.e. it does not work for most visitors)."

From their blog: [https://blog.cloudflare.com/incident-report-on-memory-
leak-c...](https://blog.cloudflare.com/incident-report-on-memory-leak-caused-
by-cloudflare-parser-bug/)

------
rickdg
Any word on the possibility of credit card numbers having been exposed?

~~~
jlgaddis
Yes, it's possible. Any data that passed througb Cloudflare was possibly
exposed.

------
7ewis
How are people finding this info?

Is it possible to find if anything leaked from my site behind Cloudflare is in
the caches?

------
grogenaut
To help with this, I made
[https://bleed.cloud/index.html](https://bleed.cloud/index.html)

It lets you run domains quickly without downloading and grepping.

------
skrebbel
Folks, can we please stop downvoting the parent of the linked comment? It's of
no use when it disappears from HN.

~~~
dang
For those who don't know, to read a greyed-out comment you can click on its
timestamp to go to its page.

------
pvg
This is already a comment on the site, in the relevant thread. Seems a little
meta as a post.

~~~
tptacek
For what it's worth, in the general case, I think you're right. The only thing
that makes this case different to me is that the one comment CF's CEO has
decided to make on HN takes a potshot at Google, which is newsworthy --- but
if we took every notable comment on HN and put it on the front page, that's
all the front page would be.

~~~
pvg
Right, I'm mostly whining about the method - link to grumpy mid-thread
comment, title that misleads about the intent of the post, etc.

The thing itself seems somewhat newsworthy but personally, I'd rather hear the
CF people defend the TLS-MITM-as-service idea itself and get yelled at for
that. The fact they're being weaselly and insufficiently contrite seems
secondary.

~~~
tptacek
That makes sense. I'm much less interested in the phenomenon of CF-as-global-
TLS-MITM than I am in how responsible they are being with the position they've
accrued, and handling vulnerabilities like this is a big part of that. So the
fact that CF is in a spat with Google is a big deal to me, but I understand
that's not the case for everyone.

------
pikzen
Company engaging in practices that undermine internet security and MITM their
users found to be doing stupid shit.

Not exactly breaking news. At some point, maybe people will realise that CF is
actively making internet worse and less secure, and that it should be treated
as nothing more than a wart to be removed.

~~~
manigandham
Every CDN that handles TLS traffic is a MITM.

~~~
gkop
What other CDN makes it easy to use plaintext between the CDN and origin, yet
use a secure connection between the CDN and the end user, and has the nerve to
market this as a feature called "Flexible SSL"?

Edit: I wasn't very clear. GP is wrong saying MITM is "wrong" for its own
sake. I think Cloudflare is harmful for other reasons though.

~~~
manigandham
That does seem to be an accurate name, it _is_ a feature that's offered and up
to the site operator to enable, and yes it's unfortunate that it potentially
gives a false sense of security to end users. However in almost every case,
it's still better than both sides of the connection being unsecured.

I'm not sure what this has to do with all CDNs being MITM operators when
caching secure content.

~~~
gkop
At the very least, Cloudflare should do a better job of discouraging use of
Flexible SSL. People that opt in to Flexible SSL should know what they are
doing.

I edited my comment.

------
gear54rus
I just wonder when we can stop beating this dead horse here...

