Hacker News new | past | comments | ask | show | jobs | submit login
Helping to Build Cloudflare, Part 2: The Worst Two Weeks (cloudflare.com)
158 points by jgrahamc 6 months ago | hide | past | web | favorite | 20 comments

The video linked in another comment is terrific. (https://www.rsaconference.com/videos/inside-cloudbleed)

Particularly the value of getting recurring core dumps to zero, such that they stand out enough to find and fix. Reducing log/stats noise has been a traditional go-to for me.

Moving sshd to a non-default port, for example, initially sounds like useless security through obscurity. But, it then makes unusual ssh access stand out and get noted.

The openness is laudable, but I'd seen this phrase from them before around cloudbleed: "We didn’t find evidence of exploitation"

It just doesn't mean much. They had something like 2 weeks worth of logs. And further, they never explicitly stated that the logs had enough context to show if it was being exploited. That is, are there any fields in the log files that could distinguish normal uri accesses from malicious ones?

I'm curious if that's all deliberately careful wording because they know, that they "don't know".

Indeed, "Absence of evidence is not evidence of absence."

A popular quote that's usually wrong.

It's true iff you haven't made any effort to find evidence whatsoever OR there's a zero percent chance that you would find evidence even if it existed. This is a very rare circumstance, and one that doesn't apply in this case.

There were some logs that. had there been exploitation, had a non-zero change of providing evidence of said exploitation. They looked in the logs, and there's a non-zero chance that, if that evidence were in those logs, they would have seen it.

P(exploitation) < P(exploitation|they had and looked at the logs)

That's the definition of evidence.

The blog I linked to (about the data analysis we did) talks about the different data sources we used. We had the sampled logs (which you mention), plus statistical data on traffic through individual sites, plus crash logs. A lot more than "2 weeks worth of logs".

That's helpful, thank you. However, I'm not conjuring the "2 weeks worth of logs" phrase out of thin air:


"We have the logs of 1% of all requests going through Cloudflare from 8 February 2017 up to 18 February 2017 (when the vulnerability was patched)...Requests prior to 8 February 2017 had already been deleted."

Sorry, if it came across as me implying you pulled that from thin air.

One of the most interesting things we had was the core dumps. Randomnly (depending on memory state) we'd crash rather than dump out memory in the HTTP response. We had all that data going back over the entire period. That gave us a lot of confidence this hadn't been exploited because we could see the rate of crashes plus we could see the actual core dumps to see the memory state when the crash happened.

Ahh. That's more comforting. So, any wide scale deliberate exploitation would have resulted in an obvious spike of crash dumps.

That leaves whatever the scale of passively trolling, say, Google's cache might have been. Unknown, but probably not huge.

Right. Which was one of the reasons we used YARA on all the data we pulled from Google and other caches so we could extract the leaked data and categorize it. Then we called all the affected customers (I did a lot of those calls personally) and offered to give them the leaked information so they could look for exploitation. The idea being that if they had seen some anamoly with something that we knew was in Google's cache then it would be evidence of exploitation that way.

I'm curious why those core dumps, lasting for months at least, were not investigated

Enjoy my "Inside Cloudbleed" RSA talk: https://www.rsaconference.com/videos/inside-cloudbleed I talked about that and much more there.

Thanks a lot!

Starts at 25:30 for anyone curious.

> We were recording the core dumps [...] We didn't look at this

They are probably pretty conscious of legal liability. I participated in the investigation of a potential vulnerability at one point in my career, and I was advised to use similar wording at the conclusion of the investigation.

Cloudflare is the kind of technical company I always wanted to build. Glad to see their success as a happy customer.

I really enjoy reading these war stories, even though this sounds like a super difficult time for the author. On a technical note though, aren't these kinds of breaches a powerful reason to use memory safe languages? I know performance used to be a major driving factor, but I doubt that's the case anymore. Even then, does it seem reasonable to say "I'm risking it all for a 20% performance gain"?

Admittedly, this isn't the kind of stuff I normally write, so I might be missing something. But since the ramifications of these memory leaks are so severe, I would think the industry would be adapting.

Yes. And tomorrow's installment of this series of posts begins: After Cloudbleed, lots of things changed. We started to move away from memory-unsafe languages like C and C++ (there’s a lot more Go and Rust now).

Oh! Well there's my answer then, looking forward to reading that.

The biggest problem leaving C/C++ is GC jitter. GCed languages go through massive hoops to minimize it but it's dtill there and in some workloads the jitter is simply unacceptable

Reading the first part of this post reminds me that Cloudflare should really get into email security as well ;)


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact