
Let's Encrypt OCSP and Issuance Outage Postmortem - agrajag
https://community.letsencrypt.org/t/may-19-2017-ocsp-and-issuance-outage-postmortem/34922
======
zeta0134
And this is why you always... aww hell, who am I kidding? That was one complex
series of events, and I would have done no better.

Grats to the Let's Encrypt team for figuring it out, and thanks for writing up
the post mortem. It's always interesting to read things like this, and it just
goes to show that sometimes all the monitoring you thought you had covered
wasn't quite enough.

~~~
zzzcpan
It's easy to do better if you ever listened to Joe Armstrong or read his
dissertation. The idea of centralized monitoring is just broken and can never
be truly reliable.

~~~
bogomipz
Can you say who is Joe Armstrong and what is the dissertation? I would be
interested in reading it.

~~~
krallja
Joe Armstrong's dissertation became Erlang

~~~
bogomipz
Oh wow I didn't know any of this. This will be some good weekend reading.
Thanks.

Here is the link in case anyone else is interested:

[http://erlang.org/download/armstrong_thesis_2003.pdf](http://erlang.org/download/armstrong_thesis_2003.pdf)

~~~
SkyMarshal
Their original video demo of hot code swapping is also a classic worth
checking out:

[https://www.youtube.com/watch?v=uKfKtXYLG78](https://www.youtube.com/watch?v=uKfKtXYLG78)

That got a lot of programmers interested in Erlang ~10yrs ago.

------
agrajag
I thought this was a great writeup, but one significant issue I see unanswered
is why did all OCSP responses fall out of cache at the same time? In a cache
with an even distribution of expiration times they should have seen a gradual
increase in traffic as responses steadily fell out of cache. Adding jitter
when setting cache duration for the CDNshould help even out the rate at which
responses fall out of cache.

In addition, monitoring cache rate and measuring request rate at the CDN
should have been big indicators that it wasn't a DDoS.

Lastly, is this kind of upstream throttling with no customer communication
common? That seems like a big failing on the ISPs side.

~~~
zeta0134
I work at a hosting company, and I can attest to throttling at the ISP side
being somewhat common. My own company will throttle or null route customers
almost immediately if enough traffic comes through that it starts to affect
other customers behind the same switches.

We try to notify customers as quickly as possible (emails go out within
minutes) but there are a lot of cases where the emails end up in unmonitored
inboxes and customers don't realize they're down until their clients complain
to them about it.

In any case, it sounds like there may be a bit of a logic bug clientside if a
service downtime causes all of their clients to generate so much traffic that
it looks like DDoS. The clients should be throttling themselves to prevent
overloading a downed upstream service, why didn't that happen here? That's
worth investigating, though much more difficult to fix at this point with that
many copies of the client in the wild. EDIT: I just thought to actually look
up OCSP, I assumed it was the mechanism something like certbot was using to
renew certificated. This is built into _browsers_? Yeah, nevermind on this
whole paragraph then.

~~~
mholt
> This is built into browsers?

Yup, unfortunately, browsers often have to do revocation (and SCT) checks
because traditional servers like Apache and nginx don't staple OCSP responses
by default (and even if they are configured to do so, the implementations in
these servers are not robust against outages).

Firefox checks OCSP for DV certs, but that will be disabled in a near-future
release.

~~~
tialaramex
Your reference to SCT here seems weird.

[Anybody following at home: SCTs are signed timestamps proving a particular
Certificate Transparency log server logged this certificate's details at a
moment in time. These log servers are public proof of what certificates should
exist. The cryptography behind the log servers makes it impossible for them to
lie about what they know beyond a certain horizon, policy today sets that
horizon at 24 hours]

Today the only thing any browser (Chrome) does with SCT by default is to
verify whether valid (properly signed) SCTs are provided for certain
certificates. This doesn't result in any additional connections, although if
OCSP is needed already the SCTs might optionally be included with OCSP.

Eventually browsers will be able to automatically report information to detect
any discrepancies (e.g. where a log is telling different things to different
people), but today that's something which exists only in prototype, not as a
default feature in production web browsers ordinary people use every day.

------
ge96
Sorry I guess I only know how to implement/use an SSL certificate not
necessarily how the generation/providing part works. What did this outage
mean? If you already had a certificate generated, were you not affected by
this? Or does a certificate enabled on a website need to be checked/validated.

I did look up what OCSP is after looking at the article:

>an Internet protocol used for obtaining the revocation status of an X.509
digital certificate.

So does this mean they weren't able to verify that the SSL was still good and
so you'd get a warning in the browser saying "This site is not secure" or
something?

~~~
ZoFreX
(this is purely about the OCSP server outage, others have commented on the
issuance server outage)

In most cases, clients ignore any failure to contact the OCSP servers. This
means that:

1) OCSP servers aren't an additional point of failure for your website

2) A man-in-the-middle attack using a stolen and revoked certificate can
prevent your browser from knowing it's stolen by blocking the connection to
the OCSP server

Possibly due to #2, or other reasons, I can only speculate, some clients treat
failure to contact OCSP servers more seriously and abort the connection.
During the outage, those clients were unable to talk to servers that:

1) were using LetsEncrypt

2) enabled OCSP

3) did not have a valid stapled OCSP response (for example because OCSP
stapling was not configured, or their server lost the response and couldn't
get a new one during the downtime)

The size of the intersection of affected clients and sites is very small, but
during that window they were completely unable to talk to each other. So in
broad terms the impact was very small, but for those affected it was quite
large.

~~~
tialaramex
As you observe, OCSP today is not widely respected (for most sites Chrome
doesn't even check OCSP at all for example) which is bad news if anybody's
certificate gets stolen or misused.

OCSP Stapling is (part of) the eventual solution. Web server software will go
get the OCSP answers for its own certificates, and "staple" those to the
certificates when it serves them. So now client software doesn't have to
wrestle with unreliable networks and make extra connections, the OCSP response
is right there with the certificate during connection to the site.

However, Quality of Implementation for OCSP Stapling in some of the most
popular HTTP servers is poor. Let's take the example of Apache httpd, possibly
still the most popular server in the world.

By default Apache doesn't do OCSP stapling at all. So you need to configure
that, doing so isn't even a one line "Yup, staple please" either, instead it
appear the person writing this code for Apache went through the specification
and any time they weren't sure what to do they said "Eh, I'll leave that to
the sysadmin" and added a configuration option, with more or less random
defaults.

As a result by default Apache will forget a perfectly good OCSP answer it
knows in favour of trying to get a new one. If that fails (as it did here due
to Let's Encrypt's problems) Apache doesn't say "Oh well, I have a good one
already, I can use that". It makes a fake "error" OCSP response and serves
that up. Why? Nobody we've ever been able to find knows what that could be
useful for, but the Apache developer decided it would be a good default. "Yup,
if anything goes wrong, just irreversibly break the entire server, that way
they'll be sure to notice".

It will also happily staple worthless outdated answers, or answers saying e.g.
"Temporary failure, try later", which likewise will just cause visitors to
your site to get turned away, rather than continuing to use a known-good
answer it has.

And nobody at Apache seems the least bit interested in fixing this.

~~~
Anthony-G
As an Apache user, I’ve yet to look into enabling OCSP stapling so thanks for
this informative post. I presume the developer you are referring to is (one
of) the developers _mod_ssl_. I found the bug report[1] where the Apache
developers state that they won’t enable stapling by default because _“it would
enable a "phoning home" feature (to the CA's OCSP responders) as a side effect
of configuring a certificate”_. That seems reasonable to me. However, the
other behaviour that you’ve mentioned seems less so. Do you have any
references (mailing list discussions, links to bug reports, etc.) for this?

By the way, your opening line should probably be edited to say something like
_which is bad news if anybody 's private key gets stolen or misused and they
need to revoke the corresponding key(s)_. Most readers of this discussion will
know what you mean but some who are still learning about PKI may be confused.

[1]
[https://bz.apache.org/bugzilla/show_bug.cgi?id=50740#c20](https://bz.apache.org/bugzilla/show_bug.cgi?id=50740#c20)

~~~
Anthony-G
Wikipedia has a good description on what OCSP Stapling is[1] and how it works.
When I read the Apache projects' WONTFIX reason, I presumed that it was
related to how plain OCSP requires the client to "phone home" in order to
check whether a certificate has been revoked or not – which has implications
for the privacy of the browser.

However, now that I know OCSP Stapling works (the web server caches and
proxies time-stamped OCSP responses that are _signed by the CA_ ), the Apache
position is much less reasonable. As a Let’s Encrypt user, I “phone home”
every couple of months to renew my X.509 key and certificate. That’s not a
privacy concern for me or anyone else who happens to browse my site.

I also found a good article by Hanno Böck[2] which provides more details on
how OCSP Stapling is thoroughly broken on Apache as described by tialaramex).

[1]
[https://en.wikipedia.org/wiki/OCSP_stapling](https://en.wikipedia.org/wiki/OCSP_stapling)

[2] [https://blog.hboeck.de/archives/886-The-Problem-with-OCSP-
St...](https://blog.hboeck.de/archives/886-The-Problem-with-OCSP-Stapling-and-
Must-Staple-and-why-Certificate-Revocation-is-still-broken.html)

------
yuhong
I have been thinking of a yellow single click SSL warning for thing like OCSP
server cannot be contacted. Not enabled by default for now, of course.

------
Already__Taken
Is it possible to have a test stage that replays/simulates the entire previous
24 hours of production to compare previous output with new output. This might
let you know if a fix to a known bug actually makes things worse.

Obviously I'm not inferring this is a step mozilla should already have in
place.

~~~
woliveirajr
That's a huge step. Lots of data to save for the input, lots of diff output,
and you have to look for all details of the output, since many of them are
desirable. Then you would automate the output processing, and a wrong rule
would miss the new bug (because it was unexpected, so the automation would
account for it).

------
ge96
I used Let's Encrypt for the first time today, awesome! Saved $9.00 haha

------
theprop
LetsEncrypt is just awesome & a superb mission to make every website easily
secured!! SSLs were otherwise a big scam! (Yes i know I love them!)

