
Let's Encrypt is down - bpierre
https://letsencrypt.status.io/?170519
======
jaas
Josh from Let's Encrypt here. First, my apologies for the trouble this has
caused.

I want to offer people here an early root cause analysis. I say early because
we have not entirely completed our investigation or a post-mortem.

OCSP requests that use the GET method use standard base64 encoding, which can
contain two slashes one after another. While debugging why a small number of
OCSP requests consistently failed our engineers observed a rather odd, but
standard, web server behavior. When a server receives a request with multiple
slashes one after another they will collapse them into a single slash. This
caused our OCSP responder to consider requests that had this unusual encoding
quirk invalid and would respond to with a '400 Bad Request' response. The fix
seemed quite simple: disable the slash collapsing behavior.

Unfortunately, stopping this behavior surfaced a more serious issue. The AIA
extension that we include in certificates we issue contains a URI for our OCSP
server. This URI contains a trailing slash. According to RFC 6960 Appendix 1
an OCSP request using the GET method is constructed as follows 'GET
{url}/{url-encoding of base-64 encoding of the DER encoding of the
OCSPRequest}' where the url 'may be derived from the value of the authority
information access extension in the certificate being checked for revocation'.
A number of user agents take this quite literally and will construct the URL
without inspecting the contents of the AIA extension meaning that they ended
up with a double slash between the host name and the base64 encoded OCSP
request. Before we disabled slash collapsing this was fine as the web server
was silently fixing this problem. Once we stopped collapsing slashes we
started seeing problems.

From our OCSP server's perspective a majority of the OCSP requests we were
receiving were prepended with a slash and we were unable to decode them so
we'd respond with a '400 Bad Request' response and move on. This coincided
with a large number of previously cached responses on our CDN expiring,
causing us to start getting hit with a large number of requests. Because we
were responding with '400 Bad Request' responses we were setting explicit no-
cache headers which meant we had a near 0% cache (CDN) offload rate and were
hit with the full brunt of our OCSP request load at our origin servers. This
caused our whole infrastructure to get bogged down.

~~~
TechTechTech
Just a quick question. Does this mean that if your OCSP servers were to go
down, a lot of SSL enabled websites and applications will stop working? Seems
like a serious single point of failure for modern day internet. I was always
under the assumption that clients do not have to contact the CA (every time?)
before a TLS handshake takes place.

OCSP Stapling seems to be the way to mitigate this problem, but not all web
servers implement it (for instance lighttpd does not). Any recommendation from
Let's Encrypt on this issue?

~~~
r1ch
I'm pretty sure revocation checks like CRL and OCSP all fail-open (they still
allow the connection if contacting the revocation server fails).

~~~
niftich
Some have argued that this is why CRL and (especially) OCSP are useless pieces
of security theater: they don't actually protect against a crafted attack
because they fail-open in the very situations that a determined adversary can
trigger, so they only "protect" in situations where no real threat exists.
It's simply feel-good bookkeeping.

Adam Langley, working on Google Chrome [1][2][3], has been very vocal about
OCSP's faults, and Chrome began using its own auto-update to ship an aggregate
of revocations of high-value certs directly to browsers out-of-band. Despite
this being another famous instance of Chrome going against the grain of other
browser vendors, I believe this was the correct solution: offering better
protection for a curated subset of sites vs. pretending to -- but not actually
-- protecting all sites.

[1]
[https://www.imperialviolet.org/2012/02/05/crlsets.html](https://www.imperialviolet.org/2012/02/05/crlsets.html)
[2]
[https://www.imperialviolet.org/2014/04/19/revchecking.html](https://www.imperialviolet.org/2014/04/19/revchecking.html)
[3] [http://www.zdnet.com/article/chrome-does-certificate-
revocat...](http://www.zdnet.com/article/chrome-does-certificate-revocation-
better/)

~~~
zkms
> I believe this was the correct solution: offering better protection for a
> curated subset of sites vs. pretending to -- but not actually -- protecting
> all sites.

I concur, but note that it _is_ possible to do better and offer better
revocation protection for all sites, with low bandwidth/storage costs:
[http://www.ccs.neu.edu/home/cbw/static/pdf/larisch-
oakland17...](http://www.ccs.neu.edu/home/cbw/static/pdf/larisch-
oakland17.pdf)

~~~
niftich
This paper -- the CRLite proposal -- is wonderfully well thought-out,
experimentally tested, and meets the design goals much better and more
elegantly than any other attempt to solve the certificate revocation problem.

Looks like it was posted here and got very little traction [1]; a shame. But
it will be presented in a few days at the IEEE Symposium on Security and
Privacy [2]. I hope it will get the coverage and examination it deserves.

[1]
[https://news.ycombinator.com/item?id=13982861](https://news.ycombinator.com/item?id=13982861)
[2] [https://www.ieee-security.org/TC/SP2017/program-
papers.html](https://www.ieee-security.org/TC/SP2017/program-papers.html)

------
hannob
So in case this helps anyone, I had people complaining about strange OCSP
errors all over the morning coming from my server (using apache httpd).

It turns out apache does practically everything to behave as dumb as possible
in case of OCSP downtimes.

If the OCSP sends an error it will send the error as a stapled OCSP reply
(instead of using an old, still valid OCSP reply). You can't make it behave
sane here, but you can at least tell it to not return the error with
SSLStaplingReturnResponderErrors set to off.

However if the OCSP isn't available at all apache will fake its own OCSP error
(sic!) and send it. This is controlled by the option SSLStaplingFakeTryLater,
which defaults to on. So if your firefox users get strange OCSP errors, it's
most likely this. The doc for SSLStaplingFakeTryLater claims that this option
is only effective if SSLStaplingFakeTryLater is set to on, however that's
wrong.

tl;dr set both of these options to "off", then at least apache won't staple
any garbage in your TLS connection, firefox will try to reach the ocsp on its
own and fail and still accept the connection. Yes, that's all pretty fucked
up.

~~~
avian
Thanks for pointing this out!

Going through the documentation, another thing that surprised me was the
SSLStaplingStandardCacheTimeout setting.

If I understand this correctly, by default Apache will only cache OCSP
responses for 1 hour, even if they are still valid for days. I guess
increasing this to 1 day or something would make sense as well.

~~~
hannob
> If I understand this correctly, by default Apache will only cache OCSP
> responses for 1 hour, even if they are still valid for days. I guess
> increasing this to 1 day or something would make sense as well.

Yeah, that's another major problem. But increasing that doesn't really fix
anything in a reasonable way. There shouldn't be any cache timeout, this
option doesn't make any sense. It should cache as long as the reply is valid
and replace it in time before it expires.

~~~
avian
By the way, I checked the source and it appears that setting
SSLStaplingStandardCacheTimeout to a large value (larger than typical OCSP
reply validity) effectively creates this behavior.

Apache checks if cached reply is still valid and if it's not it attempts to
renew it.

At least in 2.4.10 shipped in Debian Jessie. Relevant code is in
modules/ssl/ssl_util_stapling.c, stapling_cb()

------
scrollaway
Was fun finding this out during a random server cycle. Turns out, Caddy
doesn't appreciate the ACME server being down, and refuses to start :)

[https://github.com/mholt/caddy/issues/1680](https://github.com/mholt/caddy/issues/1680)

~~~
discreditable
Wow @ that close comment:

> So, this is not a bug and all is working as intended.

Caddy folks had better never restart the caddy service (or server) while LE
happens to be down, even if you already have a valid cert!

~~~
tyingq
That's going to be a limiter for adoption. Hopefully @mholt reconsiders.

Update: Mholt pushed a change where caddy only refuses to start if the cert is
expiring in 7 days or less.
[https://github.com/mholt/caddy/commit/410ece831f26c61d392e0e...](https://github.com/mholt/caddy/commit/410ece831f26c61d392e0e8fa41e9b4f90d7fb95)

~~~
scrollaway
Hm, yeah I hope so too :/ Been using Caddy in prod for a year now, this issue,
rare as it may be, could single-handedly get me back on nginx.

Having the server be unable to start through circumstances outside of the
system's control is just such a huge no.

~~~
enimodas
Why did you switch away from nginx?

~~~
scrollaway
Not having to deal with certificate renewal is a big deal.

~~~
fb03
I am running nginx reverse-proxying to a python API right now. Dealing with
certificate renewal is a matter of running a daily cronjob issuing 'certbot
renew'. If it works it replaces the fullchain.pem certificate, and that's it,
easy peasy.

Am I missing something?

~~~
scrollaway
And by running Caddy instead, I have one less piece to monitor and worry
about.

The way Let's Encrypt works, it makes a _lot_ of sense to have the
functionality be part of the web server.

~~~
stephenr
> by running Caddy instead, I have one less piece to monitor and worry about

No, with Caddy you have several vaguely related pieces glued together with
superglue.

> The way Let's Encrypt works, it makes a lot of sense to have the
> functionality be part of the web server.

I think this thread is pretty much proof that that approach _will_ bite you in
the ass.

~~~
scrollaway
> _I think this thread is pretty much proof that that approach will bite you
> in the ass._

What bit me here is the fact that I'm running alpha software instead of a
battle-tested web server; I'm doing so willingly, with full awareness of the
risks that that entails.

Drawing the conclusion you did from the variables at play is shortsighted. If
anything bites people in the ass, it's prejudice and shortsightedness. I
wouldn't want you handling my ops/infrastructure.

~~~
stephenr
> What bit me here is the fact that I'm running alpha software

This wasn't caused by a bug. This was a _deliberate_ decision to fail to start
if the certificates on-disk were <= 30 _days_ away from expiring and the CA
can't be contacted.

> Drawing the conclusion you did from the variables at play is shortsighted

Using caddy is the web-server-stack equivalent of "putting all your eggs in
one basket". If one thing about it isn't working the way you want, you have to
either a) replace it completely or b) work out how to disable the bit that's
not working how you want, and replace that part of it.

> Drawing the conclusion you did from the variables at play is shortsighted

\- People use a piece of software that serves as both ACME TLS certificate
client and web server

\- Said software _by design_ won't start if the CA can't be contacted 30-days
out from expiry

The conclusion I drew is that such integration leave the operator with _less_
control than if they followed a separation-of-concerns approach, and left web
serving to a web server, and TLS certificate renewal to an ACME client. The
former doesn't need to care about how old the certificates are, just use what
it's given.

~~~
scrollaway
You're drawing conclusions from unintended behaviour, which has now changed
(and a release has been issued).

~~~
stephenr
> unintended behaviour

Ahem.
[https://github.com/mholt/caddy/issues/1680#issuecomment-3026...](https://github.com/mholt/caddy/issues/1680#issuecomment-302693543)

Emphasis mine:

> So, this is not a bug and all is _working as intended_.

~~~
scrollaway
Have you noticed that the bug has been fixed?

~~~
stephenr
why do you keep calling it a bug?

------
IgorPartola
I think LE is a huge boon to the internet. But I would really love for someone
like Amazon, Google, Facebook, or Microsoft to set up a separate provider that
implements the same thing. Redundancy is super important here and clearly just
one organization can't guarantee 100% uptime.

~~~
falcolas
Amazon does, though limited to its own services (which is, frankly, to be
expected). AWS Certificate Manager

~~~
tyingq
ACM is nice, but it does require the manual step of clicking a link in a
verification email.

~~~
atonse
Yes but it also issues certs for a year, which helps alleviate that email link
issue.

~~~
alberts00
If you are running the AWS ACM certificate at the time when the old one is
close to expiration it will automatically renew it without user intervention.
[http://docs.aws.amazon.com/acm/latest/userguide/configure-
do...](http://docs.aws.amazon.com/acm/latest/userguide/configure-domain-for-
automatic-validation.html)

------
QUFB
Not only is it a problem with certificate issuance - but their OCSP servers
are also down. This caused an issue on one of my sites where I was using OCSP
Stapling: normal browser connections were failing, but not tools like curl
(which don't ask for the OCSP response over SSL).

~~~
Ajedi32
What's the typical validity period for OCSP responses with Let's Encrypt?
Shouldn't the stapled responses continue working for at least a couple hours
even after Let's Encrypt goes down?

~~~
agrajag
1 week, so most servers likely won't be affected unless the outage goes on for
a really long time.

~~~
avian
Not sure how this works.

I have OCSP stapling turned on in Apache and Firefox wouldn't load my page
when Let's Encrypt OCSP servers went down.

My monitoring shows that last stapled response had 4 days of validity left. So
it seems that Apache immediately threw away cached OCSP responses.

~~~
mholt
For what it's worth, Caddy is the only server that will locally cache the
staples (and manage them) automatically. In other words, Caddy sites were not
affected by this OCSP downtime.

~~~
pritambaral
While we're on the topic:
[https://github.com/mholt/caddy/issues/1680](https://github.com/mholt/caddy/issues/1680)

------
ramshanker
On the plus side, as a side effect to this event, most of libraries shall
start handling this case in better robust ways..

~~~
nickpsecurity
Usually takes a serious failure first. Then, they start doing real
robustness... on just that one thing. ;)

------
CaliforniaKarl
It seems that Let's Encrypt is back up.

I think, in the past, mods have put an extra down-weight on "X is down"
stories, once 'X' is back up.

Since this discussion now has interesting stuff related to Let's Encrypt—and
products which use Let's Encrypt—I hope the mods are willing to forgo the down
weight, and instead just change the post title to something like "Let's
Encrypt Was Down".

~~~
corford
I'm still getting 504s and timeouts from
[https://acme-v01.api.letsencrypt.org/acme/new-
reg](https://acme-v01.api.letsencrypt.org/acme/new-reg) and
[https://acme-v01.api.letsencrypt.org/directory](https://acme-v01.api.letsencrypt.org/directory)
(trying from Lisbon, PT)

Edit: Akamai is issuing the 504 when hitting
[https://acme-v01.api.letsencrypt.org/acme/new-
reg](https://acme-v01.api.letsencrypt.org/acme/new-reg) so I guess the origin
servers are still overloaded...?

~~~
alixaxel
I'm experiencing the same from Kuala Lumpur.

~~~
corford
FYI, acme-staging.api.letsencrypt.org is working ok for me but the production
endpoints are still timing out.

------
throw2016
Nothing against letsencrypt but dependencies on services to be online is
fragile and will break. Their 90 day limit makes it worse. Saying its for
security is like saying 1 or 3 year certs are somehow insecure which is not
the case. It's one more headache for an admin to think about even if
automated.

We really should reexamine the CA system. Self signed certs should have more
value than they currently do, and identity can be verified by out of band
methods. Surely it's worth exploring.

What we have now effectively disempowers individuals and centralizes essential
services which cannot be good in the long run.

~~~
tscs37
You can run your own ACME provider, the code is open source.

Nothing stops you from running a CA that offers 1 year certs over ACME. Or
just providing one that also offers 90 day certs. If people will trust that CA
is another question.

The automation of LE is not the problem either. Properly automated systems
would extend/renew the cert well before they are invalid, almost every LE
guide I know mentions this on grounds of "what if LE is down".

The only libraries and services affected are those who do not properly code
for an external service provider being temporarily offline.

The problem with OOB verification of certs is that it's slow and inefficient
for almost all methods this can be done with. And it doesn't scale either.

Imagine if everyone wanted to OOB verify the Google certs on the same day.

~~~
mort96
> Nothing stops you from running a CA that offers 1 year certs over ACME. Or
> just providing one that also offers 90 day certs. If people will trust that
> CA is another question.

You know that's BS. All of your users would get certificate errors, that's
what's preventing you from running your own CA.

~~~
tscs37
Read carefully.

"if people wo trust that CA is another question"

Of course you won't get trusted but that's not the issue.

------
theprop
We've been using LetsEncrypt on dozens of our servers for several months now
and it's worked flawlessly both in set-up and in operation!! Set up was quite
easy thanks to A LOT of work by the team there.

To make HTTPS simple and free to set up is a FANTASTIC mission & the team has
overall built a SUPERB system. Congratulations & thanks for addressing this
issue quickly, looks like it's already solved, good work!

------
Filligree
This is also why you don't wait until the last day before renewing. (But no-
one does that, right?)

~~~
IgorPartola
If you use one of the myriad of automated tools for using LE, you will get
your cert renewed as early as 30 days before it expires. So right now the
issue should only be with new domains getting certs.

If you renew LE certs manually, first what is wrong with you and don't you
like yourself? Second, at that point it's no different than NameCheap going
down and you getting your cert from 1and1 instead.

~~~
Bartweiss
> So right now the issue should only be with new domains getting certs.

Certainly should have been, but it looks like a lot of libraries are choking
on the outage even if the old cert is 100% valid.

~~~
IgorPartola
I use
[https://github.com/lukas2511/dehydrated](https://github.com/lukas2511/dehydrated)
which is a bash script for doing this. It doesn't choke because it's just a
cron job.

I also use dokku's LE plugin which is... also a shell script corn job. Maybe
that's the way to do this. I know Caddy is an exception case for this.

------
kennu
Just spent an hour debugging why development scripts work but production
doesn't.. Good reminder to configure some kind of notifications from AWS
Lambda also when the execution times out, not just on errors.

------
101km
Clearly it is high time for an EncryptWeShall nonprofit with a wholly separate
implementation and team and all the tooling adjusted to randomly pick between
the two.

------
bpierre
Traefik is having a problem similar to Caddy, where having letsencrypt down
prevents it to boot.

[https://github.com/containous/traefik/issues/791](https://github.com/containous/traefik/issues/791)

------
fgrimes
Holy frijoles, I was pondering exactly this scenario earlier today while
messing around with Caddy and LE, as in: do I want to take a (mostly) static
and offline process and directly inject it as another moving part into runtime
world, for the sake of its cost, convenience, and overall worthwhileness?

Is there a good alternative, without this sort of process? Or back to the
sharks?

------
Tepix
This is unrelated to this outage, however in the past when renewing, I've
always had problems resolving acme-v01.api.letsencrypt.org - am I alone with
this issue?

I'm running a local dnscache instance (djbdns), perhaps that has something to
do with it?

------
graton
Working for me. I just renewed some certs using the DNS method without issue.

------
raarts
Well, it's down again

------
m3kw9
Short version - they fixed an issue but the fix had dependencies

------
Romajashi
oh my god, a website in the internet is currently down, let's discuss that.

------
nickpsecurity
"High assurance datacenter" High assurance my ass. Those don't go down unless
there's a DDOS or catastrophic failure (often several). Then, they're right
back up. People need to stop misusing this label. Another is "high-assurance"
certs from vendors that get compromised or subverted easily. Only one high-
assurance CA that I know of. It's not around for business reasons, though.

[http://www.anthonyhall.org/c_by_c_secure_system.pdf](http://www.anthonyhall.org/c_by_c_secure_system.pdf)

~~~
mort96
The issue here doesn't seem that a data center went down, but that there was a
bug which caused downtime.

~~~
nickpsecurity
The issue I brought up is nothing about it is high-assurance except maybe
tamper-resistance on a HSM involved. It's a term abused in the certificate
market a lot. An easy hint to tell is if it's a product developed slowly in a
safe subset of C, Java, or Ada. Those have the tooling needed for highly-
robust implementations. Then look at the OS to see if it's something extremely
hardened or unusual (eg separation kernel RTOS). The protocols will be ultra-
simple with a lot of high-availability and easy recovery. Almost no modern
tooling will be in the TCB for configuration or deployment unless it's simple.
Most of it isn't.

I'm not seeing any of this in the reporting that made it here. Definitely not
high-assurance. Likely compromised by high-end attackers either for specific
targets or in some general way. It will help protect in its intended way
against the rest, though. Enormously positive development. Just not high-
assurance security at any level.

