Hacker News new | past | comments | ask | show | jobs | submit login
Let's Encrypt is down (status.io)
312 points by bpierre on May 19, 2017 | hide | past | favorite | 160 comments



Josh from Let's Encrypt here. First, my apologies for the trouble this has caused.

I want to offer people here an early root cause analysis. I say early because we have not entirely completed our investigation or a post-mortem.

OCSP requests that use the GET method use standard base64 encoding, which can contain two slashes one after another. While debugging why a small number of OCSP requests consistently failed our engineers observed a rather odd, but standard, web server behavior. When a server receives a request with multiple slashes one after another they will collapse them into a single slash. This caused our OCSP responder to consider requests that had this unusual encoding quirk invalid and would respond to with a '400 Bad Request' response. The fix seemed quite simple: disable the slash collapsing behavior.

Unfortunately, stopping this behavior surfaced a more serious issue. The AIA extension that we include in certificates we issue contains a URI for our OCSP server. This URI contains a trailing slash. According to RFC 6960 Appendix 1 an OCSP request using the GET method is constructed as follows 'GET {url}/{url-encoding of base-64 encoding of the DER encoding of the OCSPRequest}' where the url 'may be derived from the value of the authority information access extension in the certificate being checked for revocation'. A number of user agents take this quite literally and will construct the URL without inspecting the contents of the AIA extension meaning that they ended up with a double slash between the host name and the base64 encoded OCSP request. Before we disabled slash collapsing this was fine as the web server was silently fixing this problem. Once we stopped collapsing slashes we started seeing problems.

From our OCSP server's perspective a majority of the OCSP requests we were receiving were prepended with a slash and we were unable to decode them so we'd respond with a '400 Bad Request' response and move on. This coincided with a large number of previously cached responses on our CDN expiring, causing us to start getting hit with a large number of requests. Because we were responding with '400 Bad Request' responses we were setting explicit no-cache headers which meant we had a near 0% cache (CDN) offload rate and were hit with the full brunt of our OCSP request load at our origin servers. This caused our whole infrastructure to get bogged down.


Just a quick question. Does this mean that if your OCSP servers were to go down, a lot of SSL enabled websites and applications will stop working? Seems like a serious single point of failure for modern day internet. I was always under the assumption that clients do not have to contact the CA (every time?) before a TLS handshake takes place.

OCSP Stapling seems to be the way to mitigate this problem, but not all web servers implement it (for instance lighttpd does not). Any recommendation from Let's Encrypt on this issue?


This is exactly the problem with OCSP. There's no way to tell if the remote server is down, or if a malicious actor sitting in your path is blocking it. So your browser can either a) make it super easy for all your OCSP-using sites to appear down, which will encourage users to use other, non-OCSP, sites, or b) silently fail, which makes the entire exercise pointless.

Stapling only partially mitigates this, as it doesn't currently work with intermediate certs, and at this point most sites have at least one intermediate cert.


Could you elaborate on why they don't work with intermediate certs?


RFC 6066 specifies that you can only have one certificate in an OCSP response - as with intermediate certs you need to be able to respond with a chain, this does not work. RFC 6961 defines a multiple response capability, but my understanding is that currently this is not sufficiently widely implemented to be useful yet.


Thanks! I thought it's enough if the stapled response contains information only about the intermediate cert, and the browser would accept that as good enough, if the chain it got in the handshake is valid.

https://bugzilla.mozilla.org/show_bug.cgi?id=611836 - this looks pretty abandoned (last comment 3 years ago) :/

and I found no bug for Chrome.


Chrome, as far as I know, does not do OCSP - https://www.imperialviolet.org/2012/02/05/crlsets.html


I'm pretty sure revocation checks like CRL and OCSP all fail-open (they still allow the connection if contacting the revocation server fails).


Some have argued that this is why CRL and (especially) OCSP are useless pieces of security theater: they don't actually protect against a crafted attack because they fail-open in the very situations that a determined adversary can trigger, so they only "protect" in situations where no real threat exists. It's simply feel-good bookkeeping.

Adam Langley, working on Google Chrome [1][2][3], has been very vocal about OCSP's faults, and Chrome began using its own auto-update to ship an aggregate of revocations of high-value certs directly to browsers out-of-band. Despite this being another famous instance of Chrome going against the grain of other browser vendors, I believe this was the correct solution: offering better protection for a curated subset of sites vs. pretending to -- but not actually -- protecting all sites.

[1] https://www.imperialviolet.org/2012/02/05/crlsets.html [2] https://www.imperialviolet.org/2014/04/19/revchecking.html [3] http://www.zdnet.com/article/chrome-does-certificate-revocat...


> I believe this was the correct solution: offering better protection for a curated subset of sites vs. pretending to -- but not actually -- protecting all sites.

I concur, but note that it is possible to do better and offer better revocation protection for all sites, with low bandwidth/storage costs: http://www.ccs.neu.edu/home/cbw/static/pdf/larisch-oakland17...


This paper -- the CRLite proposal -- is wonderfully well thought-out, experimentally tested, and meets the design goals much better and more elegantly than any other attempt to solve the certificate revocation problem.

Looks like it was posted here and got very little traction [1]; a shame. But it will be presented in a few days at the IEEE Symposium on Security and Privacy [2]. I hope it will get the coverage and examination it deserves.

[1] https://news.ycombinator.com/item?id=13982861 [2] https://www.ieee-security.org/TC/SP2017/program-papers.html


The one app chain (complex code signing) I worked on with OSCP, we defaulted to failsafe, but it could be overridden in the 'main' (enterprise CMS) app. The installer required OSCP or wouldn't install.

Basically the first and last mile were hard fails but everything in between was advisory if the signature checked out.


1. Spec question: Why does the request need to be both base 64 and URL encoded? Why not just URL encoded? Only reason I can think of is for shorter/prettier URLs?

Or why not just use base 64 with the URL safe alphabet: https://tools.ietf.org/html/rfc4648#page-7

2. Implementation question: Shouldn't the slashes be URL encoded as "%2F"? "url/ABC/DEF" could mean "url" + "ABC/DEF" or "url/ABC" + "DEF". Multiple slashes are collapsed by default because path components shouldn't contain them.


Josh referred to RFC 6960 Appendix A, but his post didn't make it apparent that his description of is an exact quote from the spec [1]:

An OCSP request using the GET method is constructed as follows:

GET {url}/{url-encoding of base-64 encoding of the DER encoding of the OCSPRequest}

This is shamefully imprecise for an RFC, not even referencing the relevant specs for each.

--

To answer your first question, Base64 is needed because DER is binary, and URIs are defined in terms of characters -- and it's an exercise left to the reader [2] if you can somehow get from one to the other reliably. It's also an exercise for every other reader you're trying to interoperate it, so the common practice is to make an explicit conversion before you get to this stage. Base64 takes care of this by transforming the binary data to US-ASCII, of which UTF-8 is a superset, and URIs operate on UTF-8 characters.

But "vanilla" base64 can produce three characters which are reserved characters in URIs: the slash, the plus, and equals [8]. These need to be percent-encoded because if they are used directly in URIs, they have special meaning.

Of course, if the OCSP RFC had just specified base64url encoding [9], a widely used variant which swaps the slash with an underline, the plus with a minus, and allows for the omission of padding that's denoted by equals signs, the double-encoding wouldn't be needed, because none of those characters are reserved in URIs.

--

To answer your second question, slashes in URIs are fun. Though the most recent URI RFC goes through elaborate rules on when you're supposed to encode and decode [3] and what's supposed to be interpreted how, at the end of the day the URI is somehow consumed as an input to some other process where different rules may apply [5][6].

One of those different, customary "rules" is that percent-encoded slashes are just maliciously trying to path outside of the directory, so most webservers shut this down. Apache is one of the few that allows you to tune what to do in this case [7].

[1] https://tools.ietf.org/html/rfc6960#appendix-A.1 [2] https://tools.ietf.org/html/rfc3986#section-2.5 [3] https://tools.ietf.org/html/rfc3986#section-2.4 [4] https://tools.ietf.org/html/rfc3986#section-3.3 [5] https://tools.ietf.org/html/rfc3986#section-7.2 [6] https://tools.ietf.org/html/rfc3986#section-7.3 [7] http://httpd.apache.org/docs/2.4/mod/core.html#allowencodeds... [8] https://tools.ietf.org/html/rfc4648#section-4 [9] https://tools.ietf.org/html/rfc4648#section-5


Thank you for this preliminary report. I just want to say you are doing great work and a tremendous service to the public. We tolerate a few hiccups. And as always when something goes wrong, it is always more than one problem.


As I was reading the first few sentences, describing the slash collapsing, I was thinking to myself "oh no, I hope they don't just 'fix the glitch'". That behavior is so old and pervasive on the web, about the last thing I would have tried is turning off slash collapsing.


I'm not sure I understand how slash collapsing is affecting this. Slash is a reserved character and presumably if it the data was correctly encoded should never have ended up in the URI in the first place?

(I guess this is more a question for the parent)


> I'm not sure I understand how slash collapsing is affecting this. Slash is a reserved character and presumably if it the data was correctly encoded should never have ended up in the URI in the first place?

base64 uses 64 characters: A-Za-z0-9 (62) and two symbols, commonly '/' and '+'. (As well as a third symbol, '=', used at the end to handle padding.) That would work in a URI, most of the time, except if you happened to have a base64 encoding with two '/' next to each other.

A common fix for using base64 in URIs involves substituting a different pair of symbols instead of '/' and '+'.


The spec says the base 64 should be URL encoded, so why aren't the slashes turned into "%2F"?


They probably are.

But some applications will decode the percent-encoding too early in the process of normalizing, security-escaping, and processing the URL. Encoded slashes in URLs are problematic [1][2][3][4][5].

[1] https://httpd.apache.org/docs/2.4/mod/core.html#allowencoded... [2] http://stackoverflow.com/questions/3235219/urlencoded-forwar... [3] http://codeinthehole.com/tips/django-nginx-wsgi-and-encoded-... [4] http://stackoverflow.com/questions/3040659/how-can-i-receive... [5] https://groups.google.com/forum/?fromgroups#!topic/django-us...


I guess they most be using something in front of the code [0] (which do document the double slash issue). Probably should have used PathUnescape[1] on line 185 though.

[0] https://github.com/letsencrypt/boulder/blob/master/vendor/gi... [1] https://golang.org/pkg/net/url/#PathUnescape

PS. It does seem like a pretty bad idea to put data in the path instead of the query.


This is the pull request that triggered today's issue [1]. They introduced a custom object that overrides Go's default ServeMux, such that their code won't collapse adjacent slashes, unlike Go's default.

They did this because previously, they used Go's default, which collapsed adjacent slashes -- this broke stuff until they troubleshooted it [2] and discovered that Cloudflare's OCSP responder, which they were using, actually documents that you shouldn't use the default ServeMux [3]. The commit that led to that condition only went in a month prior [4]. This is similar to an issue they had two years ago [9] that seems to have started this all.

Now the proposal is to strip the "leading slash" from the {ocsp-request} component of the incoming URI "{ocsp-uri}/{ocsp-request}" [5], but far better would be to perform path canonicalization on the {ocsp-uri} but not the {ocsp-request}. But it looks like they're relying on http.StripPrefix [6], which is an idiomatic Go way of hosting a server out of a subpath and returning 404 on any request not matching the prefix; this will be problematic without additional processing that gets slashes out of places they shouldn't be, while leaving alone where they should.

For more fun with slashes and OCSP, see this source code for NSS [7], this bugzilla issue [8], this code [10], and this mailing list thread [11].

[1] https://github.com/letsencrypt/boulder/pull/2748 [2] https://github.com/letsencrypt/boulder/issues/2728 [3] https://github.com/cloudflare/cfssl/blob/master/ocsp/respond... [4] https://github.com/letsencrypt/boulder/pull/2689/files [5] https://github.com/letsencrypt/boulder/issues/2774 [6] https://golang.org/src/net/http/server.go and search for "func StripPrefix" [7] https://chromium.googlesource.com/chromium/third_party/nss/+... and search for "slash" [8] https://bugzilla.mozilla.org/show_bug.cgi?id=1010594 [9] https://github.com/letsencrypt/boulder/issues/884 [10] https://github.com/r509/r509-ocsp-responder/blob/master/lib/... [11] https://sourceforge.net/p/openca/mailman/message/31630541/


It does not matter. Due to how servers and most apps need to handle URLs they are decoded very early in the process (eg: %2F and %2f need to be the same for instance).


This is what I don't understand either.


for reference, see "base64url" in this table: https://en.wikipedia.org/wiki/Base64#Variants_summary_table


So, blame the clients and users unfortunate enough to be using an implementation that only works 99.999% of the time?


Blame? It's more about curiosity.

The poster said "OCSP requests that use the GET method use standard base64 encoding, which can contain two slashes one after another", so I'm wondering if that's that case or it's actually "clients don't encode their base64". I didn't read trough the RFC, but judging by the later statement of "OCSP request using the GET method is constructed as follows 'GET {url}/{url-encoding of base-64 encoding of the DER encoding of the OCSPRequest}'" that seems to be the case.


That means OCSP Must-Staple extension [0] can be very dangerous given bad implementation (like Apache).

[0]: https://scotthelme.co.uk/ocsp-must-staple/


Thanks for the insight.

Would love to read a full postmortem of both the OCSP and issuance issues.


yo people - you guys really need to handle slashes properly!

They are quite important for www-stuff you know.

"a majority of the OCSP requests we were receiving were prepended with a slash"

Everything is unsafe - one has to make sure that external data is converted into a sane format internally; and to never assume that external input can be safe!


So in case this helps anyone, I had people complaining about strange OCSP errors all over the morning coming from my server (using apache httpd).

It turns out apache does practically everything to behave as dumb as possible in case of OCSP downtimes.

If the OCSP sends an error it will send the error as a stapled OCSP reply (instead of using an old, still valid OCSP reply). You can't make it behave sane here, but you can at least tell it to not return the error with SSLStaplingReturnResponderErrors set to off.

However if the OCSP isn't available at all apache will fake its own OCSP error (sic!) and send it. This is controlled by the option SSLStaplingFakeTryLater, which defaults to on. So if your firefox users get strange OCSP errors, it's most likely this. The doc for SSLStaplingFakeTryLater claims that this option is only effective if SSLStaplingFakeTryLater is set to on, however that's wrong.

tl;dr set both of these options to "off", then at least apache won't staple any garbage in your TLS connection, firefox will try to reach the ocsp on its own and fail and still accept the connection. Yes, that's all pretty fucked up.


additionally, error responses from the OCSP servers are cached 600 seconds - use SSLStaplingErrorCacheTimeout to lower this to a more sane value: https://httpd.apache.org/docs/trunk/mod/mod_ssl.html#sslstap...


Thanks for pointing this out!

Going through the documentation, another thing that surprised me was the SSLStaplingStandardCacheTimeout setting.

If I understand this correctly, by default Apache will only cache OCSP responses for 1 hour, even if they are still valid for days. I guess increasing this to 1 day or something would make sense as well.


> If I understand this correctly, by default Apache will only cache OCSP responses for 1 hour, even if they are still valid for days. I guess increasing this to 1 day or something would make sense as well.

Yeah, that's another major problem. But increasing that doesn't really fix anything in a reasonable way. There shouldn't be any cache timeout, this option doesn't make any sense. It should cache as long as the reply is valid and replace it in time before it expires.


By the way, I checked the source and it appears that setting SSLStaplingStandardCacheTimeout to a large value (larger than typical OCSP reply validity) effectively creates this behavior.

Apache checks if cached reply is still valid and if it's not it attempts to renew it.

At least in 2.4.10 shipped in Debian Jessie. Relevant code is in modules/ssl/ssl_util_stapling.c, stapling_cb()


Was fun finding this out during a random server cycle. Turns out, Caddy doesn't appreciate the ACME server being down, and refuses to start :)

https://github.com/mholt/caddy/issues/1680


Wow @ that close comment:

> So, this is not a bug and all is working as intended.

Caddy folks had better never restart the caddy service (or server) while LE happens to be down, even if you already have a valid cert!


That's going to be a limiter for adoption. Hopefully @mholt reconsiders.

Update: Mholt pushed a change where caddy only refuses to start if the cert is expiring in 7 days or less. https://github.com/mholt/caddy/commit/410ece831f26c61d392e0e...


Hm, yeah I hope so too :/ Been using Caddy in prod for a year now, this issue, rare as it may be, could single-handedly get me back on nginx.

Having the server be unable to start through circumstances outside of the system's control is just such a huge no.


Same here, was very pleased with Caddy so far, but it being tightly coupled to LE being up, despite having certs cached is a no-go for our production systems and if this stays like that, would make me go back to nginx for the services I've used caddy so far.

edit: looks like the dev added a fix to only refuse the start if the cached certs are dangerously close to expire. that satisfies me and I'll be continue to be using caddy.


Call me crazy but I think it's a little silly for an https server to refuse to start just because its certificate is invalid. I recognize that some folks might like that failure mode however.


Why did you switch away from nginx?


Not having to deal with certificate renewal is a big deal.


Is it? I spent one day getting certbot up and running, and ever since then it's been pretty much a done deal.


That's 1 day that I didn't have to spend, even better when dealing with many instances.


Okay, I agree, with two caveats:

1. The cost does not scale with the amount of instances since it is the one-time cost to create the configuration package.

2. If you decide to go for Caddy instead, you'll have to spend the same time, if not more, learning Caddy.


I am running nginx reverse-proxying to a python API right now. Dealing with certificate renewal is a matter of running a daily cronjob issuing 'certbot renew'. If it works it replaces the fullchain.pem certificate, and that's it, easy peasy.

Am I missing something?


And by running Caddy instead, I have one less piece to monitor and worry about.

The way Let's Encrypt works, it makes a lot of sense to have the functionality be part of the web server.


> by running Caddy instead, I have one less piece to monitor and worry about

No, with Caddy you have several vaguely related pieces glued together with superglue.

> The way Let's Encrypt works, it makes a lot of sense to have the functionality be part of the web server.

I think this thread is pretty much proof that that approach will bite you in the ass.


> I think this thread is pretty much proof that that approach will bite you in the ass.

What bit me here is the fact that I'm running alpha software instead of a battle-tested web server; I'm doing so willingly, with full awareness of the risks that that entails.

Drawing the conclusion you did from the variables at play is shortsighted. If anything bites people in the ass, it's prejudice and shortsightedness. I wouldn't want you handling my ops/infrastructure.


> What bit me here is the fact that I'm running alpha software

This wasn't caused by a bug. This was a deliberate decision to fail to start if the certificates on-disk were <= 30 days away from expiring and the CA can't be contacted.

> Drawing the conclusion you did from the variables at play is shortsighted

Using caddy is the web-server-stack equivalent of "putting all your eggs in one basket". If one thing about it isn't working the way you want, you have to either a) replace it completely or b) work out how to disable the bit that's not working how you want, and replace that part of it.

> Drawing the conclusion you did from the variables at play is shortsighted

- People use a piece of software that serves as both ACME TLS certificate client and web server

- Said software by design won't start if the CA can't be contacted 30-days out from expiry

The conclusion I drew is that such integration leave the operator with less control than if they followed a separation-of-concerns approach, and left web serving to a web server, and TLS certificate renewal to an ACME client. The former doesn't need to care about how old the certificates are, just use what it's given.


You're drawing conclusions from unintended behaviour, which has now changed (and a release has been issued).


I still think that refusing to start if the cert expires in 7 days or less is still an issue if Let's Encrypt is down.

There should be at most a warning but it should start. Otherwise you end up with an external dependency that can cause your web server to not start through no fault of your own.


> unintended behaviour

Ahem. https://github.com/mholt/caddy/issues/1680#issuecomment-3026...

Emphasis mine:

> So, this is not a bug and all is working as intended.


Have you noticed that the bug has been fixed?


why do you keep calling it a bug?


> The way Let's Encrypt works, it makes a lot of sense to have the functionality be part of the web server.

That's an odd thing to say in this particular conversation thread...

Why would you want to tightly couple your webserver to the availability of another service provider?


It's not tightly coupled any more than it would be with certbot. The issue I filed is an issue because it's unexpected behaviour (despite early closure, it has actually now been fixed).


The simple config syntax and sane defaults is nice in Caddy. For example, 3 lines of config nets a https server with http/2 support and an A rating with Qualsys for ssl setup.


> sane defaults

Failure to start if a CA is down is a "sane default" ?


Yes, that's exactly what I meant. You can tell because I didn't present any examples. And because one issue invalidates everything else. And I didn't express my opinion on this issue up thread.


I switched from nginx to caddy for my (extremely low-traffic) server because I was tired of having to copy & paste a bunch of SSL setup any time I was configuring a new domain.


I'm done with Caddy - this is security theater at its finest. Back to nginx we go. Why did I pick Caddy, because it was simple/easy/fast-to-setup.


I moved from Caddy to Traefik (https://traefik.io/) several months ago. Granted, nginx has years (decades?) on some of these newer webserver/reverse-proxies, but I have been happy with all the built in niceties of traefik so far (single binary, etc), and haven't really experienced any negatives.



Nope.


Or someone forks it


Then wait until you see how caddy handles fully qualified domain names!

Caddy will just refuse to even handle them.

Every single other server on this planet handles them properly, but caddy doesn’t – and mholt considers that working as intended.

Try out: https://www.google.co.uk./ https://www.microsoft.com./en-us/ https://www.amazon.com./ serve the page directly; https://www.facebook.com./ redirects to the relative domain

and then https://caddyserver.com./ (That said, traefik is equally dumb, as seen with https://traefik.io./ )


Is there a bug for this? That's out of spec


Yes there is, @mholt said it’s a WONTFIX. You're supposed to script it yourself.

https://github.com/mholt/caddy/issues/1632


I submitted the issue for it (https://github.com/mholt/caddy/issues/1632) and it was immediately closed. Reason (after asking): we only keep issues open that are on our TODO list.


I clicked the link, saw it was closed, and thought, "Wow, these guys are fast". Then I read it....


Caddy restarts gracefully with zero downtime. If you're killing the process and starting one anew, you're doing it wrong. Use signal USR1 to gracefully apply new configuration changes. Failed reloads fall back to the current working configuration without any changes or downtime. https://caddyserver.com/docs/cli#usr1


This misses the overall point.

If I am hosting 5 sites on caddy, and add a 6th one, I restart the server. If the 6th site doesn't work (for example, if DNS didn't resolve for lets encrypt), the other 5 sites which were working before the restart, all fail to start as caddy completely crashes.

This is basic resiliency you'd expect from your web server. Why should the other 5 sites fail to start if their configs are completely valid?


> If I am hosting 5 sites on caddy, and add a 6th one, I restart the server.

No, you don't. You reload the server, not restart it. Restarting a web server should only be required if you get an upgrade for it (or for OpenSSL etc.)


OK, there's a 0-day patch for OpenSSL, without it my users are vulnerable to RCE. Why can't I restart if LE's ACME is down?


This goes wholly against most cattle-not-pets devops philosophy though, right?


How so? In Chef:

   notifies :reload, "service[caddy]", :delayed


The idea here is that the server can't be scaled up or down. I suggest googling "cattle vs pets". If you know how to make a single process scale horizontally across additional hosts or scale up in alternative datacenters with a chef service notification, I'd pay you money to tell me how.


So, um: devops and cloud architecture is my job. I think I know what "cattle vs pets" is referring to. Nothing in this thread has anything to do with one versus many processes or one versus many nodes, nor does it have anything to do with manual configuration of anything. Rewrite the configuration based in orchestration data or, in extremity, upload a new version of a cookbook that handles that new site; the hypothetical sixth site is added and the service kicked over without human interaction. Scaling is an orthogonal concern.


What if there's a reason to restart the server or spawn a new one?


Scaling, upgrading, disk issues, just migrating to new place... and foremost https://en.m.wikipedia.org/wiki/Fallacies_of_distributed_com... Also nowadays with Docker around I rarely see server/app bundles supporting graceful reload.


This type of thing is why I (and I'm sure others) have literally zero intention of using tools like this.

Separation of concerns means you are in control, and using separate layers means you can swap one out when a vulnerability/show-stopper bug is discovered.

What exactly do you do when your look-ma-no-hands server won't even start?

Edit: maybe "all-things-to-all-people" was the wrong term to use here.


That's a bit of a strawman. Caddy is far more lightweight than apache and even nginx. It just happens to do something they don't do.


> That's a bit of a straw man

From https://caddyserver.com:

> The Most Beloved Server

They started the hyperbolic claims, not me.


I have no idea how this is relevant to the conversation, nor who said anything about hyperbolic claims.

You're claiming "caddy does everything". As opposed to what? If you're running apache or nginx, your server does far more than caddy, so you're quite simply mistaken.


> You're claiming "caddy does everything". As opposed to what?

Serving content over http(s), and obtaining TLS certificates are two very different tasks.

> If you're running apache or nginx, your server does far more than caddy

Far more, that is directly related to serving content over http/https.


> Serving content over http(s), and obtaining TLS certificates are two very different tasks.

Except that with let's encrypt one actually needs the other.


And best part, according to the developer this is working as intended. A webserver with perfectly valid cached certificates refusing to start.


Updating to v0.10.3 should fix this problem and allow Caddy to start, provided your cert isn't less than 7 days from expiring: https://github.com/mholt/caddy/commit/410ece831f26c61d392e0e...


Indeed, the 0.10.3 fix compared to the situation with OCSP stapling in Apache is illustrative

This fix gets almost everybody where they should be, the next time the same thing happens (and it will) Caddy isn't a problem for three weeks, which is definitely enough time. Meanwhile we're going to see the same Apache crappiness for OCSP again each time until someone over there finally snaps out of it and asks someone who actually knows how OCSP stapling was supposed to work.



I think LE is a huge boon to the internet. But I would really love for someone like Amazon, Google, Facebook, or Microsoft to set up a separate provider that implements the same thing. Redundancy is super important here and clearly just one organization can't guarantee 100% uptime.


Alternatively, maybe Let's Encrypt ought to Chaos Monkey this up and be down for 4 random hours every month or something on purpose. Or if you want to make very sure you don't turn people away, be down for 4 hours every month for any cert that has been in Let's Encrypt for more than a month or two, so you don't turn away new users. Because if you have a problem with a brief outage, the problem is in the user code, not Let's Encrypt.

It doesn't matter how redundant you make Let's Encrypt, the problem could always like too close to the user code to be resolved, e.g., the data center hosting your server loses internet. 100% uptime in the sense of "system A can always reach service B" is impossible, even if service B never "goes down" strictly speaking.


> Alternatively, maybe Let's Encrypt ought to Chaos Monkey this up and be down for 4 random hours every month or something on purpose.

And then gain a reputation for being unreliable?

> Or if you want to make very sure you don't turn people away, be down for 4 hours every month for any cert that has been in Let's Encrypt for more than a month or two, so you don't turn away new users. Because if you have a problem with a brief outage, the problem is in the user code, not Let's Encrypt.

Most users don't care. If it's not working reliably for them, they'll just move to something that does. Maybe there is an issue in their code that should be addressed, but I seriously doubt they'd care to have that pointed out when they're suddenly offline.

Anyone doing proper testing of their software/infrastructure should have a testing environment anyway. I'd take your proposal and modify it to: Let's Encrypt should offer testing servers which are down for well defined periods throughout the day that people can use to test their platform against.


"And then gain a reputation for being unreliable?"

If they tell people what they are doing and why, and only do it for established certificates, I'm not sure that would happen.

"Let's Encrypt should offer testing servers which are down for well defined periods throughout the day that people can use to test their platform against."

Who would use them? Anyone who cares enough to simulate LE failure is presumably already doing it.


I'd vote for Gandi.net or Github to do it. FB definitely not, MS rather not, Google not if I could avoid it.

Amazon maybe.


Google already can mess with your domain quite a bit (how many people use Google's DNS servers?). They also have a registrar and a CA already that your browser trusts. Amazon does the same thing. Don't know about MS/Azure, but if they don't now, I'm sure they will soon. FB is the only one that I don't think has neither, but they are also the ones I think that are least likely to actually maliciously mess with domains and TLS. It's just not in the business model.

Besides, this doesn't have to be a thing that's under the direct control of any of these organizations, but rather a separate entity that's just financed by them.

Gandi or GitHub would be cool too.


What's the threat model? I don't trust FB either, but with CA transparency and CAA it seems safe enough.


"We want to make sure our users are protected from malicious links so we proxy them when clicked"

Very unlikely, most feasible thing I could think of


Amazon does, though limited to its own services (which is, frankly, to be expected). AWS Certificate Manager


I am talking about it being wide open, not just for AWS, and implementing the ACME protocol.


I think the OP is asking / looking forward big tech like them to host the infrastructure for higher availability. Basically like a mirror. To end users they are just the same server.

On the side note, is the LE infrastructure globally distributed or currently all reside in US/West and/or US/East? Is Mozilla currently the one hosting?


ACM is nice, but it does require the manual step of clicking a link in a verification email.


Yes but it also issues certs for a year, which helps alleviate that email link issue.


If you are running the AWS ACM certificate at the time when the old one is close to expiration it will automatically renew it without user intervention. http://docs.aws.amazon.com/acm/latest/userguide/configure-do...


ACM is kind of limiting though. You can only use it with 4 of AWS's services none of which is ec2 directly which limits use cases I would think.


And then Route53 doesn't support CAA.


Redundancy yeah! But does it have to be one of those 4? I'm getting slightly worried and very bored by their dominance in infrastructure.


Not only is it a problem with certificate issuance - but their OCSP servers are also down. This caused an issue on one of my sites where I was using OCSP Stapling: normal browser connections were failing, but not tools like curl (which don't ask for the OCSP response over SSL).


What's the typical validity period for OCSP responses with Let's Encrypt? Shouldn't the stapled responses continue working for at least a couple hours even after Let's Encrypt goes down?


1 week, so most servers likely won't be affected unless the outage goes on for a really long time.


Not sure how this works.

I have OCSP stapling turned on in Apache and Firefox wouldn't load my page when Let's Encrypt OCSP servers went down.

My monitoring shows that last stapled response had 4 days of validity left. So it seems that Apache immediately threw away cached OCSP responses.


Yeah, seems like Apache handles OCSP server outages pretty poorly. See: https://news.ycombinator.com/item?id=14375334


For what it's worth, Caddy is the only server that will locally cache the staples (and manage them) automatically. In other words, Caddy sites were not affected by this OCSP downtime.



On the plus side, as a side effect to this event, most of libraries shall start handling this case in better robust ways..


Usually takes a serious failure first. Then, they start doing real robustness... on just that one thing. ;)


It seems that Let's Encrypt is back up.

I think, in the past, mods have put an extra down-weight on "X is down" stories, once 'X' is back up.

Since this discussion now has interesting stuff related to Let's Encrypt—and products which use Let's Encrypt—I hope the mods are willing to forgo the down weight, and instead just change the post title to something like "Let's Encrypt Was Down".


I'm still getting 504s and timeouts from https://acme-v01.api.letsencrypt.org/acme/new-reg and https://acme-v01.api.letsencrypt.org/directory (trying from Lisbon, PT)

Edit: Akamai is issuing the 504 when hitting https://acme-v01.api.letsencrypt.org/acme/new-reg so I guess the origin servers are still overloaded...?


I'm experiencing the same from Kuala Lumpur.


FYI, acme-staging.api.letsencrypt.org is working ok for me but the production endpoints are still timing out.


Nothing against letsencrypt but dependencies on services to be online is fragile and will break. Their 90 day limit makes it worse. Saying its for security is like saying 1 or 3 year certs are somehow insecure which is not the case. It's one more headache for an admin to think about even if automated.

We really should reexamine the CA system. Self signed certs should have more value than they currently do, and identity can be verified by out of band methods. Surely it's worth exploring.

What we have now effectively disempowers individuals and centralizes essential services which cannot be good in the long run.


You can run your own ACME provider, the code is open source.

Nothing stops you from running a CA that offers 1 year certs over ACME. Or just providing one that also offers 90 day certs. If people will trust that CA is another question.

The automation of LE is not the problem either. Properly automated systems would extend/renew the cert well before they are invalid, almost every LE guide I know mentions this on grounds of "what if LE is down".

The only libraries and services affected are those who do not properly code for an external service provider being temporarily offline.

The problem with OOB verification of certs is that it's slow and inefficient for almost all methods this can be done with. And it doesn't scale either.

Imagine if everyone wanted to OOB verify the Google certs on the same day.


> Nothing stops you from running a CA that offers 1 year certs over ACME. Or just providing one that also offers 90 day certs. If people will trust that CA is another question.

You know that's BS. All of your users would get certificate errors, that's what's preventing you from running your own CA.


Read carefully.

"if people wo trust that CA is another question"

Of course you won't get trusted but that's not the issue.


Long lived certs aren't necessarily insecure but revocation is a real issue that doesn't have a good solution outside of short cert lifetimes. People don't actually revoke their certs because it takes effort and costs money, browsers don't reliably check revocation lists if they do at all, and checking for revocations is bad for performance and privacy.


I don't disagree about CAs, but LE suggests to renew certs every 60 days. So unless this outage lasts a month or some acme tool is poorly made it shouldn't affect anything except new registrations.


The only exception is OCSP data. If you're stapling, I think that lasts a week, but I'm not 100% sure.

Anyway, in either case, your site should absolutely be able to weather CA downtime like this.


OK done: we move to self signed certs.

- Someone connects to wifi.

- The wifi gives a DNS server.

- The DNS server says some IP is foo.com.

- foo.com isn't actually the foo.com you expect [1], but it's got a self signed DV cert so you connect to it, and give some bad person your data.

That's why we don't more to self signed certs.

[1] Of course, if you want to assert foo.com is actually the 'Foo, Inc' you were expecting that's a job for EV.

Disclosure: I made https://certsimple.com that focuses on simplifying the identity verification process for EV certs.


TLSA DNS records coupled with DNSSEC could eliminate the need for CAs but we still have a long way to go until we can rely on it.


DNSSEC has a 1024-bit RSA root after Web PKI ( at least Mozilla-flavored) no longer did.

Are the TLDs audited like CAs? Having to change the domain name to evade the TLDs bad security practices is more disruptive than switching CAs.


Then you just move all the WoT issues of the CA system into DNS, no?


Yes. But you could argue that securing DNS is necessary anyways and using it for TLS is just the next step.


What do you mean by necessary?


Things like SPF and SSHFP records are still unprotected without DNSSEC. HTTP, IMAP/POP3, and SMTP may be safe by themselves but it would be nice to have the others covered as well.


You shouldn't renew the 90 day cert on day 89 anyway. As long as Let's Encrypt isn't down for more than a day I don't see the problem.


The CA system isn't perfect, but we haven't moved off of it because there are no viable alternatives. Self signed certs are completely useless.


We've been using LetsEncrypt on dozens of our servers for several months now and it's worked flawlessly both in set-up and in operation!! Set up was quite easy thanks to A LOT of work by the team there.

To make HTTPS simple and free to set up is a FANTASTIC mission & the team has overall built a SUPERB system. Congratulations & thanks for addressing this issue quickly, looks like it's already solved, good work!


This is also why you don't wait until the last day before renewing. (But no-one does that, right?)


If you use one of the myriad of automated tools for using LE, you will get your cert renewed as early as 30 days before it expires. So right now the issue should only be with new domains getting certs.

If you renew LE certs manually, first what is wrong with you and don't you like yourself? Second, at that point it's no different than NameCheap going down and you getting your cert from 1and1 instead.


> So right now the issue should only be with new domains getting certs.

Certainly should have been, but it looks like a lot of libraries are choking on the outage even if the old cert is 100% valid.


I use https://github.com/lukas2511/dehydrated which is a bash script for doing this. It doesn't choke because it's just a cron job.

I also use dokku's LE plugin which is... also a shell script corn job. Maybe that's the way to do this. I know Caddy is an exception case for this.


Wish someone would write one of those automated tools for Google App Engine-hosted apps.

That renewal process is exactly like regex. Once every three months I need it, and have to spend an hour re-learning it from scratch.


Apparently they're working on that: https://issuetracker.google.com/issues/35900034 (For some reason it seems you need to sign in to view this bug; not sure why.)


>But no-one does that, right?

Sadly, reality is not that nice and I can already feel a disturbance in the force in form of "Why Lets Encrypt is bad and you should by 1 year certs" blog posts all over the place.


If you have your nginx HTTP vhost configured for serving {{ domain }}/.well-known/acme-challenge/ from /var/www/{{ domain }}, then getting a new cert and having it automatically renewed it's as simple as running:

certbot certonly --webroot --webroot-path /var/www/{{ domain }} --agree-tos -m {{ email }} --domain {{ domain }} --renew-hook "service nginx reload"

If certbot was installed with PIP, a CRON job will be automatically created and it will run with whatever arguments the certificate was first obtained with (--renew-hook is obviously important).


That assumes the only sites you want to use SSL with is sites with a web root. Many sites use python's flask or node.js or similar, where you instead have an http server running on a high port, then proxy requests to certain domains to those ports. Such a script won't work with a setup like that.


Just spent an hour debugging why development scripts work but production doesn't.. Good reminder to configure some kind of notifications from AWS Lambda also when the execution times out, not just on errors.


Clearly it is high time for an EncryptWeShall nonprofit with a wholly separate implementation and team and all the tooling adjusted to randomly pick between the two.


Traefik is having a problem similar to Caddy, where having letsencrypt down prevents it to boot.

https://github.com/containous/traefik/issues/791


Holy frijoles, I was pondering exactly this scenario earlier today while messing around with Caddy and LE, as in: do I want to take a (mostly) static and offline process and directly inject it as another moving part into runtime world, for the sake of its cost, convenience, and overall worthwhileness?

Is there a good alternative, without this sort of process? Or back to the sharks?


This is unrelated to this outage, however in the past when renewing, I've always had problems resolving acme-v01.api.letsencrypt.org - am I alone with this issue?

I'm running a local dnscache instance (djbdns), perhaps that has something to do with it?


Working for me. I just renewed some certs using the DNS method without issue.


Well, it's down again


Short version - they fixed an issue but the fix had dependencies


oh my god, a website in the internet is currently down, let's discuss that.


"High assurance datacenter" High assurance my ass. Those don't go down unless there's a DDOS or catastrophic failure (often several). Then, they're right back up. People need to stop misusing this label. Another is "high-assurance" certs from vendors that get compromised or subverted easily. Only one high-assurance CA that I know of. It's not around for business reasons, though.

http://www.anthonyhall.org/c_by_c_secure_system.pdf


The issue here doesn't seem that a data center went down, but that there was a bug which caused downtime.


The issue I brought up is nothing about it is high-assurance except maybe tamper-resistance on a HSM involved. It's a term abused in the certificate market a lot. An easy hint to tell is if it's a product developed slowly in a safe subset of C, Java, or Ada. Those have the tooling needed for highly-robust implementations. Then look at the OS to see if it's something extremely hardened or unusual (eg separation kernel RTOS). The protocols will be ultra-simple with a lot of high-availability and easy recovery. Almost no modern tooling will be in the TCB for configuration or deployment unless it's simple. Most of it isn't.

I'm not seeing any of this in the reporting that made it here. Definitely not high-assurance. Likely compromised by high-end attackers either for specific targets or in some general way. It will help protect in its intended way against the rest, though. Enormously positive development. Just not high-assurance security at any level.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: