Josh from Let's Encrypt here. First, my apologies for the trouble this has caused.
I want to offer people here an early root cause analysis. I say early because we have not entirely completed our investigation or a post-mortem.
OCSP requests that use the GET method use standard base64 encoding, which can contain two slashes one after another. While debugging why a small number of OCSP requests consistently failed our engineers observed a rather odd, but standard, web server behavior. When a server receives a request with multiple slashes one after another they will collapse them into a single slash. This caused our OCSP responder to consider requests that had this unusual encoding quirk invalid and would respond to with a '400 Bad Request' response. The fix seemed quite simple: disable the slash collapsing behavior.
Unfortunately, stopping this behavior surfaced a more serious issue. The AIA extension that we include in certificates we issue contains a URI for our OCSP server. This URI contains a trailing slash. According to RFC 6960 Appendix 1 an OCSP request using the GET method is constructed as follows 'GET {url}/{url-encoding of base-64 encoding of the DER encoding of the OCSPRequest}' where the url 'may be derived from the value of the authority information access extension in the certificate being checked for revocation'. A number of user agents take this quite literally and will construct the URL without inspecting the contents of the AIA extension meaning that they ended up with a double slash between the host name and the base64 encoded OCSP request. Before we disabled slash collapsing this was fine as the web server was silently fixing this problem. Once we stopped collapsing slashes we started seeing problems.
From our OCSP server's perspective a majority of the OCSP requests we were receiving were prepended with a slash and we were unable to decode them so we'd respond with a '400 Bad Request' response and move on. This coincided with a large number of previously cached responses on our CDN expiring, causing us to start getting hit with a large number of requests. Because we were responding with '400 Bad Request' responses we were setting explicit no-cache headers which meant we had a near 0% cache (CDN) offload rate and were hit with the full brunt of our OCSP request load at our origin servers. This caused our whole infrastructure to get bogged down.
Just a quick question. Does this mean that if your OCSP servers were to go down, a lot of SSL enabled websites and applications will stop working? Seems like a serious single point of failure for modern day internet. I was always under the assumption that clients do not have to contact the CA (every time?) before a TLS handshake takes place.
OCSP Stapling seems to be the way to mitigate this problem, but not all web servers implement it (for instance lighttpd does not). Any recommendation from Let's Encrypt on this issue?
This is exactly the problem with OCSP. There's no way to tell if the remote server is down, or if a malicious actor sitting in your path is blocking it. So your browser can either a) make it super easy for all your OCSP-using sites to appear down, which will encourage users to use other, non-OCSP, sites, or b) silently fail, which makes the entire exercise pointless.
Stapling only partially mitigates this, as it doesn't currently work with intermediate certs, and at this point most sites have at least one intermediate cert.
RFC 6066 specifies that you can only have one certificate in an OCSP response - as with intermediate certs you need to be able to respond with a chain, this does not work. RFC 6961 defines a multiple response capability, but my understanding is that currently this is not sufficiently widely implemented to be useful yet.
Thanks! I thought it's enough if the stapled response contains information only about the intermediate cert, and the browser would accept that as good enough, if the chain it got in the handshake is valid.
Some have argued that this is why CRL and (especially) OCSP are useless pieces of security theater: they don't actually protect against a crafted attack because they fail-open in the very situations that a determined adversary can trigger, so they only "protect" in situations where no real threat exists. It's simply feel-good bookkeeping.
Adam Langley, working on Google Chrome [1][2][3], has been very vocal about OCSP's faults, and Chrome began using its own auto-update to ship an aggregate of revocations of high-value certs directly to browsers out-of-band. Despite this being another famous instance of Chrome going against the grain of other browser vendors, I believe this was the correct solution: offering better protection for a curated subset of sites vs. pretending to -- but not actually -- protecting all sites.
> I believe this was the correct solution: offering better protection for a curated subset of sites vs. pretending to -- but not actually -- protecting all sites.
This paper -- the CRLite proposal -- is wonderfully well thought-out, experimentally tested, and meets the design goals much better and more elegantly than any other attempt to solve the certificate revocation problem.
Looks like it was posted here and got very little traction [1]; a shame. But it will be presented in a few days at the IEEE Symposium on Security and Privacy [2]. I hope it will get the coverage and examination it deserves.
The one app chain (complex code signing) I worked on with OSCP, we defaulted to failsafe, but it could be overridden in the 'main' (enterprise CMS) app. The installer required OSCP or wouldn't install.
Basically the first and last mile were hard fails but everything in between was advisory if the signature checked out.
1. Spec question: Why does the request need to be both base 64 and URL encoded? Why not just URL encoded? Only reason I can think of is for shorter/prettier URLs?
2. Implementation question: Shouldn't the slashes be URL encoded as "%2F"? "url/ABC/DEF" could mean "url" + "ABC/DEF" or "url/ABC" + "DEF". Multiple slashes are collapsed by default because path components shouldn't contain them.
Josh referred to RFC 6960 Appendix A, but his post didn't make it apparent that his description of is an exact quote from the spec [1]:
An OCSP request using the GET method is constructed as follows:
GET {url}/{url-encoding of base-64 encoding of the DER encoding of the OCSPRequest}
This is shamefully imprecise for an RFC, not even referencing the relevant specs for each.
--
To answer your first question, Base64 is needed because DER is binary, and URIs are defined in terms of characters -- and it's an exercise left to the reader [2] if you can somehow get from one to the other reliably. It's also an exercise for every other reader you're trying to interoperate it, so the common practice is to make an explicit conversion before you get to this stage. Base64 takes care of this by transforming the binary data to US-ASCII, of which UTF-8 is a superset, and URIs operate on UTF-8 characters.
But "vanilla" base64 can produce three characters which are reserved characters in URIs: the slash, the plus, and equals [8]. These need to be percent-encoded because if they are used directly in URIs, they have special meaning.
Of course, if the OCSP RFC had just specified base64url encoding [9], a widely used variant which swaps the slash with an underline, the plus with a minus, and allows for the omission of padding that's denoted by equals signs, the double-encoding wouldn't be needed, because none of those characters are reserved in URIs.
--
To answer your second question, slashes in URIs are fun. Though the most recent URI RFC goes through elaborate rules on when you're supposed to encode and decode [3] and what's supposed to be interpreted how, at the end of the day the URI is somehow consumed as an input to some other process where different rules may apply [5][6].
One of those different, customary "rules" is that percent-encoded slashes are just maliciously trying to path outside of the directory, so most webservers shut this down. Apache is one of the few that allows you to tune what to do in this case [7].
Thank you for this preliminary report. I just want to say you are doing great work and a tremendous service to the public. We tolerate a few hiccups. And as always when something goes wrong, it is always more than one problem.
As I was reading the first few sentences, describing the slash collapsing, I was thinking to myself "oh no, I hope they don't just 'fix the glitch'". That behavior is so old and pervasive on the web, about the last thing I would have tried is turning off slash collapsing.
I'm not sure I understand how slash collapsing is affecting this. Slash is a reserved character and presumably if it the data was correctly encoded should never have ended up in the URI in the first place?
> I'm not sure I understand how slash collapsing is affecting this. Slash is a reserved character and presumably if it the data was correctly encoded should never have ended up in the URI in the first place?
base64 uses 64 characters: A-Za-z0-9 (62) and two symbols, commonly '/' and '+'. (As well as a third symbol, '=', used at the end to handle padding.) That would work in a URI, most of the time, except if you happened to have a base64 encoding with two '/' next to each other.
A common fix for using base64 in URIs involves substituting a different pair of symbols instead of '/' and '+'.
But some applications will decode the percent-encoding too early in the process of normalizing, security-escaping, and processing the URL. Encoded slashes in URLs are problematic [1][2][3][4][5].
I guess they most be using something in front of the code [0] (which do document the double slash issue). Probably should have used PathUnescape[1] on line 185 though.
This is the pull request that triggered today's issue [1]. They introduced a custom object that overrides Go's default ServeMux, such that their code won't collapse adjacent slashes, unlike Go's default.
They did this because previously, they used Go's default, which collapsed adjacent slashes -- this broke stuff until they troubleshooted it [2] and discovered that Cloudflare's OCSP responder, which they were using, actually documents that you shouldn't use the default ServeMux [3]. The commit that led to that condition only went in a month prior [4]. This is similar to an issue they had two years ago [9] that seems to have started this all.
Now the proposal is to strip the "leading slash" from the {ocsp-request} component of the incoming URI "{ocsp-uri}/{ocsp-request}" [5], but far better would be to perform path canonicalization on the {ocsp-uri} but not the {ocsp-request}. But it looks like they're relying on http.StripPrefix [6], which is an idiomatic Go way of hosting a server out of a subpath and returning 404 on any request not matching the prefix; this will be problematic without additional processing that gets slashes out of places they shouldn't be, while leaving alone where they should.
For more fun with slashes and OCSP, see this source code for NSS [7], this bugzilla issue [8], this code [10], and this mailing list thread [11].
It does not matter. Due to how servers and most apps need to handle URLs they are decoded very early in the process (eg: %2F and %2f need to be the same for instance).
The poster said "OCSP requests that use the GET method use standard base64 encoding, which can contain two slashes one after another", so I'm wondering if that's that case or it's actually "clients don't encode their base64". I didn't read trough the RFC, but judging by the later statement of "OCSP request using the GET method is constructed as follows 'GET {url}/{url-encoding of base-64 encoding of the DER encoding of the OCSPRequest}'" that seems to be the case.
yo people - you guys really need to handle slashes properly!
They are quite important for www-stuff you know.
"a majority of the OCSP requests we were receiving were prepended with a slash"
Everything is unsafe - one has to make sure that external data is converted into a sane format internally; and to never assume that external input can be safe!
I want to offer people here an early root cause analysis. I say early because we have not entirely completed our investigation or a post-mortem.
OCSP requests that use the GET method use standard base64 encoding, which can contain two slashes one after another. While debugging why a small number of OCSP requests consistently failed our engineers observed a rather odd, but standard, web server behavior. When a server receives a request with multiple slashes one after another they will collapse them into a single slash. This caused our OCSP responder to consider requests that had this unusual encoding quirk invalid and would respond to with a '400 Bad Request' response. The fix seemed quite simple: disable the slash collapsing behavior.
Unfortunately, stopping this behavior surfaced a more serious issue. The AIA extension that we include in certificates we issue contains a URI for our OCSP server. This URI contains a trailing slash. According to RFC 6960 Appendix 1 an OCSP request using the GET method is constructed as follows 'GET {url}/{url-encoding of base-64 encoding of the DER encoding of the OCSPRequest}' where the url 'may be derived from the value of the authority information access extension in the certificate being checked for revocation'. A number of user agents take this quite literally and will construct the URL without inspecting the contents of the AIA extension meaning that they ended up with a double slash between the host name and the base64 encoded OCSP request. Before we disabled slash collapsing this was fine as the web server was silently fixing this problem. Once we stopped collapsing slashes we started seeing problems.
From our OCSP server's perspective a majority of the OCSP requests we were receiving were prepended with a slash and we were unable to decode them so we'd respond with a '400 Bad Request' response and move on. This coincided with a large number of previously cached responses on our CDN expiring, causing us to start getting hit with a large number of requests. Because we were responding with '400 Bad Request' responses we were setting explicit no-cache headers which meant we had a near 0% cache (CDN) offload rate and were hit with the full brunt of our OCSP request load at our origin servers. This caused our whole infrastructure to get bogged down.