Josh from Let's Encrypt here. First, my apologies for the trouble this has caused.
I want to offer people here an early root cause analysis. I say early because we have not entirely completed our investigation or a post-mortem.
OCSP requests that use the GET method use standard base64 encoding, which can contain two slashes one after another. While debugging why a small number of OCSP requests consistently failed our engineers observed a rather odd, but standard, web server behavior. When a server receives a request with multiple slashes one after another they will collapse them into a single slash. This caused our OCSP responder to consider requests that had this unusual encoding quirk invalid and would respond to with a '400 Bad Request' response. The fix seemed quite simple: disable the slash collapsing behavior.
Unfortunately, stopping this behavior surfaced a more serious issue. The AIA extension that we include in certificates we issue contains a URI for our OCSP server. This URI contains a trailing slash. According to RFC 6960 Appendix 1 an OCSP request using the GET method is constructed as follows 'GET {url}/{url-encoding of base-64 encoding of the DER encoding of the OCSPRequest}' where the url 'may be derived from the value of the authority information access extension in the certificate being checked for revocation'. A number of user agents take this quite literally and will construct the URL without inspecting the contents of the AIA extension meaning that they ended up with a double slash between the host name and the base64 encoded OCSP request. Before we disabled slash collapsing this was fine as the web server was silently fixing this problem. Once we stopped collapsing slashes we started seeing problems.
From our OCSP server's perspective a majority of the OCSP requests we were receiving were prepended with a slash and we were unable to decode them so we'd respond with a '400 Bad Request' response and move on. This coincided with a large number of previously cached responses on our CDN expiring, causing us to start getting hit with a large number of requests. Because we were responding with '400 Bad Request' responses we were setting explicit no-cache headers which meant we had a near 0% cache (CDN) offload rate and were hit with the full brunt of our OCSP request load at our origin servers. This caused our whole infrastructure to get bogged down.
Just a quick question. Does this mean that if your OCSP servers were to go down, a lot of SSL enabled websites and applications will stop working? Seems like a serious single point of failure for modern day internet. I was always under the assumption that clients do not have to contact the CA (every time?) before a TLS handshake takes place.
OCSP Stapling seems to be the way to mitigate this problem, but not all web servers implement it (for instance lighttpd does not). Any recommendation from Let's Encrypt on this issue?
This is exactly the problem with OCSP. There's no way to tell if the remote server is down, or if a malicious actor sitting in your path is blocking it. So your browser can either a) make it super easy for all your OCSP-using sites to appear down, which will encourage users to use other, non-OCSP, sites, or b) silently fail, which makes the entire exercise pointless.
Stapling only partially mitigates this, as it doesn't currently work with intermediate certs, and at this point most sites have at least one intermediate cert.
RFC 6066 specifies that you can only have one certificate in an OCSP response - as with intermediate certs you need to be able to respond with a chain, this does not work. RFC 6961 defines a multiple response capability, but my understanding is that currently this is not sufficiently widely implemented to be useful yet.
Thanks! I thought it's enough if the stapled response contains information only about the intermediate cert, and the browser would accept that as good enough, if the chain it got in the handshake is valid.
Some have argued that this is why CRL and (especially) OCSP are useless pieces of security theater: they don't actually protect against a crafted attack because they fail-open in the very situations that a determined adversary can trigger, so they only "protect" in situations where no real threat exists. It's simply feel-good bookkeeping.
Adam Langley, working on Google Chrome [1][2][3], has been very vocal about OCSP's faults, and Chrome began using its own auto-update to ship an aggregate of revocations of high-value certs directly to browsers out-of-band. Despite this being another famous instance of Chrome going against the grain of other browser vendors, I believe this was the correct solution: offering better protection for a curated subset of sites vs. pretending to -- but not actually -- protecting all sites.
> I believe this was the correct solution: offering better protection for a curated subset of sites vs. pretending to -- but not actually -- protecting all sites.
This paper -- the CRLite proposal -- is wonderfully well thought-out, experimentally tested, and meets the design goals much better and more elegantly than any other attempt to solve the certificate revocation problem.
Looks like it was posted here and got very little traction [1]; a shame. But it will be presented in a few days at the IEEE Symposium on Security and Privacy [2]. I hope it will get the coverage and examination it deserves.
The one app chain (complex code signing) I worked on with OSCP, we defaulted to failsafe, but it could be overridden in the 'main' (enterprise CMS) app. The installer required OSCP or wouldn't install.
Basically the first and last mile were hard fails but everything in between was advisory if the signature checked out.
1. Spec question: Why does the request need to be both base 64 and URL encoded? Why not just URL encoded? Only reason I can think of is for shorter/prettier URLs?
2. Implementation question: Shouldn't the slashes be URL encoded as "%2F"? "url/ABC/DEF" could mean "url" + "ABC/DEF" or "url/ABC" + "DEF". Multiple slashes are collapsed by default because path components shouldn't contain them.
Josh referred to RFC 6960 Appendix A, but his post didn't make it apparent that his description of is an exact quote from the spec [1]:
An OCSP request using the GET method is constructed as follows:
GET {url}/{url-encoding of base-64 encoding of the DER encoding of the OCSPRequest}
This is shamefully imprecise for an RFC, not even referencing the relevant specs for each.
--
To answer your first question, Base64 is needed because DER is binary, and URIs are defined in terms of characters -- and it's an exercise left to the reader [2] if you can somehow get from one to the other reliably. It's also an exercise for every other reader you're trying to interoperate it, so the common practice is to make an explicit conversion before you get to this stage. Base64 takes care of this by transforming the binary data to US-ASCII, of which UTF-8 is a superset, and URIs operate on UTF-8 characters.
But "vanilla" base64 can produce three characters which are reserved characters in URIs: the slash, the plus, and equals [8]. These need to be percent-encoded because if they are used directly in URIs, they have special meaning.
Of course, if the OCSP RFC had just specified base64url encoding [9], a widely used variant which swaps the slash with an underline, the plus with a minus, and allows for the omission of padding that's denoted by equals signs, the double-encoding wouldn't be needed, because none of those characters are reserved in URIs.
--
To answer your second question, slashes in URIs are fun. Though the most recent URI RFC goes through elaborate rules on when you're supposed to encode and decode [3] and what's supposed to be interpreted how, at the end of the day the URI is somehow consumed as an input to some other process where different rules may apply [5][6].
One of those different, customary "rules" is that percent-encoded slashes are just maliciously trying to path outside of the directory, so most webservers shut this down. Apache is one of the few that allows you to tune what to do in this case [7].
Thank you for this preliminary report. I just want to say you are doing great work and a tremendous service to the public. We tolerate a few hiccups. And as always when something goes wrong, it is always more than one problem.
As I was reading the first few sentences, describing the slash collapsing, I was thinking to myself "oh no, I hope they don't just 'fix the glitch'". That behavior is so old and pervasive on the web, about the last thing I would have tried is turning off slash collapsing.
I'm not sure I understand how slash collapsing is affecting this. Slash is a reserved character and presumably if it the data was correctly encoded should never have ended up in the URI in the first place?
> I'm not sure I understand how slash collapsing is affecting this. Slash is a reserved character and presumably if it the data was correctly encoded should never have ended up in the URI in the first place?
base64 uses 64 characters: A-Za-z0-9 (62) and two symbols, commonly '/' and '+'. (As well as a third symbol, '=', used at the end to handle padding.) That would work in a URI, most of the time, except if you happened to have a base64 encoding with two '/' next to each other.
A common fix for using base64 in URIs involves substituting a different pair of symbols instead of '/' and '+'.
But some applications will decode the percent-encoding too early in the process of normalizing, security-escaping, and processing the URL. Encoded slashes in URLs are problematic [1][2][3][4][5].
I guess they most be using something in front of the code [0] (which do document the double slash issue). Probably should have used PathUnescape[1] on line 185 though.
This is the pull request that triggered today's issue [1]. They introduced a custom object that overrides Go's default ServeMux, such that their code won't collapse adjacent slashes, unlike Go's default.
They did this because previously, they used Go's default, which collapsed adjacent slashes -- this broke stuff until they troubleshooted it [2] and discovered that Cloudflare's OCSP responder, which they were using, actually documents that you shouldn't use the default ServeMux [3]. The commit that led to that condition only went in a month prior [4]. This is similar to an issue they had two years ago [9] that seems to have started this all.
Now the proposal is to strip the "leading slash" from the {ocsp-request} component of the incoming URI "{ocsp-uri}/{ocsp-request}" [5], but far better would be to perform path canonicalization on the {ocsp-uri} but not the {ocsp-request}. But it looks like they're relying on http.StripPrefix [6], which is an idiomatic Go way of hosting a server out of a subpath and returning 404 on any request not matching the prefix; this will be problematic without additional processing that gets slashes out of places they shouldn't be, while leaving alone where they should.
For more fun with slashes and OCSP, see this source code for NSS [7], this bugzilla issue [8], this code [10], and this mailing list thread [11].
It does not matter. Due to how servers and most apps need to handle URLs they are decoded very early in the process (eg: %2F and %2f need to be the same for instance).
The poster said "OCSP requests that use the GET method use standard base64 encoding, which can contain two slashes one after another", so I'm wondering if that's that case or it's actually "clients don't encode their base64". I didn't read trough the RFC, but judging by the later statement of "OCSP request using the GET method is constructed as follows 'GET {url}/{url-encoding of base-64 encoding of the DER encoding of the OCSPRequest}'" that seems to be the case.
yo people - you guys really need to handle slashes properly!
They are quite important for www-stuff you know.
"a majority of the OCSP requests we were receiving were prepended with a slash"
Everything is unsafe - one has to make sure that external data is converted into a sane format internally; and to never assume that external input can be safe!
So in case this helps anyone, I had people complaining about strange OCSP errors all over the morning coming from my server (using apache httpd).
It turns out apache does practically everything to behave as dumb as possible in case of OCSP downtimes.
If the OCSP sends an error it will send the error as a stapled OCSP reply (instead of using an old, still valid OCSP reply). You can't make it behave sane here, but you can at least tell it to not return the error with SSLStaplingReturnResponderErrors set to off.
However if the OCSP isn't available at all apache will fake its own OCSP error (sic!) and send it. This is controlled by the option SSLStaplingFakeTryLater, which defaults to on. So if your firefox users get strange OCSP errors, it's most likely this. The doc for SSLStaplingFakeTryLater claims that this option is only effective if SSLStaplingFakeTryLater is set to on, however that's wrong.
tl;dr set both of these options to "off", then at least apache won't staple any garbage in your TLS connection, firefox will try to reach the ocsp on its own and fail and still accept the connection. Yes, that's all pretty fucked up.
Going through the documentation, another thing that surprised me was the SSLStaplingStandardCacheTimeout setting.
If I understand this correctly, by default Apache will only cache OCSP responses for 1 hour, even if they are still valid for days. I guess increasing this to 1 day or something would make sense as well.
> If I understand this correctly, by default Apache will only cache OCSP responses for 1 hour, even if they are still valid for days. I guess increasing this to 1 day or something would make sense as well.
Yeah, that's another major problem. But increasing that doesn't really fix anything in a reasonable way. There shouldn't be any cache timeout, this option doesn't make any sense. It should cache as long as the reply is valid and replace it in time before it expires.
By the way, I checked the source and it appears that setting SSLStaplingStandardCacheTimeout to a large value (larger than typical OCSP reply validity) effectively creates this behavior.
Apache checks if cached reply is still valid and if it's not it attempts to renew it.
At least in 2.4.10 shipped in Debian Jessie. Relevant code is in modules/ssl/ssl_util_stapling.c, stapling_cb()
Same here, was very pleased with Caddy so far, but it being tightly coupled to LE being up, despite having certs cached is a no-go for our production systems and if this stays like that, would make me go back to nginx for the services I've used caddy so far.
edit: looks like the dev added a fix to only refuse the start if the cached certs are dangerously close to expire. that satisfies me and I'll be continue to be using caddy.
Call me crazy but I think it's a little silly for an https server to refuse to start just because its certificate is invalid. I recognize that some folks might like that failure mode however.
I am running nginx reverse-proxying to a python API right now. Dealing with certificate renewal is a matter of running a daily cronjob issuing 'certbot renew'. If it works it replaces the fullchain.pem certificate, and that's it, easy peasy.
> I think this thread is pretty much proof that that approach will bite you in the ass.
What bit me here is the fact that I'm running alpha software instead of a battle-tested web server; I'm doing so willingly, with full awareness of the risks that that entails.
Drawing the conclusion you did from the variables at play is shortsighted. If anything bites people in the ass, it's prejudice and shortsightedness. I wouldn't want you handling my ops/infrastructure.
> What bit me here is the fact that I'm running alpha software
This wasn't caused by a bug. This was a deliberate decision to fail to start if the certificates on-disk were <= 30 days away from expiring and the CA can't be contacted.
> Drawing the conclusion you did from the variables at play is shortsighted
Using caddy is the web-server-stack equivalent of "putting all your eggs in one basket". If one thing about it isn't working the way you want, you have to either a) replace it completely or b) work out how to disable the bit that's not working how you want, and replace that part of it.
> Drawing the conclusion you did from the variables at play is shortsighted
- People use a piece of software that serves as both ACME TLS certificate client and web server
- Said software by design won't start if the CA can't be contacted 30-days out from expiry
The conclusion I drew is that such integration leave the operator with less control than if they followed a separation-of-concerns approach, and left web serving to a web server, and TLS certificate renewal to an ACME client. The former doesn't need to care about how old the certificates are, just use what it's given.
I still think that refusing to start if the cert expires in 7 days or less is still an issue if Let's Encrypt is down.
There should be at most a warning but it should start. Otherwise you end up with an external dependency that can cause your web server to not start through no fault of your own.
It's not tightly coupled any more than it would be with certbot. The issue I filed is an issue because it's unexpected behaviour (despite early closure, it has actually now been fixed).
The simple config syntax and sane defaults is nice in Caddy. For example, 3 lines of config nets a https server with http/2 support and an A rating with Qualsys for ssl setup.
Yes, that's exactly what I meant. You can tell because I didn't present any examples. And because one issue invalidates everything else. And I didn't express my opinion on this issue up thread.
I switched from nginx to caddy for my (extremely low-traffic) server because I was tired of having to copy & paste a bunch of SSL setup any time I was configuring a new domain.
I moved from Caddy to Traefik (https://traefik.io/) several months ago. Granted, nginx has years (decades?) on some of these newer webserver/reverse-proxies, but I have been happy with all the built in niceties of traefik so far (single binary, etc), and haven't really experienced any negatives.
I submitted the issue for it (https://github.com/mholt/caddy/issues/1632) and it was immediately closed. Reason (after asking): we only keep issues open that are on our TODO list.
Caddy restarts gracefully with zero downtime. If you're killing the process and starting one anew, you're doing it wrong. Use signal USR1 to gracefully apply new configuration changes. Failed reloads fall back to the current working configuration without any changes or downtime. https://caddyserver.com/docs/cli#usr1
If I am hosting 5 sites on caddy, and add a 6th one, I restart the server. If the 6th site doesn't work (for example, if DNS didn't resolve for lets encrypt), the other 5 sites which were working before the restart, all fail to start as caddy completely crashes.
This is basic resiliency you'd expect from your web server. Why should the other 5 sites fail to start if their configs are completely valid?
> If I am hosting 5 sites on caddy, and add a 6th one, I restart the server.
No, you don't. You reload the server, not restart it. Restarting a web server should only be required if you get an upgrade for it (or for OpenSSL etc.)
The idea here is that the server can't be scaled up or down. I suggest googling "cattle vs pets". If you know how to make a single process scale horizontally across additional hosts or scale up in alternative datacenters with a chef service notification, I'd pay you money to tell me how.
So, um: devops and cloud architecture is my job. I think I know what "cattle vs pets" is referring to. Nothing in this thread has anything to do with one versus many processes or one versus many nodes, nor does it have anything to do with manual configuration of anything. Rewrite the configuration based in orchestration data or, in extremity, upload a new version of a cookbook that handles that new site; the hypothetical sixth site is added and the service kicked over without human interaction. Scaling is an orthogonal concern.
This type of thing is why I (and I'm sure others) have literally zero intention of using tools like this.
Separation of concerns means you are in control, and using separate layers means you can swap one out when a vulnerability/show-stopper bug is discovered.
What exactly do you do when your look-ma-no-hands server won't even start?
Edit: maybe "all-things-to-all-people" was the wrong term to use here.
I have no idea how this is relevant to the conversation, nor who said anything about hyperbolic claims.
You're claiming "caddy does everything". As opposed to what? If you're running apache or nginx, your server does far more than caddy, so you're quite simply mistaken.
Indeed, the 0.10.3 fix compared to the situation with OCSP stapling in Apache is illustrative
This fix gets almost everybody where they should be, the next time the same thing happens (and it will) Caddy isn't a problem for three weeks, which is definitely enough time. Meanwhile we're going to see the same Apache crappiness for OCSP again each time until someone over there finally snaps out of it and asks someone who actually knows how OCSP stapling was supposed to work.
I think LE is a huge boon to the internet. But I would really love for someone like Amazon, Google, Facebook, or Microsoft to set up a separate provider that implements the same thing. Redundancy is super important here and clearly just one organization can't guarantee 100% uptime.
Alternatively, maybe Let's Encrypt ought to Chaos Monkey this up and be down for 4 random hours every month or something on purpose. Or if you want to make very sure you don't turn people away, be down for 4 hours every month for any cert that has been in Let's Encrypt for more than a month or two, so you don't turn away new users. Because if you have a problem with a brief outage, the problem is in the user code, not Let's Encrypt.
It doesn't matter how redundant you make Let's Encrypt, the problem could always like too close to the user code to be resolved, e.g., the data center hosting your server loses internet. 100% uptime in the sense of "system A can always reach service B" is impossible, even if service B never "goes down" strictly speaking.
> Alternatively, maybe Let's Encrypt ought to Chaos Monkey this up and be down for 4 random hours every month or something on purpose.
And then gain a reputation for being unreliable?
> Or if you want to make very sure you don't turn people away, be down for 4 hours every month for any cert that has been in Let's Encrypt for more than a month or two, so you don't turn away new users. Because if you have a problem with a brief outage, the problem is in the user code, not Let's Encrypt.
Most users don't care. If it's not working reliably for them, they'll just move to something that does. Maybe there is an issue in their code that should be addressed, but I seriously doubt they'd care to have that pointed out when they're suddenly offline.
Anyone doing proper testing of their software/infrastructure should have a testing environment anyway. I'd take your proposal and modify it to: Let's Encrypt should offer testing servers which are down for well defined periods throughout the day that people can use to test their platform against.
"And then gain a reputation for being unreliable?"
If they tell people what they are doing and why, and only do it for established certificates, I'm not sure that would happen.
"Let's Encrypt should offer testing servers which are down for well defined periods throughout the day that people can use to test their platform against."
Who would use them? Anyone who cares enough to simulate LE failure is presumably already doing it.
Google already can mess with your domain quite a bit (how many people use Google's DNS servers?). They also have a registrar and a CA already that your browser trusts. Amazon does the same thing. Don't know about MS/Azure, but if they don't now, I'm sure they will soon. FB is the only one that I don't think has neither, but they are also the ones I think that are least likely to actually maliciously mess with domains and TLS. It's just not in the business model.
Besides, this doesn't have to be a thing that's under the direct control of any of these organizations, but rather a separate entity that's just financed by them.
I think the OP is asking / looking forward big tech like them to host the infrastructure for higher availability. Basically like a mirror. To end users they are just the same server.
On the side note, is the LE infrastructure globally distributed or currently all reside in US/West and/or US/East? Is Mozilla currently the one hosting?
Not only is it a problem with certificate issuance - but their OCSP servers are also down. This caused an issue on one of my sites where I was using OCSP Stapling: normal browser connections were failing, but not tools like curl (which don't ask for the OCSP response over SSL).
What's the typical validity period for OCSP responses with Let's Encrypt? Shouldn't the stapled responses continue working for at least a couple hours even after Let's Encrypt goes down?
For what it's worth, Caddy is the only server that will locally cache the staples (and manage them) automatically. In other words, Caddy sites were not affected by this OCSP downtime.
I think, in the past, mods have put an extra down-weight on "X is down" stories, once 'X' is back up.
Since this discussion now has interesting stuff related to Let's Encrypt—and products which use Let's Encrypt—I hope the mods are willing to forgo the down weight, and instead just change the post title to something like "Let's Encrypt Was Down".
Nothing against letsencrypt but dependencies on services to be online is fragile and will break. Their 90 day limit makes it worse. Saying its for security is like saying 1 or 3 year certs are somehow insecure which is not the case. It's one more headache for an admin to think about even if automated.
We really should reexamine the CA system. Self signed certs should have more value than they currently do, and identity can be verified by out of band methods. Surely it's worth exploring.
What we have now effectively disempowers individuals and centralizes essential services which cannot be good in the long run.
You can run your own ACME provider, the code is open source.
Nothing stops you from running a CA that offers 1 year certs over ACME. Or just providing one that also offers 90 day certs. If people will trust that CA is another question.
The automation of LE is not the problem either. Properly automated systems would extend/renew the cert well before they are invalid, almost every LE guide I know mentions this on grounds of "what if LE is down".
The only libraries and services affected are those who do not properly code for an external service provider being temporarily offline.
The problem with OOB verification of certs is that it's slow and inefficient for almost all methods this can be done with. And it doesn't scale either.
Imagine if everyone wanted to OOB verify the Google certs on the same day.
> Nothing stops you from running a CA that offers 1 year certs over ACME. Or just providing one that also offers 90 day certs. If people will trust that CA is another question.
You know that's BS. All of your users would get certificate errors, that's what's preventing you from running your own CA.
Long lived certs aren't necessarily insecure but revocation is a real issue that doesn't have a good solution outside of short cert lifetimes. People don't actually revoke their certs because it takes effort and costs money, browsers don't reliably check revocation lists if they do at all, and checking for revocations is bad for performance and privacy.
I don't disagree about CAs, but LE suggests to renew certs every 60 days. So unless this outage lasts a month or some acme tool is poorly made it shouldn't affect anything except new registrations.
Things like SPF and SSHFP records are still unprotected without DNSSEC. HTTP, IMAP/POP3, and SMTP may be safe by themselves but it would be nice to have the others covered as well.
We've been using LetsEncrypt on dozens of our servers for several months now and it's worked flawlessly both in set-up and in operation!! Set up was quite easy thanks to A LOT of work by the team there.
To make HTTPS simple and free to set up is a FANTASTIC mission & the team has overall built a SUPERB system. Congratulations & thanks for addressing this issue quickly, looks like it's already solved, good work!
If you use one of the myriad of automated tools for using LE, you will get your cert renewed as early as 30 days before it expires. So right now the issue should only be with new domains getting certs.
If you renew LE certs manually, first what is wrong with you and don't you like yourself? Second, at that point it's no different than NameCheap going down and you getting your cert from 1and1 instead.
Sadly, reality is not that nice and I can already feel a disturbance in the force in form of "Why Lets Encrypt is bad and you should by 1 year certs" blog posts all over the place.
If you have your nginx HTTP vhost configured for serving {{ domain }}/.well-known/acme-challenge/ from /var/www/{{ domain }}, then getting a new cert and having it automatically renewed it's as simple as running:
If certbot was installed with PIP, a CRON job will be automatically created and it will run with whatever arguments the certificate was first obtained with (--renew-hook is obviously important).
That assumes the only sites you want to use SSL with is sites with a web root. Many sites use python's flask or node.js or similar, where you instead have an http server running on a high port, then proxy requests to certain domains to those ports. Such a script won't work with a setup like that.
Just spent an hour debugging why development scripts work but production doesn't.. Good reminder to configure some kind of notifications from AWS Lambda also when the execution times out, not just on errors.
Clearly it is high time for an EncryptWeShall nonprofit with a wholly separate implementation and team and all the tooling adjusted to randomly pick between the two.
Holy frijoles, I was pondering exactly this scenario earlier today while messing around with Caddy and LE, as in: do I want to take a (mostly) static and offline process and directly inject it as another moving part into runtime world, for the sake of its cost, convenience, and overall worthwhileness?
Is there a good alternative, without this sort of process? Or back to the sharks?
This is unrelated to this outage, however in the past when renewing, I've always had problems resolving acme-v01.api.letsencrypt.org - am I alone with this issue?
I'm running a local dnscache instance (djbdns), perhaps that has something to do with it?
"High assurance datacenter" High assurance my ass. Those don't go down unless there's a DDOS or catastrophic failure (often several). Then, they're right back up. People need to stop misusing this label. Another is "high-assurance" certs from vendors that get compromised or subverted easily. Only one high-assurance CA that I know of. It's not around for business reasons, though.
The issue I brought up is nothing about it is high-assurance except maybe tamper-resistance on a HSM involved. It's a term abused in the certificate market a lot. An easy hint to tell is if it's a product developed slowly in a safe subset of C, Java, or Ada. Those have the tooling needed for highly-robust implementations. Then look at the OS to see if it's something extremely hardened or unusual (eg separation kernel RTOS). The protocols will be ultra-simple with a lot of high-availability and easy recovery. Almost no modern tooling will be in the TCB for configuration or deployment unless it's simple. Most of it isn't.
I'm not seeing any of this in the reporting that made it here. Definitely not high-assurance. Likely compromised by high-end attackers either for specific targets or in some general way. It will help protect in its intended way against the rest, though. Enormously positive development. Just not high-assurance security at any level.
I want to offer people here an early root cause analysis. I say early because we have not entirely completed our investigation or a post-mortem.
OCSP requests that use the GET method use standard base64 encoding, which can contain two slashes one after another. While debugging why a small number of OCSP requests consistently failed our engineers observed a rather odd, but standard, web server behavior. When a server receives a request with multiple slashes one after another they will collapse them into a single slash. This caused our OCSP responder to consider requests that had this unusual encoding quirk invalid and would respond to with a '400 Bad Request' response. The fix seemed quite simple: disable the slash collapsing behavior.
Unfortunately, stopping this behavior surfaced a more serious issue. The AIA extension that we include in certificates we issue contains a URI for our OCSP server. This URI contains a trailing slash. According to RFC 6960 Appendix 1 an OCSP request using the GET method is constructed as follows 'GET {url}/{url-encoding of base-64 encoding of the DER encoding of the OCSPRequest}' where the url 'may be derived from the value of the authority information access extension in the certificate being checked for revocation'. A number of user agents take this quite literally and will construct the URL without inspecting the contents of the AIA extension meaning that they ended up with a double slash between the host name and the base64 encoded OCSP request. Before we disabled slash collapsing this was fine as the web server was silently fixing this problem. Once we stopped collapsing slashes we started seeing problems.
From our OCSP server's perspective a majority of the OCSP requests we were receiving were prepended with a slash and we were unable to decode them so we'd respond with a '400 Bad Request' response and move on. This coincided with a large number of previously cached responses on our CDN expiring, causing us to start getting hit with a large number of requests. Because we were responding with '400 Bad Request' responses we were setting explicit no-cache headers which meant we had a near 0% cache (CDN) offload rate and were hit with the full brunt of our OCSP request load at our origin servers. This caused our whole infrastructure to get bogged down.