Much like writing tests, it initially hurt my ego a bit, but I've come to like and proselytise it.
And even if it didn't, scaling certificates from once a year to once a week requires merely 50x capacity - a finite number, and something that's very easy to plan out. If CAs in the '90s and '00s were able to handle a certain once-a-year request load, surely CAs today can handle 50x that.
If there is a requirement or good practice to periodically replace key-pairs then it should be a network management operation. Equipment should never stop working because it decided that a key-pair had expired.
X509 certificates are not technically optimal in many situations and, let's face it, the main driver for expiry dates is that certificate vendors want to sell more certificates...
In the case of Ericsson, this might have allowed an emergency certificate update before all the O2 systems could no longer be automatically updated. Once your network is completely down, bringing it back up remotely is hard.
The simple fact is that this sort of problem should've been dealt wiht a lot sooner, and a failure to do so is sheer incompetence.
Where that incompetence lies, is up for debate. But given how these things usually run, it lies some levels above the people who didn't have the time to properly follow a poorly-described process in the face of demands to do other stuff instead.
In this case the certs expiry seems to have rendered it hard or impossible to remotely resurrect the systems. That's a very brutal failure when your whole network goes down in one go. Having random failures, with alarms going off before everything is down, is definitely better that having alarms going off with the entire network already down.
This is immediately results in the “it took us more than 24 hours to work out everything was failing, it needs to be 72 hours”. The best thing about that is that it results in a self extending policy - the longer it takes to age out an expired cert, the longer it takes for the failure volume to become noticeable, and then the longer to work out that the problem is cert expiry, and so the more likely they’ll need an extension (again).
The correct solution for this problem is for people to correctly manage their certs, which they should be doing anyway because the private keys are sufficiently important you should know exactly which are in use and where they are in use.
We do both, and they start alerting two months in advance. This isn't rocket surgery, new, or much different than remembering people's birthdays.
But a surprising number of technical people are simply incapable of managing events farther out than next week, and a surprising number of their managers are similarly incapable of making sure someone treats it as important.
Makes me almost want to start a betting pool on who will be the next HugeCo to down themselves this way.
Edit: And let me join the chorus in saying deliberately inducing random errors as a way to draw attention is a terrible idea.
"Ah, are we supposed to do that too?"
Years ago, during the tail end of the Age of XML, I had a conversation with one of the admins over at W3C.
To make your app work you're supposed to cache all schemas locally but an inordinate amount of people don't. Their apps will be slow because of that, but work anyway. So tons of traffic goes to w3.org. They can't remove those schemas because then nobody can use them the right way (download them 1x and keep the result). But how do you get people's attention?
I pointed out to him that a human downloading the schema isn't going to notice if the request takes .3 seconds or 3 seconds, but someone loading that schema 100x in a loop is sure going to be incentivized to figure out where the extra 5 minutes of processing time came from. He allowed that I might be onto something there but I never heard back about whether he tried it.
You want to kill off an API, assign it resources (hardware, bandwidth) that get progressively smaller, or add artificial waits that get progressively bigger. Depending on how horizontally scalable your app is (is it easier to add a sleep or to pull servers out of the cluster?).
Have a search for libraries in your favorite language, I'm sure something will already exist. I've personally used Polly in .NET.
So many human & tech error factors lead to this occurring and they're all the same old things. Staffing changes, spam filters, ignored warnings, skipped emails...
There's a remarkable number of ways for this simple thing to go wrong. To prevent a future repeat, we got rid of our calendar reminders (which we started ignoring once we both thought the change had been made) and wrote a script that emailed us based on the time to expiration of the live cert. This is a much better method.
Of course, give us enough years and I'm sure we'll manage to find a way to get this new setup wrong.
Thanks for the reminder.
Add a test that just unconditionally fails on a certain date, like a week before your cert expires. Don't let any code review sign off on a merge of a fix the test until the new cert is in prod. Don't let any code promote between environments while tests are broken.
The problem with emails and warnings is they're all ignorable and therefore completely unsuitable for managing something as critical as a cert.
The secret is to create a straight-up error that absolutely interrupts every engineer in the organization's day until the cert is renewed. A development-halting error a week before cert expiration is a hell of a lot better than a business-halting error when it expires in prod.
Or just write a test that checks the date on the cert and conditionally fails.
If it's within 4 weeks, send email to x,y,z.
If it's within 2 weeks, send it to VP of x, y, z.
combination of failing test and email should lessen chance of it going unnoticed(email could possibly fail for whatever reason).
The surprising part is it took over 24 hours to restore service. I currently still have 3G only, and that's struggling (apparently 4G will follow).
See what we can do. O2.
I am, however, pretty sure there's no middle ground.
Wish you all the best but I would never trade my job for yours, I value my sanity way more than that.
I assumed I had dropped the darn thing one too many times.
Glad to see the HN grapevine works on wifi.
It is a reminder of just how fragile this digital world still is - we are taking technology designed to survive nuclear war, and adding single points of failure.
Let's look at mesh networking again.
Though companies should be doing at least a yearly audit of certificates and calendering any that will need renewing. As I'm sure they do with domain names already.
The expiry date of a certificate was present in the original implementation of X.509 certificates; the infrastructure for certificate revocation lists was added later. Furthermore, a certificate revocation list includes every certificate ever revoked that has yet to expire. If you relied on certifcate revocation for expiring old certificates as well, that would be a nasty long file you'd have to download before you can start connecting to any website.
To your comment of "one day the system works normally and is deemed secure, and the next day it is so insecure and dangerous", this is working as intended. The certificate is to establish trust and identity along with encrypting the data in transit. The identity described by the certificate is only valid up until the expiry date after which it ceases to be valid for that purpose. You now have an encrypted connection to something that can't prove its identity, which is certainly a lower level of "secure" than what it was before the certificate expired.
Ignoring expiry dates would mean any keys that were compromised ever could be used for MITM attacks and no one would be the wiser.
However we already mitigate this with revocation lists. But if we can revoke certificates why do we have expiration dates?
Seems to me expiration dates are rent seeking behaviour by certificate vendors.
You're not secure if you don't check revocation.
I would say that the "proper" use of certificate expiry is the way LetsEncrypt and other ACME providers do it: it's set so low that you need an automated renewal process in order to make the certificate at-all useful.
Unlike cert expiry, where the first you hear about it is when your production system stops working.
Just this week I had someone complain to me that he was no longer getting build failure emails, and it turned out IT had disappeared our old <username>@<olddomain> addresses. That reminds me about an IT ticket I need to write.
Most CAs already support OCSP, which is effectively one-day-long certificates: the client can contact the OCSP server for a signed response saying "yes, it's still unrevoked", or the server can include ("staple") such a response along with its certs. Some certs have the MustStaple extension, indicating that they should not be treated as valid unless a recent OCSP response is stapled to it.
If we had the computational resources to just not have certificates at all and have every client check the CA's current belief that a public key belongs to a name at each use (and magically avoid the associated privacy problems), that would be ideal. Certificates are an approximation.
One neat thing about 3-month certificates is that it's a meaningfully different human lifescale from a year (or multiple years): operators may not still be around in a year but will certainly be around in 3 months. Operators may just plan to manually renew in a few years, but generally decide they need to automate the process if it's every 3 months. (For extremely boring reasons, I manually update the certificate on my personal website every 3 months and it's a pain, and I think I am one of very few people who do manual updates to their Let's Encrypt certs.) So it forces people to think of certificates as running their end of a service with ongoing operational work, not a one-time transfer of data.
I always assumed one reason is so revocation lists don't become huge. Also, digital properties such as domains change hands.
Something in there is telling you that you're doing the worst of both worlds, by allowing such a huge gap between renewals that people merely forget leading to outages, and such a huge gap that if a key is to expire, someone else could use it for years before someone notice and actually revoke the old key.
derefr above has the correct response.
(“Blockchain” is not a valid answer)