One of the principal engineers I used to work with at AWS had a saying: "A one-year certificate expiration is an outage you schedule a year in advance." Of course, it's a bit hyperbolic -- but a ten-year expiration is almost a certainty to result in an outage.
In a similar vein, you should never generate resources which will expire unless some undocumented action is taken. A common one I've seen is self-signed certs which last for n days, and are re-generated whenever an application is deployed or restarted, under the assumption that the application will never run untouched longer than that. (Spoiler: It probably will, at some point, whether due to unexpected change freezes, going into maintenance mode, or -- in my personal favourite -- being deployed to an environment that just isn't updated as regularly.)
That Principal Engineer's knowledge came from painful repeated experiences in AWS. When I left AWS in 2016 they were trying to push towards 3 monthly cert rotations, and hoping to get it shorter.
A year long expiry isn't frequent enough that you build automation, and is long enough that the runbook you have is likely out of date before the next time you execute it. If you make it 3 monthly, it's more likely to be fully or mostly automated, and it's more likely you'll remember that certs were recently introduced in a particular service. If you make it monthly, it's pretty much guaranteed that it'll be fully automated.
Almost every week in the weekly AWS-wide ops meetings, one service or another would be talking about something that went wrong that was caused by some certificate expiring, that happened in a place they'd forgotten they had certificates, or had missed when they did the rotation. A number of those failures presented in particularly misleading ways, too, by nature of what role the cert was playing.
Does one actually manage to avoid such outages for 10 years by making the problem recur every month? 'cause I feel like stuff would still break even if you test and run them regularly.
You might hit an outage, but you'll hit it within a month of deploying the new code that caused it, so you'll have the context and staffing expertise to fix it so it doesn't happen next month. Whereas if the outage happens in ten years, you'll need some software archaeologists to find the root cause and likely won't have the expertise available to fix it.
And maybe you say "it's one outage either way, but isn't it better in ten years than next month?" But when you're constantly adding new services, eventually there will come a time where every month some new service is having its ten year anniversary.
One day I could not connect to my (home) server. Turns out the client certificate had expired, I never thought to make note of or increase the 10 year default value when I did my test configuration...
30 years ago companies were rebooting their mainframes twice a year just to make sure. Before doing that companies were burned because the mainframe went down accidentally (backup generator broke during a power outage) and they couldn't get it to start because someone changed a setting at runtime but didn't save the setting to the boot scripts - then that person retired or found a new job. By rebooting twice a year they were able to ensure the someone remembered what setting was changed when the system failed to start.
One of the things that I loved about ISO9001, sure, it made every sysadmin action something that made police paperwork look 'light', but it ensured you didn't hit this kind of thing, or if you did, it was an instant gross negligence dismissal on whoever stopped documenting or following the documented procedural protocol.
Financial firms will also hit time-based bugs before most organizations because they often deal with forecasting events 30+ years in the future (e.g. mortgages). For a bank, the 2038 rollover has been relevant since 2008.
I hit one of these on an EMC VNX array one time; after ~400 days all the controllers crashed at the same time. Didn't help that it happened at 4am on New Year's Day. I do recall other instances of this class of bug, but nothing specific.
To me, 10 years is long enough to completely forget how to fix the problem, once it becomes a problem in 10 years' time.
Most people won't document well enough to even be aware that the 10-year deadline is approaching, much less how to fix it! When the deadline hits, everything will break, and then you basically have to reinvent the wheel to get it up and running again.
In my opinion, if something can last 10 years, then it could probably just last indefinitely.
Alternatively, have something that lasts a short time, but that is renewed automatically. I suppose that is the advantage of let's encrypt, for example, in that their certificates expire after 3 months, but that just means that I set up a Cron job to automatically renew them for me.
My last boss was very keen on change control on any of our live systems; steps documented to the point where the cleaner could have run them. It seemed a bit excessive at the time, but there's something lovely about having copy/pasteable steps (and any notes) from when you had to do this a year ago.
As for forgetting deadlines are approaching, just set up automated checks for cert and domain expiration. I wasn't just checking our infrastructure but also any remote APIs/interconnects we were using too. There were a handful of times where I'd contact providers with 7 days warning just to confirm they were aware their certs were expiring; it was slightly terrifying how often they weren't.
I'm about to write a little "choose-your-own-adventure" script that walks new employees through our processes. We're about to begin our decade-ly turning-over, and the idea is that the we'll hand over the script and fix the "bugs" in the script, rather than explaining "out loud". Most of these process scripts are used on a weekly or monthly cadence, so the hope is they'll stay up-to-date.
I love love love the let's encrypt 3 months expire for this reason, it used to be such a pain to remember how to do the yearly/biyearly renew and now, as you say, it's just a Cron job.
At work, our customers need to get new certificates from a gov't agency every other year.
Most of them have either completely forgotten about it and how to do it, or there's been a change of employees and the new ones didn't get the memo, so to speak.
So it falls on us to remind them and guide them through the process.
How I wish the gov't moved to a 3 month setup like Let's Encrypt.
Depending on the government agency, there may be a required level of ongoing identity and need verification that can't be automated. For personal PKI in the US DoD, for instance, you have to go in-person to an ID office on a military installation to get your common access card renewed. For server certs, there is obviously no way to make a server go somewhere physically, but you need a qualified sponsor to sign off on the request to the DoD PKI office, and who that person is will likely change over any multi-year span, as military command positions tend not to last more than a year and even the civilian offices still see fairly frequent turnover at the higher levels. Plus those people need to sign requests with their common access card, which requires them to periodically go to an ID office in-person.
I'll go further: Three months is too long. Secrets which are used to authenticate and identify should be rotated far more regularly, using infrastructure which treats them as effectively ephemeral. The industry has learned to do this -- and built the infrastructure to support it! -- for things like user credentials (see: extensive use of AWS IAM roles, rather than user creds). We should be making a push to treat certificates the same way.
(That said, three months is better than any longer period. The shorter the rotation, the lower the risk -- but, more importantly, the stronger the impetus to build strong automation around the process.)
A three month expiration time with automatic renewal after two months (as letsencrypt recommends) is a sweet spot for me. When something breaks this gives you 30 days to figure out that something went wrong and to fix it with zero customer impact. The 30 day grace window is also long enough that let's encrypt will send you two emails (at the 19 day and 9 day thresholds) to make you aware that something might be going wrong.
If we lowered the expiration time to say 3 days, with automatic renewal after 2 days, then any breakage on your side or downtime on let's encrypt's side would quickly escalate into https errors. That in turn would train users that those just happen, and make them ignore the big red scary page even when it's an actual attack. That sounds much worse than the small risk from a 30 day certificate.
> If we lowered the expiration time to say 3 days, with automatic renewal after 2 days, then any breakage on your side or downtime on let's encrypt's side would quickly escalate into https errors. That in turn would train users that those just happen, and make them ignore the big red scary page even when it's an actual attack. That sounds much worse than the small risk from a 30 day certificate.
That's already happened. I'm encountering LE errors on random websites so much that I don't care and automatically click through warnings. This is especially troublesome because my government keeps MITM me and I don't like it.
I did this manually for a while with a reminder every 3 months but now it happens automagically with a cron job on my server as well and now I'm similarly at risk of forgetting how it worked in the first place
> now I'm similarly at risk of forgetting how it worked in the first place
I once met a person at a client org who was generally opposed to automation due to risks of forgetting how things work and not always knowing what the internals behind those abstractions are. It was an interesting take.
At the same time, something like Ansible and other methods of automation can be pretty useful and actually aid in documenting things.
It's especially good if you can spare 10% of your time to put some notes down in Markdown files in a Git repo, or source/deploy most of your automation scripts from there as well.
Lets encrypt makes sense in a world where you have pet servers and can install random software like certbot and keep them running for years.
In a world of containerized and immutable cattle servers its not a good solution. Especially not when you technically only need it for something that is internally accessible.
Currently my homelab setup is based on running certbot locally in a docker and a calender entry - maybe I get annoyed enough and switch to my own cert at some point, but those are also a big pita.
LE's short expiry is the primary reason why I don't use it. Yes, I know, automation is the approved solution for this, but it's not a great solution for me.
Agreed, when the tools stop working (which, they do) then suddenly what was swapping out a file instead becomes a big ordeal with fighting nginx .well-known bypass or trying to figure out why lets encrypt can't connect via IPv6 but everything else seems to be able to or, in my case, when certbot-auto stopped working and had no upgrade path on oBSD.
my blog and personal website are down for this reason, I simply can't spend half-a-day at this point in my life figuring out how to do this on OpenBSD. So I'd rather just leave it dead at this point.
Guess I could just buy an SSL certificate still, maybe I do that tonight.
On the other hand, carefully written doc from 10 years ago might not even apply. So you're going to have to do things from scratch again regardless. Often, that's not really a big deal. And it can happen just as well to stable stuff or cron'd stuff that was working fine for x years until it stopped working.
I've had several problems like this at various jobs where something silently breaks, all hell breaks loose, and it takes the better part of an entire day from an entire team to figure out what 15-second change needs to be made to fix everything.
I would say the same kind of thing about primary keys on a database.
If any imaginable sort of explosion in usage means INT could overflow in your lifetime, just go BIGINT.
I've had this burn be so many times. You don't want to be trying to retype a primary key and all the foreign keys that point to it frantically in the middle of the night ten years from now.
You're safeguarding your or your successors sleep.
Is there ever a reason to use an INT these days? If you do have a lot of rows, you need BIGINT anyway. If you don't have a lot of rows, the few extra megabytes of disk usage from INT vs BIGINT are unlikely to cause any problems and are probably not even worth thinking about.
A good example of where INT might make sense could be a table of a relatively set size, where other tables with massively more data had a foreign key to it.
No chance of overflow, but as a BIGINT it would take a ton of space elsewhere.
Just this last Saturday before Christmas, an engineer I know at another company was frantically making schema changes to database schemas to shift from INT to BIGINT, after they hit the limit and everything started failing.
Yeah, I don't understand why 10 years would be preferable to 100 years. 10 years indicates that the person thinks "not my problem". Be kind, make it 100 in that case.
If nothing else, 10 years forces the renewal cert to use a newer, likely less vulnerable encryption algorithm. Imagine your organization was dependent on a CA root using a mid-90's algorithm and still had another 70 years to expiration.
That is something that should be done deliberately as part of the normal product development process, not forced by an arbitrary date that someone set 10 years ago.
There probably isn't an actual reason other than "nice round number," but this could conceivably be something like a typical time to expect 50% of a corporate organization to have turned over. 100 years would be guaranteed 100% turnover, including ownership, and likely 99% of organizations don't even exist that long. At that point, it may as well just never automatically expire.
I distinctly remember cks as the very guy who recommended ten years as an expiration date on my self-signed certificates, and I distinctly recall believing this was a bad idea because I'd be guaranteed to have trouble reconstructing the renewal process in ten years' time.
Thankfully I completely discontinued use of the relevant apps long before that deadline.
I don't know. I could be wrong. I interacted with cks in live chat on a daily basis for nearly 25 years. Mostly not over his website.
cks also gave me some other amazing ideas, such as cross-grading my system from 32 bit Ubuntu to 64 bit. That was an ordeal and a half, but I'm proud to say that I finally achieved that goal without any reinstalls, rollbacks, or data loss.
There's a classic fable (of Nasruddin Hodja tradition?) where the protagonist agrees that he'll teach the king's donkey to speak in 7 years; and when his friends warn him that he'll be executed as he will fail in this task, he responds - ah, the pay is good and many things can happen in 7 years, perhaps the king will die, perhaps I will die, likely the donkey will die, so the problem will eventually go away..
Similarly, 1024 weeks is terrible. The GPS epoch rolls over at that interval, which is just long enough that a lot of receiver makers decide not to handle it, but short enough that they very much have to.
64 weeks would've been better. Force it to be handled gracefully and tested thoroughly.
I had a production domain with a 10 year paid registration go down 6 years in because of outdated Whois and Domain contact information. Lesson #1 - there is nothing like the false security of set it and forget it.
In a similar vein, you should never generate resources which will expire unless some undocumented action is taken. A common one I've seen is self-signed certs which last for n days, and are re-generated whenever an application is deployed or restarted, under the assumption that the application will never run untouched longer than that. (Spoiler: It probably will, at some point, whether due to unexpected change freezes, going into maintenance mode, or -- in my personal favourite -- being deployed to an environment that just isn't updated as regularly.)