Hacker News new | past | comments | ask | show | jobs | submit login
Ten years isn't long enough for maximum age settings (utcc.utoronto.ca)
106 points by ingve on Jan 4, 2024 | hide | past | favorite | 59 comments



One of the principal engineers I used to work with at AWS had a saying: "A one-year certificate expiration is an outage you schedule a year in advance." Of course, it's a bit hyperbolic -- but a ten-year expiration is almost a certainty to result in an outage.

In a similar vein, you should never generate resources which will expire unless some undocumented action is taken. A common one I've seen is self-signed certs which last for n days, and are re-generated whenever an application is deployed or restarted, under the assumption that the application will never run untouched longer than that. (Spoiler: It probably will, at some point, whether due to unexpected change freezes, going into maintenance mode, or -- in my personal favourite -- being deployed to an environment that just isn't updated as regularly.)


That Principal Engineer's knowledge came from painful repeated experiences in AWS. When I left AWS in 2016 they were trying to push towards 3 monthly cert rotations, and hoping to get it shorter.

A year long expiry isn't frequent enough that you build automation, and is long enough that the runbook you have is likely out of date before the next time you execute it. If you make it 3 monthly, it's more likely to be fully or mostly automated, and it's more likely you'll remember that certs were recently introduced in a particular service. If you make it monthly, it's pretty much guaranteed that it'll be fully automated.

Almost every week in the weekly AWS-wide ops meetings, one service or another would be talking about something that went wrong that was caused by some certificate expiring, that happened in a place they'd forgotten they had certificates, or had missed when they did the rotation. A number of those failures presented in particularly misleading ways, too, by nature of what role the cert was playing.


Does one actually manage to avoid such outages for 10 years by making the problem recur every month? 'cause I feel like stuff would still break even if you test and run them regularly.


You might hit an outage, but you'll hit it within a month of deploying the new code that caused it, so you'll have the context and staffing expertise to fix it so it doesn't happen next month. Whereas if the outage happens in ten years, you'll need some software archaeologists to find the root cause and likely won't have the expertise available to fix it.

And maybe you say "it's one outage either way, but isn't it better in ten years than next month?" But when you're constantly adding new services, eventually there will come a time where every month some new service is having its ten year anniversary.


Sounds like they need a systems that actually gets remembered and referenced if they want to stick to 1 year expiries.


One day I could not connect to my (home) server. Turns out the client certificate had expired, I never thought to make note of or increase the 10 year default value when I did my test configuration...


I remember there being a weird clock rollover bug that only financial firms would hit (since they never took their machines down, ever)

That was a long time ago. I wonder if technology/the cloud has changed or they still run those same machines


30 years ago companies were rebooting their mainframes twice a year just to make sure. Before doing that companies were burned because the mainframe went down accidentally (backup generator broke during a power outage) and they couldn't get it to start because someone changed a setting at runtime but didn't save the setting to the boot scripts - then that person retired or found a new job. By rebooting twice a year they were able to ensure the someone remembered what setting was changed when the system failed to start.


Chaos Engineering!

Untested emergency plans are not a guarantee that the plans will work.


One of the things that I loved about ISO9001, sure, it made every sysadmin action something that made police paperwork look 'light', but it ensured you didn't hit this kind of thing, or if you did, it was an instant gross negligence dismissal on whoever stopped documenting or following the documented procedural protocol.


Financial firms will also hit time-based bugs before most organizations because they often deal with forecasting events 30+ years in the future (e.g. mortgages). For a bank, the 2038 rollover has been relevant since 2008.


I hit one of these on an EMC VNX array one time; after ~400 days all the controllers crashed at the same time. Didn't help that it happened at 4am on New Year's Day. I do recall other instances of this class of bug, but nothing specific.


I had to do a release to fix an outage because someone set up a system that would have an outage every six months if no one ran a release.

Naturally, they didn't document this.


To me, 10 years is long enough to completely forget how to fix the problem, once it becomes a problem in 10 years' time.

Most people won't document well enough to even be aware that the 10-year deadline is approaching, much less how to fix it! When the deadline hits, everything will break, and then you basically have to reinvent the wheel to get it up and running again.

In my opinion, if something can last 10 years, then it could probably just last indefinitely.

Alternatively, have something that lasts a short time, but that is renewed automatically. I suppose that is the advantage of let's encrypt, for example, in that their certificates expire after 3 months, but that just means that I set up a Cron job to automatically renew them for me.


My last boss was very keen on change control on any of our live systems; steps documented to the point where the cleaner could have run them. It seemed a bit excessive at the time, but there's something lovely about having copy/pasteable steps (and any notes) from when you had to do this a year ago.

As for forgetting deadlines are approaching, just set up automated checks for cert and domain expiration. I wasn't just checking our infrastructure but also any remote APIs/interconnects we were using too. There were a handful of times where I'd contact providers with 7 days warning just to confirm they were aware their certs were expiring; it was slightly terrifying how often they weren't.


I'm about to write a little "choose-your-own-adventure" script that walks new employees through our processes. We're about to begin our decade-ly turning-over, and the idea is that the we'll hand over the script and fix the "bugs" in the script, rather than explaining "out loud". Most of these process scripts are used on a weekly or monthly cadence, so the hope is they'll stay up-to-date.


I love love love the let's encrypt 3 months expire for this reason, it used to be such a pain to remember how to do the yearly/biyearly renew and now, as you say, it's just a Cron job.


At work, our customers need to get new certificates from a gov't agency every other year.

Most of them have either completely forgotten about it and how to do it, or there's been a change of employees and the new ones didn't get the memo, so to speak.

So it falls on us to remind them and guide them through the process.

How I wish the gov't moved to a 3 month setup like Let's Encrypt.


Depending on the government agency, there may be a required level of ongoing identity and need verification that can't be automated. For personal PKI in the US DoD, for instance, you have to go in-person to an ID office on a military installation to get your common access card renewed. For server certs, there is obviously no way to make a server go somewhere physically, but you need a qualified sponsor to sign off on the request to the DoD PKI office, and who that person is will likely change over any multi-year span, as military command positions tend not to last more than a year and even the civilian offices still see fairly frequent turnover at the higher levels. Plus those people need to sign requests with their common access card, which requires them to periodically go to an ID office in-person.


I'll go further: Three months is too long. Secrets which are used to authenticate and identify should be rotated far more regularly, using infrastructure which treats them as effectively ephemeral. The industry has learned to do this -- and built the infrastructure to support it! -- for things like user credentials (see: extensive use of AWS IAM roles, rather than user creds). We should be making a push to treat certificates the same way.

(That said, three months is better than any longer period. The shorter the rotation, the lower the risk -- but, more importantly, the stronger the impetus to build strong automation around the process.)


A three month expiration time with automatic renewal after two months (as letsencrypt recommends) is a sweet spot for me. When something breaks this gives you 30 days to figure out that something went wrong and to fix it with zero customer impact. The 30 day grace window is also long enough that let's encrypt will send you two emails (at the 19 day and 9 day thresholds) to make you aware that something might be going wrong.

If we lowered the expiration time to say 3 days, with automatic renewal after 2 days, then any breakage on your side or downtime on let's encrypt's side would quickly escalate into https errors. That in turn would train users that those just happen, and make them ignore the big red scary page even when it's an actual attack. That sounds much worse than the small risk from a 30 day certificate.


> If we lowered the expiration time to say 3 days, with automatic renewal after 2 days, then any breakage on your side or downtime on let's encrypt's side would quickly escalate into https errors. That in turn would train users that those just happen, and make them ignore the big red scary page even when it's an actual attack. That sounds much worse than the small risk from a 30 day certificate.

That's already happened. I'm encountering LE errors on random websites so much that I don't care and automatically click through warnings. This is especially troublesome because my government keeps MITM me and I don't like it.


This is my experience as well. I encounter cert errors more now than ever, and I tend to ignore them.


> The shorter the rotation, the lower the risk

the lower the risk of compromised certs / keys. certainly not a lower risk of issues, or surprises.

hopefully -- emphasis on hope -- this regular action becomes routine and easy enough to that it is a low risk behavior.


I did this manually for a while with a reminder every 3 months but now it happens automagically with a cron job on my server as well and now I'm similarly at risk of forgetting how it worked in the first place


> now I'm similarly at risk of forgetting how it worked in the first place

I once met a person at a client org who was generally opposed to automation due to risks of forgetting how things work and not always knowing what the internals behind those abstractions are. It was an interesting take.

At the same time, something like Ansible and other methods of automation can be pretty useful and actually aid in documenting things.

It's especially good if you can spare 10% of your time to put some notes down in Markdown files in a Git repo, or source/deploy most of your automation scripts from there as well.


Lets encrypt makes sense in a world where you have pet servers and can install random software like certbot and keep them running for years.

In a world of containerized and immutable cattle servers its not a good solution. Especially not when you technically only need it for something that is internally accessible.

Currently my homelab setup is based on running certbot locally in a docker and a calender entry - maybe I get annoyed enough and switch to my own cert at some point, but those are also a big pita.


LE's short expiry is the primary reason why I don't use it. Yes, I know, automation is the approved solution for this, but it's not a great solution for me.


Agreed, when the tools stop working (which, they do) then suddenly what was swapping out a file instead becomes a big ordeal with fighting nginx .well-known bypass or trying to figure out why lets encrypt can't connect via IPv6 but everything else seems to be able to or, in my case, when certbot-auto stopped working and had no upgrade path on oBSD.

my blog and personal website are down for this reason, I simply can't spend half-a-day at this point in my life figuring out how to do this on OpenBSD. So I'd rather just leave it dead at this point.

Guess I could just buy an SSL certificate still, maybe I do that tonight.


I use DNS-01. In fact, it's the only way I can do it as LE doesn't have access to my internal setup.

And buying an SSL cert only gives you 368 days in Chrome / Apple browsers: https://support.apple.com/en-us/102028


DNS-01 is awkward with multiple TLDs and providers for a site.

For me it’s like;

    blog.jharasym.com - namecheap
    blog.jharasym.dev - gandi
    blog.dijit.sh - self hosted with BIND



On the other hand, carefully written doc from 10 years ago might not even apply. So you're going to have to do things from scratch again regardless. Often, that's not really a big deal. And it can happen just as well to stable stuff or cron'd stuff that was working fine for x years until it stopped working.


I've had several problems like this at various jobs where something silently breaks, all hell breaks loose, and it takes the better part of an entire day from an entire team to figure out what 15-second change needs to be made to fix everything.


Why isn't there a special log file for when things expire, so I can sort it by expiration date?


Haha, I won't be on this team in 2024! That's 5 years from now.

Now: Oof ouch owie all these client certs are expiring


I would say the same kind of thing about primary keys on a database.

If any imaginable sort of explosion in usage means INT could overflow in your lifetime, just go BIGINT.

I've had this burn be so many times. You don't want to be trying to retype a primary key and all the foreign keys that point to it frantically in the middle of the night ten years from now.

You're safeguarding your or your successors sleep.


Is there ever a reason to use an INT these days? If you do have a lot of rows, you need BIGINT anyway. If you don't have a lot of rows, the few extra megabytes of disk usage from INT vs BIGINT are unlikely to cause any problems and are probably not even worth thinking about.


A good example of where INT might make sense could be a table of a relatively set size, where other tables with massively more data had a foreign key to it.

No chance of overflow, but as a BIGINT it would take a ton of space elsewhere.

Even then though, it’s potentially not worth it.


I figure we might run out of UUIDs if we build a Dyson sphere.


Just do IPv8


Just this last Saturday before Christmas, an engineer I know at another company was frantically making schema changes to database schemas to shift from INT to BIGINT, after they hit the limit and everything started failing.


I love the design and look of this website. Apparently it's based on custom written wiki software.


I too loved it, so I went to go see if I could steal the CSS for my own project. Can’t seem to find it though….



Yeah, I don't understand why 10 years would be preferable to 100 years. 10 years indicates that the person thinks "not my problem". Be kind, make it 100 in that case.


If nothing else, 10 years forces the renewal cert to use a newer, likely less vulnerable encryption algorithm. Imagine your organization was dependent on a CA root using a mid-90's algorithm and still had another 70 years to expiration.


That is something that should be done deliberately as part of the normal product development process, not forced by an arbitrary date that someone set 10 years ago.


There probably isn't an actual reason other than "nice round number," but this could conceivably be something like a typical time to expect 50% of a corporate organization to have turned over. 100 years would be guaranteed 100% turnover, including ownership, and likely 99% of organizations don't even exist that long. At that point, it may as well just never automatically expire.


I distinctly remember cks as the very guy who recommended ten years as an expiration date on my self-signed certificates, and I distinctly recall believing this was a bad idea because I'd be guaranteed to have trouble reconstructing the renewal process in ten years' time.

Thankfully I completely discontinued use of the relevant apps long before that deadline.


> I distinctly remember cks as the very guy who recommended ten years as an expiration date on my self-signed certificates

Link? In https://utcc.utoronto.ca/~cks/space/blog/sysadmin/MakingSelf... I see him saying he uses 27y (9999 days).


I don't know. I could be wrong. I interacted with cks in live chat on a daily basis for nearly 25 years. Mostly not over his website.

cks also gave me some other amazing ideas, such as cross-grading my system from 32 bit Ubuntu to 64 bit. That was an ordeal and a half, but I'm proud to say that I finally achieved that goal without any reinstalls, rollbacks, or data loss.


It's a good job no one ever learns from experience, eh?


Ah yes, if we wait long enough, all our problems will eventually go away.

https://xkcd.com/1822/


There's a classic fable (of Nasruddin Hodja tradition?) where the protagonist agrees that he'll teach the king's donkey to speak in 7 years; and when his friends warn him that he'll be executed as he will fail in this task, he responds - ah, the pay is good and many things can happen in 7 years, perhaps the king will die, perhaps I will die, likely the donkey will die, so the problem will eventually go away..


This is a wonderful allegory of modern software companies in SV.

"We do not need to make profit, maybe we will be acquired, many things can happen".


Similarly, 1024 weeks is terrible. The GPS epoch rolls over at that interval, which is just long enough that a lot of receiver makers decide not to handle it, but short enough that they very much have to.

64 weeks would've been better. Force it to be handled gracefully and tested thoroughly.


Reaction: Talk to an old guy in Legal or Accounting about having stuff that "just keeps working", out of sight & mind.

Though to be fair:

> If you don't have a plan to [...] definitely take it out of service, ten years is far too short.


I had a production domain with a 10 year paid registration go down 6 years in because of outdated Whois and Domain contact information. Lesson #1 - there is nothing like the false security of set it and forget it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: