Hacker News new | past | comments | ask | show | jobs | submit login
2021.06.08 Certificate Lifetime Incident (letsencrypt.org)
400 points by Arnavion on June 9, 2021 | hide | past | favorite | 164 comments

Head of Let's Encrypt here.

The question of whether or not revocation should happen has to be asked whenever a certificate compliance issue is being discussed, regardless of how serious the issue is. That is a normal part of the process of evaluating an incident thoroughly.

We do not plan to revoke any certificates as a result of this issue.

An aside - I love how informed many of the commenters here are, thanks to you all for helping to explain what's happening!

I love your work!

Any chance you could convince Microsoft to add Let's Encrypt as an "integrated" CA in Azure Key Vault? It's absolutely bonkers how much money I have to pay to get a certificate in 2021 for cloud services! E.g.: the App Service certificates are $70/year each or $300/year for a wildcard certificate. That's nuts. Reference: https://azure.microsoft.com/en-us/pricing/details/app-servic...

I strongly suspect the reason Let's Encrypt isn't adopted more widely in cloud services is because there's no margin on a free service. This is why Microsoft, AWS, and GCP all carefully pretend that there are no free options, and make sure that it's a difficult uphill battle to use Let's Encrypt.

E.g.: https://docs.microsoft.com/en-us/azure/key-vault/certificate...

Notice how the document title is literally "Integrating Key Vault with DigiCert certificate authority". Not "Certificate Authority", it's the "DigiCert certificate authority". Apparently, HTTPS is now DigiCert's protocol, they're the gatekeepers, and you have to pay them money to use it.

It boils my blood that 1KB files of random numbers still cost money, and the trolls under the bridge are still taxing everyone for what is now essentially mandatory for all web sites.

If anyone here has a significant account with Azure, please apply some pressure to your Microsoft account manager next time you have coffee with them. This rent seeking for what should be free for everyone has to stop.

>This is why Microsoft, AWS, and GCP all carefully pretend that there are no free options, and make sure that it's a difficult uphill battle to use Let's Encrypt.

Maybe I'm misunderstanding, but GCP HTTPS and SSL load balancers give you certificates for free. They support Google's own CA (pki.goog) as well as Let's Encrypt. Full disclosure I work in GCP, but not on this.


AWS load balancers also give you managed certificates for free.

Certificates issued from AWS Certificate Manager are — and, I believe, always have been — free. AWS is definitely not rent-seeking on certificates.

Certificates from AWS Certificate Manager are part of the roach motel though - AWS orchestrates the issuance and holds the private keys so you can only use them via AWS services.

But that's what the grandparent was complaining about though? If you're not using the provider's managed services, then nothing is stopping you from running your own ACME client to provision certificates without paying the cloud provider money for certs.

"AWS Certificate Manager supports a growing number of AWS services. You cannot install your ACM certificate or your private ACM Private CA certificate directly on your AWS based website or application."

Free certificates you can't use on EC2 virtual machines are basically worthless, at least for me.

Stop internalising your 1990s architecture limitations! You shouldn't need to pay for a Layer 7 load balancer for an application that doesn't need it. A 1-core web server VM can easily put out 1-2 Gbps of HTTPS traffic. You don't need SSL offload. A crypto accelerator card is not required. You don't need an appliance to do HTTPS. You can have end-to-end HTTPS without additional infrastructure. Both Windows and Linux can do TLS out-of-the-box. You don't need a vendor to give you special permission to have security. There is no need to pay GoDaddy or DigiCert for a certificate.

The vendors are pulling the wool over your eyes, convincing you that your out-of-date thinking is good and proper, and then charging you for the privilege of having the bare minimum security that should be free as standard.

AWS ACM has always been free. ACM has never allowed you to export private keys. However ACM does have a way[1] for you to use ACM keys with EC2 instances: Nitro Enclaves. Nitro Enclaves carve off a little piece of your EC2 instance (memory + VCPU) into an isolated VM that feels a little bit like an HSM or a secure enclave.

[1] https://docs.aws.amazon.com/enclaves/latest/user/nitro-encla...

If it's your own EC2 instance and you're not using a load balancer then why can't you just use Let's Encrypt?

Just use certbot or alternatives, what's the problem?

Holy shit, Azure really charges that much? Does the tls certificate come with some other service or unique feature bundled as part of the cost? AWS basically charges you for the API calls, cents really, and I’m fairly certain GCP is at least as inexpensive if not free. Can you at least export it and take it with you? That’s really the trade off with AWS, private keys aren’t exportable. But it’s so cheap that it’s a non issue. If I want one I can manage myself I can pay or use letsencrypt.

AFAIK Azure do free ones now, I'm certainly using some free certs on Azure.

They're not really free: "This feature is available for customers on an App Service Plan of Basic and above (free and shared tiers are not supported)." They're GoDaddy certificates, and their price is charged back through the App Service pricing.

Similarly, Azure Front Door also has "free" certificates, but they just integrate it into the relatively high cost of the service.

If you want certificates for some other unrelated IaaS or PaaS service... Microsoft says no. They want their margin.

Back to GoDaddy: their attitude is very 1990s, so they sometimes use manual approval for certificates. This makes ARM Templates that normally take minutes to deploy just hang and take hours, or even fail.

Worse, they don't use the DNS address in your request for validation of domain ownership. Instead, they determine the "TLD+1" with some heuristics. Unfortunately, there is no such concept in the Domain Name System itself, so this is unreliable at best. This approach is broken by design and cannot be made to work for many domains. For example, in Australia, App Service Certificates cannot ever be used for subdomains of act.gov.au, nsw.gov.au, and nt.gov.au!

PS: For people who are unaware, the concept of the TLD is at best a fuzzy one, and is decided by the informally maintained Public Suffix list, which is currently managed by Mozilla. It's not an RFC, it's not a standard, and isn't suitable for certificate validation. See: https://publicsuffix.org/

This is one of the key philosophical differences between Let's Encrypt and GoDaddy. When issuing automated, free certificates, manual labour for validation is not a viable approach and hence Let's Encrypt eliminated all such sources of informal, error-prone, manually verified sources. GoDaddy hasn't changed their validation approach in decades, because for $70/year, this kind of inefficiency is acceptable.

To put things in perspective: GoDaddy has a support phone number. For certificates! They're literally 1KB files with two numbers and some text in them. Why do they need support!?

> Why do they need support!?

People manually installing certificates who are unfamiliar with the process. Which, to be fair, is a complex and very technical process if it’s some self-hosted or shared-hosting shell account running an outdated version of cPanel.

I’m kind of shocked that Microsoft has to partner with godaddy and didn’t already have their own CA long before Azure was even a twinkle in some executive’s eye.

Thanks for Let's Encrypt, it's truly an admirable service.

I'm curious regarding "certificate compliance" - I thought the 90-day expiration was merely a Let's Encrypt policy to encourage good automation. Is this just a matter of holding yourselves to high standards, or is there a greater authority to which LE promised 90-days-exactly?

90 days is our choice, 90 days and one second validity isn’t necessarily an issue.

The issue is that we said in our CPS that the lifetime was exactly 90 days, and then we did something different, even if just by one second. The problem here is behavioral consistency with our own published policies.

We now just say “less than 100 days.” An unfortunate side effect of being more specific and informative in these documents (e.g. saying our certs have 90 days validity) is that it ups our chances of noncompliance if we are off on something by a bit, even if it is a meaningless bit. There is an incentive to not be so specific so as to avoid situations like this.

That you for the clarification and the letsencrypt service.

Hi! Is your time servers synchronized to GMT (mean solar time at the Royal Observatory in Greenwich) or UTC (Coordinated Universal Time)?

The RFC requires GMT, but most time servers synchronizes to UTC. It could be a difference of up to 0.9 seconds between GMT and UTC.

> It could be a difference of up to 0.9 seconds between GMT and UTC.

Not in this context. In the context of ASN.1, NTP, POSIX, etc, GMT is a timezone equivalent to UTC+00:00. UT1 and DUT1 don't figure into it.

GMT was abolished at the end of 1971, when the Greenwich observatory started distributing UTC as its official timescale. At that time the observatory was based in Herstmonceux, so GMT was calculated from a model rather than being based on observations made from the transit instruments in Greenwich.

Thank you for let’s encrypt. Very grateful for your role in society.

Won't it basically fix itself in 90 days and 1 second after all the certs are rolled anyhow now that it's on the radar?

This might just be me, but I expected this issue to be taken far less seriously than it has been in the given communication [0]. In the linked similar issue [1], they even talk about revoking the certificates, which seems insane to me for being one second off. Is there any actual serious problem with this extra second I'm missing?

[0] https://bugzilla.mozilla.org/show_bug.cgi?id=1715455

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1708965

This is a "no brown M&Ms" kind of violation. The direct impact of this violation is basically nil. However, it's a violation that suggests that the requirements haven't been very carefully read and that conformance to them hasn't been very carefully audited. So how can you have be assured that there aren't violations of more important rules still lurking about?

In case anyone is confused, I believe the reference to "no brown M&Ms" is a reference to Van Halen's contracts: https://www.snopes.com/fact-check/brown-out/

> So, when I would walk backstage, if I saw a brown M&M in that bowl … well, line-check the entire production. Guaranteed you’re going to arrive at a technical error. They didn’t read the contract. Guaranteed you’d run into a problem. Sometimes it would threaten to just destroy the whole show. Something like, literally, life-threatening.

There's something I don't get about that method, as clever as it sounds. The method is if there are brown M&Ms, then check the entire production for safety. What about the else? Don't check everything? So if the person in charge of the green room happens to be on top of things and picks out a few M&Ms, they don't line-check the entire production even though there could be something "literally life-threatening"?

It's easy to say that the brown M&Ms acted as a signal, but I can't think of any way to depend on that signal that isn't reckless. Just check the actual production, since that seems to be an option.

Perhaps in practice, if they passed the M&Ms check, you would say "Great, okay, now talk me through all the other things you set up." whereas if they failed that test, you would say "I'm not willing to take your word on anything, so I'm going to call in an external auditor to check everything, and I'll recover the costs of that from you, for breach of contract."

As in most things, it is an economic tradeoff.

A big-name traveling band is kind of like a carnival without the ferris wheels. They have a lot less support staff than you probably think, and most of them are busy most of the time you're not driving.

If you have more than one person playing management/problem solver/coordinator/process lubricant, you've got it easy. Much more typical is everyone heads-down on their part of the event and one person trying to hold it all together.

That person has no free bandwidth, and depends heavily on proxy measures of the state of things.

I've played roadie before, not for a super big act and not for long. And I've ended up running point for a lot of conferences, more or less because I know how to run that sort of thing. Conferences are so, so much easier - they have resources, can assume workers aren't wasted, can assume the location at least vaguely cares about things like fire suppression, can assume the venue doing a revenue split actually has a license to sell alcohol...

> As in most things, it is an economic tradeoff.

I think that's what gets me. I'm pretty "Old Man Yells At Cloud" right now, criticizing an 80s metal band for not meeting the safety standard that I've imagined. It's just that people pull out this idea as being something neat and I think it's a completely inappropriate technique when other options are available and safety is on the line. If either of these things aren't true, then sure, try it.

You can't see how people follow orders in war without sending them to war. So you tell them "no food in the barracks" and when you catch them with a bagel you make them do a hundred pushups. And it's easy to toss out resumes with typos since they've indicated they don't actually have attention to detail. Those make sense because they're the best you can do.

But if someone dies at your concert, you'll probably think it would have been worth it to have more than one person playing safety checker. The contracts and M&Ms aren't actually going to make you feel better. (Or necessarily keep the lawyers away).

At its deepest, I think I'm reacting to a change that happened to me over time. I used to work at a ropes course. Think team building, high wires and zip lines. We'd give a list of instructions to our group contact to pass along to the group, including that people need to wear long pants and closed-toe shoes.

Well, about 20% of groups would hop out of their cars with a bunch of people in shorts and sandals. "Oh, no one told us." And we'd announce to ourselves that the group contact had failed to communicate and we should "watch out" for the rest of the day.

Looking back, I think we were the dumb ones. If 20% of groups are showing up unprepared to my ropes course, then that's my problem. I need to make different choices that lead to better outcomes. If I have evidence I shouldn't trust group contracts to make things happen (I definitely do) then I need to stop counting on them.

It’s a koan about how to use heuristics, not a guide to running tours. Much like how Full Metal Jacket isn’t the same as being in the military.

> people pull out this idea as being something neat and I think it's a completely inappropriate technique when other options are available and safety is on the line.

But we do this with all things all the time. You didn't take a ruler/micrometer and measure tolerances between parts of the last car you bought, nor the last cab/rideshare/friend's car you got into. Instead, you make some assumptions and use some cues to inform those assumptions.

Perhaps you're about to get into some ride share vehicle and you notice the exhaust looks a little smokey for the age of the car, and think maybe they aren't taking care of it. Then maybe to take a closer look at the tires and notice they're bald, etc. This is definitely something that would be considered "literally life-threatening", but there's a statistical likelihood that makes it something that not everyone checks every time.

> But if someone dies at your concert, you'll probably think it would have been worth it to have more than one person playing safety checker.

But it's not "your" concert, as much as it's billed that way. It's a joint operation between the artist and the venue, and each have their own responsibilities. It's the venue's responsibility to do certain things. This is just the artist trying to use one technique of many that are likely employed (like outright asking) to gauge how well the other party has fulfilled their responsibilities. You can't check every single thing, otherwise there's no point in there being another party (and they may not give you that access to check), but there are things you can do let you know a closer look is warranted.

> Looking back, I think we were the dumb ones. If 20% of groups are showing up unprepared to my ropes course, then that's my problem.

To some degree, maybe. I think the best outcome would be both cases, "look out" when stuff looks awry, but also examine why it doesn't seem to change. But what if you can't eliminate the problem? What if, no matter what you iterate on and try to get the contact to correctly relay your information, you can't always get he info to all the relevant people either yourself or through the contact, and 5% still show up like that? Do you stop with the "watch out" notice? No, most likely you still do that because it's useful and better than nothing, and still provides a little benefit on top of everything else that was done.

That's what we should assume the rock band was doing. They have contracts that stipulate how stuff is supposed to be, and different staff to coordinate with the venue reps for aspects of the production, and they should be doing what they can to make sure stuff is set as expected. But if brown M&M's show up, maybe it's worth paying those people a little overtime to grill the venue on what they did and didn't do that they said they did and check their work, because who knows, maybe this is the time it saves a life.

Look, that's all super reasonable. I don't have an airtight argument. This is more of a different way of looking at things. I think those sort of tricks can help with safety, but they can also come from a place of ego. If you don't believe me, check out the Snopes link I originally replied to for what happened when there was a brown M&M.

On your rideshare example, I don't do a preflight check when I ride an airliner. I did when I was a pilot, even if it had just come out of the shop.

> But it's not "your" concert, as much as it's billed that way.

This is it. This is the difference in mindset. I now think of it as "mine". Not that it's mine and no one else's. But I choose to take responsibility for anything that went wrong that I could have been prevented. I don't care how the responsibilities are divvied up or what the contracts say. So anywhere I'm in some position where people are counting on me, I'm either going to check things myself or know that someone I trust did the checking.

I don't actually disagree with your point of people taking more responsibility, I just think the brown M&M's incident isn't a good example to illustrate it. The production crew had an extensive document outlining all the requirements, likely built up over time as they encountered things venues did not actually take the care to ensure (like the stage not sinking through rubberized flooring).

> This is it. This is the difference in mindset. I now think of it as "mine". Not that it's mine and no one else's. But I choose to take responsibility for anything that went wrong that I could have been prevented.

That's fine, and you can take responsibility, but you can't do all the work, not in any way that scales. Even in the preflight checks for the plane, you're not disassembling wings and checking for cracks on internal struts I imagine. Someone else does that occasionally and you have to trust (or not) their opinions and that they've actually done the work. And you can't always spend a week doing a thorough background check of those people either (or for some reason you're forced to use someone you would rather not), so sometimes little tricks that might indicate that you should be wary are useful.

I think you’re spot on. It’s a summary QA of the QA layer below Eddie van Halen. I think it’s a mindset and extra check, much akin to clearing a handgun after you watch someone clear the gun as they hand it to you. It never hurts to have an additional safety measure.

It’s a signal to a potentially larger issue.

The other commenters have a point, but the Snopes article actually points this out:

> So just as a little test, in the technical aspect of the rider, it would say “Article 148: There will be fifteen amperage voltage sockets at twenty-foot spaces, evenly, providing nineteen amperes …” This kind of thing. And article number 126, in the middle of nowhere, was: “There will be no brown M&M’s in the backstage area, upon pain of forfeiture of the show, with full compensation.”

So they had other simple-to-check signs; the brown M&M one is just the only one commonly cited as it's the funniest/most interesting.

Is a fifteen amp socket providing nineteen amps a thing?

No, I think it's supposed to be a contradiction that an on-the-ball venue will catch before signing the contract.

It's at least 15 sockets, each providing at least 19 amps. Probably following some plug/socket standard.

I think you’re incorrect. There is a standard 15A socket. There is no standard 19A socket (but there are 20A sockets that are compatible with 15A plugs)

Sure is. See this stackexchange link: https://diy.stackexchange.com/questions/12115/is-using-15-am...

So what I think what the contract was specifying was several 15-amp receptacles, spaced 20 feet apart, from which a combined total of not less than 19 amps may be drawn. A 20 amp breaker would do that nicely.

They are stating that they would put 19A of load on a 15A rated socket. E.g. creating an obvious safety hazard that should be caught and addressed.

I assume they mean multiple 15 A (standard) sockets with a combined load no more than 19 A

It’s a common heuristic for inspections. It’s not efficient or possible to check everything, so you inspect a few pieces in minute detail. For the pieces that fail, you do a complete inspection.

In my experience it’s a pretty good heuristic. It fails badly if the organization knows what pieces will be inspected (queue Goodhart). It can also be fantastically inefficient if there’s a ton of checks or the checks are ambiguous.

Being a stickler at inspections sets a useful precedent that organizations will do a better job for you in the future.

Napoleon took advantage of this during his first inspection of a combined arms unit (Napoleon was an artillery officer). He inspected the shit out the artillery unit first. Then he could be reasonably confident the light infantry and Calvary units would take the inspection seriously.

An even cleverer trick if you have lots of things you could inspect, but finite resources and you should like to encourage people to have practices that routinely pass inspection anyway:

The Paris MOU randomly inspects vessels entering European ports with the rate of inspections based on their flag. So if you're a UK registered ship, you might expected to go several years without inspection, while if you're Albanian registered, Albania is blacklisted, you're getting inspected twice a year or more, and if you fail several inspections you're getting banned from all European ports.

How did Albania get blacklisted? The Paris MOU tracks the results of inspections. Each time a ship flying some flag is inspected, the result of the inspection changes the calculation of how much apparent risk there is of ships with that flag failing inspection. Risk too high? Greylist. Risk higher still? Blacklist.

Why flags? Because before Port State Control existed, the Flag States (Country where the ship is registered) were responsible for periodically inspecting ships. And they still are, it's international law, the Europeans just got sick of all these ships with a "flag of convenience" from some distant island that are clearly non-compliant and have never been inspected in their nice well-regulated ports. So they invented the Paris MOU and Port State Control.

And it worked, I mentioned the UK flag is whitelisted, but the Bahamas are a famous Flag of Convenience, and they're whitelisted too, because they took the job seriously. A Bahamas registration is a little cheaper†, but your ship will be properly inspected on their behalf, and so sure enough when it gets randomly inspected again in say, Antwerp, it'll likely pass. Meanwhile St Vincent, another Flag of Convenience, is on the Greylist and a few tiny islands that dipped their toes in the "Flag of Convenience" business are blacklisted. The promise of inspectors poking around inside your ships every five minutes is a bad deal even if Tuvalu shipping registration is half price.

† The other thing we don't like about Flags of Convenience is they make it easier to hide who owns things, which can serve to also avoid taxation. The Paris MOU doesn't fix that, that superyacht registered in the Bahamas might be safe but the owner likely didn't pay tax...

If we were designing a critical system around this, I suppose I would create several unrelated triggers that, if found, would invoke an extra-detailed review of all work in the show.

The baseline assumption is that they read your contract and do everything. It's their job to get it right. If they're attentive to the stupidest thing in your contract, they probably did an OK job on everything else. Sure, it's still possible to be electrocuted during the show even if the M&M's were right.

But, rock stars are not going to arrive early every night just to stay alive. The buck has to stop somewhere, and EVH decided it was in the bowl of M&M's.

I don't buy that this trick would be effective, though I like the sentiment. I would guess that the detail of m&m color could very often be overlooked due to its insignificance and create a false signal.

For one, the same people in charge of safety like riggings and electrical are not going to be the same people that are supplying the refreshments.

I can also imagine a high number of people like myself reading that part and think it's a joke or purposely ignoring it because there are much more important issues to deal with and because I'm not much into indulging the ridiculous whims of self important people.

> I can also imagine a high number of people like myself reading that part and think it's a joke or purposely ignoring it because there are much more important issues to deal with and because I'm not much into indulging the ridiculous whims of self important people.

It doesn't say "no brown M&Ms;" - it says "no brown M&Ms, otherwise we cancel the show and you still pay us the entire amount".

If I saw something like that in a contract, maybe I wouldn't go looking for brown M&Ms, I'd at least ask my boss (who hopefully has a direct line to our lawyers) "is this for real?", and if my boss and our lawyers are any good they'd tell me that's as real as it gets.

There is no need to indulge them: you just tell them you are not going to sort the M&M's. If you do that in advance: great. If you ignore other points and tell them in advance also great. If you ignore requests and don't communicate that fact you are in trouble. Like the 15 amp sockets which can deliver 19amps in a comment above. That's a fire hazard. If you don't start a discussion that it is a fire hazard and you will only deliver 15amps or you will use a different higher rated socket you failed the test and your work is suspect.

> I'm not much into indulging the ridiculous whims of self important people.

Then you probably wouldn't be working at a concert venue / on a showbusiness production team anyway.

Edit: have you seen a Van Halen (or, really, any stadium-headlining) show? The entire performance is "the ridiculous whims of self important people". That's kind of why we like it.

It's a heuristic.

Furthermore, the brown M&Ms clause was buried deep in a list of technical riders. The person in charge of the green room isn't reading the technical part of the contract at all. Why would they? It's not their job. They're reading the other part of the contract that plainly says "please put a bowl of M&Ms in the green room". So the only two cases in which a venue is going to pick out the M&Ms is...

1. The technical director saw the requirement and told the green room guy to start picking out M&Ms or the show is cancelled

2. The green room guy somehow knows about the brown M&Ms thing and decides to falsely signal compliance

Presumably, the whole brown M&Ms thing was obscure enough that no venue actually decided to just check for odd requirements. Furthermore, if they did decide to just do that and only that, specifically knowing that they were interfering with such a heuristic, they'd almost certainly be liable for something. So I doubt anyone would deliberately interfere with this, and it's hard to accidentally comply with just the no brown M&Ms thing.

"Just line-check the production all the time" might not always be an option.

If the brown m&m clause wasn’t in the green-room section, then it’s still a valid test of the setup - in the sense: “can you guys simply communicate?”

It's a bit like mutation testing. Add real bugs to the code (through automation) and see what gets caught by your unit test. Do this a few hundred times and you can get some idea how many bugs your unit tests might have missed.

The else is undetermined. See https://en.wikipedia.org/wiki/Denying_the_antecedent

So, the logic of this signal is like this:

- If a brown M&M HAS been found, you know for a fact they didn't follow the specs well enough;

- IF a brown M&M HAS NOT been found (denying the antecedent), you can't state anything about they following the specs or not.

I was talking about behavior following a decision tree. You're talking about arguments. The "else" can't exactly be undetermined here. In checking out the M&Ms, he's making a choice to check the production in that moment or not. (Of course, he can check it later, too).

Did Van Halen ever actually cancel any shows because of brown M&Ms? The Snopes article says David Lee Roth found a brown M&M, but the show proceeded (and technical problems caused damage to the venue floor).

"I found some brown M&M’s, I went into full Shakespearean “What is this before me?” … you know, with the skull in one hand … and promptly trashed the dressing room. Dumped the buffet, kicked a hole in the door, twelve thousand dollars’ worth of fun.

The staging sank through their floor. They didn’t bother to look at the weight requirements or anything, and this sank through their new flooring and did eighty thousand dollars’ worth of damage to the arena floor. The whole thing had to be replaced.

Vandalism and destruction of property is a crime, even if you say you will pay for it. I guess if you are rich or famous enough, everyone just let's everything slide.

Unless, of course, said destruction of property is in said contract.

Otherwise, "smash therapy" places couldn't exist.

If I were writing contracts for rock stars, I'd probably look to account for tantrums and raging parties in the wording.

On reading the explanation:

> The Zlint project only attempts to check whether a certificate exceeds the maximum validity allowed by the baseline requirements [398 days], and is not configurable.

To extend the 'no brown M&Ms' analogy, this would be like Van Halen learning the caterer was only checking for FDA safety requirements and not their specific needs. It's good that it's food safe, and a brown M&M (or one with a fleck of brown exposed) isn't going to hurt anyone, but it means that you've got a breakdown in your process.

Specifically, Let's Encrypt needs a review of their linting process. The requirement was that the lifetime be set to 7775999 seconds, or 90 days, this particular tool may not have thrown an error if the time had been 180 days or 360 days, and instead only thrown an error at 399 days. One second is not a problem, but 300 days would be a problem for sure! How were the lints which they're running selected, in particular how was the rule 'e_tls_server_cert_valid_time_longer_than_398_days' accepted, knowing that their validity was shorter than that? Are there other rules where the linter is checking against the most liberal specs but Let's Encrypt is offering something more precise? Fuzzing against a validity clock of 7775998 seconds, 7775999 seconds, and 7776000 seconds would find this issue, are there any other parameters that can be fuzzed? Questions like this need to be asked.

There was a similar issue a while back, where an open-source system used by a number of CAs was by default configured to generate serial numbers with only 63 bits of entropy, when the CA/B forum rules require 64 bits: https://www.schneier.com/blog/archives/2019/03/cas_reissue_o...

This incident was complicated by the fact that it was first noticed while investigating a particular CA (DarkMatter) that was already suspected of various shady activities. Most of the other affected organizations took the incident seriously, fixed the problem and revoked the affected certificates, as required by the letter of the rules, even though the actual risk was small.

DarkMatter chose to spend a lot of time complaining and making weaselly arguments on the mailing list (like "a 64-bit number where the leading bit is always zero is still technically a 64-bit random number") which did not do much to convince anyone of their trustworthiness.

> This is a "no brown M&Ms" kind of violation.

I see it more as a "no brwn M&Ms" violation.

We would all assume the intent was to say no brown M&Ms, rather than imagine that no action was required.

I think part of is that the CAB (the browser group setting standards for CAs) is much of the same people that initially supported Lets Encrypt, and also use Lets Encrypt as a justification for tightening requirements on other CAs.

They really do _not_ want to be seen to be giving Lets Encrypt special treatment because of the connection between the CAB members and Lets Encrypt and the threat that Lets Encrypt poses to the business models of the traditional CAs.

So a violation of the rules by Lets Encrypt is being dealt with by the letter, even when the issue clearly does not warrant that level of reaction.

(See also the time Google Search banned Google Chrome from results for a month when one of their advertising campaigns breached the rules for paid SEO)

It makes for good practice. Ideally real and severe incidents are rare, so you should be doing testing exercises to ensure the processes and tools are in place. But when a real incident, even though seemingly minor one, comes along you might as well use it for practice and "do things right".

I'm all for revoking the certificates, but feel that ample time should be taken to study what sorts of issues revoking would cause. It would not surprise me if this study takes approximately 90 days to complete.

They've done mass revocations in the past. There was one about a year ago from what I remember. So whatever study could be done, likely already has been. The question now is whether the issue is sever enough to warrant that action.

I feel that you've missed the joke. Also, there is no way, at all, that they are going to revoke nearly all of their certs because they were valid for 1 second longer than they should have been. It would mark the end of Let's Encrypt if they did.

We've lived with certs that were valid for a second longer than they should have been since the inception of Let's Encrypt, and three months won't kill anyone.

edit follows:

When they revoked 3% of their certificates, not all of them were able to renew in time due to physical server limitations. The renewals required server administrators to forcibly renew their certificates, and the email address associated with the certificate was contacted to let them know. It would be an unmitigated disaster if 185 million certificates were suddenly revoked.

Let's Encrypt has capacity to reissue all certificates in 24 hours. I think this is why you are supposed to run certbot every day. https://letsencrypt.org/2021/02/10/200m-certs-24hrs.html

They have the physical infrastructure to do so, but you've missed this part of article you linked:

> In order to get all those certificates replaced, we need an efficient and automated way to notify ACME clients that they should perform early renewal. Normally ACME clients renew their certificates when one third of their lifetime is remaining, and don’t contact our servers otherwise. We published a draft extension to ACME last year that describes a way for clients to regularly poll ACME servers to find out about early-renewal events. We plan to polish up that draft, implement, and collaborate with clients and large integrators to get it implemented on the client side.

Running certbot daily will currently do nothing. It won't think that the certificate needs to be replaced since it has not yet reached the limit required to renew.

> Running certbot daily will currently do nothing. It won't think that the certificate needs to be replaced since it has not yet reached the limit required to renew.

Yep, this. Clients need to be built to support it.

I know this is from the linked article and not from you, but:

> In order to get all those certificates replaced, we need an efficient and automated way to notify ACME clients that they should perform early renewal.

This is already possible. It's called OCSP stapling, and it's what Caddy (and CertMagic) does by default, automatically. When Caddy sees from OCSP that the certificate is revoked, it will automatically replace it.

Wouldn't clients then only become aware that they need to replace their certs after they've been revoked?

I think the desire here would be for a mechanism to alert clients to obtain a new certificate before their current certificate is revoked and becomes invalid.

edit: formatting

> Wouldn't clients then only become aware that they need to replace their certs after they've been revoked?

A certificate that is going to be revoked is as good as revoked. There is no "almost untrusted, but not quite yet" gray area (unless you're talking about expiration dates, which some browsers allow leniency on; but we're talking about revocation, where we know there was a problem or misissuance, whereas expiration is mainly a passive safeguard against indefinite trust).

So, once a client sees a "Revoked" OCSP status, it can replace the certificate immediately, before the previous, valid OCSP response expires.

The article you point to does not mention anything about certificate transparency. They do not only depend on their own CT log to issue certificates, but also on CT logs run by external parties to issue a certificate.

Google's current policy still seems to require them to use 1 Google log, and 1 other non-Google log given the lifetime. They seem to use 1 Google log, and 1 other random log, including their own.

I'm doubt the logs will be able to keep up.

> It would not surprise me if this study takes approximately 90 days to complete.

The other CA (KIR S.A.) actually issued certificates with a validity of one year, so, for them, dodging this is not that easy.

What's the connection between Let's Encrypt and KIR here?

There is another incident report in Bugzilla referenced in Let's Encrypt's: https://bugzilla.mozilla.org/show_bug.cgi?id=1715455#c1 references https://bugzilla.mozilla.org/show_bug.cgi?id=1708965

KIR S.A. is another CA that issued certificates with the same one second issue (1 year + 1 second instead of 1 year) and reported that one month ago.

Certificates have been mis-issued. No ifs, ands, or buts, that by itself means revocation has to be on the table.

In practice, it's a small problem and revocation is an excessive response, but those are conclusions you reach after duly investigating the issue, not before.

I can't see an immediately obvious attack that would be available by having an extra second of certificate validity. I can think of a possible place it would be useful to know this value correctly - automated expiration monitoring and unit tests

IMHO they must revoke them because when they sign incorrect data it would affect their reputation.

> Grzegorz Prusak

> Comment 5 • 6 hours ago

> I do not understand the math behind the mentioned problem.

> Is either the RFC or BR really specifying a DATE of 1s precision to be an INTERVAL rather than a POINT IN TIME? It specifies a granularity of points are allowed to be endpoints of a validity period, but these are still POINTS AFAICT.

> It means that, for example 10:00:01 is AFTER 10:00, even though 10:00:01 is not a valid endpoint of a validity period (and it doesn't have to be, as it is a measurement result, not a validity period endpoint).

> Does RFC or BR requires a browser to round down the measured current time to full seconds, before comparing them against the validity period?

As a random member of the public, this seems like a reasonable interpretation to me... though the time format which expiry dates are required to be expressed in is explicitly not allowed to include fractional seconds... so maybe it is meant as an interval not a point in time.

RFC 5280 says:

> The universal time type, UTCTime, is a standard ASN.1 type intended for representation of dates and time. UTCTime specifies the year through the two low-order digits and time is specified to the precision of one minute or one second. [...] For the purposes of this profile, UTCTime values MUST be expressed in Greenwich Mean Time (Zulu) and MUST include seconds (i.e., times are YYMMDDHHMMSSZ), even where the number of seconds is zero.

> For the purposes of this profile, GeneralizedTime values MUST be expressed in Greenwich Mean Time (Zulu) and MUST include seconds (i.e., times are YYYYMMDDHHMMSSZ), even where the number of seconds is zero. GeneralizedTime values MUST NOT include fractional seconds.

In other words, the RFC specifies specifically requires the granularity of the time to be a second, and the notBefore and notAfter are worded such that is valid for all t in the range notBefore ≤ t ≤ notAfter. Computing the number of seconds this is valid for as notAfter - notBefore is a classic fencepost error.

Yes, but the relevant portion of the RFC for inclusive time ranges is

> CAs conforming to this profile MUST always encode certificate validity dates through the year 2049 as UTCTime; certificate validity dates in 2050 or later MUST be encoded as GeneralizedTime. Conforming applications MUST be able to process validity dates that are encoded in either UTCTime or GeneralizedTime.

> The validity period for a certificate is the period of time from notBefore through notAfter, inclusive.

The way I'm reading this a conforming application must first decode the date to a point in time, and then must perform an inclusive check on that decoded date as a point in time.

The certificate expiring at 10:00:00 is not valid at 10:00:00.0001, the restrictions on the time format are just saying you can't make a certificate that is valid at 10:00:00.0001 and not valid at 10:00:00.0002.

So the fence post error "exists" here, but the size of it is t = limit granularity as granularity goes to 0 = 0...

In fact, your post makes it clear that a certificate expiring at 09:59:59 presents a problem too; there's a whole second where checks may be invalid before 10:00:00. This would depend on the browser implementation which is not specified in the RFC (I checked). Hopefully they all do truncation of fractional seconds.

The easiest fix though, is for LetsEncrypt to say their certs are valid for 90 days and a second.

If that's the case, the previous implementation was correct and they'll start issuing certificates with 90 days - 1 second now.

RFC 5280 further says: [1]

> The validity period for a certificate is the period of time from notBefore through notAfter, inclusive.

It says nothing about the comparison precision, just the storage format. It says CAs MUST encode validity dates as UTCTime / GeneralTime. It does not say this encoding must be used for comparison.

So I don't think it's actually defined if 2020-01-02 03:04:05.01Z is actually <= 2020-01-02 03:04:05Z.

That would mean this is not a 1 second-mistake but instead an infinitesimal mistake.

[1] https://datatracker.ietf.org/doc/html/rfc5280#section-4.1.2....

A thought experiment.

Pretend I told you that you could get 50% off a meal at my restaurant from Monday through Friday, inclusive.

Presumably you would expect that if you visited my restaurant on a weekday, the offer would apply; fine to arrive on Monday morning or Friday evening, but not on Saturday morning. In that case, the "resolution" of the parameters of the offer is "a day".

In this case, the resolution is "a second"; but the same meaning of inclusive applies - the validity is up until the end of the second specified in as the notAfter time.

The comparison precision you can achieve is irrelevant; the required semantics (though odd) are defined by the RFC.

Days, are intervals, not points in time. Traditionally a day is the time from sunup to sundown, in modern contexts it usually goes from midnight to midnight.

This is the question about timestamps, are the points in time, or are they second long intervals. I think the logical reading is the first.

This opens up the question of "what the hell is the word inclusive doing there", I think the answer is it is being used to say "if you don't know whether you're inside or outside the interval, you are inside". If your clock runs on a 2 second interval, and you know it's somewhere between 9:59:59 and 10:00:01, when the certificate expires as 10:00:00, you take the inclusive definition and say "still valid" because otherwise you might be rejecting a valid certificate.

It is perhaps also being used to say that "both ends of the interval aren't open". If I have on certificate that expires at 10am, and another that becomes valid at 10am, then there is no gap between there validity.

I guess my point is that you can only give the word "inclusive" any meaning if you accept that the authors of the RFC intend the validity times to be interpreted as intervals rather than instant, however unusual that is.

"Inclusive" still has a meaning in this interpretation. It means that the certificate is valid on the moment of the end date, as opposed to ceasing to be valid on that moment. That is what they mean by "both ends of the interval are closed".

Yeah; I suppose you could look at it from the real analysis point of view! Terrible spec, basically.

I can't seem to find anywhere in the RFC that the resolution is indeed a second. The actual format for storing the time indeed has a resolution of a second, but where exactly in the RFC does it say that a certificate with notAfter 2020-01-01 00:00:05 is valid at 2020-01-01 00:00:05.22?

I too find the reasoning odd. I work extensively with times in my day job. If a time period starts at 2021-06-10Z22:23:00 and finishes at 2021-06-10Z22:23:01 then that represents one second. For it to represent two seconds, the end time would be 2021-06-10Z22:23:01.999... which is the same as 2021-06-10Z22:23:02. If the RFC does not explicitely state that this is the desired interpretation, I find it odd to assume this is the case.

My local bar has a $1 bottled beer for some brands on Tuesdays. It is advertised as being for Tuesdays. Their weekday hours are 11am - 1am. If I show up at 12:30am Tuesday morning I don't get the deal because it is considered part of Monday's service. If I show up at 12:30am Wednesday I get the $1 beer because it is considered part of Tuesdays service.

Your bar sounds like children saying it's still weekend because they didn't sleep yet.

Maybe you should explain to them how a clock works? :)

I disagree.

I think they intended your reading, but if I attempt to read it as strictly as possible I'd say they have defined a representation of a particular set of points in time, not intervals, and the "inclusive" in the definition of notAfter tells us that the interval is closed, not that it extends to the beginning of the following second.

Is Ryan Sleevi in https://bugzilla.mozilla.org/show_bug.cgi?id=1715455#c1 really suggesting that Let's Encrypt revoke 185 million certificates forcing people to get new ones because the certificates have a lifetime that is 1 second too long?

1 measly second is going to break WebPKI?!

That's not my take. He's just saying that the linked document states

> If your CA will not be revoking the certificates within the time period required by the BRs, our expectations are that:

> The decision and rationale for delaying revocation will be disclosed to Mozilla in the form of a preliminary incident report immediately; preferably before the BR-mandated revocation deadline. The rationale must include an explanation for why the situation is exceptional. Responses similar to “we do not deem this non-compliant certificate to be a security risk” are not acceptable. When revocation is delayed at the request of specific Subscribers, the rationale must be provided on a per-Subscriber basis.

However, letsencrypt's report neglected to do any of this, so he queried them on this

You are supposed to be able to revoke the certificate you misissue.

LetsEncrypt automated the certificate creation, which resulted in very large number of certificate issued. But at the same time, it should be possible to revoke them all.

If the revocation mechanism doesn't scale as much as the creation, this is a huge problem.

This time, the misissued certificates probably won't cause large issues. But they are misissued nevertheless and should be revoked.

If they are not revoked because it would be too complex and will break WebPKI, then what will happen when a larger issue happen that require the revocation ?

> If they are not revoked because it would be too complex and will break WebPKI, then what will happen when a larger issue happen that require the revocation ?

As here, the severity of the issue should be weighed against the severity of revocation.

"Half the internet needs to manually renew some certs early" is a reasonable approach if there's something risking exposure of sensitive data. It's not for "the certificate is valid for an extra second".

By the way, following an earlier issue that did result in revocations, Certbot added logic where the certificate will automatically be renewed early if an OCSP check shows it is no longer valid. That means that early revocations may not cause outages for sites that are successfully using automated renewal with Certbot.

Of course, those sites are a minority and possibly not even a plurality, but in theory revocation can be made a smaller burden over time as more sites are using automation successfully.

I also know there are tons of Let's Encrypt users who are still using a completely manual workflow. Some of them just don't understand some part of the automation concept, while others have devices and/or operating systems where they simply can't automate certificate renewal, because they couldn't install any software that would facilitate it.

Doesn’t that mean the cert is already invalid by the time certbot notices, so you already have an outage? And if certbot is running daily then that outage could be up to 24 hours.

There was an idea we discussed of "intent to revoke", but it ended up looking like a quite a bit of work for a small benefit, so the situation is as mholt describes. :-)

For this specific reason, Certbot is recommended to run twice per day, and that is the default behavior in OS Certbot packages as well as the more recent official snap package.

Further to mholt's point, not all clients will enforce OCSP at all, while if you're using stapling, the client is being told about the last-valid OCSP response, so the outage would probably be a fairly small minority of clients in either case.

I know of at least one OS Certbot package where the check is only done weekly--FreeBSD. Would be interested to know if the twice-daily approach is the default in other OS packages.

Wow, thanks for pointing this out. Do you know who maintains this for FreeBSD, by any chance?

Not if the last valid OCSP response is stapled to the certificate.

Of course, clients can ignore that and use their own revocation lists/logic, which sometimes get pushed faster than OCSP responses. (Or clients can just block any certificates/CAs they want to, really.)

Technically, yes. However, if the certificate is used with With HAproxy or any other solution that provides a stapled OCSP response from some kind of cache (e.g. as a file updated by cron), it is possible to avoid the outage completely by fetching the fresh OCSP response every day, and not updating the on-disk copy if it says "bad". This way, the old but still valid copy is served to customers while the system takes the necessary steps to renew the certificate.

> Certbot added logic where the certificate will automatically be renewed early if an OCSP check shows it is no longer valid.

If anything mainstream actually checked OCSP, this seems like you'd still be effectively making the old cert unusable before the new cert was issued, which is unfortunate.

On the plus side, I don't think OSCP ever went mainstream.

OCSP is stapled automatically by Caddy, for example. It gets refreshed halfway through the response's validity period, so it will know well ahead of time if a certificate is revoked, allowing it to replace the certificate for you automatically. No downtime, no unusable certificate, etc, because even while it serves the revoked certificate, the OCSP response that is stapled to it shows valid for some remaining amount of time.

Of course, clients can do whatever they want to enforce OCSP, including pushing their own revocation lists and ignoring OCSP staples. Sigh.

Thankfully, there seems to be guidance that allows for common sense to intervene.

"Mozilla recognizes that in some exceptional circumstances, revoking the affected certificates within the prescribed deadline may cause significant harm, such as...when the volume of revocations in a short period of time would result in a large cumulative impact to the web."

He does have a good point that something quite similar happened a month ago[1], and should have trigged a deeper dive that would have exposed this as well. You could interpret that as LE needing some better discipline around process.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1708965

Certificates were mis-issued. Even if the answer is "no, definitely not", the question of whether to revoke them has to be asked.

> We will further improve our codebase […].This work will be completed by 2021-09-09 (inclusive).

I chuckled.


Wouldn't the bugzilla entry [0] a better choice since the discourse community entry just points there? The preamble by Josh (ISRG Executive Director) is literally just a copy of the first paragraph from the bug, no additional data.

[0]: https://bugzilla.mozilla.org/show_bug.cgi?id=1715455

I think Bugzilla is more likely to become unhappy with large numbers of visitors than Discourse would be.

This way all the readers who think "one extra second, I don't care" and close the tab won't contribute to the problem.

First fastly blows up, now my certs are lasting an extra second?! OpEng is off for the week or something?


Not sure what impact this would have though.

I also wonder this.

Sounds like one of those tricky technicalities

For fun, I thought of a bad edge case: you create a certificate exactly 90 days before the end of the year (having your notAfter date of 2022-01-01-00:00:00Z) BUT you get a leap second on this year. Then it will last for 90 days + 1 second

Which makes me wonder if other certificate authorities handle this scenario.

They actually mentioned that they use 90 days as defined by 2160 hours, so that would be a non-issue.

It depends if the certificate itself contains an absolute timestamp for the expiration time, or a validity period.

Certificates contain a pair of absolute times in a Gregorian representation. The ASN.1 time format does support leap seconds (in the usual way, e.g., a time of 23:59:60) but I don't know if X.500 or PKIX restricts that.

People who work with x.509 sure do make a lot of work for themselves

Ignorance is bliss. Most projects and even many formal standards don't even think to consider such questions, let alone try to answer them.

X.509 builds on ASN.1, which incorporates ISO 8601, which in turn incorporates IEC 60050-111. Unfortunately, as elucidated in this thread, the relevant portion of X.509 is still underspecified--it uses the term "inclusive" without resolving ambiguity related to intervals and granularity, which it could have done by using formulas provided by ASN.1, ISO 8601, or IEC 60060-111.

There are two hard things in computer science: cache invalidation, naming things , and off-by-one errors.

Can someone ELI5 what the actual impact of this is? Is there security concerns I'm not seeing here?

As far as I can tell there's zero real world impact here, I think they just want to maintain a stellar track record for any reported bug that would affect the certificate issuance in any way.

Basically, had it been a second, a day, a month, doesn't matter - they still treated it seriously. That sort of thing goes a long way towards building trust.

Trust with whom? To me acting like an automaton lowers trust.

The suggestion to invalidate millions of certs over a second longer validity sounds like terrible judgement.

I want to trust that when a CA has a certain written policy, they think it's important to stick by that policy, and they have plans to stick by the policy.

For instance, Symantec had a policy that they validate their subscribers before issuing the certificate. What they actually did was that they validate their subscribers before issuing the certificate, unless they were testing things out. In 2015, it was found that they tested out a google.com certificate, and they fired the employees involved in the incident: https://archive.is/Ro70U

Two years later, it was found that they tested out certificates like "example.com", "test.com", etc.: https://bugzilla.mozilla.org/show_bug.cgi?id=1334377

At no point in either incident were those certs outside of the control of Symantec employees. Still, they lost a lot of trust (and ultimately their CA was marked untrusted) because they did not fix their problems: https://wiki.mozilla.org/CA:Symantec_Issues

So apparently letting people use human judgment, firing people who misuse it, and hiring different people with hopefully-better human judgment is not the way to be trustworthy.

(To be clear, I think it's extremely reasonable for them not to revoke the certificates, but I think it's good and important that they're following the procedure which requires them to make an explicit decision not to revoke them in consultation with the community.)

acting like an automaton would mean that they just recall all certificates. which they won't do. but thinking and perhaps adapting for the future, they do. and that's the right thing

computers are about exactness. if the do not value that and inspect that thoroughly, i would not trust them for security.

Trust extended by the root program maintainers (who serve as a proxy for you, the user, and should make decisions in your interests) to the CA operators (who all too often make decisions that are good, in a prisoner’s dilemma sort of way, for the certificate holders but are terrible for you). This is meant to correct the broken incentive structure of the CA model where the people who pay the CAs are not the people who consume the results of the CA’s work.

(How much the root program maintainers can be relied on to represent your interests varies, but all other alternatives so far seem even more terrible.)

Nobody is seriously suggesting that revoking every LE certificate is the proportionate response here. But in the background for all of this is the fact that CAs historically have resisted revocation as a remedy for practically every misissuance or security incident, no matter their severity. They also frequently talked as though revocation was not only not considered, but did not even come to mind when the incident was recognized.

Arguments from lack of security impact or the number of “affected customers” (who are not, in fact, the people who the CAs exist to provide a service to, to reiterate the incentive problem) were used to argue against almost everything, including eliminating shady sub-CAs, reducing astronomical maximum validity periods, and prohibitions against broken security schemes, even after the corresponding decisions were confirmed by a vote of the CA/Browser Forum and publicly announced by the programs. (In fact, there’s an open Mozilla bug right now where Google’s Ryan Sleevi is lambasting Google Trust Services for cross-signing a historical SHA-1 CA using a currently trusted root.)

This is why Mozilla’s policy is absolutely merciless regarding revocation and incident reports. In an issue like this, where there is, in fact, no actual security impact, it is expected that the CA will suck it up and file an additional report saying that yes, on the balance we aren’t going to revoke, and here’s our reasoning (because again, bare statements or restatement of conclusions masquerading as justification is the norm for CA communications, as can be seen from the current compliance bugs on the Mozilla tracker). For every decision not to revoke, there needs to exist at least percieved harm to the CA(’s reputation with the root program), because history shows that otherwise nobody revokes anything.

(Browse the relevant category of the Mozilla tracker, mozilla.dev.security.policy, or the CA/B Forum mailing lists if you want a chilling read. This is what finally turned me off DNSSEC+DANE, because if people for whom PKI and key management is literally the only job are that bad at it, I don’t even want to imagine how bad domain resellers are going to be, and unlike CAs you can’t just toss your DNS delegation out and get a new one in a couple of minutes.)

Thus the revocation part of this issue is posturing, but it’s posturing that both sides recognize and have decided to accept as the only way of ensuring the WebPKI remains somewhat functional.

The operations part that’s going to be discussed in the linked bug itself (and not the so far nonexistent no-revocation report), on the other hand, has immediate importance to LE’s operations, because it means that there were no additional controls beyond the issuing software itself that were enforcing LE’s declared policies, and that’s just too brittle a design to work with. The policies as declared in the Certification Practice Statement are (intentionally meant to be) rigid and hard to change, whereas people will routinely reconfigure or even modify the issuing software, and somebody, sometime, will make a fumble in the certificate template or check the wrong verification box or even mistakenly enter a command to issue a sub-CA from prod. That’s not a problem, but not preventing it from causing a misissuance is.

The accepted remedy is to run independent linters (plural, because bespoke tooling integrating them into the pipeline has mistakes as well) configured to enforce both the common Baseline Requirements and the particular CA’s CPS and left untouched until there’s a (carefully reviewed) change to those. It seems that LE’s setup failed in this regard, because while of course Boulder can and will have the occasional off-by-one bug, it’s highly unlikely that ZLint or Cablint or whatever has the same one, so somebody configured them wrong.

They say they can issue 200 million certificates in 24 Hours https://letsencrypt.org/2021/02/10/200m-certs-24hrs.html

That's great, but it doesn't mean 200 million certificates will actually be renewed in 24 hours.

Also gonna be a heck of a CRL.

But only until the affected certificates expire.

Which will finished in only 90 days time, which was the driver for the short validity period in the first place.

The security concerns of this particular bug are essentially zero. The meta question is if there are other related bugs that may not have been caught. We should stamp out bug classes, not individual bugs.

It's a brown M&Ms sort of situation. It's a low-impact situation, but the appropriate response is to audit how the mistake was made and figure out what failed for it to slip through — which might lead to insight into other latent problems.

There is zero actual security impact.

A classic fencepost error. If you use SQL you should be aware of this if you use the BETWEEN operator; it will select values inclusive of the stated endpoints, which may or may not be your intuitive understanding of what "between" means.

I’ve never before heard this described as a fencepost problem but its perfect.

I clicked on the CAB bugzilla link thinking it was not going to become a big deal, but surprisingly it has. There's also a link to a bug from a month ago for another CA whose certs lasted one second longer, that also became a big deal.

In my opinion the RFC should be changed from "notBefore ≤ t ≤ notAfter" to "notBefore ≤ t < notAfter".

Shouldn't that be:

"notBefore ≤ t ≤ notAfter" to "notBefore ≤ t < until"


It is the name of the fields in RFC 5280:

Validity ::= SEQUENCE { notBefore Time, notAfter Time }


Making "notBefore" inclusive but "notAfter" exclusive is inconsistent and IMHO confusing. If the end time should be exclusive then it should be called "until" instead of "notAfter".

It's only confusing because you interpret notBefore and notAfter as one-second time intervals when they really should be instants at the start of that second.

Isn't GMT a time zone (mean solar time at the Royal Observatory in Greenwich) that can differ from UTC up to 0.9 seconds?

So since the RFC specifies GMT it is already up to 0.9 seconds wrong?

Tom Scott has made a video: https://www.youtube.com/watch?v=yRz-Dl60Lfc

According to Encyclopaedia Britannica:

> Greenwich Mean Time (GMT), the name for mean solar time of the longitude (0°) of the Royal Greenwich Observatory in England

> ...in 1928 the International Astronomical Union changed the designation of the standard time of the Greenwich meridian to Universal Time. Universal Time remains in general use in a modified form as Coordinated Universal Time (UTC), which serves to accommodate the timekeeping differences that arise between atomic time (derived from atomic clocks) and solar time.

This issue is NOT critical to my operations and is, in fact, irrelevant to them. I can NOT speak to any other users of letsencrypt. So I'm happy this was raised because maybe there is another, seemingly harmless incident that MAY impact my operations that will be brought up in the future.


Not related to this issue but I think that the expiration of LetsEncrypt's old root certificate "DST Root CA X3" on 21 Sep 2021 may cause quite a few problems. Old embedded systems may not have a way to update their trust anchors (like browsers do). Obviously not LetsEncrypt's fault but worth being aware of.

1 second Who cares?

This is an outrage! I demand my money back!

off-by-1 error... ooops

It will all be fixed in 90 days, so revoking is pointless for all but the edgiest of edge-cases.

Is this a joke?

Just leave the extra second on. Any release of software has a risk associated with it so you shouldn't risk it unless the update is worth the risk.

The time they spent investigating and updating is time wasted. Anything else they could have done would have been a better use of that time.

The time I spent writing this comment was a better use of time. And this comment is really not important.

You'd be surprised how often something that does not seem like a big deal is some kind of deal after all, sometimes even a big one. Software is complex.

So far, all signs point to this not being a big deal (EDIT: from the software side), but that is contingent on the aforementioned investigation. Others have also mentioned how it is a matter of principle: Certificates were misissued, where do you draw the line if not at that general fact?

This isn't a software complexity issue; the only reason we're talking about this is the sensitivity of root trust store process.

CAs are required to issue Certificate Practice Statements (CPS) which tell the world-at-large the parameters under which they issue certificates.

Here's the CPS for Let's Encrypt that was in effect when this problem was detected:


These CPSs are used by the various browser/operating system developers to make decisions about which CAs should be included in their root trust stores; i.e. basically whether you're a CA accepted by "the internet".

Let's Encrypt's CPS said one thing (we issue certificates valid for 90 days), but their implementation did a different thing (issued certificates valid for 90 days + 1 second).

An inability to adhere to your CPS is a big deal - it is effectively an internal controls failure; and self-reporting of failures to adhere to your CPS is essential. The owners of the root trust stores are going to want to see seriousness from the CAs when it comes to adherence to their CPS, even in apparently de minimis cases.

e.g. refer to Mozilla's policy guide and its reliance on the CA CPS:


No-one is suggesting that a one second difference in validity makes any difference in the real world, and Let's Encrypt has already uprev'd their CPS to make it clear that actually all they promise is that certificates will be valid "less than 100 days":


Thanks, I understand, but I wanted to address a broader point about software, which the comment I replied to talked about. To repeat: Sometimes software, or things that interact with software (which is minimally what this is), has issues that may seem small, but end up big because of complexity. This turned out not to be one of them. But it would be a consideration in an investigation, even though (or so I guess) it was likely straightforward to dismiss it quickly in this particular case.

Nevertheless, the time to investigate potential repercussions of something so critical is not "wasted".

I updated my comment above to clarify that this does not seem to be a big deal so far from the software side.

This is yet another nail in why we need to get rid of the whole accursed trusted CA+Browser disaster.

It does literally say ‘90 days’. Pedantically, if that's read as a physical measurement, it's precise to plus or minus half a day, which more than covers the actual state. Perhaps Mozilla also erred in accepting a statement that said ‘90 days’ rather than ‘7776000 seconds’.

The X.509 RFC and it's relevant substandards define a day to be 86400 seconds long. Not longer, not shorter. There is no half a day tolerance in there, only tolerance of less than a second (which LE exceeded by hitting an entire second).

This is not the approach you can take to writing software that the security of the entire internet relies upon.

Previously: https://news.ycombinator.com/item?id=20573429

Serial numbers that should have 64 bits of randomness only had 63.

Each bit doubles the range of values. If the error was that certs were issued for 180 days instead of 90, I'd agree that's a problem.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact