Hacker News new | past | comments | ask | show | jobs | submit login
Microsoft failed to rotate certificates for winget CDN on time (microsoft.com)
111 points by yjh0502 on Feb 12, 2023 | hide | past | favorite | 67 comments



I think the default certificate expiration time (2 years) is a terrible idea. Its long enough that there's a good chance whoever registered the cert last time has left the team or the company. Its long enough that I've forgotten how to generate a certificate with openssl on the command line. And its long enough that each time, I (and everyone else) can justify not bothering to automate the process.

But 2 years is still short enough that if you have a couple domains, remembering to renew them is an ongoing hassle!

Letsencrypt certificates last 90 days, and they recommend renewing them every 60 days. This is a much better duration, because it encourages the entire ecosystem - developers and admins - to set up processes which automate renewal. And if the automated renewal process fails, letsencrypt starts emailing you about it to let you know your certificate is about to expire. (And you have enough time to fix it).

https://letsencrypt.org/2015/11/09/why-90-days.html


The maximum expiration time is now down to 13 months, for certs that need to be valid in a browser. And if you want to cycle yours more frequently, you can. But there's enough places that can't set up automated processes that trying to make it 90 days for everyone would be a lot of pain and a lot of broken sites.


> But there's enough places that can't set up automated processes

Why can't they be automated?

And anyway, this is the exact problem that short expiration times avoid! Systems that aren't set up for automation, and rely on someone once a year remembering some creaky, error prone process to get a new cert. Much better to force short expiration times so manual cert renewal is a thing of the past.


> Why can't they be automated?

E.g. because of regulatory requirements, chain of responsibility, a paper has to be signed with a pen, etc.


Interesting, but that sounds like speculation. Do you have any examples of regulatory requirements, as opposed to voluntarily-broken internal processes?


The people installing the certs aren't necessarily the people buying the certs. I can't do anything to automate the cert purchases at my current workplace; that's a separate team that I have no control or influence over.

Shorter expiration times just mean they send me an email with the new .pfx every 4 months instead of every 12.


If they had to do that every 2 months instead of every 12, they might get tired enough of it to fix their broken process.


Several of the appliances we manage have certificates that are installed using a Web gui and require a reboot with a 15 minute outage for the change to take effect. We've looked at automating some things but there's only so far I want to go down the rabbit hole of headless chrome vs manually installing a cert yearly.


It seems like enforcing faster rotation would do a lot to encourage people and companies to move away from such obtuse platforms, no?


DV is not the only kind of certificates validation. I don't want to have to go through the OV/EV validation process several times a year, nor to validate 4 certificate issuances a year in advance.

But if I wanted to, I can do so even now without being forced - request new certificate during it's validity period, and revoke the former one.


DV is the only kind that actually matters. Browsers do not display EV certificates in the address bar anymore, the verified identity is hidden in a panel or sometimes even invisible. If you want to pay extra for snake oil, you get to enjoy all the pain in the process. See also: https://www.troyhunt.com/how-everything-were-told-about-webs...


Google made quite a few questionable changes in Chrome (with the rest feeling forced to follow the fashion set by Chrome) and not displaying EV info. Many big organisations use tens of domain some of which look very suspicious. Information in a EV/OV cert is often the only way to establish that a domain operated by the legitimate company (and not by a phisher who registered a similarly looking domain).


Google and Firefox made the change roughly at the same time -- because there was a lot of evidence that EV indicators simply don't work. Users don't pay attention to them, and even if they did, the idea that company names are unique - even within a jurisdiction - is simply incorrect.

The only upside of EV certificates is that the PKI companies can seek a higher rent.


Even if a higher price is the only EV difference (which not exactly the case) it would be enough make sites with EV certs much less likely to be used in phishing - threat actors want to keep their cost down because they frequently register a lot of domains (much more than most legit companies). And even company names are not unique good luck with registering PayPal Inc or Bank of America Corporation to get an EV cert for your phishing site.


I don’t understand. Why would phishing attacks bother getting EV certificates? Users can’t tell the difference in modern browsers.


Depends on who user is. I hope a typical HN user can find a way to view certificate information even in a modern browser.

The problem is - in modern internet it is very hard to find out who is behind a particular domain: NS/A often point to a CDN or a cloud, info in whois is hidden and all you can see is 'Private'. OV/EV cert is often the only way to know that a domain like acmecorp-invoices.com is used by the same company as acmecorp.com and not phishing (registering a domain similar to the main company's domain is a bad but not uncommon practice).

One of a reasons to get OV/EV cert is to avoid you domain being listed as phishing - if would give a security expect no hints that your suspiciously looking domain is a legit one and not impersonation there is a risk that it would be blocked.

Phisher practically never use OV/EV certs on other hand (probably because they know there are little to no changes they'll get a cert with the target company's name in organizationName).


> And its long enough that each time, I (and everyone else) can justify not bothering to automate the process

And even worse, if you do automate it there is a pretty good chance something changes and breaks your automation by the time it is needed. And that is assuming you actually tested the automation before your new cert is close to expiring.


I solve this by certificate expiration monitoring and renewing the certificate at the 60 day mark.

The expiration warning is configured so that it starts to yell at me if it passes that timeframe.

That gives me plenty of time to fix it IF it goes wrong.


In your case, "something breaks in your automation" might mean that, by the time the cert is (about to be) in need of renewal, the notifications you set up are now going to an email account that doesn't exist any more, because you left the department got re-orged and...


If "monitoring" is set up as "send email to specific personal mailbox" then things are gonna suck a lot.


I was more imagining that it was going to an email account called e.g. "devops@"; and then when you left (possibly very quickly, e.g. via termination), the need for the continuity of that account was forgotten, because it had never become institutional knowledge; and then during a reorg, a new group (= email distribution list) was created to match the new department name; with nobody remembering to forward devops@ to the new address, because it wasn't receiving emails anybody needed to see on any sort of consistent basis, only these once-in-a-blue-moon emails.


That's one reason why nobody who knows what they're doing relies (solely) on email for monitoring. You use a monitoring solution, and if that whole system gets "forgotten" then there are bigger problems anyway.


The same can happen if you assign it to the team's mailbox, reorganizations happen at all levels.


Of course it "can happen". Anything "can happen". It's about mitigating risk by picking a strategy. In this case, from worst to better: personal mailbox -> team mailbox -> an actual monitoring solution and not friggin emails.


And how do you ensure your monitoring keeps working?


Alert on missing data. Keep a continuous stream coming in.


Setup 365.25 domains and have one renew every day!


>>to set up processes which automate renewal.

that is all fine and good for things that have the ability to automate that process, plenty of hardware and device do not. Some are not even legacy are still actively being sold and developed

It is also not good for internal networks where you can not valid out to something like lets encrypt to automate that validation process, sure you could do your own internal PKI and run your own CA for that but......

In my current org 60 days would be a NIGHTMARE to manage.


> It is also not good for internal networks where you can not valid out to something like lets encrypt to automate that validation process, sure you could do your own internal PKI and run your own CA for that but......

Or you can set up certbot or similar on a public facing server (or something that can add DNS records to for your domain), and use a secure channel to send the private keys to the things that need it.

I would like to see more of a push to make setting up an internal CA a lot easier though. Because that is probably most correct way to handle that.


>It is also not good for internal networks where you can not valid out to something like lets encrypt to automate that validation process

Why not? Just use DNS validation.


Yep, I do this for internal names, works great. I've used acme.sh to update the names in a public zone that is isolated from the rest of the zone and has it's unique AWS credentials to update via Route53.


I like to renew certificates long before expiry. At my job certs last a year but I automate renewal after 5 months. If you find a certificate that is older than 6 months you know something is wrong (long before expiry).


Advantage of certbot: a systemd timer that runs every other week is very easy to write, because "certbot renew" doesn't need any user interactions.

So it's literally < 10 lines of systemd unit file to automate it.


Or one line of cron.


> Or one line of cron.

For each distro, with each having their own format and own crond implementation, at each different file paths.


If you don't automate, don't document and don't check that it is actually working your process from the get-go it is only your own fault, especially when working on industrial scale like this.

Rotating the certificates constantly works for personal websites but it is not ideal in places where one can't easily update things - like behind corporate firewalls or where corporate processes permit updating/replacing things only in fixed cycles, which are often much longer than 90 days.

Don't remember the Let's Encrypt root certificate expiring fiasco from year ago? Granted, that wasn't really Let's Encrypt's fault but it shows well that these things can be a tad more complicated than just running a script every 90 days.


Ballmer's Law: Engineers will design a system's maintenance schedule to be just beyond their promotion cycle


While I appreciate TLS, this thing with certificate expiration is one of the biggest sources of downtime IMO. Something should be done about it. May be throw error not permanently but in a some probabilistic way. Like if 1 year certificate expired, after 3 months 25% of connections would fail. It'll allow eventually to find out about problem but it'll allow for connections to somewhat work, with few retries here an there. Expired certificate is not compromised certificate and should not be treated like one. Often next certificate is issued with the same private key.

Especially with short-lived letsencrypt certificates. Despite all the evangelists assurances, certbot is not always easy to set up. After letsencrypt gained popularity, the percentage of small websites with expired certificates significantly increased IMO.


Making silent, intermittent failures for 3 months sounds so, so much worse than just having it 100% fail at expiration. What are we trying to fix here?


I'd take it as a canary for how much the site owner actually cares about ops and security. If you can't be bothered to take the day or so to set up certbot and monitoring for when your certs are ~15 days from expiration, then that's very likely not the only thing you've cut corners on.


We already have CRL lists. So Why do we need certificate expiration?


When a certificate expires, it can be removed from the CRL. If certificates never expire then the CRL grows without bound.

Also, checking CRL is implemented in different ways. Some checks may be "soft", where a connection failure to the CRL is ignored. You probably want this anyway, if the CRL goes offline you don't want the internet to break. An expiry check, on the other hand, works as long as your clock is accurate.


You are forgetting something.

If you want to reinstall some old software, lets say MS Small Business Server 2000 or Small Business Server 2003 today, the certificates in the installation files prevent the installation of said software. So you wouldnt even get as far as being able to remove any certs.

Your only recourse is change the system date and time back to before the certificates in installation files would have expired.

Besides being a stealth way to prevent old software from being reinstalled, it narrows down the window of opportunity for hackers.

I used to automatically issue certs for my own servers which lasted 24hr's because if a hacker had got in to my system without me knowing which is a real possibility, at least an expired cert being used by someone else would highlight this problem.

As it happened, despite locking everything down to packet level and controlling the packets, my devices were just prevented from getting online. My ISP at the time TalkTalk had a very responsive system, issuing new IP address every 2 seconds in a bid to prevent me from hosting a website, with a domain name using dynamic ip address domain name service.

There is way more surveillance than most people realise at least here in the UK.


>If you want to reinstall some old software, lets say MS Small Business Server 2000 or Small Business Server 2003 today, the certificates in the installation files prevent the installation of said software. So you wouldnt even get as far as being able to remove any certs.

I think at least in some cases it'll still work. What matters is that the signature was created while the cert was still valid, not that the installation happens when the cert is valid. How do we prevent backdating attacks? By using a separate timestamp signature.[1]

TLS is different. It requires the cert-holder (aka webserver) to be online at all times. You don't need to be able to validate a signature created in the past. So TLS doesn't have this problem and thus doesn't need its solution (timestamp signatures).

[1] https://stackoverflow.com/a/3428386


What? 2 seconds? That sounds crazy. It either breaks ongoing connections or wastes addresses since you would have to keep your old one until all connections are closed.


It probably wasn't really "issuing a new IP address" per se, but rather CGNAT, where your apparent IP from the perspective of an IP reflector would be the IP of whichever NAT gateway your outgoing connection had been round-robin-load-balanced onto. Under CGNAT, you don't really have any single public IP; or rather, in another sense, you (and 100k other people) "have" all N public IPs at once — just like devices on a NATed home network all share the one IP address assigned to the gateway router in front of the NAT, and would all "have" multiple addresses if that gateway-router were multi-homed.


When is the internet in a country not a virtual dome like the one seen in the Truman Show?


Code signing certificates are NOT TLS certificates. Old code signing certs are an entirely different issue from old TLS server certs.


CRL lists have a lot of problems. The biggest being that they are, well big.


I don't like intermittent bugs.


reminds me of that time the regular guy, might've been a student, re-registered hotmail.com just to get his email working again after Microsoft let it expire.

oh, looks like it was either hotmail.co.uk or passport.com

https://slashdot.org/story/99/12/25/114201/microsoft-hotmail...

from

https://whoapi.com/blog/5-all-time-domain-expirations-in-int...


Looking at crt.sh (https://crt.sh/?q=cdn.winget.microsoft.com), it seems that the certificate is issued automatically anyways but for some reason the updated certificate is not applied correctly. A bad screwup really, but more of did someone forget to check their logs for deployment errors rather than the common case of someone forgetting to manually update the certificate.


We're working on renewing the certificate. It looks like this is the first report: https://github.com/microsoft/winget-cli/issues/2956


I think they've been bad stewards of that project since they stole it from that guy. it had promise but they refused to add basic features


Why would anyone standardize on winget when there's chocolatey and it works everywhere you can run Windows software?

Recently I decided to install Windows Server on one of my PC instead of Windows 10/11. I thought, why not attempt to use winget and/or the Windows App Store to install basic software to see how far that takes me.

Immediate dead end. Apparently there's no access from the App Store on Windows Server. You have to have your IT set it up. And all the documentation refer to using winget from the App Store. :/ Microsoft, I can't even with you right now.

With Windows Server 2022, apparently you can get winget to work, but you have to install something else or other that's in preview.

For real, guys. Please get with the program.

Possible workarounds: Anything that's a workaround is already a non-starter for me because I'm not trying to experiment to see what I can get to work. I'm trying to move to using a new norm so I don't get left behind. If Microsoft was pushing winget as the new norm for global silent command-line installations of all things Windows, as an alternative to Chocolately, great! I would have given it a try. Other than that, imma just wait until y'all get your act together, <i>someday-maybe</i>.


Not sure why you'd be trying to run Server on a workstation?


because winget is built-in


Microsoft's certificate management skills have gone down the drain anyway, so this doesn't surprise me. I have a long standing support case open with them about how they ship one of their more obscure tools signed with the wrong code signing certificate (one signed by their PKI for Azure INTERNAL usage, which should have everyone a tiny bit worried), and I've pretty much given up on trying to get the quite-obviously-on-an-H1B developer (which I mention only to explain that this PROBABLY leads to a perverse incentive to sweep things under the rug), or any of the Indian support agents involved to comprehend that this isn't just inconvenient when one has AppLocker in place, but also that it violates Microsoft's internal policies (which I know for a fact that it does), and MSRC ignored my email about it, too, so… par for the course. ¯\_(ツ)_/¯


If only there was a cloud based solution by a large company for managing certificates automatically!

(Azure Front Door)


The certificate was renewed, but it wasn't deployed correctly. We're looking into the root cause.


I'm just thankful we switched to afd and never have to think about certs (neither issuing nor deploying) again. You should check it out!


ITT: Lots of people feigning ignorance about how companies work, or have literally never worked for a large company before.

And hilariously thinking Microsoft would spin up a critical incident team for a free open-source product. I'm rolling on the floor laughing.


Aaaand it's still not fixed. I think this just goes to show how much red tape there is around processes at Microsoft.


https://github.com/microsoft/winget-cli/issues/2956#issuecom...

We updated the certificate about an hour before this post. It takes 6 - 8 hours for the certificate to fully propagate.


You realize it is Sunday, right? And that probably a lot of relevant people are not in the office today, even if the techops grunts who are on call are scrambling.


Day of the week is irrelevant for an organisation that big. If you can't escalate a major incident like this to a designated person - even if you have to wake them up - your process and business continuity plans are seriously flawed.


Amateurs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: