Hacker News new | comments | ask | show | jobs | submit login
O2 outage due to expired Ericsson certificate (ericsson.com)
146 points by amaccuish 45 days ago | hide | past | web | favorite | 89 comments

It took some warming to, but I have come round to appreciate letsencrypt's short certificate lifetimes. Monthly renewal should be maximum for any system, but ideally you'd want to go weekly. Assuming your renewal is automated, I don't see any downside, only benefits. It properly internalises cert renewal as part of standard system operations, bringing it into your daily ops instead of having it as some scary gray undocumented box for the next guy to tick.

Much like writing tests, it initially hurt my ego a bit, but I've come to like and proselytise it.

Your proposal may lead to an unintended DDoS of the CA.

Let's Encrypt recommends a daily cron job for cert renewal. The certbot only actually pings the servers for a renewed cert if the current cert is within 30 days of expiration.

Most CAs already operate an OCSP responder which supports querying daily (if not more frequently) to make sure a certificate still hasn't expired; it checks the database and responds with a signed and timestamped acknowledgement. (Those that don't need to support frequent queries of a static certificate revocation list file.) So this is already a level of requests that CAs have to be built for.

And even if it didn't, scaling certificates from once a year to once a week requires merely 50x capacity - a finite number, and something that's very easy to plan out. If CAs in the '90s and '00s were able to handle a certain once-a-year request load, surely CAs today can handle 50x that.

There is no need for enforced expiry of certificates, especially here where there are used internally on a controlled network.

If there is a requirement or good practice to periodically replace key-pairs then it should be a network management operation. Equipment should never stop working because it decided that a key-pair had expired.

X509 certificates are not technically optimal in many situations and, let's face it, the main driver for expiry dates is that certificate vendors want to sell more certificates...

This was a major outage in the UK causing millions of people not to have data access on their phones.


Not just data but also calls and text. Mother is on o2 and until about 6pm I was unable to call, was able to text her from about 4pm. Between 4 and 6 the calls were just failing with “called failed” and not even going to Voicemail, but texts was go though and I would even get delivery reports, just that calls wouldn’t connect.

Yeah, O2 kept saying that calls and texts were not affected, but that was utter nonsense. I couldn't place a call most of the day, and texts would go out, but my phone would insist that sending failed. And then there was just no signal at all most of the afternoon.

Affected some other things too, like bus stop "next arrival" signs (which apparently use O2's 3G network).

And some major major emergency ass pulling I was told by insiders.

"Ass pulling"?

An Ass Pull is a moment when the writers pull something out of thin air in a less-than-graceful narrative development, violating the Law of Conservation of Detail by dropping a Plot-critical detail in the middle, or near the end of their narrative without Foreshadowing or dropping a Chekhov's Gun earlier on.

Changed all the Cell tower id's en mass

Maybe it would be a good idea for certificates to expire slowly and randomly over 24 or 48 hours. In other words, if the cert has an expiry date of 12:00 UTC, Dec 6th 2018, then start to randomly fail connections at that time with low probability. The probability increases progressively during the next 24 hours until 100% of connections fail at 12:00 UTC, Dec 7th 2018. It's not like the cert is 100% trustworthy one minute and 100% untrustworthy the next minute. Having the failure rate ramp up slowly would give advance warning before everything has gone completely pear-shaped.

In the case of Ericsson, this might have allowed an emergency certificate update before all the O2 systems could no longer be automatically updated. Once your network is completely down, bringing it back up remotely is hard.

No, this is not a good idea. Failures which are random are harder to diagnose. Something which works or not is much quicker to track down.

The simple fact is that this sort of problem should've been dealt wiht a lot sooner, and a failure to do so is sheer incompetence.

Where that incompetence lies, is up for debate. But given how these things usually run, it lies some levels above the people who didn't have the time to properly follow a poorly-described process in the face of demands to do other stuff instead.

Of course certificates should be managed properly, so this sort of thing should never happen. My suggestion is not instead of that, but in addition. The question is when your certificate management process has failed, what then? It's defense in depth.

In this case the certs expiry seems to have rendered it hard or impossible to remotely resurrect the systems. That's a very brutal failure when your whole network goes down in one go. Having random failures, with alarms going off before everything is down, is definitely better that having alarms going off with the entire network already down.

That can be achieved by having multiple carts issued with slightly different expiries. Without weakening enforcement of all other certs (incl. those belonging to companies that manage them properly).

This is immediately results in the “it took us more than 24 hours to work out everything was failing, it needs to be 72 hours”. The best thing about that is that it results in a self extending policy - the longer it takes to age out an expired cert, the longer it takes for the failure volume to become noticeable, and then the longer to work out that the problem is cert expiry, and so the more likely they’ll need an extension (again).

The correct solution for this problem is for people to correctly manage their certs, which they should be doing anyway because the private keys are sufficiently important you should know exactly which are in use and where they are in use.

Maybe it would be a good idea for people to put an expiration date check in whatever they use to monitor the rest of the machines. Or put the expiry date on a shared calendar.

We do both, and they start alerting two months in advance. This isn't rocket surgery, new, or much different than remembering people's birthdays.

But a surprising number of technical people are simply incapable of managing events farther out than next week, and a surprising number of their managers are similarly incapable of making sure someone treats it as important.

Makes me almost want to start a betting pool on who will be the next HugeCo to down themselves this way.

Edit: And let me join the chorus in saying deliberately inducing random errors as a way to draw attention is a terrible idea.

> whatever they use to monitor the rest of the machines.

"Ah, are we supposed to do that too?"

I think this could also be a good idea for phasing out public APIs -- instead of just taking an API offline, start to fail requests early at low probability, ramping the probability up to 100% over the course of a month or so.

Adding delays to responses can also work too.

Years ago, during the tail end of the Age of XML, I had a conversation with one of the admins over at W3C.

To make your app work you're supposed to cache all schemas locally but an inordinate amount of people don't. Their apps will be slow because of that, but work anyway. So tons of traffic goes to w3.org. They can't remove those schemas because then nobody can use them the right way (download them 1x and keep the result). But how do you get people's attention?

I pointed out to him that a human downloading the schema isn't going to notice if the request takes .3 seconds or 3 seconds, but someone loading that schema 100x in a loop is sure going to be incentivized to figure out where the extra 5 minutes of processing time came from. He allowed that I might be onto something there but I never heard back about whether he tried it.

You want to kill off an API, assign it resources (hardware, bandwidth) that get progressively smaller, or add artificial waits that get progressively bigger. Depending on how horizontally scalable your app is (is it easier to add a sleep or to pull servers out of the cluster?).

How long would it take before developers started wrapping their API and CDN calls in rapid-firing loops because of this?

It already exists and has borrowed the name "resilience engineering" from the construction and engineering fields. Netflix has some interesting blog posts on how they deal with transient faults and resilience in general. Implementing concepts like circuit breakers.

Have a search for libraries in your favorite language, I'm sure something will already exist. I've personally used Polly in .NET.


Are you insane? That would be fun to diagnose.

While I can totally see why this would be annoying to diagnose, throwing a specific form of a certificate expired error code could mitigate this issue.

At this point I wouldn’t even contemplate using X.509 without a fully automated PKI issuing ephemeral certs. Paging humans is not the answer.

Certificate monitoring would be a good idea.

We use Erlang for a high availability of nine nines, so long as we remember to renew our certs.

Would be interesting to know how many people who've managed footprints for a reasonable period of time (say 5-10years) who haven't had a cert expire on them. Wouldn't be surprised if it's single digit %ages.

So many human & tech error factors lead to this occurring and they're all the same old things. Staffing changes, spam filters, ignored warnings, skipped emails...

I know it's happened to every company where I've worked. It happens so rarely, though, that people don't have enough opportunity to learn from it. Even at Google they were on their Nth such outage for a large value of N before it became apparent that no certificate should ever expire at 23:59:59 on December 31, or otherwise outside of normal operating hours. Seriously 20 years of organizational knowledge required to get the company to understand that certs should expire at noon on a Wednesday to minimize time-to-repair in the inevitable event that one is allowed to lapse.

Instead of waiting last minute, you'd think a large company would have planning to renew certificates X amount of time before they expire. Alas I understand it's not that simple.

In my small organization we had planning and multiple reminders to renew the cert well before it expired, and we did. Due to a miscommunication between myself and a coworker, the new cert sat ready for nearly two months without ever being added to the configuration (we were both certain the other had done it, naturally).

There's a remarkable number of ways for this simple thing to go wrong. To prevent a future repeat, we got rid of our calendar reminders (which we started ignoring once we both thought the change had been made) and wrote a script that emailed us based on the time to expiration of the live cert. This is a much better method.

Of course, give us enough years and I'm sure we'll manage to find a way to get this new setup wrong.

You'd certainly think so. If you have frontend probers that exercise your accessible endpoints (HTTP or whatever) then those probes should fail when the certificate expires in less than 30 days. I couldn't comment on whether an organization like Ericsson or O2 would be expected to have such probers.

I should get on that for my own infrastructure.

Thanks for the reminder.

A really easy way to avoid this in any environment with Continuous Integration style tests running on everyone's work:

Add a test that just unconditionally fails on a certain date, like a week before your cert expires. Don't let any code review sign off on a merge of a fix the test until the new cert is in prod. Don't let any code promote between environments while tests are broken.

The problem with emails and warnings is they're all ignorable and therefore completely unsuitable for managing something as critical as a cert.

The secret is to create a straight-up error that absolutely interrupts every engineer in the organization's day until the cert is renewed. A development-halting error a week before cert expiration is a hell of a lot better than a business-halting error when it expires in prod.

> fails on a certain date

Or just write a test that checks the date on the cert and conditionally fails.

If it's within 4 weeks, send email to x,y,z.

If it's within 2 weeks, send it to VP of x, y, z.

combination of failing test and email should lessen chance of it going unnoticed(email could possibly fail for whatever reason).

Ignoring letting it expire in the first place.

The surprising part is it took over 24 hours to restore service. I currently still have 3G only, and that's struggling (apparently 4G will follow).

I suspect it was a root CA (they possibly had a private PKI) that expired - I saw a tweet suggesting that they had to manually apply a fix to every deployment of the software, which is consistent with having to change the trust chain.

Still only 4G here too. Apparently we can expect it to return tomorrow morning. O2 were asked if we'll be compensated. They said they'll "apologise in an O2 way" but couldn't confirm what an "O2 way" is.

I assume that means they’ll apologise by ramping up your monthly rate.

Probably 10% off some ticketing scam they have running.

If O2 is like most phone companies, my reaction to this threat would be to beg them not to.

Apologise in an O2 way? Gosh are we back to BT era.

See what we can do. O2.

They are a bunch of O2 thieves.

Certificates can be hard to manage across enterprises. I have a project coming across my desk next year specifically to manage expiring certs and track on going changes. The company has 20,000+ certs to manage for us and our customers.

I can't tell if you're sitting on a gold mine or about to get very, very depressed very, very quickly.

I am, however, pretty sure there's no middle ground.

Wish you all the best but I would never trade my job for yours, I value my sanity way more than that.

I'd find a HN post about this project and it's results very interesting

There was an interesting post a few weeks back on Autotrader moving all of their online properties over to LetsEncrypt https://news.ycombinator.com/item?id=17949741

Ooh - I was wondering what the hell was going on.

I assumed I had dropped the darn thing one too many times.

Glad to see the HN grapevine works on wifi.

It is a reminder of just how fragile this digital world still is - we are taking technology designed to survive nuclear war, and adding single points of failure.

Let's look at mesh networking again.

It wasn't just the phones or just data on phones. GWR ticket kiosks went offline. Bus stop digital timetables went offline. On the phone, SMS delievery went wonky i.e. same SMS being delivered more than once but on the sender's phone it showed up as not delivered !

What is O2? It's not mentioned in the article but it's in the title.

O2 is one of the major telecommunications service providers in the UK: https://en.wikipedia.org/wiki/O2_(UK)

Reminds me of the 1990 network-wide outage AT&T suffered, aka "the big one of nine-oh". Caused by a bad update which took a couple of weeks to eventually bork the entire network.

Multiple providers all over the world are affected as well. For example, all Vietnamese providers has their data service disrupted yesterday.

Qualys is used by O2 which has the ability to manage certificates - see their website. Most large IT systems use centralised monitoring using feeds from several sources. Did someone forget to monitor this data or was the feed from this data either ignored or out of date?

If only the mechanisms that check certificates could provide warnings of impending expiry - 1,3,7,30 days would be prudent.

Though companies should be doing at least a yearly audit of certificates and calendering any that will need renewing. As I'm sure they do with domain names already.

Presumably this certificate was protecting some machine-to-machine connection. Then, no human would ever have seen those warnings.

Hardware manufacturers: How do you manage certificates that the customer will not touch or even see? Do you deliver new CAs alongside firmware updates? Do your software lifecycle requirements take certificates and cryptography in general into account?

If an ssl cert is expired on a website, you don't have access to the secured website. In this case though, how does it affect Ericssons telcom equipment (core routers and switchs for mobileand voice) ?

Same risk with domains and DNS - when people question why CSCGlobal or UltraDNS charge so much compared to cheaper alts - it's because they have your back when you miss something

The only way Ultra have your back is ripping the shirt off you in overage charges

Why do certificates expire? How is it acceptable to have a piece of data somewhere that contains a timebomb that must be periodically defused? I do not see how this helps security, and it is particularly ridiculous that normal behavior is that one day the system works normally and is deemed secure, and the next day it is so insecure and dangerous that communication simply fails.

Cryptography is basically a computational treadmill: you want to make it cheap enough that it's not burdensome for the actual users, but that reversing the information without the key is computationally expensive. Processing power, especially for the highly parallelizable task of grinding through potential keys, follows an exponential curve; ergo, even the present exponential gap is not a long-term protection mechanism.

The expiry date of a certificate was present in the original implementation of X.509 certificates; the infrastructure for certificate revocation lists was added later. Furthermore, a certificate revocation list includes every certificate ever revoked that has yet to expire. If you relied on certifcate revocation for expiring old certificates as well, that would be a nasty long file you'd have to download before you can start connecting to any website.

I think the much more important reason for expiring certificates is to minimize the damage in the case that your private keys are stolen, which is far more likely than them being brute forced.

Seriously? Any half decent administrator knows that for every SSL certificate you install, you set a date some time in the future where you need to change it out for a new one.

To your comment of "one day the system works normally and is deemed secure, and the next day it is so insecure and dangerous", this is working as intended. The certificate is to establish trust and identity along with encrypting the data in transit. The identity described by the certificate is only valid up until the expiry date after which it ceases to be valid for that purpose. You now have an encrypted connection to something that can't prove its identity, which is certainly a lower level of "secure" than what it was before the certificate expired.

Ignoring expiry dates would mean any keys that were compromised ever could be used for MITM attacks and no one would be the wiser.

Expiry dates are in years, if a key is compromised, then an adversary has _years_ to exploit a MITM.

However we already mitigate this with revocation lists. But if we can revoke certificates why do we have expiration dates?

Seems to me expiration dates are rent seeking behaviour by certificate vendors.

This is a reason to shorten expiration times, not remove them (which companies like LetsEncrypt are doing)

Why. If revocation already works why bother with expiration?

One good reason is that if you buy a domain name that somebody else has used in the past, they don't have an infinite valid SSL certificate for your domain.

Would it not be possible to expire the cert if the domain expires?

No, that would be "revocation". Expiration is relatively easy to implement because the expiration date is known in advance and so you can simply put the expiration date in the certificate when it is issued. Revocation is relatively difficult because you need to continually check some database for revocation information — that's where CRLs, OCSP, and the like come in. And there's a lot of complexity under that hood, which, once the dust settles, boils down to just issuing very-short-lived certificates under a different guise.

No. The certificate's expiration is fixed at time of issuance. You could set the expiration of the certificate to the expiration date of the domain, but the domain could be transferred, cancelled, or revoked before the expiration.

Just a few years ago, almost nothing checked the revocation lists. I revoked certs for some popular domains and was concerned about ssl caches and proxies... turns out, an owl heard it. An odd dog barked. No impact. Not even from the folks that embedded our certs onto their servers for legacy code reasons. Perhaps this has changed over the last couple of years.

Yep. But that's an implementation problem. Revocation is a critical security feature. Complaining that people didn't used to check it is like complaining that people didn't used to encrypt.

You're not secure if you don't check revocation.

I agree with this. For the record, I am not complaining. :-) I just like to share my experiences of how things worked verses how they were intended to work.

I agree with you, but I suspect the intention is to prevent another type of timebomb. One where one day the system works and everything is considered secure, and then 25 years later nobody has given it a second thought.

Yes. Even then, a few years is too long, because you forget about the need for renewal.

I would say that the "proper" use of certificate expiry is the way LetsEncrypt and other ACME providers do it: it's set so low that you need an automated renewal process in order to make the certificate at-all useful.

But if it's automated, aren't we back to "forget about it for 25 years"?

But then you need to make sure that the automated cert renewal system is still working...

But when automated cert-renewal breaks, it immediately breaks, and you start getting "cert renewal failed" messages in your email prompting you to action—but at that point, you still have time remaining to fix it before the cert expires, since ACME impls tend not to renew the cert exactly one second before it expires, but rather more like a day or two before it would expire.

Unlike cert expiry, where the first you hear about it is when your production system stops working.

But of course, the failure emails will go to an email address that was deactivated a couple years ago because an employee left/everybody fired in reorg/unused address cleanup/domain migration/technical error/...

Just this week I had someone complain to me that he was no longer getting build failure emails, and it turned out IT had disappeared our old <username>@<olddomain> addresses. That reminds me about an IT ticket I need to write.

Attestation of identity should be seen as a service, not a product. Revocation exists. The fact that certificates are pieces of data is a compromise, not an ideal design.

Most CAs already support OCSP, which is effectively one-day-long certificates: the client can contact the OCSP server for a signed response saying "yes, it's still unrevoked", or the server can include ("staple") such a response along with its certs. Some certs have the MustStaple extension, indicating that they should not be treated as valid unless a recent OCSP response is stapled to it.

If we had the computational resources to just not have certificates at all and have every client check the CA's current belief that a public key belongs to a name at each use (and magically avoid the associated privacy problems), that would be ideal. Certificates are an approximation.

One neat thing about 3-month certificates is that it's a meaningfully different human lifescale from a year (or multiple years): operators may not still be around in a year but will certainly be around in 3 months. Operators may just plan to manually renew in a few years, but generally decide they need to automate the process if it's every 3 months. (For extremely boring reasons, I manually update the certificate on my personal website every 3 months and it's a pain, and I think I am one of very few people who do manual updates to their Let's Encrypt certs.) So it forces people to think of certificates as running their end of a service with ongoing operational work, not a one-time transfer of data.

> Why do certificates expire?

I always assumed one reason is so revocation lists don't become huge. Also, digital properties such as domains change hands.

The problem is what is someone screws up and compromises the key? By having an expiration we ensure that in the worst case they get just a few years to do damage.

>just a few years of damage.

Something in there is telling you that you're doing the worst of both worlds, by allowing such a huge gap between renewals that people merely forget leading to outages, and such a huge gap that if a key is to expire, someone else could use it for years before someone notice and actually revoke the old key.

derefr above has the correct response.

An interesting new project is Handshake which is attempting to use decentralization to remove centralized certificate authorities. Maybe it will help stop these and similar situations in the future.


Why is it better than GPG?

(“Blockchain” is not a valid answer)

The idea is to use economic incentives to keep people honest rather than trusting centralized authorities which can be more easily compromised

GPG doesn't use a centralized authority. It uses a decentralized "Web Of Trust" model.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact