
O2 outage due to expired Ericsson certificate - amaccuish
https://www.ericsson.com/en/press-releases/2018/12/update-on-software-issue-impacting-certain-customers
======
nothrabannosir
It took some warming to, but I have come round to appreciate letsencrypt's
short certificate lifetimes. Monthly renewal should be maximum for any system,
but ideally you'd want to go weekly. Assuming your renewal is automated, I
don't see any downside, only benefits. It properly internalises cert renewal
as part of standard system operations, bringing it into your daily ops instead
of having it as some scary gray undocumented box for the next guy to tick.

Much like writing tests, it initially hurt my ego a bit, but I've come to like
and proselytise it.

~~~
tqkxzugoaupvwqr
Your proposal may lead to an unintended DDoS of the CA.

~~~
jjeaff
Let's Encrypt recommends a daily cron job for cert renewal. The certbot only
actually pings the servers for a renewed cert if the current cert is within 30
days of expiration.

------
Bokanovsky
This was a major outage in the UK causing millions of people not to have data
access on their phones.

[https://www.bbc.co.uk/news/business-46464730](https://www.bbc.co.uk/news/business-46464730)

~~~
Crosseye_Jack
Not just data but also calls and text. Mother is on o2 and until about 6pm I
was unable to call, was able to text her from about 4pm. Between 4 and 6 the
calls were just failing with “called failed” and not even going to Voicemail,
but texts was go though and I would even get delivery reports, just that calls
wouldn’t connect.

~~~
gambiting
Yeah, O2 kept saying that calls and texts were not affected, but that was
utter nonsense. I couldn't place a call most of the day, and texts _would_ go
out, but my phone would insist that sending failed. And then there was just no
signal at all most of the afternoon.

------
mhandley
Maybe it would be a good idea for certificates to expire slowly and randomly
over 24 or 48 hours. In other words, if the cert has an expiry date of 12:00
UTC, Dec 6th 2018, then start to randomly fail connections at that time with
low probability. The probability increases progressively during the next 24
hours until 100% of connections fail at 12:00 UTC, Dec 7th 2018. It's not like
the cert is 100% trustworthy one minute and 100% untrustworthy the next
minute. Having the failure rate ramp up slowly would give advance warning
before everything has gone completely pear-shaped.

In the case of Ericsson, this might have allowed an emergency certificate
update before all the O2 systems could no longer be automatically updated.
Once your network is completely down, bringing it back up remotely is hard.

~~~
gred
I think this could also be a good idea for phasing out public APIs -- instead
of just taking an API offline, start to fail requests early at low
probability, ramping the probability up to 100% over the course of a month or
so.

~~~
anilakar
How long would it take before developers started wrapping their API and CDN
calls in rapid-firing loops because of this?

~~~
doesnt_know
It already exists and has borrowed the name "resilience engineering" from the
construction and engineering fields. Netflix has some interesting blog posts
on how they deal with transient faults and resilience in general. Implementing
concepts like circuit breakers.

Have a search for libraries in your favorite language, I'm sure something will
already exist. I've personally used Polly in .NET.

[https://github.com/App-vNext/Polly](https://github.com/App-vNext/Polly)

------
felideon
We use Erlang for a high availability of nine nines, so long as we remember to
renew our certs.

------
RowanH
Would be interesting to know how many people who've managed footprints for a
reasonable period of time (say 5-10years) who _haven 't_ had a cert expire on
them. Wouldn't be surprised if it's single digit %ages.

So many human & tech error factors lead to this occurring and they're all the
same old things. Staffing changes, spam filters, ignored warnings, skipped
emails...

~~~
stdplaceholder
I know it's happened to every company where I've worked. It happens so rarely,
though, that people don't have enough opportunity to learn from it. Even at
Google they were on their Nth such outage for a large value of N before it
became apparent that no certificate should ever expire at 23:59:59 on December
31, or otherwise outside of normal operating hours. Seriously 20 years of
organizational knowledge required to get the company to understand that certs
should expire at noon on a Wednesday to minimize time-to-repair in the
inevitable event that one is allowed to lapse.

~~~
jtl999
Instead of waiting last minute, you'd think a large company would have
planning to renew certificates X amount of time before they expire. Alas I
understand it's not that simple.

~~~
stdplaceholder
You'd certainly think so. If you have frontend probers that exercise your
accessible endpoints (HTTP or whatever) then those probes should fail when the
certificate expires in less than 30 days. I couldn't comment on whether an
organization like Ericsson or O2 would be expected to have such probers.

~~~
jtl999
I should get on that for my own infrastructure.

Thanks for the reminder.

------
NeedMoreTea
Ignoring letting it expire in the first place.

The surprising part is it took over 24 hours to restore service. I currently
still have 3G only, and that's struggling (apparently 4G will follow).

~~~
jwdunne
Still only 4G here too. Apparently we can expect it to return tomorrow
morning. O2 were asked if we'll be compensated. They said they'll "apologise
in an O2 way" but couldn't confirm what an "O2 way" is.

~~~
matthewmacleod
I assume that means they’ll apologise by ramping up your monthly rate.

------
wil421
Certificates can be hard to manage across enterprises. I have a project coming
across my desk next year specifically to manage expiring certs and track on
going changes. The company has 20,000+ certs to manage for us and our
customers.

~~~
codingminds
I'd find a HN post about this project and it's results very interesting

~~~
alwaystyred
There was an interesting post a few weeks back on Autotrader moving all of
their online properties over to LetsEncrypt
[https://news.ycombinator.com/item?id=17949741](https://news.ycombinator.com/item?id=17949741)

------
lifeisstillgood
Ooh - I was wondering what the hell was going on.

I assumed I had dropped the darn thing one too many times.

Glad to see the HN grapevine works on wifi.

It is a reminder of just how fragile this digital world still is - we are
taking technology designed to survive nuclear war, and adding single points of
failure.

Let's look at mesh networking again.

~~~
stillworks
It wasn't just the phones or just data on phones. GWR ticket kiosks went
offline. Bus stop digital timetables went offline. On the phone, SMS delievery
went wonky i.e. same SMS being delivered more than once but on the sender's
phone it showed up as not delivered !

------
davidwparker
What is O2? It's not mentioned in the article but it's in the title.

~~~
jacobwg
O2 is one of the major telecommunications service providers in the UK:
[https://en.wikipedia.org/wiki/O2_(UK)](https://en.wikipedia.org/wiki/O2_\(UK\))

------
nickdothutton
Reminds me of the 1990 network-wide outage AT&T suffered, aka "the big one of
nine-oh". Caused by a bad update which took a couple of weeks to eventually
bork the entire network.

------
dikei
Multiple providers all over the world are affected as well. For example, all
Vietnamese providers has their data service disrupted yesterday.

------
OnTheHoof
Qualys is used by O2 which has the ability to manage certificates - see their
website. Most large IT systems use centralised monitoring using feeds from
several sources. Did someone forget to monitor this data or was the feed from
this data either ignored or out of date?

------
Zenst
If only the mechanisms that check certificates could provide warnings of
impending expiry - 1,3,7,30 days would be prudent.

Though companies should be doing at least a yearly audit of certificates and
calendering any that will need renewing. As I'm sure they do with domain names
already.

~~~
Jolter
Presumably this certificate was protecting some machine-to-machine connection.
Then, no human would ever have seen those warnings.

------
anilakar
Hardware manufacturers: How do you manage certificates that the customer will
not touch or even see? Do you deliver new CAs alongside firmware updates? Do
your software lifecycle requirements take certificates and cryptography in
general into account?

------
Maven911
If an ssl cert is expired on a website, you don't have access to the secured
website. In this case though, how does it affect Ericssons telcom equipment
(core routers and switchs for mobileand voice) ?

------
mtkd
Same risk with domains and DNS - when people question why CSCGlobal or
UltraDNS charge so much compared to cheaper alts - it's because they have your
back when you miss something

~~~
Redsquare
The only way Ultra have your back is ripping the shirt off you in overage
charges

------
cameldrv
Why do certificates expire? How is it acceptable to have a piece of data
somewhere that contains a timebomb that must be periodically defused? I do not
see how this helps security, and it is particularly ridiculous that normal
behavior is that one day the system works normally and is deemed secure, and
the next day it is so insecure and dangerous that communication simply fails.

~~~
jcranmer
Cryptography is basically a computational treadmill: you want to make it cheap
enough that it's not burdensome for the actual users, but that reversing the
information without the key is computationally expensive. Processing power,
especially for the highly parallelizable task of grinding through potential
keys, follows an exponential curve; ergo, even the present exponential gap is
not a long-term protection mechanism.

The expiry date of a certificate was present in the original implementation of
X.509 certificates; the infrastructure for certificate revocation lists was
added later. Furthermore, a certificate revocation list includes _every_
certificate ever revoked that has yet to expire. If you relied on certifcate
revocation for expiring old certificates as well, that would be a nasty long
file you'd have to download before you can start connecting to any website.

~~~
hamandcheese
I think the much more important reason for expiring certificates is to
minimize the damage in the case that your private keys are stolen, which is
far more likely than them being brute forced.

------
seibelj
An interesting new project is Handshake which is attempting to use
decentralization to remove centralized certificate authorities. Maybe it will
help stop these and similar situations in the future.

[https://handshake.org/](https://handshake.org/)

~~~
aaaaaaaaaab
Why is it better than GPG?

(“Blockchain” is not a valid answer)

~~~
seibelj
The idea is to use economic incentives to keep people honest rather than
trusting centralized authorities which can be more easily compromised

~~~
tjohns
GPG doesn't use a centralized authority. It uses a decentralized "Web Of
Trust" model.

