
Analysis of Today's CenturyLink/Level(3) Outage - danfritz
https://blog.cloudflare.com/analysis-of-todays-centurylink-level-3-outage/
======
pritambarhate
There was the finals of The FIDE Chess Olympiad going on between India and
Russia when this outage started. 2 Indian players lost connect to Chess.com at
this time and ended up losing their games on time. This ensued a lot of drama.
In the FIDE ended up declaring both India and Russia as joint gold medalist.
Russian players were not happy. Also Armenia had forfeited from tournament as
they had faced similar disconnection issue against India during quarterfinals.

Here are a few links if you want to follow the drama:

BBC: Chess Olympiad: India and Russia both get gold after controversial final
[1]

YouTube: Joint Gold for Team India and Russia at the Online Olympiad 2020 |
Full [2] story

FINALE!! INDIA vs RUSSIA CHESS OLYMPIAD LIVE STREAM [3]: Actual live stream of
the final. The actual incident starts around 01:56:00

[1]
[https://www.bbc.com/news/world-53965748](https://www.bbc.com/news/world-53965748)

[2]
[https://www.youtube.com/watch?v=YgZAVOUmWcg](https://www.youtube.com/watch?v=YgZAVOUmWcg)

[3]
[https://www.youtube.com/watch?v=kJLyFSVuRnk](https://www.youtube.com/watch?v=kJLyFSVuRnk)

~~~
tzs
I don't know how these events are conducted, but I'm assuming that it is _not_
simply each player playing from home unsupervised. That would make it way too
hard to prevent cheating. So I assume that there is a tournament official
onsite at each player location watching, just like there would be for an in-
person tournament.

If that is the case, there are a couple of reasonable ways to handle this.

One is to bring back something that used to be common in high level chess: the
adjournment.

This used to be common when most championships used time controls that were
slow enough that you often would not finish the game in one play session.

The way it worked is that at the end of the play session, the arbiter calls
for an adjournment. When the player on the move decides their next move they
write it down on a piece of paper rather than actually making it on the board,
the move is sealed in an envelope which is kept by the arbiter, and play
stops.

When it is time to resume, the arbiter opens the envelope and plays the sealed
moved on the board and starts the clocks.

Both players are free to analyze as much as they want between play sessions,
and can get outside help. For world championships, especially back when it was
Fischer vs. Spassky or Korchnoi vs. Karpov and the match was serving as a
proxy for the cold war, each player would have whole teams of top GMs to help
analyze during an adjournment.

The player who sealed the move has the advantage of knowing for certain what
position will be on the board when the game resumes. On the other hand, the
other player is going to be the first one to have the move after several GMs
have pulled an all-nighter analyzing it for them so any inaccuracy in the
sealed move is much more likely to be punished than it would have been without
adjournment.

Or get a freaking modem. Exchanging chess moves does not require high
bandwidth, and most of these internet outages do not take out phone service.

Do a mini-adjournment (seal the move, but keep the players on site), establish
a dial-up connection between the playing sites, and then unseal the move and
resume.

~~~
timsally
Generally in professional online tournaments there is not a tournament
official onsite. Players are competing from their homes. Instead, tournaments
require several Zoom sessions from multiple angles in an attempt to verify
that the neither the player's computer nor a secondary device are being used
to cheat. These Zoom sessions are monitored by tournament officials.

On top of this, commercial chess websites have extensive anti-cheating
measures that are used to analyze the games after the fact. For example, one
of the major players has 5+ engineers and several strong chess players on
their anti-cheating team. These teams have caught professional players
cheating a surprising number of times. Being caught results in a lifetime ban
from the chess website and there are often consequences for the player in real
life as well.

I don't really buy the idea of an adjournment. Are you supposed to have one
every time a connection problem happens? Historically players knew at what
point in the game an adjournment would happen. Having them happen at random
throughout the game would change the whole dynamic.

Your idea of a cellular connection as backup is an excellent one. I think the
first commercial chess website to implement a turn key way for players to
utilize one will see huge returns from that investment.

~~~
hinkley
Every time this conversation comes up, I'm reminded of a world-building
subplot in one of Vernor Vinge's first books. Instead of banning computers in
chess, let the competitor use a computer that they built themselves, so it's
one augmented human versus another augmented human.

In his world, there were no supercomputers elsewhere, so you didn't have to
worry about covert channels phoning home to a much bigger computer. I suppose
you could put everyone in a Faraday cage...

~~~
loneranger_11x
which book is this? and would you recommend?

~~~
hinkley
I'm fairly sure it's The Peace War, which has a sequel (Marooned in Realtime)
where the concept of The Technology Singularity in introduced.

I get a little salty about Ray Kurzweil getting the credit for a concept that
Vernor Vinge had already published in 1986.

------
achiang
Google networking SRE here (my team runs ns[1-4].google.com among other
services).

Regardless of original intent, the blog doesn't land well with me. It could
have provided the background on flowspec, using their own past outage as a
case study, without any of the speculation or blameyness that came across
here. The #hugops at the end reads quite disingenuously.

We see other networks break all the time and we often have pretty good guesses
as to why. But I personally would never sign off on a public blog speculating
on a WAG of why someone _else 's_ network went down. That's uncouth.

~~~
badrabbit
I think you're stuck on the politics. Level3 is their competition but
initially CF was blamed. CF owes it to their customers and investors to
explain to them why they had an outage and how they responded to it, and they
do not need talk in detail about an unrelated past incident (just because it
was related to flowspec does not mean it was a similar outage), and they
certainly should not wait for Level3's investigation.

I would expect Google to have a similar explanation if a significant number of
GCP customers faced an outage.

You should know, it wasn't just someone else's network that went down, that
network brought down a big chunk of the internet with it. I think technical
honesty comes before political appearances. The #hugops and mention of their
past experience with a flowspec outage is clearly there to signal that the
blogpost is not there for blaming or making L3 look bad.

~~~
achiang
The politics is exactly the point of my comment.

The professional way to write a blog post like this is from your own
perspective. Identify the proximate cause (the peer), name names if you must,
talk about how awesome your own systems are, show some of your monitoring if
you like, and talk about what you'll do in the future to be even more
resilient to this class of problems.

That's all to the good and much of Cloudflare's blog was exactly that.
Would've been fine if they left it like that.

Acknowledging there is no postmortem (yet) but then pointlessly speculating
about what it _might_ contain is what I have a problem with.

I don't speak for Google but if I found out we had written a post like this, I
would speak up and advocate to change it.

~~~
badrabbit
There is nothing professional about avoiding a topic for the sake of
appearances. Level3 put out details knowing others in the industry will
discuss and speculate based on that information. They could have witheld
details such as flowspec and edge routers bouncing but they did not, it's
perfectly professional to discuss speculative details of someone elses outage
that affected your customers based on details they chose to make public.

In infosec for example, it's extremely common to speculate about a
vulnerability based on details in the CVE. Entire news articles are based on
such speculation. Like I said, you are giving too much weight to optics and
appearances. I would like to see anyone actually at Level3 complain about this
post.

------
kryogen1c
sometimes people remark at how extensive ancient civilations became with such
simple technology, yet here we are. billions of people being served by things
like BGP and SS7.

as i get older, i become more and more concerned with humanity's lack of fault
tolerance.

brett weinstein clued me into this as an evolutionary phenomena. if a gene
activates a short-term solution and long-term problem, that gene is likely to
be favored.

how do we transcend this problem that seems to be inherent with existing? a
first-principal problem?

~~~
CloudNetworking
If anything, these kind of issues we see popping up here and there are proof
of the high availability of the Internet and specifically how protocols such
as BGP helped on making it what it is today.

It is not that we have built the Internet despite BGP. We have built it thanks
to BGP. If we didn't have BGP we would have to invent it :)

~~~
alexchamberlain
Isn’t the point that BGP is great in the same way as the Model T was great? No
one is saying it wasn’t needed or - to a certain extent - doesn’t do the job,
but given recent (and not so recent) improvements to technology and security
standards, maybe we need a BGP 2.0?

~~~
PietjePukster
BGP 2.0, like self driving cars, are five years away.

For the last 20 years..

~~~
Aperocky
This makes total sense because humans are inherently lazy. Hurd would be out
in production in 199x if not for Linux. But it's still being worked on in
2020.

~~~
phone8675309
> humans are inherently lazy

> Hurd would be out in production in 199x if not for Linux.

I think Hurd is not a good example of this. Hurd being sidelined seems to me
to be a result of bikeshedding (which microkernel to use) and realizing that
Linux (as a kernel) had more effort being poured into it because it had more
mindshare.

> But it's still being worked on in 2020.

At more or less a leisurely pace as a passion project more than the end goal
being production, precisely because the social goal that it was trying to
achieve has been mostly achieved by the Linux kernel.

------
indigodaddy
I love Cloudflare’s writeups and Cloudflare in general, and while this was
again well written and excellent analysis, it contained a tad bit too much
speculation for my taste.

~~~
jabroni_salad
This is probably as good as it gets for outside analysis. I doubt we will
learn much more until CL/L3 posts their own post-mortem with all the data.

~~~
indigodaddy
Didn’t say it wasn’t good— it was that. I just think it was slightly on the
inappropriate side in terms of vendor relationships. I mean obviously CF was
kind of making a point to CTL with the article, I get that too.

------
rossdavidh
Horror thought: if the internet ever "breaks" enough that all access to
StackOverflow is lost, no one will be able to fix anything to get it back up,
and we'll be back to the stone ages to start again.

Kidding. Mostly.

~~~
j8014
As a teen in the early 90's, who was self taught, there is only one correct
answer for all questions, RTFM. lol. Man that sucked.

------
EE84M3i
Did other providers have similar issues to Cloudflare? I only noticed
cloudflare sites being particularly down, not other CDNs, but maybe that was
selection bias?

~~~
rwky
I couldn't connect to anything in the US (I'm in the UK) which broke a large
chunk of the internet mainly due to being unable to access Cloudfront.

~~~
mcspiff
That’s surprising given CloudFront has a decent number of Points of Presence
in London/Europe.

------
7demons
That technician who pushed this rule... Man, what a story he will have to tell
to his grandchild.

------
system2
I wish we could read an analysis from CenturyLink. Their status page doesn't
have anything useful: [https://status.ctl.io/](https://status.ctl.io/)

~~~
davio
That's CenturyLink Cloud (formerly Tier 3) - doesn't really have anything to
do with CTL/Level 3 backbone.

------
chkaloon
BGP and its associated tools is beginning to look like a national, actually
global, systemic risk.

------
_-___________-_
Typo in title

~~~
danfritz
Fixed!

~~~
chrismorgan
I still see “Cloudlfare”.

------
malwarebytess
Bit off topic. Does anyone know the reason why Centurylink doesn't operate in
California?

~~~
mxmasster
CenturyLink “residential” is limited to territories where they are the ILEC.

CenturyLink “business” is everywhere in CA and a large provider.

