
Level 3 Global Outage - dknecht
https://puck.nether.net/pipermail/outages/2020-August/013187.html
======
dz0ny
Summary: On August 30, 2020 10:04 GMT, CenturyLink identified an issue to be
affecting users across multiple markets. The IP Network Operations Center
(NOC) was engaged, and initial research identified that an offending flowspec
announcement prevented Border Gateway Protocol (BGP) from establishing across
multiple elements throughout the CenturyLink Network. The IP NOC deployed a
global configuration change to block the offending flowspec announcement,
which allowed BGP to begin to correctly establish. As the change propagated
through the network, the IP NOC observed all associated service affecting
alarms clearing and services returning to a stable state.

Source
[https://puck.nether.net/pipermail/outages/2020-August/013229...](https://puck.nether.net/pipermail/outages/2020-August/013229.html)

~~~
kitteh
Flowspec strikes again.

Its a super useful tool if you want to blast out an ACL across your network in
seconds (using BGP) but it has a number of sharp edges. Several networks,
including Cloudflare have learned what it can do. I've seen a few networks
basically blackhole traffic or even lock themselves out of routers due to a
poorly made Flowspec rules or a bug in the implementation.

~~~
parliament32
Is "doing what you ask" considered a sharp edge? Network-related tools don't
really have safeties, ever (your linux host will happily "ip rule add 0
blackhole" without confirmation). Every case of flowspec shenanigans in the
news has been operator error.

~~~
mrguyorama
It's possible that if a tool allows you to destroy everything with a single
click, that tool (or maybe process) is bad

------
kitteh
Massive reconvergence event in their network, causing edge router bgp sessions
to bounce (due to cpu). Right now all their big peers are shutting down
sessions with them to give level3s network the ability to reconverge. Prefixes
announced to 3356 are frozen on their route reflectors and not getting
withdrawn.

Edit: if you are a Level3 customer shut your sessions down to them.

~~~
beagle3
History doesn't repeat, but it rhymes ....

There was a huge AT&T outage in 1990 that cut off most US long distance
telephony (which was, at the time, mostly "everything not within the same area
code").

It was a bug. It wasn't a reconvergence event, but it was a distant cousin:
Something would cause a crash; exchanges would offload that something to other
exchanges, causing them to crash -- but with enough time for the original
exchange to come back up, receive the crashy event back, and crash again.

The whole network was full of nodes crashing, causing their peers to crash, ad
infinitum. In order to bring the network back up, they needed to either take
everything down at the same time (and make sure all the queues are emptied),
but even that wouldn't have made it stable, because a similar "patient 0"
event would have brought the whole network down.

Once the problem was understood, they reverted to an earlier version which
didn't have the bug, and the network re-stabilized.

The lore I grew up on is that this specific event was very significant in
pushing and funding research into robust distributed systems, of which the
best known result is Erlang and its ecosystem - originally built, and still
mostly used, to make sure that phone exchanges don't break.

[0]
[https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collap...](https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collapse)

~~~
phkahler
Contrary to what that link says, the software was not thoroughly tested.
Normal testing was bypassed - per management request after a small code
change.

This was covered in a book (perhaps Safeware, but maybe another one I dont
recall) along with the Therac 25, the Ariane V, and several others.
Unfortunately these lessons need to be relearned by each generation. See the
737-Max...

~~~
jacquesm
> Normal testing was bypassed - per management request after a small code
> change.

That lesson will really never be learned. This happens on a daily basis all
over the planet with people who have not been bitten - yet.

~~~
eru
That's why the most reliable way to instil this lesson is to instil it into
our tools. Automate as much testing as possible, so that bypassing the tests
becomes more work than running them.

~~~
monkpit
Until a manager is told about how hard the automation makes it to accomplish
their goal...

~~~
eru
You need buy-in to automation at a high enough level.

If a team manager at eg Google was complaining about how automation gets in
the way and wanted to bypass it, they wouldn't last too long.

------
emilstahl
CenturyLink/Level3 on Twitter: "We are able to confirm that all services
impacted by today’s IP outage have been restored. We understand how important
these services are to our customers, and we sincerely apologize for the impact
this outage caused."

[https://twitter.com/CenturyLink/status/1300089110858797063](https://twitter.com/CenturyLink/status/1300089110858797063)

~~~
ystad
I hope they provide a root cause analysis

~~~
colde
Based on experience it will probably not public, or at least very limited.

But customers are likely to get one, at least if they request it.

~~~
rootsudo
Being it was pretty big, they'll probably make it public.

------
MLij
India just lost to Russia in the final of the firstever online chess olympiad,
probably due to connection issues of two of its players. I wonder if it's
related to this incident and if the organizers are aware. Edit: the organizers
are aware, and Russia and India have now been declared joint winner.

~~~
redwood
Interesting. How would connection issues cause them to lose? Was it a timed
round?

~~~
colinbartlett
Related: World champion Magnus Carlson recently resigned a match after 4 moves
as an act of honor because in his previous match with the same opponent,
Magnus won solely due to his opponent having been disconnected.

~~~
repiret
His opponent, Ding Liren, is from China, and has been especially plagued by
unreliable internet since all the high level chess tournaments have moved
online. He is currently ranked #3, behind Magnus Carlson and Fabiano Caruana.

------
suby
I was doing development work which uses a server I've got hosted on digital
ocean. I started getting intermittent responses which I thought weird as I
hadn't changed anything on the server. I spent a good ten minutes trying to
debug the issue before searching for something on duckduckgo, which also
didn't respond. Cloudfare shouldn't be involved at all with my little site, so
I don't think it's limited to just them.

~~~
one2know
Yeah, something happened to ipv4 traffic worldwide. Don't see how that could
happen.

~~~
pps43
Let me guess: somebody misconfigured BGP again?

~~~
johnisgood
[https://puck.nether.net/pipermail/outages/2020-August/013198...](https://puck.nether.net/pipermail/outages/2020-August/013198.html)

------
mikiem
M5 Hosting here, where this site is hosted. We just shut down 2 sessions with
Level3/CenturyLink because the sessions were flapping and we were not getting
complete full route table from either session. There are definitely other
issues going on on the Internet right now.

~~~
exikyut
Oooh, maybe that's why HN wasn't working for me a little while ago (from
AU)...

------
eastdakota
Analysis of what we saw at Cloudflare, how our systems automatically mitigated
the worst of the impact to our customers, and some speculation on what may
have gone wrong: [https://blog.cloudflare.com/analysis-of-todays-
centurylink-l...](https://blog.cloudflare.com/analysis-of-todays-centurylink-
level-3-outage/)

~~~
ngold
Great write up. It is embarrassing that most of America has no competition in
the market.

>To use the old Internet as a “superhighway” analogy, that’s like only having
a single offramp to a town. If the offramp is blocked, then there’s no way to
reach the town. This was exacerbated in some cases because
CenturyLink/Level(3)’s network was not honoring route withdrawals and
continued to advertise routes to networks like Cloudflare’s even after they’d
been withdrawn. In the case of customers whose only connectivity to the
Internet is via CenturyLink/Level(3), or if CenturyLink/Level(3) continued to
announce bad routes after they'd been withdrawn, there was no way for us to
reach their applications and they continued to see 522 errors until
CenturyLink/Level(3) resolved their issue around 14:30 UTC. The same was a
problem on the other (“eyeball”) side of the network. Individuals need to have
an onramp onto the Internet’s superhighway. An onramp to the Internet is
essentially what your ISP provides. CenturyLink is one of the largest ISPs in
the United. Because this outage appeared to take all of the
CenturyLink/Level(3) network offline, individuals who are CenturyLink
customers would not have been able to reach Cloudflare or any other Internet
provider until the issue was resolved. Globally, we saw a 3.5% drop in global
traffic during the outage, nearly all of which was due to a nearly complete
outage of CenturyLink’s ISP service across the United States.

------
lemiffe
I had this earlier! A bunch of sites were down for me, I couldn't even connect
to this site.

The problem is I don't know where to find what was going on (tried looking up
live DDOS-tracking websites, "is it down or is it just me" websites, etc. I
couldn't find a single place talking about this.

Is there a source where you can get instant information on Level3 / global DNS
/ major outages?

~~~
kitteh
Ddos tracking sites are eye candy and garbage. Stop using them.

Outages and nanog lists are your best bet, short of being on the right IRC
channels.

~~~
xwdv
What are the right IRC channels?

~~~
dudus
I believe these are mostly non public channels where backbone and network
infrastructure engineers from different companies congregate to discuss
outages like this.

~~~
rolph
also channels where hats of various type discuss advantages opportunities and
challenges presented by such outages

~~~
xwdv
Which channels

~~~
ficklepickle
They wouldn't be non-public if they told us plebs

~~~
rolph
please dont call yourself that its more like i [and others] are hyper paranoid
and marginal in behavior due to the nature of pastimes [i myself can promise
you that im not malicious but i cant speak for others, i would leave it up to
them to speak for themselves]

------
aosaigh
Has anyone any good resources for learning more about the "internet-level"
infrastructure affected today and how global networks are connected?

~~~
q3k
Unfortunately, this infrastructure is at an uncanny intersection of
technology, business and politics.

To learn the technical aspect of it, you can follow any network engineering
certification materials or resources that delve into dynamic routing
protocols, notably BGP. Inter-ISP networking is nothing but setting up BGP
sessions and filters at the technical level. Why you set these up, and under
what conditions is a whole different can of worms, though.

The business and political aspect is a bit more difficult to learn without
practice, but a good simulacrum can be taking part in a project like dn42, or
even just getting an ASN and some IPv6 PA space and trying to announce it
somewhere. However, this is no substitute for actual experience running an
ISP, negotiating percentile billing rates with salespeople, getting into IXes,
answering peering requests, getting rejected from peering requests, etc. :)

Disclaimer: I helped start a non-profit ISP in part to learn about these
things in practice.

~~~
akritrime
What resources can I follow to start a non-profit ISP? I want to start one in
my hometown for students who couldn't afford internet to join online classes.

~~~
dboreham
Why not just raise money to pay for service from for-profit providers? Much
more efficient use of donation funds.

~~~
akritrime
Hmm, I actually didn't think about that at all. I guess I got too fascinated
by this video[0] and wanted to apply something similar to our current
scenario.

[0]: [https://youtu.be/lEplzHraw3c](https://youtu.be/lEplzHraw3c)

------
Yetanfou
Odd, I'm trying to reach a host in Germany (AS34432) from Sweden but get
rerouted Stockholm-Hamburg-Amsterdam-London-Paris-London-Atlanta-São Paulo
after which the packets disappear down a black hole. All routing problems
occur within Cogentco.

    
    
        3  sth-cr2.link.netatonce.net (85.195.62.158) 
        4  te0-2-1-8.rcr51.b038034-0.sto03.atlas.cogentco.com 
        5  be3530.ccr21.sto03.atlas.cogentco.com (130.117.2.93)
        6  be2282.ccr42.ham01.atlas.cogentco.com (154.54.72.105)  
        7  be2815.ccr41.ams03.atlas.cogentco.com (154.54.38.205) 
        8  be12194.ccr41.lon13.atlas.cogentco.com (154.54.56.93)   
        9  be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130)  
       10  be2315.ccr31.bio02.atlas.cogentco.com (154.54.61.113)  
       11  be2113.ccr42.atl01.atlas.cogentco.com (154.54.24.222)  
       12  be2112.ccr41.atl01.atlas.cogentco.com (154.54.7.158)
       13  be2027.ccr22.mia03.atlas.cogentco.com (154.54.86.206)
       14  be2025.ccr22.mia03.atlas.cogentco.com (154.54.47.230)
       15  * level3.mia03.atlas.cogentco.com (154.54.10.58) 
       16  * * *
       17  * * *

~~~
cotillion
What seems to have happened is that Centurylinks internal routing has
collapsed in some way. But they're still announcing all routes and they don't
stop announcing routes when other ISPs tag their routes not to be exported by
Centurylink.

So as other providers shut down their links to Centurylink to save themselves
the outgoing packets towards centurylink travel to some part of the world
where links are not shut down yet.

------
vld
I'm having issues reaching IP addresses unrelated to Cloudflare. Based on some
traceroutes, it seems AS174 (Cogent) and AS3356 (Level 3) are experiencing
major outages.

~~~
jbotz
Is there any one place that would be a good first place to go to check on
outages like this?

It would be really cool and useful to have an "public Internet health
monitoring center"... this could be a foundation that gets some financing from
industry that maintains a global internet health monitoring infrastructure and
a central site at which all the major players announce outages. It would be
pretty cheap and have a high return on investment for everybody involved.

~~~
thejosh
Until that site also goes down.

~~~
lioeters
Indeed, if we're to have a public Internet health meter, it must be
distributed and hosted/served from "outside" somehow, to be resilient to all
or parts of the network being down.

~~~
johnisgood
Here's a thought: we should all be outside. :D

------
Benjamin_Dobell
This explains a lot. Initially thought my mobile phone Internet connectivity
was flakey because I couldn't access HN here in Australia, whilst it's fine
over wi-fi (wired Internet).

~~~
abhishekjha
Its reverse for me. The broadband fails to connect to HN but my mobile ISP is
able to reach it fine.

~~~
willis936
Same for me in midwest US.

I first thought I had broken my DNS filter again through regular maintenance
updates, then I suspected my ISP/modem because it regularly goes out. I have
never seen the behavior I saw this morning: some sites failing to resolve.

~~~
bmlzootown
I thought Cloudflare was having issues again, since I use their DNS servers,
so I started by changing that. Then I tried restarting everything,
modem/router/computer. Wasn't until I connected to a VM that a friend hosts
that I was finally able to access HN, and thus saw this thread.

Hopefully this will get fixed within a reasonable timespan.

------
iso1210
Looks like Centurylink/Level3 (as3356) might not be withdrawing routes after
people close their peering?

~~~
regolithori
What could cause this? I wonder what the technical problem is.

~~~
jcims
I would love to hear the inside scoop from folks working at CenturyLink. I’ve
used their DSL for years and the network is a mess. I don’t know if it them
here or legacy Level3 but i have a guess.

Edit: Looks like i would have guessed wrong :P. Still want that inside scoop!

~~~
iso1210
Used level3 IP for a long time professionally with limited issues, ceratainly
not on the list of worst ISPs.

Also used a company that over the years has gone from Genesis, GlobalCrossing,
Vyvx, Level3 and now of course Level 3 is CenturyLink, which has been fine.

------
bregma
Misread the headline as "Level 3 Global Outrage" and thought "someone had
defined outrage levels?" and "it doesn't matter, he'll just attribute it to
the Deep State".

In some ways I'm a little bit disappointed it's only a glitch in the internet.

------
_eigenfoo
Can somebody please clarify - what exactly is this an outage of, and how
serious is it?

~~~
g105b
Is this affecting all geographic regions?

~~~
dredmorbius
US, Europe, and Asia that I'm aware of (NANOG mailing list).

------
mikro2nd
Had to laugh: "I'm seeing complaints from all over the planet on Twitter"

The one site I can't see is Twitter. (Not a heart-wrenching loss, mind you...)

~~~
quickthrower2
I could not get on HN as a logged in person (logged out was OK) during this. I
wondered how big the cloudflare thread would be if people could get on to
comment on it :-)

------
emilstahl
CNN just blames Cloudflare.. :facepalm:
[https://edition.cnn.com/2020/08/30/tech/internet-outage-
clou...](https://edition.cnn.com/2020/08/30/tech/internet-outage-
cloudflare/index.html)

~~~
ihatecloudflare
CNN is absolutely right. Every day I read news that something goes down at
CloudFlare. CloudFlare do much more harm than they "fix" with their services.

------
dathinab
I guess that why HN was temporary unreachable from my home?

~~~
protomyth
and why Cloudflare was having so many issues
[https://www.cloudflarestatus.com/](https://www.cloudflarestatus.com/)

------
jetru
Oh lord. I'm oncall and we were like "WHATS HAPPENING"

~~~
b3lvedere
Same here :) Couple of companies started complaining. Told them it's a
worldwide issue. It seems going better at the moment.

------
iso1210
No peering problems from my network with Level3 in London Telehouse West,
maybe a minute or so of increased latency at 10:09 GMT

Routing to a level3 ISP I have an office in in the states peers with
London15.Level3.net

No problem to my Cogent ISP in the states, although we don't peer directly
with Cogent, that bounces via Telia

Going east from London, a 10 second outage at 12:28:42 GMT on a route that
runs from me, level3, tata in India, but no rerouting.

------
johnchristopher
So, that's why HN is unreachable from Belgium at the moment (right when I was
trying to figure a dns cache problem in Firefox,of course).

An ssh tunnel through OVH/gravelines is working so far. edit: Proximus. edit2:
also, Orange Mobile

~~~
iso1210
HN working for me from the UK on BT, but traceroute showing lots of different
bouncing around and a lot of different hops in the US

    
    
      7  166-49-209-132.gia.bt.net (166.49.209.132)  9.877 ms  8.929 ms
        166-49-209-131.gia.bt.net (166.49.209.131)  8.975 ms
      8  166-49-209-131.gia.bt.net (166.49.209.131)  8.645 ms  10.323 ms  10.434 ms
      9  be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130)  95.018 ms
        be3487.ccr41.lon13.atlas.cogentco.com (154.54.60.5)  7.627 ms
        be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130)  102.570 ms
      10  be3627.ccr41.jfk02.atlas.cogentco.com (66.28.4.197)  89.867 ms
        be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130)  101.469 ms  101.655 ms
      11  be2806.ccr41.dca01.atlas.cogentco.com (154.54.40.106)  103.990 ms  93.885 ms
        be3627.ccr41.jfk02.atlas.cogentco.com (66.28.4.197)  97.525 ms
      12  be2112.ccr41.atl01.atlas.cogentco.com (154.54.7.158)  106.027 ms
        be2806.ccr41.dca01.atlas.cogentco.com (154.54.40.106)  98.149 ms  97.866 ms
      13  be2687.ccr41.iah01.atlas.cogentco.com (154.54.28.70)  120.558 ms  122.330 ms  120.071 ms
      14  be2687.ccr41.iah01.atlas.cogentco.com (154.54.28.70)  123.662 ms
        be2927.ccr21.elp01.atlas.cogentco.com (154.54.29.222)  128.351 ms
        be2687.ccr41.iah01.atlas.cogentco.com (154.54.28.70)  120.746 ms
     15  be2929.ccr31.phx01.atlas.cogentco.com (154.54.42.65)  145.939 ms  137.652 ms
        be2927.ccr21.elp01.atlas.cogentco.com (154.54.29.222)  128.043 ms
      16  be2930.ccr32.phx01.atlas.cogentco.com (154.54.42.77)  150.015 ms
        be2940.rcr51.san01.atlas.cogentco.com (154.54.6.121)  152.793 ms  152.720 ms
      17  be2941.rcr52.san01.atlas.cogentco.com (154.54.41.33)  152.881 ms
        te0-0-2-0.rcr11.san03.atlas.cogentco.com (154.54.82.66)  153.452 ms
        be2941.rcr52.san01.atlas.cogentco.com (154.54.41.33)  152.054 ms
      18  te0-0-2-0.rcr12.san03.atlas.cogentco.com (154.54.82.70)  162.835 ms
        te0-0-2-0.nr11.b006590-1.san03.atlas.cogentco.com (154.24.18.190)  146.643 ms
        te0-0-2-0.rcr12.san03.atlas.cogentco.com (154.54.82.70)  153.714 ms
      19  te0-0-2-0.nr11.b006590-1.san03.atlas.cogentco.com (154.24.18.190)  151.212 ms  145.735 ms
        38.96.10.250 (38.96.10.250)  147.092 ms
      20  38.96.10.250 (38.96.10.250)  149.413 ms * *

~~~
josephb
Guessing the traceroute looks a bit messy because of multiple paths being
available.

You can use `-q 1` to send a single traceroute probe/query instead of the
default 3, it might make your traceroute look a little cleaner.

~~~
iso1210
I don't normally see multi paths for a given IP, but that aside it's bouncing
through far more than I'd expect. That said, it's rare I look at traceroutes
across the continental U.S, maybe that many layer 3 hops are normal, maybe
routes change constantly.

HN has dropped off completely from work - I see the route advertised from
Level 3 (3356 21581 21581) and from Telia and onto Cogent (1299 174 21581
21581). Telia is longer, so traffic goes into to Level3 at Docklands via our
20G peer to London15, but seems to get no further.

Heading to Tata in India, route out is via same peer to level3, then onto the
London, Marseile, and then peers with Tata in Marseille, working fine.

My gut feeling is a core problem in Level3's continental US network rather
than something more global.

~~~
RKearney
This is normal for Cogent. They do per-packet load balancing across ECMP
links. What you're seeing is normal for the given configuration.

------
wiremine
In a situation like this, what are the best "status" sites to be watching?

~~~
OskarS
HN is not the worst place, honestly.

~~~
Timothycquinn
Agreed. I went to Reddit r/networking and the mods were closing helpful
threads in real-time :(

------
gnyman
This had me really confused until I saw it was a global outage. I have been
getting delayed iOS push notifications (from prowl) now for the last few
hours, from a device I was fairly sure I had disconnected 3 hours ago (a pump)

Got questioning if I really disconnected it before I left.

I'm wondering if we're at the point where internet outages should have some
kind of (emergency) notification/sms sent to _everyone_.

------
dredmorbius
NANOG are talking about a CenturyLink outage and BGP flapping (AS 3356) as of
03:00 US/Pacific, AS209 possibly also affected.

AS3356 is Level 3, AS209 is CenturyLink.

[https://mailman.nanog.org/pipermail/nanog/2020-August/209359...](https://mailman.nanog.org/pipermail/nanog/2020-August/209359.html)

------
ffpip
DDG, down detector are all very slow. Both are on cloudflare.

Fastly, HN, Reddit too.

Only Google domains are loading here.

~~~
thejteam
From where I am (mid-altantic US) Google site are completely down (google.com,
youtube)

------
jlgaddis
> _" Root Cause: An offending flowspec announcement prevented BGP from
> establishing correctly, impacting client services."_

\--

That doesn't really explain the "stuck" routes in their RRs... maybe it'll
make sense once we've gotten some more details...

~~~
quickthrower2
This might be a silly question but is there such a thing as CI/CD for this
sort of thing that may have caught the problem?

~~~
dsr_
There are two aspects to this:

1\. Is there syntax correctness checking available, so you don't push a config
that breaks machines? Yes.

2\. Is there a DWIM check available, so you can see the effect of the change
before committing? No. That would require a complete model of, at a minimum,
your entire network plus all directly connected networks -- that still
wouldn't be complete, but it could catch some errors.

------
based2
[https://status.ctl.io/history/f19a0555-abbd-4038-91cb-b55a76...](https://status.ctl.io/history/f19a0555-abbd-4038-91cb-b55a7645c1f5)

[https://twitter.com/g_bonfiglio/status/1300022993251446785?s...](https://twitter.com/g_bonfiglio/status/1300022993251446785?s=19)

[https://old.reddit.com/r/networking/comments/ijb8tn/global_a...](https://old.reddit.com/r/networking/comments/ijb8tn/global_as3356_level3_outages/)

------
blantonl
Everything to Oracle Cloud's Ashburn US-East location is down.

Their console isn't responding at all and all my servers are unreachable.
Their status console reports all normal though.

~~~
system2
Status pages of the companies are just PR disasters for them. Most of the time
they don't report what's up.

------
tyfon
Seems like "the internet" works again here in Norway. I've been limited to
local sites all day.

Hacker news has been off for several hours for me.

Whatever it was it must have been nasty.

~~~
djxfade
I had the same issue on my fiber connection (Altibox/BKK), however, no
problems on my mobile using 4G (Dipper/Telenor)

~~~
matsemann
I couldn't reach HN on neither Altibox or 4g/telenor.

~~~
tyfon
Both altibox and telia 4g was down for me as well.

------
janmo
There is a major internet outage going on. I am using Scaleway they are also
affected. According to Twitter, Vodafone, CityLink and many more are also
affected.

------
gailees
The beginning of WWIII probably looks something like this.

------
vbsteven
I'm having lots of issues with Hetzner machines not being available (and even
the hetzner.com website). Don't know if this is related.

~~~
zepearl
Fyi I'm not having any problems right now with hetzner.com nor hetzner.de - my
own dedicated server hosted at Hetzner datacenter in Germany seems to be
reachable/working as well.

Connecting from Switzerland.

------
vinni2
I had to use a VPN With US location to post this comment. I am in Europe.

~~~
lucb1e
HN works fine from Germany with Telefonica (O2) and also from the Netherlands
with XS4ALL.

Edit: Somewhere between 14:00 and 14:46Z it also went down from O2; XS4ALL
still works, and O2 can reach XS4ALL.

~~~
minxomat
No luck on T-Mobile

~~~
crizzlenizzle
Yup.

``` Prefix 209.216.230.0/24 BGP as_path 3356 21581 21581 ```

As seen from AS3320.

~~~
minxomat
Even NordVPN to the nearest German hub is screwed. Have to vpn to the US to
access HN.

~~~
lucb1e
I see a _lot_ of ads for NordVPN, but you should know they're not necessarily
reliable. Just look for NordVPN on hacker news search:
[https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...](https://hn.algolia.com/?dateRange=all&page=0&prefix=true&query=nordvpn&sort=byPopularity&type=story)
(see e.g. the second hit:
[https://news.ycombinator.com/item?id=21664692](https://news.ycombinator.com/item?id=21664692)
covering up security issues, using your connection to proxy other people's
traffic, a related company does data mining...). The only VPN that seemed to
fit the bill when I looked for one about a year ago was ProtonVPN, but I
certainly didn't manage to look at every VPN on the planet and I'm just a
random internet stranger so... take that with a grain of salt.

~~~
minxomat
I know. But they are required to unlock streaming services. I’m not using them
for privacy or even normal traffic.

~~~
lucb1e
Alright, just making sure. Happy to hear you're an informed netizen :)

------
osipovas
A service I run on Digital Ocean was affected by this early this morning.
Looks like it was mitigated by DO - so I'm very grateful for that. Although,
the service I run is time sensitive so failures like this are pretty
unfortunate for me. Where would I get started with building in redundancy
against these sort of outages?

------
naringas
seems like the internet in 2020 has a diminished ability to route around
damage

~~~
sp332
BGP has always had this issue. It depends on trustworthy information being
available. Any trusted source who starts lying (or just screws up) is going to
cause routing problems.

~~~
salawat
Note, trustworthyness jumps off of being a technical problem, and becoming a
human/people problem. Level 8 as someone mentioned, or GIGO (Garbage-In-
Garbage-Out) as others may know it.

To safely use a system, your operator needs to be 10% smarter than the system
being operated. It is clear that we have problems in that department with
certain AS's. This is about, what the third major outage attributed to
CenturyLink in the last handful of years? I have no idea what exactly their
process must look like, but good heavens, a better look need be taken, as this
is becoming a bit regular for my tastes.

------
tambre
Fastly is also seeing problems. [0]

However, they report that they've identified the issue and are fixing it.

[0]: [https://status.fastly.com/](https://status.fastly.com/)

------
xyst
Internet infrastructure is broken.

Why do a few companies control the backbone of the internet? Shouldn’t there
be a fallback or disaster recovery plan if one or more of these companies
become unavailable?

~~~
kzrdude
Why doesn't stuff just route around this automatically, if one provider has
problems?

~~~
johncolanduoni
The problem is the provider having problems is still sending misconfigured
routes after the other providers have tried to pull them in response to the
outage. So it’s as if CenturyLink was doing a massive BGP attack against their
peers, pointing at a black hole.

------
danecek099
Even [https://downdetector.com/](https://downdetector.com/) has problems
loading for me. Middle Europe *internetweathermap is down

~~~
neuronic
Who watches the Watchmen...

------
hkc
Chess.com was down due to the outage and some of the Indian players got
disconnected and lost on time, so FIDE declared India-Russia joint winner of
the Online Chess Olympiad 2020.

------
eric_khun
Shameless plug:

I spent too much time losing precious time when github/npm/cloudflare are
going down, until I figure out it was them.

So currently working on a project[1] to monitor all the 3rd party stack you
use for your services. Hit me up if you want, access I'll give free access for
a year+ to some folks to get feedbacks.

[1] [https://monitory.io](https://monitory.io)

~~~
naavis
Maybe fix this typo? "Save titme on issues investigation"

~~~
kzrdude
And > Monitor all 3rd parties services

*3rd party services or possibly 3rd parties' services

~~~
eric_khun
Thank you, fixed it, I definitely didn't pay attention on this

Now wondering if it impacted conversion rate?

~~~
Axsuul
Your landing page doesn't build enough trust. You have run on sentences. It's
still unclear what the service does.

------
emilstahl
Cloudflare status page: Update - Major transit providers are taking action to
work around the network that is experiencing issues and affecting global
traffic.

We are applying corrective action in our data centers as the situation changes
in order to improve reachability Aug 30, 14:26 UTC

[https://www.cloudflarestatus.com](https://www.cloudflarestatus.com)

------
rantanplan
Incidentally I can't connect to HN directly from Greece, but only if I use my
VPN through New York. Probably somehow related?

------
RedShift1
Ironically this page doesn't load for me

------
Cyphase
I just experienced HN down for several minutes before it loaded and I saw this
story at the top.

I'm doing something with the HN API as I type this, so for a moment I was
trying to decide if I'd been IP blocked, even though the API is hosted by
Firebase.

I haven't noticed any obvious issues elsewhere yet.

(Just got a delay while trying to submit this comment.)

------
redwood
Could this be a Russia move vis a vis today's expected Belarus protests?

(I hope this doesn't mean a violent crackdown is imminent)

Oy
[https://mobile.twitter.com/HannaLiubakova/status/13000645356...](https://mobile.twitter.com/HannaLiubakova/status/1300064535697555456)

~~~
badrabbit
I don't see any bgpmon alerts, that's unlikely.

------
haunter
I'm in Hungary EU. My fiber works fine but 4G gone except for domestic
addresses can't connect to anything

------
gnicholas
Can anyone help me understand why I can't access HN from my iPhone, but I can
from my computer? both are on the same network. I'm getting "Safari cannot
open the page because the server cannot be found", and many apps won't work at
all either.

~~~
lmm
One might be using IPv6 and the other v4. Or you might have different DNS
settings.

------
one2know
Based on twitter, the outage was on multiple continents. What would cause
that? Subsea cable broken?

------
stordoff
It wasn't a total outage for the site I was trying to reach. It took about 20
minutes to make an order, but after multiple retries (errors were reported as
a 522 with the problem being somewhere between Manchester, UK and the host),
it did go through.

------
nottorp
I have two pipes from two different (consumer ISPs) at home. One can reach HN,
the other can't.

Incidentally, uBlock Origin seems to be completely broken. It doesn't have any
local blacklists to work when their ?servers? are unavailable?

------
tpmx
From the other (Cloudflare) thread (post:
[https://news.ycombinator.com/item?id=24322603](https://news.ycombinator.com/item?id=24322603)),
the outages list
([https://puck.nether.net/mailman/listinfo/outages](https://puck.nether.net/mailman/listinfo/outages)).

[https://puck.nether.net/pipermail/outages/2020-August/thread...](https://puck.nether.net/pipermail/outages/2020-August/thread.html)

Not a network engineer, but based on the comments there it looks like it's a
BGP blackhole incident.

Edit: removed details about the similarity to a 1997 incident based in input
from commenters.

~~~
jsjohnst
> Not a network engineer, but based on the comments there it looks like it's a
> BGP blackhole incident, possibly reminiscent of the
> [https://en.wikipedia.org/wiki/AS_7007_incident](https://en.wikipedia.org/wiki/AS_7007_incident)
> in 1997.

As you aren’t a network engineer, I can understand making that leap based on
the context, but no, this is nothing like the AS7007 event.

The “black hole” in this case is due to networks pulling their routes via
AS3356 to try and avoid their outage, but when they do, CenturyLink is still
announcing those routes and as such those networks blackhole.

~~~
tpmx
So it's not a BGP blackhole incident then?

~~~
jsjohnst
Not all BGP blackholes are the same. The AS7007 incident from _over twenty
years ago_ is an entirely different cause, and thus unrelated.

~~~
tpmx
What I take from that: It is a BGP blackhole incident.

~~~
jsjohnst
What I take from this is that you’re offering input to a thread which you
don’t have experience in or even actually understand, thus are spreading
misinformation. You then are continually doubling down further showing your
maturity.

You aren’t helping, so please stop.

------
Darmody
Half of the internet is down. Crazy...

I can't even access the private WoW server I play.

~~~
tc313
FWIW, I can’t connect to Madden NFL online servers.

------
rglover
This knocked out the Starbucks app and some of their systems this morning. A
bunch of people in line couldn't log in and they were saying parts of their
whole internal system were down, too.

------
EE84M3i
I'm confused about why Cloudflare had problems but other CDN providers/sites
with private CDNs like Google did not. Is there something different about how
Cloudflare operates?

------
blooalien
I experienced this issue while reading docs at "Read the Docs" (and ironically
had connection issues while trying to read this very exact page right here,
too.)

------
system2
I am having trouble with Hulu right now. I bet it is related.

------
dancemethis
Probably due to the incredibly ugly name this company has. No one in their
right mind should shake hands with a thing called Level 3.

------
bovermyer
SalesForce/Office365 is also having trouble.

------
corford
No impact here in Lisbon, PT (using MEO). I can access: HN, twitter,
cloudflare, AWS, DO, Hetzner, DDG, Scaleway etc.

------
tictok4
This (thread) explains why we've been having internet problems this
morning.... lots of sites not working.

------
jsumrall
The iDeal payment network used by most online stores of the Netherlands was
down/flaky all afternoon.

------
TreeInBuxton
Looks like an issue with AS3356, they are advertising stale routes - lots of
unrelated services impacted

------
2fast4you
Centurylink is my isp, it looks like traffic drops out after 2 hops. It’s been
this way for a few hours

~~~
2fast4you
Youtube is still trucking though, not sure how that works

~~~
badrabbit
Youtube colocates at most major ISPs on the planet, that might help.

------
CarCooler
Yep, internet has been horrible out here, I had to use Cloudflare DNS to reach
websites!

------
eatmyshorts
I was doing a big release over the evening. I was working fine up until about
6 hours ago, when I signed off. Our network monitors show an outage started
about half an hour later (at about 4:05am CST). Service restored a few minutes
ago, at about 9:44am CST. I don't know if our problem is the same as this
problem, but we are on CenturyLink.

------
nurettin
also related
[https://www.cloudflarestatus.com/incidents/hptvkprkvp23](https://www.cloudflarestatus.com/incidents/hptvkprkvp23)

------
karpolan
Deployment to Netlify fails on installing of any version of Node :)

~~~
_fool
more specifically, npmjs.com and nodejs.org are not available from Netlify's
datacenter due to this outage.

------
ausjke
I wasted two hours for this, diagnosis, reboots,etc.

------
person_of_color
Imagine a ransomware attack against these jokers.

------
ezconnect
Namecheap is also having network connection issues.

------
pgoodjohn
Pressing F for everyone else who was on call today

------
skee0083
Good. It's about time ISP switched to ipv6.

------
chkaloon
Wonder if that's that why Feedly is down

~~~
pinkano
Yes

------
tiernano
1.1.1.1 warp is having issues too...

------
mathieubordere
stackoverflow seems to be unreachable

------
ihatecloudflare
It's probably just another daily outage at CloudFlare, they are famous for
their the most unreliable infrastructure on the entire planet.

------
ramshanker
I hope these kind of “ipv4” only outages encourages more and more websites to
upgrade to ipv6.

#OutageBenefit ;)

~~~
cuu508
Sadly, in my experience, ipv4 is generally more reliable than ipv6 still.

Set up two hosts, host A and host B in two different data centers. Make them
send HTTP requests to each other over ipv4 and over ipv6. You'll see that
latency spikes, packet loss is more frequent over ipv6.

~~~
bigdict
Why is that?

~~~
chaboud
We’ve observed this in end-user devices, especially on some ISPs.

It makes sense if the overall adoption and resource allocation are
comparatively smaller, making individual or small-group coincident spikes more
impactful against the amortized whole.

It’s a lot like a market with low volume/liquidity. Someone wanders in with a
big transaction and blows everything up.

------
tpmx
How the _xxxx_ did it take CenturyLink/Level3 like 3-4 hours to fix this
problem?

Again
([https://news.ycombinator.com/item?id=24322988](https://news.ycombinator.com/item?id=24322988))
not a network engineer, but it seemed like their routers actively stopped
other networks from working around the problem since L3 would still keep
pushing other networks' old routes, even after those networks tried to stop
that.

Also: BGP probably needs to redesigned from the ground up by software
engineers with experience from designing systems that can remain working with
hostile actors.

~~~
q3k
> Also: BGP probably needs to redesigned from the ground up by software
> engineers with experience from designing systems that can remain working
> with hostile actors.

This has been attempted a number of times, but this is a political problem,
not a technical problem: there's no single agreed source of truth for routing
policy.

A lot of US Internet providers won't even sign up for ARIN IRR, or even move
their legacy space to a RIR - so there isn't even any technical way of
figuring out address space ownership and cryptographic trust (ie. via RPKI).
Hell, some non-RIR IRRs (like irr.net) are pretty much the fanfiction.net
equivalent of IRRs, with anyone being able to write any record about
ownership, without any practical verification (just have to pay a fee for
write access). And for some address space, these IRRs are the only information
about ownership and policy that exists.

Without even knowing for sure who a given block belongs to, or who's allowed
to announce it, or where, how do you want to fix any issues with a new dynamic
routing protocol?

~~~
tpmx
Build an industry coalition. Put pressure on those who don't join. Randomly
throw away 1 out of 10000 packets from the providers that fail to get with the
times. Increase that frequency according to some published time function.

~~~
sneak
Having a single, cryptographically assured source of truth for routing data is
a turnkey censorship nightmare waiting to happen.

All it takes is a national military to care enough to put pressure on the
database operator, legal or otherwise, and suddenly your legitimate routes are
no longer accepted.

If you think this wouldn't be used to shut down things like future Snowden-
style leaks or Wikileaks or The Shadow Brokers, you may not have been paying
attention to the news.

~~~
kitteh
sneak you should come back to irc :)

~~~
sneak
Where? Send me an email rather than spamming this thread; my email address is
on my profile.

------
tpmx
Based on what I've seen: They essentially "shut down the Internet" for
probably a quarter of the global population for about 3-4 hours.

That response time is atrocious. It wasn't that they needed to fix broken
hardware, rather they needed to stop running hardware from actively sabotaging
the global routing via the inherently insecure BGP protocol. That took 3-4
hours to happen.

As an example: Being in Sweden with an ISP that uses Telia Carrier for
connectivity things started working around the time of
[https://twitter.com/TeliaCarrier/status/1300074378378518528](https://twitter.com/TeliaCarrier/status/1300074378378518528)

~~~
swinglock
Seems they didn't even get around to doing so, rather asking other carriers to
stop peering with them.

[https://twitter.com/TeliaCarrier/status/1300074378378518528?...](https://twitter.com/TeliaCarrier/status/1300074378378518528?s=20)

~~~
matsur
CenturyLink requested depeering to give them some breathing room and stop the
bleeding. Hug ops.

~~~
tpmx
That is a fantastic euphemism. Personally I'm disappointed Telia didn't de-
peer two hours earlier, after diagnosing the issue for 30 minutes, since that
whole lack of functioning routning to very large parts of the internet forced
me to use VPN in north america to access many web services, including HN.

I realize I'm going to get insanely downvoted by the elite internetworking
crowd again but I think this needs to be said.

From an outsider's POV: There seems to be a very strange and almost incestual
relationship between the networking companies. Or maybe it's just their
hangaround supporters? I dunno.

