Hacker News new | past | comments | ask | show | jobs | submit login
Level 3 Global Outage (nether.net)
914 points by dknecht 20 days ago | hide | past | favorite | 367 comments



Summary: On August 30, 2020 10:04 GMT, CenturyLink identified an issue to be affecting users across multiple markets. The IP Network Operations Center (NOC) was engaged, and initial research identified that an offending flowspec announcement prevented Border Gateway Protocol (BGP) from establishing across multiple elements throughout the CenturyLink Network. The IP NOC deployed a global configuration change to block the offending flowspec announcement, which allowed BGP to begin to correctly establish. As the change propagated through the network, the IP NOC observed all associated service affecting alarms clearing and services returning to a stable state.

Source https://puck.nether.net/pipermail/outages/2020-August/013229...


Flowspec strikes again.

Its a super useful tool if you want to blast out an ACL across your network in seconds (using BGP) but it has a number of sharp edges. Several networks, including Cloudflare have learned what it can do. I've seen a few networks basically blackhole traffic or even lock themselves out of routers due to a poorly made Flowspec rules or a bug in the implementation.


Is "doing what you ask" considered a sharp edge? Network-related tools don't really have safeties, ever (your linux host will happily "ip rule add 0 blackhole" without confirmation). Every case of flowspec shenanigans in the news has been operator error.


It's possible that if a tool allows you to destroy everything with a single click, that tool (or maybe process) is bad


Massive reconvergence event in their network, causing edge router bgp sessions to bounce (due to cpu). Right now all their big peers are shutting down sessions with them to give level3s network the ability to reconverge. Prefixes announced to 3356 are frozen on their route reflectors and not getting withdrawn.

Edit: if you are a Level3 customer shut your sessions down to them.


History doesn't repeat, but it rhymes ....

There was a huge AT&T outage in 1990 that cut off most US long distance telephony (which was, at the time, mostly "everything not within the same area code").

It was a bug. It wasn't a reconvergence event, but it was a distant cousin: Something would cause a crash; exchanges would offload that something to other exchanges, causing them to crash -- but with enough time for the original exchange to come back up, receive the crashy event back, and crash again.

The whole network was full of nodes crashing, causing their peers to crash, ad infinitum. In order to bring the network back up, they needed to either take everything down at the same time (and make sure all the queues are emptied), but even that wouldn't have made it stable, because a similar "patient 0" event would have brought the whole network down.

Once the problem was understood, they reverted to an earlier version which didn't have the bug, and the network re-stabilized.

The lore I grew up on is that this specific event was very significant in pushing and funding research into robust distributed systems, of which the best known result is Erlang and its ecosystem - originally built, and still mostly used, to make sure that phone exchanges don't break.

[0] https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collap...


Contrary to what that link says, the software was not thoroughly tested. Normal testing was bypassed - per management request after a small code change.

This was covered in a book (perhaps Safeware, but maybe another one I dont recall) along with the Therac 25, the Ariane V, and several others. Unfortunately these lessons need to be relearned by each generation. See the 737-Max...


> Normal testing was bypassed - per management request after a small code change.

That lesson will really never be learned. This happens on a daily basis all over the planet with people who have not been bitten - yet.


It isn't learned because 99% of the time, it works fine and nothing bad happens.

We are very bad at avoiding these sorts of rare, catastrophic events.


That's why the most reliable way to instil this lesson is to instil it into our tools. Automate as much testing as possible, so that bypassing the tests becomes more work than running them.


Until a manager is told about how hard the automation makes it to accomplish their goal...


You need buy-in to automation at a high enough level.

If a team manager at eg Google was complaining about how automation gets in the way and wanted to bypass it, they wouldn't last too long.


Managers who have been bitten still make this choice


Or most of what we do isn't really important so it doesn't matter if it breaks every once in a while.


Probably not the book you are thinking off, since it’s just about the AT&T incident, but “The Day the Phones Stopped Ringing” by Leonard Lee is a detailed description of the event.

It’s been many years since I read it, but I recall it being a very interesting read.


For some reason in my university almost every CS class would start with an anecdote about the Therac 25, Ariane V, and/or a couple others as a motivation on why we the class existed. It was sort of a meme.

The lessons are definitely still taught, I don't know if they're actually learned of course.. And who knows who actually taught the 737-Max software devs, I don't suppose they're fresh out of uni.


Do management typical typically study Computer Science?


Unfortunately most people become a manager by bring a stellar independent contributor. People management and engineering are very different skills, I'm always impressed when I see someone make that jump smoothly.

I always wanted companies to hire people managers as its own career path. An engineer can be an excellent technical lead or architect, but it can feel like you started over once you're responsible for the employees, their growth, and their career path.


Yeah, it just sucks that you eventually have someone making significant people management decisions without the technical knowledge of what the consequences could end up being. This would be even worse if you had people manager hiring be completely decoupled. The US military works this way and I have to say it's not the best mode.


Typically yes actually, the director of engineering should always be an engineer. Of course, these are hardware companies so it would probably be some kind of hardware engineer.


Should.

Sure.


As a former AT&T contractor, albeit from years later, this checks out. Sat in a "red jeopardy" meeting once because a certain higher-up couldn't access the AT&T branded security system at one of his many houses.

The build that broke it was rushed out and never fully tested, adding a fairly useless feature for said higher-up that improved the UX for users with multiple houses on their account.


This reminds me of an incident on the early internet (perhaps ARPANET at that point) where a routing table got corrupted so it had a negative-length route which routers then propagated to each other, even after the original corrupt router was rebooted. As with AT&T, they had to reboot all the routers at once to get rid of the corruption.

I can't remember where i read about this, but i recall the problem was called "The Creeping Crud from California". Sadly, this phrase apparently does not appear anywhere on the internet. Did i imagine this?


I can't find anything by that name either, but the details do match the major ARPANET outage of Oct 27, 1980.

The incident is detailed in RFC 789:

http://www.faqs.org/rfcs/rfc789.html#b


Interesting, thanks! That is different to the story i remember, but it's possible that i remember incorrectly, or read an incorrect explanation.

I believe that i read about this episode in Hans Moravec's book 'Mind Children'. I can see in Google Books that chapter 5 is on 'Wildlife', and there is a section 'Spontaneous Generation', which promises to talk about a "software parasite" which emerged naturally in the ARPAnet - but of which the bulk is not available:

https://books.google.co.uk/books?id=56mb7XuSx3QC&lpg=PA133&d...


I have spent hours and hours banging my head against Erlang distributed system bugs in production. I am absolutely mystified why anyone thought just using a particular programming language would prevent these scenarios. If it's Turing-complete, expect the unexpected.


The idea isn't that Erlang is infallible in the design of distributed systems.

The idea is it takes away enough foot-guns that if you're banging your head against systems written it in, you'd be banging your head even harder and more often if the same implementor had used another language


There was something similar a few years ago on a large US mobile network. You could watch the ‘storm’ rolling across the map. Fascinating stuff


Are you referring to CenturyLink’s 37-hour, nationwide outage?

> In this instance, the malformed packets [Ethernet frames?] included fragments of valid network management packets that are typically generated. Each malformed packet shared four attributes that contributed to the outage: 1) a broadcast destination address, meaning that the packet was directed to be sent to all connected devices; 2) a valid header and valid checksum; 3) no expiration time, meaning that the packet would not be dropped for being created too long ago; and 4) a size larger than 64 bytes.

* https://arstechnica.com/information-technology/2019/08/centu...


I think we used to call that a poison pill message (still bring it up routinely when we talk about load balancing and why infinite retries are a very, very bad idea).


Some queue processing systems I've seen have infinite retries.

At least they have exponential backoff I guess.


But your queue will grow and grow and the fraction of time you spend servicing old messages grows and grows.

Not a terribly big fan of these queueing systems. People always seem to bung things up in ways they are not quite equipped to fix (in the “you are not smart enough to debug the code you wrote” sense).

Last time I had to help someone with such a situation, we discovered that the duplicate processing problem had existed for >3 months prior to the crisis event, and had been consuming 10% of the system capacity, which was just low enough that nobody noticed.


We also alert if any message is in the queue too long.

If anything, the alert is too sensitive.


The thing with feature group D trunks to the long distance network is you could (and still can on non-IP/mobile networks) manually route to another long distance carrier like Verizon, and sidestep the outage from the subscriber end, full stop. That's certainly not possible with any of the contemporary internet outages.


you can inject changes in routing, but if the other other carrier doesn't route around the affected network, you're back to square one. That's part of why Level3/CenturyLink was depeered and why several prefixes that are normally announced through it were quickly rerouted by owners.


That's my point; as a subscriber, you can prefix a long distance call with a routing code to avoid, for example, a shut down long distance network without any administrator changes. Routing to the long distance networks is done independently through the local network, so if AT&T's long distance network was having issues, it'd have no impact on your ability to access Verizon's long distance network.


There's actually no technical reason why you couldn't do that with IP (4 or 6); although you'd need a approriately located host to be running a relay daemon[0].

0: ie something that takes, say, a UDP packet on port NNNN containing a whole raw IPv4 packet, throws away the wrapping, and drops the IPv4 packet onto its own network interface. This is safe - the packet must shrink by a dozen or two bytes with each retransmission - but usually not actually set up anywhere.

Edit: It probably wouldn't work for TCP though - maybe try TOR?


There are plenty of ways to do what you're describing, and they all work with TCP. Some of them only work if the encapsulated traffic is IPv6 (and a designed to give IPv6 access on ISPs that only support IPv4). Some of them may end up buffering the TCP stream and potentially generating packet boundaries at different locations than in the original TCP stream.

[0] https://en.wikipedia.org/wiki/Generic_Routing_Encapsulation

[1] https://en.wikipedia.org/wiki/Teredo_tunneling

[2] https://en.wikipedia.org/wiki/6to4

[3] Any of the various https://en.wikipedia.org/wiki/Virtual_private_network technologies (WireGuard, IPSec, SOCKS TLS proxies, etc.)

[3] As you mention, a Tor SOCKS proxy


There is, technically, a way for IP packets to signify preferred routes, but due to other (security) reasons it's disabled.


> best known result is Erlang and its ecosystem

not expert but erlang is listed as 1986, so that would seem not directly related https://en.wikipedia.org/wiki/Erlang_(programming_language)


This sounds like the event that is described in the book Masters of Deception: The gang that ruled cyberspace. The way I remember it the book attributes the incident to MoD, while of course still being the result of a bug/faulty design.


Indeed. In 2018 an Erlang telco software did break, bringing down the UK and Japan.


If memory serves that also involved an expired certificate


That matches my memory.


A thread discussing that event:

https://news.ycombinator.com/item?id=24323412


Is that related to the hacker's crackdown?


Fascinating. Thanks for sharing! :)


Most of level3s settlement free peers aka "tier 1s" have shutdown or depreffed their sessions with them.

Example: https://mobile.twitter.com/TeliaCarrier/status/1300074378378...


Root cause identified. Folks are turning things back on now.


Source?



What is a reconvergence event? Is that what's described in your last sentence?


BGP is a path-vector routing protocol, every router on the internet is constantly updating its routing tables based on paths provided by its peers to get the shortest distance to an advertised prefix. When a new route is announced it takes time to propagate through the network and for all routers in the chain to “converge” into a single coherent view.

If this is indeed a reconvergence event, that would imply there’s been a cascade of route table updates that have been making their way through CTL/L3’s network - meaning many routers are missing the “correct” paths to prefixes and traffic is not going where it is supposed to, either getting stuck in a routing loop or just going to /dev/null because the next hop isn’t available.

This wouldn’t be such a huge issue if downstream systems could shut down their BGP sessions with CTL and have traffic come in via other routes, but doing so is not resulting in the announcements being pulled from the Level 3 AS - something usually reflective of the CPU on the routers being overloaded processing route table updates or an issue with the BGP communication between them.

Convergence time is a known bugbear of BGP.


BGP operates as a rumor mill. Convergence is the process of all of the rumors settling into a steady state. The rumors are of the form "I can reach this range of IP addresses by going through this path of networks." Networks will refuse to listen to rumors that have themselves in the path, as that would cause traffic to loop.

For each IP range described in the rumor table, each network is free to choose whichever rumor they like best among all they have heard, and send traffic for that range along the described path. Typically this is the shortest, but it doesn't have to be.

ISPs will pass on their favorite rumor for each range, adding themselves to the path of networks. (They must also withdraw the rumors if they become disconnected from their upstream source, or their upstream withdraws it.) Business like hosting providers won't pass on any rumors other than those they started, as no one involved wants them to be a path between the ISPs. (Most ISPs will generally restrict the kinds of rumors their non ISP peers can spread, usually in terms of what IP ranges the peer owns.)

Convergence in BGP is easy in the "good news" direction, and a clusterfuck in the "bad news" direction. When a new range is advertised, or the path is getting shorter, it is smooth sailing, as each network more or less just takes the new route as is and passes it on without hesitation. In the bad news direction, where either something is getting retracted entirely, or the path is going to get much longer, we get something called "path hunting."

As an example of path hunting: Lets say the old paths for a rumor were A-B-C and A-B-D, but C is also connected to D. (C and D spread rumors to each other, but the extended paths A-B-C-D and A-B-D-C are longer, thus not used yet.) A-B gets cut. B tells both C and D that it is withdrawing the rumor. Simultaneously D looks at the rumor A-B-C-D and C looks at the rumor A-B-D-C, and say "well I've got this slightly worse path lying around, might as well use it." Then they spread that rumor to their down streams not realizing that it is vulnerable to the same event that cost them the more direct route. (They have no idea why B withdraw the rumor from them.) The paths, especially when removing an IP range entirely, can get really crazy. (A lot of core internet infrastructure uses delays to prevent the same IP range from updating too often, which tamps down on the crazy path exploration and can actually speed things up in these cases.)


https://en.wikipedia.org/wiki/Convergence_(routing)

IP network routing is distributed systems within distributed systems. For whatever reason the distributed system that is the CenturyLink network isn't "converging", or we could it becoming consistent, or settling, in a timely manner.


I know some of these words


Can you tell tell me more about what happened, but in a way that for a person who struggled with the CCNA? I’ve never heard of a reconvergence event.


CenturyLink/Level3 on Twitter: "We are able to confirm that all services impacted by today’s IP outage have been restored. We understand how important these services are to our customers, and we sincerely apologize for the impact this outage caused."

https://twitter.com/CenturyLink/status/1300089110858797063


I hope they provide a root cause analysis


Based on experience it will probably not public, or at least very limited.

But customers are likely to get one, at least if they request it.


Being it was pretty big, they'll probably make it public.



India just lost to Russia in the final of the firstever online chess olympiad, probably due to connection issues of two of its players. I wonder if it's related to this incident and if the organizers are aware. Edit: the organizers are aware, and Russia and India have now been declared joint winner.


I am glad they declared a tie. Seems fair.

I had this problem two years ago while I was taking Go lessons online from a South Korean professional Go Master. For my last job we were renting a home well outside city limits in Illinois and our Internet failed often. I lost one game in an internal teaching tournament because of a failed connection, and jumped through hoops to avoid that problem.


Thanks for the update.

Wasn't able to access HN from India earlier, but other cloudflare enabled services were accessible. I assume several Network Engineers were woken up from their Sunday morning sleep to fix the issue; if any of them is reading this, I appreciate your effort.


Interesting. How would connection issues cause them to lose? Was it a timed round?


Related: World champion Magnus Carlson recently resigned a match after 4 moves as an act of honor because in his previous match with the same opponent, Magnus won solely due to his opponent having been disconnected.


His opponent, Ding Liren, is from China, and has been especially plagued by unreliable internet since all the high level chess tournaments have moved online. He is currently ranked #3, behind Magnus Carlson and Fabiano Caruana.


All professional chess games have a time limit for each player (if you've ever heard of "chess clocks" -- that's what they're used for). In "slow chess" each player has a 2-hour limit and all of the other time control schemes (such as rapid and blitz) are much shorter.


There’s an interesting protocol for splitting a Go or chess game over multiple days so that neither party has the entire time to think about their response to the last move: at the end of the day the final move is made by one player but is sealed, not to be revealed until the start of the next session.

For this to work on an internet competition, the judges would need a backup, possibly very low bandwidth communication mechanism that survives a network outage.

This wouldn’t save any real-time esports, but would be serviceable for turn based systems.


Yes, this is call Adjournment[0] and they used to do it until 20 or so years ago when computer analysis became too good/mainstream.

[0] https://en.wikipedia.org/wiki/Adjournment_(games)


Yes, two players lost on time.


That's fascinating. But I wonder, why don't they start over, or continue where they left off, once the internet is back?


> continue where they left off

The games are timed and this pause gives a lot of thinking time. If they're allowed to talk with others during the pause, then also consulting time.

> why don't they start over

That would be unfair to the player who was ahead.

That said, both players might still be fine with a clean rematch, because being the undisputed winner feels better. I wonder if they were asked (anonymously to prevent public hate) whether they would be fine with a rematch.


Seems like one of those cases where solving a “little” issue would actually require rearchitecting the entire system.

Namely, in this case, it seems like the “right thing” is for games to not derive their ELO contributions from pure win/loss/draw scorings at all, but rather for games to be converted into ELO contributions by how far ahead one player was over the other at the point when both players stopped playing for whatever reason (where checkmate, forfeit, and game disruption are all valid reasons.) Perhaps with some Best-rank (https://www.evanmiller.org/how-not-to-sort-by-average-rating...) applied, so that games that go on longer are “more proof” of the competitive edge of the player that was ahead at the time.

Of course, in most central cases (of chess matches that run to checkmate or a “deep” forfeit), such a scoring method would be irrelevant, and would just reduce to the same data as win/loss/draw inputs to ELO would. So it’d be a bunch of effort only to solve these weird edge cases like “how does a half-game that neither player forfeited contribute to ELO.”


> but rather for games to be converted into ELO contributions by how far ahead one player was over the other at the point when both players stopped playing for whatever reason

Except for the obvious positions that no one serious would even play, there is no agreed-upon way of calculating who has an advantage in chess like that. One man's terrible mobility and probable blunder is another's brilliant stratagem.


Hm, you’re right; guess I was thinking in terms of how this would apply to Go, where it’d be as simple as counting territory.

Still, just to spitball: one “obvious” approach, at least in our modern world where technology is an inextricable part of the game, would be to ask a chess-computer: “given that both players play optimally from now on, what would be the likelihood of each player winning from this starting board position?” The situations where this answer is hard/impossible to calculate (i.e. estimations close to the beginning of a match) are exactly the situations where the ELO contribution should be minuscule anyway, because the match didn’t contribute much to tightening the confidence interval of the skill gap between the players.

Of course, players don’t play optimally. I suspect that, given GPT-3 and the like, we’ll soon be able to train chess-computers to mimic specific players’ play-styles and seeming limits of knowledge (insofar as those are subsets of the chess-computer’s own capabilities, that it’s constraining its play to.) At that point, we might actually be able to ask the more interesting question: “given these two player-models and this board position, in what percentage of evolutions from this position does player-model A win?”

Interestingly, you could ask that question with the board position being the initial one, and thus end up with automatically-computed betting odds based on the players’ last-known skill (which would be strictly better than ELO as a prediction on how well an individual pair of players would do when facing off; and therefore could, in theory, be used as a replacement for ELO in determining who “should” be playing whom. You’d need an HPC cluster to generate that ladder, but it’d be theoretically possible, and that’s interesting.)


I was doing development work which uses a server I've got hosted on digital ocean. I started getting intermittent responses which I thought weird as I hadn't changed anything on the server. I spent a good ten minutes trying to debug the issue before searching for something on duckduckgo, which also didn't respond. Cloudfare shouldn't be involved at all with my little site, so I don't think it's limited to just them.


Yeah, something happened to ipv4 traffic worldwide. Don't see how that could happen.


Let me guess: somebody misconfigured BGP again?



likely


That's definitely going to be an interesting postmortem.


Seconding this. Had some ssh connections timing out repeatedly just a bit ago. Also got disconnected on IRC.


IKEA had their payment system go down worldwide also. I really doubt that uses Cloudflare.


It's not a just CloudFlare outage, its a global CenturyLink/Level3 outage


Is there a ranking board for which carriers have caused the most accumulated network carnage out there? I think the world deserves this.


Me too. I can only connect to one of my DO servers. The rest are all unreachable.


As noticed in another comment I see loads of problems within Cogentco, all on *.atlas.cogentco.com. Might the problem lies there?


Cogent and Cox are also having problems, but we are seeing a lot more successful traffic on Cogent than CenturyLink. It appears that CL is also not withdrawing stale routes. It seems CLs issues are causing issues on/with everything connected to it.


Same here. I actually opened a support ticket with them because I was worried my ISP had started blocking their IP addresses for some unknown reason. Luckily it seems to clear up, and in the ticket they mentioned routing traffic away from the problematic infrastructure. Seems to have worked for now for my things.


Yup, definitely noticed earlier outages to both EU sites and also to HN. Looked far upstream because many sites/lots of things worked fine. Good to see it's at least largely fixed


I had problems accessing my Hetzner VPS', but I haven't tried connecting directly with the IP. So I suppose it could be a DNS thing?


M5 Hosting here, where this site is hosted. We just shut down 2 sessions with Level3/CenturyLink because the sessions were flapping and we were not getting complete full route table from either session. There are definitely other issues going on on the Internet right now.


Oooh, maybe that's why HN wasn't working for me a little while ago (from AU)...


Analysis of what we saw at Cloudflare, how our systems automatically mitigated the worst of the impact to our customers, and some speculation on what may have gone wrong: https://blog.cloudflare.com/analysis-of-todays-centurylink-l...


Great write up. It is embarrassing that most of America has no competition in the market.

>To use the old Internet as a “superhighway” analogy, that’s like only having a single offramp to a town. If the offramp is blocked, then there’s no way to reach the town. This was exacerbated in some cases because CenturyLink/Level(3)’s network was not honoring route withdrawals and continued to advertise routes to networks like Cloudflare’s even after they’d been withdrawn. In the case of customers whose only connectivity to the Internet is via CenturyLink/Level(3), or if CenturyLink/Level(3) continued to announce bad routes after they'd been withdrawn, there was no way for us to reach their applications and they continued to see 522 errors until CenturyLink/Level(3) resolved their issue around 14:30 UTC. The same was a problem on the other (“eyeball”) side of the network. Individuals need to have an onramp onto the Internet’s superhighway. An onramp to the Internet is essentially what your ISP provides. CenturyLink is one of the largest ISPs in the United. Because this outage appeared to take all of the CenturyLink/Level(3) network offline, individuals who are CenturyLink customers would not have been able to reach Cloudflare or any other Internet provider until the issue was resolved. Globally, we saw a 3.5% drop in global traffic during the outage, nearly all of which was due to a nearly complete outage of CenturyLink’s ISP service across the United States.


I remember working the support queue _before_ this automatic re-routing mitigation system went in and it was a lifesaver. Having to run over to SRE and yell "look! look at grafana showing this big jump in 522s across the board for everything originating in ORD-XX where the next hop is ASYYYY! WHY ARE WE STILL SENDING TRAFFIC OVER THAT ARRRGHH please re-route and make the 522 tickets stop"

it's cool to see something large enough that the auto-healing mechanisms weren't able to handle it on their own, though shoutout to whoever was on the weekend support/SRE shift; that stuff was never fun to deal with when you were one of a few reduced staff on the weekend shifts


I had this earlier! A bunch of sites were down for me, I couldn't even connect to this site.

The problem is I don't know where to find what was going on (tried looking up live DDOS-tracking websites, "is it down or is it just me" websites, etc. I couldn't find a single place talking about this.

Is there a source where you can get instant information on Level3 / global DNS / major outages?


Ddos tracking sites are eye candy and garbage. Stop using them.

Outages and nanog lists are your best bet, short of being on the right IRC channels.


What are the right IRC channels?


I believe these are mostly non public channels where backbone and network infrastructure engineers from different companies congregate to discuss outages like this.


Not just discuss, but fix too :)


also channels where hats of various type discuss advantages opportunities and challenges presented by such outages


Which channels


They wouldn't be non-public if they told us plebs


please dont call yourself that its more like i [and others] are hyper paranoid and marginal in behavior due to the nature of pastimes [i myself can promise you that im not malicious but i cant speak for others, i would leave it up to them to speak for themselves]


it isnt so much the channels that you want its the current IP of a non indexed IRC server[s] that you need, of course you could create and maintain your own dynamic IRC server and invite people that you trust or feel kinship toward.

here are a couple of "for instance" breadcrumbs for you to start from:

https://github.com/saileshmittal/nodejs-irc-server

https://ubuntu.com/tutorials/irc-server

...>>> https://github.com/inspircd/inspircd/releases/tag/v3.7.0


packetheads irc


I agree!

I'm definitely an amateur when it comes to networking stuff. At the time, the _only_ issue I had was with all of my Digital Ocean droplets. It was confusing because I was able to get to them through my LTE connection and not able to through my home ISP. I opened a ticket with DO worried that it was my ISP blocking IP addresses suddenly. It turned out to be this outage, but it was very specific. Traceroute gave some clues, but again I'm amateur and I couldn't tell what was happening after a certain point.

So yeah, I too would love a really easy to use page that could show outages like this. It would be really great to be able to specify vendors used to really piece the puzzle together.


I had a similar issue with my droplets as well. I thought I messed up something and then suddenly it worked again.


I found places talking about this earlier. A friend of mine who has CenturyLink as their ISP complained to me that Twitch and Reddit weren't working. But they worked for me, so I suspected a CDN issue. I did some digging to figure out what CDNs they had in common. I expected Twitch to be on CloudFront, but their CDN doesn't serve CloudFront headers; instead they are "Via: 1.1 varnish". Reddit is exactly the same. I did some googling and found out that they both apparently used Fastly, at least to some extent. Fastly has a status page and it was talking about "widespread disruption".

So I guess my takeaway from this is that if the Internet seems to be down, usually the CDN providers notice. I don't know if either of the sites actually still use Fastly (I kind of forgot they existed), but I did end up reading about the Internet being broken at some scale larger than "your friend's cable modem is broken", so that was helpful.

It would be nice if we had a map of popular sites and which CDN they use, so we can collect a sampling of what's up and what's down and figure out which CDN is broken. Though in this case, it wasn't really the CDN's fault. Just collateral damage.


Has anyone any good resources for learning more about the "internet-level" infrastructure affected today and how global networks are connected?


Unfortunately, this infrastructure is at an uncanny intersection of technology, business and politics.

To learn the technical aspect of it, you can follow any network engineering certification materials or resources that delve into dynamic routing protocols, notably BGP. Inter-ISP networking is nothing but setting up BGP sessions and filters at the technical level. Why you set these up, and under what conditions is a whole different can of worms, though.

The business and political aspect is a bit more difficult to learn without practice, but a good simulacrum can be taking part in a project like dn42, or even just getting an ASN and some IPv6 PA space and trying to announce it somewhere. However, this is no substitute for actual experience running an ISP, negotiating percentile billing rates with salespeople, getting into IXes, answering peering requests, getting rejected from peering requests, etc. :)

Disclaimer: I helped start a non-profit ISP in part to learn about these things in practice.


Judging by other comments, it seems there's a space to fill this niche with a series of blog articles or a book, if you're that sort of person.


There are plenty of presentations out there. See nanog, ripe.

The books are meh because they're not written by operators. They're more academic and dated.

Plenty of clueful folks on the right IRC channels.


Second time in this thread you've mentioned the right IRC channels. Does one need to invoke some secret code to find out what they are? :-)


usually these channels are either invite only (for members of a NOG for example) or are very, very hard to find if you don't know the proper people.


in certain instances, yes depending on the nature or subject matter of the channel

ok lets go for broke, there are a LOT of clandestine IRC servers and exclusive gatekeeping of channels. you wont know about unless you have an IRL reference


The first rule of Fight Club is that you don't talk about Fight Club.


What resources can I follow to start a non-profit ISP? I want to start one in my hometown for students who couldn't afford internet to join online classes.


Why not just raise money to pay for service from for-profit providers? Much more efficient use of donation funds.


Hmm, I actually didn't think about that at all. I guess I got too fascinated by this video[0] and wanted to apply something similar to our current scenario.

[0]: https://youtu.be/lEplzHraw3c


Because often enough there is only one dominant service in the region who has no pressure to compete from anyone due to regulatory capture (esp. regarding right of way on utility poles) and so has no incentive to upgrade their offers to the customers.


If you intend to start a facilities-based last mile access ISP, what last-mile tech do you intend to use? There's a number of resources out there for people who want to be a small hyper local WISP. But I would not recommend it unless you have 10+ years of real world network engineering experience at other, larger ISPs.



There's a bunch of guidelines for starting a (W)ISP depending on your region.


I actually tried, but all I got was some consultancy services that would help you get an ISP with estimated cost of 10k USD (a middle class household earns half of that in a year here).



> or even just getting an ASN and some IPv6 PA space and trying to announce it somewhere

That’s fairly expensive to do just for a hobby interest, but at least the price has came down since I last looked.


A RIPE ASN (as a end-user through a LIR) and PA v6 will cost you around $100 per year and some mild paperwork, there's plenty of companies/organizations that will help you with that (shameless plug: bgp.wtf, that's us).

Afterwards, announcing this space is probably the cheapest with vultr (but their BGP connectivity to VMs is uh, erratic at times) or with ix-vm.cloud, or with packet.net (more expensive). You can also try to colo something in he.net FMT2 and reach FCIX, or something in kleyrex in Germany. All in all, you should be able to do something like run a toy v6 anycast CDN at not more than $100 per month.


> bgp.wtf, that's us

I feel like you ought to be part of the ffdn database: https://db.ffdn.org/

Though the "French Data Networks Federation" is a French organization, their db tries to cover every independent, nonprofit ISP in the world :)


That couldn't work, unfortunately, per https://www.ffdn.org/en/charter-good-practices-and-common-co... :

> All subscribers to Internet access provided by the provider must be members of the provider and must have the corresponding rights, including the right to vote, within reasonable time and procedure.

Not all of our subscribers are members of our association. The association is primarily a hackerspace with hackerspace members being members of the association. We just happen to also be an ISP selling services commercially (eg. to people colocating equipment with us, or buying FTTH connectivity in the building we're located in).


Ah, that's interesting, thank you for pointing this out to me, I didn't know about it. I take it that this isn't the first time you are asked about this, then?

Well, I on one hand I perfectly understand you not wanting to change your structure, especially if it works fine. On the other, can see a few ways around that restriction, and don't really see how having the ISP a separate association with its customers as members (maybe with their votes having less weight than hackerspace members) would have a downside (except if funds are primarilly collected for funding hackerspace activities?).


> I take it that this isn't the first time you are asked about this, then?

First time, but I read the rules carefully :).

> [I] don't really see how having the ISP a separate association with its customers as members [...] would have a downside [...]

Paperwork, in time and actual accounting fees. If/when we grow, this might happen - but for now it's just not worth the extra effort. We're not even breaking even on this, just using whatever extra income from customers to offset the costs of our own Internet infrastructure for the hackerspace. We don't even have enough customers to legally start a separate association with them as members, as far as I understand. I also don't think our customers would necessarily be interested in becoming members of an association, they just want good and cheap Internet access and/or mediocre and cheap colocation space.


> A RIPE ASN (as a end-user through a LIR) and PA v6 will cost you around $100 per year

ARIN starting price is 2.5x that much (used to be 5x) for just the ASN. Glad the pricing is better elsewhere in the world at least!


If you try to become an LIR in your own right, RIPE fees are much higher.

If you're looking for PI (provider independent) resources from RIPE, the costs to the LIR (on top of their annual membership fees) is around 50€/year. An ASN and a /48 of IPv6 PI space would therefore clock in around 100€/year (which is in line with the GP's pricing).

Membership fees are around 1400€/year, with a 2000€ signup fee. The number of PA (provider assigned) resources you have has no bearing on your membership fee. If you only have a single /22 of IPv4 PA space (the maximum you can get as a new LIR today) or you have several /16s, it makes no difference to your membership fees (this wasn't always the case, the fee structure changes regularly).

(EDIT: Source: the RIPE website, and the invoices they've sent me for my LIR membership fees)


"An Open Platform to Teach How the Internet Practically Works" from NANOG perhaps:

* https://www.youtube.com/watch?v=8SRjTqH5Z8M

The Network Startup Resource Center out of UOregen has some good tutorials on BGP and connecting networks owned by different folks:

* https://learn.nsrc.org/bgp

NANOG also has a lot of good videos on their channel from their conferences, including one on optical fibre if you want to get into the low-level ISO Layer 1 stuff:

* https://www.youtube.com/watch?v=nKeZaNwPKPo

In a similar vein, NANOG "Panel: Demystifying Submarine Cables"

* https://www.youtube.com/watch?v=Pk1e2YLf5Uc


You want to learn about BGP in order to understand how routing on the internet works. The book "BGP" by Iljitsch van Beijnum is a great place to start. Don't be put off by the publication date, as almost everything in there is still relevant.[1]

Once you understand BGP and Autonomous Systems(AS), you can then understand peering as well as some of the politics that surround it.[2]

Then you can learn more about how specific networks are connected via public route servers and looking glass servers.[3][4][5]

Probably one of the best resource though still is to work for an ISP or other network provider for a stint.

[1] https://www.oreilly.com/library/view/bgp/9780596002541/

[2] http://drpeering.net/white-papers/Internet-Service-Providers...

[3] http://www.traceroute.org/#Looking%20Glass

[4] http://www.traceroute.org/#Route%20Servers

[5] http://www.routeviews.org/routeviews/


It likely has some inaccurate info as I'm not a network engineer, but I gave a talk about BGP (with a history, protocol overview, and information on how it fails using real world examples) at Radical Networks last year. https://livestream.com/internetsociety/radnets19/videos/1980...

I tried to make it accessible to those who have only a basic understanding of home networking. Assuming you know what a router is and what an ISP is, you should be able to to ingest it without needing to know crazy jargon.


It's important to recognize that there is a "layer 8" in Internet routing-- the political / business layer-- that's not necessarily expressed in technical discussion of protocols and practices. The BGP routing protocol is a place where you'll see "layer 8" decisions reflected very starkly in configuration. You may have networks that have working physical connectivity, but logically be unable to route traffic across each other because of business or political arrangements expressed in BGP configuration.


+1

Many of the comments here presume knowledge about this stuff, and I can’t follow.


Don't forget Neal Stephenson's classic, "Mother Earth, Mother Board." 25 years old but still relevant.

https://www.wired.com/1996/12/ffglass/


The business structures, ISP ownership and national telecoms have changed quite a lot in the past 25 years. But in terms of the physical OSI layer 1 challenges of laying cable across an ocean, that remains the most difficult and costly part of the process.


US and Israel looking at China's strategy at BGP-level in 2018 :

https://scholarcommons.usf.edu/mca/vol3/iss1/7/



DrPeering is good material: http://drpeering.net/tools/HTML_IPP/ipptoc.html

Geoff Huston paper "Interconnection, Peering, and Settlements" is older, but still interesting and several ways relevant.

I suggest "Where Wizards Stay Up Late: The Origins Of The Internet" - generic and talks about Internet history, but mentions several common misconseptions.


https://mobile.twitter.com/Level3 (not an internet level, just a company :)


Were their tweets protected (i.e. only visible to approved followers) when you posted that link, or is that in response to this event?


Level3 was qquired/merged/changed to century link a year or so back, I think they closed their old twitter account then

When someone says level3, read century link. L3 have been a major player for decades though (including providing the infamous 4.2.2.2 dns server), so people still refer to them as level3.

The account to follow for them now is https://mobile.twitter.com/CenturyLink but it won’t tell you much.


Note that L3 is a separate company from Level 3 Communications, which was the ISP that was acquired by CenturyLink. L3 is an American aerospace and C4ISR contractor.

CenturyLink's current CEO, Jeff Storey, was actually the pre-acquisition Level 3 CEO.


Read Internet Routing Architectures by Sam Halabi. It’s almost 20 years old now but BGP hasn’t changed and the book is still called The Bible by routing architects.


It's dated and not particularly useful if you want to learn how things are really done on the internet in a practical sense. So if you read it, be prepared to unlearn a bunch of stuff.


I don't know something holistic, but if you are the Wikipedia rampage sort of person, here is a good place to start:

https://en.wikipedia.org/wiki/Internet_exchange_point


"Tubes" is a good book to get an high level overview: https://www.penguin.co.uk/books/178533/tubes/9780141049090


No particular resource to recommend, though I first learned about it in a book by Radia Perlman, but BGP is a protocol you don't hear much about unless you work in networking, and is one of the key pieces in a lot of wide-scale outages. I'd start with that.


read the last 26 years of NANOG archives


Odd, I'm trying to reach a host in Germany (AS34432) from Sweden but get rerouted Stockholm-Hamburg-Amsterdam-London-Paris-London-Atlanta-São Paulo after which the packets disappear down a black hole. All routing problems occur within Cogentco.

    3  sth-cr2.link.netatonce.net (85.195.62.158) 
    4  te0-2-1-8.rcr51.b038034-0.sto03.atlas.cogentco.com 
    5  be3530.ccr21.sto03.atlas.cogentco.com (130.117.2.93)
    6  be2282.ccr42.ham01.atlas.cogentco.com (154.54.72.105)  
    7  be2815.ccr41.ams03.atlas.cogentco.com (154.54.38.205) 
    8  be12194.ccr41.lon13.atlas.cogentco.com (154.54.56.93)   
    9  be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130)  
   10  be2315.ccr31.bio02.atlas.cogentco.com (154.54.61.113)  
   11  be2113.ccr42.atl01.atlas.cogentco.com (154.54.24.222)  
   12  be2112.ccr41.atl01.atlas.cogentco.com (154.54.7.158)
   13  be2027.ccr22.mia03.atlas.cogentco.com (154.54.86.206)
   14  be2025.ccr22.mia03.atlas.cogentco.com (154.54.47.230)
   15  * level3.mia03.atlas.cogentco.com (154.54.10.58) 
   16  * * *
   17  * * *


What seems to have happened is that Centurylinks internal routing has collapsed in some way. But they're still announcing all routes and they don't stop announcing routes when other ISPs tag their routes not to be exported by Centurylink.

So as other providers shut down their links to Centurylink to save themselves the outgoing packets towards centurylink travel to some part of the world where links are not shut down yet.


I'm having issues reaching IP addresses unrelated to Cloudflare. Based on some traceroutes, it seems AS174 (Cogent) and AS3356 (Level 3) are experiencing major outages.


Is there any one place that would be a good first place to go to check on outages like this?

It would be really cool and useful to have an "public Internet health monitoring center"... this could be a foundation that gets some financing from industry that maintains a global internet health monitoring infrastructure and a central site at which all the major players announce outages. It would be pretty cheap and have a high return on investment for everybody involved.


In the network world there's the outages mailing list:

https://puck.nether.net/mailman/listinfo/outages

Public archives:

https://puck.nether.net/pipermail/outages/

Latest issue reported:

https://puck.nether.net/pipermail/outages/2020-August/013187... "Level3 (globally?) impacted (IPv4 only)"



Based on that map, Telia seems to be one of the most affected which might explain why Scandinavia is so badly affected.


Until that site also goes down.


Indeed, if we're to have a public Internet health meter, it must be distributed and hosted/served from "outside" somehow, to be resilient to all or parts of the network being down.


Here's a thought: we should all be outside. :D


Something something anycast.


This is an excellent idea and simple but moderately expensive for anyone to set up.

Just have a site fetch resources from every single hosting provider everywhere. A 1x1 image would be enough, but 1K/100K/1M sized files might also be useful (they could also be crafted images)

The first step would be making the HTML page itself redundant. Strict round robin DNS might work well for that.

But yeah, moderately expensive - and... thinking about it... it'll honestly come in handy once every ten years? :/


I go here :-)


Sounds like a good idea. The closest i know is the one from pingdom which i use the most. Its not detailed enough though. https://livemap.pingdom.com/


You just imagined the first target in an attack. Might as well just call it honeypotnumber1.


Reddit, HN, etc. are inaccessible to me over my Spectrum fiber connection, but working on AT&T 4G. It’s not DNS, so a tier 1 ISP routing issue seems to be the most likely cause.


Lots of local sites not working in Scandinavia either. So seems more global than a single Tier 1?


Probably relevant Fastly update:

> Fastly is observing increased errors and latency across multiple regions due to a common IP transit provider experiencing a widespread event. Fastly is actively working on re-routing traffic in affected regions.


HN and reddit out on my talktalk link in London, 3 mobile 4g working normally.


Can confirm for a number of sites, even Hacker News was unreachable for me.


This explains a lot. Initially thought my mobile phone Internet connectivity was flakey because I couldn't access HN here in Australia, whilst it's fine over wi-fi (wired Internet).


Its reverse for me. The broadband fails to connect to HN but my mobile ISP is able to reach it fine.


Because networks are connected to others via different paths, it's not unusual that one method of connectivity would work and one doesn't.

Also the Internet has lots of asymmetric traffic, just because a forward path towards a destination may look the same from different networks, it doesn't mean the reverse path will be similar.


Same for me in midwest US.

I first thought I had broken my DNS filter again through regular maintenance updates, then I suspected my ISP/modem because it regularly goes out. I have never seen the behavior I saw this morning: some sites failing to resolve.


I thought Cloudflare was having issues again, since I use their DNS servers, so I started by changing that. Then I tried restarting everything, modem/router/computer. Wasn't until I connected to a VM that a friend hosts that I was finally able to access HN, and thus saw this thread.

Hopefully this will get fixed within a reasonable timespan.


ycombinator.com pinged just fine but news.ycombinator.com dropped 100% packets. But all better now...


I was so pissed at Waze earlier for giving up on me in a critical moment. Then I found out I'm also unable to send iMessages, but I was curious, since I could browse the web just fine.

When something doesn't work I always assume it's a problem with my device/configuration/connection.

Who would have thought it's a global event such as the repeated Facebook SDK issues.


Yep, I had a similar experience. Sites that didn't work from my home connection worked fine on mobile. After rebooting and it persisted, I assumed it was just a DNS or routing issue since they were both connecting to different networks.


Looks like Centurylink/Level3 (as3356) might not be withdrawing routes after people close their peering?


That's what various networks have reported.

It kind of makes it hard to route around an upstream, if they keep announcing your routes even when there isn't a path to you!


Quick hack; split all your announcements in two, making the new announcement route around their old stale announcement by being more specific.


What could cause this? I wonder what the technical problem is.


These are usually called 'BGP Zombies', and here's a good summary of their prevalence and usual causes: https://labs.ripe.net/Members/romain_fontugne/bgp-zombies

In this case however, it seems to be an L3/CL-specific bug.


I would love to hear the inside scoop from folks working at CenturyLink. I’ve used their DSL for years and the network is a mess. I don’t know if it them here or legacy Level3 but i have a guess.

Edit: Looks like i would have guessed wrong :P. Still want that inside scoop!


Used level3 IP for a long time professionally with limited issues, ceratainly not on the list of worst ISPs.

Also used a company that over the years has gone from Genesis, GlobalCrossing, Vyvx, Level3 and now of course Level 3 is CenturyLink, which has been fine.


We had this once with one of our former ISPs configuring static routes towards us and announcing them to a couple of IXPs. I have no idea why they did it, but it caused a major downtime once for us and basically signed the termination.


Misread the headline as "Level 3 Global Outrage" and thought "someone had defined outrage levels?" and "it doesn't matter, he'll just attribute it to the Deep State".

In some ways I'm a little bit disappointed it's only a glitch in the internet.


Can somebody please clarify - what exactly is this an outage of, and how serious is it?


Here is a fantastic, though somewhat outdated overview [1]. Section 5 is most relevant to your question. The network topology today is a little different. Think of Level3 as an NSP, which is now called a "Tier 1 network" [2]. The diagram should show links among the Tier 1 networks ("peering"), but does not.

[1] https://web.stanford.edu/class/msande91si/www-spr04/readings...

[2] https://en.wikipedia.org/wiki/Tier_1_network


tl;dr One of the large Internet backbone providers (formerly known as Level3, but now known as CenturyLink usually) that many ISPs use is down. Expect issues connecting to portions of the Internet.

Usually the Internet is a bit more resilient to these kinds of things, but there are complicating factors with this outage making it worse.

Expect it to mostly be resolved today. These things have happened a bit more frequently, but generally average up to a couple times a year historically.


Is this affecting all geographic regions?


US, Europe, and Asia that I'm aware of (NANOG mailing list).


Had to laugh: "I'm seeing complaints from all over the planet on Twitter"

The one site I can't see is Twitter. (Not a heart-wrenching loss, mind you...)


I could not get on HN as a logged in person (logged out was OK) during this. I wondered how big the cloudflare thread would be if people could get on to comment on it :-)


CNN just blames Cloudflare.. :facepalm: https://edition.cnn.com/2020/08/30/tech/internet-outage-clou...


CNN is absolutely right. Every day I read news that something goes down at CloudFlare. CloudFlare do much more harm than they "fix" with their services.


I guess that why HN was temporary unreachable from my home?


and why Cloudflare was having so many issues https://www.cloudflarestatus.com/


Oh lord. I'm oncall and we were like "WHATS HAPPENING"


Same here :) Couple of companies started complaining. Told them it's a worldwide issue. It seems going better at the moment.


No peering problems from my network with Level3 in London Telehouse West, maybe a minute or so of increased latency at 10:09 GMT

Routing to a level3 ISP I have an office in in the states peers with London15.Level3.net

No problem to my Cogent ISP in the states, although we don't peer directly with Cogent, that bounces via Telia

Going east from London, a 10 second outage at 12:28:42 GMT on a route that runs from me, level3, tata in India, but no rerouting.


So, that's why HN is unreachable from Belgium at the moment (right when I was trying to figure a dns cache problem in Firefox,of course).

An ssh tunnel through OVH/gravelines is working so far. edit: Proximus. edit2: also, Orange Mobile


HN working for me from the UK on BT, but traceroute showing lots of different bouncing around and a lot of different hops in the US

  7  166-49-209-132.gia.bt.net (166.49.209.132)  9.877 ms  8.929 ms
    166-49-209-131.gia.bt.net (166.49.209.131)  8.975 ms
  8  166-49-209-131.gia.bt.net (166.49.209.131)  8.645 ms  10.323 ms  10.434 ms
  9  be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130)  95.018 ms
    be3487.ccr41.lon13.atlas.cogentco.com (154.54.60.5)  7.627 ms
    be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130)  102.570 ms
  10  be3627.ccr41.jfk02.atlas.cogentco.com (66.28.4.197)  89.867 ms
    be12497.ccr41.par01.atlas.cogentco.com (154.54.56.130)  101.469 ms  101.655 ms
  11  be2806.ccr41.dca01.atlas.cogentco.com (154.54.40.106)  103.990 ms  93.885 ms
    be3627.ccr41.jfk02.atlas.cogentco.com (66.28.4.197)  97.525 ms
  12  be2112.ccr41.atl01.atlas.cogentco.com (154.54.7.158)  106.027 ms
    be2806.ccr41.dca01.atlas.cogentco.com (154.54.40.106)  98.149 ms  97.866 ms
  13  be2687.ccr41.iah01.atlas.cogentco.com (154.54.28.70)  120.558 ms  122.330 ms  120.071 ms
  14  be2687.ccr41.iah01.atlas.cogentco.com (154.54.28.70)  123.662 ms
    be2927.ccr21.elp01.atlas.cogentco.com (154.54.29.222)  128.351 ms
    be2687.ccr41.iah01.atlas.cogentco.com (154.54.28.70)  120.746 ms
 15  be2929.ccr31.phx01.atlas.cogentco.com (154.54.42.65)  145.939 ms  137.652 ms
    be2927.ccr21.elp01.atlas.cogentco.com (154.54.29.222)  128.043 ms
  16  be2930.ccr32.phx01.atlas.cogentco.com (154.54.42.77)  150.015 ms
    be2940.rcr51.san01.atlas.cogentco.com (154.54.6.121)  152.793 ms  152.720 ms
  17  be2941.rcr52.san01.atlas.cogentco.com (154.54.41.33)  152.881 ms
    te0-0-2-0.rcr11.san03.atlas.cogentco.com (154.54.82.66)  153.452 ms
    be2941.rcr52.san01.atlas.cogentco.com (154.54.41.33)  152.054 ms
  18  te0-0-2-0.rcr12.san03.atlas.cogentco.com (154.54.82.70)  162.835 ms
    te0-0-2-0.nr11.b006590-1.san03.atlas.cogentco.com (154.24.18.190)  146.643 ms
    te0-0-2-0.rcr12.san03.atlas.cogentco.com (154.54.82.70)  153.714 ms
  19  te0-0-2-0.nr11.b006590-1.san03.atlas.cogentco.com (154.24.18.190)  151.212 ms  145.735 ms
    38.96.10.250 (38.96.10.250)  147.092 ms
  20  38.96.10.250 (38.96.10.250)  149.413 ms * *


Guessing the traceroute looks a bit messy because of multiple paths being available.

You can use `-q 1` to send a single traceroute probe/query instead of the default 3, it might make your traceroute look a little cleaner.


I don't normally see multi paths for a given IP, but that aside it's bouncing through far more than I'd expect. That said, it's rare I look at traceroutes across the continental U.S, maybe that many layer 3 hops are normal, maybe routes change constantly.

HN has dropped off completely from work - I see the route advertised from Level 3 (3356 21581 21581) and from Telia and onto Cogent (1299 174 21581 21581). Telia is longer, so traffic goes into to Level3 at Docklands via our 20G peer to London15, but seems to get no further.

Heading to Tata in India, route out is via same peer to level3, then onto the London, Marseile, and then peers with Tata in Marseille, working fine.

My gut feeling is a core problem in Level3's continental US network rather than something more global.


This is normal for Cogent. They do per-packet load balancing across ECMP links. What you're seeing is normal for the given configuration.


It was also down from South Africa. It's luckily up now. Gasps for breath


Was down from Latvia too, up now.


In a situation like this, what are the best "status" sites to be watching?


HN is not the worst place, honestly.


Agreed. I went to Reddit r/networking and the mods were closing helpful threads in real-time :(


HN was down for me, unfortunately. (Connecting from Japan, so most CDN-based website load fine since it isn't route via Europe)


https://downdetector.com/ client perspective is best perspective ;) Problem in this outage is that site X works ok but transit provider for clients in US works badly and generates "false positives"


For a situation like this, the various tools hosted by RIPE are likely your best bet. You won't get a pretty green/red picture, but you'll get a more than enough data to work with.


stat.ripe.net


Nanog is also pretty helpful for this specific type of issue


Here's a direct link to this month's messages:

https://mailman.nanog.org/pipermail/nanog/2020-August/thread...


You mean nanog.org? I don't see a stats page linked in their menu.


It’s a mailing list for network operations/engineering folks. The emails are the status updates. You’ll have to look to each network’s own site if you want connectivity, peering, and IXP red/green/ up/down status.


Ham radio might be the answer to this one day!



Except for the fact that internetweathermap.com is super green, and the internet is not currently super green.


Currently working on a project[1] to monitor all the 3rd party stack you use for your services. Hit me up if you want, access I'll give free access for a year+ to some folks to get feedbacks.

[1] https://monitory.io


Your front page has a typo: "titme".

Since hacker news was down yesterday I couldn't reply here, so I tried to send you an email, but that failed to deliver, as there are no MX records for monitory.io...


This had me really confused until I saw it was a global outage. I have been getting delayed iOS push notifications (from prowl) now for the last few hours, from a device I was fairly sure I had disconnected 3 hours ago (a pump)

Got questioning if I really disconnected it before I left.

I'm wondering if we're at the point where internet outages should have some kind of (emergency) notification/sms sent to _everyone_.


NANOG are talking about a CenturyLink outage and BGP flapping (AS 3356) as of 03:00 US/Pacific, AS209 possibly also affected.

AS3356 is Level 3, AS209 is CenturyLink.

https://mailman.nanog.org/pipermail/nanog/2020-August/209359...


DDG, down detector are all very slow. Both are on cloudflare.

Fastly, HN, Reddit too.

Only Google domains are loading here.


From where I am (mid-altantic US) Google site are completely down (google.com, youtube)


> "Root Cause: An offending flowspec announcement prevented BGP from establishing correctly, impacting client services."

--

That doesn't really explain the "stuck" routes in their RRs... maybe it'll make sense once we've gotten some more details...


This might be a silly question but is there such a thing as CI/CD for this sort of thing that may have caught the problem?


There are two aspects to this:

1. Is there syntax correctness checking available, so you don't push a config that breaks machines? Yes.

2. Is there a DWIM check available, so you can see the effect of the change before committing? No. That would require a complete model of, at a minimum, your entire network plus all directly connected networks -- that still wouldn't be complete, but it could catch some errors.



Everything to Oracle Cloud's Ashburn US-East location is down.

Their console isn't responding at all and all my servers are unreachable. Their status console reports all normal though.


Status pages of the companies are just PR disasters for them. Most of the time they don't report what's up.



Seems like "the internet" works again here in Norway. I've been limited to local sites all day.

Hacker news has been off for several hours for me.

Whatever it was it must have been nasty.


I had the same issue on my fiber connection (Altibox/BKK), however, no problems on my mobile using 4G (Dipper/Telenor)


I couldn't reach HN on neither Altibox or 4g/telenor.


Both altibox and telia 4g was down for me as well.


There is a major internet outage going on. I am using Scaleway they are also affected. According to Twitter, Vodafone, CityLink and many more are also affected.


The beginning of WWIII probably looks something like this.


I'm having lots of issues with Hetzner machines not being available (and even the hetzner.com website). Don't know if this is related.


Fyi I'm not having any problems right now with hetzner.com nor hetzner.de - my own dedicated server hosted at Hetzner datacenter in Germany seems to be reachable/working as well.

Connecting from Switzerland.


Applications are open for YC Winter 2021

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: