Its a super useful tool if you want to blast out an ACL across your network in seconds (using BGP) but it has a number of sharp edges. Several networks, including Cloudflare have learned what it can do. I've seen a few networks basically blackhole traffic or even lock themselves out of routers due to a poorly made Flowspec rules or a bug in the implementation.
Edit: if you are a Level3 customer shut your sessions down to them.
There was a huge AT&T outage in 1990 that cut off most US long distance telephony (which was, at the time, mostly "everything not within the same area code").
It was a bug. It wasn't a reconvergence event, but it was a distant cousin: Something would cause a crash; exchanges would offload that something to other exchanges, causing them to crash -- but with enough time for the original exchange to come back up, receive the crashy event back, and crash again.
The whole network was full of nodes crashing, causing their peers to crash, ad infinitum. In order to bring the network back up, they needed to either take everything down at the same time (and make sure all the queues are emptied), but even that wouldn't have made it stable, because a similar "patient 0" event would have brought the whole network down.
Once the problem was understood, they reverted to an earlier version which didn't have the bug, and the network re-stabilized.
The lore I grew up on is that this specific event was very significant in pushing and funding research into robust distributed systems, of which the best known result is Erlang and its ecosystem - originally built, and still mostly used, to make sure that phone exchanges don't break.
This was covered in a book (perhaps Safeware, but maybe another one I dont recall) along with the Therac 25, the Ariane V, and several others. Unfortunately these lessons need to be relearned by each generation. See the 737-Max...
That lesson will really never be learned. This happens on a daily basis all over the planet with people who have not been bitten - yet.
We are very bad at avoiding these sorts of rare, catastrophic events.
If a team manager at eg Google was complaining about how automation gets in the way and wanted to bypass it, they wouldn't last too long.
It’s been many years since I read it, but I recall it being a very interesting read.
The lessons are definitely still taught, I don't know if they're actually learned of course.. And who knows who actually taught the 737-Max software devs, I don't suppose they're fresh out of uni.
I always wanted companies to hire people managers as its own career path. An engineer can be an excellent technical lead or architect, but it can feel like you started over once you're responsible for the employees, their growth, and their career path.
The build that broke it was rushed out and never fully tested, adding a fairly useless feature for said higher-up that improved the UX for users with multiple houses on their account.
I can't remember where i read about this, but i recall the problem was called "The Creeping Crud from California". Sadly, this phrase apparently does not appear anywhere on the internet. Did i imagine this?
The incident is detailed in RFC 789:
I believe that i read about this episode in Hans Moravec's book 'Mind Children'. I can see in Google Books that chapter 5 is on 'Wildlife', and there is a section 'Spontaneous Generation', which promises to talk about a "software parasite" which emerged naturally in the ARPAnet - but of which the bulk is not available:
The idea is it takes away enough foot-guns that if you're banging your head against systems written it in, you'd be banging your head even harder and more often if the same implementor had used another language
> In this instance, the malformed packets [Ethernet frames?] included fragments of valid network management packets that are typically generated. Each malformed packet shared four attributes that contributed to the outage: 1) a broadcast destination address, meaning that the packet was directed to be sent to all connected devices; 2) a valid header and valid checksum; 3) no expiration time, meaning that the packet would not be dropped for being created too long ago; and 4) a size larger than 64 bytes.
At least they have exponential backoff I guess.
Not a terribly big fan of these queueing systems. People always seem to bung things up in ways they are not quite equipped to fix (in the “you are not smart enough to debug the code you wrote” sense).
Last time I had to help someone with such a situation, we discovered that the duplicate processing problem had existed for >3 months prior to the crisis event, and had been consuming 10% of the system capacity, which was just low enough that nobody noticed.
If anything, the alert is too sensitive.
0: ie something that takes, say, a UDP packet on port NNNN containing a whole raw IPv4 packet, throws away the wrapping, and drops the IPv4 packet onto its own network interface. This is safe - the packet must shrink by a dozen or two bytes with each retransmission - but usually not actually set up anywhere.
Edit: It probably wouldn't work for TCP though - maybe try TOR?
 Any of the various https://en.wikipedia.org/wiki/Virtual_private_network technologies (WireGuard, IPSec, SOCKS TLS proxies, etc.)
 As you mention, a Tor SOCKS proxy
not expert but erlang is listed as 1986, so that would seem not directly related https://en.wikipedia.org/wiki/Erlang_(programming_language)
If this is indeed a reconvergence event, that would imply there’s been a cascade of route table updates that have been making their way through CTL/L3’s network - meaning many routers are missing the “correct” paths to prefixes and traffic is not going where it is supposed to, either getting stuck in a routing loop or just going to /dev/null because the next hop isn’t available.
This wouldn’t be such a huge issue if downstream systems could shut down their BGP sessions with CTL and have traffic come in via other routes, but doing so is not resulting in the announcements being pulled from the Level 3 AS - something usually reflective of the CPU on the routers being overloaded processing route table updates or an issue with the BGP communication between them.
Convergence time is a known bugbear of BGP.
For each IP range described in the rumor table, each network is free to choose whichever rumor they like best among all they have heard, and send traffic for that range along the described path. Typically this is the shortest, but it doesn't have to be.
ISPs will pass on their favorite rumor for each range, adding themselves to the path of networks. (They must also withdraw the rumors if they become disconnected from their upstream source, or their upstream withdraws it.) Business like hosting providers won't pass on any rumors other than those they started, as no one involved wants them to be a path between the ISPs. (Most ISPs will generally restrict the kinds of rumors their non ISP peers can spread, usually in terms of what IP ranges the peer owns.)
Convergence in BGP is easy in the "good news" direction, and a clusterfuck in the "bad news" direction. When a new range is advertised, or the path is getting shorter, it is smooth sailing, as each network more or less just takes the new route as is and passes it on without hesitation. In the bad news direction, where either something is getting retracted entirely, or the path is going to get much longer, we get something called "path hunting."
As an example of path hunting: Lets say the old paths for a rumor were A-B-C and A-B-D, but C is also connected to D. (C and D spread rumors to each other, but the extended paths A-B-C-D and A-B-D-C are longer, thus not used yet.) A-B gets cut. B tells both C and D that it is withdrawing the rumor. Simultaneously D looks at the rumor A-B-C-D and C looks at the rumor A-B-D-C, and say "well I've got this slightly worse path lying around, might as well use it." Then they spread that rumor to their down streams not realizing that it is vulnerable to the same event that cost them the more direct route. (They have no idea why B withdraw the rumor from them.) The paths, especially when removing an IP range entirely, can get really crazy. (A lot of core internet infrastructure uses delays to prevent the same IP range from updating too often, which tamps down on the crazy path exploration and can actually speed things up in these cases.)
IP network routing is distributed systems within distributed systems. For whatever reason the distributed system that is the CenturyLink network isn't "converging", or we could it becoming consistent, or settling, in a timely manner.
But customers are likely to get one, at least if they request it.
I had this problem two years ago while I was taking Go lessons online from a South Korean professional Go Master. For my last job we were renting a home well outside city limits in Illinois and our Internet failed often. I lost one game in an internal teaching tournament because of a failed connection, and jumped through hoops to avoid that problem.
Wasn't able to access HN from India earlier, but other cloudflare enabled services were accessible. I assume several Network Engineers were woken up from their Sunday morning sleep to fix the issue; if any of them is reading this, I appreciate your effort.
For this to work on an internet competition, the judges would need a backup, possibly very low bandwidth communication mechanism that survives a network outage.
This wouldn’t save any real-time esports, but would be serviceable for turn based systems.
The games are timed and this pause gives a lot of thinking time. If they're allowed to talk with others during the pause, then also consulting time.
> why don't they start over
That would be unfair to the player who was ahead.
That said, both players might still be fine with a clean rematch, because being the undisputed winner feels better. I wonder if they were asked (anonymously to prevent public hate) whether they would be fine with a rematch.
Namely, in this case, it seems like the “right thing” is for games to not derive their ELO contributions from pure win/loss/draw scorings at all, but rather for games to be converted into ELO contributions by how far ahead one player was over the other at the point when both players stopped playing for whatever reason (where checkmate, forfeit, and game disruption are all valid reasons.) Perhaps with some Best-rank (https://www.evanmiller.org/how-not-to-sort-by-average-rating...) applied, so that games that go on longer are “more proof” of the competitive edge of the player that was ahead at the time.
Of course, in most central cases (of chess matches that run to checkmate or a “deep” forfeit), such a scoring method would be irrelevant, and would just reduce to the same data as win/loss/draw inputs to ELO would. So it’d be a bunch of effort only to solve these weird edge cases like “how does a half-game that neither player forfeited contribute to ELO.”
Except for the obvious positions that no one serious would even play, there is no agreed-upon way of calculating who has an advantage in chess like that. One man's terrible mobility and probable blunder is another's brilliant stratagem.
Still, just to spitball: one “obvious” approach, at least in our modern world where technology is an inextricable part of the game, would be to ask a chess-computer: “given that both players play optimally from now on, what would be the likelihood of each player winning from this starting board position?” The situations where this answer is hard/impossible to calculate (i.e. estimations close to the beginning of a match) are exactly the situations where the ELO contribution should be minuscule anyway, because the match didn’t contribute much to tightening the confidence interval of the skill gap between the players.
Of course, players don’t play optimally. I suspect that, given GPT-3 and the like, we’ll soon be able to train chess-computers to mimic specific players’ play-styles and seeming limits of knowledge (insofar as those are subsets of the chess-computer’s own capabilities, that it’s constraining its play to.) At that point, we might actually be able to ask the more interesting question: “given these two player-models and this board position, in what percentage of evolutions from this position does player-model A win?”
Interestingly, you could ask that question with the board position being the initial one, and thus end up with automatically-computed betting odds based on the players’ last-known skill (which would be strictly better than ELO as a prediction on how well an individual pair of players would do when facing off; and therefore could, in theory, be used as a replacement for ELO in determining who “should” be playing whom. You’d need an HPC cluster to generate that ladder, but it’d be theoretically possible, and that’s interesting.)
>To use the old Internet as a “superhighway” analogy, that’s like only having a single offramp to a town. If the offramp is blocked, then there’s no way to reach the town. This was exacerbated in some cases because CenturyLink/Level(3)’s network was not honoring route withdrawals and continued to advertise routes to networks like Cloudflare’s even after they’d been withdrawn. In the case of customers whose only connectivity to the Internet is via CenturyLink/Level(3), or if CenturyLink/Level(3) continued to announce bad routes after they'd been withdrawn, there was no way for us to reach their applications and they continued to see 522 errors until CenturyLink/Level(3) resolved their issue around 14:30 UTC.
The same was a problem on the other (“eyeball”) side of the network. Individuals need to have an onramp onto the Internet’s superhighway. An onramp to the Internet is essentially what your ISP provides. CenturyLink is one of the largest ISPs in the United.
Because this outage appeared to take all of the CenturyLink/Level(3) network offline, individuals who are CenturyLink customers would not have been able to reach Cloudflare or any other Internet provider until the issue was resolved. Globally, we saw a 3.5% drop in global traffic during the outage, nearly all of which was due to a nearly complete outage of CenturyLink’s ISP service across the United States.
it's cool to see something large enough that the auto-healing mechanisms weren't able to handle it on their own, though shoutout to whoever was on the weekend support/SRE shift; that stuff was never fun to deal with when you were one of a few reduced staff on the weekend shifts
The problem is I don't know where to find what was going on (tried looking up live DDOS-tracking websites, "is it down or is it just me" websites, etc. I couldn't find a single place talking about this.
Is there a source where you can get instant information on Level3 / global DNS / major outages?
Outages and nanog lists are your best bet, short of being on the right IRC channels.
here are a couple of "for instance" breadcrumbs for you to start from:
I'm definitely an amateur when it comes to networking stuff. At the time, the _only_ issue I had was with all of my Digital Ocean droplets. It was confusing because I was able to get to them through my LTE connection and not able to through my home ISP. I opened a ticket with DO worried that it was my ISP blocking IP addresses suddenly. It turned out to be this outage, but it was very specific. Traceroute gave some clues, but again I'm amateur and I couldn't tell what was happening after a certain point.
So yeah, I too would love a really easy to use page that could show outages like this. It would be really great to be able to specify vendors used to really piece the puzzle together.
So I guess my takeaway from this is that if the Internet seems to be down, usually the CDN providers notice. I don't know if either of the sites actually still use Fastly (I kind of forgot they existed), but I did end up reading about the Internet being broken at some scale larger than "your friend's cable modem is broken", so that was helpful.
It would be nice if we had a map of popular sites and which CDN they use, so we can collect a sampling of what's up and what's down and figure out which CDN is broken. Though in this case, it wasn't really the CDN's fault. Just collateral damage.
To learn the technical aspect of it, you can follow any network engineering certification materials or resources that delve into dynamic routing protocols, notably BGP. Inter-ISP networking is nothing but setting up BGP sessions and filters at the technical level. Why you set these up, and under what conditions is a whole different can of worms, though.
The business and political aspect is a bit more difficult to learn without practice, but a good simulacrum can be taking part in a project like dn42, or even just getting an ASN and some IPv6 PA space and trying to announce it somewhere. However, this is no substitute for actual experience running an ISP, negotiating percentile billing rates with salespeople, getting into IXes, answering peering requests, getting rejected from peering requests, etc. :)
Disclaimer: I helped start a non-profit ISP in part to learn about these things in practice.
The books are meh because they're not written by operators. They're more academic and dated.
Plenty of clueful folks on the right IRC channels.
ok lets go for broke, there are a LOT of clandestine IRC servers and exclusive gatekeeping of channels.
you wont know about unless you have an IRL reference
That’s fairly expensive to do just for a hobby interest, but at least the price has came down since I last looked.
Afterwards, announcing this space is probably the cheapest with vultr (but their BGP connectivity to VMs is uh, erratic at times) or with ix-vm.cloud, or with packet.net (more expensive). You can also try to colo something in he.net FMT2 and reach FCIX, or something in kleyrex in Germany. All in all, you should be able to do something like run a toy v6 anycast CDN at not more than $100 per month.
ARIN starting price is 2.5x that much (used to be 5x) for just the ASN. Glad the pricing is better elsewhere in the world at least!
If you're looking for PI (provider independent) resources from RIPE, the costs to the LIR (on top of their annual membership fees) is around 50€/year. An ASN and a /48 of IPv6 PI space would therefore clock in around 100€/year (which is in line with the GP's pricing).
Membership fees are around 1400€/year, with a 2000€ signup fee. The number of PA (provider assigned) resources you have has no bearing on your membership fee. If you only have a single /22 of IPv4 PA space (the maximum you can get as a new LIR today) or you have several /16s, it makes no difference to your membership fees (this wasn't always the case, the fee structure changes regularly).
(EDIT: Source: the RIPE website, and the invoices they've sent me for my LIR membership fees)
I feel like you ought to be part of the ffdn database: https://db.ffdn.org/
Though the "French Data Networks Federation" is a French organization, their db tries to cover every independent, nonprofit ISP in the world :)
> All subscribers to Internet access provided by the provider must be members of the provider and must have the corresponding rights, including the right to vote, within reasonable time and procedure.
Not all of our subscribers are members of our association. The association is primarily a hackerspace with hackerspace members being members of the association. We just happen to also be an ISP selling services commercially (eg. to people colocating equipment with us, or buying FTTH connectivity in the building we're located in).
Well, I on one hand I perfectly understand you not wanting to change your structure, especially if it works fine. On the other, can see a few ways around that restriction, and don't really see how having the ISP a separate association with its customers as members (maybe with their votes having less weight than hackerspace members) would have a downside (except if funds are primarilly collected for funding hackerspace activities?).
First time, but I read the rules carefully :).
> [I] don't really see how having the ISP a separate association with its customers as members [...] would have a downside [...]
Paperwork, in time and actual accounting fees. If/when we grow, this might happen - but for now it's just not worth the extra effort. We're not even breaking even on this, just using whatever extra income from customers to offset the costs of our own Internet infrastructure for the hackerspace. We don't even have enough customers to legally start a separate association with them as members, as far as I understand. I also don't think our customers would necessarily be interested in becoming members of an association, they just want good and cheap Internet access and/or mediocre and cheap colocation space.
The Network Startup Resource Center out of UOregen has some good tutorials on BGP and connecting networks owned by different folks:
NANOG also has a lot of good videos on their channel from their conferences, including one on optical fibre if you want to get into the low-level ISO Layer 1 stuff:
In a similar vein, NANOG "Panel: Demystifying Submarine Cables"
Once you understand BGP and Autonomous Systems(AS), you can then understand peering as well as some of the politics that surround it.
Then you can learn more about how specific networks are connected via public route servers and looking glass servers.
Probably one of the best resource though still is to work for an ISP or other network provider for a stint.
I tried to make it accessible to those who have only a basic understanding of home networking. Assuming you know what a router is and what an ISP is, you should be able to to ingest it without needing to know crazy jargon.
Many of the comments here presume knowledge about this stuff, and I can’t follow.
Geoff Huston paper "Interconnection, Peering, and Settlements" is older, but still interesting and several ways relevant.
I suggest "Where Wizards Stay Up Late: The Origins Of The Internet" - generic and talks about Internet history, but mentions several common misconseptions.
When someone says level3, read century link. L3 have been a major player for decades though (including providing the infamous 126.96.36.199 dns server), so people still refer to them as level3.
The account to follow for them now is https://mobile.twitter.com/CenturyLink but it won’t tell you much.
CenturyLink's current CEO, Jeff Storey, was actually the pre-acquisition Level 3 CEO.
3 sth-cr2.link.netatonce.net (188.8.131.52)
5 be3530.ccr21.sto03.atlas.cogentco.com (184.108.40.206)
6 be2282.ccr42.ham01.atlas.cogentco.com (220.127.116.11)
7 be2815.ccr41.ams03.atlas.cogentco.com (18.104.22.168)
8 be12194.ccr41.lon13.atlas.cogentco.com (22.214.171.124)
9 be12497.ccr41.par01.atlas.cogentco.com (126.96.36.199)
10 be2315.ccr31.bio02.atlas.cogentco.com (188.8.131.52)
11 be2113.ccr42.atl01.atlas.cogentco.com (184.108.40.206)
12 be2112.ccr41.atl01.atlas.cogentco.com (220.127.116.11)
13 be2027.ccr22.mia03.atlas.cogentco.com (18.104.22.168)
14 be2025.ccr22.mia03.atlas.cogentco.com (22.214.171.124)
15 * level3.mia03.atlas.cogentco.com (126.96.36.199)
16 * * *
17 * * *
So as other providers shut down their links to Centurylink to save themselves the outgoing packets towards centurylink travel to some part of the world where links are not shut down yet.
It would be really cool and useful to have an "public Internet health monitoring center"... this could be a foundation that gets some financing from industry that maintains a global internet health monitoring infrastructure and a central site at which all the major players announce outages. It would be pretty cheap and have a high return on investment for everybody involved.
Latest issue reported:
"Level3 (globally?) impacted (IPv4 only)"
Just have a site fetch resources from every single hosting provider everywhere. A 1x1 image would be enough, but 1K/100K/1M sized files might also be useful (they could also be crafted images)
The first step would be making the HTML page itself redundant. Strict round robin DNS might work well for that.
But yeah, moderately expensive - and... thinking about it... it'll honestly come in handy once every ten years? :/
> Fastly is observing increased errors and latency across multiple regions due to a common IP transit provider experiencing a widespread event. Fastly is actively working on re-routing traffic in affected regions.
Also the Internet has lots of asymmetric traffic, just because a forward path towards a destination may look the same from different networks, it doesn't mean the reverse path will be similar.
I first thought I had broken my DNS filter again through regular maintenance updates, then I suspected my ISP/modem because it regularly goes out. I have never seen the behavior I saw this morning: some sites failing to resolve.
Hopefully this will get fixed within a reasonable timespan.
When something doesn't work I always assume it's a problem with my device/configuration/connection.
Who would have thought it's a global event such as the repeated Facebook SDK issues.
It kind of makes it hard to route around an upstream, if they keep announcing your routes even when there isn't a path to you!
In this case however, it seems to be an L3/CL-specific bug.
Edit: Looks like i would have guessed wrong :P. Still want that inside scoop!
Also used a company that over the years has gone from Genesis, GlobalCrossing, Vyvx, Level3 and now of course Level 3 is CenturyLink, which has been fine.
In some ways I'm a little bit disappointed it's only a glitch in the internet.
Usually the Internet is a bit more resilient to these kinds of things, but there are complicating factors with this outage making it worse.
Expect it to mostly be resolved today. These things have happened a bit more frequently, but generally average up to a couple times a year historically.
The one site I can't see is Twitter. (Not a heart-wrenching loss, mind you...)
Routing to a level3 ISP I have an office in in the states peers with London15.Level3.net
No problem to my Cogent ISP in the states, although we don't peer directly with Cogent, that bounces via Telia
Going east from London, a 10 second outage at 12:28:42 GMT on a route that runs from me, level3, tata in India, but no rerouting.
An ssh tunnel through OVH/gravelines is working so far. edit: Proximus. edit2: also, Orange Mobile
7 166-49-209-132.gia.bt.net (188.8.131.52) 9.877 ms 8.929 ms
166-49-209-131.gia.bt.net (184.108.40.206) 8.975 ms
8 166-49-209-131.gia.bt.net (220.127.116.11) 8.645 ms 10.323 ms 10.434 ms
9 be12497.ccr41.par01.atlas.cogentco.com (18.104.22.168) 95.018 ms
be3487.ccr41.lon13.atlas.cogentco.com (22.214.171.124) 7.627 ms
be12497.ccr41.par01.atlas.cogentco.com (126.96.36.199) 102.570 ms
10 be3627.ccr41.jfk02.atlas.cogentco.com (188.8.131.52) 89.867 ms
be12497.ccr41.par01.atlas.cogentco.com (184.108.40.206) 101.469 ms 101.655 ms
11 be2806.ccr41.dca01.atlas.cogentco.com (220.127.116.11) 103.990 ms 93.885 ms
be3627.ccr41.jfk02.atlas.cogentco.com (18.104.22.168) 97.525 ms
12 be2112.ccr41.atl01.atlas.cogentco.com (22.214.171.124) 106.027 ms
be2806.ccr41.dca01.atlas.cogentco.com (126.96.36.199) 98.149 ms 97.866 ms
13 be2687.ccr41.iah01.atlas.cogentco.com (188.8.131.52) 120.558 ms 122.330 ms 120.071 ms
14 be2687.ccr41.iah01.atlas.cogentco.com (184.108.40.206) 123.662 ms
be2927.ccr21.elp01.atlas.cogentco.com (220.127.116.11) 128.351 ms
be2687.ccr41.iah01.atlas.cogentco.com (18.104.22.168) 120.746 ms
15 be2929.ccr31.phx01.atlas.cogentco.com (22.214.171.124) 145.939 ms 137.652 ms
be2927.ccr21.elp01.atlas.cogentco.com (126.96.36.199) 128.043 ms
16 be2930.ccr32.phx01.atlas.cogentco.com (188.8.131.52) 150.015 ms
be2940.rcr51.san01.atlas.cogentco.com (184.108.40.206) 152.793 ms 152.720 ms
17 be2941.rcr52.san01.atlas.cogentco.com (220.127.116.11) 152.881 ms
te0-0-2-0.rcr11.san03.atlas.cogentco.com (18.104.22.168) 153.452 ms
be2941.rcr52.san01.atlas.cogentco.com (22.214.171.124) 152.054 ms
18 te0-0-2-0.rcr12.san03.atlas.cogentco.com (126.96.36.199) 162.835 ms
te0-0-2-0.nr11.b006590-1.san03.atlas.cogentco.com (188.8.131.52) 146.643 ms
te0-0-2-0.rcr12.san03.atlas.cogentco.com (184.108.40.206) 153.714 ms
19 te0-0-2-0.nr11.b006590-1.san03.atlas.cogentco.com (220.127.116.11) 151.212 ms 145.735 ms
18.104.22.168 (22.214.171.124) 147.092 ms
20 126.96.36.199 (188.8.131.52) 149.413 ms * *
You can use `-q 1` to send a single traceroute probe/query instead of the default 3, it might make your traceroute look a little cleaner.
HN has dropped off completely from work - I see the route advertised from Level 3 (3356 21581 21581) and from Telia and onto Cogent (1299 174 21581 21581). Telia is longer, so traffic goes into to Level3 at Docklands via our 20G peer to London15, but seems to get no further.
Heading to Tata in India, route out is via same peer to level3, then onto the London, Marseile, and then peers with Tata in Marseille, working fine.
My gut feeling is a core problem in Level3's continental US network rather than something more global.
Since hacker news was down yesterday I couldn't reply here, so I tried to send you an email, but that failed to deliver, as there are no MX records for monitory.io...
Got questioning if I really disconnected it before I left.
I'm wondering if we're at the point where internet outages should have some kind of (emergency) notification/sms sent to _everyone_.
AS3356 is Level 3, AS209 is CenturyLink.
Fastly, HN, Reddit too.
Only Google domains are loading here.
That doesn't really explain the "stuck" routes in their RRs... maybe it'll make sense once we've gotten some more details...
1. Is there syntax correctness checking available, so you don't push a config that breaks machines? Yes.
2. Is there a DWIM check available, so you can see the effect of the change before committing? No. That would require a complete model of, at a minimum, your entire network plus all directly connected networks -- that still wouldn't be complete, but it could catch some errors.
Their console isn't responding at all and all my servers are unreachable. Their status console reports all normal though.
Hacker news has been off for several hours for me.
Whatever it was it must have been nasty.
Connecting from Switzerland.