Basically what's happened here is that Telecom Malaysia told one of Level3's networks (AS3549 - Global Crossing) that it was capable of delivering traffic to... anywhere on the Internet. Global Crossing apparently didn't have the usual sanity checks in place.
Because of how BGP works, once GBLX decided the route seemed reasonable, it immediately proceeded to dump huge amounts of traffic on the tiny Telecom Malaysia. This isn't incorrect behavior on GLBX's part (accepting the routes was incorrect, but sending the traffic was correct once they were accepted). This is way more traffic than Telecom Malaysia was prepared to handle.
Every tier 1 network gets rid of traffic as soon as possible (because it reduces their costs), completely ignoring performance or whether a route seems sensible. Telecom Malaysia therefore claimed to the most attractive route from anywhere near Malaysia to most of the world.
Level3 is one of the biggest telecoms in the world -- and in fact the biggest by far in that region. This means that most of the Internet probably stopped working for anyone anywhere near Malaysia.
in an ideal world that has not existed for over a decade, IRR data could be used to automagically generate accurate prefix lists but for all the usual reasons (inaccurate, stale data; obtuse interface; etc) that just isn't done on a wide scale.
You can put a prefix limit on a session (that's the "little" in little to no) but that doesn't stop someone from announcing 10,000 more specific routes to popular (YouTube, Facebook, Twitter, Akamai) destinations.
Really? That's not been my experience with L3 and a few others. They seem to prefer customer routes always, even when prepended. But I don't really understand this too in-depth.
However, it'd be a good idea for CAs to look at any "domain validated" certs issued during this time and re-confirm. Really, any service that does insecure account resets or takeovers should be a bit more wary than usual.
(That's the twitter account for the currently implicated provider which messed things up, and for the record it has a minion in a Hawaiian grass dress saying "Happy Friday!", posted about 10 hours ago.)
This is a double fail, both for Telecom Malaysia for leaking a full routing table, and for GBLX who apparently isn't filtering prefixes from their downstream customers or even restricting to a max number of prefixes.
The people managing peering for AS 3356 and AS 3549 should be the same group, no? ("That DB we don't share the URL of" seems to imply as much.)
But a route leak forces traffic to take a different path, and this potentially results in a pipe being inundated with a significant volume of traffic that cannot be handled by ISPs along that route (or simply a dead-end).
i.e. imagine an ISP in Rio suddenly declaring that they are the best route to reach the networks that contain facebook.com and google.com ... that ISP will DDoS itself or one of their downstream partners. An ISP may not even notice this, but their peers and partners probably will.
A couple of years ago a BGP route leak took out the entire internet for a few hours for people in Australia. They're fairly common unfortunately, but tend to have limited impact and are resolved quite quickly.
More info: https://labs.apnic.net/?p=139
https://blog.cloudflare.com/route-leak-incident-on-october-2... has more information from a previous time this happened. A key idea is that internet routing depends a lot on trust, and it's possible for a single misconfigured site to cause serious issues across the internet.
I got on with tech support this morning and they had no indication that anything was wrong regionally, had me do a factory reset on the device, and verified it was connecting to their network successfully.
So, maybe my issue is resolved, but Sprint is still having peering issues. That is the kind of behavior I would expect to see during an issue like this.
Of course I go into the store and they tell me there have been regional issues for the last two weeks and it's not my device (we can't help you), but everything works fine for me while I'm inside of the store and on the way back. Then 2 miles before back at the office, broken again, call again and phone support guy hasn't heard anything about this Malaysia Telekom debacle or any regional outages in my area. At this point I notice that I can still reach some sites, but most sites are down.
Not completely sure this is related, since I've been having issues since Wednesday afternoon (in fact that would seem to indicate it's unrelated). Also possible I have been having one issue that was fixed by the factory reset, so now I can freely experience the other issue that everyone else around the globe is apparently having today.
I am interested to know from The Expert as well, when Telekom Malaysia fixed their issue (it is actually fixed at the source now, right?) would there be some propagation delay or aftershocks, or should everything just return to normal relatively immediately.
In theory, a large portion of the Internet would have still seen the legitimate routes as better paths, as they would have a shorter AS path (fewer networks in between) than the leaked ones. However, many networks often implement other BGP metrics for traffic engineering, depending on whether they are seeing the routes from a transit/peer/downstream customer, which may override the shortest AS path.
6 29 ms syd-sot-ken-int2-be-20.tpgi.com.au [126.96.36.199]
7 197 ms 10ge3-7.core1.sjc1.he.net [188.8.131.52]
8 199 ms 10ge1-4.core1.pao1.he.net [184.108.40.206]
9 * Request timed out.
Edit: It seems like that must have come through the Southern Cross Cable which is nowhere near Malaysia.
Ycombinator and reddit were A OK though.
Capital One uses Level 3 GLBX as a primary ISP: https://oxqyi.share.thousandeyes.com
(Jump to BGP route visualization and you can see AS 3549)
On the routing plane, you can see that the London GLBX monitor had issues to many services and that AS4788 Malaysia Telecom was advertised in the route.
Here is an example from a LinkedIn snapshot: https://wbpkq.share.thousandeyes.com
Edit: found it. See Lecture 4: http://ocw.mit.edu/courses/electrical-engineering-and-comput...
> Your IP address 220.127.116.11 has been flagged as a scanner. Scanners are not permitted. If you are seeing this message in error, please contact email@example.com.
Guess I better send an e-mail to see this status page..?
With that, instead of just looking at 404s we can make more informative observations like this request got stuck at this node, or that request is routed through a new node which has never been seen before recently..
So Telecom Malaysia messed up a config, and Global Crossing accepted their updates automatically.
Global Crossing didn't have to accept the bad update. They generally trust updates from other organisations that are generally trustworthy. They apply checks and restrictions proportionate to the risks involved.
These mistakes happen rarely. If they were to happen more often, major operators would apply more checks and restrictions. If they were to stop happening, operators would apply less checks and restrictions, because they have a cost in manpower, complexity, and loss of flexibility.
That's how the internet works. You could almost say that's what the internet is--the idea of being actively managed by people who know what they're doing and are not bound by exhaustive predefined policies is defining of how the internet came about and how it came to be dominant.
If you want a network guaranteed to be resistant to this kind of f---up, build one. The internet is that network which does not work that way, which is flexible, expandable, mostly "good enough" but not ever designed for absolute reliability.
OTOH, so much in telecom is hung together on the assumption of basically good actors everywhere.
No it is not. You accept their claims of "mistakes" I see no evidence of that - how, exactly, do you leak a full table by accident? and this is too big a security hole to leave to hackers. Leak a bunch of table, shut down an entire country.
you would be surprised, BGP is an old protocol, has had very little serious security improvements. It currently works more or less based upon the goodwill and discipline of network engineers around the world, because if they screw it up, they usually end up offline and out of a job.
1) Someone wrote an incorrect config
2) They did not test it
3) They pushed it to production systems without testing it
4) They did not monitor their systems after pushing new configs
5) They took ages to fix the problem after it was detected.
That definitely isn't a single mistake.
It's a little difficult to test this kind of config without emulating the entire internet - which is quite clearly beyond the scope of all bar a very small number of organisations.
If I'm driving a car without paying attention to the road and kill a bunch of kids, that's my fault.
If Telecom Malaysia pushes new configs without testing them and breaks the internet, that's their fault.
Obviously these two things aren't even remotely comparable in seriousness, but it's clear that in both cases the people messing up should be held responsible.
If the postal service misdelivers some mail are there criminal charges?
I would suggest resisting the urge to hit every problem and mistake in the world with the law hammer; it is rarely productive, and generally isn't how the world works.
Your comparison to postal service misdelivering some mail is both irrelevant and utterly ridiculous.
A relevant comparison would be someone disrupting mail delivery on a global scale. Which would in fact be a crime in quite a few countries.
For example in the US: https://www.law.cornell.edu/uscode/text/18/1701
E: (Yeah, downvote me because you disagree. Instead of explaining why I'm wrong)
Apparently all it takes is one ISP misconfiguring something to break large swathes of the Internet. I believe the consensus on HN is that no one entity should have an Internet kill switch.
If someone managed to disrupt mail delivery on a global scale, people would be less concerned with THAT it happened than that it COULD HAVE happened in the first place. Why would global mail delivery be so not-fault-tolerant that one mistake brought it to a grinding halt for hours? Same deal here.
Thing is, everybody knows (and has always known) this can happen. Everybody also knows how to avoid it. Well-intentioned people tend to try to avoid breaking the internet.
That really doesn't make the negligence of Telecom Malaysia any more defensible.
I really don't see what's the point of defending Telecom Malaysia, a plenty of people manage to operate their equipment in a manner that doesn't break the internet.
Just because this was a mistake doesn't mean they should not be held responsible (and no, I'm not saying someone should go to prison.)
Perhaps the point is that Malaysia is part of the world, and we can't realistically expect to exclude them from the internet, any more than we could expect to exclude them from the commercial airline system. Their network people answer to their customers in Malaysia, not to us nerds on HN. (Also ISTM many Malaysians are more concerned about earthquakes caused by exhibitionist tourists  than about internet stability.) I guess in some way TM answer to their upstream in GLBX, and they could get "demoted" in some way, but GLBX isn't going to just walk away from an income stream.
Many times when one node is blamed for network-wide bad results, the nodes that connect it to the network might be blamed fairly, as well.
Malaysian here. More Malaysians care about the 18 people who died on Mt Kinabalu from the earthquake, than about the antics of a few douchey tourists.
As for Telekom Malaysia, TM has the same reputation that BT has in the UK - shitty service, but Malaysians are stuck with them. I don't think it surprised anyone that TM caused this fuckup.
Despite the name, TM is a private company. I really wouldn't have any issues with (temporarily) disconnecting them, or alternatively fining them. GLBX should definitely be able to do both of those.
Of course best case scenario would involve Malaysian government intervention. (As unlikely as that sounds in a country that seems to be ran by people that believe in magic)
If that language is in their contracts, then sure. Such penalties might not be made public, however, or there may be different enforcement mechanisms in place. The internet (and all global commerce, really) functions anyway.
Of course best case scenario would involve Malaysian government intervention.
Given how often "Malaysian government intervention" entails unconscionable violence, I cannot agree.
I definitely didn't intend that anybody should be executed, but at most fined (or imprisoned for a reasonable amount of time if this was in fact intentional, but that's unlikely).
"with design to obstruct the correspondence, or to pry into the business or secrets of another, or opens, secretes, embezzles, or destroys the same" nope
"voluntarily quits or deserts" nope
Unlike in engineering sector where a negligence could result in injury or death; those could be charged.
I think some people get too caught up on blaming individuals for accidents. Which I think stems from the culture of litigation and offsetting unforeseen costs. While there obviously should be processes in place to prevent accidents from happening, gross negligence aside I do also think we should stop hunting down individuals to compensate our own greed.
I don't think anyone is blaming individuals here, in most countries businesses can face criminal charges.
Unless you want to turn miscoding config files into an international crime - in which case every single person who works in IT would be a criminal since we all error at some point in our careers (usually frequently, to be honest).
Granted the scale here is greater than your usual sysadmin would have access to, but the nature of the error isn't any different.
"Shit happens" would've been applicable if this was solved in 5 minutes. Accidents like this are trivial to prevent with proper policies.
So what you're telling me is that these financial systems' risk analyses didn't result in them mitigating this risk (1), but because things went pear-shaped now the government needs to step in?
I was going to be unsympathetic to them, because in my field we have to analyze and then mitigate certain levels of risk. But I guess if your business is financial, bellyaching after the fact to the government is your mitigation.
(1) different design, SLAs, making sure service providers already have those policies, acceptable and expected levels of downtime, whatever...
Any "financial system" who uses "the internet" without acknowledging and accepting the risk of this sort of downtime should probably be considered incompetent. (Of course, there are probably many such institutions where the techs are currently saying "We warned you! But you refused to authorise the budget to mitigate this!" - who are now baying for blood from people who never signed up to provide 100% reliable networking for some cheapskate financial firm...)
Online banking, PayPal, Bitcoin...
I am focusing on liability though, I very much believe Telecom Malaysia should face criminal charges for this (I do not know if they should be sentenced though, as I am not aware of all the facts. That's up for the court to figure out)
In most countries (I do not know if this applies to Malaysia too, but I believe it should) denial of service attacks are a criminal offence, I'd say exporting prefixes like this would constitute as one.
You're completely ignoring the potential monetary damages.