Basically what's happened here is that Telecom Malaysia told one of Level3's networks (AS3549 - Global Crossing) that it was capable of delivering traffic to... anywhere on the Internet. Global Crossing apparently didn't have the usual sanity checks in place.
Because of how BGP works, once GBLX decided the route seemed reasonable, it immediately proceeded to dump huge amounts of traffic on the tiny Telecom Malaysia. This isn't incorrect behavior on GLBX's part (accepting the routes was incorrect, but sending the traffic was correct once they were accepted). This is way more traffic than Telecom Malaysia was prepared to handle.
Every tier 1 network gets rid of traffic as soon as possible (because it reduces their costs), completely ignoring performance or whether a route seems sensible. Telecom Malaysia therefore claimed to the most attractive route from anywhere near Malaysia to most of the world.
Level3 is one of the biggest telecoms in the world -- and in fact the biggest by far in that region. This means that most of the Internet probably stopped working for anyone anywhere near Malaysia.
Is it actually normal to have some checks for this now? When this happened a few years ago in Australia the apnic writeup (someone has linked below) essentially said there was no scalable solution yet.
when the number of originated or announced routes are "reasonable", prefix lists, as they are called, are generated and applied to BGP sessions. when the originated or announced routes is substantial and/or fluid, there are little to no prefix lists; allowing anything to be announced and accepted.
in an ideal world that has not existed for over a decade, IRR data could be used to automagically generate accurate prefix lists but for all the usual reasons (inaccurate, stale data; obtuse interface; etc) that just isn't done on a wide scale.
You can put a prefix limit on a session (that's the "little" in little to no) but that doesn't stop someone from announcing 10,000 more specific routes to popular (YouTube, Facebook, Twitter, Akamai) destinations.
>Every tier 1 network gets rid of traffic as soon as possible
Really? That's not been my experience with L3 and a few others. They seem to prefer customer routes always, even when prepended. But I don't really understand this too in-depth.
Do we know that this happened by accident? Because if all it takes is a few routing packages that is a very attractive option for any state that wants to attack another.
However, it'd be a good idea for CAs to look at any "domain validated" certs issued during this time and re-confirm. Really, any service that does insecure account resets or takeovers should be a bit more wary than usual.
(That's the twitter account for the currently implicated provider which messed things up, and for the record it has a minion in a Hawaiian grass dress saying "Happy Friday!", posted about 10 hours ago.)
Telecom Malaysia (AS4788) was leaking a full routing table to a Tier 1 network, Global Crossing (AS3549), who in turn was advertising the prefixes to its peers like Level3 (AS3356). Large portions of the Internet would have been affected.
This is a double fail, both for Telecom Malaysia for leaking a full routing table, and for GBLX who apparently isn't filtering prefixes from their downstream customers or even restricting to a max number of prefixes.
I just remembered, GBLX was bought by L3 a few years back - I'm guessing someone has a "tighten up route maps for GBLX ASN" ticket open in their backlog at the Level 3 NOC.
The people managing peering for AS 3356 and AS 3549 should be the same group, no? ("That DB we don't share the URL of" seems to imply as much.)
Malaysia is an Islamic country, so Friday is the equivalent of the American Saturday to them, i.e. Friday is the first day of the Weekend, and Sunday is a normal workday.
I'm an American living in KL. Malaysia follows the Monday through Friday work week. The Twitter message was posted in the AM and this event didn't occur until the evening.
But a route leak forces traffic to take a different path, and this potentially results in a pipe being inundated with a significant volume of traffic that cannot be handled by ISPs along that route (or simply a dead-end).
i.e. imagine an ISP in Rio suddenly declaring that they are the best route to reach the networks that contain facebook.com and google.com ... that ISP will DDoS itself or one of their downstream partners. An ISP may not even notice this, but their peers and partners probably will.
A couple of years ago a BGP route leak took out the entire internet for a few hours for people in Australia. They're fairly common unfortunately, but tend to have limited impact and are resolved quite quickly.
People cannot access some internet sites, depending on who they have internet connectivity with.
https://blog.cloudflare.com/route-leak-incident-on-october-2... has more information from a previous time this happened. A key idea is that internet routing depends a lot on trust, and it's possible for a single misconfigured site to cause serious issues across the internet.
This wasn't battle damage, this was operator error. Building a tank to survive a war and building it to work when driven in a ditch are not quite the same.
My connection through a Sprint hotspot is only able to reach www.google.com and plus.google.com (but not news.google.com, reddit, hn, my own website on the other side of the city I'm in, or practically any other website I can think of.)
I got on with tech support this morning and they had no indication that anything was wrong regionally, had me do a factory reset on the device, and verified it was connecting to their network successfully.
So, maybe my issue is resolved, but Sprint is still having peering issues. That is the kind of behavior I would expect to see during an issue like this.
Of course I go into the store and they tell me there have been regional issues for the last two weeks and it's not my device (we can't help you), but everything works fine for me while I'm inside of the store and on the way back. Then 2 miles before back at the office, broken again, call again and phone support guy hasn't heard anything about this Malaysia Telekom debacle or any regional outages in my area. At this point I notice that I can still reach some sites, but most sites are down.
Not completely sure this is related, since I've been having issues since Wednesday afternoon (in fact that would seem to indicate it's unrelated). Also possible I have been having one issue that was fixed by the factory reset, so now I can freely experience the other issue that everyone else around the globe is apparently having today.
I am interested to know from The Expert as well, when Telekom Malaysia fixed their issue (it is actually fixed at the source now, right?) would there be some propagation delay or aftershocks, or should everything just return to normal relatively immediately.
Any networks that both accept the prefixes and see these advertisements as the best path will send their traffic towards the route leaker, who will lack the capacity to handle this traffic, effectively blackholing these routes.
In theory, a large portion of the Internet would have still seen the legitimate routes as better paths, as they would have a shorter AS path (fewer networks in between) than the leaked ones. However, many networks often implement other BGP metrics for traffic engineering, depending on whether they are seeing the routes from a transit/peer/downstream customer, which may override the shortest AS path.
that is because cloudflare has direct peering with many access networks. most providers see cloudflare routes directly and not through a transit such as Level3.
(Jump to BGP route visualization and you can see AS 3549)
On the routing plane, you can see that the London GLBX monitor had issues to many services and that AS4788 Malaysia Telecom was advertised in the route.
Could someone elaborate on the scope of the problem? Is it a sensible question to ask "How many routes were leaked?" how big were the prefixes of the networks leaked?
There is a great chapter on BGP by Hari Balakrishnan. I believe the chapter is available on MIT Open Courseware. It should be under the Computer Networks class.
did this happen last night too (about 15 hours from when this post was made)? I am in India and was not able to reach my servers in Singapore or do a git push. All other sites worked okay. I scratched my head and changed my dns to 8.8.8.8. It still did not work.
> Your IP address 5.79.68.161 has been flagged as a scanner. Scanners are not permitted. If you are seeing this message in error, please contact security@statuspage.io.
Guess I better send an e-mail to see this status page..?
just want to ask if there's anything a home user could do to detect unusual routing behavior/phenomenon? I'm envisioning something like a browser plugin that logs/monitors outgoing connections and traceroute data
With that, instead of just looking at 404s we can make more informative observations like this request got stuck at this node, or that request is routed through a new node which has never been seen before recently..
Fuckups like this should result in criminal charges (and immediate depeering). DoS attacks are illegal in most countries and this is definitely gross negligence.
It's actually interesting why this idea is so totally wrong.
So Telecom Malaysia messed up a config, and Global Crossing accepted their updates automatically.
Global Crossing didn't have to accept the bad update. They generally trust updates from other organisations that are generally trustworthy. They apply checks and restrictions proportionate to the risks involved.
These mistakes happen rarely. If they were to happen more often, major operators would apply more checks and restrictions. If they were to stop happening, operators would apply less checks and restrictions, because they have a cost in manpower, complexity, and loss of flexibility.
That's how the internet works. You could almost say that's what the internet is--the idea of being actively managed by people who know what they're doing and are not bound by exhaustive predefined policies is defining of how the internet came about and how it came to be dominant.
If you want a network guaranteed to be resistant to this kind of f---up, build one. The internet is that network which does not work that way, which is flexible, expandable, mostly "good enough" but not ever designed for absolute reliability.
If anything, the outrage should be directed at Global Crossing/L3. They shouldn't be allowing some little ISP to screw up their route tables like that. Because this means that the Malaysian government could secretly announce just a few blocks, route them back out so they still work, and easily do MITM.
OTOH, so much in telecom is hung together on the assumption of basically good actors everywhere.
>It's actually interesting why this idea is so totally wrong.
No it is not. You accept their claims of "mistakes" I see no evidence of that - how, exactly, do you leak a full table by accident? and this is too big a security hole to leave to hackers. Leak a bunch of table, shut down an entire country.
full table leaks happen ALL the time. The reasons you dont notice or hear about it is:
- the providers which do this by accident are too small (multiple asn hops away from a major transit provider) to become the best choice for most people
- the small guy who does leak the full table to a major transit provider, is adequately filtered by the major transit provider by default
- the small guy who leaks to the medium transit provider might take an outage, but may not leak to his upstreams due to outbound filtering or the upstreams filtering
you would be surprised, BGP is an old protocol, has had very little serious security improvements. It currently works more or less based upon the goodwill and discipline of network engineers around the world, because if they screw it up, they usually end up offline and out of a job.
Even if was a mistake, that doesn't suddenly make it OK. People make mistakes, yeah. But this isn't a simple mistake, in fact this incident consists of multiple mistakes.
1) Someone wrote an incorrect config
2) They did not test it
3) They pushed it to production systems without testing it
4) They did not monitor their systems after pushing new configs
5) They took ages to fix the problem after it was detected.
It's a little difficult to test this kind of config without emulating the entire internet - which is quite clearly beyond the scope of all bar a very small number of organisations.
This isn't about how the internet works. This is someone messing up and breaking the internet.
If I'm driving a car without paying attention to the road and kill a bunch of kids, that's my fault.
If Telecom Malaysia pushes new configs without testing them and breaks the internet, that's their fault.
Obviously these two things aren't even remotely comparable in seriousness, but it's clear that in both cases the people messing up should be held responsible.
Excuse me Mr Lol, but since it happened, it would appear that it in fact is how the internet works.
If the postal service misdelivers some mail are there criminal charges?
I would suggest resisting the urge to hit every problem and mistake in the world with the law hammer; it is rarely productive, and generally isn't how the world works.
You are wrong because this IS about how the Internet works and less about how one person/company screwed up.
Apparently all it takes is one ISP misconfiguring something to break large swathes of the Internet. I believe the consensus on HN is that no one entity should have an Internet kill switch.
If someone managed to disrupt mail delivery on a global scale, people would be less concerned with THAT it happened than that it COULD HAVE happened in the first place. Why would global mail delivery be so not-fault-tolerant that one mistake brought it to a grinding halt for hours? Same deal here.
I don't think anyone is saying that the fact that this CAN happen isn't a huge problem, it most definitely is.
Thing is, everybody knows (and has always known) this can happen. Everybody also knows how to avoid it. Well-intentioned people tend to try to avoid breaking the internet.
That really doesn't make the negligence of Telecom Malaysia any more defensible.
I really don't see what's the point of defending Telecom Malaysia, a plenty of people manage to operate their equipment in a manner that doesn't break the internet.
Just because this was a mistake doesn't mean they should not be held responsible (and no, I'm not saying someone should go to prison.)
I really don't see what's the point of defending Telecom Malaysia, a plenty of people manage to operate their equipment in a manner that doesn't break the internet.
Perhaps the point is that Malaysia is part of the world, and we can't realistically expect to exclude them from the internet, any more than we could expect to exclude them from the commercial airline system. Their network people answer to their customers in Malaysia, not to us nerds on HN. (Also ISTM many Malaysians are more concerned about earthquakes caused by exhibitionist tourists [0] than about internet stability.) I guess in some way TM answer to their upstream in GLBX, and they could get "demoted" in some way, but GLBX isn't going to just walk away from an income stream.
Many times when one node is blamed for network-wide bad results, the nodes that connect it to the network might be blamed fairly, as well.
> Also ISTM many Malaysians are more concerned about earthquakes caused by exhibitionist tourists [0] than about internet stability.
Malaysian here. More Malaysians care about the 18 people who died on Mt Kinabalu from the earthquake, than about the antics of a few douchey tourists.
As for Telekom Malaysia, TM has the same reputation that BT has in the UK - shitty service, but Malaysians are stuck with them. I don't think it surprised anyone that TM caused this fuckup.
I'm definitely not saying we should disconnect Malaysia from the internet, that'd be terrible.
Despite the name, TM is a private company. I really wouldn't have any issues with (temporarily) disconnecting them, or alternatively fining them. GLBX should definitely be able to do both of those.
Of course best case scenario would involve Malaysian government intervention. (As unlikely as that sounds in a country that seems to be ran by people that believe in magic)
I really wouldn't have any issues with (temporarily) disconnecting them, or alternatively fining them. GLBX should definitely be able to do both of those.
If that language is in their contracts, then sure. Such penalties might not be made public, however, or there may be different enforcement mechanisms in place. The internet (and all global commerce, really) functions anyway.
Of course best case scenario would involve Malaysian government intervention.
Given how often "Malaysian government intervention" entails unconscionable violence, I cannot agree.
Based on previous experiences I'd imagine their contract would allow early termination in case of abuses such as this,but of course this is speculation.
I definitely didn't intend that anybody should be executed, but at most fined (or imprisoned for a reasonable amount of time if this was in fact intentional, but that's unlikely).
"with design to obstruct the correspondence, or to pry into the business or secrets of another, or opens, secretes, embezzles, or destroys the same" nope
This reminds me of arguments some people use in the EU against net neutrality... they want fast lanes for Industry 4.0, online emergency and self driving cars. Incredible thoughts... I know.
There's a big difference between accidentally breaking something and doing it intentionally.
I think some people get too caught up on blaming individuals for accidents. Which I think stems from the culture of litigation and offsetting unforeseen costs. While there obviously should be processes in place to prevent accidents from happening, gross negligence aside I do also think we should stop hunting down individuals to compensate our own greed.
There hasn't actually been any crime committed though.
Unless you want to turn miscoding config files into an international crime - in which case every single person who works in IT would be a criminal since we all error at some point in our careers (usually frequently, to be honest).
Granted the scale here is greater than your usual sysadmin would have access to, but the nature of the error isn't any different.
You're completely ignoring the potential monetary damages. A plenty of financial systems use the internet. I've also heard of a couple of other businesses utilizing the internet.
"Shit happens" would've been applicable if this was solved in 5 minutes. Accidents like this are trivial to prevent with proper policies.
You're completely ignoring the potential monetary damages. A plenty of financial systems use the internet. I've also heard of a couple of other businesses utilizing the internet.
So what you're telling me is that these financial systems' risk analyses didn't result in them mitigating this risk (1), but because things went pear-shaped now the government needs to step in?
I was going to be unsympathetic to them, because in my field we have to analyze and then mitigate certain levels of risk. But I guess if your business is financial, bellyaching after the fact to the government is your mitigation.
(1) different design, SLAs, making sure service providers already have those policies, acceptable and expected levels of downtime, whatever...
There are "financial systems" who actually care enough about their comms network to build it themselves (Google for "High Frequency Trading in my backyard" for some great stories).
Any "financial system" who uses "the internet" without acknowledging and accepting the risk of this sort of downtime should probably be considered incompetent. (Of course, there are probably many such institutions where the techs are currently saying "We warned you! But you refused to authorise the budget to mitigate this!" - who are now baying for blood from people who never signed up to provide 100% reliable networking for some cheapskate financial firm...)
I see you're very much concerned about liability and financial compensation here. I'm no lawyer so I don't know whether it could be a criminal offence to export prefixes like this either intentionally or by accident. However, we don't know what SLA agreements the financial institutions you speak of had with their providers. If said institution has paid for a 100% reachability guarantee then I would presume they are entitled to financial compensation. Everyone else, not so much.
I'm really not focusing on financial compensation here (I'm more interested in discouraging people from breaking the internet) , with the amount of people affected that's a topic you could write books on.
I am focusing on liability though, I very much believe Telecom Malaysia should face criminal charges for this (I do not know if they should be sentenced though, as I am not aware of all the facts. That's up for the court to figure out)
In most countries (I do not know if this applies to Malaysia too, but I believe it should) denial of service attacks are a criminal offence, I'd say exporting prefixes like this would constitute as one.
I agree that denial of service attacks are at best unlawful. However, an attack? I think not. For it to be an attack I would presume there must be some evidence of malice and intent. I have seen no such evidence of this.
I don't actually think it's an "attack" either. But guilty or not is binary, the actual sentence tends to be affected by details such as malice and intent.
You're completely ignoring the potential monetary damages.
If there's a car accident that closes the road, it's frustrating for everyone involved. But if someone's crash means I'm late for a vital sales meeting and I lose a multi-million-dollar contract, I wouldn't blame them for my consequential losses.
But even if they do, they won't (and shouldn't) be held liable for any subsequent things that happen in your life because you were delayed in traffic, consequent to that accident.
You should do that though: it would allow us to know the complete cost of the accident so we can allocate the correct number of resources to preventing it.
Should the complete cost of the accident include monetary losses from people who had no plans for dealing with unexpected delays?
Or to put it another way; Lets say you're driving across a bridge with your laptop in your car and you get hit causing you to drive through the rail and into the ocean. Should the 'complete cost of the accident' change based on whether or not you were smart enough to make backups?
If they can't take something like this, they should not depend on public packet-switched networks. In many ways, "the internet" is a victim of it's own success - it works so well most of the time that people come to expect too much of it.
I don't think you fully understand what "should" means here. I'm obviously not implying that the legal framework necessarily exists, but that it should. (And not for drone strikes ffs)
International legal agreements for criminal charges are pretty limited (AFAIK IANAL). I guess I could see internet service disruption becoming subject to international law but it seems also to introduce another vector for abuse.
Basically what's happened here is that Telecom Malaysia told one of Level3's networks (AS3549 - Global Crossing) that it was capable of delivering traffic to... anywhere on the Internet. Global Crossing apparently didn't have the usual sanity checks in place.
Because of how BGP works, once GBLX decided the route seemed reasonable, it immediately proceeded to dump huge amounts of traffic on the tiny Telecom Malaysia. This isn't incorrect behavior on GLBX's part (accepting the routes was incorrect, but sending the traffic was correct once they were accepted). This is way more traffic than Telecom Malaysia was prepared to handle.
Every tier 1 network gets rid of traffic as soon as possible (because it reduces their costs), completely ignoring performance or whether a route seems sensible. Telecom Malaysia therefore claimed to the most attractive route from anywhere near Malaysia to most of the world.
Level3 is one of the biggest telecoms in the world -- and in fact the biggest by far in that region. This means that most of the Internet probably stopped working for anyone anywhere near Malaysia.