Ok, first, I have to say that I never expected to see "The interwebs are borked" become a national thing. Every where I've worked, at some point folks would start wandering around saying "the internet is down" which was code for "help us, we can't use the web" and various folks would then figure out what their particular issue was, then that problem migrated to my home when we got always on 24/7 internet, something that started out "why would I use that?" has become like oxygen "ZOMG I can't get to the webz!" and here we have an interesting variant on it that a transit monitoring service notes a lot of disruption. Clearly whom ever is currently the current CIO of the US [1] is not doing their job :-)
That said, there are no doubt folks on the other side of those down links with calls in to three or four NOCs, a couple of trouble tickets being escalated, and people driving out to non-nondescript buildings near railroad tracks and in industrial areas carrying weird looking devices which can measure the intensity of laser light and do time-domain reflectometry (TDR) measurements. We can only wait and see what they discover. If we were playing the Ops edition of the game Clue I'd guess "Colonel Mustard with a Backhoe in New Jersey" :-)
Your second paragraph reminds me of the fantastic South Park episode ("Over Logging;" S12, E6) where The Internet goes down across the entire US, and everyone tracks it to its source in a desperate attempt to fix it.
I've seen huge networks taken down by a rouge router in a closet that everyone forgot about, handing out DHCP leases. More than once.
Sometimes it's amazing how much one silly little issue can bork. Even more so when it usually winds up being completely innocent, and not actually a bug.
This takes me back to the FLIX disruption of '97. A lonely router (I seem to recall a Cisco 2500 series, but I could be wrong) at the Florida Internet eXchange was misconfigured. The result was the router advertising itself to the world as the default route for a large chunk of the Internet. I think it was also on a T-1, which was a good link at the time.
I was the senior UNIX systems manager at a business unit HQ of a Fortune 5 company then, and still remember all the people stopping me in the halls that day to ask what was wrong with the Internet.
More recently (Feb 2012) almost all Telstra's (Australia's biggest ISP) customers were taken off line when Dodo Internet (another cut-price ISP, with a low end reputation) published some bad BGP routing[1].
Internettrafficreport isn't the most reliable (normally there are lots of zeroes on their graphs), but it does indicate a large change in some numbers.
All 3 of the listed Wisconsin routers are small businesses. 2 are currently listed as being down, and have been for months now. However, their web pages are up, they probably just changed around their IP's or something, and never notified.
I could see a core router at UW being a major measure of the internet, but not some small consulting company in a small town..
Aside from infrastructure woes like this, one of the original premises of the internet's resilience was its decentralized and organic design, however, as developers migrate to the cloud we are going in the exact opposite direction where a single cloud provider going down takes with it a ton of popular web services. We have moved to the mainframe model and the new IT dept is now GAE, AWS etc. While cloud providers try to decentralize their infrastructure it seems that we are in the early days of figuring out how to do this, because for the past few days we have had major disruption to essential services like tumblr (for kitten photos) et al.
Fortunately to date the affected services are all non essential, mainly entertainment/trivial stuff like blogs, instagram, dropbox etc etc, but when we start to see things like water supply and electrical power management systems, hospital records, aviation system etc affected the consequences could be severe.
If the very best IT minds at AWS and GAE can't keep their systems running, what hope have government departments got? Anyone that's ever been to a DMV, or USPS knows just how good the US Government’s IT departments are.
Leslie Lamport in the 80's "“A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable”
I don't think in the 80's anyone, even a luminary like Lamport really had any idea how distributed systems would evolve. Also he was talking about something different. I very much know there are computers in Google's datacenter that run my apps, and when they go down my computer is not rendered unusable, but rather the service executed by those apps are no longer accessible. Lamport was talking about distributed systems where multiple computers are working together to achieve a common goal. A cloud system is more like a mainframe, or client/server system. Lamport did some fabulous work (at SRI if memory serves) on distributed system.
I will give you an example of Leslie's quote in action. At my house we forward DNS requests to a machine hosted by our ISP. (Well we used to, but let me finish) My wife came up to me and said "the TV is broken." The TV was trying to load its NetFlix application which was trying to resolve a netflix URL which was going through our Internet setup, which went to the ISP's two DNS servers, both of which were offline because the switch they were connected to had failed.
Now how to explain to your spouse that the TV is broken because ns-18.sbcglobal.net is not working.
I had to explain to my girlfriend today why our Apple TV would play Netflix but the Internet on laptop failed, BTs DNS pooped and the Apple TV was on a custom DNS provider. She settled with "isn't the Internet weird". Which it is, it's really weird.
Realistically a lot of utilities are sitting on darknets internally so they shouldn't see disruptions in an internet outage. Telecommunications might be a bit different... I'm curious how things are set up but in my small industrial shop our sites would be 99% fine with a web outage, they'd just be prevented from posting daily receipts to head office.
It won't. In fact it will likely get worse as more and more companies deal with big data. There is just too much of a performance benefit to be had by colocating in the same or neigbouring data center.
Perhaps thats by design, some things don't need 100% uptime. If I have a personal blog on heroku or enjoy browsing reddit, I'll take the pros over the coins.
Mission critical services should probably not be on the cloud until cloud tech matures.
Could it just be that the hosts on http://internettrafficreport.com/ are out of date? I'm in Vancouver, BC trying it hit the UBC hub and I don't get any further than the main educational provider, bc.net. Maybe the UBC host listed on internettrafficreport.com isn't supposed to be up and its been replaced with a different host.
I am having some routing issues with my Frontier DSL service
(residential) and after speaking with technical support at Frontier, they
confirmed they are having a nationwide routing issue with no ETA
currently on the fix.
Packet loss is intermittent regardless of destination.
This may be related to the NYT article about China's political elite. A basic tit-for-tat to say "don't think that posting things in the US about us is without consequences".
Of course, they couldn't possibly be that dumb as to make a massive DDoS in retaliation. snicker
China looks inwards, not outwards. They fire-walled off the NYT awhile ago from internal users.
The USA has been fighting a very dirty fight against Iranian science programs, including using the stuxnet worms. Iran was also recently fingered for attacking Saudi networks.
EDIT: The West also crashed Iran's currency, where it lost 40% value in one week.
FYI: most serious attacks come out of Chinese networks and are managed by Eastern Europeans where the attack software is written.
Apparently China did not have the NY Times blocked until today, in response to the article (obviously going back further it might have been blocked previously):
Wasserain is an senior Linux SA in an ODC located in Pekin, we meet hugh pack lost from Pek office to US DataCenter this week; when we check the openvpn log, many TLS Error happened, it started from last last Friday (2012-10-12), but the shutdown-time is tiny in last last week, but it became longer and loonger this week, about down 6mins for every 30mins at Mon & Tue from Telecomm 's line, then we switch to Unicomm, but it still the same -- almost worse 10mins per 40mins, TLS handshake is hardly be done; at Friday, we try a new way to link, separate the plan-txt data and openvpn data on 2 link, that make a better status, but we can't sure that is the real reason, maybe the GF\/\/ is in it's maintain date.
Guess:
GF\/\/ is made a pre-Graet-18 exercise on the funczion of shuting-down-TLS/HTTPS/OpenVPN.
Prophecy:
some outage likely will happen again in the next 30day. (until the Graet-18 finished;)
Other issue met this week,
Yahoo msger report TLS error sometime when login at 0900-1000 in the morning.
This must not be affecting everyone because my ssh connection from Toronto to Portland is working just fine without additional latency.
Edit:
The Ontario router seems to be dropping packets:
$ ping gw02.wlfdle.phub.net.cable.rogers.com
PING gw02.wlfdle.phub.net.cable.rogers.com (66.185.86.254) 56(84) bytes of data.
From <snip> icmp_seq=1 Packet filtered
From <snip> icmp_seq=2 Packet filtered
From <snip> icmp_seq=3 Packet filtered
--- gw02.wlfdle.phub.net.cable.rogers.com ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 10206ms
Though I have no issue with routers under that sub-domain:
$ traceroute <snip>
traceroute to <snip> (<snip>), 30 hops max, 60 byte packets
1 <snip> (192.168.1.1) 1.489 ms 2.038 ms 2.669 ms
2 * * *
3 69.63.243.69 (69.63.243.69) 17.599 ms 17.584 ms 17.339 ms
4 so-4-0-0.gw02.wlfdle.phub.net.cable.rogers.com (66.185.82.97) 31.992 ms 31.972 ms 31.819 ms
5 69.63.253.65 (69.63.253.65) 33.198 ms 34.687 ms 34.596 ms
6 * * *
7 pos-3-15-0-0-cr01.ashburn.va.ibone.comcast.net (68.86.86.25) 35.557 ms 28.952 ms 28.818 ms
8 68.86.85.14 (68.86.85.14) 33.029 ms 42.176 ms 41.924 ms
9 he-0-4-0-0-cr01.350ecermak.il.ibone.comcast.net (68.86.88.146) 49.244 ms 45.218 ms 44.940 ms
10 pos-1-2-0-0-pe01.350ecermak.il.ibone.comcast.net (68.86.86.78) 37.146 ms 40.169 ms 40.372 ms
Note: so-4-0-0.gw02.wlfdle.phub.net.cable.rogers.com is having no issues. I don't know how Rogers' internal network is setup, but it seems like if there are issues they are handling them so that customers (or at least I) don't see them.
Packet filtered is an ICMP response that indicates that the ping request was actively responded to with something other than icmp reply. Most likely cause is that the router, or some other router en-route rejected the request with icmp prohibited.
The site is just listing the status of the few routers it is monitoring. It's not indicative of all traffic.
Do you really think all traffic in/out of MD goes through a single (2 if you count DC) router?
Put it another way, imagine you're monitoring traffic for SF by monitoring average speeds on the Northbound 280. One pile-up that blocks the road completely brings the average speed at that point to 0mph. Doesn't mean that every road in SF is blocked. Traffic will bail off the 280 and use other routes to get to their destinations (albeit slower and causing average speeds on the surround[ing] road network to drop too), but the one thing you are measuring (average speed on the Northbound 280) has dropped to 0.
What would seemingly be my ISP's router is mentioned here as having 100% packet loss for the last 24 hours. I had great speeds yesterday and the last few hours, been downloading large files.
Perhaps I'm just lucky? Or there is issue with how this is reporting or there is more than one router that everyone else on my ISP uses.
I remember about 10 years ago one of the UK connections to the US dying, which meat a big chunk of the Internet failed and how everyone was a bit puzzled. That was when the Internet using population was much lower, I wonder how an outage like that would affect people now.
Even here in Amsterdam, The Netherlands I get reports from friends their DSL lines dropping. Loos like traffic re-routing is choking up core routers here and there.
Now I see the problem of private subnets is fixed at the L3 dns servers that were borked 2-days ago (4.2.2.1 4.2.2.2). The one the popped up today is still borked (4.2.2.5).
Seems bogus. The Ashburn router is pingable (at least from near Ashburn) even though it's listed as down:
Pinging 67.215.65.132 with 32 bytes of data:
Reply from 67.215.65.132: bytes=32 time=14ms TTL=56
Reply from 67.215.65.132: bytes=32 time=15ms TTL=56
Reply from 67.215.65.132: bytes=32 time=14ms TTL=56
67.215.65.132 is OpenDNS's "not available" redirector, so you aren't actually pinging the router. It's listed as the ip address for at least one other router that is listed down.
That's not what the "location" column means- you don't think there are exactly two cords leading into the state of Texas, do you? It's just the physical location of that router.
That's for international communication. States have power poles (notice all the wires on them? they're not just power...) and buried wires that a lot of the internet also go through.
That said, there are no doubt folks on the other side of those down links with calls in to three or four NOCs, a couple of trouble tickets being escalated, and people driving out to non-nondescript buildings near railroad tracks and in industrial areas carrying weird looking devices which can measure the intensity of laser light and do time-domain reflectometry (TDR) measurements. We can only wait and see what they discover. If we were playing the Ops edition of the game Clue I'd guess "Colonel Mustard with a Backhoe in New Jersey" :-)
[1] http://www.archives.gov/press/press-releases/2011/nr11-124.h...