Hacker News new | past | comments | ask | show | jobs | submit login

I've been having major internet issues lately (Seattle area), have had 4 techs come try to figure it out. Yesterday's tech finally correctly diagnosed the problem as happening before the connection reaches our home but was unsure of the cause. He called his supervisor to investigate, and they found that the capacity for our neighborhood's node was nearly at 100%, while ideally it should always be under 80%. Fortunately they said they'll be able to fix it within a few weeks by doing a node split. The tech mentioned he'd never heard of capacity issues before in his ~20 years as a tech and that some smaller ISPs have been having issues keeping their internet up and running at all.

I've been tracking the performance with PingPlotter, if you're curious how bad it is right now here's the last 10 minutes: https://i.imgur.com/AnUqv3j.png (red lines are packet loss) Pretty interesting how current circumstances are pushing even tried and tested infrastructure to their limits.

If you didn't know, that 80% number is probably the result of Little's Law. That's the result where if your demand is generated by a Poisson process, and your service has a queue, 80% utilization of the service is where the probability of an infinite queue starts to get really high. People

Here's a nice blog post about the subject:


This law does not apply to queueing as encountered in routers. It assumes unbounded queues and a poisson arrival process (i.e. a memoryless channel); both assumptions don't hold for packet routers and senders using congestion control (TCP or otherwise).

There is, however, a high chance of encountering buffer bloat if countermeasures are not taken at the chokepoint: https://en.wikipedia.org/wiki/Bufferbloat

Modern cable modems, for example, are required to implement such countermeasures. My ISP is at over 90% capacity and round trip times are still mostly reasonable. (Bandwidth is atrocious, of course.)

How do you monitor this? The 90% over capacity, would like to see where mine is at

There might be a way using a cable TV receiver (see my other comment on this thread), but in my case, a sales rep of my ISP just told me on the phone.

I have an older modem (DCM476) and it definitely doesn't have this or doesn't have it enabled. I have to use/tune queue management myself on the router side.

Yes, it's mandatory only as of DOCSIS 3.1, and yours seems to be 3.0. (Supposedly it has been "backported" to 3.0, but that obviously would not apply to existing devices certified before that amendment to the spec.)

To add:

If you have more control over or knowledge of your load, you can safely go higher than 80%.

Eg when I was working at Google we carefully tagged our RPC calls by how 'sheddable' they were. More sheddable load gets dropped first. Or, from the opposite perspective: when important load is safely under 100%, which it is almost all the time in a well-designed system, we can also handle more optional, more sheddable load.

As a further aside, parts of the financial system work on similar principles:

If you have a flow of income over time, like from a portfolio of offices you are renting out, you can take the first 80% of dollars that come in on average every month and sell that very steady stream of income off for a high price.

The rest of income is much choppier. Sometimes you fail to rent everything. Sometimes occupants fall behind on rent. Sometimes a building burns down.

So you sell the rest off as cheaper equity. It's more risky, but also has more upside potential.

The more stable and diversified your business, the bigger proportion you can sell off as expensive fixed income.

I've noticed that above 70-80% it gets pretty hard to insure that interrupt timing can be met and balanced with low priority main looping in a lot of my bare metal embedded projects.

The tech was full of shit. This happens literally all the time. You probably won’t get a “node split” unless more people loudly complain. It’s cheaper for them to roll a tech and hope you get fed up than it is to actually fix the problem.

My ISP has been playing the same game with me for months. I finally cancelled the contract when it was about to renew, and I got a very interesting winback call from sales:

Not only did the rep freely share the utilization numbers with me (80% during the day and 90% at night), he also mentioned that things would not get better until end of the year when they would do a node split.

As consolation, they offered me 10x the download speed for half the price. I'm not really sure how that would help congestion...

I work in this field in Spain. Margins in this sector are slim, deployment is expensive. EVERYONE works with simultaneity rates, it's the only way to have cheap connections.

In fiber connections is actually not that expensive to split a fiber after a CTO, you can actually sort of daisy chain it, but you want to keep everything as standard as possible.

Margins are not slim at all in the USA

You think they're fat in the US? Look north.

Shh, you'll upset the Great Robelus[1] and they may start euthanizing animals....

[1] https://www.thebeaverton.com/2020/03/telus-threatens-to-euth...

I'm Canadian, trust me I know and hate it.

Some EU relatives of mine keep their phone plans living here because it's cheaper with the overseas rate than paying Canadian plan rates (!!!)

Maybe being in the system with a higher speed tier gets you higher priority?

I don't see what motivation a tech would have for lying about this.

I asked a Comcast tech when IPv6 would be available and he said “IP v what?”. Don’t attribute to malice what can be explained by incompetence.

That's like asking a telephone lineman about IPv6. Diff layer in the OSI stack.

My 67 year old grandpa has vague idea what ipv6 is.

He probably was around when the standard was defined. It's amazing this is taking 30 years to replace IPv4.

The transition is definitely taking a long time, are there additional reasons for delaying the switch to IPv6 other than the mitigation of the problem with NAT/private networks?

It requires cooperation from perhaps fifty thousand organisations (there are 45k ASes that announce more than one prefix, and I'm guessing that there may be 5k software vendors). Some of those have orgcharts that aren't very friendly to this kind of change.

Adding to that, even clueful places may be held back by one or more vendor or provider, all of which need to have working v6 support before you yourself can deploy it.

I thought ipv4 and ipv6 addresses could be provided simultaneously (or rather, ipv6 has provisions to be mapped to/from ipv4); you just wouldn't see any real benefits until you could switch wholesale (because you'd still be limited to whatever ipv4 can do)

That is, it was my understanding that there was no real blocker to supporting it in the interim, except for the lack of any immediate benefit. Though I'm also not clear on whether supportinf both introduces any significant complexity

They can be provided simultaneously, that's the normal case.

Suppose an ISP wants to provide IPv6 besides v4. What does that ISP need? Well, first, v6 from the upstreams, that's simple, and v6-capable name servers, routers, that's simple too nowadays.

But there's more. Suppose that the ISP has some homegrown scripts connected to its monitoring or accounting, written by a ninny years ago, uncommented, and some of those assume IPv4, and noone wants to touch them.

Suppose that ISP outsources its support, and the outsourcing company promises to do the needful regarding IPv6 support but never actually does it.

Suppose that that ISP is in a country where ISPs have to answer automated requests from the police or courts, and one of the software packages involved in that has a v6-related bug. Or the ISP worries that it's poorly tested and the ISP's lawyer advises that if there are any bugs, the ISP will be criminally liable.

And so on. Enabling IPv6 may need a fair number of ducks lined up.

Did you ask them ten years ago? Comcast has had v6 for ages.

the point was, i believe, that the techs frequently don't know what they are talking about.

A lot of techs for large orgs don’t. I had a grid electrician in a while ago, replacing unshielded triple phase from the pole, who was convinced that they only use AC in the US, and that here in Europe it’s all DC, so safer, and this is why I can work on it without shutting it down, mate.

The mind boggles. These people maintain our infrastructure.

Wow, that's wrong on several different levels. I can't even begin...

I understand that you don't need an electrical engineering degree to be an electrician, but still, these are some fairly basic concepts in the electric power industry, especially the safety aspects, so you'd think someone working on live wires would know better.

Honestly, any halfway-intelligent person who travels internationally should know that Europe runs at 240VAC/60Hz, because this is really important if you want to use your American electronics there without a transformer. (When I went to Europe last, I brought my laptop, and an adapter which does not convert voltage, only the prongs, but that's OK because the laptop's power brick says it works on everything from 100VAC to 240VAC, as do a lot of electronics these days. But you have to check this first, you can't assume! Plugging a 120V-only device into this adapter could cause a fire.)

Europe runs on 230V 50Hz.

Yep, you're right.

Luckily, the 50/60Hz stuff really doesn't matter these days except maybe for some digital clocks on appliances.

It's instructive I think to look at the job ads for these technicians. It's frequently something on the close order of: can be professional, knows how to drive, can handle close proximity customer service, knows some handyman skills, and oh by the way maybe has seen an Ethernet cable before.

Not that there's anything wrong with that, everyone was entry-level at some point, but engineers who do capacity planning and traffic engineering they are emphatically not.

To contrast this, every Comcast tech (3) that's been in my home has been very knowledgeable. Once they see I'm a "geek" they unload with technical knowledge and generally talk my ear off. That's how I learned my town has less nodes/per subscriber than any of the surrounding towns which is why my Internet speed is frequently ass.

Because he wanted you to believe they were going to fix the problem at a later date so he could go to the next job (paid by the gig) and get you to close the ticket (improve his metrics).

I’ve worked at a major ISP, for a decade, and spotting something like this should be so easy to spot. There are tools on monitoring of load all the time, and areas are routibely getting split etc. to improve bandwith, so I think your ISP are basicly amateurs..

The problem is that most companies aren't going to tell you that their peering circuits are running hot or that their internal network or access layers to the end user are running warm at peak. ISPs all do stat muxing and the line is "we make money when customers don't use the service".

They'll be happy to deal with the last mile segment, but anything beyond that is murky and most companies I know aren't going to share much. Helps to have friends on the inside leak some graphs, though.

> I’ve worked at a major ISP, for a decade, and spotting something like this should be so easy to spot

MRTG graph, ISP circa 1995. Colorized.

See a flat line? that's congestion. Now figure out where it is coming from. Sorry, we have been doing this for thirty years so I'm kind of cranky. It is not a rocket science.

Alternatively, load has gone up across the board in a short period of time, so that preventive scaling has fallen behind and are in recovery mode.

Yes it can, but why would it take several techs, to spot something like load, which is the first thing you would do, it should take no more than 10s to look it up in a tool.

A "last foot" tech might not even have access to those tools, much less know how to use them.

Rolling out that tech has got to be more expensive than checking the load first.

Dunno how it is in the States, but here in UK rolling out the tech is basically the first thing they do after the unavoidable "have you tried turning it on and off again" phone call. They just don't trust the customer to have any clue and maybe don't want to waste time doing troubleshooting at their end when it's "probably" a downstream issue.

Network Operations should be raising known problem issues to front line call centre staff.

Network congestion issues shouldn't be handed off to field techs to check local loop (last mile) and CPE (Customer Premises Equipment.

I'm pretty sure it's standard practice at these companies to never let front line call center staff acknowledge known problems. Sometimes, the automated phone menu will give you a recorded generic message that they are currently experiencing a service issue, but that's intended to convince you to hang up and patiently wait for them to sort their shit out. I've never had a front-line rep be at all useful in diagnosing a real problem.

Yeah true.

I guess I need to remember the ISP I work for here in Australia (front line tech support, and then network operations physical security and infrastructure) was widely recognised as the best ISP in Australia multiple years running, so I shouldn't use it as a baseline expectation.

So how was life at internode or aussie?


Yeah, was a good place to work. I was in their Adelaide data centres when iiNet acquired the company.

In NZ you sign up with an ISP, but your local connection is usually handled by the same physical equipment (DSLAM for ADSL, etc) which is owned by a single network provider.

I’m not sure what the incentives are for an ISP to try to get the provider to fix issues, or even if they would e.g. https://company.chorus.co.nz/what-we-do is notoriously bad for service and the copper network is being deprecated. Locally https://www.enable.net.nz/about-enable/ are doing a good job of service, because they are well subsidised by the government and seem to be effectively operated.

> There are tools on monitoring of load all the time

On some days my connection resets 5 times within an hour, which is quite annoying since retraining the connection takes a minute or two. When I call support about it they have zero monitoring in place that would let them know about the recent history of the connection quality, they can only do spot tests of SNR on demand, which of course doesn't show any transient events. According support forum posts of other users they'd have to explicitly enable "long term monitoring" based on user input to get that information.

Of course SNR line quality is an issue separate from congestion, but still, automatic monitoring appears to be limited.

how can i as a subscriber find out whats the capacity?

It used to be possible to determine the downlink capacity and even current usage with a DVB-C receiver and some Linux software, since DOCSIS is essentially just IP encapsulated in MPEG transport streams on a digital TV channel.

More recent versions of DOCSIS have moved away from that layer of backwards compatibility, so you would probably need some specialised equipment, if it is possible at all (I don't know at what layer exactly encryption happens).

Not amateurs, liars.

So Frontier?

A free alternative to PingPlotter: https://www.thinkbroadband.com/broadband/monitoring/quality

My connection: https://www.thinkbroadband.com/broadband/monitoring/quality/...

In case anyone is shopping for broadband in the UK, I only have great things to say about Zen pictured above. It's so good I just called to upgrade my 80 Mb to a 300 Mb just for fun, meanwhile my quarantined Italian friends are suffering awful internet now that everybody's at home streaming Netflix.

I used to have Virgin fibre and my average ping was 80ms with a ton of jitter. The plot above is my internet while downloading at about 2MB/s average over the past 24 hours, and surprisingly stays the same even at peak download.

I’m being pedantic, but that’s not really Zen, it’s the BT Openreach backend which has really great stability and latencies. I tracked my BT Openreach connection for many years and I never got more than a few ms of jitter, really amazing. However the speeds are not great (70/20), and the coverage is also fairly poor - I'm in a dead zone right now between two local exchanges. So unfortunately I'm forced to use Virgin, which has gotta be the worst ISP in the history of the world (and I have had Comcast!). Terrible network and terrible customer service - I don't know how this company exists.

That's a neat alternative to PingPlotter. I like that it pings from outside, so no client required. I'll check it out, however, I'm in the US, so I bet it's always going to be high latency.

Not friend of yours italian quarantined enjoying 1Gb/s here. Never used Netflix ;-)

You're describing an issue specific to US ISPs. It doesn't apply to Europe. From what I read even before the pandemic the US ISPs offered rather crappy services. In Europe, particularly in Poland, I don't have and haven't heard about anyone having any issues with connectivity right now, even though the country is in lockdown, schools and universities are closed, restaurants work only in delivery/take-out mode, companies switched to remote work, ... And still no issues at home nor at work.

Don't make decisions about the European infrastructure based on American problems.

This article is literally about the EU asking Netflix to reduce bandwidth in Europe.

And the comment I replied to was "literally" about problems encountered in US (Seattle is in US FYI).

Having issues with the internet here in the UK today. Unsurprising given that half of the world has suddenly discovered video calling. Mobile network seems more stable.

In my country (NL), a lot of the backbone of both cable TV and internet on a street level has been replaced with fiber already; I can imagine that in the US, due to the scale, this process is lower. Doesn't have to be fiber-to-the-home, 20mbit should be enough for everyone for example.

I haven't had any problems with my internet (I do have fiber straight into my house, wired network on my laptop, fast.com reports 600 Mb/s), but Skype, which we use for meetings, has been pretty shit in terms of sound quality.

20mbit is not enough if you have kids with retina display ipads looking at youtube!

I work virtually from New Zealand with my colleague in Lombardy Italy. Today I noticed some more serious degradation in video call quality for the first time.

But mostly I'm amazed how well the internet is working given the circumstances.

I'm in Poland as well and I've been working remotely for over three years. Since the lock down started I feel that everything is a bit slower and less stable, but I haven't experienced major issues during usual work hours doing work-related things (maybe except MS Teams acting up). However Netflix is broken most of the time during afternoon hours (when I want to keep kids occupied with cartoons for an hour or so to get things done). Luckily other streaming services work fine.

In contrast, my internet connection finally started working great since lockdowns started. I suspect my ISP (small local company in central Poland) got some additional bandwidth or somehow finally fixed their infrastructure when they saw increased internet usage among their clients.

It depends. If there is competition, things can be good. I live in a place in the US where there are 3 broadband providers, and I pay far less than $100 for a symmetric gigabit connection, and I get it too.

FWIW, the reason nodes typically don't get to 100% is due to something called WRED (Weighted Random Early Detection). As the outbound/inbound queue on your "node" approaches fullness, it randomly selects packets to drop. This signals TCP on the sender to back-off. The closer to full-ness it gets, the higher the probability (weight), so the sender knows to slow down to the slowest link's speed.

I've written more about this problem here [0].

[0] https://rkeene.org/projects/info/wiki/176

Thanks for the write-up!

I wonder how TCP BBR would react here. If I understand it right, it wouldn't need RED to back off: the increased latency of buffers filling up would do that automatically. But BBR also wouldn't let the occasional dropped packet make it back off.

From what I understand about TCP BBR from reading about it the past few minutes, it would compute a new link speed as a result of impacts from WRED and then use that for the connection baseline speed.

TCP BBR would still rely on RED/WRED to compute the connection rate estimate initially, then it would attempt to send below that rate to avoid packet loss. If packet loss is detected it would recompute the estimated connection rate.

I found this page [0] useful, especially the graphs.

[0] https://blog.apnic.net/2017/05/09/bbr-new-kid-tcp-block/

> have had 4 techs come try to figure it out

Doesn't sound ideal for distancing.

The way the story is written it sounds like their attendance was there was serial temporal distance involved (they didn't come at the same time)

I mean inviting four different people into your home sounds silly if they're there at the same time or not! I guess people need internet to earn a living though.

Comcast's last mile network in Seattle has been struggling in some areas from the morning until around 4 to 5 PM. It's not massive loss, but enough to disrupt video conference. Run a mtr towards an Internet dest and you'll see loss at the first hop and everything behind it.

Mtr isn’t a reliable measure of packet loss. Routers drop “extra” packets like ping before they drop “paying” packets.

Yes I'm well aware of routers policing TTL=1 packets, but if you see consistent loss all the way down it's usually a sign. This compared to seeing individual spikes on intermediate routers which are usually control plane policing.

mtr uses UDP data packets, as far as I am aware.

Yes, the ICMP response packets could still be skewed, and the effect you mention is definitely real, but on a good connection, usually there should not be much to drop at all, neither TCP/UDP traffic nor ICMP packets.

>mtr uses UDP data packets, as far as I am aware.

Doesn't matter what it uses (though by default MTR does use regular old ICMP Echo - you have to specify -u or -t to get it in UDP or TCP mode). When TTL expires it still requires an ICMP TTL Exceeded be sent, regardless of whether or not you were sending ICMP through it.

Traceroute implementations in general are probably telling most everyone in this thread a lot less than they think, even without icmp deprioritization being taken into account.

https://archive.nanog.org/meetings/nanog47/presentations/Sun... is worth a read for most anyone that's ever attempted to use traceroute to troubleshoot networking, because they're almost certainly doing it wrong.

This happened to me years ago near the University of Illinois campus (UIUC) with Comcast. I had multiple techs come out but they would only come in the morning when the connection was fine. I finally escalated to corporate who finally told me they needed a node split. I made them give me 100% free internet until the split was complete about 6 months later.

Since I have been at home I practically live in MS Teams, with constant video chats. Yesterday I did a presentation with 140 people connecting watching my ppt and camera. That's got to be unusual. I imagine most of my colleagues going through this routine daily.

> I've been tracking the performance with PingPlotter, if you're curious how bad it is right now here's the last 10 minutes: https://i.imgur.com/AnUqv3j.png

Is your own connection idle though? Pings are also affected by the congestion on your own router†, especially if you don't have good AQM (such as CAKE). Dumb queues will just drop all packets equally, smart queues will do flow isolation and penalize the bulk flows first while keeping the trickle ones (ping, ssh, voip, ...) untouched.

† and anything else along the path to your ping target

When I have connectivity issues during a pandemic I make sure at LEAST 6 techs come to make sure I have perfect connectivity to Netflix and chill.

Ho lee sh, that is absolutely crazy.

I am sure its affecting you internet speed, what sorts of tasks are you generally doing now that the entire is state is pretty much on lockdown?

Here in Alberta, although we are told be socially distant, there is no full lockdown and I want to know what kind of issues would I be expecting to run into in the up coming weeks/months?

Makes me glad I went with the Business version of Vodaphone in the UK - which is ironically £1 cheaper a month than the consumer.

I suspect its the services that relay on super low prices and don't have excess capacity Talk Talk etc that are really going to feel the pressure in the UK

What ISP? I’m on Comcast “Business Class” in Seattle and experiencing occasional slowdowns as well.


Ping plotter looks like a SaaS https://hub.docker.com/r/linuxserver/smokeping/

Around the time you posted this, my internet in Seattle was down for near around 12 hours yesterday. I'm not fond of my ISP, but that's unusual even for them.

How does it come about that the ISPs Network Operations team didn't know they were saturating a link?

Last ISP I worked at would have email and SMS notifications going to On Call staff.

Because the NOCs may not be all that competent. I remember talking to the Cablevision IP NOC back in the mid 2000s about their internal backbone circuits they were running hot that went to a POP we peered with them. I had Cablevision at home and the congestion was breaking my VPN to work. The NOC said "an OC45 was down" (no such thing, it's an OC48) and that congestion is okay because TCP will work with it okay and there won't be a problem. I shutdown the peering session with them force traffic around a diff city (sent it to Chicago). I remember talking to the eng team at Cablevision about their NOC and they had a good chuckle and admitted they're only good for the simplest of operations (link down, go fix).

In some parts of the world running links at 95 percent is okay because look 5 percent left (totally ignorant of buffers or microbursts etc.).

Curious, what ISP do you have? Currently moving to a new place in Seattle and have to decide between Wave G or Atlas Networks.

Thanks for mentioning PingPlotter, I'll try it out to monitor our connexion.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact