
Weirdly broken Wi-Fi access points - mjn
http://www.kmjn.org/notes/broken_wifi_access_points.html
======
ggggggggggw
For those who are curious, there's a bunch of reasons that contribute to why
consumer wifi routers sucks ass:

1\. Wifi routers are very complicated. You need smart people at all levels of
the stack to build a modern wifi router. You need people who know 802.1ac like
the back side of their hand, people who know how to set up and deploy linux
environments, embedded engineers for debugging OEM driver issues, networking
gurus for handling the voodoo in levels 2-4 of the OSI stack, application
people for rolling the user interface, cloud people for cloud support.
Normally this isn't a big deal, if it wasnt for the next point

2\. The profit margins on consumer routers are complete trash. Even if you're
one of the big boys with double digit market share, you're going to have a
very hard time keeping a decent engineering team staffed and your marketing
team staffed at the same time while still breaking even.

3\. Consumer router sales are SKU driven. There are dozens of price and
performance points you have to hit to meet the demands of the consumer market.
You cannot make be profitable with less than 10 actively selling SKUs. Every
time you release a new SKU, it's a new opportunity for marketing to try to
sell the device to brick and mortar stores that they're trying to expand into.
If you aren't releasing 5-6 SKUs each year, you're going to have a very hard
time keeping your router on store shelves.

4\. Since sales are SKU driven and your engineering team is probably under
funded, you have the exciting problem of maintenance releases. If your company
has 50 supported SKUs and you find a non-driver issue in one of them, the
chances are that it affects 10 other SKUs as well, if not all of your other
SKUs. Pushing that maintenance firmware to 50 SKUs could easily take 6 months
of combined QA or Firmware development time. As far as your marketing
department is concerned, all that time you spend on maintenance releases is
time that isn't spent on making new SKUs with exciting new features.

tldr: get a business class wifi router

~~~
valine
Are there any specific business class routers you would recommend?

~~~
js2
I like the Ubiquiti gear. The AC lite is affordably priced. I'm also happy
with the ERLite router.

~~~
fifteenforty
The best part is the fq_codel support :-)

~~~
js2
Unfortunately it really limits performance, to about 60 Mbps on an ERLite.

~~~
fifteenforty
Yeah, it disables all the hardware offloading unfortunately. But if you really
need it because of a slow internet connection, I guess 60Mbps is good enough.

------
giovannibajo1
I think there's also an issue of easy diagnostic. It's extremely hard for non-
tech people to assess the quality of the network; they basically have to rely
on "time to open google" which is obviously a complicated metric to rely upon,
and can be measured only with a manual test.

Wifi has a single quality indicator, the "signal power" which is useful but
doesn't say it all (it doesn't even take SNR into account). Somebody should
come up with a monitoring algorithm, possibly mixing radio stats and stability
of pings, that converts into a simple indicator on the UX of all operating
systems. Something like green/yellow/red. Once people know their connection is
"always red with MacBooks", they will fix it; hotel managers can ask for
support from suppliers, airbnbs can go to the mall where they bought the
router and complain.

This would be up to the Wifi Alliance to fix, but they are the worst slow-
moving committee ever seen on this planet, so I'm not holding my breath.
They're probably implementing a new useless encryption algorithm that will be
broken in 5 minutes and stay broken for another 5 years till they agree on
something else.

~~~
jstanley
If you do this, please make it a continuous scale rather than
green/yellow/red. There's a big difference between "the worst that connection
can get you a green" and "a really good connection".

~~~
Senji
I suggest we express it in dBs. :^)

~~~
tomchuk
Or an S meter -
[https://en.wikipedia.org/wiki/S_meter](https://en.wikipedia.org/wiki/S_meter)

------
whatever_dude
I wonder how many of those could be fixed by simply rebooting the router.

I've had numerous problems with my own Time Warner Cable modem/router. It'd
always become pretty bad after about 3 days of operation, dropping my speed to
about 2% of what it was initially. I spend months debugging the damn thing,
using a custom router, swapping cables, etc, and nothing would fix it.

Until I got a timer and got it to reboot every day at 3 am (the
[https://xkcd.com/1495](https://xkcd.com/1495) route) and never had the
problem again.

~~~
digi_owl
Rebooting a router is not always an option.

Both examples are from "public" wifi where the person with the problem is
unlikely to be the only user.

~~~
jo909
Probably a 5 minute downtime at 4 AM is overall preferable to a continuously
unstable connection. Of course an entity like a hotel should just pony up the
money for better hardware, but I could excuse that for an (Air)BnB or guest
house or vacation apartment.

~~~
jjp
And for guest houses etc the default password device is often unchanged you
could always login and reboot for yourself.

------
ploxiln
Here's a theory based on a bunch of un-substantiated and un-researched facts:

Some routers do not allow traffic from IPs which are not active DHCP leases.
When DHCP lease expires, they block traffic. But many routers don't enforce
this.

Some routers give out super short DHCP leases - as short as 30 seconds. But
the typical length is 12 or 24 hours.

A public or semi-shared wifi router is more likely to use these policies to
prevent exhausting the local subnet address pool, which is typically around
254 addresses.

Mac OS X doesn't believe that renewing a DHCP lease in 30 seconds is ever
really necessary, limits the renewal frequency to once per minute.

I'd check the dhcp lease time.

~~~
smellf
Are you serious? A 30 second lease is insane.

I've seen DHCP servers that TAKE 60 seconds to give a lease.

~~~
JoeAltmaier
Was that the server, or the client waiting that long? THe DHCP spec requires
the client to broadcast a request, then wait for a while (unspecified) and
collect answers, then choose the best answer (metric unspecified).

Sometimes when I write DHCP clients I choose the metric as "fastest", then
just choose the first answer and don't wait for any more. And it takes on
average a few milliseconds to get a lease.

~~~
smellf
It wasn't the client, but beyond that honestly I don't know, I didn't get into
the nitty gritty of it. Could have been a terrible or overloaded piece of
server hardware, or maybe a network loop or something else misbehaving between
the client and server.

We built a portable device that starts a DHCP client and tries to get a lease
on an Ethernet plug event, and if no lease is acquired after a certain amount
of time, the device assumes it has been plugged directly to a PC (or an "ad-
hoc", isolated network with a switch and no router) and will kill the client
and start a DHCP server itself, so you can access the device's webserver
directly. I wasn't onsite, but the solution was simple - we just increased the
timeout to wait for a lease before starting the server.

~~~
BuildTheRobots
Depending on what switch you're plugging into, this could easily be Cisco
Portfast or similar catching you out.

When the port goes active (layer 2 link) the switch inspects but doesn't
forward packets to try and ensure you've not just added a loop into your
ethernet network. After a minute or so, if everything looks normal, the switch
then starts letting the device talk to the rest of the network.

If you don't know about it then this can be annoying at best as you seem to
have a layer2 link but DHCP can seem to take an age to start working :)

------
zeta0134
I wish he would drop the name of the router he's running into these issues on.
I have the same exact issue with my laptop (random disconnects for 30 seconds)
only on my home router, and I've been considering replacing the router with
something of higher quality.

I use an ASUS 750N, which is starting to show its age faster than I expected,
but otherwise services the devices in my house just fine. Except for my
laptop, which is the only device I actually need to use over wifi on a regular
basis.

~~~
givinguflac
I don't think the model is 750N, more likely that's the max speed. Probably
you have the 56u or 66u. The Merlin firmware for Asus routers has really
improved my experience with the ac-87u. It keeps the stock UI and fixes
bugs/adds features.
[https://asuswrt.lostrealm.ca](https://asuswrt.lostrealm.ca)

~~~
zeta0134
You're quite correct, the 750N was from memory. The actual model is an
RT-N65R.

That firmware looks lovely! I will give that a whirl when I have time to dink
around with it, possibly this weekend. Thanks for that!

~~~
givinguflac
Happy to help out :)

------
ChuckMcM
For what its worth some hotels have been known to intentionally degrade
network performance / access to push people to the hotel supplied service.
That has included screwing up people trying to stream video into their room
via the network rather than buying the PPV movies that the Hotel provides.
Either of these failure modes have a good chance of degrading the viewing
experience of streaming video sufficiently to make it unusable.

If you are in a hospitality situation you can demand a refund on your Internet
charges, or if the Internet is "free" consider asking for a discount on your
bill.

That said, power line networking (the author mentioned devolo) is notoriously
fraught with challenges. A company I helped start was acquired by the folks
who put power line networking on the map (Tut Systems) and their were a lot of
interesting interference sources they tried to mitigate. Inductive loads
(motors) coming on and off was a big challenge, and places like hotels would
have large fans that would circulate air through the common areas or hallways.
Fluorescent lights were another noise source.

All in all it was a poor excuse for a network and everyone was amazed they got
a megabit per second point to point through it reasonably reliably.

------
wtracy
My own Linksys router (I think it's a WRV200) occasionally starts to silently
drop new connections until I disconnect and reconnect. If my browser has an
http/2 connection open, I can continue to browse that particular site, but
when I navigate to another site, the browser times out. It seems to only
affect one device at a time.

Curiously, the behavior became dramatically more frequent (from once every few
months to several times a week) when I moved in with family, which meant
plugging it into a different modem, and adding a bunch of Windows devices to
the network.

(When I move back out, I'll probably leave this router behind and get myself
something that runs Tomato.)

~~~
userbinator
That sounds like the NAT table has filled up.

Incidentally, this is often the cause of problems with "routers" that people
experience --- it's not the (IP) routing part that's giving trouble, as that's
stateless and involves not much more than packet forwarding. Things like NAT
do involve state, since the router has to assign a port mapping and keep track
of (TCP) connection lifetimes. If connections aren't closed correctly or the
router misses detecting them for whatever reason, the port mappings will stay
in the table until they time out (which may take hours or more) or the
"router" is rebooted. The problem becomes more frequent the more devices there
are on the network which are making connections.

------
seiferteric
Every free wifi I have ever been on has been broken like this in one way or
another, crazy ping times, randomly dropped packets or just extremely slow
etc. In fact I am writing this right now from an Amtrak train but tethered to
my phone since the free wifi is so bad.

~~~
jsight
I think Amtrak uses cellular connections to the outside internet. Depending
upon the part of the country, there is a good chance that the weak point is
their link to the cellular network.

~~~
mindslight
It seemed like Amtrak uses Verizon for backhaul, at least on the Lake Shore
Limited. It definitely goes through some spotty coverage areas.

Personally I couldn't care less about their wifi though - I appreciate the
power outlets much more.

------
colanderman
MacBooks have issues with WMM (WiFi QoS). I've seen my connection drop for
dozens of seconds at a time when I enable WMM on my router. Selectively
disabling it for my Mac (leaving it on for my wife's Lenovo and my HTPC) fixes
the issue.

~~~
xhruso00
Thanks for tip. Just turned it off. Hopefully it will solve the issues which I
have just on MacBook, iPhone works OK.

------
userbinator
The details are vague but if I remember correctly, this is due to something OS
X does with DHCP that isn't quite standard but apparently helps to make
connection resumption faster --- when it works, that is.

~~~
MBCook
I seem to remember this. Was it that OS X started using its old lease while
waiting for it to renew under the assumption that it would probably get the
same thing and be fine?

~~~
hug
That's almost exactly the case.

Link here:
[http://cafbit.com/entry/rapid_dhcp_or_how_do](http://cafbit.com/entry/rapid_dhcp_or_how_do)

Previous HN discussion here:
[https://news.ycombinator.com/item?id=2755461](https://news.ycombinator.com/item?id=2755461)

------
gpm
These aren't even that bad.

I've had a router start sending me two DHCP offers, with different IPs. The
first offer containing an IP outside the range it was supposed to allocate.
After my computer tried to accept the first offer, the router started ignoring
me.

This behavior survived rebooting the router (and everything else involved),
flushing the DHCP table fixed it for some reason.

~~~
voltagex_
I have a TP-Link Archer D9 on the "latest" firmware that will either:

1) stop giving out leases

2) suffer catastrophic failure in miniupnpd and stop allowing port forwarding

This is a $200AUD device and it's utter trash.

------
gefh
I had exactly the same problem with OSX - all other devices were fine but my
macbook would drop packets after 30 seconds or so. Only fix seemed to be a new
router. No problems since.

~~~
mjn
Interesting! I looked quite a bit to see if there was anything OSX-specific
that anyone else online had documented, but nothing I turned up seemed to pan
out. I am reasonably certain that the ultimate culprit here is bad low-end
wifi devices that either suffer from bad hardware (too little RAM, etc.) or
bad software (some poorly tested customized version of embedded Linux), or
both. But I'm really curious why only OSX seems to cause the problem to
manifest in several of these cases.

~~~
hug
Your article doesn't mention it: Have you tried using a statically assigned IP
address? (Poor etiquette inside the DHCP range, but maybe if you just snarf
the IP you've already leased..)

I ask this because OS X's DHCP stack is known to exhibit some strange and
unfriendly behaviours:
[http://cafbit.com/entry/rapid_dhcp_or_how_do](http://cafbit.com/entry/rapid_dhcp_or_how_do)

~~~
mjn
I did try that (though you're right that it's not in the article), setting the
last DHCP address I'd received as static IP, but it doesn't seem to change
anything. Still the same dropouts every ~30 seconds.

~~~
seiferteric
Maybe try force sending gratuitous ARP's every 10 seconds?

Also it might be interesting looking at a packet capture from both android and
mac laptop to see what the difference is.

------
gambiting
I've found that in 90% of the cases where packets are dropped randomly on a
consumer-level access point, the automatic channel selection is to blame. Go
to settings, select a static channel number, and the problem disappears
completely.

~~~
mjn
In my case it doesn't seem to be the wifi side of things, but the routing side
of things that's causing trouble. I can consistently ping the local AP itself,
but the connection to the internet has periodic dropouts.

------
darkengine
I used to frequent a coffee shop with a router that would refuse to let my
laptop's wifi card connect about half the time. Like the author, my phone
could connect but it would simply not respond to any attempts to connect from
my laptop, whether booted into Windows or OpenBSD. The only solution was to
ask the barista to reset the router, which was all the more frustrating when
they refused to do it because everyone else could connect just fine.

~~~
ballooney
That seems like the readonable answer on the part of the barista.

------
IntelMiner
I had a similar issue to this some time ago. I upgraded our internet at home
to Comcast's 250 megabit package, in doing so I had to upgrade our modem. I
picked up a Motorola/Arris Surfboard 6183

I ended up with an interesting issue where the modem itself would "hang". My
ping to Google would skyrocket from 5ms to 3000ms+, my download speed would
drop from 250~ megabits to 0.20, the upload speed however would remain
constant at about 30Mbps

Rebooting anything and everything had no effect, throwing various kinds of
routing equipment in front of the modem (from OpenWRT to PF sense to even
OpenBSD) made no difference. Eventually I purchased a "KanKun SmartSwitch".
Some cheap Chinese wall adapter that you could power devices on and off using
a phone app. Conveniently it ran OpenWRT and was hackable, so I was able to
automate simply rebooting the modem when pings spiked into the 4 digit range
(pinging Google every 60 seconds with a simple bash script)

I spent months talking to various Comcast departments. Technical Support, Tier
2, Tier 3, NOC, Headend, Engineering. Eventually I was told that (across the
14 states in my "division") there's only about 20,000 of the model of modem I
had, simply not enough to be able to establish a problematic pattern
(presumably most customers would just reboot the device when it crashed

Frustrated, I noticed on the Amazon store page that there were numerous
complaints about the device with the same issue. I assumed it most likely to
be a firmware issue, as the headend engineering team could not correlate any
changes from my device on the node at the timestamps I gave them, still
archived here
[http://intelminer.com/reboot.txt](http://intelminer.com/reboot.txt)

Upon speaking to Amazon about a refund or exchange, they referred me to
Arris/Motorola citing it was under warranty. Arris/Motorola then "helpfully"
explained that they certify everything BUT the software of the device to be
functional, as such I was not covered under warranty. (But hey why not buy a
Surfboard 6190 instead? it IS newer!)

It seems like almost a racket for planned obsolescence. Release a decent
modem, hire some interns to write crap software, then encourage the customer
to "upgrade" the "faulty" hardware when they call in. After all, it's not
under warranty now is it?

~~~
hx87
> "helpfully" explained that they certify everything BUT the software of the
> device to be functional

Perhaps because your ISP is responsible for providing and flashing the modem's
firmware? That being said, I don't exactly trust Comcast to provide reliable
software for my modem.

~~~
IntelMiner
The firmware is maintained by the OEM and given to the ISP for distribution.
The ISP does not specifically write (or maintain) the firmware used for
customer owned equipment

------
nikropht
Check out the Mikrotik routers www.routerboard.com these are linux based and
rock solid. The HAp AC Lite retails for $50
[https://routerboard.com/RB952Ui-5ac2nD](https://routerboard.com/RB952Ui-5ac2nD)

Also these routers are like swiss army knives they can do everything from DHCP
Server to full BGP and MPLS. The only limitation is the CPU, Ram and the
Interfaces.

------
rubberroad
Same problems for me, experienced this a lot on my home wifi, using a dual
band, modem / wifi router combo. Problems ONLY occurred on my MacBook Pro, no
other devices.

Purchasing a dedicated modem and using an Apple AirPort Extreme was the only
thing that resolved the issue, which annoyed me and made me feel like Apple is
further locking me into their ecosystem of devices.

~~~
tw04
MBPs don't support certain channels which is maddeningly frustrating. Trying
to find from my notes which channels those are, but when I finally changed
from auto to hard-setting to channel 100, all my issues with MBPs on my
network went away.

~~~
SysArchitect
There are certain channels in the 5Ghz range that require the access point to
do checking to make sure they don't actively interfere with radar...

See here for a document from Cisco describing the issue:

[http://www.cisco.com/c/en/us/td/docs/routers/access/3200/sof...](http://www.cisco.com/c/en/us/td/docs/routers/access/3200/software/wireless/3200WirelessConfigGuide/RadioChannelDFS.pdf)

interestingly enough, hard setting the channel to 100 might violate FCC
regulations, and DFS should still be used...

~~~
tw04
There isn't radar within 100 miles of me so I'm not too concerned. Regardless,
this AP firmware won't let you hard set to channels the FCC requires DFS on.

~~~
SysArchitect
Channel 100 requires DFS... so you did set it to a channel that requires DFS.

------
microcolonel
I experience basically the same level of problems on our office's Cisco Meraki
gateway/firewall. A couple months ago we had to do a support ticket to get
them to fix their spanning-tree implementation because it configured loops in
our simple dual-master crossover switch configuration, forcing us to disable
one of the masters. Then there was a spurious packet loss problem in one of
the default configuration parameters.

Now it just randomly stalls TCP connections open until the local node closes
them, but allows the remote node to continue receiving packets(!), which is
probably the worst thing a TCP system can do aside from simply not connecting.

I quite literally had to attempt this post twice before giving up and
connecting to a UDP VPN. I have watched my colleagues cancel and refresh
webpage loads for the last few months.

~~~
jonatron
I've had all sorts of problems with Meraki APs. They claim to hop around to
find the clearest channel, but they all always chose channel 44. Manual
channel selection was limited to non DFS channels, otherwise they'd switch
back to channel 44. There was no insight into CPU usage, so we were suspicious
that some QoS settings made everything grind to a halt, but we couldn't
confirm. Switched to high end Ruckus APs and a Juniper router/firewall and
it's now working properly.

------
digi_owl
I dunno about the first, but the second looks oddly reminiscent of when i had
a N800 and the router at home didn't have a first clue about wifi power
saving.

IIRC how wifi does it is that the device signals the router and then shuts
down the radio for some 100s of ms. This indicated to the router to hold the
packets.

Now if the router do not have a clue about the signal, it would likely treat
the device as gone. So when it comes back on and expects to continue from
where it left off, the router gets royally confused.

Not sure why doing a dhcp request would fix it though, but then i have not the
first clue about OSX innards.

------
anotheryou
My router stops working about once a day, but only if Apple products are
connected (not sure yet if iPhone or macbook (they often come in pairs), but
it's consistent across 4 flatmates).

