Hacker News new | past | comments | ask | show | jobs | submit login
AT&T Fiber in the SF Bay Area is flipping bits (twitter.com/catfish_man)
647 points by km3r 11 months ago | hide | past | favorite | 361 comments

If you have AT&T fiber, run the script in the linked gist:


It loads http://example.com and https://example.com and compares the result (should be equal) in a loop, and then reports if it finds a difference. I'm seeing multiple bit flips in the unencrypted version, and having a lot of issues loading web pages, presumably because a corrupted packet in a TLS handshake is an error and the connection dies.

Tech support, even when Twitter accounts with a lot of followers message them, is completely useless, they just say that they don't show any outages at your location. They need to be flooded with complaints for someone to look at this, or maybe someone from AT&T is on here that can get it looked at...

All of these large companies seem to have (correctly) realized that 95% of tech support cases are trivial issues that can be resolved via automated responses.

The problem is that they then assume that all cases are one of those 95% in order to solve the 95% as quickly as possible, which probably looks good to whatever metrics they're tracking.

But if you're one of the 5% you're fucked.

If there's anyone out there designing tech support procedures, you should add an "is this a 5% problem?" question to whatever checklist you give to support staff.

> If there's anyone out there designing tech support procedures, you should add an "is this a 5% problem?" question to whatever checklist you give to support staff.

When I was the engineer customer service escalated to, I was damn sure to thank them every time they escalated something. Even the one guy who escalated all the things I'd roll my eyes about in private. At least he was making sure the escalation path worked.

Someone who has taken the time to report an issue is probably one of hundreds or thousands who had an issue and didn't think it could be fixed and shrugged it off. We certainly can't fix everything, but weird network shit like this can be fixed, and it's worth escalating, because when you get it fixed, you can also figure out (hopefully) how to monitor for it, so it doesn't happen again.

OTOH, I didn't work for the phone company. We don't care, we don't have to, we're the phone company. https://vimeo.com/355556831 (sorry about the quality, I guess internet video was pretty lowdef in the 70s :P)

Ex-phone-company here. (Is this the party to whom I am speaking?) I was in installation, but hung out with a lot of the ops crew, and they LOVED interesting problems. The trouble was getting such problems to the ops people in the first place. Good people, bad process.

The most memorable one:

Customer service had been getting calls all morning with a peculiar complaint: A customer's phone would ring, and when they answered, the party on the other end didn't seem to hear them. They seemed to be talking to _someone_, but not the party they were connected to. Eventually they hung up. Sometimes, a customer would place a call, and be on the other end of the same situation -- whoever answered would say hello, but the two parties didn't seem to be talking to each other. Off into the void. They'd try again, and it would work, usually, but repeats weren't uncommon.

So everyone's looking at system logs and status alarms and stuff, and what else changed? There were two new racks of echo-cancellers placed in service last night, could that cause this? Not by any obvious means, I mean e-cans are symmetrical and they were all tested ahead of time. There was a fiber cut out by the railroad but everything switched over to the protect side of the ring OK, didn't it? Let's check on that. Everyone's checking into whatever hunch they can synthesize, and turning up bupkus.

Finally around lunchtime, one of the techs bursts into the ops center, going "TIM! I GOT ONE I GOT ONE IT'S HAPPENING TO ME, PATH ME! okay look I don't know if you can hear me, but please don't hang up, I work for the phone company and we've got a problem with the network and I need you to stick on the line for a few minutes while we diagnose this. I know I'm not who you expected to be talking to, and if you're saying anything right now, someone else might be hearing it, but that's why this is so weird and why it's so important YEAH IT CAME INTO MY PERSONAL LINE and that's why it's so important that you don't hang up okay? I really appreciate it, just hang out for a few, we'll get this figured out..."

Office chairs whiz up to terminals and in moments, they've looked up his DN and resolved it to a call path display, including all the ephemera that would be forgotten when the call disconnects. Sure enough, it's going over one of the new e-cans. Okay, that's a smoking gun!

So they place the whole set of new equipment, two whole racks of 672 channels each, out-of-service. What happens when you do that is the calls-in-process remain up, but new calls aren't established across the OOS element. Then you watch as those standing calls run their course and disconnect, and finally when the count is zero, you can work on it. (If you're doing work during the overnight maintenance window, you're allowed to forcibly terminate calls that don't wrap up after a few minutes, but that's verboten for daytime work. A single long ragchew is the bane of many a network tech!) The second rack was empty of calls in _seconds_, and everyone quickly pieced together what that implied -- every single call that had been thus routed was one of these problem calls where people hang up very quickly. This thing had been frustrating hundreds of callers a minute, all morning.

With the focus thus narrowed, the investigation proceeded furiously. Finally someone pulls up the individual crossconnects in the DACS (a sort of automated patch panel, not entirely unlike VLANs) where the switch itself is connected to the echo-cancellation equipment. And there it is. (It's been too long since I spoke TL1 so I won't attempt to fake a message here, but it goes something like this:) Circuit 1-1 transmit is connected to circuit 29-1 receive, 29-1 transmit isn't connected to anything at all. 1-2 transmit to 29-2 receive, 29-2 transmit to 1-1 receive. Alright, we've got our lopsided connection, and we can fix it, but how did it happen in the first place?

If all those lines had been hand-entered, the tech would've used 2-way crossconnects, which by their nature are symmetrical. A 2-way is logically equivalent to a pair of 1-ways though, and apparently this was built by a script which found it easier to think in 1-ways. Furthermore, for a reason I don't remember the specifics of, it was using some sort of automatic "first available" numbering. There'd been a hiccup early on in the process, where one of the entries failed, but the script didn't trap it and proceeded merrily along. From that point on, the "next available" was off by one, in one direction.

Rebuilding it was super simple, but this time they did it all by hand, and double-checked it. Then force-routed a few test calls over it, just to be sure. And in a very rare move, placed it back into service during the day. Because, you see, without those racks of hastily-installed hardware, the network was bumping up against capacity limits, and customers were getting "all circuits busy" instead. (Apparently minutes had just gotten cheaper or something, and customers quickly took advantage of it!)

Amazing story! My step dad worked night shift at AT&T back in the 80’s and ran the 5ESS. He took my brother and I in for a tour one night. Thinking back on it now it was a lean crew for the equipment they were running. Rows and rows and rows of equipment. I don’t remember closed cabinets, mostly open frames moderately populated. I’ll never forget he showed us some magnetic core memory that was still mounted up on a frame in the switch room. Huuuge battery backup floor as well.

He loved all of that stuff, absolutely hated when everything went to computers. Quit and became a maintenance man at a nursing home, commercial laundry repair guy then finally retired this year in his late 70’s (due to Covid) after working maintenance at a local jail.

That's super cool!

I believe the #5 ESS machine itself is always in closed cabinets, so it's likely that what you're remembering was the toll/transport equipment, or ancillary frames. Gray 23-inch racks as far as the eye can see!

Depending on how old that part of the office was, they were likely either 14' or 11'6" tall with rolling ladders in the aisles, or 7' tall and the only place they'd have laddertrack was in front of the main distributing frame.

As for magnetic core, if you could see it mounted in a frame, what you probably saw was a remreed switching grid, which is a sort of magnetic core with reed-relay contacts at each bit, so writing a bit pattern into it establishes a connection path through a crosspoint matrix. It's not used as storage but as a switching peripheral that requires no power to hold up its connections. (Contrast with crossbar, which relaxes as soon as the solenoids de-energize.)

Remreed was used in the #1 ESS (and the #1A, I believe), and is extensively documented in BSTJ volume 55: https://archive.org/details/bstj-archives?&and[]=year%3A%221...

You’re definitely on to something. This image from Wikipedia for the #1 ESS fits very well into my fuzzy memory, especially those protruding card chassis:


I just remember thinking it looked awkward getting to the equipment under them.

I don’t know if the ‘5E’ as he called it was actually in operation yet, he ended up moving us all out of state to take a job developing and delivering training material for it...I think that’s what finally broke him lol. Hands on kinda dude.

I’ll have to hit him up later today to see if he remembers ‘remreed’ (he will). Thanks for the info!

Yup, the #1 used computerized control, but all the switching was still electromechanical, so it sounded like a typewriter factory, especially during busy-hour.

At night, traffic was often low enough that you could hear individual call setups and teardowns, each a cascade of relay actuations rippling from one part of the floor to another. The junctor relays in particular were oddly hefty and made a solid clack, twice per call setup if I recall correctly, once to prove the path by swinging it over it to a test point of some sort, and then again to actually connect it through. On rare occasion, you'd hear a triple-clack as the first path tested bad, an alternate was set up and tested good, and then connected through.

Moments after such a triple-clack, one of the teleprinters would spring to life, spitting out a trouble ticket indicating the failed circuit.

The #5, on the other hand, was completely electronic, time-division switching in the core. The only clicks were the individual line relays responsible for ringing and talk battery, and these were almost silent in comparison. You couldn't learn anything about the health of the machine by just standing in the middle of it and listening, and anyone in possession of a relay contact burnishing tool will tell you in no uncertain terms, that the #5 has no soul.

YES! He worked third and we were there all night. He pointed those sounds out to us, it was so cool.

Theres a telco museum in Seattle called the Connections Museum, it has working panel, #1 crossbar and #5 crossbar switches and a #3ESS they are working on getting running again.


Great story.

> including all the ephemera that would be forgotten when the call disconnects

Interesting to know there is information which is not logged. I’m guessing keeping this info, even for a day, would have helped isolate the issue?

How did the echo cancellers pass testing?

They passed testing because they had each been individually crossconnected to a test trunk, and test calls force-routed over that trunk. Then to place them in service, the crossconnects were reconfigured to place them at their normal location in the system. The testing was to prove the voice path of each DSP card, and that those cards were wired into the crossconnect properly.

All that was true, the failure happened when it was being taken out of testing config and into operational config. Either nobody considered that that portion could fail, or the urgency to add capacity to a suddenly-overloaded network meant that some corners were cut. (Marketing moves faster than purchasing-engineering-installation...)

Oh, and as to the point about keeping the call path ephemera. Yeah probably, but in a server context, that'd be akin to logging the loader and MMU details of how every process executable is loaded and mapped. Sure, it might help you narrow down a failing SDRAM chip, but the other 99.99999% of the time when that's not the problem, it's just an extra deluge of data.

Were the cross-connected circuits channelized or individual voice calls (ds0? I can’t remember from my wan days) or something else?

As I recall, the cross-connects were done at the DS1 level, and an individual card handled 24 calls. These are hazy, hazy memories now; this took place in 2004-ish.

Nice!! Thanks for the walk down memory lane, this was cool.

Not OP but it sounds like the echo cancellers were fine, the interconnect to the switch was misconfigured. Rather than sending both channels of audio to opposite ends of the same call, one channel got directed to the next call.

The funny thing is that if everyone played along they could have had a mean game of telephone going.

This feels like the kind of anecdote I'd overhear my paint-covered neighbor Tom telling my dad when I was 10, and my dad would be making a racket over it, really bent over twice. I'd always be like, "what's so funny about that?" But you get older and you realize not many people tell _actually_ interesting stories, so I guess you do what you can to make them want to come around and tell more.

Cool story, thanks! Perhaps you can solve a mystery phone hiccup that happened to me a few years ago? I called a friend (mobile to mobile if it matters) and, from memory, about 20 minutes into this call I get disconnected, _but_ I instantly end up on a call with an elderly stranger instead, who seemed pretty irritated she was now on the phone with me. I was surprised enough that she hung up before I could form a coherent sentence to explain what had happened so I've no idea if she was trying to ring someone or if the same thing happened to her or if she'd dialed my number by accident. From what I remember it seemed like she was also already mid-conversation as well though.

Thank you for sharing. And for helping the phones just work, so we can complain so much when they don't :)

Have you heard any of Evan Doorbell's telephone tapes[1]? It's a series of recordings mostly from the 1970s, but with much more recent narration, exploring and sort-of documenting the various phone systems from the outside in. Might be interesting to see what they figured out, and what they didn't :)

[1] http://www.evan-doorbell.com/

This is a super cool story, thank you so much for taking the time to type this out!

What’s fascinating read! Gotta error check my scripts.

A very similar problem is happening currently in india with Jio. I wonder if anyone from there had seen this.

Cool story, thanks for sharing!

Anecdote from my past:

I used to play counter strike/Starcraft in my middle school years. I pretty much figured out I had consistent packet loss with a simple ping test. I was on the phone with Time Warner every other day for months. They kept sending the regular technicians, at one point ripping out all of my cable wires in the house to see if it fixed the problem. Nothing worked, I kept calling, at this point I had the direct number to level 3 support. They saw the packet loss too. Finally, after two or three months they send out a Chief Engineer. Guy says I’ll look at the ‘box’ on one of the cable poles down the block. He confirmed something was wrong at that source for the whole area. Then it finally got fixed.

Took forever dealing with level 1 support, and lots of karenesque ‘can I talk to your supervisor please’, but that’s literally what it took.

So yeah, if you want stuff like this fixed, stay professional, never ever curse, consistently ask to speak to the supervisor, keep records, and keep calling.

Small shout out to the old http://www.dslreports.com/ for being a great support community during the early days of broadband for consumer activism in terms of making sure you got legit good broadband.

I had similar issues for a long time. Dealing with my ISP’s support was really frustrating. Not once did a 2nd or 3rd line technician get in touch with me to acknowledge that they had done any kind of investigation and analysis of the intermittent issues that I kept experiencing. The ISP did send out a guy that replaced the optical transceiver in my end, but to me it just felt like a wild guess and not really something that they did because they had any specific reason to believe that the transceiver was actually faulty. It didn’t help.

I ended up just cancelling the service and signing up for one of their competitors instead.

The real problem is that it shouldn't take this much bullshit and rigmarole from the very beginning.

I had a ginormous AT&T router/modem (pace 5268ac) with a set of static ip addresses and a few times, AT&T just stopped routing traffic to it.

It had happened before and then magically fixed itself a few days later.

One time I had a week of outage with AT&T basically said the problem was on my side. They could ping the modem, and then punted. I had several truck rolls. The techs were really nice guys, but were basically cabling guys, better for finding a bad cable than debugging a packet loss. The problem for me was that my ipv4 static ip addresses would not receive traffic.

I was at wit's end after a week and I debugged the thing myself. By looking at EVERY bit of data on the router, I found mention of the blocked packets in the firewall log. I would clear all the logs, and found even with the firewall DISABLED, the firewall log would see and block incoming packets I was sending using my neighbor's comcast connection.

I called AT&T, but this time mentioning "firewall is completely off, but packets are blocked by the router and showing up in the log" was concrete enough for them to look up a (known) solution.

The fix was to disable the firewall, but to enable stealth mode. wtf?

To be clear, this was a firmware bug, and caused dozens of calls to AT&T, lots of heartache and finger pointing always in my direction.

I should also mention at the start of this fiasco, I checked the system log and noticed they pushed a firmware update to the modem at the time the problem started. Strangely after one call to the agent, that specific line disappeared out of the log file, but other log entries remained. hmmm.

since then, they basically screw up my modem every month or two - they push new firmware and new "features" appear (like the one that sniffs and categorizes application traffic like "youtube" and "github"). It also helpfully turns wifi BACK ON when I had disabled it. I immediately go turn if back off and then they immediately send me a big warning email that my DSL settings have been changed.

The 5268ac pace router is the worst ISP provided router I've ever had, and I've been an Xfinity/Comcast customer, and I've even had a connection in Wyoming. I detailed my experience with it in a review of a third-party router, and found numerous issues along the way [0]. My favorite is that DMZ+ mode, which is what they offer instead of a traditional DMZ mode, just has some weird MTU issue that leads git and other services to break horribly when running behind a third party router. The solution? Don't use DMZ+ mode. Instead, put the router into NAT mode, and then port forward all of the ports to one private IP address. Bonkers. This is sold as an official-looking solution on the AT&T website for a "speed issue." [1]

This is all because AT&T believes that the edge of their network is not the PON ONT/OLT, but rather, the router they issue you. If you want to be on their network, you have to use their router as some part of the chain.

My latest discovery is that in doing this, the router can actually get super hot operating at gigabit speeds for extended periods of time. When this happens, it magically starts dropping packets. Solution? Aim a fan at the router so it has "thermal management."

Total. Garbage. I'd switch to Spectrum if they had decent upload speed, but alas, they don't in my building.

[0]: https://particle17.xyz/posts/amplifi-alien-thoughts/#appendi...

[1]: https://www.att.com/support/article/u-verse-high-speed-inter...

If you have some time, you can MITM the 802.1x auth packets [1] and use a less crappy router. I run this with a VyOS router and the same 5268ac that you have, but it works with things like Ubiquiti routers too. The only catch is you need three NICs on your router, but a cheap USB 10/100 one will do for the port that connects to the 5268ac.

Another option is getting the 802.1x certificate out of a hacked router, but it's not possible as far as I know on the 5268ac. You could buy a hackable ATT router but they're not cheap. Some sellers even sell the key by itself.

Mysteriously, doing this fixed an issue I previously had where SSHing into AWS would fail.

[1] https://github.com/jaysoffian/eap_proxy

There's also one for pfsense, which is what I used before I dumped my cert out of my router


Huge bummer but the next generation of ATT routers with onboard ONT don’t work this bypass :(

Do you know the model numbers and/or have any other information about these new routers?

I'm currently using eap_proxy with my BGW210, and it's been a huge improvement, but I fear the day the device needs to be replaced with a newer model.

BGW320 is the new model, which I had installed about a month ago. It isn't a simple swap, as it uses a SFP module combined with the modem's internal ONT instead of a separate ONT, so I've heard it's only used in new installations. More about it: https://www.dslreports.com/forum/r32605799-BGW320-505-new-ga... (although theirs says 1550nm while mine says 1310nm)

However, it has 5Gbit Ethernet, hasn't re-enabled WiFi on automatic firmware updates, and has only screwed with my IP Passthrough configs once which was resolved with a router reboot. (that was possibly my router's fault, it seemed like it was unable to fetch a new DHCP lease)

Apparently you can extract the 802.1x key from the router and then use your own router, and someone even has a script to MITM the connection between the router and ONT.

That is absurd that AT&T requires the use of a rented gateway for U-verse. I've never had an issue with a another provider refusing to support off the self hardware before, including with a DSL provider, multiple cable companies, and FiOS (Ethernet on ONT).

They like customer data

AT&T sucks, congress should be requesting their leaders in for a congressional hearing along with Zuckerberg lol

And conversely its always surprising and disarming when you call a company and actually get through to a knowledgable employee. I was so surprised to hear “thats a firmware bug we know about and there is no update yet” about my router issue that I forgot to be mad at the company for not caring my router is broken.

I had a problem with a PowerMac G3 back in the day and I somehow managed to escalate up to tier 3 which is to say an Apple HW engineer. He was brusque bordering on rude, but he immediately recognized it was a problem with an undocumented jumper setting on the motherboard and solved my problem inside of two minutes. It definitely increased my customer satisfaction.

FWIW, this has the hallmarks of an interaction within the context of an abusive interpersonal relationship.

A few years back, I called up our local newspaper to start a subscription. Called the number on the website, and a real human person answered the phone. I was so surprised that it wasn't at least an initial phone tree that I actually stumbled and had to apologize and explain myself.

That’s why you need to sign up for CSA Pre™. Get preauthorized for instant escalation on customer support calls.

All you have to do is answer a form with questions like: Do you know how to plug in a computer? Do you know where the power switches are on your devices? etc.

CSA Pre™ is valid for 5 years; you can initiate the renewal process to do 6 months before expiry.

You were joking but I really wished you weren't and this service existed!

And it’s a steal at 99.90/month with a 24 months contract!

At this point we unironically need shibboleet.


Story time!

I've had this exact experience, except that it wasn't a dream.

Back in 2010 I had a weird issue where my cable connection would sometimes completely block the connectoin right after the DHCP response (we had dynamic IPs back then). This would go on for a couple of hours, until the IP lease expired, then my connection would come back. Luckily, I was running an OpenBSD box as my router which allowed me to diagnose the problem. But it was also impossible to explain to the servicedesk employees.

One evening it happened again, and I called the servicedesk, totally prepared to do the 'yes I have turned it off and on again' dance. But to my surprise the employee that I got on the phone was very knowledgable and even said that it was very cool that I had an OpenBSD box as a router. He very quickly diagnosed that someone in my neighbourhood was 'hammering' the DHCP service by not releasing his lease (a common trick to keep your IP address somewhat static). This caused a double IP on the subnet, and the L2 switch to block traffic to my port.

He asked me "do you know how to spoof your MAC with an OpenBSD box?". Then I knew this guy was legit. He instructed me to replace the last 2 bytes of the MAC with AB:BA (named after the music group). They had a separate DHCP pool for MAC addresses in that range. If they ever saw an ABBA mac address on their network, they knew it was someone who had connectivity issues before.

The problem was immediately solved, and I had a rock-solid internet connection for years, with a static IP!

I ended up chatting about networking and OpenBSD a bit, before I (as humble as I could) told the guy I was a bit flabbergasted that someone as knowledgable as him was working on the servicedesk.

It turned out, he was the chief of network operations at the ISP (the biggest ISP in my country). He was just manning the phone while some of his colleagues from the servicedesk were having diner.

Sometimes miracles do happen.

Many "phone robot" systems can be overridden by mashing on the keyboard, shouting or swearing - these will get redirected to a live human ASAP.

Andrew’s & Arnold (UK-based ISP) actually are compliant with XKCD 806! https://www.aa.net.uk/broadband/why-choose-aaisp/

Way back in the day I worked on equipment that straddled telco circuits. T1s, E1s, DS3s, OC whatever. Companies paying big money for those circuits.

Anyway I was told on more than one occasion by different telcos that the standard operating procedure for many techs was to take the call, do nothing, and call back 20 minutes later and ask if it looked better because ... often enough it did.

When I started my job as an IT director 10 years ago I was in the customer support room and there was one guy who was known for solving all the hard problems. I was standing behind him when he was talking a call. He patiently listened to the client, then loudly typed random stuff on his keyboard for twenty seconds or so, making sure the client could hear the frantic typing, sighed, and then asked “Is it better now?”

It always was.

The problem is that the idiots the hire to do their "technical" support have zero skill (nor motivation to learn) to assess whether it's a 5% problem or not, and the majority of end-users aren't capable to answer that question either, nor that they are incentivized to answer truthfully.

The solution could be a priority support tier where you pay upfront for an hour of a real network engineer's time (decently compensated so that he actually cares about solving the problem) and the charge is refunded only if the problem indeed ends up being on the ISP's side. This should self-regulate as anyone wasting the engineers' time for a simple problem they could resolve themselves would pay for that time.

I realize this was written from a position of frustration (which I share) at getting run around by customer support, but I'd reconsider the blanket characterization of tech support staff as "idiots": they're doing a high-throughput job following a playbook they're given with, as you identify, no incentive---it's probably less about personal motivation than the expectations that are set for how they perform their job---to break rules to provide better customer service to people with 5% problems.

+1 the real idiots here are att mid-upper management that setup this process and also have zero monitoring for packet loss/bit flips apparently so that they have an outage for weeks now. Support techs have no training nor tooling to debug this issue.

I cope with level 1 support by remembering we have a common goal: to stop talking to each other as quickly as possible. The tech just wants to close the case and I want to talk to someone else who can actually help me.

That’s a great example of finding how you’re aligned and then using it to get a mutually beneficial outcome.

Good point. The managers would probably give negative feedback to people that took more time on their calls in order to try to help the customer better.

Who decides if it's the ISP's problem? What if it's "Both"?

About 20 years ago I got escalated to high-tier Comcast support for an issue that turned out to be a little of A, a little of B: Comcast (Might have been @Home, based on the timing) required that your MAC address be whitelisted as part of their onboarding process. Early home routers had a "MAC Address Clone" feature for precisely this reason. At some point, the leased network card got returned to Comcast. Our router continued to work just fine... until about a year later, when the local office put that network card into some other customer's home. We started getting random disconnects in the middle of the day, and it took forever to diagnose, as the other customer was not particularly active with their internet use. Whose fault was it? Ours, for ARP Spoofing? Theirs, for requiring the spoofing?

How did you even manage to figure this out? I’ve never gotten anyone on the phone who could possibly help in a situation like this.

Wireshark and escalation to a competent tech. I believe they saw weird traffic from their DHCP server, and we were able to attach an ethernet hub (Not switch, a 10-Base-T Hub that repeated the signal on each port) along with a laptop that was running Ethereal (Before the name changed! How long ago that was now) and see the arp packets fighting.

> I believe they saw weird traffic from their DHCP server, ...

That makes sense.

When the cable modem issued a DHCP request, the CMTS would have been configured to insert some additional information (a "circuit-id") into the DHCP request as it relayed it to the DHCP server.

The short version is that the "competent tech" looked at the logs from the DHCP server, which would have showed that the "same" cable modem (i.e., MAC address) was physically connected to either 1) two different CMTS boxes or 2) two different interfaces of the same CMTS.

How would one cable modem be physically present in two different locations at the same time? Obviously, it wouldn't.

At that point, either 1) there are two cable modems with the same burned in address or 2) one of the two cable modems is cloning/spoofing its MAC address. Which one of those is more likely?

(If you're interested in the details, try "DHCP Option 82" as your search term.)

To be fair, the vast majority of technical support calls likely require only customer service rather than technical skills. Anyone with technical chops probably wouldn't last long in that environment.

A friend of mine used to work tech support, and said that from her perspective, a lot of the effort was trying to counter-outsmart customers who thought they were too smart.

For instance: "This might just be a loose cable, can you unplug each end of it, and plug it back in?" invariably elicited an "I've already done that", or a brief pause followed by "okay, there, I did it, do you see it?". Lies, often.

But: "Alright, I want you to try turning the cable around for me, yeah swap it end for end. Sometimes you get a connector that just doesn't fit quite right, but it works the other way around and it's faster than sending a tech to replace it", would often get a startled "Oh! I hadn't thought of that, one moment..." and then the customer actually DOES unplug the thing, and what do you know, they click it in properly this time.

I worked for a call center and there was a girl there I didn't think was good at her job.

Then one day, she tells me that she tells people to unplug the power cord, cup the end in their hand, and then plug it back in.

Suddenly, I really liked her. It's a genius move that makes them think they've done something obscure, but she really just wanted them to actually check the cable.

I worked for a computer support once, and just as I was hired, they went from (IIRC) 6 weeks of training to only 2 weeks. Nobody came through that knowing any more than they started with in regards to fixing computers.

Luckily, I already knew how to fix them. I found the job to be a cake walk and quite liked helping people. But I had to listen to people around me fumble through it.

It was frustrating at the time, but my favorite thing that happened was that I was admonished twice for having average call times that were too low. To them, that's a warning that someone is just getting people off the line without fixing their problems.

They monitored calls and they said I'd never received a complaint, but the system would keep flagging me for low call times so I had to artificially raise them. They suggested that I have a conversation with the clients.

I didn't, and I didn't stay there much longer, but it was quite a crazy situation. But I also felt much less pressure to handle calls quickly after that, too, which was nice.

A huge percentage of companies have their help desks and customer support outsourced to TCS or Cognizant or Accenture, etc.

Everything they do is driven by metrics, and their contracts are written around KPIs like maintaining a ludicrously short average time to answer, short handle times, and open/closed ticket ratios. If they do not hit these metrics, then the outsourcing company owes service credits. The incentives do not align with making customers happy and solving their problems. Everything is geared towards deflecting users with self-help options, simple scripts for the agents, and walking the line of hitting those metrics with the fewest number of warm bodies.

It's a pretty hellish business.

This blows my mind, because why would a customer come back when they're treated like that? Oh right, because they don't have anywhere else to go.

For the last several years, I've gotten my mobile service through an MVNO named Ting. On the rare occasion that I've needed to call support, there's no IVR, just a human who typically answers on the second ring. They speak native English, and have never failed to solve my problem either immediately or with one prompt escalation.

They're so jarringly competent I wonder how they still exist, if being obnoxiously incompetent is apparently a business requirement.

> If there's anyone out there designing tech support procedures, you should add an "is this a 5% problem?" question to whatever checklist you give to support staff.

... only if you want people to actually be able to escalate. Suspect AT&T are going to lose a lot less money over this issue than hiring an extra high quality support person would cost

Many companies have done the math, and realized they make more money if they just let the 5% customers leave.

It's even worse; the customers have no real choice of anyone else to leave _to_.

100% you’re on point here, but there’s one problem that happens when they do add “is this a 5% problem”: eventually (and I’d say pretty quickly), the public gets wind of “if I say the right things to get marked as a 5% problem, I get an automatic escalation to someone who knows more.”, and suddenly you get a big chunk of level 1 calls in to the upper tiers.

For evidence, see: “Retention departments give you the best rates if you threaten to cancel” (which then caused companies to have to rename the retention departments and change the policies.

Obligatory XKCD reference: https://xkcd.com/806/

Much scarier when you actually see it with your own eyes:

    $ diff example-*
    <     <titde>Example Domain</title>
    >     <title>Example Domain</title>
    <         backoround-color: #f0f0f2;
    >         background-color: #f0f0f2;
    <         box-shadow: 2pp 3px 7px 2px rgba(0,0,0,0.02);
    >         box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    <     <p>This domain is for use in illustrative examples in documents. You may usm this
    >     <p>This domain is for use in illustrative examples in documents. You may use this

I was under the impression there was enough layers of checksums once you get to the level of UDP/TCP that these kind of single bit flips should be detected and dropped before you get to read them. What's going on here? Is networking gear not calculating checksums like it should?

TCP header does have a checksum that is supposed to check both the header and the payload.

So either the bit corruption is such that is not detected by the checksum

Or ATT is doing something nefarious and touches layer 4 and corrupts the data while doing so

I don’t have numbers off hand, but from a feeling I have from memory, I would think it is extremely unlikely that TCP checksums are consistently failing to trigger retransmission. Someone must be altering packets along the way.

TCP checksums are notoriously weak:


With enough packets passing the dodgy RAM a noticeable number will manage to get mangled in such a way that the checksum is still correct.

Checksums are also often recomputed on transit if the packet is intercepted, e.g. to limit TTL, unflag odd/unused TCP features, that kind of ISP-ish preening. So if it was a software error or even a hardware problem in the right (wrong) spot it’s possible to get this kind of corruption without retransmits.

I think you are confusing the IP header with TCP header.

Routers don’t touch the TCP header at all

No. URG, “Christmas tree” packets, etc. all can be mangled by the ISP. Or more commonly dropped

if your router does NAT it touches the tcp header

You are right, although NAT could be classified as a firewall feature, that runs on routers.

But in any case it's irrelevant in this context as ATT shouldn't be doing any NAT

In that case, won't there be significant packet loss causing throughput to be very slow? I don't know if this is possible without something messing with TCP headers.

It is triggering retransmission. It just retransmits until it gets lucky with the (pretty weak) TCP checksum.

The checksum should be checked by your computer. So somehow the packet is being repackaged with the correct checksum, but for the wrong data. In other words, when your computer checks the checksum, it matches. Another possibility is that somehow only errors that result in the same checksum are being generated.

Or their computer isn't checking the checksum. As is apparently the case on mac os (as reported elsewhere in the thread).

Another quality product by Apple ;P

Except it was trivial to reproduce with the script on non-Apple devices, and people in one of the many Twitter threads surrounding this showed that on their Mac there was MANY tcp retransmits due to invalid checksums, and the bit-flipped packet did have the correct checksum.


OK, that tweet does show the checksum is OK. I didn't see a whole lot of tcpdumps, so had to go with what was reported in the thread (I tried to reproduce with a few people, but my server wasn't in the broken path, so I couldn't get a lot of real data).

That tweet in particular doesn't show any retransmits.

tcpdump/wireshark gets a little hard to read at times; especially when the packet dump is a lie: all those packets marked red for bad checksums are from the dumping machine, and the checksums are wrong because the NIC is filling them in, and the capture interface doesn't get to see what they are). Perhaps the other people in the thread who said mac os was ignoring bad checksums were also confused; or perhaps it does ignore bad checksums, it's pretty bad at networking (it can't handle a synflood in 2020 because it's got synhandling code from 2000)

There’s no proof of this. And what possible reason would macOS have for not checking the checksum? Although the checksum is weak it presumably catches at least some corrupt traffic. Do you really think Apple would just skip the TCP checksum and make its network performance less reliable when they have already implemented (or maintained if it came from BSD) the rest of a TCP/IP stack, which is vastly more complex, just because its developers are lazy?

There were reports in the twitter threads that macs were ignoring the bad checksum (and I thought I saw confirmations in this thread, blaming a driver). Since I don't have a mac anymore, I couldn't confirm or deny; and since I don't have any equipment with the bad equipment in the path, I couldn't get a tcpdump, and I hadn't seen any on twitter.

I wouldn't have expected Apple to purposefully break the checksum, just like they I don't think they purposefully have no synflood protection because they pulled the TCP/IP stack at the turn of the century and never pulled it to get the many many many upgrades from upstream (although, they did add on MP-TCP, so there's that). I wouldn't be surprised if tcp checksums had stopped working ages ago, possibly because of a aggressive driver, and nobody noticed. Kind of like how if you spawn a few thousand threads that just sit around sleeping, it will delay watchdog kicks and the kernel will panic. (also, from reports on here, not personal experience)

It also seems quite easy to verify this hypothesis with scapy:

  >>> p1 = IP(dst="192.168.mac.ip")/TCP(dport=1984,sport=20001)
  >>> p2 = IP(dst="192.168.mac.ip")/TCP(dport=1984,sport=20002)
  >>> p2.show2()
  ###[ IP ]###
    version   = 4
    ihl       = 5
    tos       = 0x0
    len       = 40
    id        = 1
    flags     =
    frag      = 0
    ttl       = 64
    proto     = tcp
    chksum    = 0xf905
    src       = 192.168.linux.ip
    dst       = 192.168.mac.ip
    \options   \
  ###[ TCP ]###
       sport     = commtact_http
       dport     = bb
       seq       = 0
       ack       = 0
       dataofs   = 5
       reserved  = 0
       flags     = S
       window    = 8192
       chksum    = 0xb836
       urgptr    = 0
       options   = []

  >>> p2[TCP].chksum = 0xb836 ^ 0x8  # mangled checksum

  >>> sr1(p1, timeout=1)
  Begin emission:
  Finished sending 1 packets.
  Received 5 packets, got 1 answers, remaining 0 packets
  <IP  version=4 ihl=5 tos=0x0 len=44 id=0 flags=DF frag=0 ttl=64 proto=tcp chksum=0xb902 src=192.168.mac.ip dst=192.168.linux.ip |<TCP  sport=bb dport=microsan seq=1671494800 ack=1 dataofs=6 reserved=0 flags=SA window=65535 chksum=0x6039 urgptr=0 options=[('MSS', 1460)] |>>

  >>> sr1(p2, timeout=1)
  Begin emission:
  Finished sending 1 packets.
  Received 494 packets, got 0 answers, remaining 1 packets
My Mac silently dropped the packet with the mangled checksum.

It is extremely common for hardware to be configured to ignore checksums. ("A packet with a bad checksum would have been dropped before it got here. Our cabling is too short to drop bits.")

The same here. It is always the 0x08 bit. (p-x, d-l, g-o etc.).

I wonder if there’s some 64 line I/O somewhere with a couple traces bridged together, where:

0 and 0 are still 0 and 0,

1 and 1 are still 1 and 1,

but 0 and 1 become 1 and 1 (or 0&0)

and 1 and 0 become 1 and 1 (or 0&0).

I remember seeing this wackiness when I bridged two address lines on an EEPROM with the tiniest amount of solder.

RoHS/tin whiskers strikes again?

This is unlikely, given how many people are seeing this. And now the problem has gone away for me!

I get the same. Pretty amazing that the only effect from such garbled data that I noticed were some annoying hangs and bad Twitter links!

> presumably because a corrupted packet in a TLS handshake is an error and the connection dies

Not just in the handshake, TLS moves these things called TLSPlaintext records (about 16kbytes each), not only in the handshake, but also for all the actual data - and they'll always have integrity protection to ensure bad guys can't change anything. TLS can't know the difference between a bad guy tampering with data and your crappy Internet mangling the data in transit, in either case the TLS protocol design says to "alert" bad_record_mac which your browser or similar software will probably treat as a failed connection, even if it happens mid-way through an HTTP transaction.

Because TLS guarantees integrity even if your fiber is a complete shit show, any TLSPlaintext records which do get from one end to the other are guaranteed to be as intended.

Thanks for sharing this! I have long suspected that TLS does this. It is great: for many applications preventing bitflips in transit is arguably more important than privacy. Is there an authorative source where this is documented?

You mean besides the RFCs (the most recent RFC 8446 describes TLS 1.3) ?


If you aren't interested in why this works just the Introduction to the RFC explains the intent, specifically what we care about here is that TLS delivers:

"Integrity: Data sent over the channel after establishment cannot be modified by attackers without detection."

And further notes "These properties should be true even in the face of an attacker who has complete control of the network".

Help me understand, are you saying only HTTP bits are being flipped? Because yea, if a HTTPS bit was flipped the whole packet dies. So is this issue blowing up all sorts of traffic everywhere?

Yes, all bits are being flipped. TLS connections drop because the message can't be authenticated, HTTP or other plaintext protocols will continue on with bad data.

So yes, it is blowing up all sorts of traffic everywhere. You just don't notice when it is plaintext.

I ran the script and it is finding a difference after a while. I have not noticed anything wrong with the network recently, but that of course does not mean there has not been a problem.

Edit: Actually I take that back, I have seen one of the issues where sometimes web sites will not load at random, and a reload fixes the issue.

I have ATT fiber in Texas and was having issues recently, probably for a few days, where DNS props would just fail(I use Google's DNS), huge pauses in page loads with the occasional just doesn't. Happened over the long weekend IIRC and was sporadic enough that I didn't look into it further. I thought it had largely cleared, but now I'm wondering about some ongoing page load pauses..

> ... where DNS props would just fail ...

What's a "DNS prop"?

Sorry, I meant probes as in the Chrome errors.

Propagation, but I’m not entirely sure it makes sense in this context.

It's not just Fiber. It's also V/DSL connections including resold VDSL through a company like Sonic.

I can confirm that on Sonic (resold AT&T) VDSL I am seeing this exact corruption.

Nice find! I had been wondering why I had been seeing odd TLS failure messages recently.

Sonic might be the best avenue to get this fixed. They care about their customers and presumably can talk with AT&T in a much more meaningful way.

Why is it always the same positions that differ?

The files differ after anywhere from 1sec to 5sec for me, and it's always the same character positions, and it always seems to be the same lines, and the same number of lines.

Most likely it’s the same position in packets; and there’s some bad RAM in a device along the route.

For me, TLS errors would happen every so often after we had used all our data allowance and our connection was being shaped to a slow speed. Once the speed returned to normal it was fine. I always thought it was the slow speed but maybe there were bugs in the shaping software.

In the above linked gist, how we we know it's a low level router bit flip instead of some code / programming error in their MITM / dns / JavaScript injection tomfoolery?

AFAIU the same issue also affects HTTPS packets, just with different symptoms. E.g. the TLS handshake will fail and stuff like that.

Why would a TCP connection allow flipped bits to make it through?

Could be crappy/buggy middle boxes. Especially if they’re inspecting packets or messing with SSL.

My ATT modem (Arris bgw-210-700) may have gotten a firmware upgrade recently as I found some settings I shut off got re-enabled. I use it for VDSL but the same model family is used for ATT fiber.

TCP checksum is simples one's complement checksum. So if two bits are flipped in the same position in two different 16 bit word, they cancel each other out. If you look at all the diffs posted, they differ in even number of lines.

How do you mean? How would it know if a bit in the HTTP payload got flipped?

Each TCP packet has a checksum but it does not catch 100% of all possible bit flips

And some routing hardware have been known to ignore this--meaning they'll forward the packet data along and re-calculate the checksum.

Hardware that forwards packets usually forwards IP packets, it doesn't care about TCP checksums and doesn't recalculate them. TCP stack in your OS is the one that does that.

It has to be a special kind of hardware that does much deeper packet inspection (DPI) to recalculate TCP checksums, usually used for spying, throttling, censorship, injecting ads, injecting exploits, etc., but not merely routing/forwarding packets.

Switches (L2 devices) recalculate Ethernet CRCs (L2 error detection codes), and routers tend to calculate TCP/UDP checksums (L3 error detection codes) and everything below. I've seen exactly this issue with switches that have bad RAM before, and I assume that they just have a router with bad RAM (and without ECC RAM, which looks embarrassing).

To my knowledge, from working on an actual software router, a router will only touch the TTL and recalculate the IP header checksum.

There is no reason for it to touch the TCP header.

Agreed there is no logical reason for it to touch the TCP header.

And yet, an unfortunate number of L2 switches do exactly that. :(

Doesn't NAT (specifically carrier-grade NAT in this case) often modify the port? Although I don't know if AT&T does carrier-grade NAT.


> an unfortunate number of L2 switches do exactly that

Can you name any? Just curious

Checksums. TCP is supposed to provide reliability against data corruption, resend bad packets, etc.

Unfortunately TCP checksums are hot garbage given switch ASIC design. They are a 16 bit one's complement sum over a packet. If you get two bit flips in the same offset % 16, you can pass a checksum.

The problem is routers slow down the high speed serial signals from fiber to by splitting the bits over a large number of slower speed signals internally. Often those wider busses are a multiple of 16 bits. For example, one ASIC I know of moves things around in 204 byte chunks. (Might have been 208, been a while.) Anyway, the problem is that if there is a defect in one of those parallel elements it will always flip bits in the same offset position mod 204 bytes, which is the same position mod 16 bits. If the hardware is degraded enough, it can end up flipping two bits in the same position, and that has a fairly good chance of passing the checksum.

Ethernet has proper CRCs on packets, which is a lot less vulnerable to shenanigans like this, but unfortunately those can end up being checked on the way in, discarded, and then re-generated on the way out of a router. If anything is corrupted in the middle of the switch ASIC, nothing notices and it passes along. I once helped troubleshoot an issue in our network where a BGP packet was corrupted in this way. The flipped bits ended up causing a more specific route to be generated, and we had the world weirdest BGP route hijack within the bounds of our own data center.

204 IIRC. Dune Petra.

Depending on TCP checksums and experiencing single-bit flips took down AWS S3 back in the day: https://status.aws.amazon.com/s3-20080720.html

Checksums are often calculated in hardware on the NIC. I have personally seen a network card send packets with corrupted data (corrupted by the network card itself) and valid TCP checksums, computed on the corrupted data.

That's true, but doesn't apply in this case if you assume you received one with wrong checksum, a network card won't recompute it of course.

Data point: I’m on Sonic fiber (resold AT&T fiber) and this script has been running without errors for 5+ minutes.

> resold AT&T fiber

Only in certain areas! Within San Francisco on overhead-cabled blocks (for example) their lines are their own.

Oh, good to know, I am indeed in SF.

Imagine if such an issue appeared in a non-techy-area, it could go unnoticed for years.

Oddly, I can't even resolve example.com (using AT&T DNS).

Sounds like the fixed the “glitch” of excessive traffic against example.com. :)

My AT&T bay area fiber DNS has been garbage since the day we got it. I've had to forcibly update every device we own to Google DNS as well as put another router in between our devices and the AT&T one since AT&T doesn't let you actually change those settings on your device.

Why wouldn't the first conclusion be that the example.com service was flipping bits? Load balancing TLS requests to a different front-end pool is very common.

Has anyone claimed to see these flipped bits on a domain other than example.com?

”Has anyone claimed to see these flipped bits on a domain other than example.com?“

See e.g. https://gist.github.com/bmastenbrook/14c0e22fc02b95d4a48f82d...

This is a pretty useful piece of debugging here. It might be worthwhile to try to get EdgeCast involved, as it could be a broken thing between AT&T and EdgeCast.

Given that (it seems) only AT&T customers are complaining, and (it seems) it only affects servers on EdgeCast.

I see no errors running the same script from another host that doesn't go through at&t.

> or maybe someone from AT&T is on here that can get it looked at...

They are on Facebook and Linkedin, you have seem them.

Trying to explain this issue to AT&T support is like trying to convince a doctor you're the only person on earth with a particular disease.

Even explaining the issue is hard. It's not an outage, my internet isn't out, it's intermittently wrong. The phone support agents aren't prepared for this, and I can't find any way to escalate or speak to a network engineer.

I feel like if I spoke to the right engineer, there'd be a ticket on this and they'd roll at truck to their facilities or the IPX within an hour. It's a major network issue to flip bits, it's costing them bandwidth with retransmits and could be breaking SLAs with their business customers.

On the phone the most they could do was roll a truck to me.

When I first moved to Seattle there was a great local ISP called CondoInternet that mainly specialized in high-density downtown buildings.

I was once having some packet loss issues and called their support line. I assume the company was really small at the time, because the guy who answered the phone was clearly a network engineer who knew the system inside and out. I read him a couple of traceroutes over the phone and we resolved the issue within minutes.

I have never experienced such perfect tech support before or after that with any other ISP.

And when Wave bought them, Wave put incompetent people in charge. I moved to a Seattle suburb and had "Wave G" (post-Wave CondoInternet).

I had throughput issues, first on my apartment. That got resolved via Reddit. Then the backbone is slow from time to time which persisted. Sometimes I get a Gigabit, other times I only get 10-20 Mbps.

What did Wave say? Let's bring a tech out. I told them "it's a backbone issue" and they didn't understand me. I don't know how I would explain how networking works to a field tech who only knows how to enable a port or run basic diagnostics.

Another time, I tripped Port Security, and told them they can re-enable my port. Wave said "we need to bring a tech out" as if that will magically solve everything.

This is made worse by my apartment's exclusivity deals: only Comcast and Wave, nothing else. Ziply Fiber (bought Frontier) and Atlas Networks were denied.

Bigger companies like Verizon FiOS in NYC had better tech support, I could still get hold of a Level 2 tech for a much smaller issue (and not even on FiOS).

Wave G makes AT&T's forced router (but not the flipping bits) seem decent in comparison. I don't want Comcast but I'll happily take AT&T Fiber over Wave G even if it means trading my pfSense box for a crappy AT&T gateway (worse case scenario I could bypass or root the gateway, or pay $15 for static IPs).

I fortunately moved (for another reason) and have Google's Webpass. Webpass may not give me a full Gigabit, but it's usually 400-800 Mbps all the time and not 10-20 Mbps 95% of the time with an occasional Gigabit. I wish CondoInternet sold to Webpass instead of Wave (I don't want a merger now, but still).

Surprisingly, I may have preferred CenturyLink Fiber if available mainly for GPON over a 60GHz PtP microwave link (less oversubscription!), well unless PPPoE+6rd kills my pfSense. But Webpass works pretty darn well, and I think I'll stay unless CenturyLink suddenly gives me 2 Gbps FTTH or something, or I move.

I don't have much to add here, except to say that this really matches my experience with Wave G.

Wave G's support team also doesn't even realize that the CondoInternet network they acquired provides IPv6, and when asked about IPv6-related issues they just say "oh we don't have IPv6 yet" which is nonsense. I really miss the local CondoInternet support people. They were amazing.

If you're interested or can't find it, I'll dig up a link.. but I recently unplugged my AT&T supplied residential gateway after installing a supplicant docker image on my UniFi Dream Machine Pro. It answers the authentication challenges with a reply using a CA from a jailbroken AT&T fiber modem (sourced from a guy on eBay!). I see you're no longer on AT&T fiber, but if you would use it without their gateway, know that you can!

I made the mistake of (briefly) working at Wave. I can probably shed some light on them (full disclosure: they fired me after I was sent home for vomiting in the bathroom at work during a high call volume evening)

Wave is probably one of the most breathtakingly "if it works, use it" companies I've ever seen. I'm not sure if any other ISP's truly even compare simply in the breadth of equipment they use, combined with how poorly they operate

Wave operates in parts of Washington, Oregon and California. They've mostly grown by acquiring smaller, unprofitable or mis-managed ISP's in the areas they now own.

Wave offers TV, Internet and Phone services. Unlike a typical ISP like Comcast or Frontier however they don't just offer a single or a couple methods of service delivery

For TV service, Wave offers

- Analog TV (mostly areas in California)

- Digital TV in most other areas

- TiVO and CableCARD services (they don't have the purchasing power to get boxes from companies like Motorola/Arris or Cisco/Pace)

For Telephony service. They not only offer VoIP services but in some areas even offer regular PSTN phone lines!

For internet, Wave offers its "Wave" DOCSIS 3.0 (and in many areas, still 2.0) services. They also own what was originally CondoInternet

Finding out how CondoInternet/Wave G operates was probably one of the most horrifying things I've ever seen in my years of telco work. I'll try to explain it from the ground up since it's very much a jenga tower of terribleness

Condo Internet in its inception had a very uphill battle. They wanted to target expensive Seattle condo buildings and sell a "premium" product. However in very Seattle fashion they were slapped with very indifferent, if not outright hostile actors to their plans. So they made do with what they could get

Condo Internet's services comprise a hodgepodge of VDSL2, Point to Point wireless, Fiber-Optic and MoCA. Effectively what ever they could wire into the building or appropriate for use, they did. This is why some apartments can get symmetric gigabit, while others can only get 100 megabit

MoCA was initially the most horrifying one I encountered while working there. MoCA is effectively Ethernet running over Coaxial cables. Except since Coax is a shared medium, it's just like the ethernet hubs of the 1990's all over again.

The main reason this was done as it was considered cheaper than installing an HFC node or CMTS. They didn't know how many customers they would get to switch over, so they played their cards extremely conservatively.

Apartments with MoCA configured would have a (managed) gigabit switch or two in the basement for link back to the Condo PoP. Whichever vendor was cheapest at time of purchase (Cisco, Juniper etc)

These switches would each be connected to an individual MoCA adapter, connected to one of the cable drops going to each individual apartment/floor/whatever. The field tech would then install an accompanying MoCA adapter in the customers home (simply calling it a "cable modem") and connect it to a Wave provided router (typically a TP-Link Archer C7)

Condo/Wave would offer typically symmetric 100 megabit on these lines, though the ability for more than a few customers on each "MoCA node" (for lack of a better term) to saturate them was much more limited

Another "fun" feature was that both the MoCA devices and switches they were connected to were run without any sort of VLAN'ing at all. If a customer accidentally plugged the MoCA link into the LAN port on their router, it would happily hand out DHCP leases to the entire building!

As I found out. The reason they don't use VLAN'ing is because their NOC staff are almost entirely customer service reps that were "upskilled" to handle NOC tasks (gaining a fixed $0.50 an hour bonus, hooray!). Wave's NOC handle roughly 90% of WaveG calls (I'd guess because Wave doesn't make very much money?)

One other fun anecdote:

WaveG service has a lot of users from overseas set up VPN's for their parents to watch Netflix on. Netflix's internal algorithms for the longest time would detect this behaviour and automatically flag Wave's entire IP ranges as a "proxy or VPN provider", knocking out roughly 500,000+ internet customers from using Netflix for several hours or even days. This would cause their phone support to effectively melt down, with the robotic queue time projecting at roughly 5-6 hours or more

I had a similar, local fiber ISP in a high density midwest apartment. Truly fantastic support. The network engineer came out to set me up, and we got to talking about our backgrounds (I worked in IT support while attending a nearby tech-focused university), he invites me to check out their building-wide switch closet down the hall with all the cool gear, then says "yeah we limit everyone to 100/100, but I'll flag your account for 250, just don't go overboard eh"

Scale ruins most things, unfortunately.

I had a similarly stellar experience with Init7 in Switzerland. The person who answered the phone wasn't a network engineer, but immediately passed me on to one within 10 seconds of our conversation starting.

For me it was night and day after previous having had a municipality-run absolutely garbage ISP previously (CityCable, for anyone considering them, stay far far away). Init7 might be a tad more expensive than most, but the service is solid.

XMission in Salt Lake City had support like that, back when DSL unbundling was a thing (2006).

They got bought by Wave, who still seem to have pretty solid support? At least over Twitter, it felt like I was talking to someone competent. I haven't had any major issues since installation, though, so maybe I just got lucky.

That sounds amazing!

The obnoxious thing to me is the hubris that must be behind this. Either they considered it and decided they would never encounter a system error like this and refused to implement an escalation route, or they never even considered it.

Or perhaps even worse than that, maybe they considered it, decided it was possible, but just don't care because of their insane borderline monopoly.

I don't understand how internet companies provide such consistently awful service.

Slightly off topic story, I've recently changed to another provider called Starry, and they force you to have a second router in front of your own router which they claim "decodes" their stream between the modem. I don't know the real reason but I'm pretty sure that's not it. If you plug their modem directly into a non-Starry router, the router just doesn't detect a connection.

One day, I tried to torrent something, and my internet would immediately get throttled to 0mbps. After investigating I found out that their router had a custom OS which hid a firewall and various security settings. Amusingly you could still access those settings if you just manually entered the page names into the address bar. Now all their stupid settings are disabled and I just feel badly for all the folks who use their service and don't have the savvy I do to actually get what they're paying for.

Either that or every single CSR takes 50 calls per day, every day, that start with "There's a problem with the AT&T network!!!"

Do you find it difficult to believe many of those calls actually are issues with the AT&T network?

Wait so you have your main cable plugged into their modem, their router plugged into their modem, and then your own router plugged into their router?

Funnily enough this is how my AT&T fiber is set up as well. They force you to use their router, and you can’t directly connect your router to the ONT. The problem is that they use a device certificate + EAP. There’s work arounds but it’s a pain.

Oh, there's a workaround for that.. I've unplugged my Residential Gateway and now my UniFi dream machine pro is directly connected to the ONT.

You install a CA from a jailbroken modem into a supplicant container that runs on the UDM pro. It confirms to the network that you are using "authorised" equipment for the connection and the packets flow!

I'm curious to see what happens with the new installs, which terminate the fiber directly at the gateway using the SFP port on the new BGW320 gateways, rather than using a separate ONT like they have historically. The UDM Pro has an SFP WAN port that could ostensibly be used, but I haven't seen much yet about the feasibility of adapting the existing bypasses to ONT-less installs.

Which is then a problem when you try to explain them that yes, you are sure the issue is with their service and not your setup. But what's your reason to be going such lengths instead of just plugging UDM into their router? Unless it was done for the fun of it which is fine and understandable.

> But what's your reason to be going such lengths instead of just plugging UDM into their router?

While you can do this and things will generally work, AT&T restricts all of their residential gateways from operating in a true passthrough/bridge mode to another router. So you end with double NAT and all the joys that entails (such as [1]). There are also a number of other issues that have been associated with operating in their faux-passthrough mode, including

- Issues with IPv6 prefix delegation

- Sporadic latency spikes (an issue in general, that you inherit since the gateway is still "doing" everything it normally would, since it won't actually act as a ure passthrough/bridge)

- A firmware update capped throughput at 50Mbps (later fixed in another firmware update)[2]

- Firmware updates tend to silently re-enable the built-in wifi radios

So while it'll generally work, it ends up problematic. You inherit all of the performance issues associated with just using the gateway as your all in one modem/router/firewall/AP/gateway, plus the addition of double NAT, plus the sharp edges of their poorly implemented faux-passthrough modes, plus the ever-present concern that you're one firmware update away from a non-working network despite having used their official passthrough configuration.

Hence why gateway bypasses are so popular[3][4][5][6]. Even if they're a bit involved to set up, once you get it working things just... work. With little if any upkeep (potentially a few minutes after a power outage, depending on the bypass method you implement).

[1] https://www.windowscentral.com/fix-xbox-one-double-nat

[2] https://www.dslreports.com/forum/r32172124-AT-T-Fiber-5268AC...

[3] https://github.com/MonkWho/pfatt

[4] https://github.com/bypassrg/att

[5] https://github.com/mrozentsvayg/vyos.att

[6] https://github.com/Hou-dev/simple-eap-proxy

Yes.. what he said.

But my main reason is actually the gigantic size of the residential gateway box. I mounted the ONT, UDM pro and PoE switch on a wall in a closet and the RG just took up too much space.

Thanks for such a detailed reply.

Yes exactly. Their router has a LAN with my router as the only other device, which it's bridged with, and then my router has the true home LAN.

A weird side effect of this is that I'm not using the 192.168.x.x range like usual (because that's what theirs is using), but instead the 10.0.x.x range

So are you bridged then or is it really a double nat?

this is really where my limited knowledge of networking shows. I'm not entirely sure but I want to say both or double nat. There's two networks, but my router thinks theirs is a modem and is connected via the "Internet" port, not just a normal device port

Ah ok - sounds like a double NAT then, not a bridge.

Edit: to go into more detail, their router is acting as a NAT for your public IP, giving you your first subnet, and then your router is getting a single IP on that subnet and creating a NAT where your devices all get IPs. In a bridge there would only be 1 IP space behind a single NAT. In your case with a double NAT a lot of consumer things might not work (like UPnP) and port forwarding would require you to add rules to both routers.

Thanks, yes that's exactly what it is then. In order to give external access to my devices (e.g. my NAS) I have to forward ports from their router to my router, and from my router to the device. So, definitely double NAT. Amusingly the person who installed it incorrectly called it a bridge.

Thank you for the insight and lesson!

Have you seen Parks & Rec, and remember that scene in a Home Depot where an associate walks up to Ron, asks him if he needs help with a project, and Ron responds "I know more than you"?

I've pulled a variation of that on CSRs at least once, and surprisingly, it can work. Just be cordial, preempt the typical IT Support stuff they always ask, DO NOT say its intermittent (initially, to the front line CSR; if given a chance to expand the issue after escalation, then add that bit), and get technical ASAP (it doesn't hurt to throw in some parallel industry jargon). Basically, build a case where even the information you're giving them is beyond a first-line CSR playbook, and they have to escalate.

"Hi there; I've been observing some erroneous TCP packet bit flipping on HTTP requests which route through one of AT&T's data center in Oakland. I've tried restarting my computer, I'm seeing the same thing on my phone, and I actually swapped my router out for a spare one I have, but its still an issue."

(that last sentence exhausts literally every playbook a front-line CSR has. it sounds so easy, right? there are four variables in any front-line CSR diagnostic equation: their network, your router, wifi/ethernet, and the endpoint. you just crossed off three of the four variables in one sentence).

(Wait, a data center in Oakland? How do you know this? You can tracert a bad request and geolocate the first IP outside your network, but, lets be realistic: You don't. You're fronting; demonstrating knowledge that a front-line CSR can't disprove. You may think this is misleading to whoever this gets escalated to, but it isn't; their tools are FAR more advanced than yours, and they're used to 99% of customers being incorrect idiots, so they're going to be validating and reconfirming every word you say anyway.)

Ron's Parks & Rec example above is crass. But here's the magic bit: frontline CSRs generally look for an excuse to escalate, you just need to give them enough CYA to check their job as done, and the higher tier CSRs/network engineers will love you for actually knowing what you're talking about. Its a win-win; be cordial, be forceful, strut what you know.

I had something like this happen on an even simpler level this last week. I got a Chase credit card but when I initially did the signup called my brother to ask him if he wanted to be on the account and it timed the sign up session out past the account creation but before finalization.

I got the card eventually but now I cannot create an online account with it. I called Chase, got transferred 5 times, and then told I would need to go to a physical bank to verify my identity? to create an account. Absolutely not one of them had any clue what "a broken account exists associated with this card in your database, I can guarantee it, forward me to your technical support team" but thats all above a bank reps pay grade.

The nearest Chase bank is 1.5 hours away, by the way. Probably just going to cancel the card after cashing out the sign up bonus.

> I've tried restarting my computer, I'm seeing the same thing on my phone, and I actually swapped my router out for a spare one I have, but its still an issue.

"Ok sir, please click the start button, then the power button, and finally click the restart button to restart your computer..." (and they refuse to budge until you've swapped out your router yet again, because you didn't do all that while you were on the phone with them)

From a business standpoint, it's hard to justify paying support to be technical enough to diagnose an issue such as this. Let's be honest, even senior network engineers would have a hard time debugging and diagnosing this. AT&T doesn't want to pay support staff six figure salaries and I assume most senior network engineers don't want to be support agents (customer facing).

AT&T (though applies to lost of companies) probably need a "unicorn" role of a very technical person that is paid as such, but able to interface with customers on specific highly technical issues.

Ten years ago, while using ADSL, for some reason captcha images were not loading. I opened a case to ISP support and they called me back. The support guy did not believe it. He said "this is not an analog network, this is digital, you either have it or not". I said I know what the analog and digital mean and also I know this is a digital network since I am a computer engineer and I checked everything so this was a issue with connection. After a couple hours later, he called me back and said that problem was caused by modem and a driver update would fix it and it did. These were good old days, however, when we can reach someone on the phone and talk about problem. Nowadays everything is either automated response or some random person whose whole job to tell you that he/she cannot do anything about the situation.

Have you tried saying "shibboleet"?

My ISP actually supports this[0], though I haven’t had cause to use it yet - they are also very reliable!

[0]: https://www.aa.net.uk/broadband/why-choose-aaisp/

I think you're being unfairly downvoted by people who haven't seen this xkcd:


The real life version of “shibboleet” is your Certified Partner ID number and a serial number with a valid support contract.

When I worked for a VAR I could upload logs to Cisco and get experimental patches back. Call up HP, tell them I want an RMA, and they’d just do it. Night and day compared to what consumers get.

I talked to somebody and he told me he had talked to his "IT department" and they're working on it but who the fuck knows.

From my professional experience of programming and debugging networking equipment, this could be a switch/router with a buffer with bad memory (stuck bit maybe). The better chips have CRC/Parity/ECC to cover such issues but there are always those magical choke points where the past CRC is tossed and the new one is generated that can leave a gaping hole. The tricky part is how often is this bad memory buffer used...

I would use traceroute to find a common bad point for everyone. It is also possible that the networking point where the problem occurs is invisible to traceroute as it could be part of a provider network probably MPLS but at least the common ends of the tunnel would be visible.

The fact that it is a a specific interval indicates a stuck bit in memory.

Some good previous public stories about such incidents https://www.verizondigitalmedia.com/blog/being-good-stewards... https://twitter.com/cperciva/status/1309568337408454658

Hardware designers basically started making bad decisions on this issue around the time that VLAN tagging was introduced, as well as harware forwarding of IP packets. When VLAN tags are inserted or removed, the CRC of a packet needs to be adjusted to reflect the inserted, removed and/or modified bytes from the VLAN header. Additionally Both the CRC and IP checksum of a packet needs to be adjusted when TTL is decremented as part of IP routing.

When implementing this functionality, the naive hardware designer will strip the existing CRC from the packet, modify the contents of the packet and then reuse the handy dandy CRC calculation block to place a newly calculated CRC on the packet. Similar choices are made for the adjustment of the IP/TCP/UDP checksums. If any errors are introduced in the contents of the packet by the data path prior to the new CRC is calculated, this results in the CRC being "corrected" to include the erroneous data.

A far more understanding hardware designer will instead calculate how to adjust the CRC by the changes introduced in the packet contents. Sadly, this is far more complicated to get right, and it goes against the drive of hardware designers to reuse blocks of code wherever possible. Every hardware designer working on networking has a block of Verilog or VHDL code to calculate and append a CRC to a packet. Only the most dedicated will attempt to apply only the delta needed to the CRC or checksum.

I'm not a hardware designer but I routinely deal with low level networking shenanigans and I must admit that I never considered that it would be possible to update a CRC without recomputing it fully (unless you were just appending data of course).

For people like me who aren't smart enough to figure it on their own, this stackexchange answer seem to explain how it's done: https://cs.stackexchange.com/questions/92279/can-one-quickly...

This is a great explanation of "always those magical choke points where the past CRC is tossed" that parent poster is referencing. Thank you!

You can use ping to more easily hunt these types of issues, for example `ping -A -c 100 -s 1000 -p deadbeef` will show the difference if there is a flipped bit in the payload. You can generate patterns with xxd.

This kind of incident happened to me in a system that was supposed to have high availability. We had failovers for hardware, but it seems that a network device that was supposed to have HA (and was set up to pass the functionality to another device in case of failure) did not have ECC memory. One memory bit got stuck at 0 and the event was not detected at network level, as the data was repacked with a "clean" CRC. For some reason the packet headers were not affected by this, maybe because they were kept in a separate memory zone or because of memory alignment. So the device did not report any kind of suspicious activity, no errors in its statistics.

On the application side the effects were quite bad, as the data was mainly XML and, depending on where the bit was flipped, it could impact the data or the XML structure. The data had its own CRC/hash, so the packets were cleanly rejected by the application. Unfortunately the XML library from the message queue engine and the ESB we were using did not like at all when the bit flipping occurred in the XML tags (it seems fuzzing tests were not done at that point) so the message processing got stuck and we kept getting bad messages in the queues. Even worse, the queues could not be cleaned with the normal procedures because the application wanted to first display info about the messages inside - and that failed.

The network debug was non-trivial because of that header consistency - the network devices did not report any kind of packet issues, so we had to sniff the different network segments to identify the culprit. From the application point of view, we had to delete the whole message queue storage to get rid of the bad messages, and let the application handle the rest (luckily it was designed with eventual consistency and self-healing).

Wasted an opportunity to implement code that would detect and handle poison-pill messages. Those will happen in any system where queue is involved and there always needs to be an escape hatch to get rid of them. Deleting the queue is too extreme.

Deleting the queues is an operational decision that I made to be able to put the system back online after the network device was replaced (the important part was the uptime/SLA). From a quick analysis of the logs the percentage of bad messages was ~90% (there was a ~50% chance that the original "touched" bit was 0 so no change was done, but the messages had multiple "touched" bits at fixed intervals).

There was an escape hatch, but the conditions to hit it were a bit complex. Implementing new message filtering of this kind at 2AM while the system was down was not feasible.

”I would use traceroute to find a common bad point for everyone“

How do you mean? I use traceroute from time to time but I’m not sure how it would apply in a case like this. Feel free to elaborate :)

Take a traceroute from everyone experiencing the problem, look for the common hops among them all. Then compare that list against people not experiancing the problem to find the differences. The finsl set there are good places to start looking. Sweitches and routers along that way could be the cause

If you ping the hops with a large icmp payload, you might be able to observe the flipped bit in the echo reply. That could help isolate which hop it is.

You get some amount of your traceroute packet back too, could have flipped bits in there.

It might be better (although harder!) to take the traceroute from example.org, instead of from the clients. Forward and reverse paths often diverge, so it's important to find the path with the error.

Some people on Twitter have started collecting IPs: https://twitter.com/alexstamos/status/1336100299841314817

They need to be capturing src/dest IPs as well as ports for AT&T to have any hope of using that data.

Edit to make the comment more useful: If anyone is curious, look up "ECMP hashing." There are probably tons of parallel paths through AT&T's network, and to narrow down to the hardware causing problems, they will need to identify which specific path was chosen. Hardware switches packets out equally viable pathways by hashing some of the attributes of the packet. Hash output % number of pathways selects which pathway at every hop.

Hardware does this because everyone wants all packets involved in the same "flow" (all packets with the same src/dest IP and port and protocol (TCP)) to deterministically go through the same set of pipes to avoid packet re-ordering. If you randomly sprayed packets, the various buffer depths of routers (or even speed of light and slightly different length fibers along the way) could cause packets to swap ordering. While TCP "copes" with reordering, it doesn't like it and and older implementations slowed way down when it happened.

I think he means if everyone on AT&T experiencing the issue ran trace route to example.com some common hops would emerge, which would be a place to start investigating.

It seems like it, but it's a widespread issue across the SF/Bay Area right now, maybe wider. I've been having it for weeks and exploring it as well. I've even gone as far as ripping the certs off of the router to double check.

To your point, my traceroutes on this problem often have NTT in them. It’s mostly Japanese websites, but also Wikipedia.

Why would a switch/router recalculate and rewrite the TCP checksum?

because they're changing the payload.


When AT&T first did their Fiber rollout in SF, one of the things I remember was they charged you $10 extra if you didn't want them to MITM all your connections to insert JavaScript pointing to their own ads.

They rolled this back when folks complained, but I wonder if the relevant infrastructure is still sitting around and mangling packets.

I would probably expect this to be some network card or cable or connector is failing though...

The tweet indicates it is at a specific bit position. That isn't symptomatic of a bad cable/sfp/etc. Analog problems like that tend to be more random. Random bit flipping in a fixed bit position is symptomatic of bad ram somewhere, or a router asic gone bad, or various software/config issues.

Yes, as of six months ago I had to look up how to opt out of MITMing for a coworker on AT&T Fiber, so that they could use the Internet again when the remote-activated MITM feature inside the modem broke.

What I always wondered about this is, unless AT&T has permission from the copyright holder of each and every Web page so modified to distribute the resulting derivative work, how is this practice not criminal copyright infringement?

For the same reason why ISPs are generally immunized from contributory liability that they would otherwise be completely buried in.

Now, if you had written JavaScript to detect and remove these ads, and they went around that, then you might be able to construct a DMCA 1201 claim and sue the ISP for circumventing what is legally considered DRM. Yes, JavaScript can be legally protected DRM. The law doesn't say it has to be good DRM, it just has to have the effect of controlling access to a copyrighted work. And the safe harbors the DMCA provides ISPs wouldn't protect them in this case.

As someone who doesn’t have ATT as an option here, clearly they can’t be that blatant about it?! How did they spin this $10 fee?

You can contribute to our attempts to find the bad router card here: https://twitter.com/alexstamos/status/1336099461622157312

Almost certainly

You might want to send an e-mail to the NANOG mailing list: https://mailman.nanog.org/mailman/listinfo/nanog

It’s not impossible that people from AT&T read messages posted there.

I confirm that I am seeing these bit flips, and is in my traceroute.

You people are amazing!


My mobile.twitter.com traceroute prefers going through that path, as does en.wikipedia.org (both of which have sucked for me) while a Google route (to hops through

Yup, I'd been having issues with twitter, wikipedia, and sometimes duck duck go but never google.

Confirming this IP is in my traceroute to example.com as well

Oh man, I thought it was just my crappy old router causing the problems! I've just been too lazy to call them to replace it. I'm going to tweet at them right now.

Edit: Reading the tweet thread, classic interaction: ATT says "please click this to check your connection", guy replies back and says "I think we know more about networks that you do, please get a network admin in here".

What would you know about production networking, Jeremy?

I'm in the same boat. If the dozens of IT pros who are complaining about this can't get AT&T to swap out a single router card, what hope do most folks have?

Fortunately, this malfunction occurred in the SF Bay Area, getting AT&T to fix its network is only a matter of time, sooner or later, the correct person in the Valley would be alerted.

If it happened to the rest of us at somewhere else, we are probably out of luck...

It's happening to me too! I bought a new router a couple days ago because I couldn't figure out what else could be causing random, sporadic slowdowns.

I can't even download the page 3 times in a row w/o corruption

   <     <titde>Example Domain</title>
   >     <title>Example Domain</title>

It’s certainly a relief to have an apparent explanation for the weirdness I’ve been experiencing! I thought it was my home router or WiFi or something, but then I experienced the same problems at my partner’s house, so I thought maybe it’s just a problem with some websites I frequent or some common internet infrastructure. But we both have AT&T internet, so this must be it!

For me, the problem has mostly manifested as web pages failing to load or appearing to be loading forever. Generally when I refresh the page would load quickly as expected.

I've also been having this issue with ATT Fiber, but I'm out of Atlanta.

I saw the headline and immediately realized I've noticed similar oddness in the Atlanta area recently. Glad to hear I'm not alone. So far I've been blaming it on weirdness from my devices roaming between mesh router nodes, but now I'm going to run the test script overnight and see if anything turns up.

I've been having problems like that with ATT fiber too.

From Twitter:

> I’m hearing AT&T got their shit together and things are working now. Big thanks to @vikxin and @bmastenbrook for doing the heavy lifting here.


For this type of issue, I would recommend trying the nanog mailing list. You often see network admins ask for someone "with clue" at a different company when they get the runaround by tier 1/2 tech support.



This is absolutely the right way to "informally" escalate things to people that know what they're doing at big ISPs.

I'd recommend most newbies to the list show their work and post what's broken and how you know it isn't your fault. The investigative work on that Twitter thread is top notch and would do it in a second.

NANOG and outages@ are the two mailing lists that I've been subscribed to forever and are indispensable if you do operations.

It's ironic that for the most part Silicon Valley only has terrible ISP infrastructure.

Cellular service isn't all that great either.

NIMBYism at its finest. Cupertino did not allow cell phone towers for a very long time. The only one was an ATT tower on the top of Infinite Loop, right on the Sunnyvale border.

The people who would call their provider about bad cell coverage in their house are the same people that would go to city hall and demand that no cell towers be built in the city.

It’s amusing seeing Cupertino city council transcripts about this because the people show up claiming 5G gives them cancer and the city council desperately tries to get them to use better excuses so they can approve denying it.

Makes a change from the city council members usual practice of denying Vallco permits and claiming Apple employees are hiring prostitutes and molesting high school students. (I did not make this up.)

Are these transcripts online?

The ISPs have tried, but they get pushback from local residents whenever they try to install the necessary network boxes at intervals down the street. They had to go out of their way to hide cellular towers as streetlamps to get the 5G rollout to happen; I remember getting two or three notices for this (each from different telcos, so the one street corner has three streetlamps and two traffic lights on one corner).

I suspect if the droughts didn't make people get rid of their green grassy lawns homeowners would be more amenable to seeing green network boxes every few houses. It looks awful in the context of concrete sidewalks, though.

PROTIP: The boxes are an excuse. Building infrastructure would make more people want to move there, and claiming the boxes are ugly is something you can’t disagree with.

The parts of SV that haven't buried their power lines (not too pretty to begin with) have gotten significantly uglier since all of the new cable/fiber/DSL infrastructure has gone in. There are in many cases multiple of these things on loads of poles:


The parts of SV that have buried their power lines, probably haven't gotten a whole lot of new cable/fiber/DSL than whatever was there when they buried it.

SV is the perfect place to not bury infrastructure. There's no weather to worry about, and mostly people aren't going to shoot down the fiber to try to claim scrap metals.

I live in one of the few cities in the Bay Area where everything is buried, and it's refreshing not to have stuff hanging all over the place and blocking the view. Now, granted, AT&T fiber is slow to come here, but it's hard to know if the fact that the infrastructure is buried is the main reason. They are in some areas of town but not others.

You can bury conduits, by the way, and not cables or fiber directly. This allows you to avoid digging again just to install fiber, for example. There are ways to do it right. And having the infrastructure on poles is not a panacea either. New providers are not necessarily allowed to use the poles. The weather might not be crazy, but the poles are already overloaded and a little windstorm will disrupt electricity, coax cable, or fiber. And of course, those overloaded poles are crazy ugly.

Both options have trade-offs.

When I used to work in the telecom industry, burying conduit or cable in the ground was anywhere from 3x (bury in some dirt on the side of a county highway) to 20x (directional drilling in a heavily populated city where there's utilities all over the place) the cost per foot compared to hanging it aerially from a labor standpoint.

However, as you correctly point out, there may be restrictions on what you can hang on the poles and where, and oftentimes you'll find poles where it turns out it never should have had the number of attachments it did, but guess who gets to foot a large part of that bill if they want on?

But even then, I've seen absurd lengths gone to in the name of not digging. On Martha's Vineyard, I believe they wound up using a super-special Self-Supporting fiber that could be hung in or near the Power area of a Pole. Yes, that requires a far more trained/well paid worker than normal aerial work. Also, in that region, NESC 250C/D comes into play which makes it even more of a PITA. But it still was far cheaper than putting cable in the ground.

I wonder whether Teraspan or other Vertical Directed Conduit would be a good fit for the bay area (Saw-cut a minimal depth in the street, just lay in a special zip-up conduit for fiber or twisted pair.) If the weather doesn't tend towards large temperature shifts it works well.

Speaking of which, a couple drawbacks worth noting for buried conduit; You have to go out and do your markings, or pay someone to do them for you when a dig request is made, and you have to be ready to handle the repair when someone inevitably forgets to call or the markings are done incorrectly.

Yes, but SV doesn't have alleyways to run the power lines through, so you have all of these ugly wires everywhere and you're not allowed to have big trees in the front yard, since they could interfere with the power lines. It's really ugly and looks like Baghdad after the war.

Here in Germany we bury everything. I don't even want to imagine how long it takes to rebuild all of that when some drunkard crashes his car into one of these poles or a lightning strike hits.

I grew up in the Florida Keys where the water table is very close to the surface so it's impossible to bury anything. When I was a young teenager, we would lose electricity all the time from tree branches brushing up against the power lines. They finally solved the problem by putting in power lines on concrete poles that are a few stories up! No more squirrels and tree branches disputing power!

A lot less time than when an earthquake snaps an underground cable.

you'd be surprised but it takes quite an impact to disrupt a telephone pole. most crashes leave the pole standing (see examples online), and even when it does topple the wires seem to hold, and the utility company comes in with a fresh pole. the lightning strike factor does suck if you live very very close to the strike and nothing is surge protected. That happened to my house once and fried my xbox and router.

I've seen it happen personally, except it was the support cable to a pole with a transformer. Cable anchor came out of the dirt, pole fell over, sparks everywhere, transformer fluid all over the street, hazmat cleanup crew, big mess.

I could be wrong, but I'd worry about earthquakes severing the lines.

This map[0] shows the major faults, but each one of those is a lot of small minor faults that all could snap things and shift. One small fault right by where I grew up is about three blocks long. Last I checked had one earthquake to its name: a magnitude 5.0 aftershock of Loma Prieta

[0] https://usgs.maps.arcgis.com/apps/webappviewer/index.html?id...

Here in Japan, there's full of pole except a few tourist spot, recovery after car accident, earthquake or heavy rain is done immediately.

god forbid infra has some infra on it

No, it hasn't made anything uglier - you don't really notice the infrastructure unless you're specifically trying to look for something to get annoyed by.

Maybe you have gotten blind to it? When I first landed in Palo Alto (coming from Europe), I couldn't believe my eyes: 3rd world infrastructure with wires flying everywhere! I most definitely noticed and notice. It's symptomatic of some of the ill that plague Sillicon Valley.

Overhead power / telecom lines (and cable car power lines) have always been a thing in America. Adding fiber and additional infrastructure hasn't really made it uglier. It would be different if we had buried power lines to start with.

SV residents really do complain about everything, huh? Try living in an east coast city, our aesthetic is a mess.

Solution: Paint the boxes grey.

Other options:

1) Sprout the streetlamp from the box, add a car charging port.

2) Hire an artist to paint the box with artwork (property owner gets to pick from option set). Consider it to be city owned art installations.

Or even some cheap creeping hedges like bouganvillea

Solution 2: Lift the box up, attach it to the head of the street lamp.

servicing at the top of the street lamp would me so much easier, right?

Naw man, woodgrain stickers.

Palo Alto has its own municipal fiber program: https://www.cityofpaloalto.org/gov/depts/utl/business/progra...

Wonder if someone should take them up on it.

It’s only for commercial use. As much as I’d love to be able to drop a node in to peer at paix for my house ...

I know people with Palo Alto Fiber running to their house. They were hosting websites out of their garage years ago (I think archive.org for a while.), but today it's just residential, so it's very much possible.

is the cellular deadspot on 101 just north of redwood city still there?

Are you thinking about the area just around the SFO landing zone?

I thought that was less of a "won't" and more of a "can't" between the lack of taller building towers and the radio interference zone for the airport means the directional antennas skip a slice there.

This will probably get fixed if 5G becomes a proper thing and there's a lot of micro-cells along the "bay area's biggest parking lot".

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact