It loads http://example.com and https://example.com and compares the result (should be equal) in a loop, and then reports if it finds a difference. I'm seeing multiple bit flips in the unencrypted version, and having a lot of issues loading web pages, presumably because a corrupted packet in a TLS handshake is an error and the connection dies.
Tech support, even when Twitter accounts with a lot of followers message them, is completely useless, they just say that they don't show any outages at your location. They need to be flooded with complaints for someone to look at this, or maybe someone from AT&T is on here that can get it looked at...
All of these large companies seem to have (correctly) realized that 95% of tech support cases are trivial issues that can be resolved via automated responses.
The problem is that they then assume that all cases are one of those 95% in order to solve the 95% as quickly as possible, which probably looks good to whatever metrics they're tracking.
But if you're one of the 5% you're fucked.
If there's anyone out there designing tech support procedures, you should add an "is this a 5% problem?" question to whatever checklist you give to support staff.
> If there's anyone out there designing tech support procedures, you should add an "is this a 5% problem?" question to whatever checklist you give to support staff.
When I was the engineer customer service escalated to, I was damn sure to thank them every time they escalated something. Even the one guy who escalated all the things I'd roll my eyes about in private. At least he was making sure the escalation path worked.
Someone who has taken the time to report an issue is probably one of hundreds or thousands who had an issue and didn't think it could be fixed and shrugged it off. We certainly can't fix everything, but weird network shit like this can be fixed, and it's worth escalating, because when you get it fixed, you can also figure out (hopefully) how to monitor for it, so it doesn't happen again.
OTOH, I didn't work for the phone company. We don't care, we don't have to, we're the phone company. https://vimeo.com/355556831 (sorry about the quality, I guess internet video was pretty lowdef in the 70s :P)
Ex-phone-company here. (Is this the party to whom I am speaking?) I was in installation, but hung out with a lot of the ops crew, and they LOVED interesting problems. The trouble was getting such problems to the ops people in the first place. Good people, bad process.
The most memorable one:
Customer service had been getting calls all morning with a peculiar complaint: A customer's phone would ring, and when they answered, the party on the other end didn't seem to hear them. They seemed to be talking to _someone_, but not the party they were connected to. Eventually they hung up. Sometimes, a customer would place a call, and be on the other end of the same situation -- whoever answered would say hello, but the two parties didn't seem to be talking to each other. Off into the void. They'd try again, and it would work, usually, but repeats weren't uncommon.
So everyone's looking at system logs and status alarms and stuff, and what else changed? There were two new racks of echo-cancellers placed in service last night, could that cause this? Not by any obvious means, I mean e-cans are symmetrical and they were all tested ahead of time. There was a fiber cut out by the railroad but everything switched over to the protect side of the ring OK, didn't it? Let's check on that. Everyone's checking into whatever hunch they can synthesize, and turning up bupkus.
Finally around lunchtime, one of the techs bursts into the ops center, going "TIM! I GOT ONE I GOT ONE IT'S HAPPENING TO ME, PATH ME! okay look I don't know if you can hear me, but please don't hang up, I work for the phone company and we've got a problem with the network and I need you to stick on the line for a few minutes while we diagnose this. I know I'm not who you expected to be talking to, and if you're saying anything right now, someone else might be hearing it, but that's why this is so weird and why it's so important YEAH IT CAME INTO MY PERSONAL LINE and that's why it's so important that you don't hang up okay? I really appreciate it, just hang out for a few, we'll get this figured out..."
Office chairs whiz up to terminals and in moments, they've looked up his DN and resolved it to a call path display, including all the ephemera that would be forgotten when the call disconnects. Sure enough, it's going over one of the new e-cans. Okay, that's a smoking gun!
So they place the whole set of new equipment, two whole racks of 672 channels each, out-of-service. What happens when you do that is the calls-in-process remain up, but new calls aren't established across the OOS element. Then you watch as those standing calls run their course and disconnect, and finally when the count is zero, you can work on it. (If you're doing work during the overnight maintenance window, you're allowed to forcibly terminate calls that don't wrap up after a few minutes, but that's verboten for daytime work. A single long ragchew is the bane of many a network tech!) The second rack was empty of calls in _seconds_, and everyone quickly pieced together what that implied -- every single call that had been thus routed was one of these problem calls where people hang up very quickly. This thing had been frustrating hundreds of callers a minute, all morning.
With the focus thus narrowed, the investigation proceeded furiously. Finally someone pulls up the individual crossconnects in the DACS (a sort of automated patch panel, not entirely unlike VLANs) where the switch itself is connected to the echo-cancellation equipment. And there it is. (It's been too long since I spoke TL1 so I won't attempt to fake a message here, but it goes something like this:) Circuit 1-1 transmit is connected to circuit 29-1 receive, 29-1 transmit isn't connected to anything at all. 1-2 transmit to 29-2 receive, 29-2 transmit to 1-1 receive. Alright, we've got our lopsided connection, and we can fix it, but how did it happen in the first place?
If all those lines had been hand-entered, the tech would've used 2-way crossconnects, which by their nature are symmetrical. A 2-way is logically equivalent to a pair of 1-ways though, and apparently this was built by a script which found it easier to think in 1-ways. Furthermore, for a reason I don't remember the specifics of, it was using some sort of automatic "first available" numbering. There'd been a hiccup early on in the process, where one of the entries failed, but the script didn't trap it and proceeded merrily along. From that point on, the "next available" was off by one, in one direction.
Rebuilding it was super simple, but this time they did it all by hand, and double-checked it. Then force-routed a few test calls over it, just to be sure. And in a very rare move, placed it back into service during the day. Because, you see, without those racks of hastily-installed hardware, the network was bumping up against capacity limits, and customers were getting "all circuits busy" instead. (Apparently minutes had just gotten cheaper or something, and customers quickly took advantage of it!)
Amazing story! My step dad worked night shift at AT&T back in the 80’s and ran the 5ESS. He took my brother and I in for a tour one night. Thinking back on it now it was a lean crew for the equipment they were running. Rows and rows and rows of equipment. I don’t remember closed cabinets, mostly open frames moderately populated. I’ll never forget he showed us some magnetic core memory that was still mounted up on a frame in the switch room. Huuuge battery backup floor as well.
He loved all of that stuff, absolutely hated when everything went to computers. Quit and became a maintenance man at a nursing home, commercial laundry repair guy then finally retired this year in his late 70’s (due to Covid) after working maintenance at a local jail.
I believe the #5 ESS machine itself is always in closed cabinets, so it's likely that what you're remembering was the toll/transport equipment, or ancillary frames. Gray 23-inch racks as far as the eye can see!
Depending on how old that part of the office was, they were likely either 14' or 11'6" tall with rolling ladders in the aisles, or 7' tall and the only place they'd have laddertrack was in front of the main distributing frame.
As for magnetic core, if you could see it mounted in a frame, what you probably saw was a remreed switching grid, which is a sort of magnetic core with reed-relay contacts at each bit, so writing a bit pattern into it establishes a connection path through a crosspoint matrix. It's not used as storage but as a switching peripheral that requires no power to hold up its connections. (Contrast with crossbar, which relaxes as soon as the solenoids de-energize.)
You’re definitely on to something. This image from Wikipedia for the #1 ESS fits very well into my fuzzy memory, especially those protruding card chassis:
I just remember thinking it looked awkward getting to the equipment under them.
I don’t know if the ‘5E’ as he called it was actually in operation yet, he ended up moving us all out of state to take a job developing and delivering training material for it...I think that’s what finally broke him lol. Hands on kinda dude.
I’ll have to hit him up later today to see if he remembers ‘remreed’ (he will). Thanks for the info!
Yup, the #1 used computerized control, but all the switching was still electromechanical, so it sounded like a typewriter factory, especially during busy-hour.
At night, traffic was often low enough that you could hear individual call setups and teardowns, each a cascade of relay actuations rippling from one part of the floor to another. The junctor relays in particular were oddly hefty and made a solid clack, twice per call setup if I recall correctly, once to prove the path by swinging it over it to a test point of some sort, and then again to actually connect it through. On rare occasion, you'd hear a triple-clack as the first path tested bad, an alternate was set up and tested good, and then connected through.
Moments after such a triple-clack, one of the teleprinters would spring to life, spitting out a trouble ticket indicating the failed circuit.
The #5, on the other hand, was completely electronic, time-division switching in the core. The only clicks were the individual line relays responsible for ringing and talk battery, and these were almost silent in comparison. You couldn't learn anything about the health of the machine by just standing in the middle of it and listening, and anyone in possession of a relay contact burnishing tool will tell you in no uncertain terms, that the #5 has no soul.
Theres a telco museum in Seattle called the Connections Museum, it has working panel, #1 crossbar and #5 crossbar switches and a #3ESS they are working on getting running again.
They passed testing because they had each been individually crossconnected to a test trunk, and test calls force-routed over that trunk. Then to place them in service, the crossconnects were reconfigured to place them at their normal location in the system. The testing was to prove the voice path of each DSP card, and that those cards were wired into the crossconnect properly.
All that was true, the failure happened when it was being taken out of testing config and into operational config. Either nobody considered that that portion could fail, or the urgency to add capacity to a suddenly-overloaded network meant that some corners were cut. (Marketing moves faster than purchasing-engineering-installation...)
Oh, and as to the point about keeping the call path ephemera. Yeah probably, but in a server context, that'd be akin to logging the loader and MMU details of how every process executable is loaded and mapped. Sure, it might help you narrow down a failing SDRAM chip, but the other 99.99999% of the time when that's not the problem, it's just an extra deluge of data.
As I recall, the cross-connects were done at the DS1 level, and an individual card handled 24 calls. These are hazy, hazy memories now; this took place in 2004-ish.
Not OP but it sounds like the echo cancellers were fine, the interconnect to the switch was misconfigured. Rather than sending both channels of audio to opposite ends of the same call, one channel got directed to the next call.
The funny thing is that if everyone played along they could have had a mean game of telephone going.
This feels like the kind of anecdote I'd overhear my paint-covered neighbor Tom telling my dad when I was 10, and my dad would be making a racket over it, really bent over twice. I'd always be like, "what's so funny about that?" But you get older and you realize not many people tell _actually_ interesting stories, so I guess you do what you can to make them want to come around and tell more.
Cool story, thanks! Perhaps you can solve a mystery phone hiccup that happened to me a few years ago? I called a friend (mobile to mobile if it matters) and, from memory, about 20 minutes into this call I get disconnected, _but_ I instantly end up on a call with an elderly stranger instead, who seemed pretty irritated she was now on the phone with me. I was surprised enough that she hung up before I could form a coherent sentence to explain what had happened so I've no idea if she was trying to ring someone or if the same thing happened to her or if she'd dialed my number by accident. From what I remember it seemed like she was also already mid-conversation as well though.
Thank you for sharing. And for helping the phones just work, so we can complain so much when they don't :)
Have you heard any of Evan Doorbell's telephone tapes[1]? It's a series of recordings mostly from the 1970s, but with much more recent narration, exploring and sort-of documenting the various phone systems from the outside in. Might be interesting to see what they figured out, and what they didn't :)
I used to play counter strike/Starcraft in my middle school years. I pretty much figured out I had consistent packet loss with a simple ping test. I was on the phone with Time Warner every other day for months. They kept sending the regular technicians, at one point ripping out all of my cable wires in the house to see if it fixed the problem. Nothing worked, I kept calling, at this point I had the direct number to level 3 support. They saw the packet loss too. Finally, after two or three months they send out a Chief Engineer. Guy says I’ll look at the ‘box’ on one of the cable poles down the block. He confirmed something was wrong at that source for the whole area. Then it finally got fixed.
Took forever dealing with level 1 support, and lots of karenesque ‘can I talk to your supervisor please’, but that’s literally what it took.
So yeah, if you want stuff like this fixed, stay professional, never ever curse, consistently ask to speak to the supervisor, keep records, and keep calling.
Small shout out to the old http://www.dslreports.com/ for being a great support community during the early days of broadband for consumer activism in terms of making sure you got legit good broadband.
I had similar issues for a long time. Dealing with my ISP’s support was really frustrating. Not once did a 2nd or 3rd line technician get in touch with me to acknowledge that they had done any kind of investigation and analysis of the intermittent issues that I kept experiencing. The ISP did send out a guy that replaced the optical transceiver in my end, but to me it just felt like a wild guess and not really something that they did because they had any specific reason to believe that the transceiver was actually faulty. It didn’t help.
I ended up just cancelling the service and signing up for one of their competitors instead.
I had a ginormous AT&T router/modem (pace 5268ac) with a set of static ip addresses and a few times, AT&T just stopped routing traffic to it.
It had happened before and then magically fixed itself a few days later.
One time I had a week of outage with AT&T basically said the problem was on my side. They could ping the modem, and then punted. I had several truck rolls. The techs were really nice guys, but were basically cabling guys, better for finding a bad cable than debugging a packet loss. The problem for me was that my ipv4 static ip addresses would not receive traffic.
I was at wit's end after a week and I debugged the thing myself. By looking at EVERY bit of data on the router, I found mention of the blocked packets in the firewall log. I would clear all the logs, and found even with the firewall DISABLED, the firewall log would see and block incoming packets I was sending using my neighbor's comcast connection.
I called AT&T, but this time mentioning "firewall is completely off, but packets are blocked by the router and showing up in the log" was concrete enough for them to look up a (known) solution.
The fix was to disable the firewall, but to enable stealth mode. wtf?
To be clear, this was a firmware bug, and caused dozens of calls to AT&T, lots of heartache and finger pointing always in my direction.
I should also mention at the start of this fiasco, I checked the system log and noticed they pushed a firmware update to the modem at the time the problem started. Strangely after one call to the agent, that specific line disappeared out of the log file, but other log entries remained. hmmm.
since then, they basically screw up my modem every month or two - they push new firmware and new "features" appear (like the one that sniffs and categorizes application traffic like "youtube" and "github"). It also helpfully turns wifi BACK ON when I had disabled it. I immediately go turn if back off and then they immediately send me a big warning email that my DSL settings have been changed.
The 5268ac pace router is the worst ISP provided router I've ever had, and I've been an Xfinity/Comcast customer, and I've even had a connection in Wyoming. I detailed my experience with it in a review of a third-party router, and found numerous issues along the way [0]. My favorite is that DMZ+ mode, which is what they offer instead of a traditional DMZ mode, just has some weird MTU issue that leads git and other services to break horribly when running behind a third party router. The solution? Don't use DMZ+ mode. Instead, put the router into NAT mode, and then port forward all of the ports to one private IP address. Bonkers. This is sold as an official-looking solution on the AT&T website for a "speed issue." [1]
This is all because AT&T believes that the edge of their network is not the PON ONT/OLT, but rather, the router they issue you. If you want to be on their network, you have to use their router as some part of the chain.
My latest discovery is that in doing this, the router can actually get super hot operating at gigabit speeds for extended periods of time. When this happens, it magically starts dropping packets. Solution? Aim a fan at the router so it has "thermal management."
Total. Garbage. I'd switch to Spectrum if they had decent upload speed, but alas, they don't in my building.
If you have some time, you can MITM the 802.1x auth packets [1] and use a less crappy router. I run this with a VyOS router and the same 5268ac that you have, but it works with things like Ubiquiti routers too. The only catch is you need three NICs on your router, but a cheap USB 10/100 one will do for the port that connects to the 5268ac.
Another option is getting the 802.1x certificate out of a hacked router, but it's not possible as far as I know on the 5268ac. You could buy a hackable ATT router but they're not cheap. Some sellers even sell the key by itself.
Mysteriously, doing this fixed an issue I previously had where SSHing into AWS would fail.
BGW320 is the new model, which I had installed about a month ago. It isn't a simple swap, as it uses a SFP module combined with the modem's internal ONT instead of a separate ONT, so I've heard it's only used in new installations. More about it: https://www.dslreports.com/forum/r32605799-BGW320-505-new-ga... (although theirs says 1550nm while mine says 1310nm)
However, it has 5Gbit Ethernet, hasn't re-enabled WiFi on automatic firmware updates, and has only screwed with my IP Passthrough configs once which was resolved with a router reboot. (that was possibly my router's fault, it seemed like it was unable to fetch a new DHCP lease)
Apparently you can extract the 802.1x key from the router and then use your own router, and someone even has a script to MITM the connection between the router and ONT.
That is absurd that AT&T requires the use of a rented gateway for U-verse. I've never had an issue with a another provider refusing to support off the self hardware before, including with a DSL provider, multiple cable companies, and FiOS (Ethernet on ONT).
And conversely its always surprising and disarming when you call a company and actually get through to a knowledgable employee. I was so surprised to hear “thats a firmware bug we know about and there is no update yet” about my router issue that I forgot to be mad at the company for not caring my router is broken.
I had a problem with a PowerMac G3 back in the day and I somehow managed to escalate up to tier 3 which is to say an Apple HW engineer. He was brusque bordering on rude, but he immediately recognized it was a problem with an undocumented jumper setting on the motherboard and solved my problem inside of two minutes. It definitely increased my customer satisfaction.
A few years back, I called up our local newspaper to start a subscription. Called the number on the website, and a real human person answered the phone. I was so surprised that it wasn't at least an initial phone tree that I actually stumbled and had to apologize and explain myself.
That’s why you need to sign up for CSA Pre™. Get preauthorized for instant escalation on customer support calls.
All you have to do is answer a form with questions like: Do you know how to plug in a computer? Do you know where the power switches are on your devices? etc.
CSA Pre™ is valid for 5 years; you can initiate the renewal process to do 6 months before expiry.
I've had this exact experience, except that it wasn't a dream.
Back in 2010 I had a weird issue where my cable connection would sometimes completely block the connectoin right after the DHCP response (we had dynamic IPs back then). This would go on for a couple of hours, until the IP lease expired, then my connection would come back. Luckily, I was running an OpenBSD box as my router which allowed me to diagnose the problem. But it was also impossible to explain to the servicedesk employees.
One evening it happened again, and I called the servicedesk, totally prepared to do the 'yes I have turned it off and on again' dance. But to my surprise the employee that I got on the phone was very knowledgable and even said that it was very cool that I had an OpenBSD box as a router. He very quickly diagnosed that someone in my neighbourhood was 'hammering' the DHCP service by not releasing his lease (a common trick to keep your IP address somewhat static). This caused a double IP on the subnet, and the L2 switch to block traffic to my port.
He asked me "do you know how to spoof your MAC with an OpenBSD box?". Then I knew this guy was legit. He instructed me to replace the last 2 bytes of the MAC with AB:BA (named after the music group). They had a separate DHCP pool for MAC addresses in that range. If they ever saw an ABBA mac address on their network, they knew it was someone who had connectivity issues before.
The problem was immediately solved, and I had a rock-solid internet connection for years, with a static IP!
I ended up chatting about networking and OpenBSD a bit, before I (as humble as I could) told the guy I was a bit flabbergasted that someone as knowledgable as him was working on the servicedesk.
It turned out, he was the chief of network operations at the ISP (the biggest ISP in my country). He was just manning the phone while some of his colleagues from the servicedesk were having diner.
Way back in the day I worked on equipment that straddled telco circuits. T1s, E1s, DS3s, OC whatever. Companies paying big money for those circuits.
Anyway I was told on more than one occasion by different telcos that the standard operating procedure for many techs was to take the call, do nothing, and call back 20 minutes later and ask if it looked better because ... often enough it did.
When I started my job as an IT director 10 years ago I was in the customer support room and there was one guy who was known for solving all the hard problems. I was standing behind him when he was talking a call. He patiently listened to the client, then loudly typed random stuff on his keyboard for twenty seconds or so, making sure the client could hear the frantic typing, sighed, and then asked “Is it better now?”
The problem is that the idiots the hire to do their "technical" support have zero skill (nor motivation to learn) to assess whether it's a 5% problem or not, and the majority of end-users aren't capable to answer that question either, nor that they are incentivized to answer truthfully.
The solution could be a priority support tier where you pay upfront for an hour of a real network engineer's time (decently compensated so that he actually cares about solving the problem) and the charge is refunded only if the problem indeed ends up being on the ISP's side. This should self-regulate as anyone wasting the engineers' time for a simple problem they could resolve themselves would pay for that time.
I realize this was written from a position of frustration (which I share) at getting run around by customer support, but I'd reconsider the blanket characterization of tech support staff as "idiots": they're doing a high-throughput job following a playbook they're given with, as you identify, no incentive---it's probably less about personal motivation than the expectations that are set for how they perform their job---to break rules to provide better customer service to people with 5% problems.
+1 the real idiots here are att mid-upper management that setup this process and also have zero monitoring for packet loss/bit flips apparently so that they have an outage for weeks now. Support techs have no training nor tooling to debug this issue.
I cope with level 1 support by remembering we have a common goal: to stop talking to each other as quickly as possible. The tech just wants to close the case and I want to talk to someone else who can actually help me.
Good point. The managers would probably give negative feedback to people that took more time on their calls in order to try to help the customer better.
Who decides if it's the ISP's problem? What if it's "Both"?
About 20 years ago I got escalated to high-tier Comcast support for an issue that turned out to be a little of A, a little of B: Comcast (Might have been @Home, based on the timing) required that your MAC address be whitelisted as part of their onboarding process. Early home routers had a "MAC Address Clone" feature for precisely this reason. At some point, the leased network card got returned to Comcast. Our router continued to work just fine... until about a year later, when the local office put that network card into some other customer's home. We started getting random disconnects in the middle of the day, and it took forever to diagnose, as the other customer was not particularly active with their internet use. Whose fault was it? Ours, for ARP Spoofing? Theirs, for requiring the spoofing?
Wireshark and escalation to a competent tech. I believe they saw weird traffic from their DHCP server, and we were able to attach an ethernet hub (Not switch, a 10-Base-T Hub that repeated the signal on each port) along with a laptop that was running Ethereal (Before the name changed! How long ago that was now) and see the arp packets fighting.
> I believe they saw weird traffic from their DHCP server, ...
That makes sense.
When the cable modem issued a DHCP request, the CMTS would have been configured to insert some additional information (a "circuit-id") into the DHCP request as it relayed it to the DHCP server.
The short version is that the "competent tech" looked at the logs from the DHCP server, which would have showed that the "same" cable modem (i.e., MAC address) was physically connected to either 1) two different CMTS boxes or 2) two different interfaces of the same CMTS.
How would one cable modem be physically present in two different locations at the same time? Obviously, it wouldn't.
At that point, either 1) there are two cable modems with the same burned in address or 2) one of the two cable modems is cloning/spoofing its MAC address. Which one of those is more likely?
(If you're interested in the details, try "DHCP Option 82" as your search term.)
To be fair, the vast majority of technical support calls likely require only customer service rather than technical skills. Anyone with technical chops probably wouldn't last long in that environment.
A friend of mine used to work tech support, and said that from her perspective, a lot of the effort was trying to counter-outsmart customers who thought they were too smart.
For instance: "This might just be a loose cable, can you unplug each end of it, and plug it back in?" invariably elicited an "I've already done that", or a brief pause followed by "okay, there, I did it, do you see it?". Lies, often.
But: "Alright, I want you to try turning the cable around for me, yeah swap it end for end. Sometimes you get a connector that just doesn't fit quite right, but it works the other way around and it's faster than sending a tech to replace it", would often get a startled "Oh! I hadn't thought of that, one moment..." and then the customer actually DOES unplug the thing, and what do you know, they click it in properly this time.
I worked for a call center and there was a girl there I didn't think was good at her job.
Then one day, she tells me that she tells people to unplug the power cord, cup the end in their hand, and then plug it back in.
Suddenly, I really liked her. It's a genius move that makes them think they've done something obscure, but she really just wanted them to actually check the cable.
I worked for a computer support once, and just as I was hired, they went from (IIRC) 6 weeks of training to only 2 weeks. Nobody came through that knowing any more than they started with in regards to fixing computers.
Luckily, I already knew how to fix them. I found the job to be a cake walk and quite liked helping people. But I had to listen to people around me fumble through it.
It was frustrating at the time, but my favorite thing that happened was that I was admonished twice for having average call times that were too low. To them, that's a warning that someone is just getting people off the line without fixing their problems.
They monitored calls and they said I'd never received a complaint, but the system would keep flagging me for low call times so I had to artificially raise them. They suggested that I have a conversation with the clients.
I didn't, and I didn't stay there much longer, but it was quite a crazy situation. But I also felt much less pressure to handle calls quickly after that, too, which was nice.
A huge percentage of companies have their help desks and customer support outsourced to TCS or Cognizant or Accenture, etc.
Everything they do is driven by metrics, and their contracts are written around KPIs like maintaining a ludicrously short average time to answer, short handle times, and open/closed ticket ratios. If they do not hit these metrics, then the outsourcing company owes service credits. The incentives do not align with making customers happy and solving their problems. Everything is geared towards deflecting users with self-help options, simple scripts for the agents, and walking the line of hitting those metrics with the fewest number of warm bodies.
This blows my mind, because why would a customer come back when they're treated like that? Oh right, because they don't have anywhere else to go.
For the last several years, I've gotten my mobile service through an MVNO named Ting. On the rare occasion that I've needed to call support, there's no IVR, just a human who typically answers on the second ring. They speak native English, and have never failed to solve my problem either immediately or with one prompt escalation.
They're so jarringly competent I wonder how they still exist, if being obnoxiously incompetent is apparently a business requirement.
> If there's anyone out there designing tech support procedures, you should add an "is this a 5% problem?" question to whatever checklist you give to support staff.
... only if you want people to actually be able to escalate. Suspect AT&T are going to lose a lot less money over this issue than hiring an extra high quality support person would cost
100% you’re on point here, but there’s one problem that happens when they do add “is this a 5% problem”: eventually (and I’d say pretty quickly), the public gets wind of “if I say the right things to get marked as a 5% problem, I get an automatic escalation to someone who knows more.”, and suddenly you get a big chunk of level 1 calls in to the upper tiers.
For evidence, see: “Retention departments give you the best rates if you threaten to cancel” (which then caused companies to have to rename the retention departments and change the policies.
Much scarier when you actually see it with your own eyes:
$ diff example-*
4c4
< <titde>Example Domain</title>
---
> <title>Example Domain</title>
11c11
< backoround-color: #f0f0f2;
---
> background-color: #f0f0f2;
23c23
< box-shadow: 2pp 3px 7px 2px rgba(0,0,0,0.02);
---
> box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
41c41
< <p>This domain is for use in illustrative examples in documents. You may usm this
---
> <p>This domain is for use in illustrative examples in documents. You may use this
I was under the impression there was enough layers of checksums once you get to the level of UDP/TCP that these kind of single bit flips should be detected and dropped before you get to read them. What's going on here? Is networking gear not calculating checksums like it should?
I don’t have numbers off hand, but from a feeling I have from memory, I would think it is extremely unlikely that TCP checksums are consistently failing to trigger retransmission. Someone must be altering packets along the way.
Checksums are also often recomputed on transit if the packet is intercepted, e.g. to limit TTL, unflag odd/unused TCP features, that kind of ISP-ish preening. So if it was a software error or even a hardware problem in the right (wrong) spot it’s possible to get this kind of corruption without retransmits.
In that case, won't there be significant packet loss causing throughput to be very slow? I don't know if this is possible without something messing with TCP headers.
The checksum should be checked by your computer. So somehow the packet is being repackaged with the correct checksum, but for the wrong data. In other words, when your computer checks the checksum, it matches. Another possibility is that somehow only errors that result in the same checksum are being generated.
Except it was trivial to reproduce with the script on non-Apple devices, and people in one of the many Twitter threads surrounding this showed that on their Mac there was MANY tcp retransmits due to invalid checksums, and the bit-flipped packet did have the correct checksum.
OK, that tweet does show the checksum is OK. I didn't see a whole lot of tcpdumps, so had to go with what was reported in the thread (I tried to reproduce with a few people, but my server wasn't in the broken path, so I couldn't get a lot of real data).
That tweet in particular doesn't show any retransmits.
tcpdump/wireshark gets a little hard to read at times; especially when the packet dump is a lie: all those packets marked red for bad checksums are from the dumping machine, and the checksums are wrong because the NIC is filling them in, and the capture interface doesn't get to see what they are). Perhaps the other people in the thread who said mac os was ignoring bad checksums were also confused; or perhaps it does ignore bad checksums, it's pretty bad at networking (it can't handle a synflood in 2020 because it's got synhandling code from 2000)
There’s no proof of this. And what possible reason would macOS have for not checking the checksum? Although the checksum is weak it presumably catches at least some corrupt traffic. Do you really think Apple would just skip the TCP checksum and make its network performance less reliable when they have already implemented (or maintained if it came from BSD) the rest of a TCP/IP stack, which is vastly more complex, just because its developers are lazy?
There were reports in the twitter threads that macs were ignoring the bad checksum (and I thought I saw confirmations in this thread, blaming a driver). Since I don't have a mac anymore, I couldn't confirm or deny; and since I don't have any equipment with the bad equipment in the path, I couldn't get a tcpdump, and I hadn't seen any on twitter.
I wouldn't have expected Apple to purposefully break the checksum, just like they I don't think they purposefully have no synflood protection because they pulled the TCP/IP stack at the turn of the century and never pulled it to get the many many many upgrades from upstream (although, they did add on MP-TCP, so there's that). I wouldn't be surprised if tcp checksums had stopped working ages ago, possibly because of a aggressive driver, and nobody noticed. Kind of like how if you spawn a few thousand threads that just sit around sleeping, it will delay watchdog kicks and the kernel will panic. (also, from reports on here, not personal experience)
It is extremely common for hardware to be configured to ignore checksums. ("A packet with a bad checksum would have been dropped before it got here. Our cabling is too short to drop bits.")
> presumably because a corrupted packet in a TLS handshake is an error and the connection dies
Not just in the handshake, TLS moves these things called TLSPlaintext records (about 16kbytes each), not only in the handshake, but also for all the actual data - and they'll always have integrity protection to ensure bad guys can't change anything. TLS can't know the difference between a bad guy tampering with data and your crappy Internet mangling the data in transit, in either case the TLS protocol design says to "alert" bad_record_mac which your browser or similar software will probably treat as a failed connection, even if it happens mid-way through an HTTP transaction.
Because TLS guarantees integrity even if your fiber is a complete shit show, any TLSPlaintext records which do get from one end to the other are guaranteed to be as intended.
Thanks for sharing this! I have long suspected that TLS does this. It is great: for many applications preventing bitflips in transit is arguably more important than privacy. Is there an authorative source where this is documented?
If you aren't interested in why this works just the Introduction to the RFC explains the intent, specifically what we care about here is that TLS delivers:
"Integrity: Data sent over the channel after establishment cannot be modified by attackers without detection."
And further notes "These properties should be true even in the face of an attacker who has complete control of the network".
Help me understand, are you saying only HTTP bits are being flipped? Because yea, if a HTTPS bit was flipped the whole packet dies. So is this issue blowing up all sorts of traffic everywhere?
Yes, all bits are being flipped. TLS connections drop because the message can't be authenticated, HTTP or other plaintext protocols will continue on with bad data.
So yes, it is blowing up all sorts of traffic everywhere. You just don't notice when it is plaintext.
I ran the script and it is finding a difference after a while. I have not noticed anything wrong with the network recently, but that of course does not mean there has not been a problem.
Edit: Actually I take that back, I have seen one of the issues where sometimes web sites will not load at random, and a reload fixes the issue.
I have ATT fiber in Texas and was having issues recently, probably for a few days, where DNS props would just fail(I use Google's DNS), huge pauses in page loads with the occasional just doesn't. Happened over the long weekend IIRC and was sporadic enough that I didn't look into it further. I thought it had largely cleared, but now I'm wondering about some ongoing page load pauses..
The files differ after anywhere from 1sec to 5sec for me, and it's always the same character positions, and it always seems to be the same lines, and the same number of lines.
For me, TLS errors would happen every so often after we had used all our data allowance and our connection was being shaped to a slow speed. Once the speed returned to normal it was fine. I always thought it was the slow speed but maybe there were bugs in the shaping software.
In the above linked gist, how we we know it's a low level router bit flip instead of some code / programming error in their MITM / dns / JavaScript injection tomfoolery?
Could be crappy/buggy middle boxes. Especially if they’re inspecting packets or messing with SSL.
My ATT modem (Arris bgw-210-700) may have gotten a firmware upgrade recently as I found some settings I shut off got re-enabled. I use it for VDSL but the same model family is used for ATT fiber.
TCP checksum is simples one's complement checksum. So if two bits are flipped in the same position in two different 16 bit word, they cancel each other out. If you look at all the diffs posted, they differ in even number of lines.
Hardware that forwards packets usually forwards IP packets, it doesn't care about TCP checksums and doesn't recalculate them. TCP stack in your OS is the one that does that.
It has to be a special kind of hardware that does much deeper packet inspection (DPI) to recalculate TCP checksums, usually used for spying, throttling, censorship, injecting ads, injecting exploits, etc., but not merely routing/forwarding packets.
Switches (L2 devices) recalculate Ethernet CRCs (L2 error detection codes), and routers tend to calculate TCP/UDP checksums (L3 error detection codes) and everything below. I've seen exactly this issue with switches that have bad RAM before, and I assume that they just have a router with bad RAM (and without ECC RAM, which looks embarrassing).
Unfortunately TCP checksums are hot garbage given switch ASIC design. They are a 16 bit one's complement sum over a packet. If you get two bit flips in the same offset % 16, you can pass a checksum.
The problem is routers slow down the high speed serial signals from fiber to by splitting the bits over a large number of slower speed signals internally. Often those wider busses are a multiple of 16 bits. For example, one ASIC I know of moves things around in 204 byte chunks. (Might have been 208, been a while.) Anyway, the problem is that if there is a defect in one of those parallel elements it will always flip bits in the same offset position mod 204 bytes, which is the same position mod 16 bits. If the hardware is degraded enough, it can end up flipping two bits in the same position, and that has a fairly good chance of passing the checksum.
Ethernet has proper CRCs on packets, which is a lot less vulnerable to shenanigans like this, but unfortunately those can end up being checked on the way in, discarded, and then re-generated on the way out of a router. If anything is corrupted in the middle of the switch ASIC, nothing notices and it passes along. I once helped troubleshoot an issue in our network where a BGP packet was corrupted in this way. The flipped bits ended up causing a more specific route to be generated, and we had the world weirdest BGP route hijack within the bounds of our own data center.
Checksums are often calculated in hardware on the NIC. I have personally seen a network card send packets with corrupted data (corrupted by the network card itself) and valid TCP checksums, computed on the corrupted data.
My AT&T bay area fiber DNS has been garbage since the day we got it. I've had to forcibly update every device we own to Google DNS as well as put another router in between our devices and the AT&T one since AT&T doesn't let you actually change those settings on your device.
Why wouldn't the first conclusion be that the example.com service was flipping bits? Load balancing TLS requests to a different front-end pool is very common.
Has anyone claimed to see these flipped bits on a domain other than example.com?
This is a pretty useful piece of debugging here. It might be worthwhile to try to get EdgeCast involved, as it could be a broken thing between AT&T and EdgeCast.
Given that (it seems) only AT&T customers are complaining, and (it seems) it only affects servers on EdgeCast.
Trying to explain this issue to AT&T support is like trying to convince a doctor you're the only person on earth with a particular disease.
Even explaining the issue is hard. It's not an outage, my internet isn't out, it's intermittently wrong. The phone support agents aren't prepared for this, and I can't find any way to escalate or speak to a network engineer.
I feel like if I spoke to the right engineer, there'd be a ticket on this and they'd roll at truck to their facilities or the IPX within an hour. It's a major network issue to flip bits, it's costing them bandwidth with retransmits and could be breaking SLAs with their business customers.
On the phone the most they could do was roll a truck to me.
When I first moved to Seattle there was a great local ISP called CondoInternet that mainly specialized in high-density downtown buildings.
I was once having some packet loss issues and called their support line. I assume the company was really small at the time, because the guy who answered the phone was clearly a network engineer who knew the system inside and out. I read him a couple of traceroutes over the phone and we resolved the issue within minutes.
I have never experienced such perfect tech support before or after that with any other ISP.
And when Wave bought them, Wave put incompetent people in charge. I moved to a Seattle suburb and had "Wave G" (post-Wave CondoInternet).
I had throughput issues, first on my apartment. That got resolved via Reddit. Then the backbone is slow from time to time which persisted. Sometimes I get a Gigabit, other times I only get 10-20 Mbps.
What did Wave say? Let's bring a tech out. I told them "it's a backbone issue" and they didn't understand me. I don't know how I would explain how networking works to a field tech who only knows how to enable a port or run basic diagnostics.
Another time, I tripped Port Security, and told them they can re-enable my port. Wave said "we need to bring a tech out" as if that will magically solve everything.
This is made worse by my apartment's exclusivity deals: only Comcast and Wave, nothing else. Ziply Fiber (bought Frontier) and Atlas Networks were denied.
Bigger companies like Verizon FiOS in NYC had better tech support, I could still get hold of a Level 2 tech for a much smaller issue (and not even on FiOS).
Wave G makes AT&T's forced router (but not the flipping bits) seem decent in comparison. I don't want Comcast but I'll happily take AT&T Fiber over Wave G even if it means trading my pfSense box for a crappy AT&T gateway (worse case scenario I could bypass or root the gateway, or pay $15 for static IPs).
I fortunately moved (for another reason) and have Google's Webpass. Webpass may not give me a full Gigabit, but it's usually 400-800 Mbps all the time and not 10-20 Mbps 95% of the time with an occasional Gigabit. I wish CondoInternet sold to Webpass instead of Wave (I don't want a merger now, but still).
Surprisingly, I may have preferred CenturyLink Fiber if available mainly for GPON over a 60GHz PtP microwave link (less oversubscription!), well unless PPPoE+6rd kills my pfSense. But Webpass works pretty darn well, and I think I'll stay unless CenturyLink suddenly gives me 2 Gbps FTTH or something, or I move.
I don't have much to add here, except to say that this really matches my experience with Wave G.
Wave G's support team also doesn't even realize that the CondoInternet network they acquired provides IPv6, and when asked about IPv6-related issues they just say "oh we don't have IPv6 yet" which is nonsense. I really miss the local CondoInternet support people. They were amazing.
If you're interested or can't find it, I'll dig up a link.. but I recently unplugged my AT&T supplied residential gateway after installing a supplicant docker image on my UniFi Dream Machine Pro. It answers the authentication challenges with a reply using a CA from a jailbroken AT&T fiber modem (sourced from a guy on eBay!).
I see you're no longer on AT&T fiber, but if you would use it without their gateway, know that you can!
I made the mistake of (briefly) working at Wave. I can probably shed some light on them (full disclosure: they fired me after I was sent home for vomiting in the bathroom at work during a high call volume evening)
Wave is probably one of the most breathtakingly "if it works, use it" companies I've ever seen. I'm not sure if any other ISP's truly even compare simply in the breadth of equipment they use, combined with how poorly they operate
Wave operates in parts of Washington, Oregon and California. They've mostly grown by acquiring smaller, unprofitable or mis-managed ISP's in the areas they now own.
Wave offers TV, Internet and Phone services. Unlike a typical ISP like Comcast or Frontier however they don't just offer a single or a couple methods of service delivery
For TV service, Wave offers
- Analog TV (mostly areas in California)
- Digital TV in most other areas
- TiVO and CableCARD services (they don't have the purchasing power to get boxes from companies like Motorola/Arris or Cisco/Pace)
For Telephony service. They not only offer VoIP services but in some areas even offer regular PSTN phone lines!
For internet, Wave offers its "Wave" DOCSIS 3.0 (and in many areas, still 2.0) services. They also own what was originally CondoInternet
Finding out how CondoInternet/Wave G operates was probably one of the most horrifying things I've ever seen in my years of telco work. I'll try to explain it from the ground up since it's very much a jenga tower of terribleness
Condo Internet in its inception had a very uphill battle. They wanted to target expensive Seattle condo buildings and sell a "premium" product. However in very Seattle fashion they were slapped with very indifferent, if not outright hostile actors to their plans. So they made do with what they could get
Condo Internet's services comprise a hodgepodge of VDSL2, Point to Point wireless, Fiber-Optic and MoCA. Effectively what ever they could wire into the building or appropriate for use, they did. This is why some apartments can get symmetric gigabit, while others can only get 100 megabit
MoCA was initially the most horrifying one I encountered while working there. MoCA is effectively Ethernet running over Coaxial cables. Except since Coax is a shared medium, it's just like the ethernet hubs of the 1990's all over again.
The main reason this was done as it was considered cheaper than installing an HFC node or CMTS. They didn't know how many customers they would get to switch over, so they played their cards extremely conservatively.
Apartments with MoCA configured would have a (managed) gigabit switch or two in the basement for link back to the Condo PoP. Whichever vendor was cheapest at time of purchase (Cisco, Juniper etc)
These switches would each be connected to an individual MoCA adapter, connected to one of the cable drops going to each individual apartment/floor/whatever. The field tech would then install an accompanying MoCA adapter in the customers home (simply calling it a "cable modem") and connect it to a Wave provided router (typically a TP-Link Archer C7)
Condo/Wave would offer typically symmetric 100 megabit on these lines, though the ability for more than a few customers on each "MoCA node" (for lack of a better term) to saturate them was much more limited
Another "fun" feature was that both the MoCA devices and switches they were connected to were run without any sort of VLAN'ing at all. If a customer accidentally plugged the MoCA link into the LAN port on their router, it would happily hand out DHCP leases to the entire building!
As I found out. The reason they don't use VLAN'ing is because their NOC staff are almost entirely customer service reps that were "upskilled" to handle NOC tasks (gaining a fixed $0.50 an hour bonus, hooray!). Wave's NOC handle roughly 90% of WaveG calls (I'd guess because Wave doesn't make very much money?)
One other fun anecdote:
WaveG service has a lot of users from overseas set up VPN's for their parents to watch Netflix on. Netflix's internal algorithms for the longest time would detect this behaviour and automatically flag Wave's entire IP ranges as a "proxy or VPN provider", knocking out roughly 500,000+ internet customers from using Netflix for several hours or even days. This would cause their phone support to effectively melt down, with the robotic queue time projecting at roughly 5-6 hours or more
I had a similar, local fiber ISP in a high density midwest apartment. Truly fantastic support. The network engineer came out to set me up, and we got to talking about our backgrounds (I worked in IT support while attending a nearby tech-focused university), he invites me to check out their building-wide switch closet down the hall with all the cool gear, then says "yeah we limit everyone to 100/100, but I'll flag your account for 250, just don't go overboard eh"
I had a similarly stellar experience with Init7 in Switzerland. The person who answered the phone wasn't a network engineer, but immediately passed me on to one within 10 seconds of our conversation starting.
For me it was night and day after previous having had a municipality-run absolutely garbage ISP previously (CityCable, for anyone considering them, stay far far away). Init7 might be a tad more expensive than most, but the service is solid.
They got bought by Wave, who still seem to have pretty solid support? At least over Twitter, it felt like I was talking to someone competent. I haven't had any major issues since installation, though, so maybe I just got lucky.
The obnoxious thing to me is the hubris that must be behind this. Either they considered it and decided they would never encounter a system error like this and refused to implement an escalation route, or they never even considered it.
Or perhaps even worse than that, maybe they considered it, decided it was possible, but just don't care because of their insane borderline monopoly.
I don't understand how internet companies provide such consistently awful service.
Slightly off topic story, I've recently changed to another provider called Starry, and they force you to have a second router in front of your own router which they claim "decodes" their stream between the modem. I don't know the real reason but I'm pretty sure that's not it. If you plug their modem directly into a non-Starry router, the router just doesn't detect a connection.
One day, I tried to torrent something, and my internet would immediately get throttled to 0mbps. After investigating I found out that their router had a custom OS which hid a firewall and various security settings. Amusingly you could still access those settings if you just manually entered the page names into the address bar. Now all their stupid settings are disabled and I just feel badly for all the folks who use their service and don't have the savvy I do to actually get what they're paying for.
Funnily enough this is how my AT&T fiber is set up as well. They force you to use their router, and you can’t directly connect your router to the ONT. The problem is that they use a device certificate + EAP. There’s work arounds but it’s a pain.
Oh, there's a workaround for that.. I've unplugged my Residential Gateway and now my UniFi dream machine pro is directly connected to the ONT.
You install a CA from a jailbroken modem into a supplicant container that runs on the UDM pro. It confirms to the network that you are using "authorised" equipment for the connection and the packets flow!
I'm curious to see what happens with the new installs, which terminate the fiber directly at the gateway using the SFP port on the new BGW320 gateways, rather than using a separate ONT like they have historically. The UDM Pro has an SFP WAN port that could ostensibly be used, but I haven't seen much yet about the feasibility of adapting the existing bypasses to ONT-less installs.
Which is then a problem when you try to explain them that yes, you are sure the issue is with their service and not your setup. But what's your reason to be going such lengths instead of just plugging UDM into their router? Unless it was done for the fun of it which is fine and understandable.
> But what's your reason to be going such lengths instead of just plugging UDM into their router?
While you can do this and things will generally work, AT&T restricts all of their residential gateways from operating in a true passthrough/bridge mode to another router. So you end with double NAT and all the joys that entails (such as [1]). There are also a number of other issues that have been associated with operating in their faux-passthrough mode, including
- Issues with IPv6 prefix delegation
- Sporadic latency spikes (an issue in general, that you inherit since the gateway is still "doing" everything it normally would, since it won't actually act as a ure passthrough/bridge)
- A firmware update capped throughput at 50Mbps (later fixed in another firmware update)[2]
- Firmware updates tend to silently re-enable the built-in wifi radios
So while it'll generally work, it ends up problematic. You inherit all of the performance issues associated with just using the gateway as your all in one modem/router/firewall/AP/gateway, plus the addition of double NAT, plus the sharp edges of their poorly implemented faux-passthrough modes, plus the ever-present concern that you're one firmware update away from a non-working network despite having used their official passthrough configuration.
Hence why gateway bypasses are so popular[3][4][5][6]. Even if they're a bit involved to set up, once you get it working things just... work. With little if any upkeep (potentially a few minutes after a power outage, depending on the bypass method you implement).
But my main reason is actually the gigantic size of the residential gateway box. I mounted the ONT, UDM pro and PoE switch on a wall in a closet and the RG just took up too much space.
Yes exactly. Their router has a LAN with my router as the only other device, which it's bridged with, and then my router has the true home LAN.
A weird side effect of this is that I'm not using the 192.168.x.x range like usual (because that's what theirs is using), but instead the 10.0.x.x range
this is really where my limited knowledge of networking shows. I'm not entirely sure but I want to say both or double nat. There's two networks, but my router thinks theirs is a modem and is connected via the "Internet" port, not just a normal device port
Ah ok - sounds like a double NAT then, not a bridge.
Edit: to go into more detail, their router is acting as a NAT for your public IP, giving you your first subnet, and then your router is getting a single IP on that subnet and creating a NAT where your devices all get IPs. In a bridge there would only be 1 IP space behind a single NAT. In your case with a double NAT a lot of consumer things might not work (like UPnP) and port forwarding would require you to add rules to both routers.
Thanks, yes that's exactly what it is then. In order to give external access to my devices (e.g. my NAS) I have to forward ports from their router to my router, and from my router to the device. So, definitely double NAT. Amusingly the person who installed it incorrectly called it a bridge.
Have you seen Parks & Rec, and remember that scene in a Home Depot where an associate walks up to Ron, asks him if he needs help with a project, and Ron responds "I know more than you"?
I've pulled a variation of that on CSRs at least once, and surprisingly, it can work. Just be cordial, preempt the typical IT Support stuff they always ask, DO NOT say its intermittent (initially, to the front line CSR; if given a chance to expand the issue after escalation, then add that bit), and get technical ASAP (it doesn't hurt to throw in some parallel industry jargon). Basically, build a case where even the information you're giving them is beyond a first-line CSR playbook, and they have to escalate.
"Hi there; I've been observing some erroneous TCP packet bit flipping on HTTP requests which route through one of AT&T's data center in Oakland. I've tried restarting my computer, I'm seeing the same thing on my phone, and I actually swapped my router out for a spare one I have, but its still an issue."
(that last sentence exhausts literally every playbook a front-line CSR has. it sounds so easy, right? there are four variables in any front-line CSR diagnostic equation: their network, your router, wifi/ethernet, and the endpoint. you just crossed off three of the four variables in one sentence).
(Wait, a data center in Oakland? How do you know this? You can tracert a bad request and geolocate the first IP outside your network, but, lets be realistic: You don't. You're fronting; demonstrating knowledge that a front-line CSR can't disprove. You may think this is misleading to whoever this gets escalated to, but it isn't; their tools are FAR more advanced than yours, and they're used to 99% of customers being incorrect idiots, so they're going to be validating and reconfirming every word you say anyway.)
Ron's Parks & Rec example above is crass. But here's the magic bit: frontline CSRs generally look for an excuse to escalate, you just need to give them enough CYA to check their job as done, and the higher tier CSRs/network engineers will love you for actually knowing what you're talking about. Its a win-win; be cordial, be forceful, strut what you know.
I had something like this happen on an even simpler level this last week. I got a Chase credit card but when I initially did the signup called my brother to ask him if he wanted to be on the account and it timed the sign up session out past the account creation but before finalization.
I got the card eventually but now I cannot create an online account with it. I called Chase, got transferred 5 times, and then told I would need to go to a physical bank to verify my identity? to create an account. Absolutely not one of them had any clue what "a broken account exists associated with this card in your database, I can guarantee it, forward me to your technical support team" but thats all above a bank reps pay grade.
The nearest Chase bank is 1.5 hours away, by the way. Probably just going to cancel the card after cashing out the sign up bonus.
> I've tried restarting my computer, I'm seeing the same thing on my phone, and I actually swapped my router out for a spare one I have, but its still an issue.
"Ok sir, please click the start button, then the power button, and finally click the restart button to restart your computer..." (and they refuse to budge until you've swapped out your router yet again, because you didn't do all that while you were on the phone with them)
From a business standpoint, it's hard to justify paying support to be technical enough to diagnose an issue such as this. Let's be honest, even senior network engineers would have a hard time debugging and diagnosing this. AT&T doesn't want to pay support staff six figure salaries and I assume most senior network engineers don't want to be support agents (customer facing).
AT&T (though applies to lost of companies) probably need a "unicorn" role of a very technical person that is paid as such, but able to interface with customers on specific highly technical issues.
Ten years ago, while using ADSL, for some reason captcha images were not loading. I opened a case to ISP support and they called me back. The support guy did not believe it. He said "this is not an analog network, this is digital, you either have it or not". I said I know what the analog and digital mean and also I know this is a digital network since I am a computer engineer and I checked everything so this was a issue with connection. After a couple hours later, he called me back and said that problem was caused by modem and a driver update would fix it and it did. These were good old days, however, when we can reach someone on the phone and talk about problem. Nowadays everything is either automated response or some random person whose whole job to tell you that he/she cannot do anything about the situation.
The real life version of “shibboleet” is your Certified Partner ID number and a serial number with a valid support contract.
When I worked for a VAR I could upload logs to Cisco and get experimental patches back. Call up HP, tell them I want an RMA, and they’d just do it. Night and day compared to what consumers get.
From my professional experience of programming and debugging networking equipment, this could be a switch/router with a buffer with bad memory (stuck bit maybe). The better chips have CRC/Parity/ECC to cover such issues but there are always those magical choke points where the past CRC is tossed and the new one is generated that can leave a gaping hole. The tricky part is how often is this bad memory buffer used...
I would use traceroute to find a common bad point for everyone. It is also possible that the networking point where the problem occurs is invisible to traceroute as it could be part of a provider network probably MPLS but at least the common ends of the tunnel would be visible.
The fact that it is a a specific interval indicates a stuck bit in memory.
Hardware designers basically started making bad decisions on this issue around the time that VLAN tagging was introduced, as well as harware forwarding of IP packets. When VLAN tags are inserted or removed, the CRC of a packet needs to be adjusted to reflect the inserted, removed and/or modified bytes from the VLAN header. Additionally Both the CRC and IP checksum of a packet needs to be adjusted when TTL is decremented as part of IP routing.
When implementing this functionality, the naive hardware designer will strip the existing CRC from the packet, modify the contents of the packet and then reuse the handy dandy CRC calculation block to place a newly calculated CRC on the packet. Similar choices are made for the adjustment of the IP/TCP/UDP checksums. If any errors are introduced in the contents of the packet by the data path prior to the new CRC is calculated, this results in the CRC being "corrected" to include the erroneous data.
A far more understanding hardware designer will instead calculate how to adjust the CRC by the changes introduced in the packet contents. Sadly, this is far more complicated to get right, and it goes against the drive of hardware designers to reuse blocks of code wherever possible. Every hardware designer working on networking has a block of Verilog or VHDL code to calculate and append a CRC to a packet. Only the most dedicated will attempt to apply only the delta needed to the CRC or checksum.
I'm not a hardware designer but I routinely deal with low level networking shenanigans and I must admit that I never considered that it would be possible to update a CRC without recomputing it fully (unless you were just appending data of course).
You can use ping to more easily hunt these types of issues, for example `ping -A -c 100 -s 1000 -p deadbeef` will show the difference if there is a flipped bit in the payload. You can generate patterns with xxd.
This kind of incident happened to me in a system that was supposed to have high availability. We had failovers for hardware, but it seems that a network device that was supposed to have HA (and was set up to pass the functionality to another device in case of failure) did not have ECC memory. One memory bit got stuck at 0 and the event was not detected at network level, as the data was repacked with a "clean" CRC. For some reason the packet headers were not affected by this, maybe because they were kept in a separate memory zone or because of memory alignment. So the device did not report any kind of suspicious activity, no errors in its statistics.
On the application side the effects were quite bad, as the data was mainly XML and, depending on where the bit was flipped, it could impact the data or the XML structure. The data had its own CRC/hash, so the packets were cleanly rejected by the application. Unfortunately the XML library from the message queue engine and the ESB we were using did not like at all when the bit flipping occurred in the XML tags (it seems fuzzing tests were not done at that point) so the message processing got stuck and we kept getting bad messages in the queues. Even worse, the queues could not be cleaned with the normal procedures because the application wanted to first display info about the messages inside - and that failed.
The network debug was non-trivial because of that header consistency - the network devices did not report any kind of packet issues, so we had to sniff the different network segments to identify the culprit. From the application point of view, we had to delete the whole message queue storage to get rid of the bad messages, and let the application handle the rest (luckily it was designed with eventual consistency and self-healing).
Wasted an opportunity to implement code that would detect and handle poison-pill messages. Those will happen in any system where queue is involved and there always needs to be an escape hatch to get rid of them. Deleting the queue is too extreme.
Deleting the queues is an operational decision that I made to be able to put the system back online after the network device was replaced (the important part was the uptime/SLA). From a quick analysis of the logs the percentage of bad messages was ~90% (there was a ~50% chance that the original "touched" bit was 0 so no change was done, but the messages had multiple "touched" bits at fixed intervals).
There was an escape hatch, but the conditions to hit it were a bit complex. Implementing new message filtering of this kind at 2AM while the system was down was not feasible.
Take a traceroute from everyone experiencing the problem, look for the common hops among them all. Then compare that list against people not experiancing the problem to find the differences. The finsl set there are good places to start looking. Sweitches and routers along that way could be the cause
If you ping the hops with a large icmp payload, you might be able to observe the flipped bit in the echo reply. That could help isolate which hop it is.
It might be better (although harder!) to take the traceroute from example.org, instead of from the clients. Forward and reverse paths often diverge, so it's important to find the path with the error.
They need to be capturing src/dest IPs as well as ports for AT&T to have any hope of using that data.
Edit to make the comment more useful: If anyone is curious, look up "ECMP hashing." There are probably tons of parallel paths through AT&T's network, and to narrow down to the hardware causing problems, they will need to identify which specific path was chosen. Hardware switches packets out equally viable pathways by hashing some of the attributes of the packet. Hash output % number of pathways selects which pathway at every hop.
Hardware does this because everyone wants all packets involved in the same "flow" (all packets with the same src/dest IP and port and protocol (TCP)) to deterministically go through the same set of pipes to avoid packet re-ordering. If you randomly sprayed packets, the various buffer depths of routers (or even speed of light and slightly different length fibers along the way) could cause packets to swap ordering. While TCP "copes" with reordering, it doesn't like it and and older implementations slowed way down when it happened.
I think he means if everyone on AT&T experiencing the issue ran trace route to example.com some common hops would emerge, which would be a place to start investigating.
It seems like it, but it's a widespread issue across the SF/Bay Area right now, maybe wider. I've been having it for weeks and exploring it as well. I've even gone as far as ripping the certs off of the router to double check.
When AT&T first did their Fiber rollout in SF, one of the things I remember was they charged you $10 extra if you didn't want them to MITM all your connections to insert JavaScript pointing to their own ads.
They rolled this back when folks complained, but I wonder if the relevant infrastructure is still sitting around and mangling packets.
I would probably expect this to be some network card or cable or connector is failing though...
The tweet indicates it is at a specific bit position. That isn't symptomatic of a bad cable/sfp/etc. Analog problems like that tend to be more random. Random bit flipping in a fixed bit position is symptomatic of bad ram somewhere, or a router asic gone bad, or various software/config issues.
Yes, as of six months ago I had to look up how to opt out of MITMing for a coworker on AT&T Fiber, so that they could use the Internet again when the remote-activated MITM feature inside the modem broke.
What I always wondered about this is, unless AT&T has permission from the copyright holder of each and every Web page so modified to distribute the resulting derivative work, how is this practice not criminal copyright infringement?
For the same reason why ISPs are generally immunized from contributory liability that they would otherwise be completely buried in.
Now, if you had written JavaScript to detect and remove these ads, and they went around that, then you might be able to construct a DMCA 1201 claim and sue the ISP for circumventing what is legally considered DRM. Yes, JavaScript can be legally protected DRM. The law doesn't say it has to be good DRM, it just has to have the effect of controlling access to a copyrighted work. And the safe harbors the DMCA provides ISPs wouldn't protect them in this case.
My mobile.twitter.com traceroute prefers going through that path, as does en.wikipedia.org (both of which have sucked for me) while a Google route (to 172.217.6.78) hops through 12.122.149.186.
Oh man, I thought it was just my crappy old router causing the problems! I've just been too lazy to call them to replace it. I'm going to tweet at them right now.
Edit: Reading the tweet thread, classic interaction: ATT says "please click this to check your connection", guy replies back and says "I think we know more about networks that you do, please get a network admin in here".
What would you know about production networking, Jeremy?
I'm in the same boat. If the dozens of IT pros who are complaining about this can't get AT&T to swap out a single router card, what hope do most folks have?
Fortunately, this malfunction occurred in the SF Bay Area, getting AT&T to fix its network is only a matter of time, sooner or later, the correct person in the Valley would be alerted.
If it happened to the rest of us at somewhere else, we are probably out of luck...
It’s certainly a relief to have an apparent explanation for the weirdness I’ve been experiencing! I thought it was my home router or WiFi or something, but then I experienced the same problems at my partner’s house, so I thought maybe it’s just a problem with some websites I frequent or some common internet infrastructure. But we both have AT&T internet, so this must be it!
For me, the problem has mostly manifested as web pages failing to load or appearing to be loading forever. Generally when I refresh the page would load quickly as expected.
I saw the headline and immediately realized I've noticed similar oddness in the Atlanta area recently. Glad to hear I'm not alone. So far I've been blaming it on weirdness from my devices roaming between mesh router nodes, but now I'm going to run the test script overnight and see if anything turns up.
For this type of issue, I would recommend trying the nanog mailing list. You often see network admins ask for someone "with clue" at a different company when they get the runaround by tier 1/2 tech support.
This is absolutely the right way to "informally" escalate things to people that know what they're doing at big ISPs.
I'd recommend most newbies to the list show their work and post what's broken and how you know it isn't your fault. The investigative work on that Twitter thread is top notch and would do it in a second.
NANOG and outages@ are the two mailing lists that I've been subscribed to forever and are indispensable if you do operations.
NIMBYism at its finest. Cupertino did not allow cell phone towers for a very long time. The only one was an ATT tower on the top of Infinite Loop, right on the Sunnyvale border.
The people who would call their provider about bad cell coverage in their house are the same people that would go to city hall and demand that no cell towers be built in the city.
It’s amusing seeing Cupertino city council transcripts about this because the people show up claiming 5G gives them cancer and the city council desperately tries to get them to use better excuses so they can approve denying it.
Makes a change from the city council members usual practice of denying Vallco permits and claiming Apple employees are hiring prostitutes and molesting high school students. (I did not make this up.)
The ISPs have tried, but they get pushback from local residents whenever they try to install the necessary network boxes at intervals down the street. They had to go out of their way to hide cellular towers as streetlamps to get the 5G rollout to happen; I remember getting two or three notices for this (each from different telcos, so the one street corner has three streetlamps and two traffic lights on one corner).
I suspect if the droughts didn't make people get rid of their green grassy lawns homeowners would be more amenable to seeing green network boxes every few houses. It looks awful in the context of concrete sidewalks, though.
PROTIP: The boxes are an excuse. Building infrastructure would make more people want to move there, and claiming the boxes are ugly is something you can’t disagree with.
The parts of SV that haven't buried their power lines (not too pretty to begin with) have gotten significantly uglier since all of the new cable/fiber/DSL infrastructure has gone in. There are in many cases multiple of these things on loads of poles:
The parts of SV that have buried their power lines, probably haven't gotten a whole lot of new cable/fiber/DSL than whatever was there when they buried it.
SV is the perfect place to not bury infrastructure. There's no weather to worry about, and mostly people aren't going to shoot down the fiber to try to claim scrap metals.
I live in one of the few cities in the Bay Area where everything is buried, and it's refreshing not to have stuff hanging all over the place and blocking the view. Now, granted, AT&T fiber is slow to come here, but it's hard to know if the fact that the infrastructure is buried is the main reason. They are in some areas of town but not others.
You can bury conduits, by the way, and not cables or fiber directly. This allows you to avoid digging again just to install fiber, for example. There are ways to do it right. And having the infrastructure on poles is not a panacea either. New providers are not necessarily allowed to use the poles. The weather might not be crazy, but the poles are already overloaded and a little windstorm will disrupt electricity, coax cable, or fiber. And of course, those overloaded poles are crazy ugly.
When I used to work in the telecom industry, burying conduit or cable in the ground was anywhere from 3x (bury in some dirt on the side of a county highway) to 20x (directional drilling in a heavily populated city where there's utilities all over the place) the cost per foot compared to hanging it aerially from a labor standpoint.
However, as you correctly point out, there may be restrictions on what you can hang on the poles and where, and oftentimes you'll find poles where it turns out it never should have had the number of attachments it did, but guess who gets to foot a large part of that bill if they want on?
But even then, I've seen absurd lengths gone to in the name of not digging. On Martha's Vineyard, I believe they wound up using a super-special Self-Supporting fiber that could be hung in or near the Power area of a Pole. Yes, that requires a far more trained/well paid worker than normal aerial work. Also, in that region, NESC 250C/D comes into play which makes it even more of a PITA. But it still was far cheaper than putting cable in the ground.
I wonder whether Teraspan or other Vertical Directed Conduit would be a good fit for the bay area (Saw-cut a minimal depth in the street, just lay in a special zip-up conduit for fiber or twisted pair.) If the weather doesn't tend towards large temperature shifts it works well.
Speaking of which, a couple drawbacks worth noting for buried conduit; You have to go out and do your markings, or pay someone to do them for you when a dig request is made, and you have to be ready to handle the repair when someone inevitably forgets to call or the markings are done incorrectly.
Yes, but SV doesn't have alleyways to run the power lines through, so you have all of these ugly wires everywhere and you're not allowed to have big trees in the front yard, since they could interfere with the power lines. It's really ugly and looks like Baghdad after the war.
Here in Germany we bury everything. I don't even want to imagine how long it takes to rebuild all of that when some drunkard crashes his car into one of these poles or a lightning strike hits.
I grew up in the Florida Keys where the water table is very close to the surface so it's impossible to bury anything. When I was a young teenager, we would lose electricity all the time from tree branches brushing up against the power lines. They finally solved the problem by putting in power lines on concrete poles that are a few stories up! No more squirrels and tree branches disputing power!
you'd be surprised but it takes quite an impact to disrupt a telephone pole. most crashes leave the pole standing (see examples online), and even when it does topple the wires seem to hold, and the utility company comes in with a fresh pole. the lightning strike factor does suck if you live very very close to the strike and nothing is surge protected. That happened to my house once and fried my xbox and router.
I've seen it happen personally, except it was the support cable to a pole with a transformer. Cable anchor came out of the dirt, pole fell over, sparks everywhere, transformer fluid all over the street, hazmat cleanup crew, big mess.
I could be wrong, but I'd worry about earthquakes severing the lines.
This map[0] shows the major faults, but each one of those is a lot of small minor faults that all could snap things and shift. One small fault right by where I grew up is about three blocks long. Last I checked had one earthquake to its name: a magnitude 5.0 aftershock of Loma Prieta
No, it hasn't made anything uglier - you don't really notice the infrastructure unless you're specifically trying to look for something to get annoyed by.
Maybe you have gotten blind to it? When I first landed in Palo Alto (coming from Europe), I couldn't believe my eyes: 3rd world infrastructure with wires flying everywhere! I most definitely noticed and notice. It's symptomatic of some of the ill that plague Sillicon Valley.
Overhead power / telecom lines (and cable car power lines) have always been a thing in America. Adding fiber and additional infrastructure hasn't really made it uglier. It would be different if we had buried power lines to start with.
I know people with Palo Alto Fiber running to their house. They were hosting websites out of their garage years ago (I think archive.org for a while.), but today it's just residential, so it's very much possible.
Are you thinking about the area just around the SFO landing zone?
I thought that was less of a "won't" and more of a "can't" between the lack of taller building towers and the radio interference zone for the airport means the directional antennas skip a slice there.
This will probably get fixed if 5G becomes a proper thing and there's a lot of micro-cells along the "bay area's biggest parking lot".
Interestingly, the incorrect ASCII chars are always 8 chars higher or lower than their counterpart, indicating that bit 3 is consistently being flipped:
I love this. In 1992, I bought a Kenwood stereo receiver, which I still have in my living room to this day, which you could program your radio station call letters into for your presets, and I noticed that after setting one of the presets to "KNBR" it switched to "KNBB". I assumed I made a mistake and corrected it, but a few days later I noticed it happened again. Assuming ASCII encoding, B is 0x42 and R is 0x52, so the difference is the single bit (and even assuming non-ASCII, the delta of 16 is likely still the same), so I experimented and determined that that particular memory location always eventually zeroed the 4th least significant bit, so all characters from P-Z would be shifted down to the corresponding character 16 positions less than it. Nowadays I don't really use the radio presets anymore, but I still like knowing the flaw is there nearly 30 years later.
The same day there is an enormous prognostication thread on the future of the Bay Area, a bunch of nerds on Twitter observe, test, diagnose, and root cause a local fiber network issue. I can only think of a few places in the world where this may happen: https://twitter.com/bd/status/1336110887145361410
I had AT&T via Sonic's FTTN in an effort to avoid Comcast/AT&T directly, especially since Sonic does not have data caps despite running on AT&T's network and actually respects their customer's privacy. Unfortunately my experience was so terrible and unreliable that I decided to give up and finally sign up with Comcast, and just have to mind that 1.2TB data cap.
The Bay Area of all places should not have such terrible ISP infrastructure, but here we are.
It's because AT&T does not follow net neutrality. It has fast lanes and slow lanes, which cause a lot of problems for work VPNs even when down the street. This TLS handshake bit flip is just yet another issue AT&T has, including the 1.1.1.1 DNS issues, and everything else.
I'm on Sonic.net fiber over AT&T too. To get the equivalent fiber from Comcast out here it's $270 a month. However, if you're lucky enough Sonic.net fiber that doesn't run over AT&T is amazing.
Huh, I had Sonic fiber over AT&T and loved it. In May I moved out of the bay area and now have Comcast cable. I'd do pretty much anything to get Sonic's level of quality out here.
It's shitty, but just FYI: you can pay Comcast an additional $30/mo for unlimited data. They hide the option, but it was ultimately the least bad option to handle my WFH traffic after Comcast re-instated the caps.
You can get Comcast Business at your residence. The overall bandwidth is lower, but you're priority traffic, and there are no caps. Plus, static IPv4 and IPv6 are available.
I was seriously considering signing up for Sonic service, and I have Comcast now. Guess I won't do that. I hesitated because I really really hate AT&T. Guess I should have trusted that feeling.
I had to switch also after missing days of work due to not being able to use the internet with AT&T. They just kept trying to send new routers, which was not the problem. And I thought ISPs in Alaska were bad, I never thought I'd have worse service in the Bay Area.
We are AT&T in SF Bay Area, and our service has been flaking out for weeks. Inability to get DNS, machine claims invalid security certificates, download/upload drop below 1 MbS. All we could get is our modem reset.
It has to be a good 30 years since I saw the first instance of this: an Ethernet bridge that randomly flipped packet payload bits and nicely regenerated a correct CRC on the mangled packet going out. Every few years we come across some variant of the same "evil box" problem.
If you've got AT&T fiber via Sonic, give them a call and ask them to escalate the issue. The people on the phone know about the issue, take you seriously, and the more people complaining the more pressure they'll put on AT&T, I expect.
I have done this earlier today and it is exactly as you say. All they can do is file another ticket with AT&T, but multiple tickets coming to one ISP from another ISP mean more than calls or tweets to the AT&T customer support line, so here's hoping! Some people on Twitter report this problem going back all the way to November. Glad it's finally getting attention tonight.
I once had a combination of "dueling" VPN clients on a Windows PC that caused random bit flips. It was awful. SSL/TLS sessions randomly failing. SSH sessions closing randomly. Downloaded files failing integrity checks.
It's an interesting juxtaposition of how tolerant TCP/IP is of random bit flips, yet how intolerant application layer protocols are.
We have Sonic DSL and I see differences using the example.com test; and at installation time I was told I could only get fiber at my location (backside of Bernal) when AT&T put it in and they could lease it.
Sonic has their own fiber in some parts of SF/Santa Rosa and you would know if you were on it, all Sonic DSL products are essentially resold AT&T uverse.
In some parts of the East Bay as well, but I can't find exact maps. If they sell you "Fusion IP Broadband", it's rebranded AT&T. If it's "Fusion Fiber", it's Sonic's own.
"It depends" - Your NIC could be doing some tcp offload for you, at which point any checksum you'd observe with tcpdump/wireshark would be flat out wrong.
Generally - both the tcp checksum and the 802.11/Ethernet CRC isn't all that strong, so it's entirely within the realm of possibility of having bitflip patterns that fool both.
Dane Jasper, CEO of Sonic, which resells AT&T's fiber service in some areas where they don't have their own fiber, responded to a tweet asking if he could help with pushing this to a remedy with AT&T:
> Yep, we are engaged. And they are clueless. Huge challenge to find someone who really has ability to troubleshoot."
What layer of the stack do we think this is happening at? I seem to recall that GPON has forward error correction, so the same bit flipping in every packet seems unlikely to be caused there. But, I suppose that core routers are complicated enough these days to cause problems like this (custom ASICs doing all the bit moving), and if it affects every customer without common pieces of network equipment in between, it sure sounds like a software problem. More programs, more problems!
Sonic resells me AT&T and their best suggestion, since they cannot file issues with AT&T (what?), was to post on their forums and hope the Sonic CEO sees it and acts on it.
That may not make any difference, depending on your locality. Unfortunately AT&T infrastructure is still relied on by others such as Sonic, which leases lines from the likes of AT&T fiber, at least in my locality.
It's pretty clear what you're getting, though. If for no other reason than the fact that the resold ATT service costs 2x what "native" Sonic fiber does.
I'm on the preorder list for Sonic, for the past.. almost 2 years now. In the meantime, I'm using resold ATT through Sonic. $10/mo more than ATT directly, but no MITM and no data caps.
Sonic's fiber service uses AT&T fiber in Alameda as well. Just another data point. I have that service. It's been working just fine, though. I haven't noticed any symptoms of flipped bits or anything similar to what others are reporting here.
+1 for Sonic with their own fiber in Berkeley, which was way better than Comcast (previously the only other option) for the same price with no data cap.
Requisite link to the excellent paper by Artem Dinaburg about Bitsquatting, the practice of passively exploiting single bit flips (usually in ram, but on the wire works too):
Some new data....
I ran the same script on two different machines connected via the same AT&T provided modem. Machine A is NATted, B uses a public address. A always gets the error within 50-60 fetches. B doesn't (I ran the test a number of times for a few seconds each). I even ran them both concurrently (repeated the test on A since it keeps failing) but it was always A that got the error.
I am wondering if this is modem related -- may be they silently "upgraded" it or changed some default such as TCP checksum offload? Just guessing, I haven't checked.
Experiment 2:
I switched machine A to use a public address (from my static IP address block) and the problem disappeared! I then switched it back to the way it was and still there are no errors. Not sure what to make of this...
My router is also getting a public IP address, I've set my AT&T gateway to bridge mode, but the AT&T gateway is still a monster-in-the-middle, a tumor I can't excise because it has to be between your router and the ONT or else the fiber turns off.
When I was in the service area, I had written a EAP proxy and did it that way (I've still got the code around somewhere, ran on FreeBSD, email in my profile if you want it), but I've heard the easier way is to put an unmanaged switch between the ONT and the residential gateway, let the gateway do the 802.11x, then unplug it from the switch and use hardware of your choice. As long as your switch isn't too smart, it won't know to send an 802.11x signoff, and the ONT will leave the network port working. Yay, security theater.
You may need to spoof mac for DHCP, but I don't know if that's strictly required.
I recall hardly any of the details now but IIRC I talked to the AT&T support person and they changed some settings. Looking at the settings now, I see in "Home Networks->Subnets & DHCP" the [public subnet mode] is on. I haven't had to deal with any EAP authentication (but again I may be forgetting!).
And IP Passthrough is off. Basically the modem is routing between the public and private subnets.
I find it amusing that all these multi hundred thousand dollar per year income people working at the forefront of consumer tech don’t have access to a quality symmetric fiber internet connection.
Years ago I had gotten an "ATT fiber coming soon" notice, but still not here. In the end its simply the usual issue with internet access in the US -- ISPs have zero incentive to invest/improve, there's no meaningful competition. With everyone WFH its even worse -- oversubscribed networks are causing trouble all over (esp if you use a cable ISP).
When a city (or someone like Google) brings fiber in a meaningful way, it only then lights a fire under the incumbents' ass. In theory cities with non-buried lines (like Mountain View) may be easier to upgrade; but it still depends on ATT granting pole access (which they promptly slowwalk).
I believe I felt that one in Croatia. Started on 6-7th November.
My ISP Tele2, Croatia. But, I am on a 4G LTE router.
Haven't had it for 2 years since I've been with that ISP nothing the same as described.
Or at least we had some strange stuff for a few days, I thought it was my dedicated server SSH and MTU, bu not.
When connected through the VPN, the speed is normal, fine.
Also, FTP was working fine.
But, without a VPN, the network seems to be "throttled" somehow, do not know why.
Persists today. I have to use VPN for SSH connection to my server and to download on my "full speed" over FTP.
And my router is D-LINK, since firmware upgrade cannot access HTTPS configuration. Bad certificate :/
My first thought was: how come the frame isn't just dropped due to a bad checksum?
Then I remembered a piece of Ethernet test hardware I worked with. It had a bad SRAM chip or something which would result in all of the payloads having a stuck bit every 8 bytes. However, the FCS was calculated _after_ that, so it was still a valid frame.
My guess is AT&T has some hardware that's recalculating the checksums after the frame has already been corrupted.
Similar problem, on the peninsula, running ATT Fiber via Sonic.
There is a known issue with 2 part numbers of ONT that are incompatible with ATT main station, resulting in packet loss, accrued errors, etc.
An ATT service tech can come out to replace this in 3 minutes. Have had zero problems since replacement, after getting knocked off network many times per hour.
I am a new AT&T fiber customer and have noticed that connections to random places seems slow. New connections in general. There seems to be no pattern, even across udp or tcp. I assumed there was just bad connectivity issues. Any more details available on what is happening? Like, what bits are getting flipped?
I might have to go back to the local cable company if this continues.
AT&T runs transparent proxies on their network that have been known to intercept and mangle tls, and directly manipulate tcp packets on the wire.
Take a tcpdump capture on both server/client side and compare the handshake/tcp headers/negotiated window size. Often you see the server sending one thing, but the client sees another
Wow. I happen to setup my PiHole at the same time and did not expect it to be the ISP. I even bought the newest RPi 4 and disabling lists without any luck. The interesting part is that I turn off my wifi on my cellphone if the site load hangs and it always worked. I always thought it was because I was using my non-pihole DNS.
I don't have fiber but I do spend slightly more on business class internet, they have tech support that is an order of magnitude more helpful and knowledgeable than normal residential internet. With business class you can get real static IPs and better support SLAs.
I have tried AT&T Fiber twice and the experience have been underwhelming both times. Intermittent outages and dead connection on a frequent basis. Unfortunately, I switched to Comcast 1GB connection and it has been miles better than AT&T
I have AT&T fiber. I noticed Wikipedia started to have trouble loading around then, assumed Wikipedia just had capacity issues. Script definitely shows corruption.
The TCP checksum catches some issues but it won't detect everything. Other newer algorithms like CRC32C[0] are much more effective at detecting corrupt data.
The following is an excerpt from Performance of Checksums and CRCs over Real Data[1].
The TCP checksum is a 16-bit ones-complement sum of the data. This sum will catch any burst error of 15bits or less[8], and all 16-bit burst errors except for those which replace one 1’s complement zero with another (i.e., 16 adjacent 1 bits replaced by 16 zero bits, or vice-versa). Over uniformly distributed data, it is expected to detect other types of errors at a rate proportional to 1 in 2^16.
If you're deeply interested in this topic then I would recommend Jonathan Stone's Phd thesis.
TCP checksum is notorious for not catching all corruption, to the point that all protocols run over TCP should have their own integrity checking. TLS is good enough AFAIK, which lets most traffic off the hook these days.
Computer hanging when connecting to random websites some of the time. Apps on your phone just randomly hanging. Streaming services randomly not working.
If you use Twitch you'll randomly have the video break with a connection failure in the middle of stream.
Interesting, I've been having a lot of the same symptoms for months now with my Comcast service on the other end of the country. Comcast refuses to acknowledge there's any problem on their end, but I'm reasonably confident that my home infrastructure is rock solid after double checking every aspect of my network setup.
Default timeout time on firefox is 30 seconds before retrying and a ctrl+shift+r refresh does not refresh the load, so random websites pause while opening and you have no choice but to wait 30 seconds before the page starts loading. (Psst, the default can be changed.)
Not that I have any specific evidence this is the case, but it wouldnt surprise me to later learn this is an attack on encryption by the NSA.
AT&T in San Francisco is ground 0 for some of their bulk collection.
After The Times disclosed the Bush administration’s warrantless wiretapping program in December 2005, plaintiffs began trying to sue AT&T and the N.S.A. In a 2006 lawsuit, a retired AT&T technician named Mark Klein claimed that three years earlier, he had seen a secret room in a company building in San Francisco where the N.S.A. had installed equipment. [1]
Thank you for voicing this despite the downvotes. I don't think it's the most likely scenario, but it hurts to see how harshly people on HN reject speculation about other paths of possibility than the most obvious, especially considering AT&T's history.
Never forget: "NSA documents show that the relationship with AT&T has been considered unique and especially productive. One document described it as 'highly collaborative', while another lauded the company’s 'extreme willingness to help'."
I would think that if that room did/still exists they would most likely use passive optical taps, not interfere with transit traffic. Optical taps simply mirror a fiber optic signal one way to a tap aggregation switch/packet capture devices. They would also be mostly undetectable, you will just see a reduced light level.
I'm pointing out that flipping bits could be a form of cryptographic attack (potentially against an implementation bug, not necessarily against the math/algo) .
For example flipping a bit possibly could cause a checksum failure and retransmit.
I'm not saying something _is_ happening, mostly just musing on the possibilities.
https://twitter.com/bmastenbrook/status/1335400747794530304
It loads http://example.com and https://example.com and compares the result (should be equal) in a loop, and then reports if it finds a difference. I'm seeing multiple bit flips in the unencrypted version, and having a lot of issues loading web pages, presumably because a corrupted packet in a TLS handshake is an error and the connection dies.
Tech support, even when Twitter accounts with a lot of followers message them, is completely useless, they just say that they don't show any outages at your location. They need to be flooded with complaints for someone to look at this, or maybe someone from AT&T is on here that can get it looked at...