It loads http://example.com and https://example.com and compares the result (should be equal) in a loop, and then reports if it finds a difference. I'm seeing multiple bit flips in the unencrypted version, and having a lot of issues loading web pages, presumably because a corrupted packet in a TLS handshake is an error and the connection dies.
Tech support, even when Twitter accounts with a lot of followers message them, is completely useless, they just say that they don't show any outages at your location. They need to be flooded with complaints for someone to look at this, or maybe someone from AT&T is on here that can get it looked at...
The problem is that they then assume that all cases are one of those 95% in order to solve the 95% as quickly as possible, which probably looks good to whatever metrics they're tracking.
But if you're one of the 5% you're fucked.
If there's anyone out there designing tech support procedures, you should add an "is this a 5% problem?" question to whatever checklist you give to support staff.
When I was the engineer customer service escalated to, I was damn sure to thank them every time they escalated something. Even the one guy who escalated all the things I'd roll my eyes about in private. At least he was making sure the escalation path worked.
Someone who has taken the time to report an issue is probably one of hundreds or thousands who had an issue and didn't think it could be fixed and shrugged it off. We certainly can't fix everything, but weird network shit like this can be fixed, and it's worth escalating, because when you get it fixed, you can also figure out (hopefully) how to monitor for it, so it doesn't happen again.
OTOH, I didn't work for the phone company. We don't care, we don't have to, we're the phone company. https://vimeo.com/355556831 (sorry about the quality, I guess internet video was pretty lowdef in the 70s :P)
The most memorable one:
Customer service had been getting calls all morning with a peculiar complaint: A customer's phone would ring, and when they answered, the party on the other end didn't seem to hear them. They seemed to be talking to _someone_, but not the party they were connected to. Eventually they hung up. Sometimes, a customer would place a call, and be on the other end of the same situation -- whoever answered would say hello, but the two parties didn't seem to be talking to each other. Off into the void. They'd try again, and it would work, usually, but repeats weren't uncommon.
So everyone's looking at system logs and status alarms and stuff, and what else changed? There were two new racks of echo-cancellers placed in service last night, could that cause this? Not by any obvious means, I mean e-cans are symmetrical and they were all tested ahead of time. There was a fiber cut out by the railroad but everything switched over to the protect side of the ring OK, didn't it? Let's check on that. Everyone's checking into whatever hunch they can synthesize, and turning up bupkus.
Finally around lunchtime, one of the techs bursts into the ops center, going "TIM! I GOT ONE I GOT ONE IT'S HAPPENING TO ME, PATH ME! okay look I don't know if you can hear me, but please don't hang up, I work for the phone company and we've got a problem with the network and I need you to stick on the line for a few minutes while we diagnose this. I know I'm not who you expected to be talking to, and if you're saying anything right now, someone else might be hearing it, but that's why this is so weird and why it's so important YEAH IT CAME INTO MY PERSONAL LINE and that's why it's so important that you don't hang up okay? I really appreciate it, just hang out for a few, we'll get this figured out..."
Office chairs whiz up to terminals and in moments, they've looked up his DN and resolved it to a call path display, including all the ephemera that would be forgotten when the call disconnects. Sure enough, it's going over one of the new e-cans. Okay, that's a smoking gun!
So they place the whole set of new equipment, two whole racks of 672 channels each, out-of-service. What happens when you do that is the calls-in-process remain up, but new calls aren't established across the OOS element. Then you watch as those standing calls run their course and disconnect, and finally when the count is zero, you can work on it. (If you're doing work during the overnight maintenance window, you're allowed to forcibly terminate calls that don't wrap up after a few minutes, but that's verboten for daytime work. A single long ragchew is the bane of many a network tech!) The second rack was empty of calls in _seconds_, and everyone quickly pieced together what that implied -- every single call that had been thus routed was one of these problem calls where people hang up very quickly. This thing had been frustrating hundreds of callers a minute, all morning.
With the focus thus narrowed, the investigation proceeded furiously. Finally someone pulls up the individual crossconnects in the DACS (a sort of automated patch panel, not entirely unlike VLANs) where the switch itself is connected to the echo-cancellation equipment. And there it is. (It's been too long since I spoke TL1 so I won't attempt to fake a message here, but it goes something like this:) Circuit 1-1 transmit is connected to circuit 29-1 receive, 29-1 transmit isn't connected to anything at all. 1-2 transmit to 29-2 receive, 29-2 transmit to 1-1 receive. Alright, we've got our lopsided connection, and we can fix it, but how did it happen in the first place?
If all those lines had been hand-entered, the tech would've used 2-way crossconnects, which by their nature are symmetrical. A 2-way is logically equivalent to a pair of 1-ways though, and apparently this was built by a script which found it easier to think in 1-ways. Furthermore, for a reason I don't remember the specifics of, it was using some sort of automatic "first available" numbering. There'd been a hiccup early on in the process, where one of the entries failed, but the script didn't trap it and proceeded merrily along. From that point on, the "next available" was off by one, in one direction.
Rebuilding it was super simple, but this time they did it all by hand, and double-checked it. Then force-routed a few test calls over it, just to be sure. And in a very rare move, placed it back into service during the day. Because, you see, without those racks of hastily-installed hardware, the network was bumping up against capacity limits, and customers were getting "all circuits busy" instead. (Apparently minutes had just gotten cheaper or something, and customers quickly took advantage of it!)
He loved all of that stuff, absolutely hated when everything went to computers. Quit and became a maintenance man at a nursing home, commercial laundry repair guy then finally retired this year in his late 70’s (due to Covid) after working maintenance at a local jail.
I believe the #5 ESS machine itself is always in closed cabinets, so it's likely that what you're remembering was the toll/transport equipment, or ancillary frames. Gray 23-inch racks as far as the eye can see!
Depending on how old that part of the office was, they were likely either 14' or 11'6" tall with rolling ladders in the aisles, or 7' tall and the only place they'd have laddertrack was in front of the main distributing frame.
As for magnetic core, if you could see it mounted in a frame, what you probably saw was a remreed switching grid, which is a sort of magnetic core with reed-relay contacts at each bit, so writing a bit pattern into it establishes a connection path through a crosspoint matrix. It's not used as storage but as a switching peripheral that requires no power to hold up its connections. (Contrast with crossbar, which relaxes as soon as the solenoids de-energize.)
Remreed was used in the #1 ESS (and the #1A, I believe), and is extensively documented in BSTJ volume 55: https://archive.org/details/bstj-archives?&and=year%3A%221...
I just remember thinking it looked awkward getting to the equipment under them.
I don’t know if the ‘5E’ as he called it was actually in operation yet, he ended up moving us all out of state to take a job developing and delivering training material for it...I think that’s what finally broke him lol. Hands on kinda dude.
I’ll have to hit him up later today to see if he remembers ‘remreed’ (he will). Thanks for the info!
At night, traffic was often low enough that you could hear individual call setups and teardowns, each a cascade of relay actuations rippling from one part of the floor to another. The junctor relays in particular were oddly hefty and made a solid clack, twice per call setup if I recall correctly, once to prove the path by swinging it over it to a test point of some sort, and then again to actually connect it through. On rare occasion, you'd hear a triple-clack as the first path tested bad, an alternate was set up and tested good, and then connected through.
Moments after such a triple-clack, one of the teleprinters would spring to life, spitting out a trouble ticket indicating the failed circuit.
The #5, on the other hand, was completely electronic, time-division switching in the core. The only clicks were the individual line relays responsible for ringing and talk battery, and these were almost silent in comparison. You couldn't learn anything about the health of the machine by just standing in the middle of it and listening, and anyone in possession of a relay contact burnishing tool will tell you in no uncertain terms, that the #5 has no soul.
> including all the ephemera that would be forgotten when the call disconnects
Interesting to know there is information which is not logged. I’m guessing keeping this info, even for a day, would have helped isolate the issue?
How did the echo cancellers pass testing?
All that was true, the failure happened when it was being taken out of testing config and into operational config. Either nobody considered that that portion could fail, or the urgency to add capacity to a suddenly-overloaded network meant that some corners were cut. (Marketing moves faster than purchasing-engineering-installation...)
Oh, and as to the point about keeping the call path ephemera. Yeah probably, but in a server context, that'd be akin to logging the loader and MMU details of how every process executable is loaded and mapped. Sure, it might help you narrow down a failing SDRAM chip, but the other 99.99999% of the time when that's not the problem, it's just an extra deluge of data.
The funny thing is that if everyone played along they could have had a mean game of telephone going.
Have you heard any of Evan Doorbell's telephone tapes? It's a series of recordings mostly from the 1970s, but with much more recent narration, exploring and sort-of documenting the various phone systems from the outside in. Might be interesting to see what they figured out, and what they didn't :)
I used to play counter strike/Starcraft in my middle school years. I pretty much figured out I had consistent packet loss with a simple ping test. I was on the phone with Time Warner every other day for months. They kept sending the regular technicians, at one point ripping out all of my cable wires in the house to see if it fixed the problem. Nothing worked, I kept calling, at this point I had the direct number to level 3 support. They saw the packet loss too. Finally, after two or three months they send out a Chief Engineer. Guy says I’ll look at the ‘box’ on one of the cable poles down the block. He confirmed something was wrong at that source for the whole area. Then it finally got fixed.
Took forever dealing with level 1 support, and lots of karenesque ‘can I talk to your supervisor please’, but that’s literally what it took.
So yeah, if you want stuff like this fixed, stay professional, never ever curse, consistently ask to speak to the supervisor, keep records, and keep calling.
Small shout out to the old http://www.dslreports.com/ for being a great support community during the early days of broadband for consumer activism in terms of making sure you got legit good broadband.
I ended up just cancelling the service and signing up for one of their competitors instead.
It had happened before and then magically fixed itself a few days later.
One time I had a week of outage with AT&T basically said the problem was on my side. They could ping the modem, and then punted. I had several truck rolls. The techs were really nice guys, but were basically cabling guys, better for finding a bad cable than debugging a packet loss. The problem for me was that my ipv4 static ip addresses would not receive traffic.
I was at wit's end after a week and I debugged the thing myself. By looking at EVERY bit of data on the router, I found mention of the blocked packets in the firewall log. I would clear all the logs, and found even with the firewall DISABLED, the firewall log would see and block incoming packets I was sending using my neighbor's comcast connection.
I called AT&T, but this time mentioning "firewall is completely off, but packets are blocked by the router and showing up in the log" was concrete enough for them to look up a (known) solution.
The fix was to disable the firewall, but to enable stealth mode. wtf?
To be clear, this was a firmware bug, and caused dozens of calls to AT&T, lots of heartache and finger pointing always in my direction.
I should also mention at the start of this fiasco, I checked the system log and noticed they pushed a firmware update to the modem at the time the problem started. Strangely after one call to the agent, that specific line disappeared out of the log file, but other log entries remained. hmmm.
since then, they basically screw up my modem every month or two - they push new firmware and new "features" appear (like the one that sniffs and categorizes application traffic like "youtube" and "github"). It also helpfully turns wifi BACK ON when I had disabled it. I immediately go turn if back off and then they immediately send me a big warning email that my DSL settings have been changed.
This is all because AT&T believes that the edge of their network is not the PON ONT/OLT, but rather, the router they issue you. If you want to be on their network, you have to use their router as some part of the chain.
My latest discovery is that in doing this, the router can actually get super hot operating at gigabit speeds for extended periods of time. When this happens, it magically starts dropping packets. Solution? Aim a fan at the router so it has "thermal management."
Total. Garbage. I'd switch to Spectrum if they had decent upload speed, but alas, they don't in my building.
Another option is getting the 802.1x certificate out of a hacked router, but it's not possible as far as I know on the 5268ac. You could buy a hackable ATT router but they're not cheap. Some sellers even sell the key by itself.
Mysteriously, doing this fixed an issue I previously had where SSHing into AWS would fail.
I'm currently using eap_proxy with my BGW210, and it's been a huge improvement, but I fear the day the device needs to be replaced with a newer model.
However, it has 5Gbit Ethernet, hasn't re-enabled WiFi on automatic firmware updates, and has only screwed with my IP Passthrough configs once which was resolved with a router reboot. (that was possibly my router's fault, it seemed like it was unable to fetch a new DHCP lease)
All you have to do is answer a form with questions like: Do you know how to plug in a computer? Do you know where the power switches are on your devices? etc.
CSA Pre™ is valid for 5 years; you can initiate the renewal process to do 6 months before expiry.
I've had this exact experience, except that it wasn't a dream.
Back in 2010 I had a weird issue where my cable connection would sometimes completely block the connectoin right after the DHCP response (we had dynamic IPs back then). This would go on for a couple of hours, until the IP lease expired, then my connection would come back. Luckily, I was running an OpenBSD box as my router which allowed me to diagnose the problem. But it was also impossible to explain to the servicedesk employees.
One evening it happened again, and I called the servicedesk, totally prepared to do the 'yes I have turned it off and on again' dance. But to my surprise the employee that I got on the phone was very knowledgable and even said that it was very cool that I had an OpenBSD box as a router. He very quickly diagnosed that someone in my neighbourhood was 'hammering' the DHCP service by not releasing his lease (a common trick to keep your IP address somewhat static). This caused a double IP on the subnet, and the L2 switch to block traffic to my port.
He asked me "do you know how to spoof your MAC with an OpenBSD box?". Then I knew this guy was legit. He instructed me to replace the last 2 bytes of the MAC with AB:BA (named after the music group). They had a separate DHCP pool for MAC addresses in that range. If they ever saw an ABBA mac address on their network, they knew it was someone who had connectivity issues before.
The problem was immediately solved, and I had a rock-solid internet connection for years, with a static IP!
I ended up chatting about networking and OpenBSD a bit, before I (as humble as I could) told the guy I was a bit flabbergasted that someone as knowledgable as him was working on the servicedesk.
It turned out, he was the chief of network operations at the ISP (the biggest ISP in my country). He was just manning the phone while some of his colleagues from the servicedesk were having diner.
Sometimes miracles do happen.
Anyway I was told on more than one occasion by different telcos that the standard operating procedure for many techs was to take the call, do nothing, and call back 20 minutes later and ask if it looked better because ... often enough it did.
It always was.
The solution could be a priority support tier where you pay upfront for an hour of a real network engineer's time (decently compensated so that he actually cares about solving the problem) and the charge is refunded only if the problem indeed ends up being on the ISP's side. This should self-regulate as anyone wasting the engineers' time for a simple problem they could resolve themselves would pay for that time.
About 20 years ago I got escalated to high-tier Comcast support for an issue that turned out to be a little of A, a little of B: Comcast (Might have been @Home, based on the timing) required that your MAC address be whitelisted as part of their onboarding process. Early home routers had a "MAC Address Clone" feature for precisely this reason. At some point, the leased network card got returned to Comcast. Our router continued to work just fine... until about a year later, when the local office put that network card into some other customer's home. We started getting random disconnects in the middle of the day, and it took forever to diagnose, as the other customer was not particularly active with their internet use. Whose fault was it? Ours, for ARP Spoofing? Theirs, for requiring the spoofing?
That makes sense.
When the cable modem issued a DHCP request, the CMTS would have been configured to insert some additional information (a "circuit-id") into the DHCP request as it relayed it to the DHCP server.
The short version is that the "competent tech" looked at the logs from the DHCP server, which would have showed that the "same" cable modem (i.e., MAC address) was physically connected to either 1) two different CMTS boxes or 2) two different interfaces of the same CMTS.
How would one cable modem be physically present in two different locations at the same time? Obviously, it wouldn't.
At that point, either 1) there are two cable modems with the same burned in address or 2) one of the two cable modems is cloning/spoofing its MAC address. Which one of those is more likely?
(If you're interested in the details, try "DHCP Option 82" as your search term.)
For instance: "This might just be a loose cable, can you unplug each end of it, and plug it back in?" invariably elicited an "I've already done that", or a brief pause followed by "okay, there, I did it, do you see it?". Lies, often.
But: "Alright, I want you to try turning the cable around for me, yeah swap it end for end. Sometimes you get a connector that just doesn't fit quite right, but it works the other way around and it's faster than sending a tech to replace it", would often get a startled "Oh! I hadn't thought of that, one moment..." and then the customer actually DOES unplug the thing, and what do you know, they click it in properly this time.
Then one day, she tells me that she tells people to unplug the power cord, cup the end in their hand, and then plug it back in.
Suddenly, I really liked her. It's a genius move that makes them think they've done something obscure, but she really just wanted them to actually check the cable.
Luckily, I already knew how to fix them. I found the job to be a cake walk and quite liked helping people. But I had to listen to people around me fumble through it.
It was frustrating at the time, but my favorite thing that happened was that I was admonished twice for having average call times that were too low. To them, that's a warning that someone is just getting people off the line without fixing their problems.
They monitored calls and they said I'd never received a complaint, but the system would keep flagging me for low call times so I had to artificially raise them. They suggested that I have a conversation with the clients.
I didn't, and I didn't stay there much longer, but it was quite a crazy situation. But I also felt much less pressure to handle calls quickly after that, too, which was nice.
Everything they do is driven by metrics, and their contracts are written around KPIs like maintaining a ludicrously short average time to answer, short handle times, and open/closed ticket ratios. If they do not hit these metrics, then the outsourcing company owes service credits. The incentives do not align with making customers happy and solving their problems. Everything is geared towards deflecting users with self-help options, simple scripts for the agents, and walking the line of hitting those metrics with the fewest number of warm bodies.
It's a pretty hellish business.
For the last several years, I've gotten my mobile service through an MVNO named Ting. On the rare occasion that I've needed to call support, there's no IVR, just a human who typically answers on the second ring. They speak native English, and have never failed to solve my problem either immediately or with one prompt escalation.
They're so jarringly competent I wonder how they still exist, if being obnoxiously incompetent is apparently a business requirement.
... only if you want people to actually be able to escalate. Suspect AT&T are going to lose a lot less money over this issue than hiring an extra high quality support person would cost
For evidence, see: “Retention departments give you the best rates if you threaten to cancel” (which then caused companies to have to rename the retention departments and change the policies.
$ diff example-*
< <titde>Example Domain</title>
> <title>Example Domain</title>
< backoround-color: #f0f0f2;
> background-color: #f0f0f2;
< box-shadow: 2pp 3px 7px 2px rgba(0,0,0,0.02);
> box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
< <p>This domain is for use in illustrative examples in documents. You may usm this
> <p>This domain is for use in illustrative examples in documents. You may use this
So either the bit corruption is such that is not detected by the checksum
Or ATT is doing something nefarious and touches layer 4 and corrupts the data while doing so
With enough packets passing the dodgy RAM a noticeable number will manage to get mangled in such a way that the checksum is still correct.
Routers don’t touch the TCP header at all
But in any case it's irrelevant in this context as ATT shouldn't be doing any NAT
Another quality product by Apple ;P
That tweet in particular doesn't show any retransmits.
tcpdump/wireshark gets a little hard to read at times; especially when the packet dump is a lie: all those packets marked red for bad checksums are from the dumping machine, and the checksums are wrong because the NIC is filling them in, and the capture interface doesn't get to see what they are). Perhaps the other people in the thread who said mac os was ignoring bad checksums were also confused; or perhaps it does ignore bad checksums, it's pretty bad at networking (it can't handle a synflood in 2020 because it's got synhandling code from 2000)
I wouldn't have expected Apple to purposefully break the checksum, just like they I don't think they purposefully have no synflood protection because they pulled the TCP/IP stack at the turn of the century and never pulled it to get the many many many upgrades from upstream (although, they did add on MP-TCP, so there's that). I wouldn't be surprised if tcp checksums had stopped working ages ago, possibly because of a aggressive driver, and nobody noticed. Kind of like how if you spawn a few thousand threads that just sit around sleeping, it will delay watchdog kicks and the kernel will panic. (also, from reports on here, not personal experience)
>>> p1 = IP(dst="192.168.mac.ip")/TCP(dport=1984,sport=20001)
>>> p2 = IP(dst="192.168.mac.ip")/TCP(dport=1984,sport=20002)
###[ IP ]###
version = 4
ihl = 5
tos = 0x0
len = 40
id = 1
frag = 0
ttl = 64
proto = tcp
chksum = 0xf905
src = 192.168.linux.ip
dst = 192.168.mac.ip
###[ TCP ]###
sport = commtact_http
dport = bb
seq = 0
ack = 0
dataofs = 5
reserved = 0
flags = S
window = 8192
chksum = 0xb836
urgptr = 0
options = 
>>> p2[TCP].chksum = 0xb836 ^ 0x8 # mangled checksum
>>> sr1(p1, timeout=1)
Finished sending 1 packets.
Received 5 packets, got 1 answers, remaining 0 packets
<IP version=4 ihl=5 tos=0x0 len=44 id=0 flags=DF frag=0 ttl=64 proto=tcp chksum=0xb902 src=192.168.mac.ip dst=192.168.linux.ip |<TCP sport=bb dport=microsan seq=1671494800 ack=1 dataofs=6 reserved=0 flags=SA window=65535 chksum=0x6039 urgptr=0 options=[('MSS', 1460)] |>>
>>> sr1(p2, timeout=1)
Finished sending 1 packets.
Received 494 packets, got 0 answers, remaining 1 packets
0 and 0 are still 0 and 0,
1 and 1 are still 1 and 1,
but 0 and 1 become 1 and 1 (or 0&0)
and 1 and 0 become 1 and 1 (or 0&0).
I remember seeing this wackiness when I bridged two address lines on an EEPROM with the tiniest amount of solder.
RoHS/tin whiskers strikes again?
Not just in the handshake, TLS moves these things called TLSPlaintext records (about 16kbytes each), not only in the handshake, but also for all the actual data - and they'll always have integrity protection to ensure bad guys can't change anything. TLS can't know the difference between a bad guy tampering with data and your crappy Internet mangling the data in transit, in either case the TLS protocol design says to "alert" bad_record_mac which your browser or similar software will probably treat as a failed connection, even if it happens mid-way through an HTTP transaction.
Because TLS guarantees integrity even if your fiber is a complete shit show, any TLSPlaintext records which do get from one end to the other are guaranteed to be as intended.
If you aren't interested in why this works just the Introduction to the RFC explains the intent, specifically what we care about here is that TLS delivers:
"Integrity: Data sent over the channel after establishment cannot be modified by attackers without detection."
And further notes "These properties should be true even in the face of an attacker who has complete control of the network".
So yes, it is blowing up all sorts of traffic everywhere. You just don't notice when it is plaintext.
Edit: Actually I take that back, I have seen one of the issues where sometimes web sites will not load at random, and a reload fixes the issue.
What's a "DNS prop"?
Nice find! I had been wondering why I had been seeing odd TLS failure messages recently.
The files differ after anywhere from 1sec to 5sec for me, and it's always the same character positions, and it always seems to be the same lines, and the same number of lines.
My ATT modem (Arris bgw-210-700) may have gotten a firmware upgrade recently as I found some settings I shut off got re-enabled. I use it for VDSL but the same model family is used for ATT fiber.
It has to be a special kind of hardware that does much deeper packet inspection (DPI) to recalculate TCP checksums, usually used for spying, throttling, censorship, injecting ads, injecting exploits, etc., but not merely routing/forwarding packets.
There is no reason for it to touch the TCP header.
And yet, an unfortunate number of L2 switches do exactly that. :(
Can you name any? Just curious
The problem is routers slow down the high speed serial signals from fiber to by splitting the bits over a large number of slower speed signals internally. Often those wider busses are a multiple of 16 bits. For example, one ASIC I know of moves things around in 204 byte chunks. (Might have been 208, been a while.) Anyway, the problem is that if there is a defect in one of those parallel elements it will always flip bits in the same offset position mod 204 bytes, which is the same position mod 16 bits. If the hardware is degraded enough, it can end up flipping two bits in the same position, and that has a fairly good chance of passing the checksum.
Ethernet has proper CRCs on packets, which is a lot less vulnerable to shenanigans like this, but unfortunately those can end up being checked on the way in, discarded, and then re-generated on the way out of a router. If anything is corrupted in the middle of the switch ASIC, nothing notices and it passes along. I once helped troubleshoot an issue in our network where a BGP packet was corrupted in this way. The flipped bits ended up causing a more specific route to be generated, and we had the world weirdest BGP route hijack within the bounds of our own data center.
Only in certain areas! Within San Francisco on overhead-cabled blocks (for example) their lines are their own.
Has anyone claimed to see these flipped bits on a domain other than example.com?
See e.g. https://gist.github.com/bmastenbrook/14c0e22fc02b95d4a48f82d...
Given that (it seems) only AT&T customers are complaining, and (it seems) it only affects servers on EdgeCast.
They are on Facebook and Linkedin, you have seem them.
Even explaining the issue is hard. It's not an outage, my internet isn't out, it's intermittently wrong. The phone support agents aren't prepared for this, and I can't find any way to escalate or speak to a network engineer.
I feel like if I spoke to the right engineer, there'd be a ticket on this and they'd roll at truck to their facilities or the IPX within an hour. It's a major network issue to flip bits, it's costing them bandwidth with retransmits and could be breaking SLAs with their business customers.
On the phone the most they could do was roll a truck to me.
I was once having some packet loss issues and called their support line. I assume the company was really small at the time, because the guy who answered the phone was clearly a network engineer who knew the system inside and out. I read him a couple of traceroutes over the phone and we resolved the issue within minutes.
I have never experienced such perfect tech support before or after that with any other ISP.
I had throughput issues, first on my apartment. That got resolved via Reddit. Then the backbone is slow from time to time which persisted. Sometimes I get a Gigabit, other times I only get 10-20 Mbps.
What did Wave say? Let's bring a tech out. I told them "it's a backbone issue" and they didn't understand me. I don't know how I would explain how networking works to a field tech who only knows how to enable a port or run basic diagnostics.
Another time, I tripped Port Security, and told them they can re-enable my port. Wave said "we need to bring a tech out" as if that will magically solve everything.
This is made worse by my apartment's exclusivity deals: only Comcast and Wave, nothing else. Ziply Fiber (bought Frontier) and Atlas Networks were denied.
Bigger companies like Verizon FiOS in NYC had better tech support, I could still get hold of a Level 2 tech for a much smaller issue (and not even on FiOS).
Wave G makes AT&T's forced router (but not the flipping bits) seem decent in comparison. I don't want Comcast but I'll happily take AT&T Fiber over Wave G even if it means trading my pfSense box for a crappy AT&T gateway (worse case scenario I could bypass or root the gateway, or pay $15 for static IPs).
I fortunately moved (for another reason) and have Google's Webpass. Webpass may not give me a full Gigabit, but it's usually 400-800 Mbps all the time and not 10-20 Mbps 95% of the time with an occasional Gigabit. I wish CondoInternet sold to Webpass instead of Wave (I don't want a merger now, but still).
Surprisingly, I may have preferred CenturyLink Fiber if available mainly for GPON over a 60GHz PtP microwave link (less oversubscription!), well unless PPPoE+6rd kills my pfSense. But Webpass works pretty darn well, and I think I'll stay unless CenturyLink suddenly gives me 2 Gbps FTTH or something, or I move.
Wave G's support team also doesn't even realize that the CondoInternet network they acquired provides IPv6, and when asked about IPv6-related issues they just say "oh we don't have IPv6 yet" which is nonsense. I really miss the local CondoInternet support people. They were amazing.
Wave is probably one of the most breathtakingly "if it works, use it" companies I've ever seen. I'm not sure if any other ISP's truly even compare simply in the breadth of equipment they use, combined with how poorly they operate
Wave operates in parts of Washington, Oregon and California. They've mostly grown by acquiring smaller, unprofitable or mis-managed ISP's in the areas they now own.
Wave offers TV, Internet and Phone services. Unlike a typical ISP like Comcast or Frontier however they don't just offer a single or a couple methods of service delivery
For TV service, Wave offers
- Analog TV (mostly areas in California)
- Digital TV in most other areas
- TiVO and CableCARD services (they don't have the purchasing power to get boxes from companies like Motorola/Arris or Cisco/Pace)
For Telephony service. They not only offer VoIP services but in some areas even offer regular PSTN phone lines!
For internet, Wave offers its "Wave" DOCSIS 3.0 (and in many areas, still 2.0) services. They also own what was originally CondoInternet
Finding out how CondoInternet/Wave G operates was probably one of the most horrifying things I've ever seen in my years of telco work. I'll try to explain it from the ground up since it's very much a jenga tower of terribleness
Condo Internet in its inception had a very uphill battle. They wanted to target expensive Seattle condo buildings and sell a "premium" product. However in very Seattle fashion they were slapped with very indifferent, if not outright hostile actors to their plans. So they made do with what they could get
Condo Internet's services comprise a hodgepodge of VDSL2, Point to Point wireless, Fiber-Optic and MoCA. Effectively what ever they could wire into the building or appropriate for use, they did. This is why some apartments can get symmetric gigabit, while others can only get 100 megabit
MoCA was initially the most horrifying one I encountered while working there. MoCA is effectively Ethernet running over Coaxial cables. Except since Coax is a shared medium, it's just like the ethernet hubs of the 1990's all over again.
The main reason this was done as it was considered cheaper than installing an HFC node or CMTS. They didn't know how many customers they would get to switch over, so they played their cards extremely conservatively.
Apartments with MoCA configured would have a (managed) gigabit switch or two in the basement for link back to the Condo PoP. Whichever vendor was cheapest at time of purchase (Cisco, Juniper etc)
These switches would each be connected to an individual MoCA adapter, connected to one of the cable drops going to each individual apartment/floor/whatever. The field tech would then install an accompanying MoCA adapter in the customers home (simply calling it a "cable modem") and connect it to a Wave provided router (typically a TP-Link Archer C7)
Condo/Wave would offer typically symmetric 100 megabit on these lines, though the ability for more than a few customers on each "MoCA node" (for lack of a better term) to saturate them was much more limited
Another "fun" feature was that both the MoCA devices and switches they were connected to were run without any sort of VLAN'ing at all. If a customer accidentally plugged the MoCA link into the LAN port on their router, it would happily hand out DHCP leases to the entire building!
As I found out. The reason they don't use VLAN'ing is because their NOC staff are almost entirely customer service reps that were "upskilled" to handle NOC tasks (gaining a fixed $0.50 an hour bonus, hooray!). Wave's NOC handle roughly 90% of WaveG calls (I'd guess because Wave doesn't make very much money?)
One other fun anecdote:
WaveG service has a lot of users from overseas set up VPN's for their parents to watch Netflix on. Netflix's internal algorithms for the longest time would detect this behaviour and automatically flag Wave's entire IP ranges as a "proxy or VPN provider", knocking out roughly 500,000+ internet customers from using Netflix for several hours or even days. This would cause their phone support to effectively melt down, with the robotic queue time projecting at roughly 5-6 hours or more
Scale ruins most things, unfortunately.
For me it was night and day after previous having had a municipality-run absolutely garbage ISP previously (CityCable, for anyone considering them, stay far far away). Init7 might be a tad more expensive than most, but the service is solid.
Or perhaps even worse than that, maybe they considered it, decided it was possible, but just don't care because of their insane borderline monopoly.
I don't understand how internet companies provide such consistently awful service.
Slightly off topic story, I've recently changed to another provider called Starry, and they force you to have a second router in front of your own router which they claim "decodes" their stream between the modem. I don't know the real reason but I'm pretty sure that's not it. If you plug their modem directly into a non-Starry router, the router just doesn't detect a connection.
One day, I tried to torrent something, and my internet would immediately get throttled to 0mbps. After investigating I found out that their router had a custom OS which hid a firewall and various security settings. Amusingly you could still access those settings if you just manually entered the page names into the address bar. Now all their stupid settings are disabled and I just feel badly for all the folks who use their service and don't have the savvy I do to actually get what they're paying for.
You install a CA from a jailbroken modem into a supplicant container that runs on the UDM pro. It confirms to the network that you are using "authorised" equipment for the connection and the packets flow!
While you can do this and things will generally work, AT&T restricts all of their residential gateways from operating in a true passthrough/bridge mode to another router. So you end with double NAT and all the joys that entails (such as ). There are also a number of other issues that have been associated with operating in their faux-passthrough mode, including
- Issues with IPv6 prefix delegation
- Sporadic latency spikes (an issue in general, that you inherit since the gateway is still "doing" everything it normally would, since it won't actually act as a ure passthrough/bridge)
- A firmware update capped throughput at 50Mbps (later fixed in another firmware update)
- Firmware updates tend to silently re-enable the built-in wifi radios
So while it'll generally work, it ends up problematic. You inherit all of the performance issues associated with just using the gateway as your all in one modem/router/firewall/AP/gateway, plus the addition of double NAT, plus the sharp edges of their poorly implemented faux-passthrough modes, plus the ever-present concern that you're one firmware update away from a non-working network despite having used their official passthrough configuration.
Hence why gateway bypasses are so popular. Even if they're a bit involved to set up, once you get it working things just... work. With little if any upkeep (potentially a few minutes after a power outage, depending on the bypass method you implement).
But my main reason is actually the gigantic size of the residential gateway box. I mounted the ONT, UDM pro and PoE switch on a wall in a closet and the RG just took up too much space.
A weird side effect of this is that I'm not using the 192.168.x.x range like usual (because that's what theirs is using), but instead the 10.0.x.x range
Edit: to go into more detail, their router is acting as a NAT for your public IP, giving you your first subnet, and then your router is getting a single IP on that subnet and creating a NAT where your devices all get IPs. In a bridge there would only be 1 IP space behind a single NAT. In your case with a double NAT a lot of consumer things might not work (like UPnP) and port forwarding would require you to add rules to both routers.
Thank you for the insight and lesson!
I've pulled a variation of that on CSRs at least once, and surprisingly, it can work. Just be cordial, preempt the typical IT Support stuff they always ask, DO NOT say its intermittent (initially, to the front line CSR; if given a chance to expand the issue after escalation, then add that bit), and get technical ASAP (it doesn't hurt to throw in some parallel industry jargon). Basically, build a case where even the information you're giving them is beyond a first-line CSR playbook, and they have to escalate.
"Hi there; I've been observing some erroneous TCP packet bit flipping on HTTP requests which route through one of AT&T's data center in Oakland. I've tried restarting my computer, I'm seeing the same thing on my phone, and I actually swapped my router out for a spare one I have, but its still an issue."
(that last sentence exhausts literally every playbook a front-line CSR has. it sounds so easy, right? there are four variables in any front-line CSR diagnostic equation: their network, your router, wifi/ethernet, and the endpoint. you just crossed off three of the four variables in one sentence).
(Wait, a data center in Oakland? How do you know this? You can tracert a bad request and geolocate the first IP outside your network, but, lets be realistic: You don't. You're fronting; demonstrating knowledge that a front-line CSR can't disprove. You may think this is misleading to whoever this gets escalated to, but it isn't; their tools are FAR more advanced than yours, and they're used to 99% of customers being incorrect idiots, so they're going to be validating and reconfirming every word you say anyway.)
Ron's Parks & Rec example above is crass. But here's the magic bit: frontline CSRs generally look for an excuse to escalate, you just need to give them enough CYA to check their job as done, and the higher tier CSRs/network engineers will love you for actually knowing what you're talking about. Its a win-win; be cordial, be forceful, strut what you know.
I got the card eventually but now I cannot create an online account with it. I called Chase, got transferred 5 times, and then told I would need to go to a physical bank to verify my identity? to create an account. Absolutely not one of them had any clue what "a broken account exists associated with this card in your database, I can guarantee it, forward me to your technical support team" but thats all above a bank reps pay grade.
The nearest Chase bank is 1.5 hours away, by the way. Probably just going to cancel the card after cashing out the sign up bonus.
"Ok sir, please click the start button, then the power button, and finally click the restart button to restart your computer..." (and they refuse to budge until you've swapped out your router yet again, because you didn't do all that while you were on the phone with them)
AT&T (though applies to lost of companies) probably need a "unicorn" role of a very technical person that is paid as such, but able to interface with customers on specific highly technical issues.
When I worked for a VAR I could upload logs to Cisco and get experimental patches back. Call up HP, tell them I want an RMA, and they’d just do it. Night and day compared to what consumers get.
I would use traceroute to find a common bad point for everyone. It is also possible that the networking point where the problem occurs is invisible to traceroute as it could be part of a provider network probably MPLS but at least the common ends of the tunnel would be visible.
The fact that it is a a specific interval indicates a stuck bit in memory.
Some good previous public stories about such incidents
When implementing this functionality, the naive hardware designer will strip the existing CRC from the packet, modify the contents of the packet and then reuse the handy dandy CRC calculation block to place a newly calculated CRC on the packet. Similar choices are made for the adjustment of the IP/TCP/UDP checksums. If any errors are introduced in the contents of the packet by the data path prior to the new CRC is calculated, this results in the CRC being "corrected" to include the erroneous data.
A far more understanding hardware designer will instead calculate how to adjust the CRC by the changes introduced in the packet contents. Sadly, this is far more complicated to get right, and it goes against the drive of hardware designers to reuse blocks of code wherever possible. Every hardware designer working on networking has a block of Verilog or VHDL code to calculate and append a CRC to a packet. Only the most dedicated will attempt to apply only the delta needed to the CRC or checksum.
For people like me who aren't smart enough to figure it on their own, this stackexchange answer seem to explain how it's done: https://cs.stackexchange.com/questions/92279/can-one-quickly...
On the application side the effects were quite bad, as the data was mainly XML and, depending on where the bit was flipped, it could impact the data or the XML structure. The data had its own CRC/hash, so the packets were cleanly rejected by the application. Unfortunately the XML library from the message queue engine and the ESB we were using did not like at all when the bit flipping occurred in the XML tags (it seems fuzzing tests were not done at that point) so the message processing got stuck and we kept getting bad messages in the queues. Even worse, the queues could not be cleaned with the normal procedures because the application wanted to first display info about the messages inside - and that failed.
The network debug was non-trivial because of that header consistency - the network devices did not report any kind of packet issues, so we had to sniff the different network segments to identify the culprit. From the application point of view, we had to delete the whole message queue storage to get rid of the bad messages, and let the application handle the rest (luckily it was designed with eventual consistency and self-healing).
There was an escape hatch, but the conditions to hit it were a bit complex. Implementing new message filtering of this kind at 2AM while the system was down was not feasible.
How do you mean? I use traceroute from time to time but I’m not sure how it would apply in a case like this. Feel free to elaborate :)
Edit to make the comment more useful: If anyone is curious, look up "ECMP hashing." There are probably tons of parallel paths through AT&T's network, and to narrow down to the hardware causing problems, they will need to identify which specific path was chosen. Hardware switches packets out equally viable pathways by hashing some of the attributes of the packet. Hash output % number of pathways selects which pathway at every hop.
Hardware does this because everyone wants all packets involved in the same "flow" (all packets with the same src/dest IP and port and protocol (TCP)) to deterministically go through the same set of pipes to avoid packet re-ordering. If you randomly sprayed packets, the various buffer depths of routers (or even speed of light and slightly different length fibers along the way) could cause packets to swap ordering. While TCP "copes" with reordering, it doesn't like it and and older implementations slowed way down when it happened.
They rolled this back when folks complained, but I wonder if the relevant infrastructure is still sitting around and mangling packets.
I would probably expect this to be some network card or cable or connector is failing though...
Almost certainly 220.127.116.11
It’s not impossible that people from AT&T read messages posted there.
You people are amazing!
My mobile.twitter.com traceroute prefers going through that path, as does en.wikipedia.org (both of which have sucked for me) while a Google route (to 18.104.22.168) hops through 22.214.171.124.
Edit: Reading the tweet thread, classic interaction: ATT says "please click this to check your connection", guy replies back and says "I think we know more about networks that you do, please get a network admin in here".
I'm in the same boat. If the dozens of IT pros who are complaining about this can't get AT&T to swap out a single router card, what hope do most folks have?
If it happened to the rest of us at somewhere else, we are probably out of luck...
I can't even download the page 3 times in a row w/o corruption
< <titde>Example Domain</title>
> <title>Example Domain</title>
For me, the problem has mostly manifested as web pages failing to load or appearing to be loading forever. Generally when I refresh the page would load quickly as expected.
> I’m hearing AT&T got their shit together and things are working now. Big thanks to @vikxin and @bmastenbrook for doing the heavy lifting here.
I'd recommend most newbies to the list show their work and post what's broken and how you know it isn't your fault. The investigative work on that Twitter thread is top notch and would do it in a second.
NANOG and outages@ are the two mailing lists that I've been subscribed to forever and are indispensable if you do operations.
Cellular service isn't all that great either.
The people who would call their provider about bad cell coverage in their house are the same people that would go to city hall and demand that no cell towers be built in the city.
Makes a change from the city council members usual practice of denying Vallco permits and claiming Apple employees are hiring prostitutes and molesting high school students. (I did not make this up.)
I suspect if the droughts didn't make people get rid of their green grassy lawns homeowners would be more amenable to seeing green network boxes every few houses. It looks awful in the context of concrete sidewalks, though.
SV is the perfect place to not bury infrastructure. There's no weather to worry about, and mostly people aren't going to shoot down the fiber to try to claim scrap metals.
You can bury conduits, by the way, and not cables or fiber directly. This allows you to avoid digging again just to install fiber, for example. There are ways to do it right. And having the infrastructure on poles is not a panacea either. New providers are not necessarily allowed to use the poles. The weather might not be crazy, but the poles are already overloaded and a little windstorm will disrupt electricity, coax cable, or fiber. And of course, those overloaded poles are crazy ugly.
When I used to work in the telecom industry, burying conduit or cable in the ground was anywhere from 3x (bury in some dirt on the side of a county highway) to 20x (directional drilling in a heavily populated city where there's utilities all over the place) the cost per foot compared to hanging it aerially from a labor standpoint.
However, as you correctly point out, there may be restrictions on what you can hang on the poles and where, and oftentimes you'll find poles where it turns out it never should have had the number of attachments it did, but guess who gets to foot a large part of that bill if they want on?
But even then, I've seen absurd lengths gone to in the name of not digging. On Martha's Vineyard, I believe they wound up using a super-special Self-Supporting fiber that could be hung in or near the Power area of a Pole. Yes, that requires a far more trained/well paid worker than normal aerial work. Also, in that region, NESC 250C/D comes into play which makes it even more of a PITA. But it still was far cheaper than putting cable in the ground.
I wonder whether Teraspan or other Vertical Directed Conduit would be a good fit for the bay area (Saw-cut a minimal depth in the street, just lay in a special zip-up conduit for fiber or twisted pair.) If the weather doesn't tend towards large temperature shifts it works well.
Speaking of which, a couple drawbacks worth noting for buried conduit; You have to go out and do your markings, or pay someone to do them for you when a dig request is made, and you have to be ready to handle the repair when someone inevitably forgets to call or the markings are done incorrectly.
This map shows the major faults, but each one of those is a lot of small minor faults that all could snap things and shift. One small fault right by where I grew up is about three blocks long. Last I checked had one earthquake to its name: a magnitude 5.0 aftershock of Loma Prieta
1) Sprout the streetlamp from the box, add a car charging port.
2) Hire an artist to paint the box with artwork (property owner gets to pick from option set). Consider it to be city owned art installations.
Wonder if someone should take them up on it.
I thought that was less of a "won't" and more of a "can't" between the lack of taller building towers and the radio interference zone for the airport means the directional antennas skip a slice there.
This will probably get fixed if 5G becomes a proper thing and there's a lot of micro-cells along the "bay area's biggest parking lot".