Hacker News new | past | comments | ask | show | jobs | submit login

> If there's anyone out there designing tech support procedures, you should add an "is this a 5% problem?" question to whatever checklist you give to support staff.

When I was the engineer customer service escalated to, I was damn sure to thank them every time they escalated something. Even the one guy who escalated all the things I'd roll my eyes about in private. At least he was making sure the escalation path worked.

Someone who has taken the time to report an issue is probably one of hundreds or thousands who had an issue and didn't think it could be fixed and shrugged it off. We certainly can't fix everything, but weird network shit like this can be fixed, and it's worth escalating, because when you get it fixed, you can also figure out (hopefully) how to monitor for it, so it doesn't happen again.

OTOH, I didn't work for the phone company. We don't care, we don't have to, we're the phone company. https://vimeo.com/355556831 (sorry about the quality, I guess internet video was pretty lowdef in the 70s :P)




Ex-phone-company here. (Is this the party to whom I am speaking?) I was in installation, but hung out with a lot of the ops crew, and they LOVED interesting problems. The trouble was getting such problems to the ops people in the first place. Good people, bad process.

The most memorable one:

Customer service had been getting calls all morning with a peculiar complaint: A customer's phone would ring, and when they answered, the party on the other end didn't seem to hear them. They seemed to be talking to _someone_, but not the party they were connected to. Eventually they hung up. Sometimes, a customer would place a call, and be on the other end of the same situation -- whoever answered would say hello, but the two parties didn't seem to be talking to each other. Off into the void. They'd try again, and it would work, usually, but repeats weren't uncommon.

So everyone's looking at system logs and status alarms and stuff, and what else changed? There were two new racks of echo-cancellers placed in service last night, could that cause this? Not by any obvious means, I mean e-cans are symmetrical and they were all tested ahead of time. There was a fiber cut out by the railroad but everything switched over to the protect side of the ring OK, didn't it? Let's check on that. Everyone's checking into whatever hunch they can synthesize, and turning up bupkus.

Finally around lunchtime, one of the techs bursts into the ops center, going "TIM! I GOT ONE I GOT ONE IT'S HAPPENING TO ME, PATH ME! okay look I don't know if you can hear me, but please don't hang up, I work for the phone company and we've got a problem with the network and I need you to stick on the line for a few minutes while we diagnose this. I know I'm not who you expected to be talking to, and if you're saying anything right now, someone else might be hearing it, but that's why this is so weird and why it's so important YEAH IT CAME INTO MY PERSONAL LINE and that's why it's so important that you don't hang up okay? I really appreciate it, just hang out for a few, we'll get this figured out..."

Office chairs whiz up to terminals and in moments, they've looked up his DN and resolved it to a call path display, including all the ephemera that would be forgotten when the call disconnects. Sure enough, it's going over one of the new e-cans. Okay, that's a smoking gun!

So they place the whole set of new equipment, two whole racks of 672 channels each, out-of-service. What happens when you do that is the calls-in-process remain up, but new calls aren't established across the OOS element. Then you watch as those standing calls run their course and disconnect, and finally when the count is zero, you can work on it. (If you're doing work during the overnight maintenance window, you're allowed to forcibly terminate calls that don't wrap up after a few minutes, but that's verboten for daytime work. A single long ragchew is the bane of many a network tech!) The second rack was empty of calls in _seconds_, and everyone quickly pieced together what that implied -- every single call that had been thus routed was one of these problem calls where people hang up very quickly. This thing had been frustrating hundreds of callers a minute, all morning.

With the focus thus narrowed, the investigation proceeded furiously. Finally someone pulls up the individual crossconnects in the DACS (a sort of automated patch panel, not entirely unlike VLANs) where the switch itself is connected to the echo-cancellation equipment. And there it is. (It's been too long since I spoke TL1 so I won't attempt to fake a message here, but it goes something like this:) Circuit 1-1 transmit is connected to circuit 29-1 receive, 29-1 transmit isn't connected to anything at all. 1-2 transmit to 29-2 receive, 29-2 transmit to 1-1 receive. Alright, we've got our lopsided connection, and we can fix it, but how did it happen in the first place?

If all those lines had been hand-entered, the tech would've used 2-way crossconnects, which by their nature are symmetrical. A 2-way is logically equivalent to a pair of 1-ways though, and apparently this was built by a script which found it easier to think in 1-ways. Furthermore, for a reason I don't remember the specifics of, it was using some sort of automatic "first available" numbering. There'd been a hiccup early on in the process, where one of the entries failed, but the script didn't trap it and proceeded merrily along. From that point on, the "next available" was off by one, in one direction.

Rebuilding it was super simple, but this time they did it all by hand, and double-checked it. Then force-routed a few test calls over it, just to be sure. And in a very rare move, placed it back into service during the day. Because, you see, without those racks of hastily-installed hardware, the network was bumping up against capacity limits, and customers were getting "all circuits busy" instead. (Apparently minutes had just gotten cheaper or something, and customers quickly took advantage of it!)


Amazing story! My step dad worked night shift at AT&T back in the 80’s and ran the 5ESS. He took my brother and I in for a tour one night. Thinking back on it now it was a lean crew for the equipment they were running. Rows and rows and rows of equipment. I don’t remember closed cabinets, mostly open frames moderately populated. I’ll never forget he showed us some magnetic core memory that was still mounted up on a frame in the switch room. Huuuge battery backup floor as well.

He loved all of that stuff, absolutely hated when everything went to computers. Quit and became a maintenance man at a nursing home, commercial laundry repair guy then finally retired this year in his late 70’s (due to Covid) after working maintenance at a local jail.


That's super cool!

I believe the #5 ESS machine itself is always in closed cabinets, so it's likely that what you're remembering was the toll/transport equipment, or ancillary frames. Gray 23-inch racks as far as the eye can see!

Depending on how old that part of the office was, they were likely either 14' or 11'6" tall with rolling ladders in the aisles, or 7' tall and the only place they'd have laddertrack was in front of the main distributing frame.

As for magnetic core, if you could see it mounted in a frame, what you probably saw was a remreed switching grid, which is a sort of magnetic core with reed-relay contacts at each bit, so writing a bit pattern into it establishes a connection path through a crosspoint matrix. It's not used as storage but as a switching peripheral that requires no power to hold up its connections. (Contrast with crossbar, which relaxes as soon as the solenoids de-energize.)

Remreed was used in the #1 ESS (and the #1A, I believe), and is extensively documented in BSTJ volume 55: https://archive.org/details/bstj-archives?&and[]=year%3A%221...


You’re definitely on to something. This image from Wikipedia for the #1 ESS fits very well into my fuzzy memory, especially those protruding card chassis:

https://upload.wikimedia.org/wikipedia/commons/7/7e/1A_Frame...

I just remember thinking it looked awkward getting to the equipment under them.

I don’t know if the ‘5E’ as he called it was actually in operation yet, he ended up moving us all out of state to take a job developing and delivering training material for it...I think that’s what finally broke him lol. Hands on kinda dude.

I’ll have to hit him up later today to see if he remembers ‘remreed’ (he will). Thanks for the info!


Yup, the #1 used computerized control, but all the switching was still electromechanical, so it sounded like a typewriter factory, especially during busy-hour.

At night, traffic was often low enough that you could hear individual call setups and teardowns, each a cascade of relay actuations rippling from one part of the floor to another. The junctor relays in particular were oddly hefty and made a solid clack, twice per call setup if I recall correctly, once to prove the path by swinging it over it to a test point of some sort, and then again to actually connect it through. On rare occasion, you'd hear a triple-clack as the first path tested bad, an alternate was set up and tested good, and then connected through.

Moments after such a triple-clack, one of the teleprinters would spring to life, spitting out a trouble ticket indicating the failed circuit.

The #5, on the other hand, was completely electronic, time-division switching in the core. The only clicks were the individual line relays responsible for ringing and talk battery, and these were almost silent in comparison. You couldn't learn anything about the health of the machine by just standing in the middle of it and listening, and anyone in possession of a relay contact burnishing tool will tell you in no uncertain terms, that the #5 has no soul.


YES! He worked third and we were there all night. He pointed those sounds out to us, it was so cool.


Theres a telco museum in Seattle called the Connections Museum, it has working panel, #1 crossbar and #5 crossbar switches and a #3ESS they are working on getting running again.

https://www.telcomhistory.org/connections-museum-seattle/


Great story.

> including all the ephemera that would be forgotten when the call disconnects

Interesting to know there is information which is not logged. I’m guessing keeping this info, even for a day, would have helped isolate the issue?

How did the echo cancellers pass testing?


They passed testing because they had each been individually crossconnected to a test trunk, and test calls force-routed over that trunk. Then to place them in service, the crossconnects were reconfigured to place them at their normal location in the system. The testing was to prove the voice path of each DSP card, and that those cards were wired into the crossconnect properly.

All that was true, the failure happened when it was being taken out of testing config and into operational config. Either nobody considered that that portion could fail, or the urgency to add capacity to a suddenly-overloaded network meant that some corners were cut. (Marketing moves faster than purchasing-engineering-installation...)

Oh, and as to the point about keeping the call path ephemera. Yeah probably, but in a server context, that'd be akin to logging the loader and MMU details of how every process executable is loaded and mapped. Sure, it might help you narrow down a failing SDRAM chip, but the other 99.99999% of the time when that's not the problem, it's just an extra deluge of data.


Were the cross-connected circuits channelized or individual voice calls (ds0? I can’t remember from my wan days) or something else?


As I recall, the cross-connects were done at the DS1 level, and an individual card handled 24 calls. These are hazy, hazy memories now; this took place in 2004-ish.


Nice!! Thanks for the walk down memory lane, this was cool.


Not OP but it sounds like the echo cancellers were fine, the interconnect to the switch was misconfigured. Rather than sending both channels of audio to opposite ends of the same call, one channel got directed to the next call.

The funny thing is that if everyone played along they could have had a mean game of telephone going.


This feels like the kind of anecdote I'd overhear my paint-covered neighbor Tom telling my dad when I was 10, and my dad would be making a racket over it, really bent over twice. I'd always be like, "what's so funny about that?" But you get older and you realize not many people tell _actually_ interesting stories, so I guess you do what you can to make them want to come around and tell more.


Cool story, thanks! Perhaps you can solve a mystery phone hiccup that happened to me a few years ago? I called a friend (mobile to mobile if it matters) and, from memory, about 20 minutes into this call I get disconnected, _but_ I instantly end up on a call with an elderly stranger instead, who seemed pretty irritated she was now on the phone with me. I was surprised enough that she hung up before I could form a coherent sentence to explain what had happened so I've no idea if she was trying to ring someone or if the same thing happened to her or if she'd dialed my number by accident. From what I remember it seemed like she was also already mid-conversation as well though.


Thank you for sharing. And for helping the phones just work, so we can complain so much when they don't :)

Have you heard any of Evan Doorbell's telephone tapes[1]? It's a series of recordings mostly from the 1970s, but with much more recent narration, exploring and sort-of documenting the various phone systems from the outside in. Might be interesting to see what they figured out, and what they didn't :)

[1] http://www.evan-doorbell.com/


This is a super cool story, thank you so much for taking the time to type this out!


What’s fascinating read! Gotta error check my scripts.


A very similar problem is happening currently in india with Jio. I wonder if anyone from there had seen this.


Cool story, thanks for sharing!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: