We just received a new Mainframe from IBM. Big beast big power consumption.
My primary task was to be Sysadmin of LPAR/instances of Linux inside the new IBM.
The new mainframe was unpacked, and the power connectors had to be "modified" to the local standard. You know.. You ask your local contractor to read the manual in English and hope for the best.
There was two person on that day on that Data Center. Me and the IBM Tech Representative.
Well, I was checking some blade servers looking at the Robotic library, and I see him plugin it 5 meters from me.
I just heard a BANG. And for the first time, I saw an electric fire. Like a Dragon spitting green fire. I shout for him stop and move away. He by instinct unplugged it (I grabbed a chair to throw at him if by chance he gets stuck on electricity).
And everything gets pitch black.
He looks at me.
I look behind. And there are. 200 servers Down. all Down.
It even had broken the APS system.
I walk to the extension. Dial 28 to my co-worker and say:
"P.... come here. Serious! Get everybody here... Big problem.. BiiiiiiiiiiiiiG."
We had to start everything on its right orders (SAN storage, ADs, Servers, SQL) but we knew it.
8 minutes late the electricity company appears the IBM tech had to go to the hospital with cardiac arrest by the stress.
The IBM tech guy got lucky and is alive.
I got a good recommendation for keeping cool in emergency situations.
For someone who's been there and got out alive, it's easy to keep calm. For someone's first time, it's easier said than done.
To help us troubleshoot this, my boss asked me to program the unit to give a missed call to the server every hour. If we got a missed call, we knew that unit was still working. In countries like India, giving a missed call is a zero cost way to communicate. For example: You would pull up in front of a friend's place and give them a "missed call" to let them know that you are waiting outside etc.
Anyway, I implemented the logic and we sent off our field techs to intercept trucks at highways and update the firmware.
The way I implemented the logic was the unit was to call our server's modem number every hour at the top of the hour. No random delay nothing. So, soon after that, around 50 units tried to call our server at the same time. Remember the clocks in the units are being run off GPS and they are super accurate. This caused our telecom company's cell tower BTS to crash. Cell service in my office area, a busy part of Bangalore, was down for a whole 2 hours.
I was called into the telecom company's head office for their postmortem. They didn't yell at me or anything. They were super nice. In fact, when I finished explaining my side of the story, one of their engineers opened his wallet and gave a hundred rupees to another guy. Guess they were betting on the root cause. From what I understand, they escalated the bug to Ericsson who manufactured the BTS and got it fixed. For my part, I added a random delay and eventually removed that feature.
BTW, the term "missed call" may not be familiar to people outside India. For those not familiar, it's ringing a number and them disconnecting before they pick up. Serves to notify them that you called :)
The owner was a very impatient youngish founder who new just enough about HTML/CSS to have the dangerous notion that he new something about programming. Additionally he was obsessive compulsive to the point where when he saw that different browsers didn't render the HTML/CSS EXACTLY identically in all cases, he had me redo ALL text on the site as IMAGES!..because those would look the same regardless of what the browser supported.
Now, the payment processing part. Since he was a cheap bastard and didn't want me spending any time on actually versioning, managing code, doing deployments, testing etc....we only had one development/test environment: PRODUCTION.
Yup, I'd connect my trusty VisualStudio IDE directly to the file system on the production IIS webserver and code away. Whatever I had coded when I hit save...was live.
No issue. Since we had no monitoring, logs, analytics or anything else unnecessary like that, he never could tell how many live transactions were lost because I had forgotten to close some tag, looped once too many times, mistakenly truncated some part of a card number, swapped the first name and the last name field accidentally or mistakenly told the payment gateway to CREDIT rather than DEBIT the charity's account (yes, that one did happen...and he did notice).
I would come home a nervous wreck every day just wondering what kind of pissed off customer calls I'd be hearing about the next day for something I had done that day.
I also forgot to mention the part where he thought his firewall was slowing down his website so he had it removed thus opening up his production Windows Server to the wide open internet. His internet provider...Comcast Business DSL.
i suppose that's to be expected though. it seems these sorts of pathologies travel in herds.
Turns out that the only difference between testing and opening night was that the front doors of the theater were open, and it was a windy night. The projector has a "wind vane" style airflow sensor in its exhaust vent to check and double-check that the fans are running correctly. The sudden changes in airflow when the control room door was open was enough for the airflow sensor to drop and trigger a panic shutdown. Since the projector also had fan sensors and temperature sensors, the manufacturer okayed us to bypass the airflow sensor.
Early in my career a fellow team member working with me at a fairly well known fortune 500 company was testing out a process where using Microsoft Forefront Identity Manager we would cleanup in-active accounts and shuffle things around to various systems auth systems. Since this service was used to sync our prod AD and test AD there was a "connector" into prod. From this single FIM instance you could hit dev, test and prod ADs. Sadly there had never been any rules put into place to prevent a push from test -> prod.
On the day that this co-worker was doing some testing he somehow managed to push a change that he though was going to our test AD but instead went to the prod AD. This change ended up wiping out quite a few prod AD accounts. As in totally deleting them. All of our systems, including the phone system (not sure why) were tied into AD. All of the sudden people on our floor were saying they couldn't login to anything or send e-mail. Soon we found out that the CEO of the company was feeling the same pain and on top of that, was not able to receive or make phone calls. My co-worker took a look at the process he was running and realized he had screwed up big time. He killed the process but not before about half of our production AD had been wiped out.
Like most backup systems, restoring our AD from a backup had not been tested in awhile. Between figuring that out, since it naturally didn't work as designed, and having to get the backups from our off-site backup company most of the company was unable to do anything for about 8-10 hours. This included remote sites, field techs, customer support agents, etc.
What sucked is that this co-worker was one of the top members of our team and had been handed this FIM environment that somebody no longer with the company had built. On top of that he was not provided any sort of formal training and was really learning on the job. They let him hang around for another week or so and then let him go.
The project was late and there was a daily-charge penalty clause in the contract with the customer, a very large company. A long enough delay could wipe out all the profit from the project. So engineering management told the programmers to suppress all signs of runtime bugs, no error messages, no halts, just slog on, bugs and all.
I objected, nobody paid attention. For my sensor, I had it scream bloody murder (on the diagnostic console) for every runtime problem it found. So I could fix it. The rest of the team followed instructions.
My unit was debugged, up and running, a year before everybody else. If the whole project would have been ready, the profit would have been reasonable. In a whole-team meeting,near the end, I asked the testing team if they had found any bugs in my unit. They asked "What's that?". They didn't even know its name. Suddenly, I was a hero.
Spent some time trying to figure out why. No luck.
Spent some more time.
Eventually I realized that the log timestamps were weird - it looked like the query had been sent a response, but the log message appeared 30 seconds later.
I instrumented the servers to measure disk latency. I noticed massive spikes in latency every few hours. Couldn't figure out why. Then someone told me the servers were running on virtual machines with a shared NetApp for storage... and it all came together.
Every few hours a multi-gigabyte file was delivered to each machine. This was a design that had originally been done for physical machines. With virtual machines, 30 copies of a multi-gigabyte file were being dumped to a single NetApp, filling up the file server's memory buffer and making disk latency spike since it was waiting for physical writes.
Meanwhile, the server I was debugging was doing log writes in the main I/O thread, so it blocked on handling requests when this happened.
I went and talked to team lead for the server. "Oh yeah, we fixed that recently, logging will be in its own thread as of next release."
Moral of the story:
1. Talk to the people maintaining the software before you spend too much time debugging.
2. Disks do block, don't assume they won't.
3. Changes to operational setups can have significant, hard to predict impacts.
If you want to hear more stories, I'm writing a weekly email with one of my programming or career mistakes: https://softwareclown.com
Datacenter had about ~300 servers in it. Not huge, but not small either. The lynchpin in the system is this: neither the battery supplies or generator can run the AC or air handler, so when the power is out everything non-essential needs to come down to maintain sane temps in the DC.
Anyway, my page goes off in the middle of the night -- power outage. The DC is running on battery backup. I hurry into the office to start powering things down as temps are climbing. I start shutting down VM's, blades, and 1/2U servers. About 1/2 of the way through, the power comes back on -- but the AC isn't kicking on (red flag). The air handler will function though, so let's run with that until the AC guy comes out.
I start powering everything back up. At this point, a few co-workers trickle in to help. After about 2 minutes the fire suppression alarm triggers -- 30 sec to evacuate the DC. I glance over to the air handler vent, and it's SHOOTING flames into the DC. We oh-crap the heck out of there just in time to see the suppression system trigger and the door lock closed. I run to the electrical panel and kill the power to the AC and air handler knowing that they were possible sources of the fire. The fire dept. arrives and forces us out of the building. At this point, nearly the entire DC is cranking on sustainable power with 0 cooling. It's a locked box effectively. We watch our notification slowly alert to servers going down hard due to heat one-by-one. VM hosts -- boom. Network switches -- Boom. SAN -- BOOM.
Long story short, we lost a number of servers and restored a lot of data from backup once things were back online. The cause was traced back to the wiring of the air handler motor. When the power came back on, only 2 of the 3 phases came back online. This was enough for the UPS system to operate, but not enough for the AC (wired correctly). The motor on the air handler was 3 phase but installed incorrectly (or something to that effect, it's been years and I'm not an electrician) allowing it to run, but turning it into a ticking time bomb of an electrical fire.
The distribution service used the same reporting system as our ad hoc notification service. Records for scheduled distributions had a field that stored the ID of the scheduled event that prompted the report generation. In this way, the distributions could grab all the output by event ID without needing to know anything about the reports themselves.
The problem: The ad hoc system relied on the same behavior, but since it didn't have actual scheduled events, the programmer who implemented it halfheartedly spoofed an ID based on the current time when the ad hoc notification created report requests. Over time, these spoofed IDs collided with the real event IDs used by the scheduled distributions.
We entered a bug report. The bug report got closed, because the developers said the module was being re-implemented in a different language, so the functionality would likely be different (read as "broken in different ways").
Time passes. On a lark, the colleague I originally investigated the issue with suggests we do a code review to see whether the bug did indeed get fixed.
The new programmer copied & pasted the original Visual Basic code responsible for the bug into the new Visual Basic .NET project, comments and all.
In the end, of course, a true war story is never about war. It's about the special way that dawn spreads out on a river when you know you must cross the river and march into the mountains and do things you are afraid to do. It's about love and memory. It's about sorrow. It's about sisters who never write back and people who never listen.
I looked at him and said the installer is ~200MB in size, 20 seconds is more than reasonable. He started arguing that the network connections had to be at least 100meg links so it shouldn't take more than 2 seconds. It went round and round, until I realized he didn't understand that network links were in bits/sec. At this point, he was refusing to listen and started disparaging everyone 'against' him. I gave it one more go and showed him the unit conversion and basic math on file size, rate, and time.
For a while, it looked like he was trying to get everyone arguing against him released since he was an 'architect' and everyone else was engineers and I was just the ops guy. Too bad for him, he didn't realize in the land of inflated titles, I was the security, storage, and infrastructure architect. I just felt it was presumptuous and relabeled myself the ops guy.
As I hinted earlier, the AC system was epic - literally cold enough to hang meat. It originally used several chillers, but even after turning off all but one, it was still FREEZING (well, nearly) in there - there was literally a rack of parkas hanging at the entrance, and you put one on if you were staying more than a couple of minutes.
Not long after the system went in, we got a call from their admins saying there was something wrong with the network, as they couldn't get the Storage Arrays back up - sometimes. Eventually, a pattern emerged: the Arrays refused to restart after they had been shut down for more than an hour or two, but when we took them back to our office, they worked just fine. Turns out the problem was thermal: The room was so cold that the new state-of-the-art HDDs literally didn't have the torque to spin the platters up again against the cold-shrunk tolerances. We checked with Seagate and they said NASA was operating the disks well below their design temperature - they had never expected anyone to do that! The fix: 1) Don't shut the array down for too long, and, 2) if you really have to shut it down longer you'll have to wrap the whole array in plastic to prevent condensation, then take it outside to let the disks warm up enough before bringing it back into the DC and spinning up before the disks almost literally froze up again! This was, of course, duly written up in an official NASA policy and procedures manual. I suppose SSDs were a big win for NASA, as I expect the problem only got worse with succeeding generations of spinning drives....
One the biggest banks in Europe.
They just bought (aka "rescued") another bank somewhere in Europe and they wanted to lay dark fiber for datacenter synchronization.
And they wanted us to manage that network.
After several weeks laying fiber, and setting all connections we started the network manager on the operations center... and couldn't see any device, except the "secondary" network manager in customer premises wich could see everything.
Turns out the "security" people at the simply put a Firewall between the NOC and the network and didn't allow any traffic to go from the NOC to the fiber optic devices.
We asked them to give us access but they didn't have IPv4 addressing compatible with ours. Whay? Because the bank used all IPv4 address space internally. All private IP adressing was already alocated and the NAT ("public") they could use colided with other customers equipment.
Finally I had to build a ugly "semi-isolated" network island between the bank, the noc, and some virtualized workstations that could "see" (natted) the customer devices (wich we were paid to manage) and the NOC workstations.
On the plus side, that junk made me write the best documentation I've ever made because just in case something broke in the middle of the night.
Sometime in the middle of the night, I received an SMS: "Datacenter temperature is high: 24C" (yes, centigrade about 74F).
I saw the graphs and I could see there was a temperature spike. In the last hour the datacenter went from 16C to 24C and it was going up.
Three of the four industrial HVAC machines failed that night. Nobody knew why - I called the HVAC technician (he would come around 9am) and meanwhile I opened all windows and doors in the datacenter floor to stabilize temperature. 26C inside the DC, -6C outside.
Turns out HVAC systems used water... and someone broke the pipe isolation and the water froze inside.
It was fixed in the morning aplying a blowtorch to the 3 frozen pipes, then when we had water in the circuit instead of ice, HVAC machines made the DC cold again.
Temperature peaked at 27C in the morning (-1C ouside) - at 28C (84F) the VP of Operations would call everybody for a controlled datacenter shutdown that - fortunately - didn't happened.
6 months later I had a similar problem with HVAC. This time water evaporated inside the pipes and we had a "normal" panic because both the inside and the ouside were hot.
That was my last on-call duty.
Top-level execs had made the decision to not get a backup generator. The one compensation was that we got a manual transfer switch, so that we could easily truck in a generator & cooling in case of a planned outage. There was the possibility that we'd be moving at some point, so self-containinment was a big thing.
Taking that into account, I suggested getting an Eaton 9390 UPS, with two IBC-L battery cabinets and an IDC breaker/distribution cabinet. (http://lit.powerware.com/ll_download.asp?file=Eaton9390UPSBr...) The distribution cabinet outputs went to in-rack Eaton RPMs (http://powerquality.eaton.com/Products-services/Power-Distri...), and from there to PDUs.
This setup gave us ~45 minutes runtime at normal load, and more if we shut down non-prod. The one time we had an outage (during my tenure there), shutting down non-prod allowed us to ride the outage. I also liked this setup because our only connections to the outside (power-wise) were from the fused disconnect input, and the EPO. In the end, the "single-line drawing" looked like this:
Building fused disconnect-->Manual Transfer-->IDC cabinet breakers-->UPS-->IDC cabinet-->Rack-->PDUs
Outside Generator Hookup---> Switch (input/bypass) dist. panel RPMs
Unfortunately, the electrical engineer hasn't seen such a thing before. In the past, the 480/208 transformer was external to the UPS, and this is what the electrical engineer was used to. So, the engineer wrote up plans to run an electrical duct from the UPS, to the Manual Transfer Switch, and then on to the transformer (in other words, back to the UPS).
I totally missed this mistake on the plans. It was actually caught by the construction crew, who was laying out the ductwork and realized that something looked weird.
In the end, one of the conduits was used, and the other one was just left in place. Luckily our connections from the IDC distribution panel to the RPMs were flexible, because that second conduit got in the way of pretty much everything.
Well, kinda. It was the first really big Internet meltdown that I recall.
That was the day that AOL had a 19-hour network downtime. It's been surprisingly hard to find news articles from the time that talk about it, but the key mistake that was made was AOL decided to update the software on their Cisco routers at the same time that the same was done on our upstream feed (ANS). In hindsight, I think the cascade failure was obvious, as was the fact that ANS was our Single Point Of Failure (SPOF) to access the outside world.
But there's another part to this picture that you don't see in the stories on C|Net, CBS News, etc.... That was the impact made on e-mail services around the world.
At the time, I was the Senior Internet Mail Administrator for AOL, and a few months before I had figured out how to cram about 40-50 MX records into a 512-byte UDP DNS packet. Your servers would query for the MXes for aol.com, and you'd get back a list for a.mx.aol.com, b.mx.aol.com, etc.... That would be a total of six or seven hosts, and each of those hosts would have six or seven IP addresses listed for them.
Under normal circumstances, this would mostly be okay. You'd query for the MXes, you'd get back the list, you'd try to make your connection on port 25, and after a couple of tries the odds were that you'd get through.
However, the TCP network standards required that you spend two minutes trying to connect to the first IP address, before you can proceed to the second. And in a pathological case, if all forty-five or so IP addresses timeout, then you've just spent about ninety minutes trying to send a single e-mail message to AOL.
If you have a queue retry period of sixty minutes, then when you still have another thirty minutes to go on that second message, you'll fire off another queue runner, and it will probably also get stuck sending a message to AOL.
Rinse and repeat enough times, and you fill up your RAM, your virtual memory, your swap space, and everything else you've got, all with queue runners that are stuck trying to get messages to AOL. Then your kernel panics, your system reboots, and you start the cycle all over again.
This happened across the entire Internet.
While the network layer itself finally came back up after about nineteen hours of downtime, it took about a week for the Internet e-mail backlogs around the world to finally recover.
Sanford Wallace was put out enough that he publicly claimed that I was trying to put him out of business, and then published the phone numbers for my desk and my home, and suggested that angry people call me up and tell me what they think of me.
I have been told that this event is the primary reason why qmail and postfix were created.