
Sysadmin left finger on power button for an hour to avert SAP outage - singold
https://www.theregister.co.uk/2018/03/05/who_me/
======
js2
I used to work for Loudcloud, an early dot-com hosting company. We used very
expensive EMC Symmetrix storage for our DB tier. (Search the web for EMC
Symmetrix 3830 if you want an idea what these beasts looked like.)

The Symmetrix had an EPO (Emergency Power Off) which was a red button mounted
in a recessed area on the back of the cabinet, and was protected by a plastic
lid. To perform an EPO, you had to lift the lid and hold the button down for
30 seconds or so.

One of our DC ops employees was moving a heavy server into a cage and
accidentally bumped the corner of the server into the plastic lid. The plastic
lid was forced inward and got jammed depressing the EPO button. Moments later
the entire Symmetrix powered off.

Later that day, as the word got around, another DC ops employee in a different
datacenter looked at the Symmetrix and curiosity got the better of him. He
didn't see how it was possible for the plastic lid to get jammed. So he
punched the lid with his hand. Moments later that Symmetrix went down too. :-(

We reported this design issue to EMC. A while later, a few of us were on a
factory tour at EMC. They pointed out to us the "Loudcloud Stopper" work-
around. It was a rubber stopper mounted next to the EPO button that prevented
the plastic lid from being pressed inward.

------
ilamont
From "Founders at Work," James Hong recalling how they prevented Hot or Not
from being accidentally turned off:

 _But the Salon.com article was coming out the next morning. I called the
writer and asked her if she could push the story back, but she said it was a
slow news day and she couldn 't. So the article came out and the server got
slammed.

My brother needed the server for XMethods, so we did the quickest thing we
could think of, which was that night at 3:00 a.m., we took the site down,
grabbed an extra PC--a 400 megahertz Celeron, no-memory-in-it machine that I
got for free when I opened an eTrade account--and drove to Berkeley where Jim
had a shared office.

I remember taking the top off a case for pushpins and mounting it on top of
the power switch of the machine so no one could turn it off. Then we put it in
the corner under his desk and surrounded it with books, so it just looked like
a bunch of stuff under his desk with a little Ethernet cable coming out. And
as soon as we turned the site back on, the access logs started flying. It was
5 in the morning!_

[http://wcarss.org/founders/james_hong_hot_or_not.txt](http://wcarss.org/founders/james_hong_hot_or_not.txt)

~~~
JBlue42
I have no idea where my copy of that book is now so thinks for a link to that
archive!

------
krylon
During my training, I worked on a BIND4-to-BIND9 migration in an IBM mainframe
environment. One week I got bored and started "benchmarking" the server, wrote
this little perl script that swamped the server with DNS queries. Then I
realized that my feeble little antique of a desktop (Pentium II @400MHz,
running NT 4.0, in 2004!) was not even capable to put some serious load on
that behemoth, and had not IBM just recently ported Perl 5.8 to z/OS?

So I scp the script over to the mainframe, ssh into it, run it again... and
grow disappointed that my puny little perl script is _still_ the bottleneck.
How much can this beast take, I wonder. Maybe, if I forked off a couple of
children?

In retrospect, I should have let it go at this point. My benchmark was already
querying the nameserver at a far higher rate than it would ever encounter in
production. I should have written in my report that the performance impact of
some configuration changes was negligible if not zero.

But I really wanted to see how many queries this beast could handle. So I kept
increasing the number of worker processes hammering BIND with the same queries
over and over, until ... my ssh connection dropped. I pinged the mainframe,
but I got no response. Ooops.

I was trying to look really busy as the monitoring guy who always looked as if
he had just woken up walked down the corridor into our open plan office,
grinning, and asked if anyone had something to tell him. Nobody replied. I do
not think I have ever been that quiet in my entire life.

"Okay", he said, "the TCP/IP stack on that particular system just crashed,
just in case you are wondering.". _Oops_

"Yeah, but SNA still works", the sysprog replied, "And the LPAR is scheduled
for an IPL on Saturday, anyway. It'll do."

Obviously, it was a testing LPAR, so nobody got hurt; they would not let a
trainee anywhere near a production system. But let the record show that I did
manage to disable VTAM (at least the TCP/IP side of it) with a simple perl
script from an unprivileged user account. By accident, but still. Also, I lost
about a kilogram in sweat that day.

~~~
Something1234
What do these acronyms mean?

SNA LPAR VTAM

~~~
softblush
Some IBM mainframe acronyms

SNA -> Systems Network Architecture

LPAR -> Logical partitions

IPL -> Initial program load

VTAM -> Virtual Telecommunications Access Method

I might be wrong of course

~~~
krylon
Nah, that is how I remember it. ;-)

I haven't worked with mainframes since, but I found the fact they have their
own words for things fascinating. Parallel evolution, so to speak. Like, what
mainstream operating systems call a kernel is called a "nucleus" on z/OS,
which IMHO is a much cooler name.

------
zer00eyz
HA! This is my favorite interview question to ask candidates:

"What is your all time biggest screw up, and how did you come back from it" \-
I then tell them the story of me loosing several hundred thousand dollars and
the funny things that happened around it to set the tone. If you have been in
tech for any length of time you have one of these stories (if not a few). I
have heard some great ones by simply asking and it gives great insight into a
candidate (humor, stress response, the things you have seen).

~~~
brador
What valuable skill or quality does this show?

~~~
squegles
Honesty and the ability to own up to your mistakes. It can also show arrogance
depending on their response.

~~~
MiscIdeaMaker99
I always ask potential coworkers during an interview for a story like that for
two reasons: 1) it'll tell me whether or not they have humility, honesty, etc;
2) and, because I love sharing war stories.

One candidate we interviewed answered my question saying that he had never
made a mistake like that. Then he went on to tell me a story about how a bad
patch from Oracle which he applied brought down production this one time.

The guy seemed arrogant, and I felt like he was lying or he is overstating his
skills. Either way, he ended up getting hired by another team (knowing full
well that we didn't like him), and he only stayed for about a month or two.

Go figure.

------
koolba
That’s pretty funny though I don’t think it’d work on a modern setup as
everyone I’ve seen for the past 20 or so years does a hard power off after
holding down the power button for 5+ seconds.

~~~
belthesar
This was before the introduction of ACPI, which is what makes that possible.
Prior to that, switches in PCs operated more like a latching switch than a
momentary switch (regardless of whether it was actually a momentary switch or
not), so it was the action of releasing the button that would either send the
signal to shutdown, or physically break the circuit supplying power to the
computer.

This is where the old "It is now safe to shut down your computer." screens of
Windows 9x/NT 3 came from. [http://i0.kym-
cdn.com/photos/images/original/001/286/950/e05...](http://i0.kym-
cdn.com/photos/images/original/001/286/950/e05.jpg)

~~~
rzzzt
Are services still running when that screen is being displayed? I'd have
thought that by the time it shows up, all processes are terminated.

~~~
princekolt
It's in the article. The guy pressed the power button on the wrong machine,
other than the one that was shut down.

------
snuxoll
Makes me realize how much I take quality of life features in modern servers
for granted. We don't need to be physically present to reboot servers,
eliminating the possibility (well, mostly) we will power down the wrong one
like this - even if the OS is completely unresponsive there's lights out
management that can be used to remotely manage power to the system. For the
times that one needs to do physical maintenance on a server a blinking light
can be toggled through the LOM interface to identify the machine, you can have
the hostname display on a little LCD on the front panel too.

It's really amazing to see how far computing has come in just the past two
decades.

~~~
chmod775
I type "reboot" into the wrong SSH window all time, especially when I'm tired.

It's really quite a lot easier than pressing the wrong power button (I do that
too at my desk).

~~~
snuxoll
I've done it myself, I generally avoid it by using `shutdown -r` instead which
will by default delay reboot for 1 minute (at least it does on CentOS/RHEL 7).
It's annoying to wait the extra time, but having a period to backout with a
`shutdown -c` in case I made a mistake has saved me more than it hasn't.

On the flip side, at least when you make this mistake with a VM you're
typically not down for long assuming you have fast-ish storage - on average
any of the VM's I'm responsible for are back up in 60-90 seconds, physical
machines can take 5 minutes or more (memory testing, expansion ROM's, etc. all
make post take FOREVER even on modern hardware).

------
scrumper
This is an old-style directly connected power switch. If you release and re-
press it quickly enough the power won't go out as there's still enough
residual energy in the PSU capacitors. I used to do this all the time on my
486 as a sort of absent-minded tick.

I don't blame the guy for not trying that with a production SAP server
though...

------
jontro
15 years ago when we were hosting servers in a co located facility I
accidentally turned off a server instead of rebooting it (from terminal
services).

The support personel were annoyed as they had to drive over to the facility
and manually push the power button

~~~
Cerium
I still have the habit to type "-r 0" ctrl+a "shutdown".

~~~
unit91
Oh wow that's really smart. Thanks!

~~~
iooi
Doesn't hurt to do the same when writing UPDATE queries, `WHERE id=X LIMIT 1`
ctrl+a `UPDATE...`

~~~
ufmace
I got used to typing "UPDATE SET x=x WHERE ..." and then going back and
changing the "x=x" to the actual column assignments after I typed the WHERE
clause.

~~~
chasd00
to this day, i type "update where <where clause>" and then go back and add the
set part of the statement. I think it would have been safer to put the where
clause at the front of the sql syntax. Like "delete where <whereclause> from
table", "update where <where clause> set ..." etc. I know any production
database that's of real importance is going to have safeguards in place but it
still makes me nervous.

~~~
ufmace
Yeah, I wish they would update the SQL syntax to at least allow the WHERE
clauses to be first in UPDATEs and DELETEs. Doesn't seem to be much enthusiasm
around for making big changes to SQL like that though.

------
zitterbewegung
I have used ngrok to make my laptop work as a production server when I was
user testing
[https://github.com/zitterbewegung/mms2text](https://github.com/zitterbewegung/mms2text)
. I setup twilio to point to the url provided by ngrok. I just left my laptop
home and I got people to test the app. Eventually I set it up on AWS but it
chugged away fine on my laptop (Macbook Pro TB 13 inch).

------
gk1
Speed 3: Uptime

------
lmilcin
I did the same almost two decades ago. Old AT power supplies on Proliant
servers would turn the server off only after you lifted your finger. I have
pressed it on a wrong server. Had to reach with my foot to the phone lying
close by on the floor to call accounting department to log off the application
to prevent corruption when the Novell Netware server powering it was was
rebooted.

~~~
jamiepenney
386s had a push button power switch where if you were really fast you could
let it out and immediately push it back in and the power wouldn't be
interrupted. Found this out in class - we would walk over to each other, lean
over pretending to talk to a friend and sneakily push the power button in.
Before we worked out that trick, you would have to get your finger on the
button before the other guy took his off, then hold it down while you saved
your work.

~~~
dragonwriter
> 386s had a push button power switch where if you were really fast you could
> let it out and immediately push it back in and the power wouldn't be
> interrupted

It would be more accurate to say “some computers in the late 1980s and 1990s”;
not all of them were 386s, and not all 386s had this style of switch.

~~~
lmilcin
It would be even more accurate to say "Many power supplies, especially when
computers were much less power hungry, had enough capacitance to be able to
survive 0.1 to 0.2s needed for the operator to reset the switch back on. This
only works when the switch is switching AC and the motherboard stays connected
to PSU. Since ATX standard was introduced the switch actually disconnects the
motherboard from the power supply and it is no longer possible. Many power
supplies still have independent AC switch but this is not typically operated
by the user to switch the machine on or off."

------
waltwalther
I RDP'd to a Windows server an hour's drive from my office at a public library
in another town. I had right-clicked on the network connection to check out
some settings....and accidentally clicked DISABLE instead of PROPERTIES (or
whatever it was called in Windows 2000 server) and disabled the network
connection. It was a long drive...with my phone ringing the entire time. Never
made that mistake again.

------
zaarn
There is a rather similar (maybe same but changed to protect the not-so-
innocent?) story on the daily wtf; [http://thedailywtf.com/articles/Trauma-
Center](http://thedailywtf.com/articles/Trauma-Center)

------
squozzer
A modern interpretation of Hans Brinker.

------
sd6594
What about disabling the power button effect in the OS?

~~~
netsharc
Nowadays holding it for 5+ seconds still turns the system off, I wonder if
this is a BIOS or a hardware configuration. Probably it's the PSU's logic, to
power off if the 2 pins are shorted for more than 5 seconds.

On old AT systems (the ones where Windows 9x would show "It is now safe to
turn off your computer"), one could actually press and hold the power button
and the system would stay running. And when you're bored you can also quickly
move your finger off the button and jab it down again (this would flip the
switch back to on), and if you're quick enough, the system would not see that
there was a power interruption.

Indeed the old AT power button was a mains (120V) switch, with thick cables
going from and to the power supply unit.

~~~
jaclaz
>Nowadays holding it for 5+ seconds still turns the system off, I wonder if
this is a BIOS or a hardware configuration. Probably it's the PSU's logic, to
power off if the 2 pins are shorted for more than 5 seconds.

It is BIOS+Hardware (the PSU is not involved). As a matter of fact to "switch
on" a ATX power supply (not connected to a motheboard) you normally use a
paperclip (or a short piece of cable) to connect the green with any of the
black see:

[https://forum.overclock3d.net/showthread.php?t=394](https://forum.overclock3d.net/showthread.php?t=394)

The whole point is that (unless the PSU has a mains switch and it is turned
off) an ATX power supply is always partially ON, powering (parts of) the
motherboard at all times (this allows for such things as Wake on Lan or switch
on via CTRL+F11 or dedicated key on the keyboard).

~~~
yuhong
Without ACPI (which NT4 was) the ATX power button was handled entirely by the
BIOS (I think in SMI code). With ACPI the power button (when you don't hold it
for 5 seconds of course) was handled by the OS.

------
tomcooks
Unscrew a bolt from the server rack and tape it on the button, done

~~~
komali2
>Jeremy told Who, me? that his mate asked to be relieved, as he was in a bit
of pain. Those requests were denied due to the risk of the power going off and
also out of a desire to make the poor chap suffer for his error.

Looks like they wanted him to suffer :p

~~~
jonwachob91
You call it suffer, others call it corrective training ;)

~~~
jasonlotito
Corrective training for the wrong person. Every time I hear a case like this,
I'm reminded that when something like this goes wrong, it's usually the fault
of something that failed prior to getting to this stage. The problem isn't
that this person pressed the wrong button. The problem is that the wrong
button was allowed to be pushed in the first place.

~~~
jerf
Normally I'm down with that, but I think there are some ground-state base
cases where you have for perfectly sound reasons penetrated all the security
and safety affordances and simply must be careful. "Make sure you've got the
right power button before pushing it" is probably one of them. There's not
_much_ you can do about that. Maybe not _zero_ , but not _much_ , in a context
like this where many servers are being updated, because, again, whatever
protections you may have had in place have already been bypassed.

Similarly, when push comes to shove you'll never be able to eliminate the
needs for somebody to jump to root on some server, at which point, well, _be
careful_ is all you've really got. Hopefully you've built some good habits
into your fingers.

------
hartator
> Jeremy told Who, me? that his mate asked to be relieved, as he was in a bit
> of pain. Those requests were denied due to the risk of the power going off
> and also out of a desire to make the poor chap suffer for his error.

I think that's just awful.

~~~
loco5niner
Likely also simple embellishment

