
The Power of Power Cycling - nkurz
http://growthalytics.com/troubleshooting/programming/2015/08/21/the-power-of-power-cycling
======
Animats
No, it's not acceptable to power cycle to fix problems.

Yesterday, I had to power-cycle a CNC milling machine. The machine (a Tormach
1100) started the spindle with the spindle RPM about a tenth of the value
specified. This resulted in pushing an end mill into the workpiece (it wasn't
spinning fast enough to cut properly), snapping off the end mill (like a drill
bit) which went flying off, and damaging the workpiece. This is not a huge
deal (it's about $25 in damage, and everybody wears safety goggles around
milling machines) but it shouldn't happen. We discovered that all spindle
speeds were far slower than they should have been; entering a new spindle
speed did cause the spindle drive to change speed, but the speeds were about
an order of magnitude low. The spindle speed control is digital and computer-
controlled.

Power-cycling the machine (which runs Windows Embedded Standard, a Windows XP
derivative) cleared the problem, and I was able to run several jobs
successfully. But someone else had reported a spindle overspeed on Friday.
Something has gone badly wrong in spindle speed control on that machine.

The "Recommended Best Practices" for this machine tool actually says "Reboot
controller once a day to force both Mach3 and Windows to restart."[1] That's
not good.

[1]
[http://www.tormach.com/uploads/883/SB0036_Mach3_Best_Practic...](http://www.tormach.com/uploads/883/SB0036_Mach3_Best_Practices_1214A-pdf.html)
(a PDF file with the wrong suffix)

~~~
msandford
If you're using Windows to run a milling machine, you're doing it wrong. It's
not Windows problem that you're using it inappropriately. That's like taking a
1990 Ford Taurus with 250k on it to a track day and then complaining that the
engine blew up. Technically the engine did fail, but the driver should have
known better.

~~~
keithpeter
Windows Embedded is quite common in factories where I live. What alternatives
would you suggest that are vendor supported?

~~~
msandford
If your vendor thinks that Windows is an acceptable control for a CNC milling
machine, consider another vendor. Yes vendors might choose to use Windows, and
think that it's OK. But just because a vendor chooses to use Windows doesn't
somehow make you obligated to buy it and suffer, does it?

Sure they might be the only vendor, or someone inside your organization might
choose them anyhow despite your protestations. But that doesn't change the
fact that using Windows for these kinds of things is foolish.

Just because big companies do it doesn't make it smart, does it? I mean, if it
did, then startups wouldn't be able to exist would they? Startups are able to
be a thing because big companies sometimes do stupid stuff.

~~~
CapitalistCartr
They all use Windows. Mine use Windows NT4 and Windows XP. That's the
industry. If you want industrial CNC machines you don't get a choice.

~~~
msandford
There's a tremendous difference between using Windows to run the GUI and using
Windows to do the real-time control of the servos and various other hardware.

Most of the time these machines run a small microprocessor which receives
commands over some port and interprets them as it is instructed to. This is
what's called a "hard realtime" system as it always responds within a certain
amount of time, guaranteed, provably as per the design.

What Tormach is doing is eliminating the dedicated gcode interpreter
hardware/controller and performing those operations strictly in software, on a
program running on a PC. There's some utility to that, but pretending that
it's as good as having a dedicated, realtime gcode interpreter is not honest.

------
pjc50
I don't see why this is attracting so much skepticism when "crash-only
programming" and the "chaos monkey" are popular ideas on HN. There really are
only two ways to do software reliability:

\- "failure is not an option": you don't get to reboot your rocket controller
when it has a floating point error, or your Therac when you're doing an X-ray.
This involves investing a lot of time, effort, review, and formal validation
into getting it right. It's completely incompatible with Agile and RERO.

\- "s--- happens": the cost of failure is small and you can just accept it,
issue refunds/apologies and move on. Or show the fail whale. This is much
easier and cheaper. This environment moves towards powercycling and
redeploying software/EC2 instances/Docker containers whenever something
happens. You monitor observed reliability and make a commercial decision as to
whether it's unacceptable and you need to fix some bugs.

Almost all of HN works in the second area.

~~~
vezzy-fnord
Crash-only is rather different from power cycling. The unit of granularity in
a crash-only metaphor is at a higher level (process, thread, green thread or
other schedulable entity), the whole point being that you delegate to a
supervisor tree for restarting crashed subtasks/processes to a known good
state while maintaining the uptime and integrity of the entire system (i.e.
crashes are not typically user-visible). It works really well because in any
complex system running on top of so many layers, the state space expands
combinatorially so that performing intricate error diagnosis and recovery
procedures will often be a losing proposition from the number of code paths
you'd need to properly exercise.

Power cycling is different. A system that expects you to power cycle often
because it consistently fragments or cannot dynamically update/reread its
configuration is broken and a major annoyance. In many ways, systems that
require lots of power cycling are as such because they're designed in a way
antithetical to crash-only. It's a failure to enforce boundaries and
separation of concerns.

------
tempestn
Power cycling is a work-around, not a "fix". Pretty much any time you end up
needing to power cycle something, it's due to an underlying bug, which remains
after the fact. In fact, in many cases it may have intentionally been left
unfixed, since the people shipping these products know that most users will
try power cycling before complaining about a problem. It has become so
commonplace that we apparently now see it as acceptable that we have to
regularly reboot our routers, PCs, phones, etc. just to keep them working.

~~~
andygates
Thing is, that underlying problem may be in a system that's out of anyone's
direct control. Case in point: a clinical system which occasionally pops a
memory leak and needs a bounce. The leak turns out to be a known bug in a
library - not their code. And it's fairly elderly so it's not been or going to
be fixed.

I mean, it shouldn't happen, but if "should"s were horses I'd have a
ponyburger for brunch.

------
xrmagnum
There's no doubt Power Cycling helps. But I would argue it helps for people
who 'consume' products. It is anything but a help for Engineers, for people
who design products.

There's so many examples I could provide: you are a software engineer, the
program you develop works well for a day and then crashes. You can either
investigate the problem, find there's a memory leak and fix it or Power Cycle
your program.

-> If you design a product, don't go for the easy stuff, don't power cycle just to get back to a well known scenario. Investigate and fix.

~~~
kbenson
Yes, power cycling most often treats the symptom, not the problem. As a
consumer, you often don't have a way to treat the problem, so that's your only
recourse. As an engineer, or really anyone capable and responsible for fixing
the problem, often you _do not_ want to power cycle, because if the problem
goes away you may no longer have a way or reproducing it. That makes most
problems harder to fix, and possibly impossible to confirm fixed.

I often reduce how I classify the importance of bug reports I get that I can't
reproduce. It's often not efficient to work on something with so little
information to go on. Further information on the bug or a reproducible error
case vastly help.

~~~
StillBored
The real problems start when your support organization thinks that rebooting
the customer's machine is a solution.....

Then the engineering team never gets to fix the problem, and your support team
starts eating up all the resources because they get constant calls from the
customers asking about the same problems.

(thankful for not being in that job anymore!)

------
crististm
If we all had ECC on our machines then SW engineers would have one less of an
excuse to blame it on HW.

Power-cycle related story: my machine crashed two weeks ago while playing a
movie and my 3TB hdd showed only 2TB on _SMART_ after power reboot! After
googling around, letting the hdd cool off and playing with wd tools, several
power reboots later my 3TB partition is back but not without loosing a couple
of MB of the size of the disk at the end of it. Puff, just gone, not deleted,
gone! My old partition is now bigger than the 3TB drive.

Smart reports a 3TB disk again, and I'm on the look for a spare drive.

No, really, not all power cycles are equal.

~~~
falcolas
This comment has two parts - the first one I think deserves more attention.
When your program can actively and randomly rot while it's in in production,
the only valid recovery method really is "power cycling" the application. As
the OS is also being affected by this rot, power cycling the OS is a valid
solution to many problems as well.

The push for making broad use of consumer hardware running production
environments to save (a veritable fistfull of) money has only made this
problem worse.

It makes me wonder if we shouldn't be adding a module to the Linux kernel to
occasionally reload blocks of code from disk.

~~~
crististm
"occasionally reload blocks of code from disk" \- It's not that simple. An ECC
system avoids most HW related issues but does not protect you from SW errors
which probably accounts for most rotting.

Why reboot it in the first place? If the HW has ECC, the SW does not rot
unless it is poorly written.

------
sandworm101
The OP assumes many things about 'turning it off and on again'. Firstly it
assumes that turning something off and restarting will bring it back to some
default state. That isn't always true, especially for computers. Say some new
code has been injected. Restarting, going through the motion of boot and
login, may allow that code to do something (harvest logic creds) that it
otherwise might not.

And the OP assumes that once 'off' has been achieved that "on" is always a
possibility. Forget the car. Think airplane that has lost instruments, but the
engines are still working. Turning things off might move you from bad to
worse. Maybe you stick with the partially-working machine rather than risk
bricking things.

~~~
userbinator
_And the OP assumes that once 'off' has been achieved that "on" is always a
possibility._

Or that to do this is a simple operation with no other side-effects. As one of
the other comments here points out, power-cycling is probably acceptable for
consumer products, but with systems that are more mission-critical, should
really be closer to a last-resort method.

~~~
sandworm101
Chernobyl started with them 'power cycling' the reactor, an on-off-on test for
a backup system. It did not go well.

------
saosebastiao
I feel like the majority of problems that are resolved with power cycling
could be prevented on the engineering side if engineers _merely understood
state machines better_.

The reason why power cycling works at all is because the machine's state is
inconsistent with the program's, and power cycling gives the machine and the
program a chance to start over from scratch.

So many programs out there, especially so in embedded devices where power
cycling is common, don't really have an explicit state machine model of
operation. As such, the debugging of these types of errors is near impossible.
If you have an explicit state machine model of operation, and you have a bug
that is remedied by a power cycle, you can quite easily trace the bug back to
a specific state transition. Of course, this becomes more untenable with
increases in system and hardware complexity, but on the level of a driver or a
single executable, an explicit state machine model works wonders.

------
viach
Hm, is that the answer to the question why do we actually sleep?

------
Fede_V
As an extreme example of power cycling, I was refactoring a slow function in
some simulation code and replacing some hash-based look ups with some pointer
arithmetics.

I must have spent several hours writing and re-writing that function, and I
still couldn't get the tests to pass. I went over all the code character by
character, debugging, etc, and I just couldn't figure out where it wasn't
working. Eventually, I just nuked all the work I did that morning, rewrote
that entire routine from scratch, and it worked perfectly.

------
joezydeco
Just after reading this I went to get a coke from the vending machine down the
hall. There's a bottle sitting on the robotic arm halfway down and stuck.
Display reads OUT OF SERVICE.

I yanked the power cord, put it back in. The steppers re-homed themselves and
the bottle made it to the bottom.

Free Coke.

------
toth
For anyone who hasn't seen it yet, the AI koan on power cycling:

'A novice was trying to fix a broken Lisp machine by turning the power off and
on.

Knight, seeing what the student was doing, spoke sternly: “You cannot fix a
machine by just power-cycling it with no understanding of what is going
wrong.”

Knight turned the machine off and on.

The machine worked.'

See the rest at
[http://www.catb.org/jargon/html/koans.html](http://www.catb.org/jargon/html/koans.html)

------
sllabres
This is going in another direction as the cited article, but there are aspects
when it comes to software aging and software rejuvenation for high
availability.
[https://en.wikipedia.org/wiki/Software_rejuvenation](https://en.wikipedia.org/wiki/Software_rejuvenation)

There were some interesting reads in the AT&T Technical Journals for the 5ESS
switches (e.g. Kintala, Bernstein , Wang “Components for Software Fault
Tolerance and Rejuvenation”) a long time ago. The journals are not free
accessible, but I have found another text from one of the authors about the
same topic [http://www.crosstalkonline.org/storage/issue-
archives/2004/2...](http://www.crosstalkonline.org/storage/issue-
archives/2004/200408/200408-Bernstein.pdf) (defense related web site) or
[http://srejuv.ee.duke.edu/shaman02_secure.pdf](http://srejuv.ee.duke.edu/shaman02_secure.pdf)
(other author)

------
burnte
In this thread: Lots of people criticizing a perfectly valid first diagnostic
step that they all use frequently. Yes, we all know that it's not a FIX for
some long term problem, but frequently you're looking at a simple confluence
of rare events that caused an issue that can be resolved by resetting the
system to a known state. If the issue continues with a frequency that is
troubling, then you move on to other steps, but if you fire up a kernel
debugger and bus sniffer every time you have a minor glitch, you're wasting
time. And I wager everyone commenting otherwise knows that.

~~~
xrmagnum
I don't see many people criticising here quite frankly.

As per your comment, you are right: if there is a kernel bug and you'd rather
restart your computer to quickly get back to what you were doing and avoid
distraction, all good.

If you are part of a kernel engineering team and you quickly restart your
computer and pretend it never happened then maybe you didn't give your best
shot.

------
yason
The value of power cycling is to determine how intermittent the problem is.

If the problem only ever occurs once and power cycling fixes it then that is
effectively a permanent fix.

If it comes back next year and power cycling fixes it again, you know there's
something wrong but it probably won't be an actual problem for any practical
purposes unless it gets worse.

If you find yourself to need to power cycle every month or every week or even
worse, increasingly more often, chances are you're going realize that the root
cause must be investigated and fixed in a matter or months or weeks or even
days.

------
deepnet
Another way of saying,

"Have you tried turning it off and on again ?"

\- from Graham Lineham's excellent _I.T Crowd_ , the phrase is spoken by Roy
from Technical Support, the I.T. dept.

~~~
Frqy3
It doesn't work unless you say it with the correct accent.

~~~
deepnet
Compare :

[https://www.youtube.com/watch?v=p85xwZ_OLX0](https://www.youtube.com/watch?v=p85xwZ_OLX0)
\- Irish brogue ?

[https://www.youtube.com/watch?v=tPxrACTPxEk](https://www.youtube.com/watch?v=tPxrACTPxEk)
\- Minessotan US ?

------
martin-adams
I've heard an alternative of the opening joke.

It results in the software engineer basically saying, "lets try it again and
see if we can reproduce it".

------
esaym
Personally, I'd rather just fix the problem :)

    
    
      :~$ uname -r && uptime 
      3.2.0-4-amd64
      01:33:09 up 461 days,  3:50, 13 users,

~~~
MichaelGG
Do you hot patch things or what?

~~~
ucho
There is/was KSplice for kernel and everything else can be upgraded without
reboot.

------
NDizzle
I had to recently power cycle my refrigerator.

The bottom ice maker stopped working and no matter what setting I had it on
would make it create ice.

My current theory is that I changed the ice settings while the freezer was
open (a few days) earlier. I noticed this because the water in the door shuts
off if you open the other door.

Sure enough, after leaving the fridge unplugged for 15 minutes I now have ice.

~~~
StillBored
Must be an LG, I had the same problem... And yah power cycling fixed it for a
couple weeks, then it did it again. The wife got mad and called the repair
guys, they said, turn down the temp in the freezer. Apparently something in
the software works better with the temp turned down to the minimum, because
its not like it can't make ice at 10 degree's or whatever it was previously
set to.

------
splitrocket
Cosmic rays flip bits all the time:

"one error per month per 256 MiB of ram was expected"

[http://stackoverflow.com/questions/2580933/cosmic-rays-
what-...](http://stackoverflow.com/questions/2580933/cosmic-rays-what-is-the-
probability-they-will-affect-a-program)

------
NeutronBoy
While I 100% support power cycling as a troubleshooting step, it doesn't
answer the question as to _why_ something went wrong. For fixing the air
conditioning in my car once in a blue moon, fine. For figuring out why my
internet drops out once per day, it's only a triage step.

~~~
jakejake
My philosophy is that the first time something funky and unexplainable
happens, power cycle and don't spend a bunch of time on it. But keep and eye
out for the same thing and if it happens again, then investigate.

Using it to solve a daily problem is a terrible idea, if you have any control
over the system.

------
jordan0day
Obligatory link to Jim Gray's "Why Do Computers Stop and What Can Be Done
About It?":
[http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf](http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf)

TL;DR: Most software bugs that make it past testing are transient
"heisenbugs". That is, they're the kind of bug that goes away when if you
_restart the program_.

Related: This is actually a core tenet of the Erlang ecosystem -- spend any
length of time around Erlangers and you're bound to hear the phrase "let it
crash". Erlang actually has support for this built into the system: Supervisor
processes exist to automatically "power cycle" your code if an unhandled error
occurs.

~~~
sllabres
It is not necessary that there is a bug firsthand. Think at a system with
memory pressure due to memory fragmentation. This could lead to failed memory
requests for applications that would succeed on a less long running system.
(For this reason some systems even disallow dynamic memory allocations during
runtime)

------
stillsut
I was thinking about this for IoT yesterday. Basically if you wanted a machine
with premanent uptime, it would be two identical basic machines that would
periodically power cycle each other. On error, hard-reset, and reload the
OS/app from the working device onto the problem device.

~~~
DanielDent
Watchdog processes are actually extremely common in embedded systems. Servers
have them. Even the code in the SMM on a Chromebook has a watchdog.

------
smoyer
I have a wife and four kids (two of whom are now living on their own) and I
can tell you that it's sometimes impossible to power cycle them. Especially
when they're young and excited - and the next day is Christmas.

------
stevebmark
[https://pragprog.com/the-pragmatic-
programmer/extracts/coinc...](https://pragprog.com/the-pragmatic-
programmer/extracts/coincidence)

------
jlynn
I once called a repairman to look at my dishwasher because it was stuck
repeating the wash cycle over and over. He power cycled my dishwasher and the
problem was fixed.

------
jamesfe
I have to admit - I thought this would be an article about bicycling and the
usefulness of physical fitness in being a better (XYZ).

Not disappointed, but not what I expected.

------
QuantumRoar
Here's a question. Is it actually common practice to test your code by leaving
it to run for weeks while inputting random stuff and observing if it leaks
memory, sets bad values or crashes?

If not, why not? That'd catch any "mad user bugs" and all kinds of accumulated
collateral damages from the sea of complexities which might be overlooked if
routines are only tested individually.

~~~
mwill
This is along the lines of "soak testing"

It seems so common its mandatory in some places / fields, and totally unknown
in others unfortunately.

------
denshade
It's this kind of 'fixing' the issue that is the cause why we still have this
kind of problems. I'm the kind of guy that debugs things for 3 days to find
the issue. Yup even if "it isn't worth it". I should warn you, I hit people in
the face with a shovel if something isn't right. And frankly if you are a IT
professional you should to.

~~~
a3n
These things happen.
[https://www.youtube.com/watch?v=6KpzGfC9JP8](https://www.youtube.com/watch?v=6KpzGfC9JP8)

------
hahainternet
If you are power cycling your computer, it is not sufficient to shut it down
and restart it from the power button.

You must disconnect its power by turning off the PSU or yanking the cable. I
suspect a majority of tech people know this but your PSU passively supplies
power to a few things on the board, this can be enough for controller errors
to persist across a reboot.

------
ucho
It is great until you power cycle machine that was behaving strange only to
find out that it was due to HDD that was starting to fail and it does not boot
anymore.

~~~
karmakaze
In a production environment, I prefer frequent power cycling. You don't want
to be in a situation where a dev/ops is afraid to power cycle a server with
very high uptime. If there turns out to be a problem anywhere, better to find
it sooner than later where recovery is more likely and in the worst case fail-
over or restore from a backup.

~~~
marcosdumay
Yep, but power cycle when there are people available, with enough time to fix
any issue that may appear.

The worst time for power cycling a machine is during an emergency.

------
PascLeRasc
From the title, I hoped this would be an analysis of cyclists' power or
something related to the fancy torque meters they love to use.

