
The mystery of my desktop that locks up when it gets too cold - zdw
https://utcc.utoronto.ca/~cks/space/blog/tech/ColdLockupMachineMystery
======
M_bara
Get a thermometer and check the ambient temp. It could be a bug on the
mobo/BIOS. As an AWS SRE, we once had an issue where we had waves of racks
being unavailable. They'd go offline, were booted by automation, whereupon
they'd come up only to go offline again. Our usual approach to solving such
issues didn't work (we had a concall spanning 3 days to triage this issue). We
typically relied on IPMI sensors for thermal events in the fleet. The typical
script fetched the IPMI temp and if it was too high raised a ticket. What we
never factored was a temp going too low. So, it turns out, that the datacenter
in question would allow air from the outside in winter to help with cooling
during winter. One of the air vents got stuck open and servers went cold...
(IN deg C): 5, 4, 3, 2, 1, 0, 255 << oops bios initiates a server shutdown due
to a very high chassis temp. Server vendor issue a patch a few weeks later. We
had to put in an alarm for temp going too low in our monitoring!

~~~
healsjnr1
Interesting. A decade ago I observed something similar to the article while
traveling in Nepal / Tibet.

Me and friend both had iPods (of different generations) that got 'altitude
sick'.

At two points in the trip we got about 5,000m. Each time both iPods went
totally unresponsive. Both times we'd only be there for a few hours and once
we descended they'd stay working again.

Obvious thing was something to do with the cold and battery, but both times
the battery came at the same charged once they started working.

Can't remember exactly what the ambient temperature was (we were in cars both
times) but it's puzzled me whether it was cold or altitude.

~~~
goodcanadian
A quick trip to Wikipedia suggests that iPod Classics had a spinning disk for
storage. The disk in hard drives "floats" on a cushion of air. When the
outside air pressure gets lower, the internal pressure is relatively higher
and the gap increases. A big enough pressure difference and the disc can jam
and stop spinning altogether. I do not know if that is what affected your
iPods, but I do know that hard drive failures at the summit of Mauna Kea
(~4000m) are common and it seems to mostly be down to luck which ones work and
which don't (once you have a working one, it tends to stay working).

~~~
healsjnr1
Wow, that's awesome. No idea if that's actually what was happening, but it
fits the picture of why these were getting 'altitude sickness'

------
zero_iq
Maybe cold solder on a component on the mobo. Heat expands the joint slightly
making a connection, cold temperature makes the contact points contract away
from each other, breaking the connection. I had a similar issue with an LCD
dashboard display in my car, resoldering fixed it.

~~~
dboreham
The temp threshold reported (60F) seems too high for this to be a mechanical
issue. I'd put my money on some marginal circuitry, probably in the PSU or on-
board DC/DC converters, or perhaps a crystal -- they're wily things. I suppose
it could also be "bad silicon".

Anyhow, time to break out a can of freezer spray. For those without a Fry's
handy you can substitute compressed air cans (available at Staples and
Costco), if inverted so as to spray the propellant. Except I didn't tell you
to do that if you end up giving yourself a frostbite.

~~~
proee
A lot of people don't realize how fragile semiconductors are with regards to
reliability. It takes a TON of work to get a virgin part ready for
"production." There's huge pressure to get it out the door and it goes out
working "just enough." Semiconductors parts are held together by tons of
"bubble gum and bailing wire" hacks to keep them operational over wide
temperature swings.

~~~
kingosticks
I'd just add that you get what you pay for. There's no bubble gum hacks
keeping your ISPs networking gear forwarding packets when their air con fails
and the inlet temperature hits 50C. Or when it goes the other way (to a lesser
degree).That's the higher spec they are engineered to meet.

------
squarefoot
Those temperature related problems can be reproduced and spotted by using one
of those cold spray cans ("freeze spray" or similar names) made especially for
electronics that can be used to selectively cool down parts of a device until
the problem appears.

~~~
sorenjan
In a pinch you can also use a can of compressed air (not actually air) held
upside down.

~~~
userbinator
Chances are the substance in "canned air" is the same as in "freeze spray"
\--- a hydrofluorocarbon refrigerant. (They used to be a CFC, before those
were banned for ozone layer depletion.)

~~~
ncmncm
It is not widely known that HFCs are a really, really terrible greenhouse gas,
with hundreds of times the infrared-retaining power of CO2 and hundreds of
times the half-life in the atmosphere.

It is estimated that if all the HFCs currently in use ends up vented, it would
account for half the total greenhouse effect. It's really important to make
sure A/C systems don't leak. If you are ever in a position to specify a big
system, get one that uses ammonia.

~~~
semi-extrinsic
> get one that uses ammonia

... or CO2, or hydrocarbons. The general term is "natural working fluids". But
yes, a thousand times this.

~~~
userbinator
Ammonia is toxic, hydrocarbons are flammable, and CO2 requires extremely high
working pressures meaning more expensive equipment and lower efficiency. It's
not surprising that before ozone depletion and global warming were known, CFCs
became _the_ refrigerant to use, since they're almost completely inert and
work at low pressures.

~~~
semi-extrinsic
I agree that the CFCs have some nice properties.

But the new generation of synthetic refrigerants like R1234 are not nice, they
are flammable AND they form nice things like hydrofluoric acid when they burn.

CO2 is widely used in Japan in vending machines you find everywhere, it's
widespread in supermarket refrigeration systems, and it's looking to be the
refrigerant of choice for next-gen EVs. It's already what Mercedes use for the
AC i the S-class.

Hydrocarbons are flammable, yes, but the amount used for a residential unit
like a refrigerator is less than what's in your can of lighter fluid in the
cupboard. For a commercial kitchen, people have no problem with multiple 10 kg
cylinders of hydrocarbon for cooking, but a few hundred grams in the
refrigeration system is very dangerous?

------
madengr
I had an electronic module that worked in our lab, but failed at the
customers, but then worked again in ours.

Turns out one of the pins in a space grade connector (MDM-25) open-circuited
exactly between 68F and 70F. Our lab was 70F, and the customers was 69F. The
way we caught it was to be watching the temp chamber when it slowly ramped
from cold to ambient.

Turns out the connector supplier had shifted production to Mexico, and they
were contaminating the contacts with RTV when sealing it.

~~~
gerdesj
You have the skills to diagnose this sort of thing but I suspect that a mere
Unix sysadmin (soz: "herder") doesn't.

For the likes of Chris and me: put in a new mobo and move on, is surely the
fix that any techy would do.

~~~
madengr
You can selectively hit areas with cold spray (i.e. duster turned upside
down). Could be a bad/oxidized connector, cracked solder joint, tombstoned
part. Only worth diagnosing it for curiosity.

~~~
gotocake
Be careful when using duster that way, some of them leave a white, powdery
residue.

------
fredsanford
Check to see if the screws that hold down the motherboard are flexing it in
any way. In the '90s we had trouble with machines coming from the factory with
the screws too tight and loosening things up solved the issue. Sometimes we
had to put non-conductive washers underneath the mobo because the post to
which it was attached was too short.

~~~
Syzygies
Yes, back then I had a Power Computing Mac clone that no one could make
stable. After hours of staring I realized that the steel chassis was faintly
bent, flexing the motherboard and stressing the inferior Ram sockets of the
day. A very thin washer in the right place fixed everything.

------
rkagerer
Tricky-to-reproduce problems are always the hardest. It sounds like you've
made good headway in identifying temperature as an exacerbant.

Replacing the motherboard would be a sensible next step.

If you've exhausted all the traditional suggestions for troubleshooting
(disassembling and reassembling all components, se-seating RAM, CPU, etc), try
this:

Get a thermal imaging camera (if available) and a can of cold spray (sometimes
referred to colloquially as "liquid nitrogen"). Cool sections of the boards at
a time, and see if you can isolate which area causes the lockup (e.g.
something near power-related IC's for USB?). The camera isn't critical, but
might help you envision where temperature changes most rapidly and achieve
better granularity as to which components you thermally stress in any given
test.

Borrowing a PSU from a friend and repeating the cold test might also be
enlightening.

Good luck, and let us know how it turns out!

~~~
Scoundreller
If you treat a circuit as a 2D surface, yes. But the cold spray won't evenly
change the temperature of that 1000uf electrolytic capacitor. It's a good
start though.

Side note: A vendor my company uses outright rejects bug reports that aren't
consistently reproducible. Very annoying.

We waste a lot of time trying to find a pattern to the issue, but can't always
do so.

~~~
dylan604
I've worked with a product manager that would reject very legitimate bug
reports just because they came from me. Coming from a developer's mindset, I
would make very detailed bug reports just like I would dream to receive. The
product manager told my direct manager that I was trying to show off and make
his team look bad. So then it became me writing up the bug report, but my
manager would put his name on it and the product manager complaining our
department was out to get him.

------
the_fonz
Two distinct possibilities:

1\. Conductors with different thermal expansion coefficients and/or cold
solder joint/s. BGA chips (ie graphics) especially. Reballing BGAs is no
simple procedure to prevent popcorning, thermal damage and cold joints.

2\. Condensation - I have an A1278 non-Retina MBP that I'm donating that, for
5 years now, refuses to recognize the boot drive if it's warmed up too
quickly. I suspect corrosion resistance and/or a short from condensation
somewhere along the SATA path or a signal feeding a chip that provides it.
I've tried using a pencil eraser on the male and female SATA connector
contacts to no avail. I bet Louis Rossmann could fix it for $180-350 but I
already bought a Lenovo T480.

~~~
dehrmann
> Reballing BGAs is no simple procedure

Anyone remember the Xbox 360 Towel Trick? The idea was to wrap an Xbox with a
specific error code in a towel, allowing it to overheat to the point that it
resoldered a bad connection.

~~~
rasz
No, you could never resolder anything like that. What this did was warm
motherboard to the point it was able to bend and release accumulated (from
thermal cycling) stress _temporarily_ fixing broken solder joints.

------
rovyko
My Sony PS3 3D Display has the well-known problem where its screen shuts off
for a few seconds every once in a while. I noticed it got significantly worse
this winter, when I opened the window for some fresh air.

Did a few tests, and sure enough, if the room temperature is below 22C and the
screen has been on for about 10 minutes, then it starts to shut off
frequently. My guess is the ambient temperature is causing some metal
component to contract and break a connection.

~~~
londons_explore
Screens and audio devices shutting off periodically is a typical sign of clock
drift.

It's where a graphics card is outputting 60Hz and the screen is expecting to
receive 60Hz, but one of them is slightly off (60.0001Hz).

At some point the sending and receiving get too far misaligned, some error
condition is triggered, and the whole thing restarts.

It's a design flaw really - you should never recreate someone else's clock
signal.

There are some times it's impossible to avoid - for example a picture in
picture mode has to synchronise with two other people's clock signals for each
of the incoming images.

In your case, temperature will affect the speed of the oscillator making it
happen more frequently unless you have the ideal temperature.

------
phant0mas
I would start by checking all the capacitors if they are in good condition
(i.e not inflated). Then maybe search for a bad solder point.

------
timtimmy
It could be humidity. It’s very dry when it’s cold outside, heater kicks on,
and indoor humidity plunges even further. The colder it is outside, the larger
the differential to indoor humidity.

------
bootlooped
I, no kidding, had a PC once that booted up better if opened up the side and
pointed a space heater directly into it.

On cold starts it would boot, then lose all power seconds later, boot again,
and it would work for a few seconds longer than before, rinse and repeat 5
times or so and it would finally stay booted. After surmising that it was
temperature related, I decided to try applying extra heat. It worked. After it
was running for 2 or 3 minutes I would turn the heater off.

I never identified what component was causing this.

------
batoure
I don't know if anyone has suggested this but low temps can cause condensation
on the board if you live in a high humidity area. This could have the effect
of shorting the board.

------
kgwxd
I know you said "all the fans were spinning" but my power supply fan starts
hitting it's own frame when it gets too cold, did you check that fan? I forgot
to check that one myself the first time I diagnosed. In my case it's just
super loud but doesn't stop the fan or cause a shutdown.

If that's it, loosening the screws holding the fan in place just a little bit
worked for me.

------
Nextgrid
What surprises me is that his writing seems to imply that the computer freezes
but then manages to recover and continue as if nothing happened. I’d have
expected it to lock up forever and/or crash & reboot.

Would be interesting to see system logs (if any) after such an incident - they
could contain some clues as to which parts go offline (he mentioned USB
devices going off) during the problem.

~~~
thatcks
This was (is) a lack of clarity in my entry. When the system locks up, it only
recovers by rebooting through the BIOS; it doesn't resume operation from some
suspension. System logs cut off abruptly at the time of the hang, with nothing
abnormal even a few seconds before the time and no kernel messages sent out
through netconsole (I don't have a serial console available).

(I'm the author of the linked-to entry.)

~~~
ncmncm
People are talking about freezer spray, but really you need to substitute
parts, first, until you narrow it down. PSU, RAM, GPU, motherboard, CPU. Start
with RAM -- it's easiest and most likely, and you can get along on half for a
while, just to see.

RAM, PSUs and motherboards are remarkably cheap to replace.

------
chli
I had a GFX card that would only work when warm (power-on PC, black screen,
wait 30 seconds, power-off PC, power-on PC, PC boots). Failing to wait long
enough the PC wouldn't boot. PC booted fine with another GFX card. Never tried
debugging further.

My wife had a computer (before we met) that would boot only when warmed up
using a hair dryer !

------
peter_retief
I see some people have already suggested a dry solder joint, not that simple
to find but try look with a strong light and a magnifying glass. The freeze
spray sounds like a great idea as well

~~~
londons_explore
It'll be a bad solder joint on a BGA chip - no way to see those without an
x-ray really.

------
euske
I had a webcam that becomes faulty when it gets cold. We suspected that the
rubber connector gets tightened when it's cold, but couldn't quite figure out
why.

~~~
FreeFull
The interesting thing about rubber is that it loosens up when it's cold, and
shrinks when heated

------
dusted
Do not use a cold-spray to emulate this. Because when you have a part that is
significantly colder than ambient temperature, condensation is very likely to
occur.

------
AtlasLion
Sounds like a capacitor acting up IMHO. This can be tested by Cooling the
capacitors using an air can used upside down.

------
Kenji
It can be much simpler than that. Maybe an I/O pin is left floating by
accident (neither pulled up or down) and below a certain temperature, it gets
toggled and brings the machine down. We actually had this exact problem, but
it was the other way around: The floating pin was pulled the wrong way after a
certain temperature was surpassed.

~~~
raxxorrax
This kind of problem almost made me go crazy once when developing a simple
encoder routine. It is the absolute worst.

"sometimes it does work, so a hardware fault is very unlikely and why isn't my
software working if the sun is shining?"

I reimplemented these routines countless times...

Seriously, I was slowly beginning to think to look for a new career. Already
getting angry again just thinking about it.

------
gerdesj
Sooooo you decide to log a call ...

Hardware description, OS (and version), software installed, etc etc before we
even start to think about it.

Bugger that: cost your time at say £20 per hour (I'm thinking of a reasonably
good techy take home in a reasonably rich economy). Now do a cost/benefit
analysis ..... buy a new motherboard, fit it and delete the blog post.

Don't forget that the ambient temperature may also correlate with say humidity
or some other parameter. Buy a new mobo and delete the post and move on unless
you are prepared to really go to town with a decent investigation 8)

~~~
askmike
You must have fun hobbies.

