
Fail-slow at scale: Evidence of hardware performance degradation in prod systems - godelmachine
https://blog.acolyer.org/2018/02/26/fail-slow-at-scale-evidence-of-hardware-performance-faults-in-large-production-systems/
======
gpapilion
I had a large scale issue caused by a bug in the fan control system reading a
temp that was deadlocked. This caused cpu performance to tank, since the cpu
was thermal throttled.

To make matters worse the environmental sensors reported no issues, because
they were all frozen at a single reading as well.

------
LeonM
> There was a case where a fan in a compute node stopped working, and to
> compensate this failing fan, fans in other compute nodes started to operate
> at their maximal speed, which then generated heavy noise and vibration that
> degraded the disk performance.

I've never heard of such thing, can vibrations from a fan really influence
disk performance?

~~~
blattimwind
Yes. Loud noises and even yelling degrade disk performance as well.

[https://www.youtube.com/watch?v=tDacjrSCeq4](https://www.youtube.com/watch?v=tDacjrSCeq4)

~~~
joncrane
This is one of the classics! I love it. Screaming at a HDD.

------
godelmachine
Blue screen errors are most often caused by RAM issues. I remember very well
that I was facing blue screen error at my work once, & upon searching I found
that all I need to do is remove the RAM, clean and dust it, and put it back
into the same or another RAM socket.

------
xkgt
This is a HN gem! Wonder how it didn't make it to first page.

I once had a mobile app perform very poorly only for large downloads on a
particular SSID. No other applications had such issue. It turned out to be due
to a firmware bug which scrapped TCP SACK field, but the problem manifested
only for large data transfers over unreliable, high latency connections. Got
us all chasing our tails for a while though...

------
riza_on
The authors are still collecting more stories for their longer journal version
of the paper. If your institution has 10+ stories to contribute and are
interested to be part of the journal paper, please go ahead contact the first
two authors of the original paper.

