
CPU reliability – Linus Torvalds (2007) - semicolondev
http://yarchive.net/comp/linux/cpu_reliability.html
======
zxcdw
I don't work in environment where I get to deal with hardware failures, so
pardon my ignorance, but has anyone seen a failed CPU piece which has failed
during normal operation? I am under an impression that it is very rare for a
CPU itself to fail so that it would need to be replaced.

The only times I've even heard about failing CPUs has been if they've been
overclocked or insufficiently cooled(add in overvolting, and you get both :))
or physical damage during mounting/unmounting or otherwise handling hardware.
And even then the failure has usually been elsewhere than the CPU itself.

Of course I am not saying it'd be unheard of, but for me frankly, right now it
is.

~~~
vanderZwan
My dad had a laptop which would not boot unless he put it in the fridge first
for half an hour. As long as he didn't reboot everything then worked "fine".
Does that count?

~~~
raverbashing
That's probably a bad contact or a tiny fissure that makes contact again when
cooled

And the failed part is either needed only on bootup or when the current gets
going it doesn't stop until it's powered off again

~~~
vanderZwan
Thanks for clearing that up - it's always been bothering what it could
possibly be (that and the "how on earth did he figure that out in the first
place?")

~~~
danudey
The cold would shrink the motherboard to re-connect the contacts; it might
have been fixed by doing the old Xbox/nvidia trick of putting it in the oven
(which would soften the solder, causing it to shift and re-connect). With the
Xbox, you could apparently even just wrap it in a towel and let its own heat
do the trick.

------
Taniwha
so not even mentioned here is metastability - basically signals that cross
clock domains within traditional clocked logic where the clocks are not
carefully organized to be multiples of each other can end up being sampled
just as they change - the result is a value inside of a flip-flop that's
neither a 1 or a 0 - sometimes an analog value somewhere in between, sometimes
an oscillating mess at some unknown frequency - worst worst case this unknown
bad value can end up propagating into a chip causing havoc, a buzzing mess of
chaos.

In the real world this doesn't happen very often and there are techniques to
mitigate it when it does (usually at a performance or latency cost) - core
CPUs are probably safe, they're all one clock but display controllers,
networking, anything that touches the real world has to synchronize with it.

For example I was involved with designing a PC graphics chip in the mid '90s -
we did the calculations around metastability (we had 3 clock domains and 2
crossings), we calculated that our chip would suffer from metastability (might
be as simple as a burble on one frame of a screen, or a complete breakdown)
about once every 70 years - we decided we could live with that as they were
running on Win95 systems - no one would ever notice

Everyone who designs real world systems should be doing that math - more than
one clock domain is a no no in life support rated systems - your pacemaker for
example

~~~
caf
If a failure mode was likely to happen once every 70 chip-years of operation,
then it seems like if you sold a few hundred thousand chips then you would
expect several instances of that failure mode to occur across the population
of chips every day?

~~~
Taniwha
simply yes - but as I mentioned in our case by far the most most were going to
be pixel burbles - you'd likely see one in the lifetime of your video card -
the chances of the more serious sort of jabbering core sort of meltdown are
much less likely - we design against them - but, one has to stress, not
impossible.

You can design to be metastablity tolerant - use high-gain, high clk->Q flops
as synchronizers, uses multiple synchronizers in a row (trading latency for
reliability), you can do things to reduce frequencies (run multiple
synchronizers in parallel, synchronize edges rather than absolute values etc),
but in the end if you're synchronizing an asynchronous event you can't
engineer metastability out of your design - you just have to make it "good
enough" for some value of good enough that will keep marketing and legal
happy.

It's our dirty little secret (by 'our' I mean the whole industry)

------
pedrocr
It would be awesome if companies like Google would calculate MTBF statistics
on components. They've done it for disks and it would be great to extend it to
CPUs and memory modules. They're probably in a better position than even Intel
to calculate these things with precision.

~~~
zebra
I'm almost sure that the components without moving parts will become
technologically obsolete long before they start to fail. When I buy used
laptop I always change the HDD, the DVD and its reliability jumps sharply up.

~~~
pedrocr
That may very well be true on average but I'd bet there are plenty of CPUs and
memory modules that fail in the first year of usage for example. After all
CPUs are tested and sorted into high/low performance parts, so sample
variation itself would be enough to generate some early failures.

As a consumer it's hard enough to keep up with what's reliable in hard drives.
Keeping the manufacturers honest with good stats for the most common parts
would be great.

------
rdtsc
There was an interesting quote/anecdote, Joe Armstrong likes to tell, it is
about people who claim they've built a reliable or fault tolerant service.
They would say "This is fault tolerant, they are multiple hard drives in
there, I have done formal verification of my code and so on..." and then
someone else trips over the power cord and that's the end of the fault
tolerance. It is just a silly example, of course they'd properly provide power
to an important rack of hardware, but the point is, in the simplest case the
system is only as fault tolerant as its weakest components. It is that one bad
capacitor from Taiwan that might the whole thing down, or just a silly cosmic
ray.

One needs redundant hardware to provide certain guarantees about the service
being up. This means load balancers, multiple CPUs running the same code in
parallel and comparing results, running on separate power buses, different
data centers, different parts of the world.

~~~
shurcooL

      > different parts of the world.
    

Still takes just one asteroid.

~~~
oconnor0
Seems like after an asteroid hitting the earth, "server fault tolerance" is
the least of our worries.

~~~
klodolph
Yeah, but with the number of 9s you see you realize that asteroids are NOT
taken into account. For example, Amazon advertises 99.999999999% durability
for a given year for S3 objects. This is just stupid. An extinction-level
event (asteroid, global thermonuclear war, black hole) could easily wipe out
ALL data on S3. We know that mass extinctions have occurred about once every
100 million years. That means that if we expect a 10^-8 chance of a mass
extinction event in a given year, Amazon would need a 99% chance of surviving
a mass extinction in order to meet average durability ratings for S3.

After a certain number of 9s you just have to smile, nod, and truncate the
number.

~~~
josephagoss
They really advertise 99.999999999%? Isn't that 1 byte lost for every
100,000,000 Terabytes or something silly like that?

~~~
brianpgordon
Yes:

[https://aws.amazon.com/s3/faqs/#How_durable_is_Amazon_S3](https://aws.amazon.com/s3/faqs/#How_durable_is_Amazon_S3)

I suppose that if they ever lose an object then they can say "well we warned
you that you might lose an object every hundred million years."

------
bcoates
Thread context:
[https://lkml.org/lkml/2007/5/11/179](https://lkml.org/lkml/2007/5/11/179)

------
heaviside
This study by Microsoft Research is interesting:

"Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a
Million Consumer PCs"

[http://research.microsoft.com/apps/pubs/default.aspx?id=1448...](http://research.microsoft.com/apps/pubs/default.aspx?id=144888)

------
sytelus
If MTBF is such a big issue then would it be ever possible to build space
craft that travels across the stars and still has ability communicate? I guess
hats off to designers of Voyager and other spacecrafts whose MTBF seems to
have crossed 36+ years for many components including CPU and power supply. But
for inter-steller crafts that MTBF seems VERY low. And, seriously, MTBF of 5
years seems to be joke for desktop when lot of mechanical components with
moving parts actually lasts longer.

~~~
nwh
Spacecraft and rovers use ridiculously armoured, redundant systems to get past
the fact that they would fail quite regularly in such a hostile environment.
The Curiosity rover in 2001 uses what would normally be quite an outdated
132Mhz CPU that's been specially shielded to achieve the reliability the
program needs; even then there's two redundant systems that do health checks
on one other to avoid bit flips. Even with all of that, they're running on
only one CPU and trying to diagnose why the first one failed.

It's probably not fair to compare the MTBF of specialised hardware to the $35
CPU I bought at the retailer down the street either, the RAD750 processors in
Curiosity cost almost a quarter of a million dollars each.

[http://en.wikipedia.org/wiki/Comparison_of_embedded_computer...](http://en.wikipedia.org/wiki/Comparison_of_embedded_computer_systems_on_board_the_Mars_rovers)

[http://en.wikipedia.org/wiki/Curiosity_rover#Specifications](http://en.wikipedia.org/wiki/Curiosity_rover#Specifications)

[http://en.wikipedia.org/wiki/Radiation_hardening#Radiation-h...](http://en.wikipedia.org/wiki/Radiation_hardening#Radiation-
hardening_techniques)

Though that said, Voyager is still happy running on it's 8064 words of 16 bit
RAM, which is something.

~~~
MichaelMoser123
[http://history.nasa.gov/computers/Ch6-2.html](http://history.nasa.gov/computers/Ch6-2.html)
very interesting article on the computer system of the Voyager. It turns out
most of the systems is not powered for most of the time, even the component
that is doing the health checks - its called CCS.

"The frequency of the heartbeat, roughly 30 times per minute, caused concern
[176] that the CCS would be worn out processing it. Mission Operations
estimated that the CCS would have to be active 3% to 4% of the time, whereas
the Viking Orbiter computer had trouble if it was more than 0.2% active15. As
it turns out, this worry was unwarranted."

They are using DMA a lot; instruments write to memory, occasionally the CPU is
turned on and picks up the new values. Also they had to manage with the fact
that memory is degrading, so the system needs to adapt to working with less
memory. The bus is 16 bits wide, but actually they are processing 4 bits at a
time, so addition takes 4 cycles. CPU registers are stored in RAM, so probably
they can reassign them if a memory cell fails.

Parts of the system were reused from the Viking mission. Also they where
reprogramming the system in flight during the eighties ! That's the reason why
they could start the the mission, even without having the full software on
board, the mission was extended thanks to reprogramming. Just for the Jupiter
visit they had 18 software updates, think about that next time that a software
update breaks something on your system.

Also its all a distributed system with several CPU's, and some elements of
redundancy, awesome tech. I guess one day alien hackers will have fun with
reverse engineering this system.

~~~
MichaelMoser123
I thought about another very reliable system; deep under the sea the NSA has a
big switch that is splitting deep underwater communication lines;

Now this one has to work 24/7 in a hostile environment; has to be hidden; has
to deal with enormous quantities of data and it costs a lot to replace/repair
so it must be very reliable.

What is driving technological progress? Instead of a space program, we now
have political control of the Internet as driving forces. I guess that's what
they mean when they say that civilization is turning inwards ;-)

Yes, in many areas the NSA and Google are pushing the envelope; long term data
storage; map reduce of large data sets; AI, you name it, they have it.

~~~
nwh
I imagine under the sea is actually quite a nice place to be, if you assume
perfect waterproofing. There's litte radiation penetrating the water, so
there's less chance of bit flips I imagine. You don't need to worry so much
about cooling, as the whole ocean is your heatsink. Accessibility would suck,
but a bunch of redundant hardware wouldn't be awful.

Weren't they using hardware in submarines anyway?

------
raverbashing
(Conventional) Solid state devices are very hard to fail - exception: flash
memory

Apart from electron migration issues and failures by excess
(voltage/temperature), they're pretty long lasting

Much easier to have a failure because of something else: capacitors failing,
oxidation or mechanical failure (for example, because of thermal
expansion/contraction)

I've seen people complaining about a dead CPU but I can't find it right now

~~~
zymhan
I actually just returned my CPU (Phenom II X4) to AMD, and they've replaced
it, but they didn't say exactly why it died. I've asked them for more details,
hopefully they can tell me.

Overall though, given with how many computers I've worked with, CPU failures
still seem rarer than Memory, Disk, Mobo, or Graphics failures. Of course it
ends up being the CPU in _my_ computer that fails -.-

~~~
raverbashing
Interesting.

How long did it work for before it died?

There is still some variance in silicon, so yours may have had a defect that
manifested itself after some time, I'm not sure they evaluate returned
defective chips to see what happened (and if this is public info)

Also, the packaging is extremely complex and prone to the same kind of defects
as other PCBs in the system.

------
mrich
As a side note, the whole site is an amazing collection of wisdom and worth
bookmarking:

[http://yarchive.net/](http://yarchive.net/)

------
AnonNo15
I'd like to through my experience: I was in charge of 300+ x86 rack servers
and around 50 desktops for 3 years and never seen a single CPU fail, even old
Pentium 4 with dusty fans.

Disk failures are very common, followed by much rarer RAM chips and
motherboards failures.

I suspect server chips are rated for 10-15 years average lifespan

------
synthos
Soft errors are a very real property of low-voltage digital electronics. I
personally observed what could only be realistically explained as a soft error
in a unit running customer hardware in the field. A single bit was flipped in
the program memory of the embedded application and was causing the system to
malfunction in an obvious and repeatable manor. We've since added CRC checking
to the program memory and some of the static data sections to flag and reset
this in the future.

------
lispython
There's a more than 100 pages's thread talk about GUP failure after two years
use in Apple Support website.
[https://discussions.apple.com/thread/4766577](https://discussions.apple.com/thread/4766577)

------
dspeyer
It doesn't seem worth it for Intel to measure MTBF. By the time they got good
numbers for a specific chip, they'd be trying to sell its successor.

~~~
gilgoomesh
Long term failure rates are not usually measured in realtime but in
deliberately heat elevated environments which simulate many years of stresses
in a few months. This work is essential to ensure design decisions they've
made don't accidentally cause their chips to fail after 2 years (which might
be outside warranty lifetime but would still result in class action law suits
and horrible publicity).

~~~
williadc
Intel guarantees their consumer CPUs for 3 years.

[http://www.intel.com/support/processors/sb/cs-020033.htm](http://www.intel.com/support/processors/sb/cs-020033.htm)

------
Zardoz84
I can say that the Z80 if my ZX Spectrum keep working since 1984... Or some
old K6-2 300 was working this last year...

~~~
caf
For how much of the time since 1984 would you imagine that your Z80 has been
on and running?

------
mvanveen
My immediate reaction is to ask how this reliability characteristic of CPUs
affects critical software applications? Certainly some space missions and
medical devices out in the field must have surpassed the MTBF mark for the
given CPU deployment.

------
jokoon
I always wondered about this, but does it seem transistor do wear off over
time ?

Does that mean a CPU/RAM/GPU will not perform as well as when it's brand new ?

------
csmuk
Never had a CPU go on me.

RAM yes, PROMs yes, CMOS batteries yes, PSUs yes, drives yes.

They're probably the most reliable bit of a computer.

------
leokun
Nice thing about the cloud is that someone else is worrying about this for
you.

