
An Empirical Analysis of Hardware Failures on a Million Consumer PCs - mbafk
http://research.microsoft.com/apps/pubs/default.aspx?id=144888
======
cs702
Very useful -- I will take this analysis into account when it's time to
upgrade my current personal machine or configure the next one! Thank you for
posting this here.

The only thing I would have wanted to see but didn't in this analysis is how
failure rates vary for different types of disk subsystem -- specifically,
traditional hard drives versus the newer solid-state devices. I suspect, but
don't know for sure, that the latter have much, much lower real-world failure
rates in the first 30 days of total accumulated CPU time (TACT).

The authors openly suggest that the sharp difference in failure rates between
desktop and laptop machines may be due in part to their disk subsystems:
"Laptops are between 25% and 60% less likely than desktop machines to crash
from a hardware fault over the first 30 days of observed TACT. We hypothesize
that the durability features built into laptops (such as motion-robust hard
drives) make these machines more robust to failures in general." Alas, the
authors don't delve any further into it.

I'd like to see hard data comparing the real-world failure rates of _both_
desktops and laptops using traditional versus solid-state disk subsystems.

~~~
wazoox
So far numbers I've seen seem to acknowledge a failure rate of SSDs in the
same ballpark as spinning rust but it's only been slightly over a year that
most SSDs are actually reliable. Many, many old models were absolutely
terrible. Hence I think it may be difficult yet to draw reliable conclusions.

~~~
cs702
wazoox: thanks. Do you recall the source(s) for those numbers?

~~~
wazoox
So far, the best source I can recall is this study:
[http://www.tomshardware.com/reviews/ssd-reliability-
failure-...](http://www.tomshardware.com/reviews/ssd-reliability-failure-
rate,2923.html)

~~~
AngryParsley
That study is Intel-only, but it seems to jive with the return rates Anandtech
mentioned: [http://www.anandtech.com/show/4202/the-intel-
ssd-510-review/...](http://www.anandtech.com/show/4202/the-intel-
ssd-510-review/3)

There are also numbers for hard drives:
<http://forums.anandtech.com/showthread.php?t=2147063>

Basically, Intel SSDs from a year ago are more reliable than all hard drives.
And SSDs in general are more reliable than any 2TB hard drive.

The data isn't ideal, but it's better than anecdotes. Return rates should
correlate pretty well with failure rates. If anything, return rates should
favor hard drives, since people are less likely to return a faulty cheap hard
drive than a faulty expensive SSD.

~~~
cs702
AngryParsley: the Tom's Hardware article wazoox mentioned above also has those
return-rate stats (page 3): "...returns can occur for a multitude of reasons.
This presents a challenge because we don’t have any additional information on
the returned drives—were they dead-on-arrival, did they stop working over
time, or was there simply an incompatibility that prevented the customer from
using the [device]? ... If online purchases account for the majority of hard
drives sold, poor packaging and carrier mishandling can have a real effect on
return rates. Furthermore, we also have no way of normalizing how customers
used these drives. The large variance in hard drive return rates [between data
sets] underlines this problem. For example, the Seagate Barracuda LP rises
from 2.1% to 4.1%, while the Western Digital Caviar Green WD10EARS drops from
2.4% to 1.2%..."[1]

In short, the available return-rate data is too noisy and inconsistent to be a
good proxy for failure rates.

[1] [http://www.tomshardware.com/reviews/ssd-reliability-
failure-...](http://www.tomshardware.com/reviews/ssd-reliability-failure-
rate,2923-3.html)

------
mrb
When Microsoft, Google, or some university publish analysis of hardware
failures across large numbers of machines, they always anonymize hardware
vendors ("vendor A", "vendor B").

I understand the reasons (not alienating your hardware vendors), but will
there ever be a research group who will disclose vendor names? Heck, I would
_pay_ for this information.

~~~
wmf
IMO such information would do more harm than good. By the time they could
gather statistics, that model would be obsolete so the stats wouldn't help you
buy new equipment. But the kind of people who are still griping about the
Deathstar would use the information to troll non-stop.

~~~
boxein
BUT if we had consistent poor performance for some vendor in a certain
category, we could infer that their offerings for that category will also be
poor in the future.

------
wazoox
Among other interesting insights:

* a machine that crashed once is 100 times more likely to crash again; the more it crashes, the more it's prone to fail again.

* overclocking significantly reduces reliability. One CPU vendor (AMD or Intel, but unspecified) is much worse in this regard, too.

* conversely, underclocking improves reliability.

* branded computers are more reliable than beige boxes.

* laptops are more reliable than desktops.

~~~
ippisl
>> branded computers are more reliable than beige boxes.

This is meaningless.

The real question is how those compare to beige box that uses a decent parts.
But Microsoft definitely has interest in helping manufacturers of brand
computers because piracy is more prevalent in beige boxes.

~~~
scott_s
This paper was written by researchers in Microsoft Research, and accepted in
an academic conference. It is not marketing. To be clear, you are suggesting
that the researchers were dishonest in order to help their company. I find
this unlikely.

Disclaimer: I am a researcher in a corporate lab.

~~~
ippisl
I doesn't have to be dishonesty. It might just be bias(conscious or
subconscious) not to waste time on this question.

Even doctors show this kind of biases when advising people on choosing
treatments(which is a much bigger moral issue).

~~~
scott_s
Their conclusions on "white boxes" are based on relatively straightforward
statistical analysis of their data. In order for there to be bias against
white boxes, one of the following has to be true:

1\. Their data collection methods are biased against white boxes. Given the
large sample size and the method of retrieving samples - automatically
generated crash reports from users - I find this unlikely. They cover this
point in section 3.2.1.

2\. Their statistical analysis is flawed. I see no issues with it, nor did the
reviewers. (Otherwise it wouldn't have been accepted.)

3\. They lied. I am most skeptical on this one.

It's disingenuous to gesture at researchers and allege bias based on their
employer without actually saying _how_ they are biased. Doing so is not valid
skepticism, but prejudice.

~~~
dredmorbius
My suspicion is that branded hardware manufacturers are uniformly reasonably
good in quality. White box vendors may vary widely: some are good, but some
(many? few?) are really, really bad. This can skew data.

It's also possible that it's getting more difficult to accurately spec
systems, to enforce vendor quality (Dell gets a bad batch of drives, they can
1) detect it and 2) tell the vendor to stuff it, Ahmed's Boxez'R'Us may not
have that leverage or depth of experience), and to do burn-in testing of their
own systems.

That said, I've had good and bad experiences with big-name and white box
vendors alike.

~~~
flomo
One question is what happens to a component that fails a major OEM's QC
standards, where does it go? In to the garbage? Or into the whitebox channel?

For example, some have speculated that "Gamer RAM" with mean looking heatsinks
is actually poorer quality stuff that requires additional cooling to work
correctly.

------
ChrisNorstrom
I'm having a hard time coming to terms with "Laptops less likely to crash from
hardware fault than desktops"

Everything we've learned from experience, surveys, and PC World magazines has
showed the opposite. Heat kills hardware and laptops have their hardware
packed together so closely that it generates lots of heat. Back then I
remember reading something like 1 in 4 laptops fail in the first 3 years.
Which was very believable, at the time I was in collage for game design &
development. All 80 guys in our class had laptops from HP (with get this...
Pentium 4s in them). Those laptops had a LOT of problems. They were basically
portable heaters.

So I guess laptops now have either much better cooling, much cooler CPUs or a
combination. OR PCs are just terribly cooled.

~~~
TazeTSchnitzel
Just a theory, but perhaps laptops have a shorter lifespan in general?

~~~
mrb
No. They only took into account the machines' first 30 days of service life:

 _"we only count failures within the ﬁrst 30 days of TACT (total accumulated
computing time), for machines with at least 30 days of TACT."_

------
josephturnip
Interesting stuff. You can improve reliability by running your system at a
lower speed. Here's a blog post with a summary of some of the conclusions of
the paper above: [http://grano.la/blog/2012/06/improve-the-reliability-of-
your...](http://grano.la/blog/2012/06/improve-the-reliability-of-your-pc/)
(Disclaimer: that's my company's blog)

One question I still have is whether the switching of CPU frequencies has any
effect, or if it is only the average speed that correlates to the reliability.
Anecdotal evidence suggests that this is the case, but it could be an area for
further research.

------
kristaps
Interesting, too bad the power supplies could not be controlled in their
setup, as a wonky power supply can unleash all kinds of gremlins that look
like failures in components down the line.

~~~
Avitas
I have been telling my staff for probably 15 or so years that likeliest cause
of PC failure is (in the following order):

\- Power supply \- Hard Drive

Ranking near these are sleeve bearing fans with ball bearing fans a close
second--for CPU and case cooling.

In rough order, I would say that the following is my estimation of other
common component failure sources:

\- Removable Drives (floppy, optical, etc.) \- Video Card (if separate) \-
Motherboard \- RAM \- CPU

These are for our corporate PCs which have been Compaq, Dell, IBM, HP, Lenovo
and a few other brands.

For our brand name and whitebox server hardware, it's pretty much the same...
if something is going to fail, it's going to be a power supply or a hard
drive. In fact, I don't ever remember a single server motherboard, RAID
controller, RAM stick, CPU or other component ever going bad in a server.

I wonder why they would leave out statistics relating to power supplies when
they are, in my experience, the component with the greatest failure rate.

------
Zenst
Interesting read though why can't Microsoft just tell me that my CPU or HD or
memory is borking and suggest I RMA it instead of saying everytime - have you
applied the latest updates, which I get to click unhelpful.

Most important thing in a PC I have found for reliability above everything
else is a good PSU, realy does make a difference on the hardware side as you
give your kit cleaner power. Add UPS/surge protector and you can double the
lifetime of kit. Least from experience I've had it has been noticable.

------
Hoff
The copy at Microsoft Research is offline.

Here's another copy of the paper:

[http://eurosys2011.cs.uni-
salzburg.at/pdf/eurosys2011-nighti...](http://eurosys2011.cs.uni-
salzburg.at/pdf/eurosys2011-nightingale.pdf)

~~~
rdmirza
Mirror. (For people who tried to ctrl-f and found nothing, like me.)

------
acqq
There are a lot of insights in the paper, but I'd really like to know about
this:

"The table shows that CPUs from Vendor A are nearly 20x as likely to crash a
machine during the 8 month observation period when they are overclocked, and
CPUs from Vendor B are over 4x as likely"

Obviously it's 5 times difference in probability to have unstable system if
overclocked between Intel and AMD but they don't say which one is better.
Anybody knows?

~~~
zipdog
When Google did a massive analysis of hard-drive failures they also didn't
publish manufacturer names, because they felt that it would tarnish the
company name when it might just be a production run

~~~
starpilot
Actually, they were just keeping their cards close to their chest:

> However, in this paper, we do not show a breakdown of drives per
> manufacturer, model, or vintage due to the proprietary nature of these data.

<http://research.google.com/archive/disk_failures.pdf>

------
hollerith
The result most surprising to me is that laptops are between 25% and 60% less
likely than desktop machines to crash from a hardware fault during the first
30 days worth of measurements.

The much larger weight and volume of desktops would seem to make them easier
to cool.

~~~
josephturnip
That caught my eye too.. perhaps it's because laptops are designed with being
carried around in mind? Or, to draw on the other conclusions from the paper,
because laptops generally have lower-frequency chips (and for that matter,
slower memory and disks too)?

~~~
Zenst
Laptops also have in effect a cleaner power source with UPS built in so any
power glitch's get filtered on a laptop as standard compared to a desktop.

Least I have found desktop's with a good quality PSU and UPS noticable more
reliable than those without.

~~~
jseliger
_Laptops also have in effect a cleaner power source with UPS built in so any
power glitch's get filtered on a laptop as standard compared to a desktop._

I've heard this theory before but never seen it substantiated. Do you have a
link to any research or articles that explains how this works?

~~~
kyberias
How it works? Laptops have a battery. Hence they are less likely to power off
suddenly when mains voltage drops. And powering off suddenly increases the
probability of some electronics in the system failing. No need to research,
just simple electronics and logic.

------
hollerith
Too bad CPU temperature was not part of the collection of data used in the
study.

------
latch
Is there a compelling reason for this to be a PDF rather than HTML? I'm
genuinely curious.

~~~
scott_s
As mbafk and josephturnip state, they simply put online the same copy that was
published in the conference. Academic conferences typically publish papers in
PDF form.

But, that doesn't actually answer your question, which I think is "Barely." I
feel silly preparing PDFs for publication when I know that most people will
read it on their computer, not print it out. Many conferences no longer even
have an actual, physical copy of the proceedings, instead just giving out USB
sticks with all of the PDFs. (Which is what we want anyway.)

I think it would be fantastic if there was a standard HTML5 template that
researchers could use to publish their papers. There are Latex-to-HTML
compilers, but I've never been impressed with the results. I think people
outside of academia would be more likely to read our papers if they were in
HTML rather and PDF.

~~~
Dn_Ab
I am outside academia but I read a lot of papers. For most people outside I
suspect the biggest problem is paywalls and not format. If you don't read a
paper because it is a PDF and not HTML then you weren't really interested.
With chrome, it is not even an annoyance as when you had to load adobe.

One reason for PDFs is as you mentioned, latex to HTML results are typically
poor. Diagrams are another difficulty that don't have an easy HTML solution.
Other reasons I prefer PDFs are: though I never print, I often save papers to
disk since I don't always find the paper when I go searching the second time
(especially if it is months or even years after), there is a real benefit to
being able to read a paper offline - I don't always have a connection to the
net when I want to read and lastly, if you have an ereader such as the kindle,
pdfs render well on them.

~~~
scott_s
Paywalls are probably a bigger issue, but it's still friction. Anyway, modern
browsers give the option to save all of the images along with the HTML file.

------
stcredzero
I've always said that smart hardware tinkerers _underclock_. It produces less
heat, and results in a quieter machine. I always suspected it improves
reliability.

~~~
wmf
Or you could save money and just buy a lower bin.

~~~
stcredzero
Well, because heat dissipation is proportional to the square of the voltage,
you end up giving up a little bit of performance but save a whole lot of heat.
In experiential terms, you never miss performance but often notice a whole lot
less fan noise.

Buying from a lower bin, you're getting a crappier processor, which might give
you less latitude to save heat. This would probably be worth measuring and
writing an article about. Also, I tend to buy lower clocked processors as it
is.

~~~
wmf
I think it's likely that all lower bins are artificial, so e.g. underclocking
a 2.4 GHz down to 2.0 is probably exactly the same as if you bought the 2.0.
But yeah, it would be worth measuring.

~~~
stcredzero
Ah, I see. I wrote underclock. It's really _undervolting_ that gets you the
big win thermally. Underclocking should be done just as a means of achieving a
greater undervolt. I just have these two things in the same mental bin.

~~~
wmf
When you underclock properly (with SpeedStep) it also lowers the voltage...
probably to the same voltage that the lower-bin processor would use.

~~~
stcredzero
There's not just one voltage here. It's a curve. I suspect that the better the
processor turned out, the more favorable your curve turns out to be.

