
Ask HN: Why don't transistors in microchips fail? - franciscop
Considering that a Quad-core + GPU Core i7 Haswell has 1.4e9 transistors inside, even given a really small probability of one of them failing, wouldn&#x27;t this be catastrophic?<p>Wouldn&#x27;t a single transistor failing mean the whole chip stops working? Or are there protections built-in so only performance is lost over time?
======
joelaaronseely
There is another mechanism called "Single Event Upset" (SEU) or "Single Event
Effects" (SEE) (basically synonymous). This is due to cosmic rays. On the
surface of the earth, the effect is mostly abated by the atmosphere - except
for neutrons. As you go higher in the atmosphere (say on a mountaintop, or an
airplane, or go into space) it becomes worse because of other charged
particles that are no longer attenuated by the atmosphere.

The typical issue at sea level is from neutrons hitting silicon atoms. If a
neutron hits the neucleus in some area of the microprocessor circuitry, it
suddenly recoils, basically causing an ionizing trail of several microns in
length. Given transistors are now measured in 10s of nanometers, the ionizing
path can cross many nodes in the circuit and create some sort of state change.
Best case it happens in a single bit of a memory that has error correction and
you never notice it. Worst case it causes latchup (power to ground short) in
your processor and your CPU overheats and fries. Generally you would just
notice it as a sudden error that causes the system to lock up, you'd reboot
and it would come back up and be fine, leaving you with a vague thought of,
"That was weird".

~~~
eridius
How often does this sort of thing actually happen in real life? Or rather,
what's the chance that some given computer will experience one of these events
in its operational lifetime (or, if the chance is actually high enough, how
many such events would it be expected to see on average given a lifespan of
several years)?

~~~
sliverstorm
Somewhere in the range that your laptop will almost certainly never see even a
single event, but a very large datacenter or colo will have multiple events a
month.

There is a lot of disagreement on bitflips from ionizing radiation. They are
unequivocally real, and unequivocally very rare. Even when they do happen, a
large portion of the chip is dark a lot of the time, and a lot of the live
data in the chip is simply thrown away and never used. (Think prefetching)
Some bits, if flipped, will break something but will not corrupt the disk and
the machine will be able to recover.

Nobody really knows for certain exactly how big of a problem they are and how
often they happen- it's all statistics, and it depends on things like where on
the globe your computer is, what your building is made of, and what phase of
the solar cycle we are in. It even depends on workload. Anybody who claims to
know for certain...

~~~
morgosmaci
Try multiple times a second. This guy made a hobby of taking advantage of
cosmic radiation bit flips to cause dns lookup problems and capturing data.

[http://dinaburg.org/bitsquatting.html](http://dinaburg.org/bitsquatting.html)

~~~
sliverstorm
Multiple times a second, if your pool of hardware is "all the internet
connected hardware in the world"! Neat experiment.

Also, FWIW that experiment will include people subject to bit errors in DRAM,
not just in the CPU- and I would even guess that bit errors are more common in
DRAM than SRAM given their electrical characteristics (a tiny floating
capacitor vs. two inverters driving eachother)

------
gibrown
As a former hardware engineer who worked on automated test equipment that
tested ASICs (and did ASIC dev), there are a lot of different methods used to
avoid this.

As others mentioned, most of these problems are caught when testing the chips.
Most of the transistors on a chip are actually used for caching or RAM, and in
those cases the chips have built in methods for disabling the portions of
memory that are non-functional. I don't recall any instances of CPUs/firmware
doing this dynamically, but I wouldn't be surprised if there are. A lot of
chips have some self diagnostics.

Most ASICs also have extra transistors sprinkled around so they can bypass and
fix errors in the manufacturing process. Making chips is like printing money
where some percentage of your money is defective. It pays to try and fix them
after printing.

Also, as someone who has ordered lots of parts there are many cases where you
put a part into production and then find an abnormally high failure rate. I
once did a few months of high temperature and vibration testing on our boards
to try and discover these sorts of issues, and then you spend a bunch of time
convincing the manufacturer that their parts are not meeting spec.

Fun times... thanks for the trip down memory lane.

~~~
Taniwha
well not quite - people certainly add spare gates - but not for fixing
individual errors - instead you add maybe 1% extra gates - if you find a bug
in your design you can redo the upper metal layers using the extra gates to
fix it - that changes ALL the chips you make, not just fix one bad transistor
in aparticular chip

~~~
gibrown
You're right, that was inelegantly written and kinda conflates two different
things. Gates are spread around for fixing design errors as you describe.
There is also often redundant logic and memory built in to allow fixing
individual chips though.

Here's a quick presentation I found on laser repairs:
[http://www.ee.ncu.edu.tw/~jfli/memtest/lecture/ch07.pdf](http://www.ee.ncu.edu.tw/~jfli/memtest/lecture/ch07.pdf)

------
kabdib
Oh, they do fail.

The last time I worked with some hardware folks speccing a system-on-a-chip,
they were modeling device lifetime versus clock speed.

"Hey software guys, if we reduce the clock rate by ten percent we get another
three years out of the chip." Or somesuch, due to electromigration and other
things, largely made worse by heat.

Since it was a gaming console, we wound up at some kind of compromise that
involved guessing what the Competition would also be doing with their clock
rate.

~~~
zokier
That is very interesting that they do that for cheap massproduced consumer
goods. I mean I can understand doing such tradeoffs in very expensive stuff
that is expected to last for decades (industrial machines, space probes etc),
but that the manufacturer cares enough about the lifetime (beyond the minimum
warranty period) of their goods is somewhat surprising in this day and age.

~~~
abtinf
If you want to minimize warranty expenses in order to maintain anything
resembling profit, then you need to engineer you product so that the average
useful life is well beyond your warranty period. The math is brutal: An solid
net profit margin for a typical manufacturer is around 5-7%. Even warranty
rate of 5% would send you _deep_ into losses. So the average durability of
your product needs to be 2 standard deviations above your warranty period. Of
course, not everyone takes advantage of warranties, so you might discount
durability to account for that.

Also, with regard to the gp specific point about the discussion being in
regard to a gaming console: they want the product to last as long as possible.
Each additional function unit in existence counts toward their installed base
and increases the attractiveness for third party developers.

~~~
jsprogrammer
You are assuming a normal distribution of failures. A sufficiently evil
company would design their failure curves to be as flat as possible until the
warranty period expires and then rapidly increase to 100%.

The practice still makes no real long-term sense though. What do you do after
`warranty_period` expires and no one wants to buy your products anymore?

~~~
mhb
Presumably, ceteris paribus, companies do attempt to design their failure
curves to be as flat as possible. Otherwise they are wasting money on
components which will survive longer than the whole product. (There is no One-
Hoss Shay
([http://holyjoe.org/poetry/holmes1.htm)](http://holyjoe.org/poetry/holmes1.htm\)))

------
ajross
Yes, they can fail. Lots and lots of them fail immediately due to
manufacturing defects. And over time, electromigration (where dopant atoms get
kicked out of position by interaction with electron momentum) will slowly
degrade their performance. And sometimes they fail due to specific events like
an overheat or electrostatic discharge.

But the failure rate after initial burn-in is phenomenally low. They're solid
state devices, after all, and the only moving parts are electrons.

~~~
exDM69
I work as a software engineer for a chip manufacturer. The fab (silicon
manufacturing company) gives only a 5 year guarantee for the smartphone/tablet
chips (with presumably some allowance).

As years go by, the chip starts slowly degrading and some of the high
performance chips start to get higher temperatures, worse power consumption,
needs higher voltages, etc. The power management software counters this by
keeping the clocks lower and the voltages higher, causing performance
degradation over time to avoid catastrophic failure.

When the same chips are used in products with higher reliability requirements,
they are clocked down and more conservative power management software is
utilized.

disclaimer: not my area of expertise, I work on something completely different
than power management.

~~~
VieElm
> The fab (silicon manufacturing company) gives only a 5 year guarantee for
> the smartphone/tablet chips (with presumably some allowance).

I'm not sure that bodes well for smart watches selling at 4+ figures.

~~~
nhaehnle
To be fair, if you don't use the latest and greatest manufacturing processes -
which you don't really need to do in smart watches - chips can be _very_
robust and long-lasting. Plus, given the battery requirements, you don't
really want to use high-performance components in watches anyway, Apple's
ridiculous battery life notwithstanding.

As for the whole market segment of "this watch will pass through generations",
I guess the honest thing to say is that we just don't have that kind of
experience with integrated circuits yet... besides, does this type of
traditional watch never need repairs? They must have failures as well.

~~~
leoc
It's different for simple BT notification buzzers, but "maximalist"
smartwatches like the Apple Watch surely call out for the latest and greatest
semiconductor processes. They face harsh trade-offs between capability, size
and battery life, harsh enough to help make them still marginal as mainstream
consumer products, and those dilemmas would be significantly eased if
performance-per-watt and size were improved. They're also high-margin products
so manufacturing at fancy fabs should be affordable.

------
zokier
Slightly related thing is RAM random bit errors. There was an interesting
article published few years ago where some guy registered domains that
differed by one bit from some popular domains and recorded the traffic that
hit them. Kinda scary to think what else is wrong in your RAM then... Too bad
that ECC is still restricted to servers and serious workstations.

[http://dinaburg.org/bitsquatting.html](http://dinaburg.org/bitsquatting.html)

~~~
buren
Cool! The author of the post you linked, also linked to this paper
[https://www.cs.princeton.edu/~appel/papers/memerr.pdf](https://www.cs.princeton.edu/~appel/papers/memerr.pdf).
From the paper: "..experimental study showing that soft memory errors can lead
to serious security vulnerabilities in Java and .NET virtual machines"

~~~
rgovind
The first author in the princeton paper is my brother!!

------
Nomentatus
Nearly all chips experienced transistor failures, rendering them useless, back
in the day. Intel is the monster it is because they were the guys who first
found out how to sorta "temper" chips to vastly reduce that failure rate (most
failures were gross enough to be instant, back then, and Intel started with
memory chips.) Because their heat treatment left no visible mark, Intel didn't
patent it, but kept it as a trade secret giving them an incredible economic
advantage, for many years. They all but swept the field. I've no doubt
misremembered some details.

------
nickpsecurity
They're extremely simple, have no moving parts, and the materials/processes of
semiconductor fabs optimize to ensure they get done right. The whole chip will
often fail if transistors are fabbed incorrectly and rest end up in errata
sheets where you work around them. Environmental effects are reduced with
Silicon-on-Insulator (SOI), rad-hard methods, immunity-aware programming, and
so on. Architectures such as Tandem's NonStop assumed there'd be plenty of
failures and just ran things in lockstep with redundant components.

So, simplicity and hard work by fab designers is 90+% of it. There's whole
fields and processes dedicated to the rest.

~~~
nhaehnle
Errata are usually caused by bugs in the logical design of the chip, not in
the manufacturing or physical behavior of transistors. If you have a source
for an errata that was issued due to systematically buggy transistors, I'd be
curious to hear that story!

~~~
nickpsecurity
Cant the erratic behavior of the transistors (eg flawed ones overheating)
cause their logical function to fail?

Past that question, Ive still plenty to learn on hardware and will take your
word for it about errata sheets. Sounds right given the things described in
them.

~~~
rdc12
"Cant the erratic behavior of the transistors (eg flawed ones overheating)"

Yes but that is either a manafacturing defect (if persistant) or a transient
error, or running out of speced tolerances. Or simply details of reality.

Whereas a logical design flaw, is more the actual design/implemenation, is
fundementally wrong more akin to a bug.

~~~
nickpsecurity
I gotcha. Appreciate the tip.

------
mchannon
Generally, yes, a failing transistor can be a fatal problem. This relates to
"chip yield" on a waferfull of chips.

Faults don't always manifest themselves as a binary pass/fail result; as chip
temperatures increase, transistors that have faults will "misfire" more often.
As long as this temperature is high enough, these lower-grade chips can be
sold as lower-end processors that never in practice reach these temperatures.

Am not aware of any redundancy units in current microprocessor offerings but
it would not surprise me; Intel did something of this nature with their 80386
line but it was more of a labeling thing ("16 BIT S/W ONLY").

Solid state drives, on the other hand, are built around this protection; when
a block fails after so many read/write cycles, the logic "TRIM"s that portion
of the virtual disk, diminishing its capacity but keeping the rest of the
device going.

~~~
ajross
> Am not aware of any redundancy units in current microprocessor offerings

Sure there are. That's why Intel sells chips with a thousand different cache
sizes, for example. Bad bit in the cache? Just turn that block off. Likewise
for whole cores in some of the bigger chips, I believe.

~~~
frik
The Xbox360 had four cores (IBM Power) but only three cores were enabled. And
older AMD CPUs were released with 6 enabled cores out of 8 cores.

~~~
GTP
Wow, that's interesting! Do you know if somebody tried to enable one of this
"extra cores"?

~~~
jotm
Haha, of course they did.

[http://www.tomsguide.com/forum/id-2142984/amd-
phenom-710-4th...](http://www.tomsguide.com/forum/id-2142984/amd-
phenom-710-4th-core-unlock.html)

AMD used to software-disable the 4th core on the Phenoms, but then they
switched to disabling them by hardware, so that rendered any software means
useless.

AMD only disabled the cores to save on manufacturing - most of the time, these
cores were not working properly. But as the process got better, people got 4
or more cores for the price of 2-3 :-D

------
RogerL
Others have answered why, here is the 'what would happen'. Heat your CPU up by
pointing a hair dryer at it (you may want to treat this as a thought
experiment as you could destroy your computer). At some point it begins to
fail because transistors are pushed past theiroperating conditions. Another
way to push it to failure is to overclock. The results are ... variable.
Sometimes you won't notice the problems, computations will just come out
wrong. Sometimes the computer will blue screen or spontaneously reboot. And so
on. Just depends where the failure occurs, and if the currently running
software depends on that part of the chip. If a transistor responsible for
instruction dispatch fails it's probably instant death. If a transistor
responsible for helping in computing the least significant bit of a sin()
computation, well, you may never notice it.

~~~
fr0styMatt2
I remember playing around with overclocking my old Pentium 4 and how it would
boot fine into Windows, but then you'd run Prime95 on it and the benchmark
would start failing because the FPU was returning incorrect results.

------
intrasight
When I was studying EE, a professor said on this subject that about 20% of the
transistors in a chip are used for self-diagnostics. Manufacturing failures
are a given. The diagnostics tell the company what has failed, and they
segment the chips into different product/price classes based upon what works
and what doesn't. After being deployed into a product, I assume that chips
would follow a standard Bathtub Curve:
[https://en.wikipedia.org/wiki/Bathtub_curve](https://en.wikipedia.org/wiki/Bathtub_curve)

As geometries fall, the effects of "wear" at the atomic level will go up.

~~~
pjc50
I think the proportion is nowhere near that high for most ASICs - JTAG
boundary scan adds a few percent to give you this testability, maybe 1-5%.

~~~
Taniwha
Full chip scan provide logic that links EVERY flipflop in a design into one or
more scan chain (not just boundary scan which checks pin bonding) - mostly
it's a mux on each input - and is more like 5%ish

You test by scanning in a bit pattern, issuing a single clock andscanning out
the result.

Smart software generates the minimal set of test vectors that tests every wire
and gate between the flops.

Chip testers are expensive (millionish) so minimising tester time minimises
chip cost - we make special testign logic for things like srams

------
greenNote
As stated, two big variables are clock rate and feature size, which both
effect mean time between failures (MTBF). Being more conservative increases
this metric. I know from working in a fab that there are many electrical
inspection steps along the process, so failures are caught during the
manufacturing process (reducing the chance that you see them in the final
product). Once the chip is packaged, and assuming that it is operated in a
nominal environment, then failures are not that common.

------
tzs
Speaking of the effects of component failure on chips, a couple years ago
researchers demonstrated self-healing chips [1]. Large parts of the chips
could be destroyed and the remaining components would reconfigure themselves
to find an alternative way to accomplish their task.

[1] [http://www.caltech.edu/news/creating-indestructible-self-
hea...](http://www.caltech.edu/news/creating-indestructible-self-healing-
circuits-38815)

~~~
panax
The 10 series Altera FPGAs are going to be a little like this, although it is
more for SEUs and also IP segmentation/security. You will have lots of little
islands, all which can detect configuration errors and report them, and even
recover using partial reconfiguration. Maybe someday we can make it smart
enough to replace and reroute around physically bad LEs automatically.

------
wsxcde
Others have already mentioned one failure mechanism that causes transistor
degradation over time: electromigration. Other important aging mechanisms are
negative-bias temperature instability (NBTI) and hot carrier injection (HCI).
I've seem papers claim the dual of NBTI - PBTI - is now an issue in the newest
process nodes.

This seems to be a nice overview of aging effects:
[http://spectrum.ieee.org/semiconductors/processors/transisto...](http://spectrum.ieee.org/semiconductors/processors/transistor-
aging).

------
spiritplumber
This is why we usually slightly underclock stuff that has to live on boats.

~~~
cskau
Boats specifically because there's a higher turnaround time to replace failing
hardware?

~~~
spiritplumber
Yeah, also corrosion and voltage spikes due to old/untuned gensets or even
taking a lightning strike on the hull.

------
2bluesc
In 2011, Intel released the 6 series chipset with an incorrectly sized
transistor that would ultimately fail if used extensively. A massive recall
followed.

[http://www.anandtech.com/show/4142/intel-discovers-bug-
in-6s...](http://www.anandtech.com/show/4142/intel-discovers-bug-in-6series-
chipset-begins-recall)

------
Gravityloss
They do fail. Linus Torvalds talked about this in 2007
[http://yarchive.net/comp/linux/cpu_reliability.html](http://yarchive.net/comp/linux/cpu_reliability.html)

------
jsudhams
So would that mean we need to ensure the systems in critical area (not nuclear
or some but banks and transaction critical) be tech refereshed mandatory at
4/5 years? Especially when 7nm production starts.

------
msandford
> Considering that a Quad-core + GPU Core i7 Haswell has 1.4e9 transistors
> inside, even given a really small probability of one of them failing,
> wouldn't this be catastrophic?

Yes, generally speaking it would be. Depending on where it is inside the chip.

> Wouldn't a single transistor failing mean the whole chip stops working? Or
> are there protections built-in so only performance is lost over time?

Not necessarily. It might be somewhere that never or rarely gets used, in
which case the failure won't make the chip stop working. It might mean that
you start seeing wrong values on a particular cache line, or that your branch
prediction gets worse (if it's in the branch predictor) or that your floating
point math doesn't work quite right anymore.

But most of the failures are either manufacturing errors meaning that the chip
NEVER works right, or they're "infant mortality" meaning that the chip dies
very soon after it's packaged up and tested. So if you test long enough, you
can prevent this kind of problem from making it to customers.

Once the chip is verified to work at all, and it makes it through the infant
mortality period, the lifetime is actually quite good. There are a few
reasons:

1\. there are no moving parts so traditional fatigue doesn't play a role

2\. all "parts" (transisotrs) are encased in multiple layers of silicon
dioxide so that you can lay the metal layers down

3\. the whole silicon die is encased yet again in another package which
protects the die from the atmosphere

4\. even if it was exposed to the atmosphere, and the raw silicon oxidized, it
would make silicon dioxide, which is a protective insulator

5\. there is a degradation curve for the transistors, but the manufacturers
generally don't push up against the limits too hard because it's fairly easy
and cheap to underclock and the customer doesn't really know what they're
missing

6\. since most people don't stress their computers too egregiously this merely
slows down the slide down the degradation curve as it's largely governed by
temperature, and temperature is generated by a) higher voltage required for
higher clock speed and b) more utilization of the CPU

Once you add all these up you're left with a system that's very, very robust.
The failure rates are serious but only measured over decades. If you tried to
keep a thousand modern CPUs running very hot for decades you'd be sorely
disappointed in the failure rate. But for the few years that people use a
computer and the relative low load that they place on them (as personal
computers) they never have a big enough sample space to see failures. Hard
drives and RAM fail far sooner, at least until SSDs start to mature.

------
Gibbon1
Transistors don't fail for the same reason the 70 year old wires in my house
don't fail. The electrons flowing through the transistors doesn't disturb the
molecular structure of the doped silicon.

~~~
jacquesm
Sorry, but that's just plain wrong:

[https://en.wikipedia.org/wiki/Electromigration](https://en.wikipedia.org/wiki/Electromigration)

At the scale of your house wiring the effect is not so noticeable but for
integrated circuits it is definitely a factor.

As for your house wiring, if it is really 70 years old you might want to worry
about the insulation, not the copper.

~~~
Gibbon1
What's wrong about it? Transistors don't work by electromigration.

~~~
panax
Electromigration will move atoms on the interconnects between transistors and
eventually cause an open.

~~~
Gibbon1
He said transistors, not IC's. And electromigration is a problem with metal
interconnects at high current densities. So you talking solemnly about a
failure type that can occur with on some integrated circuits under some
conditions, or when the designers screwed up. But which doesn't actually
happen much in practice.

The persons actually question is roughly why do transistors last so long
compared to other types of mechanisms. No one in the comments made an attempt
to answer that, at all.

------
rhino369
Extremely good R&D done by semiconductor companies. It's frankly amazing how
good they are.

------
MichaelCrawford
They do.

That's why our boxen have power-on self tests.

