
Why Chips Die - Lind5
https://semiengineering.com/why-chips-die/
======
MrTonyD
Seems like a very light and theoretical article. I remember reading papers
produced by IBM and HP where they analyzed chip failures - and they were much
more specific in terms of both deep analysis and proximate cause (heat being
essentially the overwhelming proximate cause.)

~~~
BostonEnginerd
Do you have any references that you can share?

~~~
kev009
Wilhelm G. Spruth's "The Design of a Microprocessor" has some interesting
material on the entire lifecycle of a design to ship of a chip including
failure analysis. I have some newer books in my library on this kind of thing
as well that I'd have to go re-familiarize myself with but that one is
striking in that it is first hand knowledge of a crown jewels kind of project
and not academic or the accounts of an outsider.

~~~
Ice_cream_suit
Thank you. It seems like a great read! I have obtained the pdf.

------
isodude
Electrical design seems to have a lot of dead ends where we can't go any
further. Does anyone have any insights as to what the current status is for
transistors based on light? I would like to start with chip design (did
microchip programming as CS), is it even a feasible to achieve something in
that field? Is there any good places / papers to start with?

~~~
jpmattia
> _Does anyone have any insights as to what the current status is for
> transistors based on light?_

It's a good question, but there is something often ignored in press-releases
about photonic transistors: How _exactly_ does a speedup occur?

For example: Photons do not interact with each other, so a photonic transistor
will rely on electrons or holes to effect some sort of interaction between
photons. That interaction might occur in a photorefractive material, but
ultimately the photorefraction is a result of photons interacting with
electrons of the underlying material. So why are the electrons in the
photorefractive material faster than the electrons in a conventional
transistor? It also might be worth noting that the fastest fT in conventional
transistors is around 0.5 THz, so the bar is not particularly low.

FD: I have a patent on photonic transistors from Bell Labs days.

FD2: I have become an old cynic on photonic transistors.

FD3: The above is really a rant on high-speed transistors. Photonic
transistors might well have a superiority in different areas, eg quantum
computing.

~~~
petra
fT for the fastest transistor - probably isn't the parameter limiting
electronic circuits - it's more about wire length and capacitance, and power
density of a circuit. It's both true inside circuits, but also outside, at the
"memory wall" \- which on the surface at least, seems particularly fitting for
photonics.

~~~
jpmattia
> _it 's more about wire length and capacitance_

At the fastest speeds, interconnect on crucial lines is engineered as a
transmission line, where the inductance balances out the capacitance. When you
look at the propagation in the transmission line, the energy of the traveling
wave is dominated by the electric and magnetic fields outside the metal. So it
is not generally appreciated that transmission-line interconnect is already
photonic, and moving at photonic speeds. As you might expect, for most
transmission-line geometries, the speed of the signal is not terribly far away
from the speed-of-light in the material.

Moreover, the intuition that "metal wires" are slowing down the signal is not
taking into account that an "all photonic transmission" still requires some
sort of confinement. If you were to eliminate the metal, you would still need
some sort of waveguide to confine the photonic energy. Inevitably, the
waveguide will have some region of a higher dielectric, which as you know will
slow down the propagation and also have some loss. As it turns out, the final
propagation speed is close to a transmission line.

So again, it's important to quantify _exactly_ how the speedup occurs by
switching to "photonic interconnect", because the reality is that it is
already photonic.

Where photonic interconnect might have had a use is where the lines are very
long, and the loss associated with a transmission line gets impractically
large due to the confining metals. At that point, it's an engineering
tradeoff: Converting to and from photons at each end of the line is non-
trivial, and not all materials lend themselves to converting photons to
electrons. There have been efforts since the 80s to put GaAs on silicon just
to address this issue. (I was part of a team that got to 100MHz for GaAs-on-
Si. The fact that GaAs-on-Si photonic transmission is still pretty much
confined to the lab tells you everything you need to know about the
manufacturability.)

~~~
Ice_cream_suit
Thank you. That was most interesting !

------
VLM
Not really what I expected. My cousin worked as a chem eng for an on shore
chip mfgr in the 70s/ 80s doing polymer research into chip packaging material
and the general impression I got from informal discussion was the main long
term enemy was water and gas (air) infiltration and ceramic chips were
essentially inert in comparison to plastic DIP material. She was happy when
here theoretical model indicated the chip pins on the outside of the chip
would corrode off before most interiors would contaminate and fail; I have not
idea if the team working on pin metallurgy had a similar goal of not corroding
until her plastic seal failed, LOL. She also said something about EPROM UV
windows being her nemesis or impossible or something similar.

Possibly her main long term enemy to chip life as a polymer chemist was not
the overall system enemy to long term chip life.

------
praptak
My friend studied chip design and was taught how to design for planned failure
after X years, curiously without any ethical considerations. Are these
techniques actually used in practice?

(That was my first thought after reading 'death by design')

~~~
tails4e
I work in chip design, and we certainly don't design for planned failure, but
we do design for 20 year reliability. This means we over-design to ensure
reliability up to 20 years. If a market does not need such longevity then it
would be reasonable to design for less time, to ensure you are not over
designing. Chip deign is all about trade offs - power, area, performance,
reliability, flexibility, integration cost, high/low temperature operation,
yield, etc - and designing for 20 years certainly impacts this trade off.

~~~
agumonkey
What's the upper limit of longevity testing, say for space use, 100 years ?

~~~
tails4e
Good question, it depends on a few factors - e.g. BTI is aging that alters the
transistor performance over time when a transistor has a fixed bias, so if
you've loads of margin for this it should not impact you, or you can mitigate
by periodically altering the bias to recover. TDDB is an issue at higher than
normal voltages, so of the environment is controlled this can be avoided. For
high reliability situations older process nodes are usually used which are
less susceptible to these failures, and the environment is well controlled.
Space has other issues like much higher probability of random bit flips, so
design for space is challenging. Note our 20 year lifetime also assumes worst
case conditions, so I'd wager many of our chips would last much longer in
practice.

~~~
new299
What does changing the bias to recover look like in practice? Any resources on
this?

~~~
tails4e
I'm not sure about resources but I can explain the basics. BTI happens when
there is a constant voltage across the gate/source of a transistor for a long
time. An example for a digital circuit could be an inverter held in one state.
Either the pmos or nmos will have a large vgs and over time this bias will
shift the threshold voltage (vt). Now if this inverter delay is critical, BTI
will cause the delay to change and could break timing. To fix it if the
inverter is toggled after sometime it will mostly recover to the normal delay.
Some people mitigate it by having very low frequency toggling for 'off' gates,
but usually only if those gates are critical/matched. Alternatively people
just margin the timing for the worst case. Analog circuitry has to be designed
for it also, and it's an issue in differential pairs for example.

------
jokoon
Well, I was told once that the more you expose a chip to stress-generating
heat (for example running a game with a bad quality cooling fan), it
drastically reduces its life span, and will also make it slower.

Apparently chips "age", but they still function because or error mitigation,
which is witnessed by a performance drop.

To me it's the only explanation why I see so many computers turn slow, despite
the fact that I have reinstalled them, cleaned them, etc. I argued many times
against the "just put a SSD and add RAM, windows 10 is just slower", but in my
mind, I just cannot explain why a newer OS becomes so much slower and memory
hungry and unresponsive, even when running very basic programs.

I think there is a myth that chip will always perform exactly the same as long
as they work. Chip engineering sounds much more complicated than it seems, and
I cannot trust any IT support person telling me to "upgrade".

I'm sure military-grade chips have different designs and cooling requirements,
which doesn't turn them into domestic/obsolete products after 3 years.

~~~
meuk
You are mistaken, this is not a myth. Synchronous digital chips have a clock
with a fixed frequency, and their working is fixed. So if it works, a chip
perform exactly the same, regardless of age. If something's slower, it's
software or pheripherals. Probably software (and that's why some people are so
upset: hardware is getting faster, observed performance keeps degrading. I
have to wait 10 minutes for my computer before I can start working, every
day).

~~~
olejorgenb
The cooling system ages - dust accumulates and coolingpaste become less
efficient - which in turn can cause more frequent throttling of the cpu.
(technically a peripheral to the chp)

~~~
smaddox
This is an important and often overlooked aspect of system aging. Heat
dissipation is the primary performance bottleneck of modern CPUs (in truly
CPU-bound tasks). There's a reason liquid nitrogen/helium cooling can enable
dramatic over clocking.

