
Aging Problems at 5nm and Below - teruakohatu
https://semiengineering.com/aging-problems-at-5nm-and-below/
======
bunnie
Anyone else catch this gem:

 _But if you are creating a chip that other people will create software
for...then you don’t really know._

This is in the context of assessing usage duty cycles of circuits for aging
purposes.

I can imagine an era of exploits that rely on "aging out" paths that were
assumed to be rarely used. Like rowhammer, but persistent -- fire up a process
on a random cloud instance, run a tight loop of code to wear out an
exploitable path in e.g. SGX, rinse lather repeat until you have the coverage
you need...

~~~
est31
I think in the long term, we'll get away from the concept of general purpose
processing units towards some units that have the right architecture and wear
prevention to execute untrusted code and others which require less power but
wear out more quickly. There will be domains for all those types of computing.

~~~
Kye
That already happens, but they always seem to rejoin the CPU. GPUs are headed
that way. Math coprocessors went back a long time ago.

[https://en.wikipedia.org/wiki/Coprocessor](https://en.wikipedia.org/wiki/Coprocessor)

~~~
est31
You can have the CPU share the die with the secure cores, but you'd need to
figure out a lot of complexity, like dealing with memory access pattern
leakage due to a shared memory bus. The less things you share, the less the
performant parts need to care about security, and the secure parts need to
care about performance.

------
teruakohatu
It would be unfortunate if future process improvements resulted in fragile
CPUs and GPUs. I can sort of imagine Nvidia rubbing their hands with glee at
the prospect of non-overclocked GPUs aging prematurely, killing used sales and
forcing data centers to upgrade to more expensive compute cards.

~~~
LordHeini
Lets say they could get away with it.

What would be a reasonable time span for this?

Maybe 5 years? I have a gtx970 in my pc which is 5 years old by now. While the
card is fine by itself, it is too slow and thus getting replaced in the near
future and moved into an office pc.

But which data center uses 5 year old graphics cards?

It is save to assume that dedicated compute card gets replaced from time to
time anyway.

Moores law is still too strong.

~~~
discodave
AWS still has the m1 EC2 instance type on their pricing page which was first
launched in... 2006.

Datacenter hardware can stick around for a looooong time if the people running
the applications on top don't feel like migrating to new hardware.

~~~
ghaff
What makes you think that instance type is still running on the same hardware
it was in 2006?

~~~
ceejayoz
That they encourage folks to migrate off them and have much more limited
supply of the older instance families seems to imply it.

------
Animats
_“However, densely packed active devices with minimum safety margins are
required to realize advanced functionality requirements. This makes them more
susceptible to reliability issues caused by self-heating and increasing field
strengths. "_

We're going to have very short-lived electronics.

The Ford EEC IV engine control unit from the mid-1980s was designed for a 30
year lifespan. Which it delivered. Can the industry even make 30-year parts
any more?

For automotive, a key issue may be to turn stuff off. All the way off. Vehicle
lifespans are only around 6,000 hours of on time. But too much vehicle
electronics runs even when the vehicle is off.

~~~
tails4e
We make ICs and target 20 year lifespan. A lot of cost and over-design goes
into achieving that MTBF, especially if a device only had a typical usable
lifespan much shorter. ECU should be 30 years, but a GPU, probably not. Will
you be gaming on a 30 year old card?

~~~
simion314
It would suck if you own a special console/electronic device and it just
expires. The number of people affected is not large but still it will suck.
Also this days TVs and other electronics have some computer inside and this
"expiration" would affect second hand market and increase the e-wate more.

~~~
isoprophlex
Agree completely, but we are already entering a market of rapidly expiring
products and 'e waste as business model'

See: DRM'ed water filters, iot devices dying when their companies shut down,
mountains of unusable electric scooters left over when a startup folds.

~~~
swinglock
This is the time ramp up waste, we must exit that market instead, by law if
necessary.

------
threatripper
I would expect the problems to become both exponentially harder and
exponentially more to solve as we scale down. But vice versa, using the
process and knowledge we gain from 5 nm chips might improve the lifetime of 7
nm chips by orders of magnitude. And again using the 4/3/2 nm process
knowledge we might obtain very durable 5 nm chips in a few years.

~~~
Causality1
Is there much of a market for medium-detail processes? We have the
desktop/server/workstation/smartphone market using the very smallest detail
processes, with maybe some categories using the previous-generation node. Then
we have the embedded market which is everything from cheap-as-dirt 350nm
microcontrollers to 28nm ARM chips. Nobody really wants the five-years-ago
chips, the currently 18-22nm node. They're too expensive to buy by the million
and not shiny enough to compete against the newest stuff.

~~~
ksec
There are plenty of market for in between 28nm to 12nm node. It is simply a
natural progression between Cost, Features, Performance and Economics of
Chips. Especially when they are the last few nodes before you move up to EUV.

A10 / T2 used in many of the Apple Appliance are 16nm, along with dozens of
WiFi, Modem, Ethernet Controller, ASIC / FPGA etc. These are easily 100M unit
a year. As long as the cost benefits fits their volume they will move to next
node.

------
nullifidian
Always wanted to know if server cpus are of lower base/boost frequencies,
compared to HEDT cpus, precisely because they are designed to work 24/7 at
100% load for a decade, while consumer grade cpus have looser reliability
requirements and would fail in this regime at a high rate(without
overclocking), so they allow for higher frequency, provided that it's not 24/7
100% load.

The conventional explanation is that server cpus are designed for power
efficiency and higher frequencies are wasteful/inefficient. But I wonder if
reliability is also a factor, since there are applications for high single
threaded performance, but even special high-frequency server cpus never come
close to consumer-grade ones.

------
DavidSJ
_“For example, microprocessor degradation may lead to lower performance,
necessitating a slowdown, but not necessary failures. In mission-critical AI
applications, such as ADAS, a sensor degradation may directly lead to AI
failures and hence system failure.”_

This is probably a superficial analogy, but this made me think of people
suffering from dementia in old age.

~~~
Taek
Dementia is actually a decent metaphor for what happens to an aging chip.

Internally, a chip is extremely dependent on gate timings. As the chip decays,
certain gates or wires will start to slow down or speed up, and the chip gets
sloppier.

Often, you can address the issue by slowing the chip clock rate down, because
this gives you a much wider margin for error on your gate timings.

Certain operations will be impacted sooner and more heavily than others.
Eventually, the timings get bad enough that certain operations (or even the
whole chip) just break altogether.

------
lnsru
Dream of cellphone manufacturers. Phone dies beyond repair right after 2 year
contract. What can be better!?

Edit: my wife still uses iPhone 6, I recently build dedicated Linux machine
with my 8 year old FX-8350 CPU with ancient mainboard. The world with
disposable electronics is closer and closer. I bet, exact lifetime can be
precisely simulated with software from the companies mentioned in the article.

~~~
ChuckNorris89
I feel you and it's not really a reason to throw it away, but that FX CPU uses
more power than modern CPUs for similar performance.

I also like my old car and I'll keep it as long as I can but I'm aware a new
one would be much more fuel efficient.

~~~
Marsymars
> I also like my old car and I'll keep it as long as I can but I'm aware a new
> one would be much more fuel efficient.

From what I see of comparable-model cars, efficiency gains in engines,
aerodynamics, etc. have mostly been offset by safety, emissions, and QoL
improvements that have increased weight. For instance, by spec, the most fuel-
efficient Corolla was a 1984 model.

~~~
willis936
I am skeptical of that last sentence. Do you have a source?

~~~
Marsymars
Toyota Corolla Gas Mileage: 1979 – 2013:
[https://www.mpgomatic.com/2007/11/04/toyota-corolla-gas-
mile...](https://www.mpgomatic.com/2007/11/04/toyota-corolla-gas-mileage/)

Do note the note at the bottom of the table: "Note: the EPA tweaked their
testing procedure, starting with the 2008 model year, with the end result
being that the 2008 MPG estimates are now lower than previous years"

Other compact cars have followed a similar trend where they've gotten much
heavier and safer, but fuel economy (in terms of fuel per distance) peaked or
stagnated.

------
tlb
For a few applications, like compiling and training, I would be overjoyed to
buy a CPU that was 50% faster, but only lasted 18 months.

But for most, I care a lot more about lifetime.

It'll be interesting if this ends up being one of the major trade-off axes,
along with power and cost.

~~~
chime
I would gladly pay 100k+ for a single 10ghz processor with comparable bus
speed / RAM combo. Got a single threaded legacy DB that runs way too many
things for a large company. Upgrading to a new system is going to cost
million+. Running liquid cooled 5hz CPU on a gaming rig right now with an
identical box on standby in case of H/W issues. Could easily justify spending
200k every year if it doubled the DB performance.

~~~
DaiPlusPlus
Wouldn’t it cost less than $100k to fix the project to use any commodity SQL
database (presumably over ODBC)?

Or are we talking about an unsalvageable 4GL system that had its last update
15 years ago that does everything (storage engine, OLTP, forms UI framework,
security, and reports)?

~~~
chime
Close. Running 2009 version of a commercial DB that does everything with no
cheap upgrade path.

~~~
DaiPlusPlus
It’s Progress or FoxPro, right?

------
londons_explore
Could the next computer virus be one which repeatedly adds together 1111111
and 1111111 and wears out the wires that do carries, since they were only
designed for typical adding, not worst-case carrying on every operation.

~~~
ChuckNorris89
The days where viruses were designed to destroy your hardware for fun are
gone. Now they're just after your personal data or money.

~~~
mikro2nd
Unless you're the military.

------
ruslan
In an event when CPUs become spare part, computing devices should be designed
to allow fast repare or even hot-swap. Similar to HDDs, one can maintain a
constant supply of CPUs in store that allows to change wore out parts fast
without devices being rebooted or service interrupted, anytime. This will
solve all the longitivity issues while supporting sales on the expense of
extra work/labour. Win-win.

~~~
WrtCdEvrydy
I wonder if we'll get to the point where the motherboard has a basic 14nm
processor and the processor you buy is a co-processor on 5nm or 3nm process.

Anything goes wrong, Windows keeps running on the 14nm processor and you can
swap the co-processor.

~~~
buran77
I think you're better off having a spare machine and your data in the cloud
than to have 2 different CPUs on the same machine. It would massively increase
the price and complexity, reduce reliability, etc. It's 2 machines in one box
somehow. In the end your machine _will not_ serve its purpose with the backup
CPU or you would have bought one with that as a main CPU from the start.

We're going in the direction where the hardware is a commodity to replace as
you see fit and the data stays somewhere safe to be used regardless of device.
Phone, tablet, PC, console, etc. can consume the same data.

It's not ideal because I'd like hardware to be reliable, not something I can
expect to die on me when I need it most but it's pretty clear we're going that
way outside of niches, with most devices being exceedingly hard or impossible
to upgrade or repair.

------
etaioinshrdlu
How much redundancy do modern CPU chips have? If a single transistor goes bad
in say, a cache area, is the CPU guaranteed to be a brick?

~~~
teruakohatu
I think the usual reply to that question is that it depends on which
transistor.

Not quite the same thing but there was a Raspberry Pi board on the homepage
earlier today which has hacksawed in half and still worked, the person who did
that has also cut some microprocessors in half successfully and they still
work because they were cutting off bits he does not plan on using and which
are not required for the rest of the device or chip to function.

I am sure a AMD Ryzen CPU would work without a core or two. In fact they often
disable cores before shipping by zapping a fuse. But if the same transistor on
every core somehow blew, then you would probably be left with a dead CPU.

~~~
starky
To be fair, that guy who cut the RPi merely cut off the USB ports, RJ45 jack,
and the ethernet controller and maybe a couple caps. I don't think chopping
off a couple of low pin count peripherals far away from the SoC and DRAM
counts as "cutting the board in half"

~~~
teruakohatu
I did say it wasn't exactly the same thing but at least with some previous
*Lake Intel CPUs the gfx took up quite a lot of the die. If you are using an
external GPU quite a lot of transistors could fail and you could still use the
CPU, just like hacksawing off the ethernet chip.

------
LargoLasskhyfv
So what? I look at the MTBF/TBW(terabytes written) of solid state disks and
shrug.

From a very zoomed out view(and a laymans at that!) the whole semiconductor
industry seems like system gastronomy. While the few main players equal
McDonalds and Burger King, there is only so much you can do with similar
equipment arranged in the same ways. While I'm sure the two could produce the
same things, if given access to the same ingredients, they aren't allowed to.
Same for Coke vs. Pepsi.

Anyways, if you want to have it different, then you either need different
systems arranged in different ways processing different ingredients, or are
stuck with wailing _Oh vey!_

I'd rather prefer some cheering for alternatives like

[1]
[https://duckduckgo.com/?q=Semefab+Wafertrain+Bizen+Searchfor...](https://duckduckgo.com/?q=Semefab+Wafertrain+Bizen+Searchforthenext)
and

[2]
[https://duckduckgo.com/?q=Yokogawa+minimal.fab](https://duckduckgo.com/?q=Yokogawa+minimal.fab)

 _Oh YAY!_

That would at least enable the likes of KFC and Pizza Hut in comparison. And
fizzly Bundaberg!

 _Oy!_

edit: [3] [http://www.besang.com/](http://www.besang.com/)

------
londons_explore
Some Intel Atom processors die after sending more than a few terabytes over
USB over their lifetime. You can easily kill a laptop by leaving the webcam on
for a few weeks, and then magically all USB stops working and there is no fix
other than soldering on a new CPU.

I wonder if this is the reason?

~~~
hwillis
The USB issues were related to a critical flaw in the LPC clock, according to
the Intel errata. USB expected lifetime traffic for the affected processors
was 50 TB while active at most 10% of the time. The errata implies lower
voltage systems aren't affected.

To me that says simple design flaw. Something like overdriving a transistor to
get more performance out of it, without realizing what relied on it. That will
cause slightly different failure conditions from electromigration.

[https://cdrdv2.intel.com/v1/dl/getContent/600834](https://cdrdv2.intel.com/v1/dl/getContent/600834)

[https://www.anandtech.com/show/11110/semi-critical-intel-
ato...](https://www.anandtech.com/show/11110/semi-critical-intel-
atom-c2000-flaw-discovered)

------
ethhics
I do similar reliability work for power electronics (first job after BSEE). I
wouldn’t have expected lifetimes at the working stress to be a concern, but I
had not fully appreciated how large the E-fields were getting in these devices
(about 4x larger than a 1 kVrms isolator).

------
rwmj
The article mentions in passing _" high elevation (data servers in Mexico
City)"_, but doesn't say why. Low air pressure because that makes cooling
systems less efficient?

~~~
jfoutz
That stuck out to me as well.

I kinda suspect more ambient radiation. Less atmosphere to catch stray
particles, and the smaller gates are more susceptible. Smaller interactions
cause random errors.

Anyway, far far outside the scope of my expertise. But that's my guess.

~~~
freeqaz
Could that actually effect the life of the chip though? Annoying to have your
system crash, but that isn't a longevity problem by itself. Does a stray ray
cause the gate to wear prematurely?

~~~
jfoutz
yeah. I've done a little bit (tiny, inconsequential, not an expert) with
satellites. When cosmic rays have enough energy they'll move atoms around.
That's bad for gates. A 486, old big gates, doesn't hurt much. maybe you have
to reboot every few days. Tiny little 2 atom gates, they're more delicate.
think plastic vs glass glassware. With plastic you can have a few incidents
and it'll still be a glass. with glass, it's a lot easier to chip an edge or
shatter and not be usable anymore. Again, not my area. But I suspect altitude
is really bad for tiny systems because they lose so much "free" shielding.

------
ashtonkem
I keep getting the feeling that we’re reaching the end of the line for our
current CPU technology, and that new fundamental research is needed if we wish
to continue improving.

~~~
Retric
Even then physics has some real limits. If 14nm is ~1,000 atoms wide you could
at most double chip density 20 times. But that’s really optimistic, physical
limitations rather than design or manufacturing ones are likely well before
then. Such as what’s the resistance of a wire 1 atom wide?

~~~
hwillis
14 nm in actual distance is only ~50 silicon atoms (~1.4 angstroms) wide. 14
nm process finFETs have a fin width of ~8 nm and in general 14 nm process
transistors have a gate length of ~20 nm.

> Such as what’s the resistance of a wire 1 atom wide?

Quite high; you also get a lot of leakage since electrons are basically
scattering elastically all the time. You can't use copper for a wire like
this, you need special low-scattering conductors.

~~~
Retric
Unless I am missing something the general transister density for 14nm is
vastly worse than that.

If an A9 has 2 Billion transistors on a 96 mm^2 chip. That's ~45,000
transistor in a row ~= 10 mm = 10000000 nm or 35,000,000 atoms. Or 1
transistor per ~777x777 atoms except that’s across multiple layers so hand
wave ~1,000 atoms.

~~~
hwillis
For many reasons, transistor density is not a terribly useful metric:
[https://en.wikichip.org/wiki/mtr-mm%C2%B2](https://en.wikichip.org/wiki/mtr-
mm%C2%B2)

Not least because a transistors are not nice neat npn regions. They have
multiple gates, all different gate sizes, all number of inputs, outputs and
regions.

Intel manages to cram 20 million SRAM cells per mm^2 with 14nm; each cell has
6 transistors. That's three times higher than their reported density of 45
million per mm^2. More to the point, the transistor density really isn't that
important. For one thing there are three regions and four terminals in every
transistor, so it doesn't make much sense to collapse all that to a single
atom.

It also doesn't make much sense because that wouldn't offer much benefit:
those regions and the space between transistors are pretty minor issues
compared to the increased switching efficiency from shrinking the gate, which
is the truly important part and the limiting feature. Electricity moves at a
significant fraction of c, which moves 30 millimeters every clock @10 GHz.
Enough to completely cross a CPU multiple times, which it should never need to
do.

------
abjecton
I bet there are some people that see this as a feature.

------
jokoon
It's weird that I've read so many people say "your computer isn't slow because
of transistors degrading, but because of other thing like software/driver/OS
stuff".

~~~
thebruce87m
As you as you mean “slower than it was” then that statement holds mostly true.
Your CPU, RAM and GPU should perform the same on day 1 and day 10000 as long
as they are still functional. Any “degradation” won’t make them slower, just
non functional.

The complication in this comes from the SSD, where the flash cells have a
feedback loop for operations such as erasing that can take longer as these
cells degrade.

~~~
jokoon
CPU have error correction, which will mitigate transistor aging and make the
CPU work slowly instead of not at all.

It will not "perform the same". At some point there is a noticeable slowdown,
and even though Wirth's law is at work, it's not the entire story. Heat will
also make any chip age faster.

This article talks about aging under 5nm, but aging is already an issue above
5nm. Read the article.

[https://en.wikipedia.org/wiki/Electromigration#Practical_imp...](https://en.wikipedia.org/wiki/Electromigration#Practical_implications_of_electromigration)

> The complication in this comes from the SSD

I always experienced slowdowns on computer that did not have SSD. Software is
not always the only problem.

~~~
rcxdude
Error correction in CPUs is generally limited to the cache, and its incidence
is recorded: if something had failed permanently such that the error
correction path was being taken constantly, you would be able to record it.

Absent a mechanism which reduces the clock speed of the CPU when it becomes
unstable, there's no reasonable way in which failures in the CPU will result
in it running slower. Such a mechanism doesn't generally exist: modern CPUs
regulate their clock but only in response to a fixed power and temperature
envelope. The recent iphone throttling is the only notable case where anything
was done automatically in response to an unstable CPU, and that consisted of
applying a tighter envelope if the system reset.

This is reflected in the experiences of those who run older hardware with
contemporary software: it generally still works just fine at the speed that it
used to.

~~~
jokoon
From the article:

> “For example, microprocessor degradation may lead to lower performance,
> necessitating a slowdown, but not necessary failures

~~~
rcxdude
It may be necessary for the micro to run slower in order to be stable, but to
my knowledge no system for making that adjustment automatically exists in the
vast majority of systems. The main problem being it's hard to detect. How do
you tell if the CPU is on the margin of failing without a huge amount of extra
circuitry? It can be hard enough to detect that it has had a fault. It's not
due to lack of interest: such sensing approaches have been patented before,
but don't seem to have made it out of the R&D lab.

~~~
jokoon
"to my knowledge"

CPU technology is quite arcane, very high level, there are so many patents, IP
money and a lot of secrecy involved, since CPU tech is quite a strategic one
for geopolitical power. Do you work as an engineer at intel, ARM, AMD? On chip
design?

> How do you tell if the CPU is on the margin of failing

It's not about failing, it's about error detection. Redundancy is a form of
error detection. If several gates disagree on a result, they have to start
again what they worked on. That's one simple form of error detection.

CPU never really fail, they just slow down because gates generate more and
more errors, requiring recalculation until they finally correct the detected
error. An aged chip will just have more and more errors, that will slow it
down. Which is the reason why old chip are slower, independently of software.

Although a CPU that is very old will be very slow, or just crash the computer
again and again that hardware-people will just toss the whole thing, since
they're not really trained or taught to diagnose if it's the CPU, the RAM, the
capacitors, the GPU, the motherboard, etc. In general they will tell their
customers "it's not compatible with new software anymore". In the end, most
CPUs get tossed out anyway.

It's also a matter of planned obsolescence. Maintaining sales is vital, so
having a product that a limited lifespan is important if manufacturers want to
hold the market.

~~~
rcxdude
> CPU technology is quite arcane, very high level, there are so many patents,
> IP money and a lot of secrecy involved, since CPU tech is quite a strategic
> one for geopolitical power. Do you work as an engineer at intel, ARM, AMD?
> On chip design?

If such a mechanism existing it would be documented at at least a high level
and its effects observable under controlled tests. Neither are, in contrast to
the power and temperature envelopes I mentioned. There is no actual evidence
that aged chips operating with the same clockrate perform computation slower,
your subjective experience that hardware 'slows down' does not count.

> It's not about failing, it's about error detection. Redundancy is a form of
> error detection. If several gates disagree on a result, they have to start
> again what they worked on. That's one simple form of error detection.

> CPU never really fail, they just slow down because gates generate more and
> more errors, requiring recalculation until they finally correct the detected
> error. An aged chip will just have more and more errors, that will slow it
> down. Which is the reason why old chip are slower, independently of
> software.

This is not how consumer CPUs work. It's not even how high-reliability CPUs
necessarily work (some work through a high level of redundancy but they don't
generally automatically retry operations when a failure happens: that's a
great way of getting stuck). Such redundancy is so incredibly expensive from a
power and chip area point of view that no CPU vendor would be competetive in
the market with a CPU which worked like you describe. If a single gate fails
in a CPU, the effects can range from unnoticable to halt-and-catch-fire.

The only error correction which is present is memory based, where errors are
more common and ECC can be implemented relatively cheaply compared to error
checking computations.

~~~
jokoon
> If such a mechanism existing it would be documented

Why would it? It's an internal functionality, and CPU usually have a 1 year
warranty or so, and I'm not sure they really have guaranteed FLOPS, only
frequency I guess. If it's tightly coupled to trade secrets, I would not
expect this to be documented. I also doubt that you could find everything you
want to know in a CPU documentation.

> There is no actual evidence

The wikipedia article I mentioned, physics is enough evidence.

> If a single gate fails in a CPU

I did not say fail, I meant "miscalculated". There is a very low probability
of it happening, but it can still happen because of the high quantity of
transistors, hence error correction.

> Such redundancy is so incredibly expensive from a power and chip area point
> of view

Sure it is, so what? At one point all CPU need it and it becomes necessary.
There are billions (I think?) of transistors on a CPU.

~~~
rcxdude
Documentation is light on details, but both major CPU vendors give extensive
documentation on the performance attributes of their processors, such as how
many cycles an instruction may take to complete, and none see fit to mention
once 'may take an arbitrary amount longer as the CPU ages'. Not to mention,
these performance attributes are frequently measured by reasearchers and
engineers, and such an effect as instructions taking more cycles on one sample
compared to another from the same batch has yet to be observed (and it's
notable and noted when it does differ, e.g. from different steppings or
microcode versions). At least one of the many many people who investigate this
in great detail would have commented on it.

The wikipedia article you linked makes zero mention of redundant gates as a
workaround for reliability issues. The only thing close is that designers must
consider it, but this is design at the level of the geometry of the chip, not
its logic. It doesn't even make good sense as a strategy: the extra cost of
redundant logic to work around reliability issues on a smaller node will
outweigh the advantages of that node.

One of the greatest things about modern CPUs is how reliably they do work
given that you need such a high yield on individual transistors.

~~~
jokoon
Thanks for convincing me!

