But if you are creating a chip that other people will create software for...then you don’t really know.
This is in the context of assessing usage duty cycles of circuits for aging purposes.
I can imagine an era of exploits that rely on "aging out" paths that were assumed to be rarely used. Like rowhammer, but persistent -- fire up a process on a random cloud instance, run a tight loop of code to wear out an exploitable path in e.g. SGX, rinse lather repeat until you have the coverage you need...
1. If someone figures out how to focus execution that you'd normally assume distributed in a single area. The simplest example would be to write the first line of cache over and over again; that kind of thing has extensive protection but the principle is reasonable. I'm not sure what you'd use for this.
2. Aging related issues can be worked around; if a subunit loses performance you can give it more time to settle and reorder the pipeline around it. That would open up entirely new classes of timing attacks. You run a quick scan of every combination of execution paths (to find the biggest overlap of underperforming transistors) and you'll be able to make attacks using any number of extremely hard to predict timing combinations.
If you get an amazing price for part that fails often, it might cost you way more in long run.
If I'm buying one drive or CPU, I might pay a premium to drop the failure rate from 4% to 1%. If I'm buying dozens to hook together in a fault-tolerant system, I'll go for the cheap one and buy a few extras.
Lets say $1000 per cpu thats $1,000,000 cost and $200,000 loss to damage.
1% - 50 dead CPU in 5 years - $50,000 in losses
In this scenario you have $150,000 to save or spend to get better equipment.
Also an important note. On top of that you suffer downtime losses and manpower cost of taking server out and swapping parts. If your eCommerce goes offline that might cause significant monetary loss.
For any large scale purchases its all about the numbers game.
In case of single purchases. Paying extra to get form 4% to 1% seems excessive (depending on the cost increase).
At this percentage levels its a roll of dice if it dies or not.
So if it's $850 for the 4% failure chip, and $1000 for the 1% failure chip, I'll probably buy the cheaper one in bulk. $850k upfront and $34k in replacement, vs. $1000k upfront and $10k in replacement.
There's extra manpower, sure, but even if it costs a hundred dollars of labor per replacement the numbers barely budge.
> downtime losses
In a big system you shouldn't have those from a single server failure! Downtime losses are the point I was making. As someone buying just one drive or chip, failure costs me massive amounts beyond the part itself. But if I can buy enough for redundancy, then those problems become much much smaller.
If you're running things off one server, then apply the single device analysis, not the bulk analysis.
Point is though, they price the cost of replacement into those guarantees. Doesn't mean the hardware will last longer, just that support & replacements are
What would be a reasonable time span for this?
Maybe 5 years?
I have a gtx970 in my pc which is 5 years old by now. While the card is fine by itself, it is too slow and thus getting replaced in the near future and moved into an office pc.
But which data center uses 5 year old graphics cards?
It is save to assume that dedicated compute card gets replaced from time to time anyway.
Moores law is still too strong.
Regarding Moore's law, there's only so many possible shrinks left to go. Once we hit that wall the incentive to be on the latest node is significantly reduced. Combine that with associated lifetime reductions and I think larger nodes might even end up preferable in many cases.
Datacenter hardware can stick around for a looooong time if the people running the applications on top don't feel like migrating to new hardware.
An i7 6700 is already 5 years old. That's most certainly not an outdated "can throw away" CPU. Neither will a 3rd gen. Ryzen 4 years from now.
The strong temperature depencence means this will likely look like "5 years with low fan noise and 30 years with high fan noise"
I can see how it could be cheaper to use cheap air cooling on the chips and efficient, central room cooling.
But seriously, I think it would be a viable solution for server farms, but it didn't really catch on there yet. Probably still a matter of price. There are some theoretical application with heat exchangers though. If we could recycle some of that, computing would be much more efficient in general.
We're going to have very short-lived electronics.
The Ford EEC IV engine control unit from the mid-1980s was designed for a 30 year lifespan. Which it delivered. Can the industry even make 30-year parts any more?
For automotive, a key issue may be to turn stuff off. All the way off. Vehicle lifespans are only around 6,000 hours of on time. But too much vehicle electronics runs even when the vehicle is off.
The interesting question is going to be how much faster the unreliable parts are than the more reliable nodes once we're at the limit. If they're only 30% faster then in many cases that's a small price to pay for reliability. If they're 30 times faster then what you want is an increase in modularity and standardization, so that the chip that wears out every year can be a cheap fungible commodity part that you can replace without having to buy a new chassis, screen or battery.
Transistor scaling (dennard scaling) already ended over a decade ago. Even up to that point there have been a variety of reliability challenges at each shrink, some revisited again and again requiring different techniques at each level.
We haven't been making significant progress over the last decade in terms of process size, instead we're just shuffling towards the edge of a cliff (single atoms), the closer we get the more extreme the challenges will get with less return.
We don't have a long road ahead with gradually decreasing reliability, we are already at the end of the road, we aren't going to get much closer to the crumbly bits if it's not worth it.
> This has failed, but transistor scaling [...] has not
Not entirely true, around the same time Dennard scaling ended, transistor channel length scaling also slowed down significantly (Although I don't know if they are directly related). It's not a short topic and i'm no expert, but the summary is that process node name values no longer directly relate to transistor dimensions, and scaling has become more "strategic".
In my mind the end of Dennard scaling marked the beginning of the end of the road. The challenges are becoming more fundamental and yet process node reduction no longer yields the same benefits - meanwhile it's becoming realistic to count cross sectional areas in terms of numbers of atoms... the road really is short.
Not that we wont find another road ;)
Transistors are still shrinking by some measures, but not in the same simple way that one dimensional process node names make it appear.
Although that is still a proxy for transistor scale - what's interesting is your plot shows that in terms of density transistor scale is still managing to follow a log scale trend, in spite of the fact transistor scaling itself stopped being uniform long ago.
This is no longer true, we are barely still reducing transistor scale (but no longer uniformly), and no longer gain any of the other benefits due to the break down of Dennard scaling, coincidentally at the same time it became difficult to continue to reduce transistor channel scale.
That doesn't really make sense.
It was actually about transistor count and cost.
> faster clock at the same power density
That's Dennard scaling as someone already mentioned to you
> Transistor scaling (dennard scaling)
Those are two different things.
You also said:
> We haven't been making significant progress over the last decade in terms of process size,
This is false by any reasonable definition, since as has already been said by multiple people, transistors are a fraction of the size they were ten years ago and the density has gone up considerably.
> coincidentally at the same time it became difficult to continue to reduce transistor channel scale.
Frequencies did not go up, but transistors shrunk, I'm not sure why you keep trying to state otherwise. How do you explain the enormous rise in transistor count and process shrinkage over the last decade? You are literally stating something that is false and not even backing up what you are saying with any information at all.
As I have already admitted in the sibling thread, it is not as simple as transistor scaling stopping outright, I was clearly _wrong_ to suggest that... but it's also untrue to suggest transistor scaling has not stopped in any way - this is essentially the point I am still trying to make for you: Features are getting stuck due to various fundamental limits, and we no longer get the same benefits as a result, and this all started at the same time we stopped getting significant speed improvements (the breakdown of Dennard scaling).
>> rate of transistor count
> That doesn't really make sense. It was actually about transistor count and cost.
Yes, but that is a bi-product, It is fundamentally about the exponential growth rate of transistor count per unit area which is only achieved sustainably (until fundamental limits) via transistor scaling. When that scaling is uniform we get not only reduced cost per transistor and but higher speeds:
> Moore's law is the observation that the number of transistors in a dense integrated circuit (IC) doubles about every two years. 
>> faster clock at the same power density
> That's Dennard scaling as someone already mentioned to you
Dennard scaling is a formalization of what Moore already observed in the very same paper a decade prior on page 3 under "Heat problem":
> shrinking dimensions on an inte-grated structure makes it possible to operate the structure at higher speed for the same power per unit area. 
This is the context of Moore's law, the only mechanism at the time that he was alluding to, uniform transistor scaling with all the benefits (including what is known as Dennard scaling today). This context is commonly lost by people who quote it today.
> You said:
>> Transistor scaling (dennard scaling)
> Those are two different things.
Yes they are, as I already admitted in the sibling subthread, I was technically incorrect to mix them, never the less they are closely related - Dennard scaling has always been related to transistor scale, and broke down at the same time uniform transistor scaling aka "classic transistor scaling" stopped.
> You also said:
>> We haven't been making significant progress over the last decade in terms of process size,
> This is false by any reasonable definition, since as has already been said by multiple people, transistors are a fraction of the size they were ten years ago and the density has gone up considerably.
I've already admitted this is inaccurate in the sibling thread. You can read my response there. However progress has been stifled to say the least.
>> coincidentally at the same time it became difficult to continue to reduce transistor channel scale.
> Frequencies did not go up, but transistors shrunk, I'm not sure why you keep trying to state otherwise. How do you explain the enormous rise in transistor count and process shrinkage over the last decade? You are literally stating something that is false and not even backing up what you are saying with any information at all.
I am not disputing transistors have shrunk, but not all features of transistors have shrunk at the same rate, I'll add emphasis: channel lengths have become more difficult to reduce in scale.
Here's a random source I found:
> when we approach the direct source-drain tunneling limit, we could move to recessed channel devices and use channel lengths longer than the minimum feature size. This could allow us to continue miniaturization and increase component density. 
i.e channel lengths will _not_ be 5nm
Densities continue to increase while transistors can no longer be uniformly shrunk, in the same way a square can be made into a rectangle and have a smaller area while not reducing the maximum edge length. However it's intuitive to see that while this will continue to increase density, it will not necessarily increase speed - and it will not be long until we hit limits on scaling the other features.
See: DRM'ed water filters, iot devices dying when their companies shut down, mountains of unusable electric scooters left over when a startup folds.
LE: although considering most stuff built today wouldn't work without an internet connection the problem isn't just hardware expiring
The big issue is when the mini computer inside your device is a small part of the device, like for example my parents TV is now showing all the time a red message in the corner because the software fails to validate something, the solution appears to be to have it re-flashed with an updated firware to remove those checks or replace the shitty un-essential component that is not used at all. So I am afraid that in future you will need to replace/repair your TV/fridge/car because a chip like the memory in it is one of the ones that "expire" (we had the case where Tesla cars SSD would get bad because of extreme logging)
This affects servers, personal computers, and phones. That's pretty much it. <10 nm is only used for the applications most demanding of computer performance.
You know what the limiting factor on your phone and laptop's lifespans are? Because it's certainly not the CPU.
Even for the relatively narrow applications in which aging will have noticeable impacts, I doubt it will be a big deal. >90% of people will get a new laptop, phone, or PC long before they see clock reduction due to aging. There are plenty of people who are still on CPUs as old as sandy bridge, but even that is inflated due to the awkward phase where it seemed like parallel utilization would never improve. 15-20 years from now the number of devices using CPUs from the 2020s will be at least as small as the number of people currently using devices from 2010. Not that those devices aren't important, but "we're going to have very short-lived electronics" is just not true (or at minimum, it's already true).
On top of that aging in future processors will be a gradual reduction in performance and efficiency, not sudden failure, and it'll still take many years. Design issues are the real concern here; something like underspec'd AVX instructions that suddenly see massive increases in utilization. Aside from that CPUs will not be noticeably different, and the vast majority of consumer electronics and infrastructure will not even be on a relevant technology.
But I think their real problem will be that this will cause even more chips to be discarded during quality assurance and this rate will bleed into electronic prices. If we'd be looking at 20% or 30% more expensive electronics _beyond_ the pricings warranted by performance improvements themselves (otherwise eating into their profit margins from R&D costs etc), will consumers keep purchasing these without batting an eye?
But a sense of scales is also necessary here. If a modern CPU lasts for 15 years and this might be a reduction to 8-10 years, I think many will swallow that. If we're however talking reducing lifespans from 10 years to four, well then we are going to see complaints.
Absolutely. Will they make it, absolutely not.
Planned obsolescence is not a conspiracy theory, its a widely popular business strategy.
Spare part business is a additional revenue stream for car manufactures. They fought hard and dirty trying to monopolize it, via lobbing harmful to public laws.
People are attracted to novelty. While you and I might value the old and functional things that we have and use everyday there is a substantial number of people who will get rid of things like toasters, food mixers, etc., simply because they don't match the colour scheme of the kitchen; the suppliers only care about those people, not us because we don't make any money for them.
If you want products that last you will somehow have to moderate the human desire for novelty. I think that this could be done but it would mean dismantling our educational systems and replacing them with education in the old fashioned sense of producing people who can think and analyse, people who have a sense of history and an understanding of how the world came to be the way it is.
But what use are such people to a society based on, not merely conspicuous, but also excessive consumption?
What most people do care about is price. And what most suppliers care about is.. yeah, it's a race to the bottom.
I would assume that people would also care about quality and durability if it were something they were informed about. Like, if you're buying a toaster, one sells for $30 and the label says that will break in two years.. the other sells for $40 and the label says it will last for five years or more. I'm pretty sure most people would pick the latter, unless they're exceptionally poor or there's some other major aspect of the design or functionality that draws them to the cheaper option.
Of course, this is not the world we live in, and toasters in $25 to $100 range can last a while or not. Quality or durability is not on the label, and price is not an indication of quality. The trend seems to be that lots of "race to the bottom" companies fill their lineup with premium priced products that are made of the same crap quality as their bottom tier, but they have some silly gimmick (bluetooth in a toothbrush? goodness gracious).
There's a proxy for that: the manufacturer warranty period.
My current laptop has a 5 year warranty with on-site repair. I chose it over a cheaper laptop from the same manufacturer with only a 3 year warranty with on-site repair. Part of the reason I chose the more expensive model is that the longer warranty means I won't have to replace it for another 5 years.
FWIW I think you are right, we are trained and raised by society to buy and buy all the time.
I know people who think I'm odd for having a Nokia 6 which is nearing 3 years old (but has current Android) and no intention of replacing it - it's partially the circles I move in via work but everyone earns decent money and replaces their phone every year when the new models come out.
More generally I don't replace something unless it's uneconomic to repair, I'm rocking my 2012 road bike, I just stripped and rebuilt it for about 50 quid - I've friends who buy a new one every 18mths.
I like fixing things maybe that is the difference.
Happily using a 5 year old iPhone, which is still receiving software updates, Just because companies release new phones doesn;t make your old one obsolete.
Probably not. The problems at each scale are very different and tend not to have much overlap. Process limitations also exist regardless- say you figure out that patterning transistors at 3 nm increases lifetime in a way that is applicable to 7 nm. 7 nm still can't take advantage of it as long as it requires 3 nm scale patterning, if there is even overlapping technology. Even if it requires adding a new step, say a wafer treatment that is necessary at 3 nm but optional at 7 nm, the economic benefit from increased lifespan is probably not worth retooling the old equipment.
A10 / T2 used in many of the Apple Appliance are 16nm, along with dozens of WiFi, Modem, Ethernet Controller, ASIC / FPGA etc. These are easily 100M unit a year. As long as the cost benefits fits their volume they will move to next node.
The conventional explanation is that server cpus are designed for power efficiency and higher frequencies are wasteful/inefficient. But I wonder if reliability is also a factor, since there are applications for high single threaded performance, but even special high-frequency server cpus never come close to consumer-grade ones.
This is probably a superficial analogy, but this made me think of people suffering from dementia in old age.
Internally, a chip is extremely dependent on gate timings. As the chip decays, certain gates or wires will start to slow down or speed up, and the chip gets sloppier.
Often, you can address the issue by slowing the chip clock rate down, because this gives you a much wider margin for error on your gate timings.
Certain operations will be impacted sooner and more heavily than others. Eventually, the timings get bad enough that certain operations (or even the whole chip) just break altogether.
Edit: my wife still uses iPhone 6, I recently build dedicated Linux machine with my 8 year old FX-8350 CPU with ancient mainboard. The world with disposable electronics is closer and closer. I bet, exact lifetime can be precisely simulated with software from the companies mentioned in the article.
I also like my old car and I'll keep it as long as I can but I'm aware a new one would be much more fuel efficient.
From what I see of comparable-model cars, efficiency gains in engines, aerodynamics, etc. have mostly been offset by safety, emissions, and QoL improvements that have increased weight. For instance, by spec, the most fuel-efficient Corolla was a 1984 model.
Do note the note at the bottom of the table: "Note: the EPA tweaked their testing procedure, starting with the 2008 model year, with the end result being that the 2008 MPG estimates are now lower than previous years"
Other compact cars have followed a similar trend where they've gotten much heavier and safer, but fuel economy (in terms of fuel per distance) peaked or stagnated.
But the weight difference is also substantial between 1984 and 2020 models: 2110 lbs vs 2910 lbs
Old FX-8350 was available for this one shot Linux project. Normally virtual machine is ok. I had lots of thoughts what I do with good old hardware when upgrading to Ryzen, I delayed it for more than a year. Luckily it was solved this fortunate way and I do not need throw working things away.
But for most, I care a lot more about lifetime.
It'll be interesting if this ends up being one of the major trade-off axes, along with power and cost.
A cryogenic cooling system that could let a CPU run at 7 GHz would be doable for about $100K.
But more realistically, there are often a long list of other things that can be done to a legacy DB that have a much bigger bang-for-buck, and are also lower risk.
Try NVMe storage. Databases love low-latency storage.
Are you virtualising this in any way? Don't.
Use a better NIC to cut down latency. Think 200 Gbps Mellanox cards with jumbo frames instead of the built-in Broadcom chip that probably cost $1.50.
Try co-locating the app(s) on the same box with the database to really cut the latency. This works great with modern many-core CPUs like an AMD EPYC. You can often dump everything onto one machine and your latency will go from 1ms to 10μs. That's a 100-fold difference!
Turn off the CPU vulnerability mitigations. This is safe as long as nothing else runs on the same hardware. Boosts some databases up to 50%.
Pin the process threads to specific CPU cores. Some newer Intel Xeons have "preferred cores" that will turbo boost higher than any other core.
Upgrade to the latest CPU to get more instructions per clock. This can have surprisingly beneficial effects.
Or... I dunno... fix the database. Does it even use indexes?
Or are we talking about an unsalvageable 4GL system that had its last update 15 years ago that does everything (storage engine, OLTP, forms UI framework, security, and reports)?
How about instant handoff? I guess that would require running the second box as a slave, so both boxes are running harder than they need to...
Does running one of the old FX-8350/70 at 7+ GHz have any benefit, or is it offset by the per-tick improvements of newer CPUs?
It's weird to think of heavily overclocking stuff in a non-gaming workload for me... stuff like chilled coolant and instability suddenly has completely differnet factors to consider.
Although current rumours suggest 7nm Ocean Cove coming in 2022 with 85% increase of IPC compared to Skylake. I wish we could also make performance node that push through 6Ghz.
If I was been cynical.
Not entirely impossible. Given some of these improvement has been sitting within Intel for years due to the delay in 10nm.
I can make an ALU in an FPGA that only takes in one bit data and adds it. That would take the IPC crown. Is it a useful statement? No. Comparing ARM IPC to x86 IPC is far more complicated than measuring length.
I've taken some measurements of the A13's microarchitecture (mostly, its ROB size), and Apple has a bigger lead here than you'd expect. I don't want to share his data without permission, but the bulk of the code is at  and I'd be happy to help anyone able to run code on an iDevice repeat the same measurements.
I doubt it. It's way more likely that the computers get more disposable and we replace everything. If for no other reason, because manufacturers won't see a point in making the rest of the electronics outlast the CPU.
Motherboards and cables do not wearout, bridges and network interfaces most likely don't either. CPUs and RAM do. Concider a symmetrical multiprocessor system with a lot many multi-core CPU and RAM units (may be even fused together in one IC) that are sussectable to wearout. Operating system maintains a set of depletion counters for each unit. When a counter reaches some threshold, system automatically (or by request) gets such units out of operation and reports. An operator walks through the server rooms daily with a bunch of new units and replaces them. Only depleted silicon is replaced. It can be easily and safely disposed or even recycled and reused. Green. Efficient. High power. Low cost. Constant sales. What else we can dream of ? ;-)
Anything goes wrong, Windows keeps running on the 14nm processor and you can swap the co-processor.
We're going in the direction where the hardware is a commodity to replace as you see fit and the data stays somewhere safe to be used regardless of device. Phone, tablet, PC, console, etc. can consume the same data.
It's not ideal because I'd like hardware to be reliable, not something I can expect to die on me when I need it most but it's pretty clear we're going that way outside of niches, with most devices being exceedingly hard or impossible to upgrade or repair.
Not quite the same thing but there was a Raspberry Pi board on the homepage earlier today which has hacksawed in half and still worked, the person who did that has also cut some microprocessors in half successfully and they still work because they were cutting off bits he does not plan on using and which are not required for the rest of the device or chip to function.
I am sure a AMD Ryzen CPU would work without a core or two. In fact they often disable cores before shipping by zapping a fuse. But if the same transistor on every core somehow blew, then you would probably be left with a dead CPU.
However electronics doesn't really fail like that. A single transistor might be "zapped" by a cosmic ray, but that's a transient error. Electromigration causes the copper interconnects between parts of the circuit to break (https://en.wikipedia.org/wiki/Electromigration#Practical_imp...), especially parts that carry higher current for power distribution around the chip. I had an Intel C2000 fail in a server after 3 years because of this (https://www.theregister.com/2017/02/06/cisco_intel_decline_t...).
Of course it might be too expensive to design such a feature.
From a very zoomed out view(and a laymans at that!) the whole semiconductor industry seems like system gastronomy. While the few main players equal McDonalds and Burger King, there is only so much you can do with similar equipment arranged in the same ways. While I'm sure the two could produce the same things, if given access to the same ingredients, they aren't allowed to. Same for Coke vs. Pepsi.
Anyways, if you want to have it different, then you either need different systems arranged in different ways processing different ingredients, or are stuck with wailing Oh vey!
I'd rather prefer some cheering for alternatives like
 https://duckduckgo.com/?q=Semefab+Wafertrain+Bizen+Searchfor... and
That would at least enable the likes of KFC and Pizza Hut in comparison. And fizzly Bundaberg!
edit:  http://www.besang.com/
I wonder if this is the reason?
To me that says simple design flaw. Something like overdriving a transistor to get more performance out of it, without realizing what relied on it. That will cause slightly different failure conditions from electromigration.
I kinda suspect more ambient radiation. Less atmosphere to catch stray particles, and the smaller gates are more susceptible. Smaller interactions cause random errors.
Anyway, far far outside the scope of my expertise. But that's my guess.
We need something else, I have no idea what, but we’re clearly running out of improvements possible on transistor based CPUs
> Such as what’s the resistance of a wire 1 atom wide?
Quite high; you also get a lot of leakage since electrons are basically scattering elastically all the time. You can't use copper for a wire like this, you need special low-scattering conductors.
If an A9 has 2 Billion transistors on a 96 mm^2 chip. That's ~45,000 transistor in a row ~= 10 mm = 10000000 nm or 35,000,000 atoms. Or 1 transistor per ~777x777 atoms except that’s across multiple layers so hand wave ~1,000 atoms.
Not least because a transistors are not nice neat npn regions. They have multiple gates, all different gate sizes, all number of inputs, outputs and regions.
Intel manages to cram 20 million SRAM cells per mm^2 with 14nm; each cell has 6 transistors. That's three times higher than their reported density of 45 million per mm^2. More to the point, the transistor density really isn't that important. For one thing there are three regions and four terminals in every transistor, so it doesn't make much sense to collapse all that to a single atom.
It also doesn't make much sense because that wouldn't offer much benefit: those regions and the space between transistors are pretty minor issues compared to the increased switching efficiency from shrinking the gate, which is the truly important part and the limiting feature. Electricity moves at a significant fraction of c, which moves 30 millimeters every clock @10 GHz. Enough to completely cross a CPU multiple times, which it should never need to do.
The complication in this comes from the SSD, where the flash cells have a feedback loop for operations such as erasing that can take longer as these cells degrade.
It will not "perform the same". At some point there is a noticeable slowdown, and even though Wirth's law is at work, it's not the entire story. Heat will also make any chip age faster.
This article talks about aging under 5nm, but aging is already an issue above 5nm. Read the article.
> The complication in this comes from the SSD
I always experienced slowdowns on computer that did not have SSD. Software is not always the only problem.
This will not detect an error in computation, only a bit flip in data. The current way to mitigate computation errors is to have two processors to detect an error in computation, or more to do a voting system if mere detection is not good enough. Since you cannot detect a computation error with a single CPU (without overhead somewhere, and thus lower performance), you can’t slow down to fix it.
The ECC systems I have worked with can fix a 1 bit error and detect two or more bit errors. They do this by using an algorithm to convert the original data in to an output that is bigger than the original (e.g. 64 bytes is now 72 bytes). This output data does not make sense until passed through the reversing algorithm. So basically the overhead is zero since the memory controller is running the algorithm in hardware every time anyway, so no slow down.
Absent a mechanism which reduces the clock speed of the CPU when it becomes unstable, there's no reasonable way in which failures in the CPU will result in it running slower. Such a mechanism doesn't generally exist: modern CPUs regulate their clock but only in response to a fixed power and temperature envelope. The recent iphone throttling is the only notable case where anything was done automatically in response to an unstable CPU, and that consisted of applying a tighter envelope if the system reset.
This is reflected in the experiences of those who run older hardware with contemporary software: it generally still works just fine at the speed that it used to.
> “For example, microprocessor degradation may lead to lower performance, necessitating a slowdown, but not necessary failures
CPU technology is quite arcane, very high level, there are so many patents, IP money and a lot of secrecy involved, since CPU tech is quite a strategic one for geopolitical power. Do you work as an engineer at intel, ARM, AMD? On chip design?
> How do you tell if the CPU is on the margin of failing
It's not about failing, it's about error detection. Redundancy is a form of error detection. If several gates disagree on a result, they have to start again what they worked on. That's one simple form of error detection.
CPU never really fail, they just slow down because gates generate more and more errors, requiring recalculation until they finally correct the detected error. An aged chip will just have more and more errors, that will slow it down. Which is the reason why old chip are slower, independently of software.
Although a CPU that is very old will be very slow, or just crash the computer again and again that hardware-people will just toss the whole thing, since they're not really trained or taught to diagnose if it's the CPU, the RAM, the capacitors, the GPU, the motherboard, etc. In general they will tell their customers "it's not compatible with new software anymore". In the end, most CPUs get tossed out anyway.
It's also a matter of planned obsolescence. Maintaining sales is vital, so having a product that a limited lifespan is important if manufacturers want to hold the market.
If such a mechanism existing it would be documented at at least a high level and its effects observable under controlled tests. Neither are, in contrast to the power and temperature envelopes I mentioned. There is no actual evidence that aged chips operating with the same clockrate perform computation slower, your subjective experience that hardware 'slows down' does not count.
> It's not about failing, it's about error detection. Redundancy is a form of error detection. If several gates disagree on a result, they have to start again what they worked on. That's one simple form of error detection.
> CPU never really fail, they just slow down because gates generate more and more errors, requiring recalculation until they finally correct the detected error. An aged chip will just have more and more errors, that will slow it down. Which is the reason why old chip are slower, independently of software.
This is not how consumer CPUs work. It's not even how high-reliability CPUs necessarily work (some work through a high level of redundancy but they don't generally automatically retry operations when a failure happens: that's a great way of getting stuck). Such redundancy is so incredibly expensive from a power and chip area point of view that no CPU vendor would be competetive in the market with a CPU which worked like you describe. If a single gate fails in a CPU, the effects can range from unnoticable to halt-and-catch-fire.
The only error correction which is present is memory based, where errors are more common and ECC can be implemented relatively cheaply compared to error checking computations.
Why would it? It's an internal functionality, and CPU usually have a 1 year warranty or so, and I'm not sure they really have guaranteed FLOPS, only frequency I guess. If it's tightly coupled to trade secrets, I would not expect this to be documented. I also doubt that you could find everything you want to know in a CPU documentation.
> There is no actual evidence
The wikipedia article I mentioned, physics is enough evidence.
> If a single gate fails in a CPU
I did not say fail, I meant "miscalculated". There is a very low probability of it happening, but it can still happen because of the high quantity of transistors, hence error correction.
> Such redundancy is so incredibly expensive from a power and chip area point of view
Sure it is, so what? At one point all CPU need it and it becomes necessary. There are billions (I think?) of transistors on a CPU.
The wikipedia article you linked makes zero mention of redundant gates as a workaround for reliability issues. The only thing close is that designers must consider it, but this is design at the level of the geometry of the chip, not its logic. It doesn't even make good sense as a strategy: the extra cost of redundant logic to work around reliability issues on a smaller node will outweigh the advantages of that node.
One of the greatest things about modern CPUs is how reliably they do work given that you need such a high yield on individual transistors.