Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How do CPUs handle bad transistors?
10 points by brokenmachine 11 days ago | hide | past | favorite | 11 comments
I've read that current CPUs have more than a hundred million transistors per square millimeter. I can't imagine that every single one of those works perfectly and will stay working perfectly for the entire lifetime of the CPU.

How do we design CPUs that don't die or stop working properly when one out of a hundred million transistors fails?

That's a great question with a rather interesting answer.

The difference between CPUs actually may not be due to different designs - but due to manufacturing errors. Intel for example might have a production line only for "tier 1" processors (e.g. i7) but during manufacturing, some of the transistors, for some reason, fail to function. During quality tests they can check which transistors fail and reprogram the microcode to use only the good transistors. Then you end up with a lower tier processor (e.g. i5, i3)

Lots of really good information here: https://www.google.com/amp/s/www.techspot.com/amp/article/18...

Cool! I’ll take the liberty to post the permalink: https://www.techspot.com/article/1840-how-cpus-are-designed-...


Modern designs sometimes have some duplication of functional units that allow the final chip to be configured in a workable manner even if a few of the units don't turn out right. But otherwise, getting all of those transistors working right is the big challenge of chip design. Yes, they all have to work. Yields on new processes are often very low, only a fraction of the devices made actually working.

Does anyone know more detailed articles about how this (self-)checking architecture/process works?

I'm afraid I don't have any articles to direct you to. I can give you a little detail of how they do it with RAM. DRAM is just a huge array of cells. The array itself is made up of repeated sections of a block. These blocks are independently wired and have on-die logic for testing (simple memory check). The blocks which pass testing are configured by blowing fuses so they're connected to the normal input/output logic, and the broken ones have fuses blown to disconnect them from the power rails.

They slightly overbudget the number of blocks when designing. This way if a few blocks are faulty they still have a working RAM chip. If a whole bunch of blocks aren't working, they can still sell it as a half- or quarter-capacity chip. A 8, 16 and 32 gigabit RAM chip in the same process from the same manufacturer may well have the same die inside, with different fuses blown.

This isn't new; it's been done since at least the 1980s. It's my understanding the same thing is done with cache and cores and even execution units on modern processors, but I would assume with a great deal more complexity.

The book “VLSI test principles and architectures: deign for testability” edited by Laung-Terng Wang et al, is quite good.

Cache has extra capacity so bad parts can be mapped out during the testing process. If there's a fault in a core the entire core is disabled. A fault in the uncore will probably cause the entire chip to be scrapped.

Transistors are actually very reliable, especially on a silicon wafer where they're protected from the elements. Unless there's a voltage spike somehow, they're unlikely to go bad.

That's why called IC, integrated circuit

either not one at all or entirely outright break down

The keyword is "mercurial core", isn't it?

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact