Many users report problems like "NaN" during training- at some point, the gradients blow up and the job crashes. Sometimes these are caused by specific examples, or numerical errors on the part of the model developer, but sometimes, they are the result of errors from bad cores (during matrix multiplication, embedding lookup, vector op, whatever).
ML is usually pretty tolerant of small amounts of added noise (especially if it's got nice statistcal properties) and some training jobs will ride through a ton of uncorrected and undetected errors with few problems. It's a very challenging field to work in because it's hard to know if your nan is because of your model or your chip.
What stands out to me:
- "Mercurial cores are extremely rare" but "we observe on the order of a few mercurial cores per several thousand machines". On average one core per 1000 machines is faulty? That's quite a high rate.
- Vendors surely must know about this? If not by testing then through experiencing the failures in their company servers.
- I've read the whole paper and I see no mention of them even reaching out to vendors about this issue. Their are strong incentives on both sides to solve or mitigate this issue so why aren't they working together?
Google's internal production stack is much more amenable to that kind of digging than public cloud products:
* You can easily find out what machine a given borg task was running on. In fact, not just your own borg job but anyone's. You can query live state, or you can use Dremel to look up history.
* Similarly, even as a client of Bigtable or Spanner, you can find out the specific tabletservers/spanservers operating on a portion of your database and what machines they're running on. (Not as easy to cross this layer and get to the relevant D servers actually storing the data but I think it's all checksummed here anyway.) If your team has your own partition, you can see tabletserver/spanserver debug logs yourself also.
* There's a convenient frontend for looking up a bunch of diagnostic info for the machine, including failures of borg tasks (were other people's tasks crashing at the same time mine did? what was their crash message?), syslog-level stuff, other machine diagnostics like ECC / MCE errors, and repair history (swapped this DIMM, next attempt will swap this CPU).
It's not unusual for application teams to suspect a machine and basically vote it off the island (I don't want my jobs running here anymore, I cast a vote for it to be repaired / Office Spaced). It's more rare for them to really take the time to really understand the problem in detail like "core 34 sometimes returns incorrect results on this computation", although there's nothing in particular stopping them from doing so (other than lack of expertise and a long list of other things to do). The platforms team gets involved sometimes and really digs in—iirc in one bug they mentioned sending a CPU back to the vendor to examine with an electron microscope.
I'm not sure what lessons that offers for a public cloud where that kind of transparency isn't realistic...
> After a few iterations, it became obvious that the computation of 𝐼𝑛𝑡(1.153)=0 as an input to the math.pow function in Scala would always produce a result of 0 on Core 59 of the CPU. However, if the computation was attempted with a different input value set 𝐼𝑛𝑡(1.152)=142 the result was accurate.
From working in HPC I've handled reports of things like FMA units producing incorrect results or random appearance of NaNs. Were it not for the fact that we knew these things could happen and customer's intimate knowledge of their codes I dread to think how'd "normal" operations would track these issues down. Bad parts went back to the CPU manufacturer and further testing typically confirmed the fault. But that end of the process was pretty much a black box to anyone but the CPU manufacturer. I'd be keen to know more about this too.
How in the world do they get such a precise number of atoms to land on billions of transistors? It seems so hard for even one transistor.
For a ~5nm process, there might be only a few dozen atoms across the width of the fin, but the other dimensions are much larger, for a total of probably somewhere around hundreds of thousands of atoms per channel.
see e.g. https://fuse.wikichip.org/news/2408/tsmc-7nm-hd-and-hp-cells... and other wikichip articles for some dimension info.
but regardless, modern semiconductor manufacturing processes are incredible. the much-too-brief summary is that they shine very precise patterns of light on the silicon using very high frequency light to activate photoresist, and then etching away the silicon that isn't protected by the photoresist. this doesn't actually produce features as small as the fins, so there are a lot of tricky techniques, like doing this patterning + etching once, then growing a layer of some other material on top, then etching again to leave only the very narrow sidewalls that grew around the original feature, then etching again using the sidewalls as the pattern.
it really is amazing - reliably writing tiny structures on stones with light.
HN thread from a couple years ago: https://news.ycombinator.com/item?id=16175949
He has a funny slip too: at 20:51 he says Xeon instead of Xenon!
This is probably difficult to do at a fine grained level, but I imagine that coarser synchronization and checks (both in software) could provide the necessary assurances that code executing on a single core is consistent with that of other cores.
Would you check that every register assignment matches? Or every page write?
A lot of this logic is already in out of order execution (e.g., Tomasulo algorithm). Memory has ECC and is probably a different problem.
So the question is, how do you check these results without actually doing them twice? Is there a role here for frameworks or OS to impose sanity checks? Obviously we already have assertions, but something gentler than a panic, where it says "this is suspect, let's go back and confirm."
With today's high-speed multi-core processors, a 1-in-a-million chance of a computation error would mean tens to hundreds of thousands of errors per second.
That's usually why no one that depends on their computers to work day in and day out overclocks their components. The marginal performance gains aren't worth the added unreliability and added power/heat/noise footprint.
And you'd need to eliminate use of non-deterministic CPU local data, like RDRAND and on die temperature, power, etc sensors. Most likely, you'd want to run the CPUs at fixed clock speed to avoid any differences in settling time when switching speeds.
This could probably effectively fine broken CPUs (although you wouldn't know which of the pair was broken), but you could still have other broken devices resulting in bad computations. It might be better to run calculations on two separate nodes and compare; possibly only for important calculations.
As for settling times, those are random anyway. Processors are binned according to how good the settling times ended up being. It’s unlikely to have two homogeneous chips.
No, it is not. You can always trade off performance for reliability by repeating your computations several times, preferably with some variation in the distribution/timing of work to avoid the same potential hardware failure pattern.
It's possible other modes have this property because of the structure of the cipher itself, but that's way out of my league.
If you want extremely high reliability, for critical applications, you use other CPUs. Of course, they are slower.
So the only interesting info that remains is that the defect rate seems way too high and maybe the quality decreasing in recent years. In which case, when you are Google, you probably could and should complain (strongly) to your CPU vendors, because likely their testing is lacking and their engineering margins too low... (at least if that's really the silicon that is at fault, and not say for example the MB)
Now of course it's a little late for the existing ones, but still the sudden realization that "OMG CPU do sometimes fail, with a variety of modes, and for a variety of reasons" (including, surprise(?!), aging) seems, if not concentrating on the defect rate, naïve. And the potential risk of sometimes having high rate errors was already a very well known esp. in the presence of software changes and/or heterogenous software and/or heterogenous hardware, due to the existence of logical CPU bugs, sometimes also resulting in silent data corruption, and sometimes also with non-deterministic-like behaviors (so can as well work on a core but not another because of "random" memory controller pressure and delays, and the next time with the two cores reversed)
ECC memory was deprecated in consumer machines 15 years ago or more. This was a conscious industry choice that reliability in hardware could be sacrificed to other concerns. That's just an example.
Any credible evidence of this?
If google and the other hyperscalers complain enough, there’s no reason Intel couldn’t give them some self test to run every hour or so.
 depends on how complex the CPU is, how long you accept to run the self testing code, and how well it was done.
Their Tech Reports are worth a sample and fortunately they're online: https://www.hpl.hp.com/hplabs/index/Tandem
Probably the best one to start at: https://www.hpl.hp.com/techreports/tandem/TR-90.5.pdf
Not directly related but some platforms supporting lockstep are flexible: you can use a pair as either 2 cores (perf) or a single logical one (lockstep).
Joking aside, it's really neat to see the scale and sophistication of error detection appearing in these data centers.