Another relevant point made in the video is that they restrict cells to a maximum size which then makes it easier to test behavior at that size. This would have also helped avoid this specific issue since the number of threads would have been tied to the number of instances in a cell.
I definitely recommend checking out the video. Even if you have seen it before, rewatching it in the context of this post-mortem really makes it hit home.
As another googler, I'd argue that Borg's concept of cells aren't like what Amazon is calling "cells" here. Borg cells are, as far as I can tell, akin to an AWS Zone. There are similar concepts within Google that match the concept of "an application unit that is in multiple compute units but is isolated from other similar application units, and can be used for a singular customer or workload". There are multiple terms for this concept, which I'd be happy to share within Google.
I definitely recommend checking out the video. Even if you have seen it before, rewatching it in the context of this post-mortem really makes it hit home.