This is one of the first articles I’ve read with a decent attack on reverse engineering the black boxes of neural networks. I particularly appreciate the use of corrupted prompts for isolating behaviors.
Came here to say something similar. It seems to me that being able to determine how specific neurons are affecting outputs will be crucial to future optimization.