If I have at least n^2 processors, I can send a row and column to each processor, which can compute the inner product in linear time. So O(n^2) time to coordinate the work, and O(n) to actually do it.
Therefore, only O(n) time depth apparently. O(log(n)) to broadcast matrix_width to all processors, which seems to be the only communication needed to organize the calculation.
If I have at least n^2 processors, I can send a row and column to each processor, which can compute the inner product in linear time. So O(n^2) time to coordinate the work, and O(n) to actually do it.