I would say that we don't care about root cause when the event happens; just get the server out of the pool. RCA can be done post mortem.
I would reduce dimensionality down to 2d: errors per time. In that case, we have a great deal of statistical tools at hand. They also do not require hand-waving of N dimensional cluster detection, where only the machine has any idea of detecting errors. And having something like 1000 dimensions is just tremendously slow, compared to integral analysis of errors in respect to time.