Hacker News new | past | comments | ask | show | jobs | submit login

What is considered an "error"? Are there further classifications for different types of errors? Maybe there are certain error types that contribute more to allow you to detect bad servers? Seems like just having a single dimension is a bit heavy-handed. Would multiple dimensions slow it down too much to be useful?

The problems are "error". Anything else may be a root cause, which can range from failing hard drives, CPU failure, memory failure, NIC failure, rat chewing on cables, or plenty of other modes.

I would say that we don't care about root cause when the event happens; just get the server out of the pool. RCA can be done post mortem.

I would reduce dimensionality down to 2d: errors per time. In that case, we have a great deal of statistical tools at hand. They also do not require hand-waving of N dimensional cluster detection, where only the machine has any idea of detecting errors. And having something like 1000 dimensions is just tremendously slow, compared to integral analysis of errors in respect to time.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact