Hacker News new | past | comments | ask | show | jobs | submit login
Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems (micahlerner.com)
46 points by malphite on May 1, 2023 | hide | past | favorite | 3 comments

You’re effectively using https://en.m.wikipedia.org/wiki/Little%27s_law to deduce component performance as service time. I’ve found the peer filtering mechanism to work as well. It can also work if the entire node is slow, depending upon characteristics of the workload.

I think peer filtering starts working well at some medium sized number of peers x events. Off the top I guess at five peers with a steady, non-zero event rate for a few hours, a slow peer should stick out obviously in the 95th percentile.

However, op mentions that opaque workloads make peer identification difficult. IIUC this is a cool way to use two correlated metrics to extract performance outliers with some neat-sounding math that I didn’t study.

They’re using a clustering algorithm, which is effectively comparing against peers, isn’t it?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact