
Ask HN: How do you detect oddly performing machines in a large cluster? - chucky_z
Hi HN,<p>I have a problem where random nodes in a large cluster perform pretty far out of spec.  What&#x27;s the correct way to find them?  There&#x27;s a huge diversity of workloads, and the boxes are large so I was considering doing something really trivial that eats up some consistent % of CPU then graphing machines that are some n of SD out of normal (calculating fibbonacci to some low-ish number, for instance).<p>Is there any really good, clean way to do this that solves the problem in an elegant way?
======
verdverm
Is the problem data collecting and detection in time series, or more being
able to reproduce so you can work towards a fix?

~~~
chucky_z
The latter. We have a lot of metrics and data, but I need something consistent
beyond "this box looks like its using a lot of CPU/memory/IO."

Our workloads are dynamic enough that a box using 20% or 80% are both within
range of normal, with some single threaded workloads so per-core stats are
also not great.

I'm also not looking to fix anything, just have the ability to say, with a
high degree of confidence, "hey, our system detected this box looks funky, can
you just pull it out and replace it?"

~~~
verdverm
That's the hard part of the problem, extracting signals from the noise. Have a
look at the Google Autopilot paper (cloud scheduling, not self driving). There
are a few interesting ideas that could be adapted. They generally have some
good ideas if you peruse their research page, specifically the data center and
related topics.

~~~
chucky_z
Awesome! For anyone who finds this through search, here's that whitepaper:
[https://dl.acm.org/doi/pdf/10.1145/3342195.3387524](https://dl.acm.org/doi/pdf/10.1145/3342195.3387524)

I don't think it helps at a server level, but it does provide me some ideas,
especially looking at some of the algorithms they're using.

My problems are mostly CPU related, whereas this paper seems to focus on
Memory. I still can definitely get something useful from here. :)

~~~
verdverm
Yes, most cloud applications / systems struggle with memory and optimizations
around the system behaviors / configuration. It is rare to find something CPU
bound, and when you do, it's usually isolated so it doesn't step on other
programs. I thought the algos and strategy would be most useful to you. Glad
you found it interesting!

