
Show HN: Weekly troubleshooting challenges for DevOps and SREs - hkh
https://www.circuitops.com/challenge
======
rachelbythebay
Feedback: having done your first one, I find myself wondering _how_ the thing
got into that state in the first place, and what can be done about it. That
is, [Jub be jung fperjrq hc gur crezvffvbaf/bjarefuvcf ba gur erqvf cngu]
(spoilers).

Basically, this is the lowest level of "if this then that" fixing, and people
need to be able to do that, but they also need to then go "we can't keep
operating in this mode".

I mean, yeah, here, it's artificial for the purposes of the exercise. But, if
you were to see this in real life, there would be a bunch more questions to
ask and things to dig into to keep it from happening again. Randos with root?
Bad script somewhere? Terrible system management stuff? RPM getting a little
too frisky with how it manages certain directories?

I'm not sure how you get that part into a thing like this, though.

~~~
hkh
Great point! I wonder if there are things we can do to break the scenario
while the user is logged in to simulate operational issues such as OOMKilled,
network glitches or disk filled. That might enable a better root cause
discussion during interviews

------
hkh
Hi HN, Hassan here - one of the co-founders of CircuitOps. While building
CircuitOps, we have had so much fun watching people try to diagnose and fix
issues in production like environments, so we made it into a public challenge!

We are planning on releasing one challenge a week.

Would love to hear your feedback and see some of you on the leaderboard!

Cheers, Hassan

