Hacker News new | past | comments | ask | show | jobs | submit login
Wolf Incident Postmortem (jefftk.com)
102 points by zzq on Jan 19, 2023 | hide | past | favorite | 17 comments



The only wolf evidence are alerts by a now-missing sentry saying "wolf" twice and "real wolf" once. Yet the conclusion due to missing sentry is 'wolf', and not only that, that there must have been only one wolf rather than a pack. If you don't have supporting evidence, this conclusion may be faulty, and the real fault may return.

My initial reaction is that there is no wolf. The sentinel began exhibiting a mental bug that reoccurred for 2 days until the sentinel broke and wandered off. I think the resolution should have an action item to continue closely monitoring sentinels for further bugs, and collect metrics to show proof of wolf. And, you know, maybe come up with an additional wolf countermeasure (if the fault was a wolf, it will probably come back)


It took me longer than I care to admit to figure out this wasn't talking about a real incident. I figured flock was about some Kubernetes/VM thing.


Context for those wondering: The Boy Who Cried Wolf

https://en.wikipedia.org/wiki/The_Boy_Who_Cried_Wolf#The_fab...


I really prefer when the "postmortem" label is metaphorical.


I don't get the joke but I browsed a bit, got to the story about "MA RMV Overloaded" [0] where they tried to renew their driver license. I was baffled by the fact that it was possible at all and even faster to renew your official driver license through a private company (AAA) than through their DMV office.

[0] https://www.jefftk.com/p/ma-rmv-overloaded



It's about the "boy who cried wolf" vs how incident management works.

False positives can make real issues get an insufficient response when they occur (e.g. sentinel getting eaten).


False positives can be a far more serious issue than people are willing to admit. There are plenty of loss of life incident reports that include wording like "the operator disabled the alert system due to numerous false positives" near the beginning.

An example: https://www.youtube.com/watch?v=1zDcsjHyxr8


Reminds me of an admin story where each incident ended with updating /etc/motd



This is awesome. Job well done, and hats off sir.


Well done. Seems like a good prototype to show someone who is unfamiliar with how a postmortem should be written up.


Indeed. Bookmarked it for exactly this.


I remember Jeff K's old website: https://www.somethingawful.com/hosted/jeffk/


Sensor was faulty, sensor replaced, system functioning normally.


Here I thought Jeff K was a sly fox, not a wolf.


Cruel.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: