I Have No Group, and I Must Scream

sunir · on June 19, 2021

Don’t “Be humble”. Be scientific. Humility is an assumption the mistake is with you; but you only know the system is not functioning as expected.

Proceed by narrowing down what parts of the system behave unexpectedly. Keep an open mind.

Humility implies blame in some human way. The computer is a machine. It doesn’t care how we feel. It just is. Appreciate the freedom that grants you in debugging.

webreac · on June 19, 2021

strace is a tool that would have identified the origin of the problem (a permission problem) without the need to grope. The small script to isolate the issue is a good strategy that has the additional benefit to help convincing other people of the issue.

pram · on June 19, 2021

strace can never be shilled enough. I’ve solved so many issues with it that were complete mysteries otherwise. A common and simple problem is a missing file something needs to initialize, but there’s no error emitted for it anywhere. strace will make that obvious in a snap (I had this problem on early LXC)

It should be in every admin/devs toolbox!

viraptor · on June 19, 2021

Strace, dtrace, bpftrace are great tools in situations like this one. The code is a good approximation of what's happening, but monitoring the actual behaviour can make many issues obvious.

nemetroid · on June 19, 2021

The permissions problem reminded me of an issue where CI builds in Cygwin would intermittently fail, complaining about not being able to read a file. However, just rerunning the job would always succeed.

It turned out that although just opening the file for reading worked fine, access(path, R_OK) returned false. The build system only called access() to cover some special case, which didn't occur on the rerun (because of caching?).

smitty1e · on June 19, 2021

> the issue was rectified within a few days—until it cropped up again, causing considerably less confusion (the debug script was already in place), and was fixed again. This time for real.

Oh, the angst of a fix that "sorta" works.

Hankenstein2 · on June 19, 2021

This was interesting but probably not the way the author intended. Lately I feel like I have been spending a significant part of my development time creating small scripts like this for the sole purpose of convincing sys-admins that the problem is actually theirs.

I absolutely believe that sys-admins are as stressed and worked thin as the rest of us and systems in general are worse off because of it but I have always been fascinated/irritated by the assumption that sys-admins are right until proven wrong.

geofft · on June 19, 2021

As someone on the other side of this, I'm sympathetic and genuinely do try to debug problems, but, off the top of my head,

a) I don't actually have the same level of access to our cluster as our users do. There are datasets and even programs with contractual limitations on who can access it. So if you tell me "My job isn't working," I can't run it myself and see what's wrong; you need to send me the error message. Just like with software, if you can get me a minimal, self-contained example (especially one I can run myself), I can try to figure out why it's breaking, but I can't necessarily minimize your code.

b) Somewhat by definition (a system with "sysadmins" necessarily has enough users to justify paying us), there are a whole lot of other users who don't have whatever problem you have. (We notice very quickly if a problem is affecting everyone.) So chances are high that the answer is "You're holding it wrong" instead of "The tool is broken." Yes, a lot of the time that's bad documentation or bad error messages, which we can and should fix, but the common answer to those questions in practice is your teammate shows you how to hold the tool. The point of a sysadmin is to take advantage of economies of scale; it doesn't scale for us to debug everyone's problems. (And there's a very real sense where time spent helping an individual user is time not spent writing docs or improving error messages.)

I think these problems ought to be solvable, and I'm curious what we (culturally) can do to make this better.

At the somewhat deep technical level, I've been sort of wondering about the nature of errors. Some errors - e.g., statting a file that doesn't exist - are fairly common in working software. Others - e.g., statting a file that you don't have permissions to - ought to be pretty rare. Suppose we had a kernel that could distinguish those, somehow, and sample backtraces or error contexts in some fashion. Would that help us identify problems like this faster, and narrow down quicker on the fact that the system actually isn't working right?

Hankenstein2 · on June 19, 2021

All of those are great points and I agree, I just find myself, more often lately, exhaustively trying to prove my bug is real before something gets fixed.

I wish there were some sort of badges I could acquire, like "You have earned 5 bugs to be fixed, without being a dumbass" badge. And then my 6th one might get escalated earlier.

Like I said, I really appreciate both sides of the issue and am also not certain how to make it better.

Pseudomanifold · on June 20, 2021

The idea with badges is awesome! Kind of like a trusted traveler program, but for sysops.

formerly_proven · on June 19, 2021

LSF seems to be the OG HPC scheduler, most of the others (Slurm, SGE, PBS, ...) have interfaces which are very similar to it.

d0mine · on June 19, 2021

that is why pathlib.Path.exists should be used instead. https://docs.python.org/3/library/pathlib.html#pathlib.Path....

spinax · on June 19, 2021

If using bash:

    for INDEX in {0..99}; do

man bash -> "sequence expression"

moonbug · on June 19, 2021

Ah, NIS