
I usually run 'w' first when troubleshooting unknown machines - weinzierl
https://rachelbythebay.com/w/2018/03/26/w/
======
spudlyo
I too instinctively run `w` whenever I log into a machine, and that instinct
helped me land my current job.

One hour long component of a (now deprecated) SRE interview loop was for the
candidate to SSH into a series of EC2 instances and debug issues which got
progressively harder as the interview wore on.

I had wasted a substantial amount of time on the first and easiest problem by
really overthinking it, and not trying the simplest of debugging techniques
first. By the time I got to the final, hardest problem, I had just over 5
minutes remaining. The interviewer gave me a pass to skip it, but I was having
fun, and really wanted to take a crack at it.

The final problem was to try to figure out why logging into a particular
machine with SSH was slow. While I sat waiting for a prompt, I had a number of
thoughts. Is a reverse DNS lookup timing out? Is there a huge i/o load on the
machine? Am I going to have to wire up strace to `sshd` and log in again?

When I finally get to a shell prompt, I instinctively run `w` and it just
hangs. I hit ^C, `strace` it, and discover that it's blocking on:

    
    
        fcntl64(5, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0}) ...
    

I look up a bit, and discover that file descriptor 5 is /var/run/utmp. So `w`
is trying to get an advisory lock on utmp and failing. Then it hits me, `sshd`
is likely also trying to acquire a lock on utmp, failing, and then eventually
timing out.

A little bit later, I've found and killed the the rogue program that held the
lock, and SSH logins were fast again. Solving that last problem so quickly
really boosted my spirits, and gave me the energy to push through the harder
interviews that came later in the day.

Thanks w!

~~~
ahh
That actually seems like a great idea for a work sample test.

~~~
matt_wulfeck
In my opinion, a bit too much trivia to be a valuable indicator of SRE
success. It would be really great if a candidate (who rated themselves well in
systems debugging) could walk you through the strace output of a command like
that.

~~~
spudlyo
I really like that idea. It might be fun to be given some `strace` output
(with the initial `execve` and writes to stdout/stderr redacted) and then be
asked to determine which UNIX command it was, or more broadly what it was
doing.

------
lazyant
My goto for initial troubleshooting a server is:

uptime # uptime and CPU stress

w # or better yet:last |head # who is/has been in

netstat -tlpn # find server role

df -h # out of disk space?

grep kill /var/log/messages # out of memory?

ps auxf # what's running

htop # stressed? , look out for D (waiting on I/O typically) processes

history # what has changed recently

tail /var/log/application.log # anything interesting logged?

~~~
fapjacks
I've wasted time not checking for inode availability, so I'd add a check for
that to this list:

df -hi

------
jillesvangurp
I can recommend setting up audit beat & kibana or similar. Auditbeat is a
recent addition to the Elastic beats agents and it sends audit logs for a lot
of things, including ssh logins, system calls, and you can monitor changes to
files/directories as well. So, you can flag boxes where people are poking
around in /etc, and see which boxes are being accessed via ssh by which user.

We have this and a few other elastic beats on most of our vms in amazon baked
into the amis we use. So anything deployed by us starts sending lots of data
to our logging cluster for metrics, auditing, internal application events,
stacktraces from our docker infrastructure, syslogs, etc.

------
checkyoursudo
I run my own web server and email server for my law firm. I'm the only one
with credentials to log in (ash), and probably the only one who would even
know how to do it, and possibly the only one who knows we have servers.

And I still run w first thing almost every time out of habit.

What am I looking for? A session I accidentally left open somewhere else?
Unauthorized access? A friend? Dunno...

~~~
RobAley
You're leaving your firm incredible vulnerable to the "checkyoursudo gets hit
by a bus" scenario aren't you? At least let the rest of your firm know you
have servers and give some credentials (for another account with sudo
permissions) to e.g. a managing partner or similar (with instructions to never
use them unless you die).

~~~
HenryBemis
My though exactly! For the company of checkyoursudo, he's what is called a Key
Person Risk.

But in my examples, I never use the "hit-by-a-bus". Sysadmins tend to frown
upon that comment.

I use the "win-the-lottery-go-to-Fiji-and-never-look-back". It always makes
them smile :)

~~~
AnIdiotOnTheNet
The problem with the "win-the-lottery-go-to-Fiji-and-never-look-back" as an
example is it doesn't quite get the point across because they'll have time to
gracefully transfer knowledge and after that they're just somewhere else in
the world so I can still theoretically fly over there and beat them with a
wrench until I get what I need.

Death is a real thing that really happens to people, and from an
organizational perspective it is valuable to keep in mind that no one in your
company is immune to that.

~~~
wccrawford
Yeah, it's not the same at all. I had an ex-employer that fired me multiple
times, and came back to me to get forgotten passwords multiple times.

After a little soul-searching, and deciding I didn't want to harbor anger, I
helped him with the ones I remembered each time.

Had I actually been hit by a bus, that wouldn't have been possible at all.

I'm sure that if I'd hit the lottery and gone to Fiji, I'd have been even more
likely to help him with those passwords.

In the end, not-burning-that-bridge did help me earn more money as he hired me
back several times, and I demanded more money each time until I was asking
almost as much per hour as he was getting from his customers and he simply
couldn't afford me.

~~~
pertymcpert
How do you get fired by someone multiple times?

~~~
frandroid
Not learning the lesson multiple times. :)

------
ktpsns
Interesting that nobody mentioned htop so far
([https://en.wikipedia.org/wiki/Htop](https://en.wikipedia.org/wiki/Htop)). It
is my favourite command to get a quick glance on the computational facilities
of a computer (memory, cores) and what it is doing (load, fancy ps/top). htop
is not installed everywhere, but it is easy to make a static build and scp it
to the questioned host.

Another very handy command is

    
    
      sudo netstat -atpn
    

which shows you the processes and owners of open TCP/UDP ports. The argument
combination is as weird as "ps aux" that I just memorized it by heart.

~~~
poooogles
netstat is deprecated and has been for a while, you should be using ss from
iproute2. Most of the commands should map directly so it's pretty drop in.

~~~
bloopernova
I like ss if only for the mnemonic "ss 4chan" which sort of but not quite maps
to "ss -4tan" (
[https://explainshell.com/explain?cmd=ss+-4tan](https://explainshell.com/explain?cmd=ss+-4tan)
)

~~~
y4mi
Thats glorious, gotta remember that one.

Chan is a japanese name ending for children, various communities have created
several Anime mascots over the years. Theyre generally suffixed with that
-tan, as a cute misspronounciation of chan.

[https://en.m.wikipedia.org/wiki/OS-tan](https://en.m.wikipedia.org/wiki/OS-
tan)

Btw, 4tan should be that 4chan mascot. Ive seen it previously on sankaku
complex, but its been too many years ago. Cant find it right now.

/Edit: i probably should mention that you shouldnt visit sankaku complex at
work. Its very... Questionable with nudity

~~~
mappu
_> that 4chan mascot_

[https://en.wikipedia.org/wiki/Yotsuba%26](https://en.wikipedia.org/wiki/Yotsuba%26)!
(SFW)

------
EngineerBetter
Genuine question: how many HN readers log on to boxes with user accounts that
belong to humans, where some state may have been mutated?

My experience of the last five years is so heavily weighted to (effectively)
immutable infrastructure that checking to see who had been on a box hadn't
event crossed my mind.

~~~
Fradow
On my personal dedicated server, I don't have any kind of "modern" things, I
do everything the old way, so I do log manually. It's a feature, it's a way
for me to learn old-school sysadmin (and I have so little things on here
anyway automation is not needed).

For my startup, I often run bash on Heroku because I do migrations manually
(again, it's a feature, I'm too inexperienced to have automated migrations
that work everytime, I prefer to be already on it if it breaks). Sometime when
something breaks I'll also poke around the filesystem (which is a copy, so no
fear to break anything).

Basically, I'd say the smaller your team, your uptime requirements and your
traffic is, the less you need automation, and the more you are susceptible to
login directly to a box (I combine all of that: team of 1, no uptime
requirement, not enough traffic to even max the most basic server).

~~~
majewsky
Counterpoint: Even on my private VPS, I have all the configuration as code. It
gives me peace of mind knowing that when a server comes crashing down for
whatever reason, I can reinstall it and bring all services back online with
not more than one hour of time invested. Time is precious.

(Also, when someone asks me "how do you configure X", I can just link them to
the corresponding place in my system configuration repo on Github.)

~~~
zacmps
I'd be interested in seeing the config if it's public?

~~~
Crontab
Probably [https://github.com/majewsky/system-
configuration](https://github.com/majewsky/system-configuration)

~~~
majewsky
Exactly. The Readme is horribly outdated, but the basic remarks are still
true.

------
gyrgtyn
How, after like 20 years goofing around on linux, have I never heard of `w` ?

~~~
spudlyo
Do you know about `comm`? Given two sorted files it will show you in three
columns.

    
    
        * unique lines to file1
        * unique lines to file2
        * lines common to both files
    

You can pass it various options to suppress any of the three columns.

I once had an interview at Facebook where one of the problems was easily
solved by `comm`. I found it funny that the interviewer, or anyone who
reviewed the interview question, had never heard of it. I was a good sport
about it though, I ended up writing a janky Perl script that roughly
implemented `comm` to solve the problem, which (modulo Perl) was what they
wanted me to do.

~~~
theoh
In that vein, I was asked about implementing n-way merge of logfiles in an
interview once. The key insight/recollection they wanted was: use a heap to
implement a priority queue. Not sure it was a great question as jumping to
employ a data structure like that might suggest premature optimization.

~~~
rthille
yeah, I'd guess `cat *.log | sort`

~~~
theoh
This was for Zeus, the high-performance web server guys. So I guess it was a
thing for them to optimize in general: a USP.

If you specify that the files are way too big to fit in memory, well, that's a
different story.

------
dvh
I run df first, very often something stopped working because the disk space
ran out

~~~
aepiepaey
...and then df gets stuck in uninterruptible sleep (due to a file system
hang), and ^C and ^Z does nothing.

~~~
executesorder66
How would you avoid that, and still find out the disk usage?

~~~
zbentley
You wouldn't. If the filesystem is hung, or if some common path is never
yielding out of blocking-no-matter-what-calls (e.g. stat()), then the presence
of the hang itself would indicate an issue. The isolation process for me would
probably be something like:

1\. df -h; notice that it hangs.

2\. Log in to a new shell, 'strace' the old process or a new one doing the
same thing, see what path it was choking on.

3\. If the breakage is on an external/network filesystem, reboot the host in
almost every case. Unless it was happily completing day 364/365 of some
incredibly important task elsewhere, it's just not worth my time to remount a
dead share and go clean up everything that was broken trying to talk to the
old one. I've had database servers lose some random NFS share that the DB
process wasn't using, then crash months later due to PID exhaustion because
some monitoring script in cron that kept trying to talk to a somehow-corrupted
mountpoint and hanging forever. Yes, timeouts and client programs should be
able to handle these failures perfectly in theory. Given my experience, I have
very little faith in theory matching up with reality.

4\. If it's on an internal drive, check dmesg/syslog (if I can) for any
smoking guns. Reboot and see if the problem goes away. If it does, unless I
can find something blindingly obvious indicating that the issue was transient
and unlikely to reoccur, I'm probably reprovisioning the system after a
hardware diagnostic. Even if the server isn't critical and just serves a cat
blog or whatever, it's not worth my time and repeated head-scratching to deal
with issues like this more than once per host.

5\. If I need data off of the questionable filesystem, I'll get it exclusively
via a recovery environment; not worth the risk in the case of a flaky/failing
drive otherwise (this applies even if the server itself was virtualized). I
hope the server had some sort of LOM console set up so I can do that,
otherwise someone's getting travel expenses for my trip onsite.

Edits: grammar.

~~~
insanejudge
Interesting (well, interesting to me) note on the nfs case, on modern linux,
`umount -l` should be able to unmount pretty much anything. You'll often still
be left with a pile of processes stuck in uninterruptible sleep depending on
the scope of the 'random share', but at the very least it can staunch the
bleeding and let you move around.

TBH I get rather claustrophobic when I can't `w` with aplomb.

------
Symbiote
> up 23 days > Right there, I can see that okay, the box hasn't been rebooted
> recently

Is the implication of "recently" that it should have been rebooted every
couple of weeks or something?

My desktop is at 54 days (since I moved desk, I think) and picking a random
Hadoop node:

> 10:26:47 up 303 days, 11:56, 1 user, load average: 44.12, 50.56, 47.48

It's private, so kernel updates aren't a security issue.

(Keeeping this on-topic, "history" tells me the most frequent thing I do on
this node is "sudo iftop"; we've been doubting the accuracy of our monitoring
system's network utilization graphs.)

~~~
RobAley
I think her thinking was "the problem started recently (~days), did the box
get rebooted recently (~days) which might indicate when the problem started",
rather than it _should_ have been rebooted recently.

------
subway
w; df; dmesg; top

usually there's a loud _sob_ somewhere in between.

~~~
agumonkey
reading w df in my head made me laugh

------
rurban
> If you want to impress me, set up a system at your company that will reimage
> a box within 48 hours of someone logging in as root and/or doing something
> privileged with sudo (or its local equivalent). If you can do that and make
> it stick, it will keep randos from leaving experiments on boxes which
> persist for months (or years...) and make things unnecessarily interesting
> for others down the road.

Ferrari does this. They do a lot of experimentation on their most expensive
internal test equipment, and every now and then the whole box is re-imaged
automatically, even if it's completely locked down from outside. It's their
internal staff who is corrupting/improving the system. Only if it's a really
good and well-tested improvement they will make it stick.

~~~
softawre
Anybody running Chaos Monkey (from Netflix) does this, at lesat for their
stateless services.

------
olefoo
I do that too. I probably wouldn't if I hadn't learned on a shell server where
that was how you found out who else was up.

~~~
SubiculumCode
i did not know w was a command. I've used the 'last' command
[https://www.cyberciti.biz/faq/linux-unix-last-command-
exampl...](https://www.cyberciti.biz/faq/linux-unix-last-command-examples/)

------
chris_wot
I never even knew about that command...

~~~
Freak_NL
Me neither. I just asked two colleagues (one on MacOS, one on Ubuntu) and got
the same “huh, I never knew?” response I had.

Sure there are tons of unknown commands on any OS, but a _one-letter_ command
you've never heard of somehow amplifies the amazement.

------
sgaduuw
My standard set of commands in context of troubleshooting, in no particular
order is * w * last * dmesg

------
js2
Related, "Linux Performance Analysis in 60,000 Milliseconds":

[http://techblog.netflix.com/2015/11/linux-performance-
analys...](http://techblog.netflix.com/2015/11/linux-performance-analysis-
in-60s.html)

------
seanwilson
I stick to serverless setups and platforms like Heroku wherever I can. The
whole idea of state being stored on a web server or having to SSH in to them
to do any form of admin makes my skin crawl now.

~~~
betaby
It's not because 'there is no state' \- you just outsourced state(s) to the
third party.

~~~
seanwilson
I think it's more the case that typical e.g. VPS setups introduce state where
you shouldn't have state at all.

Web servers for example should typically have minimal state for scalability
and robustness but for most VPS setups this is not the case. If the VPS got
wiped it could take days of work to get up and running again, and if you
wanted to scale horizontally you'll have some reengineering to do. You could
try and replicate something like Heroku yourself to minimise state on your web
servers but it'll take you a lot of time and won't be as robust.

------
yonilevy
I use the same approach when troubleshooting/getting into hairy code - git
blame ("annotate" in IntelliJ) first, gives a lot of useful context!

------
vermaden
On FreeBSD systems there are two great commands that are not available on
other systems.

These are 'gstat' and 'top -m io' commands, both I/O related.

Often in top/htop there is not much happening but the server is very slow, a
lot of the times its because of I/O operations.

The 'gstat' command will tell You (besides other useful statistics) how much
the storage devices are loaded:

    
    
      # gstat
      dT: 1.001s  w: 1.000s
       L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
          1   9679   9614   4807    0.1     63    671    0.2   82.3| ada0
          0      0      0      0    0.0      0      0    0.0    0.0| ada0p1
          0     65      0      0    0.0     63    671    0.2    1.4| ada0p2
          0      0      0      0    0.0      0      0    0.0    0.0| ada0p3
          0      0      0      0    0.0      0      0    0.0    0.0| gpt/boot
          0     65      0      0    0.0     63    671    0.2    1.5| gpt/sys
          0      0      0      0    0.0      0      0    0.0    0.0| gpt/local
          0      0      0      0    0.0      0      0    0.0    0.0| zvol/local/SWAP
    
    
    

The 'top -m io' show which processes does how much I/O:

    
    
      # top -m io -o total 10
      last pid: 51154;  load averages:  0.31,  0.31,  0.28 up 3+18:01:00  14:58:15
      54 processes:  1 running, 53 sleeping
      CPU:  2.4% user,  0.0% nice, 15.3% system,  5.3% interrupt, 77.1% idle
      Mem: 345M Active, 1236M Inact, 153M Laundry, 2158M Wired, 3903M Free
      ARC: 834M Total, 46M MFU, 295M MRU, 160K Anon, 5006K Header, 488M Other
           67M Compressed, 274M Uncompressed, 4.08:1 Ratio
      Swap: 4096M Total, 4096M Free
      
        PID USERNAME       VCSW  IVCSW   READ  WRITE  FAULT  TOTAL PERCENT COMMAND
      51021 vermaden       6005     16   6005      0      0   6005  99.92% dd
      51154 vermaden          8      9      5      0      0      5   0.08% top
      51036 vermaden          6     10      0      0      0      0   0.00% xterm
      50907 vermaden          0      0      0      0      0      0   0.00% zsh
      50905 vermaden          0      0      0      0      0      0   0.00% xterm
      50815 vermaden          0      0      0      0      0      0   0.00% zsh
      50813 vermaden          0      0      0      0      0      0   0.00% xterm
      50755 vermaden          0      0      0      0      0      0   0.00% tail
      41780 vermaden          0      0      0      0      0      0   0.00% leafpad
      41255 vermaden         29     11      0      0      0      0   0.00% firefox
    
    
    

Another command that is very useful on FreeBSD is 'vmstat -i' which show how
much interrupts are happening:

    
    
      # vmstat -i
      interrupt                          total       rate
      irq1: atkbd0                       75135          0
      irq9: acpi0                      2575929          8
      irq12: psm0                       135060          0
      irq16: ehci0                     1532065          5
      irq23: ehci1                     3265677         10
      cpu0:timer                     102772345        317
      cpu1:timer                      94199942        291
      irq264: vgapci0                   370466          1
      irq265: em0                         7017          0
      irq266: hdac0                    1904427          6
      irq268: iwn0                   148690342        459
      irq270: sdhci_pci0                   147          0
      irq272: ahci0                   20875039         64
      Total                          376403591       1161
    
    
    

I always suffer when I have to debug Linux systems because of lack of these
commands.

Regards, vermaden

~~~
teddyh
You do know about “iotop”, right?

~~~
vermaden
Thanks You for suggestion, now the answer is yes, nice alternative to 'top -m
io' on Linux.

------
giis
My first reaction after logging into unknown *nix machines.

uptime

uname -a

w

who

df -h

free -m

cat /proc/cpuinfo

mount

top

ps aux

then check appropriate logs or dmesg etc

~~~
Fnoord
You say *nix machines, but some of your commands are Linux specific.

~~~
freehunter
Only two, 'free' and 'cat /proc/cpuinfo'. Everything else can work on both.

------
MPSimmons
I actually have a profile.d entry that runs w on login any time I open a shell
on any machine in my infrastructure.

------
mirekrusin
There used to be time when the first thing I did was `sudo reboot` and it
actually worked just fine many times.

------
coinerone
"w" & "last" are first for me. Then "top" and "uname"

------
tluyben2
Yep w for me as well. History after.

------
edf13
And next, I usually run "top"... quick snapshop of whats going on.

------
known
sudo dmesg -w -H --level=err

------
sente
atop creates world readable log files. Does anyone else think this is a
security vulnerability?

~~~
cesarb
They have nearly the same information a normal user could get by running atop
themselves, which reads the world-readable virtual files found in /proc. If
you are using something like selinux to restrict access to /proc files, the
same system could be used to restrict access to the /var/log files.

------
banned1
This is the sort of article that I keep coming back to HN for. It's not
politics. It's pure and simple technical advice with reasonable back and forth
opinions from the community (other than the "if you want to impress me" guy).
And nobody wants to take my 2A right, sorry, my right to use 2FA!

------
xstartup
I run history first, followed by df then finally I run htop.

------
StapleHorse
I do 'ls' as a reflex I don't know why. For network monitoring I like:

 _nethogs_

 _nload -u K_

------
sporkenfang
There are two things I don't like here:

\- the author seems to call out specific individuals, by name, to their team
leads, based on where they were ssh'd in, instead of bringing up the issue
with that person first and asking for the logic of why they were on the box
doing the thing (sounds like a lot of assumptions combined with finger
pointing)

\- it sounds like there's no well-configured monitoring or observability at
the DC/rack/machine level involved, at all, which is surprising in a modern
enterprise setup

~~~
rachelbythebay
On call person was on a plane. They did something using intermittent
connectivity and made a mistake. Other person on the ground is helping until
the first one lands. I tell #2 about a box touched by #1 and ask them to have
a look.

They did and they figured it out. Outage resolved.

Why did you automatically assume I ratted them out to management? At no point
does the story go there.

I’m really curious, since misunderstandings like this can really poison a
working environment when people think you’re doing things you’re not. I want
to know what sent you down the wrong path here.

~~~
hitekker
Nah, you're fine, OP.

Commenters tend to proclaim bad intentions when none is present when they
either skimmed without reading or that they are lashing out to compensate for
some weird insecurity, e.g. "I caused an outage once and I didn't want anyone
to ask me and find out! How dare you want to know!?"

~~~
rachelbythebay
Thanks. I ask because I’m pretty sure I tripped over this bigtime in recent
history.

When all you do is wander around looking for broken stuff to help fix, imagine
the above sequence repeating itself.

------
briandear
> If you want to impress me, set up a system at your company that will reimage
> a box within 48 hours of someone logging in as root and/or doing something
> privileged with sudo (or its local equivalent). If you can do that and make
> it stick, it will keep randos from leaving experiments on boxes ...

It’s called Chef.

~~~
pdpi
Rachel’s starting point here is a fair few steps ahead of ”it’s called chef”.
For one, chef only fixes the things that you explicitly tell it to. When you
have enough machines and enough people poking at them, you’re more or less
guaranteed that someone somewhere will have gotten a machine or three in a
state chef can’t recover from — hence her suggestion of reimaging.

~~~
isostatic
Always pxeboot, then "re-imaging" is a matter of "turn it off and on again".

If you have a hardware failure how would you rebuild?

Of course this all takes time to implement and manage -- it's a tradeoff.

~~~
pdpi
How do you manage the network boot server? How will pxeboot interact with
hosting your rack switch images?

~~~
isostatic
High availability IP for your dhcp for "next-server", so you can boot one
server from another one.

Mikrotik routers and switches can boot from dhcp (or bootp?), but yes a
typical switch can't.

Not a network person, but I assume you can give fine grained enough control
that you can't do a "copy running-config startup-config" on a cisco switch, so
have a startup config that boots to a known basic state then tftps it's config
from your HA dhcp server.

