
First 5 Minutes Troubleshooting A Server - balou
http://devo.ps/blog/2013/03/06/troubleshooting-5minutes-on-a-yet-unknown-box.html
======
antirez
That's what I often do, however it is clear that most of this tasks can be
done automatically, so there should be somebody doing a 'linux-doctor' open
source project that will try to identify issues automatically. Assuming it
does not exist, but I never saw it before.

~~~
DanielRibeiro
DataDog[1] + PagerDuty[2] do a good job at being linux-doctor. You have to be
pro-active about it though...

[1] <http://www.datadoghq.com/>

[2] <http://www.datadoghq.com/>

~~~
hunvreus
Data Dog is pretty fantastic really. We love it at devo.ps.

------
bashtoni
I wrote a simple bash script which is a good starting point for checking
server issues.

<https://github.com/BashtonLtd/whatswrong>

The idea isn't too tell you the problem exactly, but more to stop you missing
things that are obviously wrong.

------
rachelbythebay
I wish I had recorded some of my better moments when I worked web hosting tech
support. Being able to jump onto a box, poke around at two or three things,
notice something wrong from that, and come up with a solution was magic. It's
hard to believe unless you actually see it happen, though.

I have the recording capability now (and, more importantly, playback too), but
the constant influx of broken boxes is gone. Funny how that works.

~~~
daxelrod
What do you use for recording and playback?

~~~
rachelbythebay
Recording: 'script' from util-linux, with a custom patch or two to make it
less broken. <http://rachelbythebay.com/w/2013/03/02/bugfix/>

Playback: a fork of term.js derived from the jslinux terminal with some minor
adjustments, plus a wrapper of my own which plays back the data with timing
intact. <http://rachelbythebay.com/w/2013/03/04/jvt/>

I've been using them to demonstrate how I go about writing certain bits of
code, with the goal of eventually showing the creation of something larger.

~~~
dredmorbius
Script with saved timestamps is a godsend.

Particularly useful when you're accessing via a serial console, as you'll get
full boot from BIOS. You can both log and troubleshoot based on output.

Playback utilities can take a multiplier to speed (or slow) playback from
realtime.

------
fduran
This is pretty similar (even we start with the same two commands) to what I
typically do in my checklist :-)

[http://www.fduran.com/blog/quick-linux-server-review-for-
mor...](http://www.fduran.com/blog/quick-linux-server-review-for-mortals/)

------
peterwwillis
One of the most useful pieces of information for me when troubleshooting is
network activity. Both network monitors and traffic flows can often tell you
exactly what the problem is so you don't have to spend five minutes collecting
data samples.

~~~
johngalt
A ring bufferred dumpcap is a huge time saver. Narrows the scope of
troubleshooting quickly and reliably. Start with looking at what actually
happened on the wire for X problem. Then use the rest of the tools/logs etc...
to determine why it happened that way.

------
falcolas
Personally, I think that top and vmstat are a bit low on the list; they're
typically the first things that I run. While they're too general to provide
good troubleshooting of the actual problem, they do a great job of pointing
out to me where the problem probably is.

User experience reports are nice, but rarely indicate something other than
"load is high", or "server is unresponsive". vmstat and top not only can tell
you that, they can start telling you why and where to look for your problem.

------
rmc
I've written a command (that we're still playing around with here) to take a
'snapshot' of things that a server is doing (many of the things mentioned in
this article). This can allow you to look at it later to see what's going on.

<https://github.com/rory/SystemAutopsy>

------
w0ts0n
One key thing missing imo. "history" can be a huge help. Even more so if there
are multiple admins on call.

~~~
gbog
And the infamous reboot?

~~~
richardkeller
A reboot is the _last_ thing you should be doing on a problematic box. A
reboot should only be attempted if you're 100% sure that the problem can
actually be solved by a reboot (for example in extreme cases a buggy kernel
module might require a reboot, but even that's a very rare occurrence).

I've seen inexperienced sysadmins remotely rebooting a server only to discover
that it won't boot up anymore because the original problem was that the server
had run out of disk space.

~~~
gbog
Yes, that's why I think the article should mention reboot.

~~~
balou
That's a valid point; mention that one should not reboot until proven
otherwise.

Same idea for services restart; don't do it unless absolutely needed. While it
may be doing the trick in some cases it can also generate its own set of new
false symptoms.

Take the example of :

\- mysql is "slow"

\- let's restart ! ...

\- mysql init script has been puking dots (.) for the last 30 minutes on
shutdown

\- let's kill -9 it ! ...

\- mysql db is corrupted ! Hurray!

------
kondor6c
doing "ps auxf" will put "ps aux" in forest mode and give you the same effect
as "pstree -a" while giving you all the information that "ps aux" will give.

------
viddy
Something that's been missed: atop Its a similar set to top/htop and friends,
but has an additional system process snapshot daemon, allowing us to answer
the perennial question of how a server got into a particular state. Something
particularly useful is the wide view (see here:
<http://www.atoptool.nl/screenshots.php>) and average disk response times in
ms.

------
sanotehu
What I found interesting while reading this article was the parallel to what a
doctor does when diagnosing problems.

In medicine, it's commonly known that the interview with the patient (the
'history') is the first thing a doctor should be doing. Not just because it
establishes a relationship with the patient, but because the diagnosis of most
illnesses is guided primarily by the history [1] - even with modern MRI
machines and DNA amplification techniques! At the very least the chat with the
client provides context for the problem that you are investigating - you are
now putting flesh on a skeleton of meaning rather than trying to create it on
your own.

This article stresses the importance of first getting a verbal 'history' from
the client - what the problem is, characteristics of the problem, time-course
of the problem and co-incidence with other events (like software upgrades).
There is also a parallel to medicine in that in this field a skilled
practitioner may be able to diagnose the problem based solely on the history
alone [2].

The second thing I noticed was the fault-finding mindset. As a medical student
halfway through his second year of hospital placements this is something I
took some time to learn. The initial approach to finding the reason for a
problem is usually to (1)think of a possible reason for the problem, (2)try to
fix that reason, and (3)if that doesn't work, goto 1. While this is a good
because it shows you are actually thinking about the cause of the problem
rather than its effects, it's not the most efficient way of going about
things. One way doctors can narrow down problems is by restricting them to
systems such as the cardiovascular system or the neurological system. A
searing pain in your chest is more likely to be due to a problem with your
heart or lungs than due to a problem with your kidneys or gonads.

This article takes exactly the same view of servers, classifying the
individual hardware and software components that make up the vast majority of
(linux) servers in the wild.

I don't fiddle around with servers much any more, but I'm bookmarking this
page because it is such a useful illustration of a fault-finding mentality.

[1]
[http://archinte.jamanetwork.com/article.aspx?articleid=11058...](http://archinte.jamanetwork.com/article.aspx?articleid=1105870)
[2]
[http://blogs.msdn.com/b/oldnewthing/archive/2012/08/29/10344...](http://blogs.msdn.com/b/oldnewthing/archive/2012/08/29/10344405.aspx)

------
tseeling
I always cringe when I see shell code like this > cat /etc/passwd | cut -f1
-d:

Usually this comes as > ps -ef | grep something | grep -v grep | grep -v
$myownpid

why not use _one_ simple and concise awk statement which does it all in one
go?

awk -F: '{print$1}' /etc/passwd ps -ef | awk '/[h]ttpd/{print$2}'

But apart from that: very nice summary of things to consider and the sequence
for analysis.

~~~
krenel
> why not use one simple and concise awk statement which does it all in one
> go? >awk -F: '{print$1}' /etc/passwd ps -ef | awk '/[h]ttpd/{print$2}'

IMO is easier to type the "grep/grep -v" statements than the awk ones. Usually
we are more used to use grep and we add pipes and filters until we get the
expected result.

I'm using zsh and thanks to the Global Aliases I can do, though: $ ps -ef G
something GV grep GV ownpid

------
niggler
On OSX (even though the HN title says linux, the original article title
doesn't say "Linux" and many of the same steps apply to OSX server), you can't
run dmesg as a normal user -- you have to run it as root or within sudo

------
growt
It's a nice checklist. I personally would do dmesg and log-checks right at the
beginning. Also checking sw-raid and drives is missing (cat /proc/mdstat,
hdparm, fdisk -l, smartctl ...)

------
belorn
A quite good check list/advice on server trubleshooting. The ones I am mostly
missing in the list is network tools such as telnet/dig/ip/fping/mtr.

------
dredmorbius
vmstat and iostat are also useful for tracking memory issues.

sysstat ('sar') reporting can also provide some of that much-needed history.
Sar output is pretty readily visualized with utilities such as gnuplot.

------
coin
Yet another site that disables pinch zoom for iOS devices. Pointless..

~~~
hunvreus
Not sure why it is so: I'll look into it today.

~~~
bti
Take the maximum-scale out of your meta viewport tag.

<meta name='viewport' content='width=device-width, initial-scale=1, maximum-
scale=1'/>

~~~
hunvreus
Fixed. Thanks for the tip and sorry for the annoyance. That was coming
straight from the Jekyll boilerplate I used: Foundation
(<https://github.com/Wiredcraft/foundation>).

------
drew510
the crontab business is unnecessary. ls /var/spool/cron?

other than that, I learned some new tools - ss is pretty awesome!

------
ptman
df -i

------
tlarkworthy
network cable light

------
martinced
All these commands are very fine and handy but...

Typically they're better when you can put them in context: you _should_ run
all these regularly on fully working servers which you know are operating
normally so you have "something" you can compare your results to when the shit
hits the fan.

