Hacker News new | past | comments | ask | show | jobs | submit login
First 5 Minutes Troubleshooting A Server (devo.ps)
244 points by balou on Mar 12, 2013 | hide | past | favorite | 41 comments

That's what I often do, however it is clear that most of this tasks can be done automatically, so there should be somebody doing a 'linux-doctor' open source project that will try to identify issues automatically. Assuming it does not exist, but I never saw it before.

If you're able to, it's a good idea to have a monitoring system like Nagios/OpsView in place that checks most of these things for you proactively. That way you can have a good idea of the health of a server before you even connect to it. With such a system in place you'll often spot problems before your customers do too.

DataDog[1] + PagerDuty[2] do a good job at being linux-doctor. You have to be pro-active about it though...

[1] http://www.datadoghq.com/

[2] http://www.datadoghq.com/

Data Dog is pretty fantastic really. We love it at devo.ps.

Years ago, I remember reading about a Facebook, I think, backend system that developers building applications could hook into to automate some failure scenarios. I just haven't found the old link yet.

I wrote a simple bash script which is a good starting point for checking server issues.


The idea isn't too tell you the problem exactly, but more to stop you missing things that are obviously wrong.

I wish I had recorded some of my better moments when I worked web hosting tech support. Being able to jump onto a box, poke around at two or three things, notice something wrong from that, and come up with a solution was magic. It's hard to believe unless you actually see it happen, though.

I have the recording capability now (and, more importantly, playback too), but the constant influx of broken boxes is gone. Funny how that works.

What do you use for recording and playback?

Recording: 'script' from util-linux, with a custom patch or two to make it less broken. http://rachelbythebay.com/w/2013/03/02/bugfix/

Playback: a fork of term.js derived from the jslinux terminal with some minor adjustments, plus a wrapper of my own which plays back the data with timing intact. http://rachelbythebay.com/w/2013/03/04/jvt/

I've been using them to demonstrate how I go about writing certain bits of code, with the goal of eventually showing the creation of something larger.

Script with saved timestamps is a godsend.

Particularly useful when you're accessing via a serial console, as you'll get full boot from BIOS. You can both log and troubleshoot based on output.

Playback utilities can take a multiplier to speed (or slow) playback from realtime.

This is pretty similar (even we start with the same two commands) to what I typically do in my checklist :-)


One of the most useful pieces of information for me when troubleshooting is network activity. Both network monitors and traffic flows can often tell you exactly what the problem is so you don't have to spend five minutes collecting data samples.

A ring bufferred dumpcap is a huge time saver. Narrows the scope of troubleshooting quickly and reliably. Start with looking at what actually happened on the wire for X problem. Then use the rest of the tools/logs etc... to determine why it happened that way.

Personally, I think that top and vmstat are a bit low on the list; they're typically the first things that I run. While they're too general to provide good troubleshooting of the actual problem, they do a great job of pointing out to me where the problem probably is.

User experience reports are nice, but rarely indicate something other than "load is high", or "server is unresponsive". vmstat and top not only can tell you that, they can start telling you why and where to look for your problem.

I've written a command (that we're still playing around with here) to take a 'snapshot' of things that a server is doing (many of the things mentioned in this article). This can allow you to look at it later to see what's going on.


One key thing missing imo. "history" can be a huge help. Even more so if there are multiple admins on call.

In a well configured environment, "history" isn't going to be terribly useful -- admins should be using their own accounts and using sudo to execute commands as root.

In this sort of environment you want to look at your pam logs (/var/log/auth.log on debian-like distros, and iirc, /var/log/secure on RH-like). This gives you not only a list of what was run, but who ran it and when.

LOL! My `history` says I use only `cd`, `ls` and `git`.

Shame for me ...

http://imgur.com/CBTrdJw [server linux box] http://imgur.com/tEjBtNY [development os x box]

This indeed a huge help, this is now updated.

I'm pretty sure there is also a whole bunch of other (less famous) commands that would and should be added.

And the infamous reboot?

A reboot is the last thing you should be doing on a problematic box. A reboot should only be attempted if you're 100% sure that the problem can actually be solved by a reboot (for example in extreme cases a buggy kernel module might require a reboot, but even that's a very rare occurrence).

I've seen inexperienced sysadmins remotely rebooting a server only to discover that it won't boot up anymore because the original problem was that the server had run out of disk space.

Yes, that's why I think the article should mention reboot.

That's a valid point; mention that one should not reboot until proven otherwise.

Same idea for services restart; don't do it unless absolutely needed. While it may be doing the trick in some cases it can also generate its own set of new false symptoms.

Take the example of :

- mysql is "slow"

- let's restart ! ...

- mysql init script has been puking dots (.) for the last 30 minutes on shutdown

- let's kill -9 it ! ...

- mysql db is corrupted ! Hurray!

doing "ps auxf" will put "ps aux" in forest mode and give you the same effect as "pstree -a" while giving you all the information that "ps aux" will give.

Something that's been missed: atop Its a similar set to top/htop and friends, but has an additional system process snapshot daemon, allowing us to answer the perennial question of how a server got into a particular state. Something particularly useful is the wide view (see here: http://www.atoptool.nl/screenshots.php) and average disk response times in ms.

What I found interesting while reading this article was the parallel to what a doctor does when diagnosing problems.

In medicine, it's commonly known that the interview with the patient (the 'history') is the first thing a doctor should be doing. Not just because it establishes a relationship with the patient, but because the diagnosis of most illnesses is guided primarily by the history [1] - even with modern MRI machines and DNA amplification techniques! At the very least the chat with the client provides context for the problem that you are investigating - you are now putting flesh on a skeleton of meaning rather than trying to create it on your own.

This article stresses the importance of first getting a verbal 'history' from the client - what the problem is, characteristics of the problem, time-course of the problem and co-incidence with other events (like software upgrades). There is also a parallel to medicine in that in this field a skilled practitioner may be able to diagnose the problem based solely on the history alone [2].

The second thing I noticed was the fault-finding mindset. As a medical student halfway through his second year of hospital placements this is something I took some time to learn. The initial approach to finding the reason for a problem is usually to (1)think of a possible reason for the problem, (2)try to fix that reason, and (3)if that doesn't work, goto 1. While this is a good because it shows you are actually thinking about the cause of the problem rather than its effects, it's not the most efficient way of going about things. One way doctors can narrow down problems is by restricting them to systems such as the cardiovascular system or the neurological system. A searing pain in your chest is more likely to be due to a problem with your heart or lungs than due to a problem with your kidneys or gonads.

This article takes exactly the same view of servers, classifying the individual hardware and software components that make up the vast majority of (linux) servers in the wild.

I don't fiddle around with servers much any more, but I'm bookmarking this page because it is such a useful illustration of a fault-finding mentality.

[1] http://archinte.jamanetwork.com/article.aspx?articleid=11058... [2] http://blogs.msdn.com/b/oldnewthing/archive/2012/08/29/10344...

I always cringe when I see shell code like this > cat /etc/passwd | cut -f1 -d:

Usually this comes as > ps -ef | grep something | grep -v grep | grep -v $myownpid

why not use one simple and concise awk statement which does it all in one go?

awk -F: '{print$1}' /etc/passwd ps -ef | awk '/[h]ttpd/{print$2}'

But apart from that: very nice summary of things to consider and the sequence for analysis.

> why not use one simple and concise awk statement which does it all in one go? >awk -F: '{print$1}' /etc/passwd ps -ef | awk '/[h]ttpd/{print$2}'

IMO is easier to type the "grep/grep -v" statements than the awk ones. Usually we are more used to use grep and we add pipes and filters until we get the expected result.

I'm using zsh and thanks to the Global Aliases I can do, though: $ ps -ef G something GV grep GV ownpid

On OSX (even though the HN title says linux, the original article title doesn't say "Linux" and many of the same steps apply to OSX server), you can't run dmesg as a normal user -- you have to run it as root or within sudo

It's a nice checklist. I personally would do dmesg and log-checks right at the beginning. Also checking sw-raid and drives is missing (cat /proc/mdstat, hdparm, fdisk -l, smartctl ...)

A quite good check list/advice on server trubleshooting. The ones I am mostly missing in the list is network tools such as telnet/dig/ip/fping/mtr.

vmstat and iostat are also useful for tracking memory issues.

sysstat ('sar') reporting can also provide some of that much-needed history. Sar output is pretty readily visualized with utilities such as gnuplot.

Yet another site that disables pinch zoom for iOS devices. Pointless..

Not sure why it is so: I'll look into it today.

Take the maximum-scale out of your meta viewport tag.

<meta name='viewport' content='width=device-width, initial-scale=1, maximum-scale=1'/>

Fixed. Thanks for the tip and sorry for the annoyance. That was coming straight from the Jekyll boilerplate I used: Foundation (https://github.com/Wiredcraft/foundation).

the crontab business is unnecessary. ls /var/spool/cron?

other than that, I learned some new tools - ss is pretty awesome!

df -i

network cable light

All these commands are very fine and handy but...

Typically they're better when you can put them in context: you should run all these regularly on fully working servers which you know are operating normally so you have "something" you can compare your results to when the shit hits the fan.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact