
My favorite debugging tool - Isofarro
http://us1.campaign-archive1.com/?u=ba9c5a596f88fa86026dd89a2&id=01a7043a5b&e=35d9c2c5b6
======
gtirloni
I love strace, DTrace, Procmon, and all these tools. Nothing gives me more
pleasure when I'm called to solve a mistery issue and end up spending time
checking syscalls, timing, etc.

That being said, that's the worst debugging experience your company can have
because it outlines a bigger problem: your monitoring sucks.

In the scenario presented by this article, better monitoring of the database
layer could have allowed the sysadmin to connect the dots much faster and find
the root cause. So the DB might be able but spilling the wrong beans? You may
need some functional monitoring checking what it's outputting from time to
time.

I absolutely love low-level debugging tools but using them on a daily basis
doesn't scale. I do use them almost every single day, because other teams
failed to properly monitor and understand their systems. That doesn't mean
it's the most efficient way to go about this.

~~~
gizmo686
The nice thing about using low level debugging tools is that they are a quick
way to get a sense of the problem before you need to start thinking about it.
When you are staring at their output trying to figure out what the problem is,
then you may be doing something wrong.

------
brendangregg
strace can slow the target by up to 200x
([http://www.slideshare.net/brendangregg/linux-performance-
ana...](http://www.slideshare.net/brendangregg/linux-performance-analysis-and-
tools/38)). While you can solve plenty of issues with it, you have to be very
careful about its use in production. In many cases, it just can't be used.

perf added a "trace" subcommand in 3.7, for buffered syscall tracing
([http://www.brendangregg.com/perf.html#More](http://www.brendangregg.com/perf.html#More)).
This should eventually be like strace, but with MUCH less performance
overhead.

sysdig is a newer tool that uses a similar low-overhead interface, and
provides an expressive syntax. It's still early days, but sysdig could
ultimately replace strace (or perf will via trace).

~~~
lambda
Sure, there are going to be a few performance sensitive workloads that you
can't use it on, but for the vast majority of issues I've used it for, it's
been far more useful than perf or dtruss (on Mac OS X, I've never used dtruss
on Solaris so I don't know if it's in better shape).

The big advantage is that it already knows how to decode most syscalls,
decompose flags into symbolic arguments, decode things like stat return
values, and so on. dtruss on Mac OS X can decode some syscalls, but doesn't
generally know how to decode structures or pull apart flags, so there's some
information you just can't see, and some which you have to laboriously
manually decode. perf, as you note, decodes even less than dtruss. When I'm
trying to debug an issue, the more information I can gather and see at once,
the better.

Very few of the occasions I've had to use these tools have been particularly
performance sensitive. Your example of the huge difference is because dd is
copying fairly small blocks at a time, and thus making a fairly high number of
system calls relative to the amount of data being transferred. Usually, I'm
just looking for why an operation fails, or why a process is stuck; if it
takes a little longer to get to that failure, that's OK, and if the process is
stuck then stracing it generally won't affect performance.

sysdig looks pretty interesting, thanks for the reference! It's too bad they
can't use the existing perf infrastructure in the kernel, and need to add
their own kernel module; that makes it that much more cumbersome to deploy and
use.

------
spydum
strace, and tcpdump. It's really a toss up between them as to which i have
solved more problems with. I was sort of rooting for tcpdump before reading
the article.

------
Theodores
+1 for recommending strace.

In discussion with others about its merits I have had people show some better
tool that works perfectly in their test case dev environment on their local
machine, possibly with fancy graphics. So then the question is 'why is your
site so *%^%% slow then!!!' \- what is it doing!!!

Put strace on there (I think you can do a local wget from the server and hook
it to that, so no need for any setup) and sure enough it will tell you what it
is doing. Sure the output is verbose, but there are command line switches for
that. If that verbose output shows it is looking up some db value 10000 times
just to show a web page then you can see that someone hasn't written the code
too well. This might not be easy to spot in the code, even if it is the
neatest code you ever saw, however, nothing is hidden in strace.

Because disk operations and db read/writes are time expensive they stick out
like a sore thumb in the strace log. Normally you need to run strace a few
times to see whether caching is working as it should or if these lengthy calls
will need some code refactoring.

All considered though, I have only used strace when things have got very
desperate. Factors that have contributed to the emergency include developers
that do not have any interest whatsoever in performance, lame hosting
environments bought by people that have no understanding of the requirements
and insufficient 'rush job' testing where there is no testing, just some
deadline. strace succeeds where 'the proper way' hasn't been setup yet or just
isn't useful.

------
ozh
Just wondering how many people will click the "Unsubscribe" link at the
bottom...

------
dlecorfec
+1 for strace (and tcpdump)

A bit better, in case your API is down, maybe you could have a script that
grab a strace (and maybe a gdb backtrace, see
[http://poormansprofiler.org/](http://poormansprofiler.org/)) of the usual
suspect and join that to the alert.

------
diggan
It's a good one. I have a snippet on Gist that collects all the workers for
debugging, instead of rushing to capture the PID and then starting strace.
Have a look:
[https://gist.github.com/VictorBjelkholm/df475e61457dabbf9d47](https://gist.github.com/VictorBjelkholm/df475e61457dabbf9d47)

------
udioron
Nice trick for identifying bottlenecks. If you want to leavarage your
"debugging" skills, try this wonderful course:
[https://www.udacity.com/course/cs259](https://www.udacity.com/course/cs259)

------
pieterza
+1 for strace, but your problem is lack of monitoring. You should have at
least one monitoring script doing a select on a known value which you compare
to.

------
eridius
> foobared

You're looking for FUBARed, which means "fucked up beyond all recognition".
"foobared" has no meaning.

~~~
SAI_Peregrinus
While the original phrase was FUBARed, the common use of "foo" and "bar" as
placeholder variables has lead to programmers using foobar instead. Of course
foo and bar come from FUBAR, just using 3 letters for each.

~~~
eridius
"foo" and "bar" are metasyntactic variables that have nothing whatsoever to do
with the meaning of FUBAR. Using "foobar" to mean FUBAR is utter nonsense, and
serves only to show that the writer has no idea what FUBAR is and is spelling
it phonetically. Which is to say, they're typing the word wrong.

~~~
acqq
"foo" and "bar" of course have _a lot_ to do with the meaning of FUBAR,
there's even a RFC(!) about that:

[http://www.ietf.org/rfc/rfc3092.txt](http://www.ietf.org/rfc/rfc3092.txt)

~~~
eridius
> 1 April 2001

This is an april fool's joke (one that significantly postdates the rise of
"foo" and "bar").

~~~
acqq
Which still doesn't mean the material isn't well researched. Nobody up to now
refuted the historical references. In short, "foo" and "bar" didn't "just fall
to Earth" they were written by the people who were surrounded by "FUBAR."

Which again means that it's better if you don't consider them as the innocent
"no meaning" variables. They have as much "no meaning" as WTF means "worse
than failure."

~~~
eridius
The jargon file, which is what that RFC references, provides no evidence for
its claim that "foo" in conjunction with "bar" has generally been traced to
FUBAR. It's a plausible-sounding claim, because of pronunciation, but that's
by no means definitive. In addition, it freely admits that "foo" predates
FUBAR.

The entry on "foobar" (as opposed to "foo") also stresses that "foobar" does
not generally reference FUBAR:

> Hackers do _not_ generally use this to mean FUBAR in either the slang or
> jargon sense.

And once again, it suggests that "foobar" may have been spread among early
engineers partly because of FUBAR, but it provides no evidence for this
suggestion aside from its apparent plausibility.

So yeah, ok, it's plausible that the popularity of "foobar" was boosted by the
acronym FUBAR. But regardless of the truth of that, the _meaning_ of "foobar"
has nothing to do with the meaning of FUBAR. It's a metasyntactic variable,
and that's it.

~~~
acqq
If you still believe in non-relatedness then you certainly believe that when
somebody names the variable SFU in his source examples, he could have been
just thinking of Microsoft"s "Services For Unix"

"[http://en.wikipedia.org/wiki/Windows_Services_for_UNIX](http://en.wikipedia.org/wiki/Windows_Services_for_UNIX)

Or "nothing, just metasyntatic" but absolutely no chance that he have been
thinking about

[http://en.wiktionary.org/wiki/STFU](http://en.wiktionary.org/wiki/STFU)

I've already mentioned the another example:

[http://thedailywtf.com/](http://thedailywtf.com/)

Initially it really meant what the internet would first think it meant.

[http://en.wiktionary.org/wiki/WTF](http://en.wiktionary.org/wiki/WTF)

As the site got more popular, it was necessary to make the name less offensive
for the uninitiated, so the "worse than failure" tag was invented much later.

[http://thedailywtf.com/Articles/The-Worse-Than-Failure-
Progr...](http://thedailywtf.com/Articles/The-Worse-Than-Failure-Programming-
Contest.aspx)

Just "metasyntatic." Three nice letters. Don't try to associate with anything
rude. We're all polite here.

~~~
eridius
> when somebody names the variable SFU in his source examples

I have never seen SFU used as a variable name, whether in source examples or
in production code. And were I to see that, I would never even consider the
possibility that it's trying to reference the acronym STFU. Because that
doesn't make sense. Googling for "SFU" as slang comes up with a reference
suggesting that it could potentially mean "So Fucked Up", which I'll grant
uses the same swear word but otherwise has a rather different meaning than
STFU. But that does not seem like common usage. And when not explicitly
looking for it as slang (after all, source code variable names are typically
not slang), it is a lot more likely to refer to many other things, including
various universities and Services for UNIX.

As near as I can tell, you're basically just making stuff up at this point.

Also, I don't think you understand what "metasyntactic" means. WTF in
TheDailyWTF is not "metasyntactic". Neither is naming a variable "SFU",
regardless of what you intended to convey with that name. A metasyntactic
variable, by definition, is a placeholder name _without meaning_ that is
intended to be substituted with some other context-appropriate thing. "foo" is
the most common such variable, "bar" is the second most common. You cannot
simultaneously claim something is metasyntactic and also claim that it has
meaning.

------
just_observing
"My site is DOWN, and it's down hard. Requests are timing out."

A bit OTT on the drama there.

------
t1m0sh3nk0
gdb is the best

~~~
randyrand
The UI is pretty terrible though in my opinion. Registers should always be
visible so that you can watch them change without having to type it after
every command.

~~~
AYBABTME
Launch GDB with the -tui flag.

[https://sourceware.org/gdb/onlinedocs/gdb/TUI.html](https://sourceware.org/gdb/onlinedocs/gdb/TUI.html)

~~~
demallien
Or C-x a after launching GDB. Followed up by "layout regs" and "layout split"
and you have a pretty decent debugger.

------
dkarapetyan
What? He could have just as easily looked at the database monitoring and
figured out that it was down. Unless he doesn't have database monitoring in
which case using the worker process itself as a database monitor is an
extremely poor substitute. Furthermore, have you ever seen the output from
strace? It spits stuff out nonstop. He could have very easily missed the call
to the database. Or maybe the call is supposed to take a few seconds and then
the fact that it is waiting on that call is telling you absolutely nothing.

This is a very poor reason for learning how to use strace. There are simpler
tools for everything he mentioned.

~~~
lambda
Except, you don't necessarily know that it's the database that's the problem.
It might be trying to do a DNS query, and that's timing out. Or spinning in a
loop trying to read from a socket that has no data available. Or blocking on a
filesystem operation on a networked filesystem that has become disconnected.
Or any number of other things.

The nice thing about strace is that it pretty quickly tells you exactly what
the program in question is doing with the outside world, which many times can
provide you the quick clue you need about what's going on.

strace is a pretty simple tool, at its most basic usage it's just "strace
<pid>", and the nice thing is that it's pretty universal; it comes in handy
many, many times, for solving many different kinds of problems. Have some
third party code that seems to exhibit different behavior on different
machines, but don't know what is different about its environment that's making
it behave differently? strace it and see what files it opens for configuration
data. Your program not making any progress, and it doesn't seem to be CPU
bound as it's only using a few percent of CPU? strace it and see what system
calls its blocking on.

Note that for this kind of use case, where the process is hanging, you
generally don't get way more output than you can handle; the fact that it's
hanging on this particular call means that that's the call that you will see
at the end of the output for several seconds until it times out.

It's certainly not the only tool in your arsenal; there will be other cases
where more specific tools are more appropriate. But it's a really useful
general purpose tool.

My general purpose debugging arsenal, when I need to debug something that I
haven't written, generally consists of tcpdump/wireshark to capture and
analyze any network traffic, strace to see how the program is interacting with
the world outside of it, turning up logging on any relevant systems to their
maximum value and looking through that output, and then finally if all else
fails attaching to it with GDB and stepping through the code.

Sure, it's always better to have more specific tools, to have turned on
appropriate monitoring, and so on. But you don't always have that luxury.
Sometimes more specific tools haven't been written. Sometimes your monitoring
may find that your database is up, as it is accepting connections, but its
actually hanging when you try to do anything, or it's succeeding for reads but
hanging on writes, or the like. Monitoring code is never going to be as
thorough as you like, there will always be something that you miss, and so you
need general-purpose tools that can be used in any situation to get a handle
on what is going on quickly so that you can narrow down on the problem.

