Hacker News new | past | comments | ask | show | jobs | submit login
My favorite debugging tool (campaign-archive1.com)
64 points by Isofarro on May 5, 2014 | hide | past | favorite | 37 comments



I love strace, DTrace, Procmon, and all these tools. Nothing gives me more pleasure when I'm called to solve a mistery issue and end up spending time checking syscalls, timing, etc.

That being said, that's the worst debugging experience your company can have because it outlines a bigger problem: your monitoring sucks.

In the scenario presented by this article, better monitoring of the database layer could have allowed the sysadmin to connect the dots much faster and find the root cause. So the DB might be able but spilling the wrong beans? You may need some functional monitoring checking what it's outputting from time to time.

I absolutely love low-level debugging tools but using them on a daily basis doesn't scale. I do use them almost every single day, because other teams failed to properly monitor and understand their systems. That doesn't mean it's the most efficient way to go about this.


The nice thing about using low level debugging tools is that they are a quick way to get a sense of the problem before you need to start thinking about it. When you are staring at their output trying to figure out what the problem is, then you may be doing something wrong.


strace can slow the target by up to 200x (http://www.slideshare.net/brendangregg/linux-performance-ana...). While you can solve plenty of issues with it, you have to be very careful about its use in production. In many cases, it just can't be used.

perf added a "trace" subcommand in 3.7, for buffered syscall tracing (http://www.brendangregg.com/perf.html#More). This should eventually be like strace, but with MUCH less performance overhead.

sysdig is a newer tool that uses a similar low-overhead interface, and provides an expressive syntax. It's still early days, but sysdig could ultimately replace strace (or perf will via trace).


Sure, there are going to be a few performance sensitive workloads that you can't use it on, but for the vast majority of issues I've used it for, it's been far more useful than perf or dtruss (on Mac OS X, I've never used dtruss on Solaris so I don't know if it's in better shape).

The big advantage is that it already knows how to decode most syscalls, decompose flags into symbolic arguments, decode things like stat return values, and so on. dtruss on Mac OS X can decode some syscalls, but doesn't generally know how to decode structures or pull apart flags, so there's some information you just can't see, and some which you have to laboriously manually decode. perf, as you note, decodes even less than dtruss. When I'm trying to debug an issue, the more information I can gather and see at once, the better.

Very few of the occasions I've had to use these tools have been particularly performance sensitive. Your example of the huge difference is because dd is copying fairly small blocks at a time, and thus making a fairly high number of system calls relative to the amount of data being transferred. Usually, I'm just looking for why an operation fails, or why a process is stuck; if it takes a little longer to get to that failure, that's OK, and if the process is stuck then stracing it generally won't affect performance.

sysdig looks pretty interesting, thanks for the reference! It's too bad they can't use the existing perf infrastructure in the kernel, and need to add their own kernel module; that makes it that much more cumbersome to deploy and use.


strace, and tcpdump. It's really a toss up between them as to which i have solved more problems with. I was sort of rooting for tcpdump before reading the article.


+1 for recommending strace.

In discussion with others about its merits I have had people show some better tool that works perfectly in their test case dev environment on their local machine, possibly with fancy graphics. So then the question is 'why is your site so *%^%% slow then!!!' - what is it doing!!!

Put strace on there (I think you can do a local wget from the server and hook it to that, so no need for any setup) and sure enough it will tell you what it is doing. Sure the output is verbose, but there are command line switches for that. If that verbose output shows it is looking up some db value 10000 times just to show a web page then you can see that someone hasn't written the code too well. This might not be easy to spot in the code, even if it is the neatest code you ever saw, however, nothing is hidden in strace.

Because disk operations and db read/writes are time expensive they stick out like a sore thumb in the strace log. Normally you need to run strace a few times to see whether caching is working as it should or if these lengthy calls will need some code refactoring.

All considered though, I have only used strace when things have got very desperate. Factors that have contributed to the emergency include developers that do not have any interest whatsoever in performance, lame hosting environments bought by people that have no understanding of the requirements and insufficient 'rush job' testing where there is no testing, just some deadline. strace succeeds where 'the proper way' hasn't been setup yet or just isn't useful.


Just wondering how many people will click the "Unsubscribe" link at the bottom...


+1 for strace (and tcpdump)

A bit better, in case your API is down, maybe you could have a script that grab a strace (and maybe a gdb backtrace, see http://poormansprofiler.org/) of the usual suspect and join that to the alert.


It's a good one. I have a snippet on Gist that collects all the workers for debugging, instead of rushing to capture the PID and then starting strace. Have a look: https://gist.github.com/VictorBjelkholm/df475e61457dabbf9d47


Nice trick for identifying bottlenecks. If you want to leavarage your "debugging" skills, try this wonderful course: https://www.udacity.com/course/cs259


+1 for strace, but your problem is lack of monitoring. You should have at least one monitoring script doing a select on a known value which you compare to.


> foobared

You're looking for FUBARed, which means "fucked up beyond all recognition". "foobared" has no meaning.


While the original phrase was FUBARed, the common use of "foo" and "bar" as placeholder variables has lead to programmers using foobar instead. Of course foo and bar come from FUBAR, just using 3 letters for each.


"foo" and "bar" are metasyntactic variables that have nothing whatsoever to do with the meaning of FUBAR. Using "foobar" to mean FUBAR is utter nonsense, and serves only to show that the writer has no idea what FUBAR is and is spelling it phonetically. Which is to say, they're typing the word wrong.


"foo" and "bar" of course have a lot to do with the meaning of FUBAR, there's even a RFC(!) about that:

http://www.ietf.org/rfc/rfc3092.txt


> 1 April 2001

This is an april fool's joke (one that significantly postdates the rise of "foo" and "bar").


Which still doesn't mean the material isn't well researched. Nobody up to now refuted the historical references. In short, "foo" and "bar" didn't "just fall to Earth" they were written by the people who were surrounded by "FUBAR."

Which again means that it's better if you don't consider them as the innocent "no meaning" variables. They have as much "no meaning" as WTF means "worse than failure."


The jargon file, which is what that RFC references, provides no evidence for its claim that "foo" in conjunction with "bar" has generally been traced to FUBAR. It's a plausible-sounding claim, because of pronunciation, but that's by no means definitive. In addition, it freely admits that "foo" predates FUBAR.

The entry on "foobar" (as opposed to "foo") also stresses that "foobar" does not generally reference FUBAR:

> Hackers do not generally use this to mean FUBAR in either the slang or jargon sense.

And once again, it suggests that "foobar" may have been spread among early engineers partly because of FUBAR, but it provides no evidence for this suggestion aside from its apparent plausibility.

So yeah, ok, it's plausible that the popularity of "foobar" was boosted by the acronym FUBAR. But regardless of the truth of that, the meaning of "foobar" has nothing to do with the meaning of FUBAR. It's a metasyntactic variable, and that's it.


If you still believe in non-relatedness then you certainly believe that when somebody names the variable SFU in his source examples, he could have been just thinking of Microsoft"s "Services For Unix"

"http://en.wikipedia.org/wiki/Windows_Services_for_UNIX

Or "nothing, just metasyntatic" but absolutely no chance that he have been thinking about

http://en.wiktionary.org/wiki/STFU

I've already mentioned the another example:

http://thedailywtf.com/

Initially it really meant what the internet would first think it meant.

http://en.wiktionary.org/wiki/WTF

As the site got more popular, it was necessary to make the name less offensive for the uninitiated, so the "worse than failure" tag was invented much later.

http://thedailywtf.com/Articles/The-Worse-Than-Failure-Progr...

Just "metasyntatic." Three nice letters. Don't try to associate with anything rude. We're all polite here.


> when somebody names the variable SFU in his source examples

I have never seen SFU used as a variable name, whether in source examples or in production code. And were I to see that, I would never even consider the possibility that it's trying to reference the acronym STFU. Because that doesn't make sense. Googling for "SFU" as slang comes up with a reference suggesting that it could potentially mean "So Fucked Up", which I'll grant uses the same swear word but otherwise has a rather different meaning than STFU. But that does not seem like common usage. And when not explicitly looking for it as slang (after all, source code variable names are typically not slang), it is a lot more likely to refer to many other things, including various universities and Services for UNIX.

As near as I can tell, you're basically just making stuff up at this point.

Also, I don't think you understand what "metasyntactic" means. WTF in TheDailyWTF is not "metasyntactic". Neither is naming a variable "SFU", regardless of what you intended to convey with that name. A metasyntactic variable, by definition, is a placeholder name without meaning that is intended to be substituted with some other context-appropriate thing. "foo" is the most common such variable, "bar" is the second most common. You cannot simultaneously claim something is metasyntactic and also claim that it has meaning.


Does it matter?


For reading comprehension? Yes, absolutely.


No it doesn't. You can guess what they are talking about from the context. If a variable is named "foo" or "fu" makes no difference.


For a variable name, sure. "foo" is a metasyntactic variable, it's used as a placeholder for something else. If you want to name it "fu", I'll be mildly curious as to where you came up with that name, but from context I'll assume it's a placeholder name (if this is production code and not example code, then why are you trying to use placeholder names? In production code I'll assume it has some meaning that I am not aware of).

But that is entirely unrelated to trying to use the acronym FUBAR in english prose. I can figure out that when the author said "foobared" he really meant "FUBARed" because I can speak it aloud and figure out that the author was phonetically typing something. But that's an error correction mechanism, the same mechanism used to try and discern the meaning of a misspelled word. It's not one you can rely on correctly identifying the real word, and it does not legitimize the use of the incorrect or misspelled word. For example, if the author had written "barfued" instead of "FUBARed", I would most likely not have been able to recognize that the author meant "FUBARed".

If you expect me to be able to understand the words that you are writing, you need to make the effort to actually use the correct words to convey your meaning.


"foobered" does not have a meaning. "FUBARed" has a meaning ( to a relatively small percentage of the population I would guess). I think given the context, if foobar is used, then we can assume the context (and lack of any alternate meaning) we can safely assume that the prose "foobared" is equivalent to "FUBARed". What else would you expect it to mean?


Actually, "FUBARed" means "fucked up beyond all recognitioned" and it makes about as much sense as "foobared". If you're going to nitpick, nitpick right. :-)


Ok sure, I'll nitpick right. FUBAR is an acronym. "FUBARed" is the informal way of using the term in its adjective form in a spoken sentence.


And foobar is a word that has come to mean FUBAR.


[citation needed]

No, seriously. "foobar" does not mean FUBAR. The linked article here is quite literally the only time I have ever seen "foobar" used in place of FUBAR, and the Jargon file also stresses that "foobar" is not generally used to mean FUBAR.


"My site is DOWN, and it's down hard. Requests are timing out."

A bit OTT on the drama there.


gdb is the best


The UI is pretty terrible though in my opinion. Registers should always be visible so that you can watch them change without having to type it after every command.



Or C-x a after launching GDB. Followed up by "layout regs" and "layout split" and you have a pretty decent debugger.


What? He could have just as easily looked at the database monitoring and figured out that it was down. Unless he doesn't have database monitoring in which case using the worker process itself as a database monitor is an extremely poor substitute. Furthermore, have you ever seen the output from strace? It spits stuff out nonstop. He could have very easily missed the call to the database. Or maybe the call is supposed to take a few seconds and then the fact that it is waiting on that call is telling you absolutely nothing.

This is a very poor reason for learning how to use strace. There are simpler tools for everything he mentioned.


Except, you don't necessarily know that it's the database that's the problem. It might be trying to do a DNS query, and that's timing out. Or spinning in a loop trying to read from a socket that has no data available. Or blocking on a filesystem operation on a networked filesystem that has become disconnected. Or any number of other things.

The nice thing about strace is that it pretty quickly tells you exactly what the program in question is doing with the outside world, which many times can provide you the quick clue you need about what's going on.

strace is a pretty simple tool, at its most basic usage it's just "strace <pid>", and the nice thing is that it's pretty universal; it comes in handy many, many times, for solving many different kinds of problems. Have some third party code that seems to exhibit different behavior on different machines, but don't know what is different about its environment that's making it behave differently? strace it and see what files it opens for configuration data. Your program not making any progress, and it doesn't seem to be CPU bound as it's only using a few percent of CPU? strace it and see what system calls its blocking on.

Note that for this kind of use case, where the process is hanging, you generally don't get way more output than you can handle; the fact that it's hanging on this particular call means that that's the call that you will see at the end of the output for several seconds until it times out.

It's certainly not the only tool in your arsenal; there will be other cases where more specific tools are more appropriate. But it's a really useful general purpose tool.

My general purpose debugging arsenal, when I need to debug something that I haven't written, generally consists of tcpdump/wireshark to capture and analyze any network traffic, strace to see how the program is interacting with the world outside of it, turning up logging on any relevant systems to their maximum value and looking through that output, and then finally if all else fails attaching to it with GDB and stepping through the code.

Sure, it's always better to have more specific tools, to have turned on appropriate monitoring, and so on. But you don't always have that luxury. Sometimes more specific tools haven't been written. Sometimes your monitoring may find that your database is up, as it is accepting connections, but its actually hanging when you try to do anything, or it's succeeding for reads but hanging on writes, or the like. Monitoring code is never going to be as thorough as you like, there will always be something that you miss, and so you need general-purpose tools that can be used in any situation to get a handle on what is going on quickly so that you can narrow down on the problem.


My thought exactly.

A jackhammer and dynamite when a pair of pliers would suffice.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: