That being said, that's the worst debugging experience your company can have because it outlines a bigger problem: your monitoring sucks.
In the scenario presented by this article, better monitoring of the database layer could have allowed the sysadmin to connect the dots much faster and find the root cause. So the DB might be able but spilling the wrong beans? You may need some functional monitoring checking what it's outputting from time to time.
I absolutely love low-level debugging tools but using them on a daily basis doesn't scale. I do use them almost every single day, because other teams failed to properly monitor and understand their systems. That doesn't mean it's the most efficient way to go about this.
perf added a "trace" subcommand in 3.7, for buffered syscall tracing (http://www.brendangregg.com/perf.html#More). This should eventually be like strace, but with MUCH less performance overhead.
sysdig is a newer tool that uses a similar low-overhead interface, and provides an expressive syntax. It's still early days, but sysdig could ultimately replace strace (or perf will via trace).
The big advantage is that it already knows how to decode most syscalls, decompose flags into symbolic arguments, decode things like stat return values, and so on. dtruss on Mac OS X can decode some syscalls, but doesn't generally know how to decode structures or pull apart flags, so there's some information you just can't see, and some which you have to laboriously manually decode. perf, as you note, decodes even less than dtruss. When I'm trying to debug an issue, the more information I can gather and see at once, the better.
Very few of the occasions I've had to use these tools have been particularly performance sensitive. Your example of the huge difference is because dd is copying fairly small blocks at a time, and thus making a fairly high number of system calls relative to the amount of data being transferred. Usually, I'm just looking for why an operation fails, or why a process is stuck; if it takes a little longer to get to that failure, that's OK, and if the process is stuck then stracing it generally won't affect performance.
sysdig looks pretty interesting, thanks for the reference! It's too bad they can't use the existing perf infrastructure in the kernel, and need to add their own kernel module; that makes it that much more cumbersome to deploy and use.
In discussion with others about its merits I have had people show some better tool that works perfectly in their test case dev environment on their local machine, possibly with fancy graphics. So then the question is 'why is your site so *%^%% slow then!!!' - what is it doing!!!
Put strace on there (I think you can do a local wget from the server and hook it to that, so no need for any setup) and sure enough it will tell you what it is doing. Sure the output is verbose, but there are command line switches for that. If that verbose output shows it is looking up some db value 10000 times just to show a web page then you can see that someone hasn't written the code too well. This might not be easy to spot in the code, even if it is the neatest code you ever saw, however, nothing is hidden in strace.
Because disk operations and db read/writes are time expensive they stick out like a sore thumb in the strace log. Normally you need to run strace a few times to see whether caching is working as it should or if these lengthy calls will need some code refactoring.
All considered though, I have only used strace when things have got very desperate. Factors that have contributed to the emergency include developers that do not have any interest whatsoever in performance, lame hosting environments bought by people that have no understanding of the requirements and insufficient 'rush job' testing where there is no testing, just some deadline. strace succeeds where 'the proper way' hasn't been setup yet or just isn't useful.
A bit better, in case your API is down, maybe you could have a script that grab a strace (and maybe a gdb backtrace, see http://poormansprofiler.org/) of the usual suspect and join that to the alert.
You're looking for FUBARed, which means "fucked up beyond all recognition". "foobared" has no meaning.
This is an april fool's joke (one that significantly postdates the rise of "foo" and "bar").
Which again means that it's better if you don't consider them as the innocent "no meaning" variables. They have as much "no meaning" as WTF means "worse than failure."
The entry on "foobar" (as opposed to "foo") also stresses that "foobar" does not generally reference FUBAR:
> Hackers do not generally use this to mean FUBAR in either the slang or jargon sense.
And once again, it suggests that "foobar" may have been spread among early engineers partly because of FUBAR, but it provides no evidence for this suggestion aside from its apparent plausibility.
So yeah, ok, it's plausible that the popularity of "foobar" was boosted by the acronym FUBAR. But regardless of the truth of that, the meaning of "foobar" has nothing to do with the meaning of FUBAR. It's a metasyntactic variable, and that's it.
Or "nothing, just metasyntatic" but absolutely no chance that he have been thinking about
I've already mentioned the another example:
Initially it really meant what the internet would first think it meant.
As the site got more popular, it was necessary to make the name less offensive for the uninitiated, so the "worse than failure" tag was invented much later.
Just "metasyntatic." Three nice letters. Don't try to associate with anything rude. We're all polite here.
I have never seen SFU used as a variable name, whether in source examples or in production code. And were I to see that, I would never even consider the possibility that it's trying to reference the acronym STFU. Because that doesn't make sense. Googling for "SFU" as slang comes up with a reference suggesting that it could potentially mean "So Fucked Up", which I'll grant uses the same swear word but otherwise has a rather different meaning than STFU. But that does not seem like common usage. And when not explicitly looking for it as slang (after all, source code variable names are typically not slang), it is a lot more likely to refer to many other things, including various universities and Services for UNIX.
As near as I can tell, you're basically just making stuff up at this point.
Also, I don't think you understand what "metasyntactic" means. WTF in TheDailyWTF is not "metasyntactic". Neither is naming a variable "SFU", regardless of what you intended to convey with that name. A metasyntactic variable, by definition, is a placeholder name without meaning that is intended to be substituted with some other context-appropriate thing. "foo" is the most common such variable, "bar" is the second most common. You cannot simultaneously claim something is metasyntactic and also claim that it has meaning.
But that is entirely unrelated to trying to use the acronym FUBAR in english prose. I can figure out that when the author said "foobared" he really meant "FUBARed" because I can speak it aloud and figure out that the author was phonetically typing something. But that's an error correction mechanism, the same mechanism used to try and discern the meaning of a misspelled word. It's not one you can rely on correctly identifying the real word, and it does not legitimize the use of the incorrect or misspelled word. For example, if the author had written "barfued" instead of "FUBARed", I would most likely not have been able to recognize that the author meant "FUBARed".
If you expect me to be able to understand the words that you are writing, you need to make the effort to actually use the correct words to convey your meaning.
No, seriously. "foobar" does not mean FUBAR. The linked article here is quite literally the only time I have ever seen "foobar" used in place of FUBAR, and the Jargon file also stresses that "foobar" is not generally used to mean FUBAR.
A bit OTT on the drama there.
This is a very poor reason for learning how to use strace. There are simpler tools for everything he mentioned.
The nice thing about strace is that it pretty quickly tells you exactly what the program in question is doing with the outside world, which many times can provide you the quick clue you need about what's going on.
strace is a pretty simple tool, at its most basic usage it's just "strace <pid>", and the nice thing is that it's pretty universal; it comes in handy many, many times, for solving many different kinds of problems. Have some third party code that seems to exhibit different behavior on different machines, but don't know what is different about its environment that's making it behave differently? strace it and see what files it opens for configuration data. Your program not making any progress, and it doesn't seem to be CPU bound as it's only using a few percent of CPU? strace it and see what system calls its blocking on.
Note that for this kind of use case, where the process is hanging, you generally don't get way more output than you can handle; the fact that it's hanging on this particular call means that that's the call that you will see at the end of the output for several seconds until it times out.
It's certainly not the only tool in your arsenal; there will be other cases where more specific tools are more appropriate. But it's a really useful general purpose tool.
My general purpose debugging arsenal, when I need to debug something that I haven't written, generally consists of tcpdump/wireshark to capture and analyze any network traffic, strace to see how the program is interacting with the world outside of it, turning up logging on any relevant systems to their maximum value and looking through that output, and then finally if all else fails attaching to it with GDB and stepping through the code.
Sure, it's always better to have more specific tools, to have turned on appropriate monitoring, and so on. But you don't always have that luxury. Sometimes more specific tools haven't been written. Sometimes your monitoring may find that your database is up, as it is accepting connections, but its actually hanging when you try to do anything, or it's succeeding for reads but hanging on writes, or the like. Monitoring code is never going to be as thorough as you like, there will always be something that you miss, and so you need general-purpose tools that can be used in any situation to get a handle on what is going on quickly so that you can narrow down on the problem.
A jackhammer and dynamite when a pair of pliers would suffice.