Hacker News new | past | comments | ask | show | jobs | submit login
Things you (probably) didn't know about xargs (offbytwo.com)
181 points by tswicegood on June 26, 2011 | hide | past | favorite | 97 comments



Developing some proficiency with commands such as find, xargs, sed, grep, etc. is where you can really gain productivity. There are studies showing that the perception that the command line is more efficient than a GUI is false; the GUI is actually faster. This may be true for relatively simple operations such as moving, copying, or deleting files, but by involving find, xargs, and a few pipes you can quickly accomplish operations that would be either very tedious or flatly impossible in most GUI file "explorers".


I am a believer in the CL, but I want to point out what "faster" can mean in these GUI vs CLI debates: On one hand, GUIs are probably faster if one is discovering a command in an interactive environment, but command lines and key oriented interfaces are probably faster for pure execution once one knows what one is doing in the same interactive environment.

sed, awk, xargs, friends aren't used interactively like either of the above, but are rather super powerful ingredients of non-interactive environments.


it's not a dichotomy. for example, staging non-trivial commits in git is much easier from a frontend than from git's CLI.


> There are studies showing that...the GUI is actually faster

Can you elaborate? I'm trying to understand how studies can show the GUI is faster than the command line, with or without xargs, grep, etc.


Every study I've seen like this is indicating that they're faster to learn for the average joe, and possibly faster for certain types of tasks (stretching an image), not that they're faster in general -- I'm not even sure what exactly that would mean.


> There are studies showing...

Sources for the studies?



Yes I was thinking of the Apple study, but didn't recall it that clearly.


The first comment there shows a counter study which says that for non-spatial tasks, keyboarding is better since muscle-memory can be used.

Do you have other studies?


You can skip the pipe by using find's '-exec' option. It will save you the pipe.

The example: "find . -name '* .py' | xargs grep 'import'" would become "find . -name '* .py' -exec grep -H 'import' {} \;".

You need to include the -H in grep to get the filename in which the match occurs.


Using xargs with a pipe is easier. I don't know any reason why I'd want to "save a pipe" when working at the command line.

Also note that text surrounded by asterisks in your comment become italics. Indent text by two or more spaces to reproduce text verbatim, like for code.


I think it was more about the appropriateness of the examples. z_'s comment is right-on, and I was going to post the same thing. It's a good intro, but the examples are contrived because you just don't need it.

It's not necessarily about saving a pipe, but also, when the tool provide first-class support for the function, it's typically less prone to error. For example, the -print0 becomes unnecessary, and I've been burnt by that.

I also appreciate the writeups that don't teach poor examples. We all know how prolific copy&paste coding is. How many times have you seen "grep foo bar | wc -l" when you know it's just all-around better to "grep -c foo bar"?


I would prefer we only teach "grep -c" as a special case optimization to people who already understand how "grep | wc -l" works, because the latter is more generally useful.


I felt bad using contrived examples, but I wanted a short post to cover the basics of xargs without getting detracted with discussions of the options of find. Based on the feedback I've seen here I'll go ahead and update the post though. If you can share some simple but less contrived examples please let me know, I'd love to update the post.


Don't get me wrong; it's a perfectly good and useful tutorial. The meat comes at the end when you talk about parallelism and argument batching. That can make a world of difference when you're working on real-world problems, like moving millions of files (mv * won't work unless you're on a system without ARG_MAX, and even then there are performance implications).

I think a good intro to xargs starts with a list of things that you can't do without it. (Easy for me to say that, but of course I haven't written that piece...) It'd be great to know why to use it, not just how, you know what I mean?

Anyway, this is just off-the-cuff commentary, not criticism. Thanks for writing it up.


I guess the question then becomes "why is xargs with a pipe easier?", or "why do you view xargs with a pipe as easier?".


Not sure if this is a zsh thing only, but I do this:

    grep import **/*.py


I rarely use find because zsh has great globbing. I really wish zsh were the default shell everywhere, most people wouldn't even notice if you changed the prompt from % to $


Ack-grep[1] is also really nice for making such find/grep operations a simple command.

[1] http://betterthangrep.com/


Please stop calling it ack-grep. Debian already had an ack package and didn't get their priorities the way I would have wished. That's their problem, not the official name.


"didn't get their priorities the way I would have wished"???

They already had a package named ack. WTF? By "getting their priorities" correct you mean catering to you?

The same problem is currently happening with node.js. Its unfortunate that node.js chose node to replace an equally ambiguous and generic name.


I didn't say "correct". I said "the way I would have wished". I expressed a personal preference, you're making me sound like more of a jerk than I already am.

That being said, yes, I believe they're plain wrong on this one. They shouldn't cater to me, but to their users. Thank God they have stats.

http://qa.debian.org/popcon.php?package=ack-grep

http://qa.debian.org/popcon.php?package=ack

I count in both because I installed the ack package mistakingly. I'm probably not the only one.

Still a very large WIN for ack, the one I care about.


Ack is awesome. It makes life as a programmer so much easier -- no more fiddling around trying to remember the useful flags for grep.

Plus it has --thpppt.


"find . -name '* .py' -exec grep 'import' {} +" will work more like the xargs example, i.e. all filenames will be passed as parameters to a single grep process, so the filename is displayed by default.

I usually don't bother with this kind of command line micro-optimization, but in my experience find and grep are the commands for which it's worth knowing and using every option.


It's more efficient to use the pipe with xargs. If you're removing 10,000 files, the difference in time between -exec (and fork/exec of rm 10,000 times) and xargs rm is quite significant. (On my system getconf MAX_ARGS is 2180000.)


One of the reasons I love xargs so much is the archaic syntax (and surprises based on your input) behind -exec. I think the pipe is cleaner, or using find in a set of backticks.


You save one pipe, but at expense of every grep will spawn its own process.

So maybe a 1000 grep processes instead of one. But I've saved a pipe.


I wonder why -print0 / -0 isn't the default, as it seems that not using those options is the wrong way to do it.

(Either that or why filenames can have spaces or LFs in them.)


It's not the default because things like 'find | grep foo' wouldn't work as well: the output would appear all concatenated, as the terminal doesn't break lines on null characters.

I wrote a utility I named print0, which simply converts line-oriented input into null-terminated output. Very useful for building pipelines of line-oriented utilities, where each line is a filename. It's quite common to have spaces in filenames (if nothing else, user files on NAS shares), and vanishingly rare to see newlines, so I find it to be a sensible tradeoff. Things like 'find | egrep | sed | print0 | xargs -0' work as you'd expect.


Would not « print0() { tr '\n' '\0'; } » work in most cases? Does your utility do something special to existing instances of \0? That's the only case I can imagine where the transformation is necessarily lossy and it's not clear what to do. I suppose in theory you could have newline normalization or something too.


No, because that doesn't handle command-line arguments or '-' in the same way that cat does. It also doesn't handle the differences between DOS, Unix and Mac (\r, \n and \r\n) line endings properly. Finally, it's also slower than my 80-line C version.


If filenames can contain LF characters, 'find | grep foo' is also wrong.

(Interesting idea with print0, but I fear this is mere symptomatic relief, rather than actually fixing the problem.)


As I already said, newlines[1] are vanishingly rare in filenames (most file manager UIs treat attempts at introducing a newline into the name as as committing the file rename), so I find it useful, so I wrote the utility, which only I use.

How you figure my solution to my problem is only symptomatic treatment, with technical drawbacks I already mentioned but treat as acceptable, is beyond me.

[1] As I mention in a cousin comment to this one, my utility handles all usual ASCII forms of newlines - \r, \n and \r\n (and \n\r for good measure).


Historically, UNIX filenames almost never had spaces in them.


Use GNU Parallel if you need to execute jobs in parallel (http://www.gnu.org/software/parallel/)


The best thing about Unix, and maybe the worst, is that you never stop learning about it. Nice article.


Never pipe `find` to `xargs rm -f`. All you need is a malicious empty file called / and you're screwed. That's why `find` has a -delete flag.


A forward slash is not a valid file name.


The chances of such a file coming into existence accidentally and also being picked up in the find filter seem slim. If it's around maliciously, I think you have bigger problems. Anyway, there are many other failure modes of rm -rf that are a lot more likely and you should worry about.


The chances of such a file coming into existence accidentally

When sharing servers with less UNIX savvy developers you will observe A) a notoriously polluted home-directory and B) plenty of files with funny names such as -, * , user@server.com (from failed scp attempts) all over the place.

It's indeed relatively hard to create a file called '/' by accident. But I've seen files containing the '/' along with spaces, which can be just as deadly. Oh and the popular * -file is not to be taken lightly either.

As a rule of thumb: Learn the safe way to chain these commands (find/xargs in particular) once and stick to it. And always perform a dry-run before launching the real deal.


Unix filenames can not include / characters or null bytes, so I call shenanigans. Nonetheless your point stands: be careful. (And make liberal use of -print0 !)


Seems you are right! I could've sworn I've seen them before, guess my memory got mixed up there.


'/' is an illegal character in filenames. Such a file is impossible on a unix system.


While not the point of the article, I was happy to learn about bash's {..} operator! I've always missed perl's .. in the shell.

   echo {1..100}


Note that the first three examples can also be written in a much shorter form if you use zsh:

  wc -l **/*.py
  rm **/*~
  grep 'import' **/*.py
I like zsh a lot for this small feature.


I like zsh for the same reason, but you will want to quote or escape the tilde in your second example.

    rm **/*\~
    rm '**/*~'


A few issues with this article. No, GNU xargs does not truncate the generated command. It will span multiple commands. Nor is it "-print 0" but "-print0" in find. And as mentioned in other comments, GNU parallel is much better for job parallelization.


You are right, I mistakenly assumed xargs truncated the command because the only thing I saw on the screen was the output of the last invocation. GNU parallel is great but xargs is installed by default on OS X, BSD. I'll go ahead and update the post and fix the -print0.


One thing that have bitten me in the past is the fact that xargs always runs at least once, even if there are no input. By using the -r / --no-run-if-empty flag, xargs does not run the command if the input does not contain any nonblanks.


As with many cool features, it's a GNU-only option.

One can refer to http://pubs.opengroup.org/onlinepubs/009695399/utilities/xar... for the portable flags.


I was just trying to figure out how to change the position of the arguments yesterday. I figured it out (thanks, man), but discovered the OS X version (FreeBSD?) uses -I while certain GNU versions use -i. Lame.


In GNU utils, incompatible features and extensions are a feature, not a bug.


On a slightly related note, am I the only one who finds the argument syntax for `find` to be inconsistent? Shouldn't it be `find --name "*.bar"` or even `find --name=foo.bar`?


The "single - for single-letter -- for fullname" option style was popularized by gnu getopt, by far the most popular style. Find has been around for a while and probably doesn't update to that style because of the huge amount of stuff already using find.

It's inconsistent in comparison to most utilities, for sure.

It's consistent internally, and POSIX-compliant, at least. (iirc)


finds arguments might very well be inconsistent but the --name argument as you show it does exist. -iname just means find should treat the name as case insensitive. --name on the other hand is case sensitive so --name=Foo.bar won't match foo.bar.


The trace param to Xargs is nice too: You can check what will happen before doing something.


zsh and its zargs command can easily overcome many of the limitations of xargs noted in the article. They also make the use of the "find" command unnecessary.


Recursively find all Emacs backup files and remove them find . -name '~' | xargs rm*

Recursively find all Python files and search them for the word ‘import’ find . -name '.py' | xargs grep 'import'*

Hmm. I don't mean to be a tweak, but you don't need xargs to do either of those things. Just:

find . -name '~' -delete

find . -name '.py' | grep 'import'

Note: I can't figure out how to get an asterisk to show up, and don't have time to look it up.


The -delete option is not available on all UNIX systems, as it's not part of the POSIX spec (http://pubs.opengroup.org/onlinepubs/009604599/utilities/fin...).

(Also, rant rant, I really don't understand why find was extended with -delete in the first place. What's next, "ls --delete" or maybe "cat --grep"?)


The orthogonality ship has sailed with Unix, for better or (IMHO) for worse. The Unix Haters were right about this, among other things, and the sections of the handbook about the shell are still valid, even as time and Moore have rendered others quaint.


For worse, I concur. I still fire up my SGI boxes once in a while to remind myself things didn't use to/don't have to be as binary as they mostly are today (Linux vs BSD, iOS vs Android, Intel vs AMD etc.). That, and to run electropaint, of course. =)


-delete was added to find because of the race condition with doing it using xargs.

See section 9.1.5 http://www.gnu.org/software/findutils/manual/html_node/find_...


Thanks for that link.

It walks through many of the same issues as the OP, but with more sophistication.

It also explains "+", which is used in place of the traditional ";" to essentially get xargs type argument accumulation, but within find.

That is,

  find . -name '*~' -exec rm {} \+
I did not know about that. It's in Mac OS 10.6 find, for one.


That was already fixed with -print0 | xargs -0, but then this solution is dismissed with "The problem is that this is not a portable construct;...". The -delete isn't either, so this is a straw man argument, although it probably is the most efficient and secure of all.


Read it again.

-print-0 | xargs -0 does not fix the race condition.

The problem is someone can swap in a symlink after the find, and before the xargs.


I'll have to run some more tests, but I can't see how -delete would help in that case .


Because find changes to the directory first (carefully not following symlinks), and then deletes the file from there.

It does not delete the file using the entire path (which may contain a sudden symlink).

It's not possible to do this safely using xargs.

Take a look also at -execdir which does the same thing - changes to the directory first, and runs things from there. -exec is not safe and should not be used.

xargs is not safe if you are running against a directory not your own. You should use find and -execdir instead.

Yes, the original authors of posix made a mistake here.

> Also, rant rant, I really don't understand why find was extended with -delete in the first place.

I'm hoping you understand it now.


Yes, thanks for the extensive info.


Rob Pike and Brian Kernighan warned about this trend in their seminal paper "Program Design in the UNIX Environment" (also known as "cat -v considered harmful" which describes how proper unix programs should be designed:

http://harmful.cat-v.org/cat-v/


I did a quick look online to see who has it. GNU, DragonFly BSD, NetBSD, and FreeBSD all have a -delete option. OpenBSD seems to be the only one that doesn't (although its online manpage does give examples on how to do it with -exec and piping to xargs).


That's basically two flavours of UNIX (Linux and BSD), but that's OK - others are unfortunately either dead or dying. The UNIX family tree (http://www.levenez.com/unix) has been shrinking so rapidly in the last few years, that the current situation looks like the early years.


Since the BSDs are all independently developed, I don't consider them to be one Unix, even though they share ancestry.


Nobody remembers the various command line switches of find or grep. I use xargs quite often because it's a very simple concept and I know how to mix and merge any commands together with it and that's something I've only had to to go ahead and learn once.


The grep one didn't involve a switch on grep. It was just the xargs example with xargs removed.


The grep one is also wrong as it does an entirely different thing without xargs.


I'm not an expert by any means, so please provide correction/feedback if I'm wrong, but I've read that xargs is more efficient because it can send parameters to the cmd in batches. I've read this in contrast to `-exec {}` not `-delete`, however.

For example:

    find . -name '*~' -exec 'rm {}'
The statement above executes `rm result` for every result. By contrast:

    find . -name '*~' | xargs rm
The example above would group the results and pass them to rm like this:

    rm result1 result2 result3 result4
Because I'm not an expert, I don't know how many parmeters it will pass, or if initiation of many new processes has a significant performance impact on newer systems. I would suspect that for anything involving the disk, I/O will be the bottle neck, not process turn-up time.

Anyone have an opinion/insight?


-delete will be the fastest because find can do everything, and it already has the file loaded, and doesn't have to do a second lookup.

-exec will be slower because for each file it has to spawn a new process (this is where most of the time goes), and then that new process has to look up the file.

xargs will be faster than -exec because it will collect a few hundred, or thousand, filenames, and pass them all to one command (the article claims 4096 is the default, on some systems it may be lower). This means that typically, only 3 processes need to spawn: `find', `xargs', and one `rm'; instead of find, and many, many `rm's.

Now, xargs is still slower than -delete because it will buffer the filenames, either waiting for the list to end, or 4000 filenames to pass to `rm'. Then, rm must look up the file from the filename.

To make my point I just set up a test situation with 1110 files ending with `~', among a total of 2221 files. I tested how fast it is to delete all the files ending with `~' using the 3 following commands:

   $ find . -name '*~' -delete
   $ find . -name '*~' -exec rm {} \;
   $ find . -name '*~' -print0 | xargs -0 rm

         | real   | user   | sys
 -delete | 0.024s | 0.003s | 0.020s
 -exec   | 0.819s | 0.007s | 0.107s
 xargs   | 0.073s | 0.007s | 0.017s
I probably should have run those tests multiple times, and taken the average, but meh.


Very cool. Thanks for the insight.

Mind if I ask how you timed execution?

The take aways for me are:

* Use find's built-in options where possible

* Use xargs where a built-in method isn't available

* Use -exec only when nothing else will do


I timed it with the `time' command :P


Most systems have a find command that supports the "-exec command {} +" option, which does not spawn the command for each result, but instead puts as many arguments as the shell will allow on the command line.


No. You're looking for files having import in their path here, not in their contents.


He got the command wrong but his point stands:

    find . -name '.py' -exec grep -H import '{}' \;


No he doesn't. That's a completely stupid compared to xargs as soon as you have more than a few files. You're starting one grep process per file.

Thanks for the laugh though.


All of this is stupid compared to just using a good shell and doing grep import * * /*.py.


find . -name '.py' | grep 'import'

I don't think that works. That'll grep for files named '\import\',


I use xargs for the find + rm cases he mentions. However, if you just want the file names containing the words import, I rely exclusively on grep:

grep -r -l import *.py


You are correct about -delete. You could also use -exec to execute arbitrary commands for each file returned by find. I needed some examples however, and I often see "find | xargs" because people don't know, or forget, the options to find. I should write a follow up post on find.


In addition to ptramo, you really want:

grep 'import' `find -name '* .py'`


Well, you want xargs as soon as your project is big, because you'll reach the limits on arguments fairly quickly.

I won't go too deep on the issues this will trigger if your filenames contain special characters. We often think of spaces, but you could have some nastier stuff. If you want to sound smart at your next geeks reunion, simply read http://www.dwheeler.com/essays/fixing-unix-linux-filenames.h...


On (modern, i.e. 2.6.23 and later) Linux at least, you won't hit them quickly:

    $ getconf ARG_MAX
    2097152
and if you do hit them, they're very easy to raise:

    $ ulimit -s 32768
    $ getconf ARG_MAX
    8388608
xargs is still useful for its other features, of course.


# echo "2097152/$(echo "/my/still/quite/reasonable/path/to/a/file"|wc -c)"|bc 49932

linux-2.6 # git ls-files|wc -l 36747

Getting close! :)

And I would guesstimate (Linux, kernel >= 2.6.23) to still be a fairly small amount of the machines people interact with professionally through a command line.

And if in a script/snippet, you often want to cover a vast majority of the systems you _could_ end up with. Won't be System III, at least for me, but there has to be a RHEL5 system in a closet right? :)


That's the kernel's limit, Bash (3) has its own limit, which is significantly lower, I've hit it before. I think Bash 4 can do an arbitrary number of arguments, within the kernel's limit.


No, bash doesn't and didn't have its own limit. On an ancient system (Debian Sarge), ARG_MAX=131072:

    $ bash --version
    GNU bash, version 2.05b.0(1)-release (i386-pc-linux-gnu)
    Copyright (C) 2002 Free Software Foundation, Inc.
    $ strace bash -c '/bin/echo `seq 1 30000`' 2>&1 | grep exec
    execve("/bin/bash", ["bash", "-c", "/bin/echo `seq 1 30000`"], …) = 0
    execve("/bin/echo", ["/bin/echo", "1", …) = -1 E2BIG (Argument list too long)
As you can see, the argument list too long error came back from the execve syscall, i.e., from the kernel. (Note that I shortened the strace output to make it fit the page)


Of course, I meant my version as an alternative to xargs. And I think you have about 128k of space for the filenames, but yeah with large projects that can be a problem.

Thanks for the link, that's more interesting than the submission. :)


Fair point, I just submitted it :)


Or maybe, if you have GNU grep:

    $ grep -R --include=\*.py import .


for a long time I avoid xargs like the plague. much less error prone to user a bash loop or something....

but that parallelization parameter may win me back as it's cleaner than & and global vars for counters....




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: