A lot of people like writing bash for loops, I will try and avoid that as much as possible, xargs -n1 is the bash equivalent of a call to 'map' in a functional language.
For instance, let's say you want to create thumbnails of a bunch of jpegs:
find images -name "*.jpg" | xargs -n1 -IF echo F F | sed -e 's/.jpg$/_thumb.jpg/' | xargs -n2 echo convert -geometry 200x
Additionally, it's fully parallelizable as xargs supports something akin to pmap.
You might be interested in "zargs" in zsh, which would save you the call to find.
Furthermore, instead of the pipe to sed and extra xargs, it might be clearer and simpler to do something like:
zargs -n 1 **/*.jpg -- make-thumb
Where "make-thumb" is a short script (or even a zsh function, if you care about saving a fork for each input file) containing:
convert -geometry 200x $1 ${1%%.jpg}_thumb.jpg
But, in real life, instead of writing such a script or function, what I'd probably do instead is:
zargs -n 1 **/*.jpg | vipe > myscript
and do some quick editing in vim to modify the zargs output by hand to do whatever I need -- and then I'd run the resulting "myscript". Just fyi, "vipe" is part of the "moreutils" package [1] and lets you use your editor in the middle of a pipe.
One final trick is for when you need to do in-place image manipulation. Instead of using "convert", you can use another ImageMagick command: "mogrify". It will overwrite the original file with the modified file. Of course, you should be very careful with it.
I use xargs a lot for refactoring work when I cannot simply use sed, e.g.
vim $(grep -lr foo | xargs)
and doing what I need to do on a file by file basis. Otherwise, for renaming functions and the like, I do a lot of:
find . -name foo_fn exec sed -i s/foo_fn/bar_fn/g '{}' \;
I generally love abusing bash. Just today I was asked about how to rename a bunch of files, specifically containing spaces, and came up with either of these two options:
find -name foo_bar -exec cp "'{}'"{,.bak} \;
and
for file in $(find -name foo_bar); do cp "$file"{,.bak}; done
Ultimately, the great thing is, if you learn CTRL-R, you can always search for these types of commands and modify them as necessary for the particular task at hand and not necessarily remember them. One I use all the time, to push git branches upstream is the following:
You are correct on both counts regarding the vim and grep example - I guess I just assumed I would have to have all the files on a single line before handing them off to vim.
Thanks for the suggestion about -exec +; I will have to remember it in the future.
Would anyone mind doing me a favor by explaining xargs in more detail? I've tried learning it a couple times but I always seem to forget the primary situations in which it's useful. Thank you in advance!
Xargs takes a newline separated list and maps the list to a command.
find ./ -name '*.log' | xargs rm
Finds all log files and map them to 'rm' commands. e.g. if it finds system.log and rails.log it will run the command `rm system.log rails.log`.
xargs will automatically do things like break up very long lists into multiple command calls so that it doesn't exceed the maximum number of arguments a command can have.
Other useful things about xargs are '-P <NUM>' which lets you run the same command in parallel. I use this with curl to do ghetto benchmarks.
The next major flag is `-n <NUM>` which changes the number of arguments per command call. e.g. `-n 1` will run the command per argument passed to xargs.
And the last flag I commonly use is `-I {}`. This sets `{}` as a variable which contains the arguments. (This also forces `-n 1`). This is useful for things like:
find ./ -name '*.log' | xargs -I{} sh -c "if [ -f {}.gz ]; then rm {}; fi"
Only do that if you know exactly what '*.log' will expand to (i.e. don't use it in scripts and avoid using it on the command line). This is because the delimiter for xargs is a newline character, but filenames can have a newline character in them. This can lead to unexpected results.
Almost everywhere I see xargs used, find ...-exec {} ; will work as well and find ...-exec {} + may work even better.
Fixes that issue and xargs is far more efficient, it doesn't launch a new process for each line like exec does, but far more importantly, xargs is generalizable to all commands so you only have to learn it once; exec is just an ugly hack on find, you can't generalize it across all commands; xargs is much more unixy.
True, but the issue isn't removing files, the issue is generalizing a command for mapping output of a command to another command. rm was just a simple example. xargs is far more useful than simply deleting files.
Removing files is a common need coupled with find yet many readers don't know of -delete; I was pointing it out. That doesn't weaken xargs's valid uses. You seem a little defensive? :-)
Note that xargs actually takes a whitespace-delimited list. This often leads to problems when a filename has a space in it. To fix it, you should either use:
Maybe you'll call me a pedant, but you should be aware that `find .` is not equivalent to `ls .* *`. The find command starts at the indicated directories (. in this case) and lists each file and directory within it, recursing into subdirectories. You can use things like -type, -[i]name, and -mtime to filter the results, as well as -mindepth and -maxdepth to constrain the traversal.
Note also that "-print" is the default command for find, so you can leave it off. Other commands include -print0 (NUL-delimited instead of newline-delimited) and -exec.
You can parallelize this operation with xargs simply by adding a -P. You could add a & to your convert here but that would run all the jobs at the same time. xargs allows you to only run n at a time. That would be a lot harder to replicate in bash.
I have a tendency to use xargs too, but often you can go with find's -exec, especially "command {} +" construct used for spitting many files at once to the given command. E.g.
However, unlike xargs, GNU Parallel gives you a guarantee on the order of the output. From the man page:
GNU parallel makes sure output from the commands is the same output as
you would get had you run the commands sequentially. This makes it
possible to use output from GNU parallel as input for other programs.
Want to comment just to stress that this feature is EXTREMELY useful and has saved me from all sorts of tricks with file intermediate output with filenames that match process ids, etc.
The nice thing about parallel is that it can actually run some of the instances on remote machines using SSH, transparently (as long as they have the commands). You just need a couple of parameters and it takes care of uploading the files the command needs and then downloading the results. It's quite awesome.
I like this but i can't decide if it's technically abuse or not. The paste command will happily parse - (meaning read from stdin) multiple times, so for transposing a list into a table:
Back in the crusty old days, FreeBSD used to take forever to install over the network, but would start an "emergency holographic shell" on pty4. The 'echo *' trick and various other shell built-ins were very useful for exploring the system before /bin and /usr/bin are populated.
Random note. The most commonly "abused" Unix command is cat. The name is short for "concatenate", and its intended purpose was originally to combine 2 or more files at once.
Therefore every time you use it to spool one file into a pipeline, that is technically an abuse!
It's most definitely catenate. I understand catenate to mean chain and concatenate to be to chain together. Since "cat foo bar xyzzy" doesn't modify the files to join them in any way I don't think they're chained together.
Besides, ken & Co. aren't daft. con would be short for concatenate. :-)
I have long thought that some sort of zsh completion that detects that abuse of cat and converts it into the more appropriate `< file` might be a good idea. If it did it silently it probably wouldn't be worth it but if it actually preformed the substitution in front of you then it might help users get more comfortable with the carrot syntax.
The ` < file` has to appear at/near the end of the line, right? Using cat has the advantage of being able to read the line from left to right along with the data flow. I often add more piped commands to the end of a line as I refine it, while the source data remains the same. (To be fair, sometimes the opposite is true.)
Interestingly, there are some circumstances where you actually want "cat file | program" and not "program < file". The case I have in mind is when file is actually a named FIFO which was not opened for writing. If you use cat, program will still run and only reads to stdin will block (but it can perform other things, possibly in different threads). If you use '<', opening stdin will block and program will probably block altogether.
~/desktop$ du -h c.dat
11G c.dat
~/desktop$ time cat c.dat | awk '{ print $1 }' > /dev/null
real 0m53.997s
user 0m52.930s
sys 0m7.986s
~/desktop$ time < c.dat awk '{ print $1 }' > /dev/null
real 0m53.898s
user 0m51.074s
sys 0m2.807s
cat CPU usage didn't exceed 1.6% at any time. The biggest cost is in redundant copying, so the more actual work you're doing on the data, the less and less it matters.
I was curious, so, here goes; 'foo' was a file of ~1G containing lines made up of of 999 'x's and one '\n'.
$ ls -lh foo
-rw-r--r-- 1 ori ori 954M Sep 5 22:57 foo
$ time cat foo | awk '{print $1}' > /dev/null
real 0m1.631s
user 0m1.452s
sys 0m0.540s
$ time awk <foo '{print $1}' > /dev/null
real 0m1.541s
user 0m1.376s
sys 0m0.160s
This was run from a warm cache, so that the overhead of the extra IO from a pipe would dominate.
Both invocations take similiar amounts of "real" time because the task is IO-bound and it takes roughly 1.5s on your machine to read the file.
But if you add up the "user" and "sys" time in the cat example, you see that it took 1.992s of actual cpu-time... Which is actually about a 30% increase in cpu-time spent.
The perf decrease wasn't visible because you have multiple cores parallelizing the extra cpu-time, but it was there.
So the two are different because awk's call to read() is effectively the same as a read directly from a file, whereas copying is taking place through the pipe with the pipeline approach?
Basically you see a linear increase in time. If it was going to take a coffee break's worth of time one way, it will take a slightly longer coffee break worth of time the other. It is fairly rare that the additional time involved matters and there isn't something else that you should be doing anyway.
assuming foo only reads stdin so `foo file' isn't possible, is that with the latter the shell will open file for reading on file descriptor 0 (stdin) before execing foo and the only cost is the read(2)s that foo does directly from file.
With the needless cat we have cat having to read the bytes and then write(2) them whereupon foo reads them as before. So the number of system calls goes from R to R+W+R assuming all reads and writes use the same block size and more byte copying may be required.
Heh. Be careful with this, though: ^ is the caret (note spelling) according to most sources of information about these things.
Random Fun Geekery Time: Back in the Before-Before, the grapheme in ASCII at the codepoint ^ is now was an up-arrow character, which is why BASIC uses ^ for exponentiation even though FORTRAN, which came first and which early BASIC dialects greatly copied, uses .
> Back in the Before-Before, the grapheme in ASCII at the codepoint ^ is now was an up-arrow character, which is why BASIC uses ^ for exponentiation even though FORTRAN, which came first and which early BASIC dialects greatly copied, uses .
press enter to read elapsed time. If you write your activity in the prompt and repeat it for multiple activities, you have a nice time log. You can then just copy&paste it from terminal.
For all the pipe lovers in this thread, here is a Perl utility I wrote to help debug shell pipelines. I call it `echoin`, and whatever it takes on stdin, it prints to stdout (presumably the terminal) while also treating its arguments as a command (sort of like xargs) and repeating its input for that command's stdin. So I can do:
foo | echoin bar
This is like `foo | bar`, but I can see what's passing between them. It's a bit like `tee`, but reversed. It's what I irrationally want `foo | tee - | bar` to do.
my $args = join ' ', @ARGV;
open OUT, "|$args" or die "Can't run $args: $!\n";
while (<STDIN>) {
print $_;
print OUT $_;
}
I use the "grep with color for lines plus the empty string" so frequently that I have a function for it:
function highlight() {
local args=( "$@" )
for (( i=0; i<${#args[@]}; i++ )); do
if [ "${args[$i]:0:1}" != "-" ]; then
args[$i]="(${args[$i]})|$"
break
fi
done
grep --color -E "${args[@]}"
}
This is only to be used as a filter, since it mangles filenames.
I'm curious, does ack support a highlight-only mode?
Another 'less' abuse: using it as an interactive grep via '&' line filtering.
It's a newish feature of less (those of you with stale RHEL installs won't find it). Type '&<pattern><return>' and you'll filter down a listing to match pattern. Regexes supported. Prefixed '!' negates filter.
Wishlist: interactive filter editing (similar to mutt's mail index view filters), so you don't have to re-type full expressions.
egregious use of color (uses 256-color term support):
function hl() {
local R='';
while [ $# -gt 0 ]; do
R="$R|$1";
shift;
done;
env GREP_COLORS="mt=38;5;$((RANDOM%256))" egrep --color=always $R;
}
then do e.g.
whatever pipeline | hl foo bar baz | hl quux | hl '^.* frob.*$' | less -R
results in foo/bar/baz highlighted in one color, quux in another, and whole lines containing frob in another. hopefully the colors aren't indistinguishable from each other or from the terminal background :\
I use a somewhat similar setup in emacs, where a key binding adds the word under point to a syntax highlighting table, but the color is computed as the first 6 characters of the md5sum of the word.
It should be noted that grep-dot only prints filenames if you give it more than one file.
Also, it skips blank lines. But of course that might be in the feature-not-a-bug category; and if you really want to see every line, there's always grep-quote-quote:
The point of the first one is he's abusing grep as cat; what he really wants is a cat that shows filenames, but since there's no such thing he uses "grep ." as a substitute.
Normally I'd typed something like "grep -i 'something' foo | less", and wanted to just up arrow the previous line and change the grep stuff to cat. I don't know why, it doesn't really save me anything. Maybe it's the hackerish "because I can, that's why" instinct at work.
`head -n 99999` seems like a weird way to do it anyway. Wouldn't it make more sense to do `tail -n +1`? The output is the same from both commands, but `tail` doesn't require you to assume arbitrary limits.
Honest question, btw. I'm relatively inexperienced with Linux, and I certainly haven't used BSD. I'd appreciate any critiques you may have to offer.
A lot of people like writing bash for loops, I will try and avoid that as much as possible, xargs -n1 is the bash equivalent of a call to 'map' in a functional language.
For instance, let's say you want to create thumbnails of a bunch of jpegs:
find images -name "*.jpg" | xargs -n1 -IF echo F F | sed -e 's/.jpg$/_thumb.jpg/' | xargs -n2 echo convert -geometry 200x
Additionally, it's fully parallelizable as xargs supports something akin to pmap.