
Parallelizing Jobs with xargs - r11t
http://www.spinellis.gr/blog/20090304/
======
yason
I'm afraid I'm going against the Unix idiom of combining simple tools to do
more advanced stuff, I can't resists here ;_)

While it is idiomatic in Unix to use xargs for parallelising batch runs I
found it pretty cumbersome because you have to be really careful with quotes,
file names and command lines with spaces to make sure the command line will be
"nice" in order to not fuck up something serious.

Moreover, xargs does have its uses but I mostly find I use it for trivial
things where I can be sure it works. The xargs idiom seems to be fed a list of
files, even more typically from the find command, as in "find . -name _out |
xargs rm -r". That's the reason there's -0 in xargs while there's the matching
-print0 in find.

I wrote a small utility myself (<http://code.google.com/p/spawntool/>) that
reads from stdin and treats each line as a complete command line that is
directly passed to system(), and then manages the parallelisation up to N
processes.

This is pretty useful for feeding in _any_ batch of commands, even unrelated
(not derived from a list of files). You could also feed the same input stream
or file straight to 'sh' (for compatibility cross-checking) or you could
verify the input command lines in plaintext before daring with either sh or
spawntool. This would be like ... | xargs sh without the white-space and
expansion head-aches.

It's pretty easy to generate complete command lines yourself and much safer
than letting xargs join stuff together.

------
mattj
Running these two commands in series is likely vastly overstating the
performance gains - almost all your time is going to be spent in io, and the
second time around you'll have a good chunk (if not all) of it in disk cache.
Try running both a few repeated times and see if you enjoy the same gains (on
my iphone right now, so I can't do this myself)

~~~
delano
You are correct sir!

    
    
         $ time find . -newerct '10 hours ago' -print
         real	0m4.262s
         user	0m0.090s
         sys	0m1.014s
    
         $ time find . -newerct '10 hours ago' -print     
         real	0m0.516s
         user	0m0.057s
         sys	0m0.251s
    
         $ time find . -newerct '10 hours ago' -print     
         real	0m0.302s
         user	0m0.056s
         sys	0m0.244s
    
         $ time find . -newerct '10 hours ago' -print     
         real	0m0.311s
         user	0m0.057s
         sys	0m0.251s
    
         $ find . | wc -l
         13167
    

Is there a way to reliably flush the disk cache besides restarting?

~~~
jws
Speaking of Linux, yes. You can echo mysterious numbers into
/proc/sys/vm/drop_caches to flush various caches.
[http://jim.studt.net/depository/index.php/flushing-caches-
fo...](http://jim.studt.net/depository/index.php/flushing-caches-for-
benchmarking-in-linux) for a one liner, <http://linux-mm.org/Drop_Caches> for
the info from the horse's mouth.

As an example, consider running md5sum on a directory of 1436 jpeg files of
about 50k each on an Atom 330 ( 2 dies, 2 hyperthreads/die). The _8way_
numbers use xargs with -n8 and -P8. The _flushed cache_ entries indicate that
all three caches have been flushed immediately before the command ...

    
    
                        elapsed   user   system
         Cold cache:       11.98   0.56   1.75  ========================
         second run:        2.71   1.38   1.76  =====
          third run:        3.17   1.40   1.75  ======
      flushed cache:       12.11   0.58   1.80  ========================
      flushed cache, 8way: 12.64   1.00   2.27  =========================
         second run, 8way:  1.22   1.15   2.52  ==
          third run, 8way:  1.17   1.26   2.26  ==
    

I do not explain the oddity that user time seems abnormally low with flushed
cache, or perhaps abnormally high with cache data present, particularly with
the single threaded version.

~~~
delano
Great, thanks.

What did you use to generate that output?

~~~
jws
'time', and typing. The graph bars are my right index finger while I count in
my head. I should probably patent that before it gets out. Wait, I just
published. I'm screwed.

~~~
prewett
You have a year after first publication to patent. But you have successfully
prevented anyone else from patenting (unless they already applied).

------
IsaacSchlueter
dtach is great for long-running jobs, too. If you pipe the output to a file,
you can even log out and check back later.

I use this function to pass stuff off to a detached process:

    
    
        # usage:
        # headless "some_long_job" "long_job"
        # go get some tea, and come back
        # headless "" "long_job" (to join that session)
        # still not done, so ^\ to detach from it again
        # Usually, I pipe the output of some_long_job to
        # a file, so I can peek in on it easily
        headless () {
          if [ "$2" == "" ]; then
            hash=`md5 -qs "$1"`
          else
            hash="$2"
          fi
          if [ "$1" != "" ]; then
            dtach -n /tmp/headless-$hash bash -l -c "$1"
          else
            dtach -A /tmp/headless-$hash bash -l
          fi
        }

------
mblakele
When using this sort of trick, I also find it useful to throw in GNU screen,
nohup, or the bash 'disown' command.

------
aolnerd
I find that xargs is the most convenient way to achieve parallelism for quick
and easy batch work. Just write your script to receive its unit of work as a
command argument (or as multiple args if starting a process is a heavy
operation). Use any language. Utilize all your cores.

