Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Using xargs to do Parallel Processing (labrat.info)
36 points by martinaglv on Dec 14, 2012 | hide | past | favorite | 15 comments



I tried it out on my late 2008 MacBook Pro (Lion, CoreDuo, SSD). Doesn't look so hot:

  Eds-MacBook-Pro workspace$ time (find . -name '*java'|parallel grep Message > /dev/null)
  
  real	0m2.176s
  user	0m1.111s
  sys	0m1.789s
  Eds-MacBook-Pro workspace$ time (find . -name '*java'|xargs grep Message > /dev/null)
  
  real	0m0.097s
  user	0m0.035s
  sys	0m0.061s
GNU Parallel was installed using homebrew. I don't know what compiler flags were used, but it's 22 times slower. I ran each multiple times but there was no significant difference.

From the 'sys' time, I'd guess the native xargs is probably not using a Posix call????


The difference is that parallel always invokes a new process for each argument, whereas xargs will pass many files into the same argument. You can see this with:

    seq 1000 | xargs echo | wc -l
    1

    seq 1000 | parallel echo | wc -l
    1000
So xargs has only run echo once, whereas parallel has run it 1000 times.

If we force xargs to run only one argument at once, things look better:

    time ( seq 1000 | xargs -n 1 echo | wc -l )
     1000 
    real	0m1.478s
    time ( seq 1000 | parallel echo | wc -l )
       1000
    real	0m5.536s
Although, still not great for parallel

However, we still haven't taken advantage of parallelisation. This is where the real strength of parallel comes in, and where (I suspect) a bunch of the slow down comes in.

If we run xargs in parallel (with --max-procs=4), then we get much faster real time, but the output is randomly shuffled up (as xargs just lets each process output when it wants). If we had programs with multiline output, they would be all shuffled together.

On the other hand, when we parallelise with parallel, the output is still all nicely sorted in order, as parallel stores up the output of each program, and outputs them in the correct order. This does create some overhead, but means the output is much more readable. In your example, if you parallelised your xargs with --max-procs, you would find the greps of different files mixed together, but not with parallel.

Wow, wrote more than I intended. Basically there is a difference, but it isn't quite as much as you think due to differences in default behaviour. However, parallel also does more stuff once you start parallelising!


The parallel manpage is quite informative on all this. There's even a specific section "DIFFERENCES BETWEEN xargs AND GNU Parallel".

You're right, the performance difference is due to the way grep is being invoked.

From the parallel man page:

EXAMPLE: Parallel grep grep -r greps recursively through directories. On multicore CPUs GNU parallel can often speed this up.

       find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}
This will run 1.5 job per core, and give 1000 arguments to grep.


I didn't realise that parallel is a Perl script, which means it's going to lose when running very short-lived processes, as in my example.


True. Unless you need some of the facilities in GNU Parallel that you do not find in xargs (such as --group which is default, running on remote machines, -L with -I, or --pipe for processing piped data).


Cold cache?



My most common use of xarsg -P is for compressing logs on remote hosts before downloading them: cat RemoteHosts | xargs -P 0 -n 1 -I Host rsh Host gzip -9 xxx/*.log The cool thing is that it waits until the completion of all the commands before returning.


With GNU Parallel it looks like this:

  parallel --nonall --slf RemoteHosts -j0 gzip -9 xxx/*.log
If you have multple CPUs on your remote hosts and have GNU Parallel installed there:

  parallel --nonall --slf RemoteHosts -j0 --arg-sep , parallel gzip -9 ::: xxx/*.log


rsh or ssh? All my VPSs have rsh disabled.


If you are compressing a single large file, you may find "pigz" of interest -- it's a gzip that uses multiple threads pretty well.


Ever read the man page for xargs? It reads just like the perscribing information for VIOXX.


Google "ppss".


make




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: