
Parallel processing with Unix tools - kiyanwang
http://www.pixelbeat.org/docs/unix-parallel-tools.html
======
Filligree
For GNU Parallel, if you pass it -k it will buffer all of its subcommands'
output and print it as if it had all been run serially, removing the need for
those subcommands to output anything atomically -- a requirement that usually
can't be guaranteed.

It's a capable tool, if somewhat... complicated.

~~~
protomikron
Thanks, I did not know that.

This is useful if you are solving embarrassingly parallel problems that
generate lots of log data (e.g. timings for benchmark) and write them to
stdout/stderr.

AFAIK there is some guaranteed atomicity when writing PIPE_BUF<512 to a pipe,
however then your sub programs shouldn't write more than 512 bytes of log-data
at any point, I would guess.

------
dahart
> parallel --will-cite

I _love_ gnu parallel, but that citation thing is a bit of a drag on parallel.
You only have to do this once, but I see why you'd write a blog post to
include the flag and not have to discuss the issue.

I wish Ole would relax on this. It's not really that appropriate to demand
that anyone using parallel for academics cite a magazine article. It's only
appropriate for people doing research in parallel algorithms, and in any case
it would make citing easier if there was a journal article to reference.

There's nothing wrong with asking nicely, and I try to spread the word about
gnu parallel. But the heavy handed demand is annoying, even if I only deal
with it like once a year.

~~~
LeoPanthera
It's GPL, right? Why hasn't anyone forked it with the citation requirement
removed?

~~~
dahart
Good Q! I don't know, but I also feel like that could be a bit dirty, without
having other reasons to fork. I might not want to encourage or support that.

There are some (non-fork) projects that provide the same functionality, with
the stated motivation being in part because of the citation thing.

~~~
jwilk
Links to the other projects?

~~~
dahart
Here's one I was thinking of:
[https://github.com/gdm85/coshell](https://github.com/gdm85/coshell)

The comment about motivation came from a bug report about parallel being
"chatty": [https://github.com/Homebrew/legacy-
homebrew/issues/29060](https://github.com/Homebrew/legacy-
homebrew/issues/29060)

Nothing homebrew can do about it, of course, but this illustrates one of the
implications of parallel doing something unexpected; package managers have to
field the complaints.

~~~
jwilk
> _Return value will be the sum of exit values of each command._

That's unforuntate design choice.

Exit status is 8-bit only, and some programs exit with high numbers¹, so the
sum could overflow easily.

¹ For example, when a Perl program dies, exit status is 255.

------
termie
Bash’s built-in wait is also handy when you want quick and simple parallelism.
[http://tldp.org/LDP/abs/html/x9644.html](http://tldp.org/LDP/abs/html/x9644.html)

~~~
philsnow
I do this pattern a lot

    
    
        for x in $(seq 1 10); do long_process file${x} & done; wait
    

It doesn't take care of not saturating my cpu or anything; when I need to care
about that then I try to remember how to use parallel.

~~~
enriquto
it's actually easier using parallel :

    
    
            for x in ...; do echo long_process file$x ; done | parallel -j 8
    
    

EDIT: and in your case, it is even easier with xargs, e.g.:

    
    
            ls files* | xargs -n 1 -P 8 long_process

~~~
sigjuice
Wouldn't this fail if your file names have white space?

~~~
Anthony-G
Parsing `ls` is never a good idea[1]. A more robust way to use globbing in a
POSIX shell would be something like this:

    
    
        printf "%s\0" files* | xargs -0 -n 1 -P 8 long_process
    

1\.
[http://mywiki.wooledge.org/ParsingLs](http://mywiki.wooledge.org/ParsingLs)

~~~
enriquto
I disagree with the article that you linked. Filenames are variable names that
you get to choose. Half of the game is won by chosing them wisely, so that
they are convenient to use.

------
npx
I realize it's somewhat off topic, but I feel like Joyent Manta deserves an
honorable mention. It's an S3 style object store, but you can spin up
containers on top of objects and do massively parallel computations with Unix
tools.

[https://apidocs.joyent.com/manta/](https://apidocs.joyent.com/manta/)

------
jwilk
Note that, somewhat counter-intuitively, what "wc -l" counts is the number of
newline characters:

    
    
      $ printf 'foo\nbar' | wc -l
      1
    

There are arguably two lines in the input, but the result is 1.

Without this (mis)feature, "wc -l" would be more difficult to parallelize.

~~~
LambdaComplex
The POSIX definition of a "line" is "A sequence of zero or more non- <newline>
characters plus a terminating <newline> character"[0]. I don't think it's
counterintuitive at all for POSIX utilities to respect that definition.

0\.
[http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_...](http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206)

~~~
frou_dh
"terminator, not separator" is an easy way to remember it. I would call bar in
GP's comment a fragment.

------
mtreis86
Qt multithreading [http://doc.qt.io/qt-5/threads-
technologies.html](http://doc.qt.io/qt-5/threads-technologies.html)

Aria for downloads [https://aria2.github.io/](https://aria2.github.io/)

Pigz and pbzip2 for compression
[https://zlib.net/pigz/](https://zlib.net/pigz/)
[http://compression.ca/pbzip2/](http://compression.ca/pbzip2/)

~~~
dahart
If you're zipping multiple files, is it better to pbzip2 each file, or to
parallel bzip2 them? You wouldn't want to parallel pbzip2, would you?

~~~
vthriller
By compressing multiple (large) files sequentially you'd be able to gradually
free some disk space much sooner, and less free space will be needed to write
compressed data to.

~~~
katastic
Large files / sets of files with 7zip means you can use all files as a
dictionary for the compression. (across-file compression, called "solid
compression")

However, zip does not supports solid compression. Which creates the oddity of
zip'ing twice can reduce your file size because multiple duplicate files may
have the same compression but they are stored separately (and the second pass
then notices the similarity).

The downside of solid compression is you have to extract any files related to
that block. But with modern computers that's not as bad as it used to be, and
modern 7zip doesn't extract "all" files, only the ones affected by that block.

------
wyoh
Rust Parallel is nice too:
[https://github.com/mmstick/parallel](https://github.com/mmstick/parallel)

~~~
agumonkey
Performance of the actual program might be less relevant but it's nice to see
it using 1% for itself compared to GNU parallel.

~~~
SamJson
The performance is apparently close to xargs (if speed is key, why not just
use xargs?). Rust parallel does, however, have some issues that would rule it
out for me:

[https://www.gnu.org/software/parallel/parallel_alternatives....](https://www.gnu.org/software/parallel/parallel_alternatives.html#DIFFERENCES-
BETWEEN-Rust-parallel-AND-GNU-Parallel)

------
enriquto
do not forget about "make", that will traverse a user-supplied tree of
filenames in parallel

~~~
dahart
And make -j will execute the rules & build targets in parallel. Once I built a
general batch parallel system using make that could pause & resume parallel
jobs by using each job's log file as the make target. Then I discovered gnu
parallel and scrapped my project.

~~~
SamJson
Fun fact from
[https://www.gnu.org/software/parallel/history.html](https://www.gnu.org/software/parallel/history.html)

> [GNU Parallel] was originally a wrapper that generated a makefile and used
> make -j to do the parallelization.

~~~
dahart
Ha! Crazy, I didn't know that, thanks for the pointer. I recall having some
issues with scaling to very large jobs, when I got lists of targets too long
for make to deal with. (I think... I could be mis-remembering the details, but
something in my pipeline would fail with really big batches.) Maybe parallel
split away from make when it ran into similar issues, I wonder.

------
wookayin
This tool is also great: [https://github.com/greymd/tmux-
xpanes](https://github.com/greymd/tmux-xpanes)

------
unixhero
One day I have to learn how to use xargs

~~~
cat199
It's pretty much 'apply' for shell arguments

[https://en.wikipedia.org/wiki/Apply](https://en.wikipedia.org/wiki/Apply)

use it many places instead of a for loop:

for x in _; do <cmd> $x done

is:

echo _ |xargs cmd

Assuming 'cmd' can take multiple arguments.

If not, use 'xargs -L 1 cmd' to run cmd 1x per arg.

GNU xargs has -1, but -L 1 is portable to xargs in the BSDs and others.. plus
it's good to know -L in case you ever need to -L 2, etc.

using echo as a driver is not the best example, but anyhow.

~~~
cat199
also of note is:

-t : print the commands being run

-I %: interpolate file name in command string.

e.g. ls _.mp4 |xargs -t -L 1 -I % scp % me@somebox: /movies

to scp individual files (replaced in %) to a destination, one at a time,
printing each command as it is run.

Again, not the best example since you can

scp _.mp4 me@sombeox:/movies

or rsync, etc. but you get the idea.

~~~
bitexploder
Find has exec as well. I tend to use this when searching for things.

------
known
OleTange, author of parallel chips in
[https://www.reddit.com/r/programming/comments/5x39jh/countin...](https://www.reddit.com/r/programming/comments/5x39jh/counting_lines_60_faster_than_wc_with_clojure/)

------
catern
See also
[http://catern.com/posts/pipes.html](http://catern.com/posts/pipes.html)

------
YSFEJ4SWJUVU6
The mentioned tool 'turbo-linecount' is, well, odd. It targets what I'd think
is a very niche application of counting lines very fast (a domain usually
limited by I/O speed), using a rather complex design... and then manages to
throw much of the gained advantage away by using what is perhaps one of the
slowest ways of counting newlines in a buffer.

------
tejtm
and the humble pipe, another process may start the next stage before the
previous one(s) have finished.

~~~
grigjd3
Need to be aware of the buffer size on the pipe. It can be one of those issues
that never comes up until you just cross the threshold, then everything fails
ungracefully.

------
gigatexal
Command line goals.

------
alvil
I like it using newLISP :)

    
    
      ;calculate primes in a range
      (define (primes from to)
        (local (plist)
            (for (i from to)
                (if (= 1 (length (factor i)))
                    (push i plist -1)))
            plist))
    
      (set 'start (time-of-day))
    
      ; start child processes
      (spawn 'p1 (primes 1 1000000))
      (spawn 'p2 (primes 1000001 2000000))
      (spawn 'p3 (primes 2000001 3000000))
      (spawn 'p4 (primes 3000001 4000000))
    
      ; wait for a maximum of 60 seconds for all tasks to finish
      ; returns true if all finished in time
      (sync 60000) 
    
      ; p1, p2, p3 and p4 now each contain a lists of primes
      (println "time spawn: " (- (time-of-day) start))
      (println "time simple: " (time  (primes 1 4000000)))
    
      (exit)

------
zackmorris
My biggest gripe with UNIX is that a command like:

(sleep 5 && echo hello) > test.txt

Causes a race condition where test.txt is created empty, then 5 seconds later
written with "hello". Try it without parentheses to see it wait. UNIX doesn't
use our intuitive notion of order of operations on its piping, it just runs
everything simultaneously. This allowed for tremendous efficiency and
concurrency but it's hard to fathom how much this has cost us in bugs and lost
development time.

I feel like this was a lost opportunity because it prevented the Actor model
(as seen in Erlang and Go) from really taking off decades ago. Perhaps this
bug/feature was one of the motivations for commands like "parallel".

Does anyone have a general workaround for this problem? Some command that we
could insert in the chain to force a wait, without having to install any
external tools? Thanx!

Edit: I'm having a hard time explaining how insidious this race condition is
for people who haven't encountered it yet. The gist of it is that file
descriptors aren't opened when a pipe sends its first byte, _they are opened
when the shell command is interpreted._ I'm also having a hard time finding
examples, here is one I think, though there are many, many others:
[https://unix.stackexchange.com/questions/174788/am-i-
hitting...](https://unix.stackexchange.com/questions/174788/am-i-hitting-a-
race-condition-in-bash)

~~~
adamkruszewski
I think it does, try it as:

sleep 5 && (echo hello > test.txt)

and you'll see it creates the file only after sleep ends.

