
GNU Parallel - ingve
https://www.gnu.org/software/parallel/
======
base698
As always gets brought up when GNU parallel is mentioned: xargs does most of
the use cases you'd need for parallel.

xargs -n1 -P4

Would be at most one arg from the arg list run with 4 jobs.
[http://stackoverflow.com/questions/28357997/running-
programs...](http://stackoverflow.com/questions/28357997/running-programs-in-
parallel-using-xargs)

~~~
lyle_nel
I have a small cluster of machines that I run experiments on. GNU parallel
makes the dispatch of jobs on remote machines very easy.

In addition, I often use it to search for sequences by running grep in
parallel. For example

$ parallel 'grep {1} -f haystack.txt' :::: many_needles.txt

Where {1} is a single line in many_needles.txt

~~~
bane
If you find yourself searching lots of haystacks, and your needles are just
text and not a regex, a better approach is to stuff all the needles into some
kind of index, then chop up the haystack into overlapping tiles (of variable
width from the smallest needle to the largest), then search each tile against
the index of needles. This effectively searches all the needles at once and
turns the operation from O(n) where n is the number of needles to O(m) where m
is the number of tiles in haystack.txt.

It may seem to be a trivial difference, but then you can search multiple
haystacks at once fairly easily, and this approach scales to hundreds of
millions of needles at once. The code for it isn't very difficult either, heck
you can just use an in memory SQLite dB to get a searchable, temporary, index
and rely on using some of the most tested software in history.

~~~
agentgt
This also works for sorting as well and is typically called Radix Sort or
Bucket Sort.

Basically using unique attributes of the data you then divide and conquer on
those attributes (e.g. for Radix you make buckets based on the digits).

------
noja
GNU Parallel - an amazing tool with the most user unfriendly brick-wall-in-
your-face documentation imaginable. Shame really - it's great.

~~~
flatfilefan
To be honest the same can be told about many GNU tools. At least myself I
still experience this moment of being totally lost infront of the man screen
from time to time. Parallel has a nice tutorial
[https://www.gnu.org/software/parallel/parallel_tutorial.html](https://www.gnu.org/software/parallel/parallel_tutorial.html)
Have you seen it?

~~~
Mahn
I always get the feeling that most man pages are written for people who
already know how to use the commands, rather than people new to them. For
example, if I already understand how tar works and have a general idea of how
to use it, man tar is great to drill down and find specific options and
switches that I need or don't remember, but if I have no idea what tar files
are or why would I want them, the man page really doesn't help much in
explaining things.

I guess you could argue this is supposed to be so and man pages are doing
their job, since they are documentation and not tutorials, but still.

~~~
kobeya
Info pages are better for that.

~~~
ams6110
Info pages should be banished from the face of the earth.

[https://xkcd.com/912/](https://xkcd.com/912/)

~~~
kobeya
I don't get it. The info pages usually have way more information available,
and are proper software manuals with plenty of worked out examples, voluminous
descriptive text, and hyperlinks to related resources. Try `info sed` and
compare that with the bare breakdown of command line options and basic syntax
of `man sed`.

I'm not sure what that comic has to do with anything.

~~~
ams6110
Man pages are (in my experience) almost always viewed on a text console or
terminal. Info pages work poorly there, and it's just frustrating to be
looking for some information in a man page and have to start up some other
program with a different UX to complete the task.

Man page standards have places for "See also" and "Examples" and that is good
enough.

------
GolDDranks
There is also an in-development GNU Parallel clone/alternative written in
Rust.
[https://github.com/mmstick/parallel](https://github.com/mmstick/parallel)

~~~
stymaar
Being written in Rust is not its main feature (and I think you're being down-
voted because people don't like fanboys).

This project is cool because it has a really low overhead, which is really
cool if you want to parallelize tasks that are not really CPU intensive (but
mostly useless if the CPU usage of each task is high).

~~~
ole_tange
Rust-parallel _is_ fast, and there is clearly a niche here, that GNU Parallel
is unlikely to fill: By design GNU Parallel will never require a compiler;
this is so you can use GNU Parallel on old systems with no compilers (Think an
old, dusty AIX-box that people have forgotten the root password to). This
design decision limits how fast GNU Parallel can be compared to compiled
alternatives.

But the main problem with rust-parallel is that it is not compatible with GNU
Parallel (and according to the author, it probably never will be 100%
compatible). If you use rust-parallel to walk through GNU Parallel's tutorial
(man parallel_tutorial) you will see it fails very quickly.

(Full disclosure: I am the author of GNU Parallel. I fully support building
other parallelizing tools, but to avoid user confusion, I would recommend
calling them something other than 'parallel' if they are not actually
compatible with GNU Parallel. History has shown that using the same name will
lead to a lot of unnecessary grief: e.g. GNU Parallel vs. Parallel from
moreutils).

------
sheraz
Does anyone else remember Slashdot and the endless threads about Beowulf
clusters? That was back when "parallel computing" was overly complicated and
rather opaque. And most of us had no idea how to take advantage of multiple
machines.

~~~
ole_tange
Sure do. Makes you feel that the future is now!

------
blockoperation
I use parallel for pretty much all batching these days. It's useful even when
you don't need parallelisation – here's a simple transcoding example:

    
    
      parallel -j1 'ffmpeg -i {2} -c:v libx264 -tune film -preset veryslow -crf 18 -vf scale={1} -c:a libfdk_aac -vbr 5 conv/{2.}-{1}.mp4' ::: hd480 hd720 hd1080 ::: *.mp4
    

Obviously in the real world you would want to take some extra steps to avoid
making upscaled versions, but this is just a rough example.

------
limaoscarjuliet
I use it daily for parallel build process and I love it. Easy to set up, easy
to deal with. Documentation was tough though.

~~~
DigitalJack
That's interesting. Make is parallelizable too, although I occasionally run
into projects that won't build correctly if you use the feature.

------
realworldview
prll [https://github.com/exzombie/prll](https://github.com/exzombie/prll) is a
very pleasant and approachable alternative that I've used in preference to GNU
Parallel.

~~~
ole_tange
Just be aware that Stderr will contain stuff you did not ask for:
[https://www.gnu.org/software/parallel/man.html#DIFFERENCES-B...](https://www.gnu.org/software/parallel/man.html#DIFFERENCES-
BETWEEN-prll-AND-GNU-Parallel)

The syntax difference is small for simple tasks.

------
cmax
parallel is nice, in the past I would run a small script like
[https://gist.github.com/CMAD/3077918](https://gist.github.com/CMAD/3077918)
because I had to migrate email accounts from a list and it was all very old
servers with the package manager broken, good old times, now I will have to
run some super convoluted orchestration

------
known
cat 100GB_data_file_with_DUPLICATE_lines | parallel --pipe awk \'\\!a[\$0]++\'
> data_file_with_UNIQUE_lines

is BETTER since it uses less computer resources than

gawk '!a[$0]++' 100GB_data_file_with_DUPLICATE_lines >
data_file_with_UNIQUE_lines

~~~
snvzz
That's a classic misuse of cat.

Just use '<'.

~~~
kamijo-k
Yeah, but to me, this "misuse" actually makes the input and output clearer
than using "<" and ">" together. After all, it places input on the left side
of the command, and output on the right side.

------
flatfilefan
GNU Parallel rules! I made $100K with just a few lines of code involving
Parallel.

~~~
tbrock
Tell the story!

~~~
flatfilefan
thery simple really, but so is UNIX :-)

parallel -j 1000 -a urls.txt {wget --recursive {}|lynx --dump|(other unix
shell power tools) > url.txt}

gets you a powerfull web scraper that maxes out your internet connection.

Throw some machine learning on top of it and you can do a lot with that. Can't
tell you what exactly I did with the ML that being trade secret about to get a
patent, but the rest doesn't involve the parallel anyway.

~~~
witty_username
OK, forget technical details; how's the money made?

~~~
flatfilefan
better sales targeting

------
wingless
Am I the only person who finds GNU parallel way too complicated? I tried to
perform a very easy parallel task with it and spent hours reading the
documentation and various tutorials. If a person with Unix command-line skills
can't easily pick it up, what's the point of having it?

~~~
tyingq
There do seem to be some very complex use cases. On the other hand, their
example of parallel gzip of files seems straightforward:

find . -name '*.html' | parallel gzip --best

Generally, using it in places where you would normally use xargs seems
uncomplicated.

~~~
amelius
Probably a stupid question, but how do the remote machines translate the
pathnames to local pathnames? And what happens if they fail?

~~~
tyingq
The example above is just running in parallel locally.

------
lowry
GNU Parallel sucks. Use xargs when possible and paexec when needing fancy
features. BTW, paexec even supports piping next to process invocation.

~~~
ole_tange
Do you have an example of "piping next to process invocation"? It sounds
interesting.

