GNU Parallel

dang · on April 2, 2022

GNU Parallel Cheat Sheet [pdf] - https://news.ycombinator.com/item?id=19330356 - March 2019 (63 comments)

GNU Parallel - https://news.ycombinator.com/item?id=13258142 - Dec 2016 (83 comments)

GNU Parallel Tutorial - https://news.ycombinator.com/item?id=12943150 - Nov 2016 (65 comments)

A Million Text Files and a Single Laptop - https://news.ycombinator.com/item?id=11248326 - March 2016 (27 comments)

GNU Parallel – The command line power tool - https://news.ycombinator.com/item?id=6209767 - Aug 2013 (28 comments)

GNU/Parallel changed my life - https://news.ycombinator.com/item?id=1894639 - Nov 2010 (8 comments)

GNU Parallel - build and execute command lines from standard input in parallel - https://news.ycombinator.com/item?id=1801186 - Oct 2010 (36 comments)

bheadmaster · on April 2, 2022

GNU Parallel is one of my favorite utilities of all time.

I used to write complex and bug-ridden scripts that used bash job control to implement parallel execution of batch jobs. When I discovered GNU Parallel, I deleted them all and never looked back.

Also, the documentation is awesome - there is a whole book [0] on GNU Parallel, and the manpage even links to a series of youtube videos [1] that explain how it works.

[0] https://zenodo.org/record/1146014/files/GNU_Parallel_2018.pd...

[1] http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

winrid · on April 3, 2022

For those that don't know, it's written in Perl!

Such a life saver of a tool. I used GNU Parallel to run a script locally that did what a big distributed system did, quicker and more reliably. It got to the point where people would just ask me to "run the thing" on my laptop instead of waiting for the cron.

Mirror: https://github.com/martinda/gnu-parallel

WhatIsDukkha · on April 2, 2022

This can be a good entry into the "why" for many people that use xargs now -

https://www.gnu.org/software/parallel/parallel_alternatives....

It has a lot of features that can feel excessive at first glance but if you have felt some pain in building jobs, most of it is pretty sensible and much better than rollyourown.

cb321 · on April 2, 2022

The comparison is not very fair to modern day xargs.

`nproc` is a relatively standard utility (coreutils). So, xargs -P$(nproc) gets you core (or core-proportional) parallelism.

Grouping output/Making a safe parallel grep is also easy-ish with `--process-slot-var=slot` and sending to `tmpOut.$slot`.

Jobs on remote computers can be done similarly with any kind of `arrayVar[$slot]` setup where `arrayVar` has a bunch of `ssh` targets, possibly duplicates if you want to run >1 job per host. (In pure POSIX sh you could use eval and $1, $2 positional args with shell arithmetic..)

Anyway, those three are just off the top of my head, unfairness-wise. Last I looked at the source for GNU parallel it looked like mountains upon mountains of Perl I would rather not depend upon, personally, but to each his own.

arrakeen · on April 3, 2022

> Last I looked at the source for GNU parallel it looked like mountains upon mountains of Perl I would rather not depend upon, personally, but to each his own.

i used parallel for years under the assumption that it was written in C and only recently learned it was written in perl when i decided to dive deeply into its documentation. if you're using a package manager to install parallel and it runs fast enough for your needs (it does) then who cares what language it was implemented in?

cb321 · on April 3, 2022

Some users like to add their own features or have problems to debug. They surely care. Others may want to move coordination to a remote host and care about some single-file transfer of the exact version of the tool. They also care. It's ok that you don't, personally, of course.

forgotmypw17 · on April 2, 2022

Perl has retained stable backwards compatibility and no breaking changes for 20+ years. What's wrong with Perl?

rurban · on April 3, 2022

Lots of unstable breakages over the last year's. automake got broken by an unnecessary deprecation, signatures got broken, encodings, and dozens more.

but still miles better than other such languages, and esp. if it would have been written in C, just as the incompatible moreutils counterpart.

MainJane · on April 3, 2022

I was curious how much breakage GNU Parallel has suffered. So I fetched all versions (in parallel) and ran:

    parallel -k --tag --argsep -- {} echo ::: 1 -- parallel-*

Every version since 20120622 work (except for 20121022). That is code which is almost 10 years old.

rurban · on April 4, 2022

you need to try with all the perl versions, not the parallel versions.

forgotmypw17 · on April 4, 2022

In my anecdotal, n=1 experience, nothing Perl-based I've ever used has EVER broken over 20+ years, not even ONCE.

Compare this with PHP, whose breaking changes between releases has taken down my sites on multiple occasions.

Compare this with Python, whose breaking changes prevent me from running the overwhelming majority of Python things I've tried to use.

arunix · on April 3, 2022

> signatures got broken

Subroutine signatures are an experimental feature in Perl. Or are you referring to something else?

smegsicle · on April 3, 2022

"perl has common-lisp levels of stability"

ZoomZoomZoom · on April 2, 2022

> Anyway, those three are just off the top of my head, unfairness-wise. Last I looked at the source for GNU parallel it looked like mountains upon mountains of Perl I would rather not depend upon, personally, but to each his own.

Well, there was a Rust version with zero Perl, now unfortunately archived. It wasn't 100% on a par with the original and wasn't really finished. On the other hand, built easily for Windows and helped me on a few occasions.

https://github.com/mmstick/parallel

cb321 · on April 2, 2022

Well, some archived project is not so great either. The core functionality is not even a 20 line bash script since bash grew wait -n, though:

    #!/bin/bash
    if [ "${1-0}" -lt 1 ]; then             # No arg / arg not a number >= 1
      echo "Usage: $0 <N>"; echo "reads cmds from stdin, running up to N at once."
      exit 1
    fi
    TMP=`mktemp -t stripen.XXXXXX`
    trap 'rm -f $TMP; exit 0' HUP INT TERM EXIT
    STRIPE_SEQ=1
    while read cmd; do
        jobs > $TMP                         # jobs | wc -l does not work
        if [ $(wc -l < $TMP) -ge $1 ]; then
            wait -n                         # Wait for 1/more jobs to finish
        fi # Could accumulate total $? above, but would need to replace final wait.
        STRIPE_SEQ=$((STRIPE_SEQ + 1))
        ( eval "$cmd" ) < /dev/null &       # Run job in a subshell in bg
    done
    rm -f $TMP
    wait                                    # Wait for all to finish

and if bash ever grows some magic environment variable $NUM_BG_JOBS or you don't want auto-help or sequence numbers or etc. it can be even simpler.

arrakeen · on April 3, 2022

sure a "parallel xargs" can ostensibly be implemented in POSIX sh but that's merely the tip of the iceberg with what parallel can do. why not just skim the documentation and give it a try?

cb321 · on April 3, 2022

I have skimmed. I did give it a try. Not for me.

MainJane · on April 3, 2022

> The comparison is not very fair to modern day xargs.

I am curious how you come to that conclusion.

> `nproc` is a relatively standard utility (coreutils). So, xargs -P$(nproc) gets you core (or core-proportional) parallelism.

I follow you on this point. A bit harder on remote systems, but definitely doable.

> Grouping output/Making a safe parallel grep is also easy-ish with `--process-slot-var=slot` and sending to `tmpOut.$slot`.

I tried spending 5 minutes on coding this, but the details seem to be very hard to get right: composed commands, grouping stderr, combined with not leaving tmp files behind if killed and allowing for the total output to be bigger than the free space on /tmp. I could not do it.

Could you consider spending 5 minutes on showing in code how you would do it?

> Jobs on remote computers can be done similarly with any kind of `arrayVar[$slot]` setup where `arrayVar` has a bunch of `ssh` targets, possibly duplicates if you want to run >1 job per host. (In pure POSIX sh you could use eval and $1, $2 positional args with shell arithmetic..)

This one seemed even harder to me: It was completely unclear how you would make sure that a given number of jobs were constantly running. And how would you need to quote data, so an eval would not cause "foo space space bar" turn into "foo space bar". And how you would kill remote jobs, if the local script was killed.

If you believe this is simple, could you spend 5 minutes on showing the rest of us how you would do it in actual working code? Because it seems the devil is really in the detail.

> Last I looked at the source for GNU parallel it looked like mountains upon mountains of Perl I would rather not depend upon, personally, but to each his own.

Personally, I would take production tested code over home-made untested code any day - no matter the language in which it was written.

cb321 · on April 3, 2022

> allowing for the total output to be bigger than the free space on /tmp. I could not do it.

This is an unreasonable standard when you do not know in advance how big the output is. What do you imagine GNU parallel does? Use `df` on every host it knows about to fill every disk partition it can? That sounds like a pretty system-hostile behavior to me.

Meanwhile, putting your temp files somewhere bigger is obv. as easy as $TMPDIR or such.

Best wishes/luck. I only have 5 minutes to explain why nothing can do the impossible like read a user's mind about disk free space management or the value of partial results. All software makes some assumptions... :-)

MainJane · on April 3, 2022

> This is an unreasonable standard when you do not know in advance how big the output is.

Why is that unreasonable?

Let us say a single job outputs 10% of the free space. As long as you run fewer than 10 jobs in parallel, GNU paralel can run forever, because it spits out the output when a job is done and then frees up the space for this job, while starting the next one.

A simple example:

    yes 1000000 | parallel -j10 seq | pv >/dev/null

On my laptop I get 600 MB/s which would fill /tmp in a few minutes, and it does not.

When dealing with big data it is not uncommon that the total data piped between commands is way larger than the free space on /tmp (which is typically fast, where as free space on $HOME is slow - thus setting $TMPDIR to $HOME/tmp may slow down your job drastically).

If you only have 5 minutes, I hope you will use them on providing actual code to support your claim, that "The comparison is not very fair to modern day xargs."

If it takes longer than 5 minutes to code, I would say your use of "easy-ish" is unwarranted.

You leave me with the feeling that you have not thought this through and that the reason why you do not provide any code is because you are now realizing you are wrong, but you do not have the guts to admit so.

Prove me wrong by posting the code. It should be "easy-ish" :)

You can use this as the test case to implement:

    yes 1000000 | parallel -kj10 "echo 'This  is  double  spaced  '{#}; seq {}" | pv >/dev/null

cb321 · on April 8, 2022

You are just moving goalposts from "grouping to not mix" in the comparison doc to "grouping to not mix with exact space management profile(s) of GNU parallel". Even worse, you now bring in IO space-speed assumptions, other use cases (hay generation not needle search), various dissembling and childish "taunts for proof" when you clearly understood the suggestion well enough to analyze it for potential limitations. Your attitude is the problem, not missing code. Also, I never said "/tmp" and the paths could be FIFOs with record size/buffering limitations instead.

Speaking of /tmp filling and questionable space management defaults:

    yes 2000000000 | parallel seq | pv > /dev/null

fills my /tmp disk partition (or $TMPDIR) before emitting one byte to pv with invisible (unlinked) temp files. Not ideal. GNU sort at least shows me there are files present yet also seems to clean up on Ctrl-C.

There is likely some solution to fix this in 15 kLOC of gross Perl. I did not find it in "5 minutes" (another unreasonable standard since the many 1000s of lines of GNU parallel docs take far longer to read, but you already seem to ignore my explanations of "unreasonable"). You even anticipate this in your 10% example. At least in my life, "way more" is often much more than 10x more. So, you basically contradict yourself.

As to the actual subtopic, besides being unfair/out-of-date, the comparison tableau is also incomplete - maybe willfully so, as per too common marketing dishonesty. "Proof?" People use parallelism to speed things up and need to make decisions about job granularity to not have perf killed by overhead. Some would say this matters more than 95% of the tableau evaluation points. Yet, no overhead benchmarks. Maybe they make GNU parallel look bad?

MainJane · on April 11, 2022

If you feel I am "moving the goalposts" why not just prove your original case? If you are spending 5 minutes on reading the source code, why not instead spend them on proving your original assertion is correct? You can then let the readers decide if they feel I "move the goalposts".

I included the example:

    yes 1000000 | parallel -kj10 "echo 'This  is  double  spaced  '{#}; seq {}" | pv >/dev/null

to give you some fixed "goalposts" to aim for: Provide a solution that gives the same output byte for byte.

Also you do not seem to get the point about the amount of data. I regularly have output from a single job that is bigger than RAM, but rarely have output from a single job that would fill /tmp. However, the total combined output from all the jobs will often take up more space than /tmp.

In numbers: RAM=32 GB, /tmp=400 GB, a single job=33 GB, number of jobs=1000, jobs in parallel=8.

In other words: Running all jobs and saving the outputs into files before outputting data will not be useful for me. If you want to use FIFOs I really cannot see how you can deal with output that is bigger than RAM, unless you mix output from different jobs - which again would not be useful to me. But prove me wrong by spending 5 minutes on building the solution.

As for your example:

    yes 2000000000 | parallel seq | pv > /dev/null

How would you design this, if output from different jobs are not allowed to mix?

If they are allowed to mix paralel gives you:

    # bytes are allowed to mix
    yes 2000000000 | parallel -u seq | pv > /dev/null
    # only full lines are allowed to mix
    yes 2000000000 | parallel --lb seq | pv > /dev/null

none of these use space in /tmp.

I sit back with the feeling you are willing to spend hours complaining, but not 5 minutes on proving your assertion that it can be done "easy-ish".

Prove me wrong: Spend 5 minutes on the task you believed was "easy-ish".

If it cannot be done in 5 minutes, be brave enough to admit you were wrong.

smegsicle · on April 3, 2022

> mountains upon mountains of Perl I would rather not depend upon

how bad was the perl?

cb321 · on April 3, 2022

Subjective. Judge for yourself: https://git.savannah.gnu.org/git/parallel.git

technofiend · on April 2, 2022

There's probably a more unix-centric / idiomatic way to do it, but using GNU Parallel along with Redis made automating tasks to process tens of thousands of systems extremely easy, repeatable, pausable, restartable, etc. And the fact you can do atomic operations on queues prevented any potential worker collisions.

cb321 · on April 8, 2022

With the default setup, GNU parallel seems to be about 1000x slower than it should be and 165x slower than serial xargs. On a 16 core/32 thread CPU running Linux 5.16 (with parallel-20220322, perl-5.34.1, ripgrep-13.0.0, findutils-4.9.0, grep-3.7):

    cd /dev/shm
    wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.17.2.tar.xz
    tar xpJf linux-5.17.2.tar.xz
    cd linux-5.17.2
    rg -a --no-ignore --files | tr \\n \\0 > ../f
    tm="/usr/bin/time"; lb="--line-buffered"
    $tm rg -a --no-ignore -l foofoo           # 130 milliseconds wall
    $tm xargs -0P16 grep $lb -l foofoo < ../f # 120 milliseconds wall
    $tm xargs -0 grep $lb -l foofoo < ../f    # 800 milliseconds wall
    $tm parallel -0j16 grep -l foofoo < ../f  # 132 *SECONDS* wall

GNU parallel uses massive %CPU to make slow progress which also tends to confuse users new to parallelism in general. Hey, maybe there is a --dont-go-slow flag somewhere. I'd bet actual new users take a while to find it - maybe they never find it. There seems to at least some story of bad/confusing defaults here. A quick grep in the parallel package for benchmark only seems to indicate things that benefit from network asynchrony, not CPU parallelism.

I tried to pick something anyone could reproduce, should they so desire. Cheers.

cb321 · on April 10, 2022

Addendum: I guess parallel is not a "drop-in" for xargs by default, but more like xargs -n1, as has been mentioned elsethread. You need to use -X (that is a capital 'x' as lowercase 'x' means something else following the xargs CLI syntax, but not its semantics, LOL [1]).

    $tm parallel -X0j16 grep -l foofoo < ../f  # 2.33 *SECONDS* wall

So, GNU parallel is "only" 20X slower than it should be and "only" 3X slower than serial xargs on a 16 core box (on bare metal, not in the cloud, by the way). Still pretty awful unless there is another "--dont-burn-down-forests-for-electricity" flag I'm missing.

[1] https://unix.stackexchange.com/questions/273170/gnu-parallel...

caro11ne · on April 10, 2022

I find this scary:

    $ export LC_ALL=C
    $ $tm xargs -0P1 grep $lb t < ../f      |sort |md5sum
    7.05user 9.88system 0:24.53elapsed 69%CPU (0avgtext+0avgdata 2344maxresident)k
    0inputs+0outputs (0major+4960minor)pagefaults 0swaps
    8ef2c658a70bb38438e59421231246b9  -
    $ $tm xargs -0P16 grep $lb t < ../f      |sort |md5sum
    10.16user 36.62system 0:18.30elapsed 255%CPU (0avgtext+0avgdata 2332maxresident)k
    0inputs+0outputs (0major+4980minor)pagefaults 0swaps
    c8ebf840e54ec8b5a49e159eda09e63f  -
    $ $tm parallel -X0P16 grep $lb t < ../f      |sort |md5sum
    16.97user 33.94system 0:16.36elapsed 311%CPU (0avgtext+0avgdata 51624maxresident)k
    0inputs+2069296outputs (0major+169409minor)pagefaults 0swaps
    8ef2c658a70bb38438e59421231246b9  -

It greps for lines containing t, sorts the lines and computes a hash.

Note how "xargs -P16 grep" gives the wrong answer. The output from parallel matches exactly the lines from "xargs -P1". With "-k" the lines are even in the same order (sorting removed):

    $ $tm xargs -0P1 grep $lb t < ../f      |md5sum
    7.03user 9.30system 0:16.32elapsed 100%CPU (0avgtext+0avgdata 2332maxresident)k
    0inputs+0outputs (0major+5023minor)pagefaults 0swaps
    d89b45188602c9bb08026dc2892cfa75  -
    $ $tm parallel -kX0P16 grep $lb t < ../f      |md5sum
    18.21user 36.03system 0:10.26elapsed 528%CPU (0avgtext+0avgdata 65396maxresident)k
    0inputs+2069344outputs (0major+154929minor)pagefaults 0swaps
    d89b45188602c9bb08026dc2892cfa75  -

I have not analyzed the output but I think the error is caused by the issue described here: https://mywiki.wooledge.org/BashPitfalls#Non-atomic_writes_w...

How anyone would ever use "xargs -P16 grep" is beyond me. I honestly do not care how fast I can get an answer, if I cannot trust the answer is correct.

I can see someone claimed they could build a safe parallel grep, but seemed not to do so: https://news.ycombinator.com/item?id=30890780#30913304 It would have been interesting to see.

cb321 · on April 11, 2022

You are just moving goalposts from "grep -l" to "grep t". The "grep -l" should be reliable by virtue of line buffering and Linux kernel source path names being shorter than PIPE_BUF (which yes, you do have to know|check - much less to know than a >5000 line man page). While I could address the moved goalposts, I already mentioned xargs --process-slot-var [1] elsethread and, in my experience, goalpost movers are never satisfied.

[1] https://unix.stackexchange.com/questions/449224/how-can-i-ge...

caro11ne · on April 11, 2022

So you knew of this limitation, but failed to mention it?!

Wow. Just wow.

I thought you were trying to show a general way to run jobs in parallel in a safe, reliable way that was faster than GNU Parallel.

You failed to do that.

Instead you showed that it is possible to run jobs faster than GNU Parallel, but in a way that is neither safe nor reliable.

(Or more correctly: If you know the exact limitations of the kernel of the OS you are currently using, it will be reliable in certain situations - but not in general).

I will pick reliable results over speed any day, thank you.

I hope the person, who mentioned safe parallel grep, will show how it is done: https://news.ycombinator.com/item?id=30891634 because I will definitely not be using your solution.

cb321 · on April 13, 2022

I made no "in general" applicability claim and have, on the contrary, explicitly acknowledged assumptions & limitations various times in this thread which you write as if you have read. Mentioning every limitation is impractical. GNU parallel also doesn't work "in general" (e.g. no Perl interpreter). My concrete benchmark was safe/reliable in context - until you added bugs. Adding bugs to GNU parallel examples is also easy - I already did one by accident.

Your link to atomic writes and your bug addition seemed pretty targeted/informed. The GNU grep --line-buffer was also in plain sight in my benchmark. Presumably one must know pipes to know when GNU parallel's own --line-buffer is helpful. So, your "wow" outrage seems fake/off point and this GNU parallel sales pitch of "background knowledge free lunch" seems more false.

You also seem to have missed (twice!) the main thrust that, unless I am missing some other --go-fast flag, GNU parallel is so slow on this common task that the easy serial method is much faster even with 16 cores given to GNU parallel. GNU parallel would have to be over 3X faster for your correctness concerns to even matter relative to serial xargs. People don't usually use parallelism to slow things down - unless maybe they blindly use GNU parallel.

For the curious, perf ratios are actually even worse for the high volume "grep t" example made safe (21.8X slower rather than 19.4X slower - on my test machine). xargs --process-var-slot (around since 2010) is enough of a hint for anyone actually curious and there is real value to having someone solve that little puzzle themselves for their own use cases. Doing all your homework for you can take something away. If you are too confused and a paying customer of Ole's, have him update his docs to be less unfair to xargs. (Also no need to link to my own posts and refer to them so generically. I only have one account, as per HN guidelines which, as a brand new account, you should maybe familiarize yourself with. [1])

As to why GNU parallel is so slow - Dunno. Took just as long with that -u flag to allow mixing output. Perl/Python programs are often 20..500X slower than compiled to native code. Python at least has like 10 ways to compile it. Curious if it was only code search, I tried that x10000000 example in the xargs comparison doc and got only 1.2X scale-up over serial xargs with 8 whole cores which also seems really slow/bad. So, GNU parallel slowness seems like probably a common problem.

This isn't just "complaining". I have already highlighted 4 risks (no Perl, invisible /tmp filling default, non-drop-in xargs, slower than serial). The oddball nagware license creates at least a 5th/6th legal/financial risk. Not sounding so safe to me. Being giant with many features generically creates more "accidental attack surface". So, there may be many more buried in GNU parallel. Different tools, different risks.

[1] https://news.ycombinator.com/newsguidelines.html

caro11ne · on April 13, 2022

> You also seem to have missed (twice!) the main thrust that [GNU Parallel is slow]

I did not miss that. I did not comment on it because I agree and so does GNU Parallel.

man parallel:

    BUGS
    [...]
       Speed
           Startup

           GNU parallel is slow at starting up - around 250 ms the
           first time and 150 ms after that.

           Job startup

           Starting a job on the local machine takes around 10 ms.
           This can be a big overhead if the job takes very few ms
           to run. Often you can group small jobs together using -X
           which will make the overhead less significant. Or you
           can run multiple GNU parallels as described in EXAMPLE:
           Speeding up fast jobs.

And man parallel_alternatives:

   DIFFERENCES BETWEEN parallel-bash AND GNU Parallel
       [...]
       parallel-bash is written in pure bash. It is really fast
       (overhead of ~0.05 ms/job compared to GNU parallel's ~3
       ms/job). So if your jobs are extremely short lived, and
       you can live with the quite limited command, this may be
       useful.

And https://www.gnu.org/software/parallel/

    Over the years GNU parallel has gotten more safety features (e.g. no silent data loss if the disk runs full in the middle of a job). These features cost performance. This graph shows the relative performance between each version.

I really do not care how fast you can produce wrong output. I care how fast you can produce correct output, and I am do not care about a specialized solution that only works for one single specialized task.

I can make a specialized solution that is faster than your specialized solution:

    $ $tm true

It gives the same output as your example, and it is way faster. But do you really feel that is a fair comparison? If you say no, then by your own arguments, I can claim you are "moving the goal posts".

> For the curious, perf ratios are actually even worse for the high volume "grep t" example made safe (21.8X slower rather than 19.4X slower - on my test machine). xargs --process-var-slot (around since 2010) is enough of a hint for anyone actually curious and there is real value to having someone solve that little puzzle themselves for their own use cases. Doing all your homework for you can take something away.

Or it might just be that your solution is not safe at all, or only works on very specialized input on your system.

I have already shown that I can do a specialized version faster than your specialized version.

As long as you do not show your work, your speed claim is just that: a claim with no evidence.

What can be asserted without evidence can also be dismissed without evidence.

> I have already highlighted 4 risks (no Perl, invisible /tmp filling default, non-drop-in xargs, slower than serial)

"No Perl": I have only once used a system without Perl: It was on an embedded system, where space was a premium. If you use a package manager to install parallel, Perl will be installed for you automatically.

"Invisible filling /tmp": I really like that behaviour, because no matter how GNU Parallel is killed, there are no files to clean up. But each to his own.

"non-drop-in xargs": Your evidence here is good and I concur, though I never hit those incompatibilites myself (apart from -n1 which is what I normally want anyway).

"slower than serial": For short-lived jobs, yes (and if your jobs are short-lived and you can live with the limitations then parallel-bash seems to be faster than xargs). In general, no. Try "seq 0.1 0.1 10 | time parallel -j 50 sleep"

I had hoped your critique would show there is a better way of running jobs in parallel. So far I can only say I am disappointed.

ValtteriL · on April 2, 2022

Parallel has been (and still is) super useful and simple tool for speeding up all kinds of shell tasks during my career.

Still remember the time I wasted days waiting for the completion of multiple serial tasks when it would have been an hour with parallel :-)

desktopninja · on April 4, 2022

Can't quite pin my resistance to parallel but I've enjoyed using this magic for many moons now:

xargs -t -n 1 -P 3 -I TARGET_COMPUTER ssh -q TARGET_COMPUTER "

sleep \$(shuf -i 1-9 | head -1)

echo \$(hostname; date)

echo \"===> \$(hostname) <===\"

> ~/I_WAS_HERE

" <<< "

server-01

server-02

server-03

server-04

server-05 "

caro11ne · on April 10, 2022

I use:

    iwashere() {
      sleep $(shuf -i 1-9 | head -1)
      date
      echo "===> $(hostname) <===" > ~/I_WAS_HERE
    }
    env_parallel -Sserver-0{1..5} --tag --nonall iwashere

I like that I can try out the function locally before running it remotely.

I like that I do not have to give xargs a multiline argument.

I am terrible at quoting inside quoting inside quoting, so I like that I can simply avoid the quoting. I also think my colleagues will find it easier to read (thus maintain).

mekster · on April 2, 2022

If you want to execute commands on multiple remote hosts over SSH, just run a tmux session and launch a tab on each host in a loop and execute them.

It's far easier to follow the output and individually deal with prompts.

permalac · on April 2, 2022

I do the same, I even have a command that tells me which tmux panel I'm in so I can ssh to many servers at once from tmux.

However, when you just want to execute the same command : https://clustershell.readthedocs.io/en/latest/tools/clush.ht...

bheadmaster · on April 2, 2022

True, but for a large number of hosts, going manually through prompts doesn't scale anymore. Sometimes you just want to check whether all hosts have the same configuration file, or something like that. That's where GNU Parallel can save your gluteus maximus:

    cat hosts.txt | parallel --quote --timeout=10 ssh {} 'echo {} $(md5sum ~/.config/file)'

The above command would take a list of hostnames from the hosts.txt file, connect to each one, hash the config file and print out hostnames and hashes in one line per host.

MainJane · on April 3, 2022

    cat hosts.txt | parallel --quote --timeout=10 ssh {} 'echo {} $(md5sum ~/.config/file)'

Also try:

    parallel --slf hosts.txt --timeout=10 --nonall --tag md5sum .config/file

mekster · on April 3, 2022

Unless entire operations can be finished automatically, you could still do,

command_to_test && exit

in the swarm of tmux tabs and let only the anomalies stay open.

ktm5j · on April 2, 2022

I love tmux for this application, and typically I tend to prefer alternate solutions to using parallel.. but to be fair there are many more usecases for parallel than the one you describe here.

MainJane · on April 3, 2022

Also try:

    parallel --tmux ...

arrakeen · on April 2, 2022

parallel is so useful and i use it multiple times daily. i wish its `:::` syntax was supported at the shell level so i could use it for every application

dehrmann · on April 2, 2022

It's nice, but the citation bit strikes me as very non-free:

> If you use --will-cite in scripts to be run by others you are making it harder for others to see the citation notice. The development of GNU parallel is indirectly financed through citations, so if your users do not know they should cite then you are making it harder to finance development. However, if you pay 10000 EUR, you have done your part to finance future development and should feel free to use --will-cite in scripts.

> If you do not want to help financing future development by letting other users see the citation notice or by paying, then please consider using another tool instead of GNU parallel. You can find some of the alternatives in man parallel_alternatives.

Athas · on April 2, 2022

It doesn't affect your right to do anything with the software. It doesn't even affect your right to remove this citation notice (and some distributions do so). Free software is not about software not being annoying; merely about your right to remove the annoyances.

I'm not certain GNU parallel's approach to obtaining funding is a good strategy, but I find it weird that people object to it on philosophical or legal grounds.

CaliforniaKarl · on April 2, 2022

> I'm not certain GNU parallel's approach to obtaining funding is a good strategy,

Indeed, it's a good question! I know that, for the SGI UV300 we got through an NIH S10 grant, the usage of the supercomputer is tracked by looking for the grant number in publication acknowledgements. Yes, we already got the money, so you may wonder "what does it matter?", but our ability to get future funding (especially as the UV300 nears retirement age) depends (at least in part) on showing how well we used our previous funding.

In our case, the funding came through an NIH grant, so we ask people reference the grant number. But more broadly, and especially for software, issuing a DOI for released versions (through a service like Zenodo https://zenodo.org), along with a request for acknowledgement, gives a way to track usage. For example, the 20150322 release of parallel (DOI 10.5281/zenodo.16303) has been cited at least 63 times (per https://doi.org/10.5281/zenodo.16303).

Looking at https://scholar.google.com/citations?user=D7I0K34AAAAJ&hl=en, it seems the 2011 version of Parallel has been cited over 1,000 times.

The author is at the University of Copenhagen (per the Google Scholar link above), so it's entirely possible that at least some of the funding for his employment is coming from sources that use citation counts as an indicator that they are "getting their money's worth" by continuing to fund (at least part of) Mr. Tange's employment.

petepete · on April 2, 2022

I was trying to use parallel properly for the first time last week and the nag screen put me off and made me choose an alternative.

I ended up with the moreutils one and it appears to work just as well for my simple needs.

MainJane · on April 3, 2022

I think that is the right thing to do: Don't like it? Don't use it.

Also: https://git.savannah.gnu.org/cgit/parallel.git/tree/doc/cita...

CaliforniaKarl · on April 2, 2022

Your concern about this being non-free is addressed by an entire FAQ. Here's a link to the FAQ page from the software's Git repo:

https://git.savannah.gnu.org/cgit/parallel.git/tree/doc/cita...

6keZbCECT2uB · on April 2, 2022

Thinking about the citation notice and whether the software is safe to use at work is why I just use xargs.

stabbles · on April 2, 2022

I don't understand how GNU finds this acceptable...

CaliforniaKarl · on April 2, 2022

[edited for formatting and to fix Git link]

According to https://git.savannah.gnu.org/cgit/parallel.git/tree/doc/cita...

> == Is the citation notice compatible with GPLv3? ==

> Yes. The wording has been cleared by Richard M. Stallman to be compatible with GPLv3. This is because the citation notice is not part of the license, but part of academic tradition.

> Therefore the notice is not adding a term that would require citation as mentioned on: https://www.gnu.org/licenses/gpl-faq.en.html#RequireCitation

If you are of the view that clearance by RMS is clearance by FSF/GNU, then that's how they find it acceptable. If you take a different view, then the next part of that section applies:

> If you disagree with Richard M. Stallman's interpretation and feel the citation notice does not adhere to GPLv3, you should treat the software as if it is not available under GPLv3. And since GPLv3 is the only thing that would give you the right to change it, you would not be allowed to change the software.

There's also an interesting comparison to be made:

> == How do I silence the citation notice? ==

> Run this once:

> parallel --citation

> It takes less than 10 seconds to do and is thus comparable to an 'OK. Do not show this again'-dialog box seen in LibreOffice, Firefox and similar programs.

bheadmaster · on April 2, 2022

This is an excellent explanation of why "GNU finds this acceptable".

Note that the citation message can also be easily silenced just by creating an empty file:

    touch ~/.parallel/will-cite

kazinator · on April 3, 2022

Tools shouldn't be probing the file system for files not related to the job they are doing, period.

The right way to do this is to patch the behavior out of the program, which you're entitled to do by its license. Or, rather by the fact that the license doesn't concern itself with use.

That is covered in the FAQ:

  == I do not like the notice. Can I fork GNU Parallel and remove it? ==

  Yes. GNU Parallel is released under GNU GPLv3 and thus you are allowed
  to fork the code. But you have to make sure that your forked version
  cannot be confused with the original, so for one thing you cannot call
  it anything similar to GNU Parallel as that would cause confusion
  between your forked version and the original.

If you're not redistributing it, this doesn't apply to you; you're only using the program.

I believe that a distro could get around this by providing a script a user can execute, or a patch that the user can apply that removes the nag code from the installation of GNU Parallel. Even if we take the view that the script creates a fork, it only creates a private one on the user's machine, and not anything that is redistributed; without distribution taking place, what is taking place is use.

bheadmaster · on April 4, 2022

> Tools shouldn't be probing the file system for files not related to the job they are doing, period.

You're entitled to your preferences and your own fork of the GNU Parallel. I'll even help you out with that - all you need to do to remove the citation message is to comment out lines 1840-1843 in "src/parallel" file.

Just please don't demand everyone else (i.e. distro maintainers) to stab the main developer in the back just to accommodate your preferences.

musicale · on April 2, 2022

I don't care what GNU thinks, but it's simply not scalable.

Imagine a world where every utility has its own irritating nag message that needs to be turned off.

MainJane · on April 3, 2022

> I don't care what GNU thinks, but it's simply not scalable.

How so?

A lot of software requires you to configure it before the first run, and we regard that as scalable.

A lot of software requires you to pay for it before the first run (most Microsoft server software comes to mind), yet we regard that as scalable. You can also pay for gnu paralell: https://git.savannah.gnu.org/cgit/parallel.git/tree/doc/cita...

Is it because you insist that you get software for free (zero cost in gnu speak)? Because that is really not what the free software movement is all about.

musicale · on April 4, 2022

    $ ls
    Thank you for using the /bin/ls utility!
    Did you know that you can upgrade to LS PRO for a mere fraction of a bitcoin? 
    Or just post a selfie tagged #LS_PRO_RULES on Twitter!
    LS PRO has many amazing features that you are missing.
    This message can be removed by using the --no-awesome-ls-pro-upgrade-msg flag.
    Here is your file listing:
    .bashrc .catconf .cprc .ddconfig .dfprefs  ...
    $ exit -1

MainJane · on April 4, 2022

Honestly, I fail to see the problem, if I had to run `ls --no-awesome-ls-pro-upgrade-msg` once when I installed it the first time. And if I did not like it, I could use one of the alternatives to `ls` or build my own.

In LibreOffice I have to click a "Don't show tip of the day again" every time I install it on an new machine, and personally I have no problem with that. If I had, I would use something else.

Zsh asks me to configure it, first time I run it. I find that slightly annoying, but not to the extend that I would even consider complaining, sending a patch, or using an alternative.

But I assume you are aware that your comparison is really not valid: Parellel is not limited in features - you do not get extra features by paying/citing. What you are doing is keeping it alive.

Also, if you really do not like the notice, why not just pay for it? Are you opposed to paying for free software? And if so, how do you suggest developers of free software make a living? And why are you not actively doing that for GNU Parallel, which you clearly have so strong opinions on, that you are willing to spend time complaining but not willing to ignore (and use another tool)?

krapht · on April 3, 2022

Don't need to imagine, I lived in that world where most utilities I used were shareware with a nag screen on startup...

MainJane · on April 3, 2022

I still remember when those were not of the "OK, don't show this again" type, so you could not simply stop them first after the first run.

musicale · on April 3, 2022

I think I'll pass on GNU/Linux: Shareware Edition.

dehrmann · on April 3, 2022

But for only $20, they'll send you Episode II, GNU Hurd.

musicale · on April 3, 2022

We may need a standard directory and behavior for naggy utilities of this type. Maybe something like:

    for util in /usr/annoying/bin/*; do
        touch ~/.$(basename $util)/stfu
    done

Of course I don't like the idea of having to create a prefs file for every obnoxious utility, but that's the way parallel is currently operating. Maybe all such utilities could be required to use standard nag library with a global setting.

MainJane · on April 3, 2022

Would it not be more fruitful addressing the hard issue: Funding.

https://git.savannah.gnu.org/cgit/parallel.git/tree/doc/cita...

I think many free software developers would rejoice if you cracked that problem.

musicale · on April 3, 2022

It seems to be encouraging inappropriate citation. It is certainly not standard practice and would probably be considered something of a violation of academic ethics.

If you're not describing an experiment or system that uses GNU parallel as one of its key components then it makes no sense to cite it any more than it does to cite any other utility.

MainJane · on April 3, 2022

> If you're not describing an experiment or system that uses GNU parallel as one of its key components then it makes no sense to cite it any more than it does to cite any other utility.

GNU Parallel agrees with you, but also gives you a test of when to regard it as a "key component" (as you put it):

https://git.savannah.gnu.org/cgit/parallel.git/tree/doc/cita...

> If you feel the benefit from using GNU Parallel is too small to warrant a citation, then prove that by simply using another tool. [...] If it is too much work replacing the use of GNU Parallel, then it is a good indication that the contribution to the research is big enough to warrant a citation.