
GNU Parallel Tutorial - pi-rat
https://www.gnu.org/software/parallel/parallel_tutorial.html#GNU-Parallel-Tutorial
======
CJefferson
Feel I should mention this -- the confusing and non-clarified (I mailed and
asked for clarification, I was told to stop using the software) licence which
says if you use this as an academic, you must cite it or pay. I don't use it
for that reason -- this request isn't scalable.

I either use xargs, or this reimplementation which had all the stuff I need:

[https://github.com/mmstick/parallel](https://github.com/mmstick/parallel)

~~~
claudius
Meh. The website says "Please cite", not "you must cite". To derail this a bit
further, I would actually be very interested in an academic version of the
AGPL - that is, a proper license that defines "publication of data produced by
the work" as "distribution of the work", meaning that if someone uses the
software to produce some data and then publishes said data, they also must
publish (or make available on request) all changes made to the software.
Unfortunately, the very strict definition in the AGPL and the interlinking
between it and the GPL make this difficult to add-on later.

And thus we are stuck with "please cite this so I can justify working on it,
_please_?" instead of "You must cite this if you want to use it".

Honestly, I also don’t see how this request isn’t scalable, could you expand
on that?

~~~
mikegerwitz
The output of a program isn't a derivative work, so this can't be enforced
with a copyright license; you'd need something else. Unless you make it a
derivative work by outputting copyrighted material, but that'd be difficult
with raw data.

Anyway: there's a wonderfully specific answer to this in the FAQ:

[https://www.gnu.org/licenses/gpl-
faq.en.html#RequireCitation](https://www.gnu.org/licenses/gpl-
faq.en.html#RequireCitation)

[https://www.gnu.org/licenses/gpl-
faq.en.html#GPLOutput](https://www.gnu.org/licenses/gpl-faq.en.html#GPLOutput)

~~~
claudius
I read this, but I honestly don’t understand it. Isn’t a license essentially a
contract between the licensor and the licensee? And shouldn’t it hence be
possible to write such a clause into a license? After all, lots of things are
put into licenses which do not cover derivative works (e.g. conditional patent
grants)?

I understand that I cannot add this to the GPL (due to the specific clauses of
the GPL) and that RMS might not consider the resulting software free (though
it would probably pass all of Debian’s freedom guidelines, for example), but
it should be possible to have such a license in general?

~~~
mikegerwitz
The issue with adding such a restriction to the license is that it has to work
within the domain of Copyright. Assuming The author of foo holds copyright on
foo, but not to the output _generated_ by foo---it's an original work
(assuming that foo doesn't output anything that its author does actually hold
a copyright on).

I guess a good example would be a Madlib-style program, where it asks you
questions and fills in the blanks in a story, often resulting in something
highly amusing. The original story containing the blanks is copyrighted. The
output of this program would then be a derivative work, because the original
story has been modified.

But consider that the program took a story of your own (the data its
processing) and output statistics, such as the word count, frequency of
certain words, grammar errors, etc. This is not a derivative work. Similarly,
if GNU Parallel is being used to process your input, its output isn't a
derivative work.

With that said, you can have a separate EULA-type thing---which is _outside_
the domain of copyright---that imposes these terms. But that is incompatible
with the terms of the GPL.

~~~
paulmd
This is not true, or at least not _universally_ true. For example, gameplay
videos are also considered copyrighted as they contain assets and other
copyrighted elements. Your "mad libs" example probably falls under a similar
classification. A completed Mad Libs would still contain large distinctive
elements of a copyrighted work.

In these cases, the EULA actually may contain clauses that _allow_ you to
distribute gameplay videos. But if they do not contain exemptions, it's
copyright that will restrain you, not the EULA.

[http://www.develop-online.net/analysis/uploading-gameplay-
co...](http://www.develop-online.net/analysis/uploading-gameplay-content-to-
youtube-the-law-versus-the-commercial-reality/0187828)

[https://support.google.com/youtube/answer/138161?hl=en](https://support.google.com/youtube/answer/138161?hl=en)

Now in the specific case of GNU Parallel, I don't see how the output would
contain any distinctive elements of the original program. As a counterexample
you could not use GNU parallel to process its own source code and end up with
your own copyright on the output, however.

~~~
mikegerwitz
> Your "mad libs" example probably falls under a similar classification.

Yes, by stating that it's a derivative work, I meant that the output would be
subject to the Madlibs copyright.

GNU Parallel's output isn't subject to the GPL.

------
pi-rat
Love this tool, often end up using it with imagemagick.

    
    
      # Resize all jpgs to 800x600 using 8 jobs/cores.
      parallel -j 8 convert {} -resize 800x600 {.}_small.jpg ::: *.jpg
    
      # Or get the help of a few servers (via SSH) to do the same job.
      parallel -S serverA,serverB -j 8 --transfer --return {.}_small.jpg convert {} -resize 800x600 {.}_small.jpg ::: *.jpg

~~~
dahart
I do the same thing and resize all the pictures I ever import into both small
& thumbnail versions.

Fun tip - if your resize script prints only the output file name to stdout,
you can pipe the result into another parallel command, e.g.,

    
    
      parallel <resize to 800x600> ::: *.jpg | parallel <resize to thumbnail>
    

This way generating thumbnails runs on the small image instead of the full
size image, saving more time.

Also, if you're on a mac, using sips is quite a bit faster than imagemagick,
and sips comes built-in.

I've never used servers with parallel for image resizing, what are the
benefits? I'd have guessed it would take longer and saturate the network,
versus doing it locally. Is it useful for really long running jobs when you
don't want to load your local cpu? Is it actually faster sometimes, or are
there other more important reasons? I could see it being useful if I didn't
have a local imagemagick install, but had access to servers with it there.
Maybe it'd be useful in cases where I'm running docker environments for the
job processing? What other use cases and scenarios have you run into?

~~~
pi-rat
sips is a good tip, thanks!

I often sort and process images with a macbook on my lap curled up on the
sofa. Resizing locally makes the mac blazing hot with fans whining - not
comfortable and kills the battery. Transfer speed is not too bad over
802.11AC. So I use it mainly as a method of moving cpu intensive work away
from my lap.

Also, I sometimes use

    
    
      parallel -S server,: .......
    

The semicolon adds the local machine to the list, it will saturate both the
laptop and whatever it manages from the other computer. I have to admit I've
never tested this scientifically, but it seems to be faster even with the
overhead (gain of remote imagemagick seems to be more than cpu overhead of SSH
file transfer).

~~~
dahart
> I often sort and process images with a macbook on my lap curled up on the
> sofa. Resizing locally makes the mac blazing hot with fans whining - not
> comfortable and kills the battery. Transfer speed is not too bad over
> 802.11AC. So I use it mainly as a method of moving cpu intensive work away
> from my lap.

:) I have the _exact_ same workflow: MacBook+Sofa. Totally trying out the
server options today.

------
zevv
Or, good enough for 95% of the use cases, simply use 'xargs' which is part of
findutils and installed by default on all linux distributions. See the
'-P/\--max-procs' option in the man page for more details.

~~~
bandrami
Right, but it's those remaining 5% of situations where you want parallel
execution on multiple remote hosts that make this cool (xargs itself can be
replaced 95% of the time with backticks, for that matter). Combining parallel,
pies[1], and shepherd[2] makes for some very cool possibilities, which you can
check out with the GNU System Distribution[3], which I've recently migrated
all my personal systems to.

[1]
[http://www.gnu.org.ua/software/pies/](http://www.gnu.org.ua/software/pies/)

[2]
[https://www.gnu.org/software/shepherd/](https://www.gnu.org/software/shepherd/)

[3] [https://www.gnu.org/software/guix/](https://www.gnu.org/software/guix/)

~~~
sigil
You can use xargs for parallel remote execution too. In the simplest case:

    
    
        list-hosts |
        xargs \
        -P8 \
        -n1 \
        -I% \
        ssh % some-command

~~~
bandrami
Sure, and you can also run an ssh command inside backticks too. There's
something of a Turing tar pit here.

------
psi-squared
If I've read this correctly, the 'sem' mode lets you submit several lots of
jobs with an overall limit on the total number of tasks running at a time
(rather than one limit per lot of jobs).

That on its own is super useful for what I'm working on right now. But what
would make it even more useful, is: can you get GNU make to use 'sem' instead
of its own jobserver? That way I could run almost everything I need to under
one overall task limit, and that would be really nice to have.

(For this reason, I'm a fan of the idea that every program with its own
'parallel execution' mode should be able to interact with a common jobserver.
The 'make' jobserver is, as far as I know, the simplest, and should be pretty
easy to support: [http://make.mad-scientist.net/papers/jobserver-
implementatio...](http://make.mad-scientist.net/papers/jobserver-
implementation/) )

~~~
dahart
I don't think you can use a different job server for make, all you can do is
have a make rule that launches parallel or something else, but then you lose
dependency tracking on the sub-tasks.

Are you running parallel make tasks where each task is also doing something
multi-threaded or parallel? Like using make -j 8 won't work for you?

Make does have the -l load average task limiter when but I've never gotten it
to work reliably, it always starts way too many jobs at first and chokes for a
while before calming down. Often that won't work for me, but maybe it will
help you?

~~~
psi-squared
My current workflow has a lot of "Run make -j<lots> to build, followed by
parallel -j<lots> to run all the tests", but sometimes I want to compare/test
multiple different versions of the code (in a way which, sadly, doesn't work
well with incremental builds). In that case it'd be nice to be able to just
spin everything up and not have to worry about overloading/under-loading the
machine I'm working on.

I know what you mean about the load average limiter - parallel behaves like
that too. I _think_ the --delay option to parallel is supposed to solve that
(I haven't tried it - will try tomorrow), but I don't know if make has
anything similar.

Finally, on further reading, it definitely seems technically possible, even if
it hasn't been done so far. The make documentation has a section on the
jobserver protocol, which looks complete enough to write both the client and
server parts: [https://www.gnu.org/software/make/manual/html_node/Job-
Slots...](https://www.gnu.org/software/make/manual/html_node/Job-Slots.html)

So if nothing exists so far, it's something I might look into writing myself.

~~~
dahart
> Finally, on further reading, it definitely seems technically possible
> [https://www.gnu.org/software/make/manual/html_node/Job-
> Slots...](https://www.gnu.org/software/make/manual/html_node/Job-Slots..).
> So if nothing exists so far, it's something I might look into writing
> myself.

Holy moly, that's kinda crazy, but would be fun, you should totally do it!
Looking forward to your blog post! ;)

------
rdtsc
GNU parallel is pretty cool.

Another little thing I realized a while back is that `make` (yes the crusty
old make + -j flag) can be used to parallelize jobs. We do it for compiling
usually, but it can be used for other jobs as well.

~~~
ole_tange
Fun fact: GNU Parallel was originally a wrapper script that used `make` for
the parallelization:
[https://www.gnu.org/software/parallel/history.html](https://www.gnu.org/software/parallel/history.html)

~~~
rdtsc
Ha! Really cool. That was a fun fact!

Make was right there under my nose I just never imagined using it for anything
but compiling and building things. In that case I was forced by circumstances
(was developing on a constrained ancient version of RHEL), couldn't use GNU
Parallel and someone suggested `make`. The use case of obvious once a co-
worker mentioned it. But it was definitely It was one of the memorable
"thinking outside the box" example as they say.

------
oftenwrong
I am a long-time user of gnu parallel, but I feel it has a bit of
"featuritis". Just look at all of those options in the man page!

~~~
ec109685
Unless it is hurting the product, is it more like featureful?

------
janoc
Please, don't:

"Install the newest version using your package manager or with this command:

    
    
      (wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

"

 _facepalm_

~~~
dahart
I think this discussion has already gone the rounds on HN multiple times, and
it usually gets pointed out that running any installer at all has more or less
the same risks as pipe something from the internet straight to the shell --
while this looks scary, it's actually no less safe than how we install any
other software.

I agree it doesn't look ideal and may not be best practice, but what do you
feel is a better _realistic_ alternative? What is the main issue for you? Is
it the lack of a hash or checksum to verify what you downloaded, and make sure
you didn't get a malicious site or a compromised package?

~~~
heinrich5991
This is HTTP though, not HTTPS. You're not just trusting the authors, but also
trusting the network.

~~~
dahart
Yes, absolutely! So package validation is the issue for you. But how certain
are you about everything else you install using HTTPS? (I'm honestly asking,
not being rhetorical.) And what is stopping anything else you install from
downloading something over HTTP, and executing it?

------
haddr
Wow! Parallel is simply unmatched!

