It's a nice tool, but it also shows the shortcomings of shell commands.
In a proper programming language, we'd have something like
parallel [1..5], i => { sleep random()*10+5; possibly_flaky i }
// [{"Seq": 4, "Host": ":", "Starttime": 1692491267...
And `parallel` would only have to worry about parallelization.
Instead, the shell environment forces programs to invent their own parameter separator (:::), a templating format ({1}), and a way to output a list of structures (CSV-like). You can see the same issues in `find`, where the exec separator is `\;`, the template is `{}`, and the output is delimited by \n or \0. And `xargs` does it in yet another different way.
It's very hard to acquire and retain mastery over a toolbox where every tool reinvents the basics. If you ever found yourself searching "find exec syntax" multiple times in a week, it's not your fault.
As for alternatives, I'm a fan of YSH[1] (Javascript-like), Nushell[2] (reinvented from first-principles for simplicity and safety) and Fish[3] (bash-like but without the footguns). Nushell is probably my favorite from the bunch, here's a parallel example:
ls | where type == dir | par-each { |it|
{ name: $it.name, len: (ls $it.name | length) }
}
[I'm not recommending this, but maybe… No, no. I'm not sure…]
It isn't even just the newer shells that have solved this, zsh also has a solution out of the box¹. The extensive globbing support in zsh can largely replace `find`, and things like zargs allow you to reuse your common knowledge throughout the shell.
For example, performing your first example with zargs would use regular option separators(`--`), regular expansion(`{1..5}`), and standard shell constructs for the commands to execute.
I'll contrive up an example based around your file counter, but slightly different to show some other functionality.
f() { fs=($1/*(.)); jo $1=$#fs }
zargs -P 32 -n1 -- **/*(/) -- f
That should recursively list directories, counting only the files within each, and output² jsonl that can be further mangled within the shell². You could just as easily populate an associative array for further work, or $whatever. Unlike bash, zsh has reasonable behaviour around quoting and whitespace too.
Edit to add: I'm not suggesting zargs is a replacement for parallel, but if you're only using a small subset of its functionality then it may be able to replace that.
What you mention is the main reason why shell script is not a decent language to write long programs. It is full of inconsistencies, and since it depends on other commands, you have to learn the quirks of each command you use. Moreover, good luck if you need to debug this. Shell should only be used for small scripts that are easy to debug.
If doing even simple things requires looking up documentation, why does it matter whether the shell script is long or short?
Spending extra time doing simple things — because you need to Google e.g. "how to pass multiple space-separated arguments from a string to a command" — is also a waste of time.
Because the shell is available everywhere. A programming language needs to be installed first to be of any use. I still understand that there is a need for a simple command processor that is independent of a programming language for simple tasks.
Do you recommend any good alternative when your shell program gets too large?
Honest question, as I’m struggling to leave the shell environment once the program gets too large. I could use Perl, but $? and the likes get quickly out of hand. Python’s support for pipes was difficult last time I used it, but that may have changed. What would you recommend?
If it's too large, then just write normal Python code. It'll be a lot longer than the equivalent shell-like script, but you'll gain it back in maintenance effort, debugabillity, and robustness.
You've some hesitation with Perl, but if you stick at it, you'll find what you seek. It feels very 'unixy' and can achieve much the same as shell while being more consistent in its syntax. Its portability means it will work the same across environments. Plus the newest editions have niceties like modern classes and try/catch as inbuilt language features.
Sharing this because its the route I went, anything I'd have written in Bash I'd now do in Perl.
Thank you for encouraging me to use Perl. After Perl 6 came out I got confused at what and how to use and hence I’ve abandoned that path. I’ll try once more now.
The tooling around Perl has also gotten better over the last decade or so while also allowing you to pack everything to run on even ancient machines running old Perl 5.
Nim is statically typed and (generally) native-compiled, but it has very low ceremony ergonomics and a powerful compile-time macro/template system as well as user-defined operators (e.g., you can use `+-` to make a constructor for uncertain values so that `9 +- 2` builds a typed object as in https://github.com/SciNim/Measuremancer .
My use case is approx. like this: I can get 80% what I want with ls … | sed … | grep -v … but then it gets complicated in the script and I’d like to replace the sed or grep part with some program.
This sounds like a job for what standard C calls "popen". You can do
import posix; for line in popen("ls", "r").lines: echo line
in Nim, though you obviously need to replace `echo line` with other desired processing and learn how to do that.
You might also want to consider `rp` which is a program generator-compiler-runner along the lines of `awk` but with all the code just Nim snippets interpolated into a program template: https://github.com/c-blake/bu/blob/main/doc/rp.md . E.g.:
$ ls -l | rp -pimport\ stats -bvar\ r:RunningStat -wnf\>4 r.push\ 4.f -eecho\ r
RunningStat(
number of probes: 26
max: 31303.0
min: 23.0
sum: 84738.0
mean: 3259.153846153846
std deviation: 6393.116633069013
)
Depending upon how balanced work is on either side of the pipe, you usually can even get parallel speed-up on multicore with almost no work. For example, there is no need to use quote-escaped CSV parsing libraries when you just read from a popen()d translator program producing an easier format: https://github.com/c-blake/nio/blob/main/utils/c2tsv.nim
Good ergonomics for Perl-style quick and dirty text processing were part of the original design goals for Ruby. Those parts of the language are still there. You can write code that feels more concise than Python yet, IMO, tends to be more readable/maintainable than Perl can stereotypically be. Modern style guides, however, de-emphasize that style of Ruby since it might not be the most appropriate in the context of say a large Rails project.
Python to me, is too far away from shell/unix. It is a programming language for writing applications. For the use case of writing shell scripts but in a more powerful language, perl is still the king here (or it should be. Sadly it doesn't appear to be the case. No one is using it except for die hard gray beards.)
Raku is a modern (still a big) language with kitchen sink. Again doesn't appear to be much uptake.
Unpopular opinion, but I used Haskell "scripts" with relative success for a while. Stack has a nice script interpreter mode that is runnable in the familiar #! way.
Even allows to add dependencies and if necessary compile the script on the fly.
I once read a HN thread that recommended Go for this, and it made me interested. I think it was a useful suggestion, it made me learn Go, and I kind of agree with it, 5+ years after. It is not a shell, but it is simple and fast and useful.
meanwhile, python DASK is very well funded to be cloud-native, and also local.. however it relies on a python runtime, so you know .. also not sure about the DASK license terms
Your find exec problem can be trivially solved with either - exec /bin/bash -c "script" or you can spend a little extra time figuring out how to properly structure your scripts in such a way where the incocations just flow with little more than an invocation +getopts
If you feel like the answer is rewriting the shell, the answer is practically never rewriting the shell. It's learning to use it.
Since nobody asked, I'm reiterating my position that computers to effectively utilize parallel functionality simply aren't available today. I've always wanted a computer with at least 256 cores and local content-addressable memories beside each core to send data where it's needed. By Moore's Law, we could have had MIPS machines with 1000 cores around 2010, and 100,000 to 1 million cores today, for under $1000.
Contrast that with GPU shaders where one C-style loop operates on buffers separate from system memory, and can't access system services like network sockets or files. GPUs have around 32 or 64 physical cores, so theoretically that many shaders could run simultaneously, although we rarely see that in practice. And we'd need bare-metal drivers to access the GPU cores directly, does anyone know of any?
The closest thing now is Apple's M1 line, but it has specialized NN and GPU cores, so missed out on the potential of true symmetric multiprocessing.
The reason I care about this so much is that with this amount of computing power, kids could run genetic algorithms and other "embarrassingly parallel" code that solves problems about as well as NNs in many cases. Instead we're going to end up with yet another billion dollar bubble that locks us into whatever AI status quo that the tech industry manages to come up with. And everyone seems to love it. It reminds me of the scene in Star Wars III when Padme notes how liberty dies with thunderous applause.
1) Amdahl's law means it's not useful to have hundreds of cores for general purpose computing. There's not that much parallel work to do in typical applications. Increasing the proportion of work that's parallelizable for a given application pays dividends when you have more cores - that's why Servo is so exciting. In some cases, picking an O(n2) algorithm that's easy to parallelize will be faster than a less parallizable O(nlog(n)) algorithm - this is true for problems like Single-Source Shortest Paths (SSSP).
2) Shared resources (in-memory mutable data, hardware devices) mean the ratio of contention to CPU work goes up when you have more cores.
3) Cores on a single die need to share the same constraints - thermal limits and transistor count. So you're best off having enough powerful cores to get you to a sweet spot of single-core performance vs multi-core parallelism.
4) It's hard to provide a performant and useful many-core machine model. Cache coherence makes it easier to program a many-core machine, but limits performance. Without it, you're stuck with distributed systems-style problems.
This exists now. Some AI accelerators are a grid of independent compute units with their own memory, message passing between them. Graphcore's IPU is an instance.
An AMD GPU is a grid of independent compute units on a memory hierarchy. At the fine grain, it's a scalar integer unit (branches, arithmetic) and a predicated vector unit, with an instruction pointer. Ballpark of 80 of those can be on a given compute unit at the same time, executed in some order and partially simultaneously by the scheduler. GPU has order of 100 compute units, so that's ~8k completely independent programs running at the same time.
You've got a variety of programming languages available to work with that. There's a shared address space with other GPUs and the system processors, direct access to system and GPU local memory. Also some other memory you can use for fast coordination between small numbers of programs.
There's a bit of a disconnect between graphics shaders, the ROCm compute stack and what you can build on the hardware if so inclined. The future you want is here today, it just has a different name to what you expected.
K if I can transpile C/C++, Rust or TypeScript to that and have full access to memory, threads, system APIs, network sockets, etc, then that would work for the use cases I have in mind. Running MIMD processes on SIMD hardware is something I'm definitely interested in.
If there's no straightforward way to do that, then I'm afraid that hardware represents a huge investment in the wrong direction.
Because a GPU can be built from the general-purpose multicore CPU I'm talking about. But a CPU can't be built from a GPU.
What I'm getting at is that if I have to "drop down" to an orthodox way of solving problems, rather than being able to solve them in the freeform way that my instincts leads me, then I will always be stifled.
1000 cores?? I don't have 100 cores! What do you even need 10 cores for? Well, here's 4 cores. Give 2 to your brother. Don't go wasting all those hyper threads all at once!
Sorry but we do have computers with 256 cores. I used to have this excuse back when processors only had 4 cores. When you consider that processors lower their turbo boost frequency as you use more cores and there is overhead from synchronization, your 4 core processor may only give you a 2x performance benefit at the expense of your code becoming difficult to reason about (depending on the problem at hand). Nowadays 8 core processors are quite cheap, below 200€. At 4x performance boost and easily 12x more if you are willing to spend the money, it is definitively worth it. The caveat of course is that there aren't actually that many programs that need the full power of your processor. The most common exception is a video game that was developed for a limited number of players or even single player but then the multiplayer version of the game becomes extremely popular and you get servers with 60 or even a hundred players, way beyond what the developers planned to support. Supporting multiple cores was not a priority and then very suddenly it becomes the biggest bottleneck.
The real problem we are facing is that our programming models aren't parallel by default.
>By Moore's Law, we could have had MIPS machines with 1000 cores around 2010, and 100,000 to 1 million cores today, for under $1000.
You can have 10000 RISC-V cores on an FPGA but nobody cares. Why? Because even a bit serial processor (that means it processes one bit per clock cycle, or 32 clock cycles for a 32 bit addition) runs into memory bandwidth limitations very quickly if you have enough of them. Main memory is very slow compared to registers and caches. The only way to utilize this many cores is by having a workload that is entirely latency bound. Your memory access pattern is perfectly unpredictable. The moment you add caching, the number of cores you can have shrinks dramatically and companies like AMD are not slimming down their CPUs, they are adding more and more cache. Their highest end processors have almost a gigabyte of cache.
I agree about the programming models not being parallel by default, and that's one of the things that I specifically rail against in most of my comments. MATLAB/Octave is a good introduction to what parallel programming could be. Also the endless doubling down on large caches, because the multicore design I have in mind would mostly eliminate cache and use that die area for cores and local memories.
I think we're slightly talking past each other here though. The CPU I want to build would have around 10-256 cores on 90s tech. So the same transistors holding 1 Pentium Pro would allow for 1-2 orders of magnitude more MIPS or RISC-V cores and local memories. The design is so simple that I think that's why it was missed by the big fabs.
Today there's little demand for 1000+ cores, but that's partly because nobody can see what they could do. But we can't design the thing, because the status quo has us all working pedal to the metal in first gear to make rent. It's a chicken and egg problem that has a lower likelihood of being solved as time goes on. Which is why I think we're on the wrong timeline, because if the system worked then actual innovation would become more accessible over time.
Programmability is always the biggest issue, and that's not really a chicken-and-egg problem because decades of research have gone into writing compilers and languages for massively parallel machines -- it's just hard, some would say intractable (and local memories tend to make programmability issues worse.) There are niche or embarrasingly-parallel problems that will run great. But it's hard to sell hardware that will solve only some of your problems well. And GPUs have taken over for many of those very regular problems as well.
Arguing about where we should be based on a projection of an empirical exponential curve seems pretty irrational. Nothing in reality is exponential forever.
Typical GPUs are easily 6000+ shaders (aka kinda-sorta like cores) on the more expensive end.
At least, 6000+ 32-bit multiplies per clock tick on ~2GHz+ clocks. Even cheap GPUs easily are 2000+ shaders.
> GPUs have around 32 or 64 physical cores
NVidia SMs and AMD WGPs are not "cores", they are... weird things. They have many shaders inside of them and have huge amounts of parallelism.
As far as grunt-work goes, a "multiplier unit" (literally A x B) is perhaps the most accurate count to compare CPU cores vs GPU "cores", because the concept of CPU-core vs GPU WGP / SM is too weird and different to directly compare.
Split up that WGP / SM into individual multipliers... and also split up the ~3 64-bit multipliers or ~48 CPU SIMD multipliers per core (3x 512-bit on Intel AVX512 cores), and its perhaps a more fair comparison point.
---------
Back 20 years ago, you'd only have 1x multiplier on a CPU core like a Pentium 4, maybe as many as 4x with the 128-bit SSE instructions.
But today, even 1x core from Intel (3x 512-bit SIMD) or 1x core from AMD (4x 256-bit SIMD) has many, many, many more parallel elements compared to a 2004-era CPU core.
The full crossbar, allowing each shader to individually issue a fetch from memory. The shared memory space is not like cache but instead is a shader-to-shader communication scratchpad.
Atomics support, coalescing atomics together.
-------
I mean hell: what is a core? Do remember that on SMs, every single shader (not SM) has its own instruction pointer.
Is the shader a core? No, not really. But SMs aren't a core either.
I wouldn't compare GPU and CPU architecture at all. They're just different. What I did above, breaking both down into individual multipliers then counting them seems like the best way forward, especially as we remain multiplier bound in practice.
Read a lot of this kind of post. Years ago I recall someone bleating for 8 cores when 1 or 2 was the norm. Now you want 256. Next generation will ask for thousands. All for nothing because you have no idea what to do with it except give the handwaviest justifications. A computer's a tool to do an actual job. You can and probably do have more computing power on your desktop than all the world's supercomputers put together from the 1970's.
GNU Parallel has been one of my go to tool to accomplish more on the terminal. Generate test data, transferring data from one node to another using rsync, run many-task, embarrassingly parallel jobs on HPC, pipelines with simple data dependencies but run over hundreds or files are some of the places where I use GNU Parallel.
Many thanks to Ole Tange for developing the wonderful tool and helping the users on Stack Overflow sites to this day.
It's more GNU Parallel has host groups in a config so you can send files for a job to the right one where its going to execute and bring things back. Essentially it can turn a local xargs type job into any kind of remote task execution including dealing with files locally needing to be remote.
GNU parallel is great for the kind of tasks highlighted in the post. Note that being written in Perl, it's slower than its simpler C counterpart moreutils parallel. And that in many uses cases xargs --max-procs=$(nproc) can replace it.
xargs -P "$(nproc)" --process-slot-var=s sh -c 'grep X "$@" >>/tmp/g.$s' d0
cat /tmp/g.*
You can also cobble together that second style with a custom config setup wherein a command is given $s and responds with some host names and there might be an `ssh` in front of the `grep`, for example. That `d0` argument (for $0) is a bit janky and there can be shell quoting issues, of course. But then again, you may not have hostile filenames/whatever. Remote loadavg adaptation might be nice, but then again, maybe you control all the remotes. Similarly, I could not get back-to-back executions of the EPOCHREALTIME thing closer than 250 microseconds. So, collision basically will not happen even though it probably could in theory.
I'm using task spooler a lot for parallel background processing. What I like the most it the ability to add further tasks to the queue after processing has already started.
Never knew about this, thanks! I'll definitely try it because `parallel` has bitten me before in a few more advanced cases. It has rough edges here and there.
I installed task-spooler just now, because I’ve been wanting something like this for a long time.
It looks like the actual name of the task-spooler command on Debian after install is “tsp”, not “ts”. So no collision :)
Now it just remains to be seen if the package by default allows the tasks to continue to run after I log out, or if systemd will annoyingly kill the tasks after I disconnect from ssh the same way systemd annoyingly kills my “screen” sessions when I disconnect ssh, and there is some cumbersome thing you have to do on each of your systems to have systemd not kill “screen” :(
moreutils also clashes with parallel, does it not? i remember installing some package for chronic and thus breaking GNU parallel, at least back in the late 2010s.
Is the author still adding the "cite me or pay 10000€" notice to the output? And calling that GPL?
And still answering every xargs Stackoverflow question with "you should use GNU Parallel" instead of answering the question? That really gets old quickly when googling for xarg answers.
These are just some of the reasons I'll never use parallel. xargs is perfectly fine for most usecases, and it can do everything I need it to.
> Is the author still adding the "cite me or pay 10000€" notice to the output? And calling that GPL?
IIRC the citation notice was cleared by Stallman as GPL compatible. I’d be surprised if anyone’s paid, I assumed that’s rhetoric to imply the value of a citation, or lack of citation, for anyone publishing scientific works.
> These are just some of the reasons I’ll never use parallel.
Hey I’ve actually ranted on HN before about the citation notice (e.g. https://news.ycombinator.com/item?id=15319715) - in part because I find the language of the notice a little misleading; it’s not tradition to write citations for tools used to conduct research, and it’s a requirement (not just tradition) to cite academic sources. If I used parallel to speed up some calculations, that doesn’t justify an academic citation. I don’t cite bash or python or C++ when I write papers either. On the other hand, if I’m writing a computer science paper about how to parallelize code, and especially if I compare it to GNU Parallel, then a citation isn’t optional, and I don’t need a guilt trip to add one, it’ll get requested in review, and rejected without. Is there even a journal publication to cite? (Edit: found it - the request is to cite an article in USENIX magazine.) So I find the notice a little irritating and I’m not sure who it’s aimed at exactly, or what the history of Ole feeling snubbed by scientists really is. Maybe some people were trying to compete with GNU Parallel and failing to cite it? Maybe Ole is paid by an organization that appreciates citations and will continue to fund development on Parallel if there’s evidence of it’s use in academia?
That said, GNU Parallel really is totally awesome, the documentation is amazing, and the citation notice is a one-time thing you can silence permanently. I don’t think the notice is a good reason to never use Parallel, and I do think Parallel is worth using, FWIW.
> it’s not tradition to write citations for tools used to conduct research.
Thia is true, but it also makes it very hard for academics and PhD students who mainly write software over papers. They get no citations and eventually have to leave academia.
If we had a better practice of citing central software we use - at least the academic software that wants to be cited - we could have a more flourishing ecosystem of such software funded by the universities.
I can understand that, and I can understand Ole’s request for citations - Googling him it looks like he is (or has been) employed by a university.
The good news is that the new ‘tradition’ these days for academic software is to open-source all the software written for a paper or academic project, so practically everything done is visible on github & arxiv.
> They get no citations and eventually have to leave academia.
You're welcome?
Seriously though, adding the citation nag to software is two wrongs not making a right.
As a software user, it isn't my fault academia hasn't figured out how to reward software contribution. If they can't figure it out, finding a greener pasture makes a lot of sense.
> Can't you just mute it and continue using the software without worries?
It _seems_ like a reasonable thing to ask, it's a minor inconvenience, really, so what's the big deal?
The big deal is that the behavior doesn't fit the unix philosophy. Tools are meant to do one thing, and do it well. They get composed in pipelines to get jobs done. In these pipelines, the communication medium is text, via stdin/stdout/stderr. If a tool is unpredictable in what it puts out via text, it can make the whole pipeline unpredictable, or at least more complicated.
If it _was_ okay, we should welcome everyone putting nag features in these simple cli tools, right? Well, I'd be on board with that as long as I can blanket disable all of them. If not, let's just leave our political/professional/begging messaging outside our computing tools. Okay?
> it’s not tradition to write citations for tools used to conduct research
Academics seem to have a very blinkered attitude to this. I wrote some software that was popular for a while in a niche field, and people were forever asking me to waste my time by 'publishing' the manual in some pointless journal so that they could cite something and give me credit. Writing useful software counts for less in that world than publishing another pointless paper that no-one will read.
That doesn't really help. People already know how to paste the URL for a piece of software into a paper. It's more that it doesn't count for anything (because it's a piece of software and not a paper).
That is a weird request, a source doesn’t need to be published in a journal to be cite-able. On the other hand, if you put a bibtex snippet on your site that indicates how you’d like to be cited, that is super helpful.
I wasn't very clear in my comment. I think the idea was that if the thing they were citing was a journal article, then the citations would actually mean something, as widely-cited journal articles are one of the currencies of most academic fields. While one certainly can cite an unpublished document, the author doesn't necessarily gain much from it in terms of their academic CV.
If someone writes a piece of software specifically for the purposes of doing certain types of scientific research, and then other scientists use this software to help conduct published experiments, then IMO it really ought to be possible to give that person meaningful credit for their work. It's a perfectly legitimate way to contribute to a field, even if it does not take the form of a paper. But with the system as it stands, the only way to get meaningful credit is to publish a pointless paper saying, in effect, "Hey! I wrote some software!"
>if you put a bibtex snippet on your site that indicates how you’d like to be cited, that is super helpful.
I should probably have done that, but from my point of view it didn't really matter. I have a name, and the software had a website. I didn't really mind exactly how individual people chose to cite it. The absence of a ready-baked bibtex snippet would never be accepted as an excuse for failing to cite any other kind of source.
Ole did provide a BibTeX entry to a USENIX magazine article about Parallel, which is fine, though I was always taught that non-journal references generally belong in footnotes or an appendix and not the bibliography, especially for something you’re not referencing for research purposes. Not sure if footnote or appendix or open-source usage citations count for what Ole needs; I’d guess he wants citations you can easily index using Google Scholar or other citation indexes, i.e. it should count toward Parallel’s H-index (https://en.wikipedia.org/wiki/H-index)
> and the citation notice is a one-time thing you can silence permanently
This doesn't scale. Imagine if all the software you used nagged you and had their own individual methods to silence them. I don't think this would be reasonable.
Lots of software nags with something when it first starts up. It’s mostly annoying, but doesn’t seem to have a scaling problem.
zsh gives you a config wizard, sudo admonishes you to use it responsibly, just about every iOS app and an increasing number of desktop apps gives you a few pages of “what’s new” every time they’re upgraded. Desktop apps have given tips-on-startup since the 90s.
I agree, and I think I might have even used this argument before. ;)
It does scale solely for GNU Parallel though for now, and very few other people have taken the same tack as GNU Parallel’s citation notice. Despite the potential for a slippery slope, it doesn’t seem to be happening. I’d speculate that if it did start to happen, then GNU would change their stance on what’s allowed by the license, perhaps.
sure, but the other guy wants me to submit a picture in a funny hat, not a citation. and the third guy wants me to add some additional legal provisions and disclaimers to the GPL license.
you need a more generalized --clickwrap-consent parameter really. One that just says "whatever it is, I accept and I'll do it".
And that's exactly the thing GPL was supposedly founded to get away from. Restrictions on user freedoms. Especially violations so routine and tedious that we open-palm-slam "accept" without reading them.
You could absolutely write this to not look like a clickwrap agreement and lean on users. "please cite me, I'm an academic and impact matters" in the manfile or --help is not something anyone would ever get upset about or probably even patch to remove.
The only reason it's OK is because basically everyone knows it's not enforceable because of the severability part of the GPL. But it's blatantly designed to look like a serious and enforceable notice to users who don't know that, and require affirmative action from the user to "consent" and bypass the screen. And clickwrap agreements of this type are generally enforceable if there is not something like the GPL that allows you to ignore it.
like I flatly do not get why this is even debatable or questionable, the dude is trying to pull a fast one on users with a scary-sounding legal notice that implies that you need to accept this clickwrap agreement. and it's not entirely clear that he cannot actually burden you with this in all jurisdictions, since it's an agreement between you and the author that exists outside the actual source code/distribution. You can end up paying for free stuff in lots of places in life, if you're not aware about what "should" be free, and those agreements stand and are enforceable even though the thing was supposed to be free. You agreed to it. You don't have to, the GPL says that, you can edit the software to remove it without consenting, but you did accept it.
Letting the camel's nose under the tent on clickwrap agreements on GPL'd software is such an incredibly bad idea legally and morally, and this dude has been an utter dick about anyone who questions that. Sure, "he's willing to do it and nobody else is stepping up" but on the other hand he's also going off and attacking other maintainers doing their jobs, too. But that's not Stallman's problem I guess. That's another problem that only works with N=1 jerk, if that was normalized we'd have a problem.
I do not get why this guy is getting this special blessing or dispensation from FSF. Like it's not just that he's a random weirdo releasing under GPL and then trying to add additional terms (lol get stuffed), this is all occurring with the FSF's blessing, Stallman's signoff, and in the GNU distribution. Official GNU clickwrap license I guess.
At the end of the day - if the guy can't be satisfied with a polite request in the manfile, wow that sucks. But the GPL isn't about you, it's about the end user. There are explicitly licenses like BSD that require acknowledgement if that's your thing!
It seems to me that citing R (or some other software tool) makes sense when it spares the author the task of providing detailed explanations.
Administrators who gauge work quality by counting citations are not helping the world much. Maybe it's time we started citing administrators who help us in our work ... so that their administrators can get rid of them if they are not helping. But of course I'm dreaming in technicolour -- administrators are never really subject to review, it seems.
> IIRC the citation notice was cleared by Stallman as GPL compatible
Do you have a source for this? Im confused by this, as the GPL section 7 is pretty clear that additional restrictions are effectively void. I suppose it’s technically not contrary to the GPL to idly state those restrictions, but it is contrary to the GPL to expect them to do anything. If the author is deliberately including an impotent clause in the hope that people will follow it anyways, I feel that trying to confuse or scare people into doing something the GPL gives them explicit permission to do is contrary to the spirit of the GPL.
Furthermore, trying to retaliate against people who (as permitted by the GPL) remove the citation notice, as the author here has done, seems very contrary to the spirit of the GPL.
I think the confusing issue here is that the notice is not a license requirement, it does not add additional licensing restrictions. It’s an honor-system agreement between the user and Ole, and does not involve the GPL. It does seem to be walking a very fine line, and it’s easy for users to not understand the distinction, but I believe the notice does adhere to the GPL’s rules, even if it doesn’t initially appear to for us non-lawyers.
Yes, to me it looks like he’s adding an official license-like note, but then declaring that he’s still GPL compliant because although his note is easily confused with a license it’s not actually a license. He then gets cranky if people remove his not-a-license note or don’t act like it’s a license. Feels very much to me like he’d be better served with something other than the GPL if he doesn’t want people using his software in GPL-permitted ways.
I hope that Stallman’s future opinions have no impact of the GPL licensed software. He is the main author of the license but I wouldn’t bet that what he says years later have to be considered.
I’m not sure I understand what you’re asking. Why is the intention of the author of the license in question?
In this case, Stallman simply clarified that Parallel’s notice did not count as a legal requirement and does not conflict with the GPL. His opinion wasn’t necessary, but since he wrote the license, it is authoritative. In this case, the question wasn’t brought to court, it was simply a clarifying discussion, and thus his intention did affect how things go in practice.
> And specially someone who is neither licensee or licensor?
Also wasn’t Stallman effectively the licensor or representing the licensor at the time, as president of FSF, head of the GNU project, and author of the GPL?
>His opinion wasn’t necessary, but since he wrote the license, it is authoritative.
No it isn't. Licences, like most legal documents, are construed objectively. The subjective intention of the author is totally irrelevant to the meaning.
You might have misunderstood what I said. It’s not up for debate whether RMS’s opinions or intent on the GPL have affected industry practice; that’s a fact of history. His statements on the GPL are authoritative in the sense that they may have prevented the courts from examining this question.
What meaning are you thinking of, exactly? I looked up the definition and it matches what I intended to say in every dictionary I checked (Merriam Webster, Oxford, Cambridge, Dictionary.com…) Some of the definitions seem more or less synonymous with “influential”, maybe you’re making some incorrect assumptions?
RMS' comments and interpretations of the GPL have generally influenced how the industry deals with the GPL, and to the extent it has become de facto standard practice, the courts might take notice and take it into consideration.
Generally though, the contents of the text and (if applicable) case law surrounding its interpretation is more important.
the difference between law and code is the interpreter: law is interpreted by humans, significantly based on perceived intent, and code is interpreted by computers, which are expected to only act literally and deterministically.
I still disagree with arbitrary, just less strongly.
Whether his opinion on GPL is relevant, or if it is, how important it is, is up for debate. But I still don’t think it’s “based on a random choice or personal whim, rather than any reason or system”.
You’re referring to a different kind of dispute than I (and parent) was talking about. In this case, Stallman did handle the dispute about whether GNU Parallel’s citation notice conflicted with the terms of the GPL.
Okay that makes sense. You're saying that since the GPL itself is not open, that it needs Stallman's approval for modifications that are not explicitly allowed. And I was saying that it does not necessarily mean those modifications are enforceable between two parties in a random jurisdiction, which comes down to courts and whatnot.
Yes kind-of… in this case Parallel’s notice is not a modification of the license at all, and Stallman is the person who ruled on this question and confirmed this to be true. The GPL doesn’t prevent authors from including a notice, and having a notice doesn’t conflict with the terms of the GPL.
I feel like the whole problem here is that the legality of Parallel’s notice, and the separation of the notice from the GPL, is not at all clear. The language is confusing to users. People who take the license seriously are staying away from Parallel because of the fear of accidentally breaking the license terms.
That’s not true. The language of Parallel’s citation notice, while confusing to some users, does not impose any legal requirements and is not part of the license. Neither the notice nor the license claim otherwise. RMS, and more importantly, Ole Tange, agree that Parallel’s notice is not legally binding, and intended to write it that way, and there is a publicly visible history of this intention and agreement.
Indeed, and people are choosing not to use Parallel for the same reason. The notice would be much better IMO from the user perspective if it was more clear. I guess that’s maybe the point, to leave people with the mistaken impression that this is a binding agreement.
RMS, not being a judge, is incapable of "authoritatively" or otherwise determining whether this notice is legally binding.
If it is something that needs to be "confirmed" by someone "authoritatively" then you should ask a lawyer for advice. You should not ask a programmer for a "ruling".
What RMS might be saying is "we won't seek to enforce it". That is completely different.
> What RMS might be saying is “we won’t seek to enforce it”. That is completely different.
If you review the thread from the top, you might find the primary question we were discussing from the start before you jumped in is whether the Parallel notice is GPL compliant. Whether Parallel’s notice is definitively and absolutely legally binding on its own and away from the GPL is a nuance you introduced, but it has been answered for all practical purposes by both Ole and RMS. It will probably never go to court or be tested by a judge, partially as a result of what Ole and RMS have said: that the notice is not a license and is not contractual.
There is no dispute about this, and because there is no dispute and because it’s not going to court, the statements by Ole and RMS are the most definitive answer we’ve got, and to date is what people are using when making and acting on decisions about Parallel usage. Both of them have said the Parallel notice complies with the GPL because the notice is not legally binding, so Ole & RMS both were saying more than GNU won’t seek to enforce Parallel’s notice. “Academic tradition” is not legally binding law, and the notice doesn’t reference any other relevant law. The notice is full of legal holes, if you insist on interpreting it as a legal contract. It was written by Ole (not a lawyer) and doesn’t define what research usage would constitute a mandatory citation, nor what happens if the user doesn’t see the notice, or if a citation is inappropriate, or if the citation is rejected by reviewers, among many other possibilities. It doesn’t take a lawyer or judge to see that the Parallel notice is not legally enforceable, and it doesn’t take a legal education to see that it’s not Ole’s intent to enforce it as a contract. He is just asking for citations, in slightly confrontational language.
It would be fair to say that a judge or court, if this issue was ever tested in court, might overrule some aspect of Ole’s or RMS’s stated intent because their language was imprecise and effectively said something different than they meant. Then again, another judge can override the first judge. There’s nothing definitive or absolute or permanent in law, regardless of whether a judges rules on it, and intent does matter in practice. Before this ever goes to court (probably never), all questions on this topic can be (and already are!) answered by non-judges, which is why it’s demonstrably not true to claim this question can only be answered in court or by a judge.
> You should not ask a programmer for a “ruling”.
RMS wasn’t acting as a programmer when he wrote the GPL, btw, nor when he opined on whether Parallel’s notice complies, so in that sense your framing is veering into the hyperbolic.
- # *YOU* will be harming free software by removing the notice. You
- # accept to be added to a public hall of shame by removing the
- # line. That includes you, George and Andreas.
[david@pc ~]$ echo foo | parallel echo
Academic tradition requires you to cite works you base your article on.
If you use programs that use GNU Parallel to process data for an article in a
scientific publication, please cite:
Tange, O. (2023, July 22). GNU Parallel 20230722 ('Приго́жин').
Zenodo. https://doi.org/10.5281/zenodo.8175685
This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.
More about funding GNU Parallel and the citation notice:
https://www.gnu.org/software/parallel/parallel_design.html#citation-notice
To silence this citation notice: run 'parallel --citation' once.
foo
[david@pc ~]$
It shall not be. It is a notability indication, not endorsement, just take a look at other release names from the time of the war. Or compare to Time's "Person of the year".
The author is having musk-ish type of fun and that's their freedom. My freedom is to feel disgust by seeing mass murderers, even if they are treated equal to not controversial topics.
You can add any message you want into your GPL program. Also, a GPL program does not have to be free.
This has nothing to do with the GPL. You can say in your program that 'by using this software you agree that you're a a cat' and license it under the GPL.
That does not mean the GPL relates to cats in any way.
You can add all the extra restrictions you want, but they effectively won’t do anything. Expecting both the GPL and the additional restrictions to apply is a violation of section 7 of the GPL.
> All other non-permissive additional terms are considered “further restrictions” within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term.
If I put the GPL in my software and add a file next to it that says "Also you can't use this software if you make more than $100k/year", I've pretty clearly added an additional clause that's incompatible with the GPL.
If I say "Please don't use this software if you make more than $100k/year" I haven't added an additional clause, just communicated a desire. I'm as annoyed by parallel's citation nag (particularly since I don't plan on ever publishing a scientific paper), but it does not impose extra requirements.
I think I agree with you: if you put in requests that have no mandate or attachment to "or this changes your rights under the license", there's no issue.
But I was responding to the comment upthread "You can add any message you want into your GPL program". If you add a message that says a user of the software must do something / must agree to additional terms / etc, that additional text is not compatible with the GPL. I'm not a lawyer, so I have no idea whether the result would be that the restriction doesn't count and the software is GPL'd, or that the software isn't viably GPL'd because the GPL+clause isn't a valid license for somebody to use.
Author of software package can revoke his license. Author can provide same package under multiple licenses, for example: GPL and proprietary license, like QT. Author may say that user A may use GPL license only, while user B may use proprietary license only.
Eh, this is kind of right but also not really responsive to the thread.
For one thing, if the author provides the source with a GPL license to user A, and user A sends it to user B, user B has the software under a GPL'd license. The normal reason for dual licensing GPL/proprietary is so that user B can pay money to bundle the software in a non-GPL-compliant way. The author can stop licensing future releases under GPL, but they can't revoke the GPL on already-distributed software.
For another, this isn't what's happening here. GNU Parallel is released under the GPL, and the author is affixing what is debatably an additional term to the GPL'd release, under the claim that it doesn't count as an additional restriction because it's "academic tradition". By the same token, I can add a clause to my software saying that rich people can't use it, because it's "hippie tradition" to stick it to The Man.
It basically means that clause doesn't do anything. If it is phrased as a request, rather than a requirement then it doesn't violate anything for sure.
The author of Notepad++ for example is famous for adding all kinds of statements associated with the software and in no way is that part of the license.
On the other hand, if your license.txt states to i.e. not use the software for evil aka JSON famously did then yes, it is part of the license.
The message isn't part of the license, and it's phrased in a way that wouldn't be binding if it were.
It says "please cite" and "feel free to not cite if you pay".
It doesn't say "must cite" or "you may only not cite if you pay".
IANAL, but it doesn't seem like it would interact with the GPL at all. So the worst that could be said is that the implementation is annoying or in poor taste.
Software cannot be distributed with a clickwrap agreement under the GPL. Requiring the user to affirmatively agree to a contract is a clickwrap agreement even if the terms are non-monetary. The old “you are making a second agreement, not the one the software is distributed under” approach.
Notionally the GPL allows you to disregard this but it may or may not be binding depending on your jurisdiction, and it’s certainly distasteful and against the absolute spirit and most likely the text of the GPL. This is an incompatible term being forced on the end user and the entire license might well be void.
did you ignore all the stuff before that about the implication of using that option and what the user agrees to by using it?
this is like saying that a user doesn't actually agree to anything just because they clicked "accept" in a EULA. you're just clicking buttons in software, it doesn't obligate you to anything!!! but actually yes that is most likely fairly binding in a lot of jurisdictions.
that is, again, literally the definition of a clickwrap licensing agreement and you cannot do that in GPL software, even if it's non-monetary. Requiring the user to submit a selfie in a funny hat would not be permissible under the GPL either. You can't limit what the user does with the software and how, or else it's not GPL.
it's open and shut, clickwrap agreements completely subvert and nullifies the moral stand the FSF is trying to make. And it doesn't matter how innocuous it seems, it undermines the whole point of the exercise.
fortunately the GPL includes a "severability" clause that basically allows you to ignore this and grants you a license regardless. but it is not a good look, it is not good behavior, and if every GPL'd package started adding random clickwrap agreements with big "IM A DOODOO HEAD IF I IGNORE THIS" parameters the whole ecosystem would degrade.
Arch and others are not only allowed but actually morally and practically in the right for stripping these messages, and it doesn't reflect well on Ole at all that he then goes on and throws more tantrums because he doesn't like the consequence of the license he chose.
If he wants to go proprietary, or BSD (which requires acknowledgement!), that's fine, but he's being a child and the terms he are adding are utterly uncompliant with GPL, and it's unprofessional for FSF to even humor him on this. If there were a hundred Oles the FSF would have a real problem on its hands, it's only because he's N=1 jerk that this is remotely tolerable.
It doesn't look like it obligates me to do anything. It contains a request to cite, not a requirement:
If you use programs that use GNU Parallel to process data for an article in a
scientific publication, please cite:
Tange, O. (2023, July 22). GNU Parallel 20230722 ('Приго́жин').
Zenodo. https://doi.org/10.5281/zenodo.8175685
This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.
More about funding GNU Parallel and the citation notice:
https://www.gnu.org/software/parallel/parallel_design.html#citation-notice
To silence this citation notice: run 'parallel --citation' once.
[edit]
This is really similar to the kerfuffle where the maintainer of Home Assistant asked distros to not repackage HA; it's a request, a bit at odds with community norms for libre software, but one that people are legally free to ignore.
You cannot include a message that requires the formation of a binding contract. This is the old “you can fire someone for no reason but not any reason”, and if the message your product shows is a prompt forcing the user to agree to a binding contract, its not GPL compatible.
I agree that in this case it’s likely not enforceable/binding especially since the GPL specifically allows you to ignore those terms. Hopefully that’s legally binding in your jurisdiction vs the other party.
But it’s a straightforward clickwrap agreement, even if the terms are non-monetary the GPL simply doesn’t allow these at all. Can’t place any stipulations on how the user uses the software.
If you're a startup and relicensing previously open source code under a restrictive license or doing other shady things you'll get plenty of defenders to line up to say 'hey, they have to make a living somehow', but if a single guy tries to make a living via a simple message in a widely used program all hell breaks loose.
Notably, GNU Parallel did not relicense; it's still GPL. The author wants to have his cake (gain the popularity benefits of being a GPL-licensed GNU tool, be able to carpetbomb Stack Overflow with "use GNU Parallel" answers, etc.) and eat it too (get people to cite or pay him as a condition of using the product). Since this isn't possible (GPL doesn't allow additional restrictions), but the author still really wants it, he went the route of making the extra condition non-legally-binding but then getting publicly upset at people for using the product under its actual license. That's the part that GNU Parallel is doing that people don't like, and that other projects are not doing.
The startups you mention actually changed their license. That's what GNU Parallel would have to do to make this extra condition ok, but he won't do it because being a GPL-licensed GNU tool is critical to its popularity in the first place.
Yes. The GPL explicitly says this about "further restrictions":
> If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term.
However, this doesn't really even come into play because the citation request is not a restriction on the license. It's not anything. As far as the GPL is concerned, it's just some code, and the GPL grants you the right to redistribute modified copies.
And by renaming it to "free-parallel" you have respected the author's trademark. You can absolutely do this, at the cost of the author being upset at you. They might get upset that "free-parallel" is too close to their "GNU Parallel" trademark but I (IANAL) don't think they'd be legally right about that. GNU Parallel coexists with other software called "parallel".
They ensured that the citation request was not actually an additional requirement and has no legal meaning. Beyond that, GNU's interests are better served by retaining GNU Parallel as a GPL-licensed GNU product than by losing it to another organization or another license. I wouldn't expect movement from GNU beyond their existing acknowledgement that the citation request is not a legal requirement and does not modify the GPL. In any event, GNU tends to be hands-off on contributed packages (i.e. the ones that Stallman wasn't involved in writing).
I think there’s a reasonable question in there, but I don’t agree with this framing. Shady relicensing isn’t legal, and it doesn’t matter if there are armchair defenders. But, Ole does have defenders, so it’s not one-sided.
Part of the issue is that Ole’s citation notice doesn’t appear at first glance to some people to be compatible with the GPL. You have to read the language carefully, and read the history of GNU Parallel’s citation notice, to understand that the notice is not a licensing term.
Another part of the issue is that the notice doesn’t sound like someone just trying to make a living. It sounds like a demand or even a veiled threat, and one that is inflicted on everyone, not just academics. It’s not exactly clear about what the legal requirements even are.
I’m in favor of Ole getting citations, and I’m in favor of his right to ask. But the way it’s being asked for rubs me the wrong way a little bit, and it’s rubbed other people the wrong way a little bit ever since it was introduced. BTW, the whole reason it seems like all hell breaks loose, and the only reason this matters is precisely because the software is widely used. If it wasn’t widely used and it didn’t sit under the GNU umbrella, you’d never hear about this.
I had no opinion until I read through the patch that Arch uses to remove the notice [0]. The creator comes across as whiny, entitled, and aggressive. They have comments in the source like "You accept to be put in a public hall-of-shame by removing the lines", "YOU will be harming free software by removing the notice", and "That includes you, George and Andreas". The whole thing is pretty unprofessional, and based on the false premise that every tool used during research traditionally gets a citation.
Of course, but HN does tend to have a lot more patience for scrappy startups that scrappy lone, non-commercial devs from what I can observe.
The question was rhetorical, I know that this place is frequented by quite a lot of people who wish to be part of the next YC batch, so they see themselves in the shoes of the startup, rather than the solo dev.
I'm saying I don't think it is that way. I think you're seeing through the lens of your expectations. dang has written about this at length, but the gist is that it's very easy and tempting to see posts you disagree with more visibly and feel them more viscerally than posts you agree with.
I could be wrong here and this is the first I’ve heard of this but I suspect it’s the language / way he goes about communicating. On the surface at least it comes across as a little annoying / demanding, things like him having a website where he shames people that don’t cite him by name, I suspect the ‘legal’ claims being made aren’t that solid either. Don’t get me wrong it’s a neat tool - but it’s just one in a huge ecosystem of many people’s efforts.
> I suspect the ‘legal’ claims being made aren’t that solid either
Section 7 of the GPL specifically says that additional restrictive terms on GPL software (like “pay me $1000 or cite me”) can be ignored or removed. If the software’s author doesn’t want people to remove his additional terms he shouldn’t have used the GPL. Publicly shaming other open source contributions for doing something that the GPL explicitly and deliberately permits (removing additional restrictive terms) is extremely improper in my opinion.
> Is the author still adding the "cite me or pay 10000€" notice to the output? And calling that GPL?
Where you get the "or pay 10000€" part from? As far as I remember, the software, unless told otherwise, asks authors of scientific papers to cite GNU parallels if they used it when writing their papers. And it doesn't force it, it's not part of the license, but asks you to do so as it's academic tradition to use citations.
You could just ignore the citation and not break the license, no one would think less of you for doing so.
If you use --will-cite in scripts to be run by others you are
making it harder for others to see the citation notice. The
development of GNU parallel is indirectly financed through
citations, so if your users do not know they should cite then you
are making it harder to finance development. However, if you pay
10000 EUR, you have done your part to finance future development
and should feel free to use --will-cite in scripts.
If you do not want to help financing future development by letting
other users see the citation notice or by paying, then please
consider using another tool instead of GNU parallel. You can find
some of the alternatives in man parallel_alternatives.
FWIW some distros remove the nagging message (e.g. mine - openSUSE - has it removed and the patch seems to come from Debian so i'd guess Debian and its derivatives also remove it).
Again, it's not part of the license nor are you forced to select between "cite GNU parallels or pay 10000 EUR". You're free to use it however you want since the software is GPL, including ignoring any of the output from using the tool if you so chose to.
> including ignoring any of the output from using the tool if you so chose to.
the user isn't merely ignoring the output though, they are actively interacting with the program in a way that the program is presenting as accepting of the agreement being presented to the user.
the agreement is plainly presented in a way that implies that it's an obligation, like any other clickwrap agreement. and everyone except ole and stallman seems to agree that it's self-evidently apparent that it's a clickwrap agreement restricting the freedoms of the user.
"free software that only prints a message and exits unless you agree to a clickwrap with further licensing terms" is not a road that FSF should go down. And it's only because of the GPL severability clause that it's not a crisis, everyone knows it's a farce, except for a bunch of the users, who are affirmatively taking action to indicate consent with an additional licensing agreement.
it's not facially clear that in most jurisdictions that the clickwrap agreement is null and void merely because the software is free. you can end up paying for lots of free stuff in life if you're not careful. you agreed to the agreement, it's on you.
you are of course free to remove the prompt and use the software yourself, and ole rants and raves about that on his website. but, agreeing to the license is a separate thing from the GPL license, most likely. just like paying for credit monitoring is different from getting your free credit reports or freezes - they'll try and railroad you into paying, definitely! and just because it's supposed to be free, doesn't mean you're not getting charged if you agree to it!
If you read the FAQ they have regarding the citation notice for GNU Parallel, it's made clear that it is not part of the license in any way and only applies to projects that are part of/the basis for academic papers. If it does apply to your project and you don't cite, at absolute worst you could get in trouble with your university or the academic community but even then it's almost certainly going to be mild at worst.
But importantly you can use the software however you want that is compatible with GPLv3. That includes ignoring or removing the citation notice without paying a cent. However just because it's legal doesn't mean it won't come with the potential for social consequences.
== Is the citation notice compatible with GPLv3? ==
Yes. The wording has been cleared by Richard M. Stallman to be compatible with GPLv3. This is because the citation notice is not part of the license, but part of academic tradition.
Therefore the notice is not adding a term that would require citation as mentioned on:
https://www.gnu.org/licenses/gpl-faq.en.html#RequireCitation
The link only addresses the license and copyright law. It does not address academic tradition, and the citation notice only refers to academic tradition.
[...]
and from the GPL faq itself (which said citation FAQ references):
Does the GPL allow me to add terms that would require citation or acknowledgment in research papers which use the GPL-covered software or its output? (#RequireCitation)
No, this is not permitted under the terms of the GPL. While we recognize that proper citation is an important part of academic publications, citation cannot be added as an additional requirement to the GPL. Requiring citation in research papers which made use of GPLed software goes beyond what would be an acceptable additional requirement under section 7(b) of GPLv3, and therefore would be considered an additional restriction under Section 7 of the GPL. And copyright law does not allow you to place such a requirement on the output of software, regardless of whether it is licensed under the terms of the GPL or some other license.
TLDR: The citation notice is a "cite it in academic works or pay me" agreement that is as legally binding as a pinky promise. You can break it without concern but some people may look negatively on that and it may come with social consequences.
Not really. It's more like if you use it for free please cite but if you're so averse to citing that you'd rather send a gazillion money then feel welcome to do so, at least that is the way I read it.
Love finding a good use-case of parallel as an easy way to gain massive time savings, especially on the modern high-threaded CPUs of today. Most recently found it useful when batch-compressing large jpeg images to smaller webp files, via use with find and ImageMagick:
Xargs is a nearly drop in replacement and probably already installed by default in most distros. You may need the -n 1 (one file per) and -P to parallelize.
Actually, parallel is a drop in for xargs as xargs has been around longer. Parallel has a few big improvements:
* Grouped output (prevents one process from writing output in the middle of another's output)
* In-order output (task a output first, task b output second even though they ran in parallel)
* Better handling of special characters
* Remote execution
I didn't know about this, and reading through the comments, I found out that xargs can also do batching and parallelism (nice!). However, it appears that if you pipe the output of an xargs-parallel command into another utility, it jumbles the output of the multiple subprocesses, whereas GNU parallel does not.
I was a little put off by the annoying/scary citation issue mentioned by another commenter, so I am not sure I will use parallel.
I want to pipe the output of parallel processes into a utility that I wrote for progress printing (https://github.com/titzer/progress), but I think that neither of these solutions work; my progress utility will have to do this on its own.
You can probably do something that creates as many FIFOs as you have parallelism and just be careful about emitting whole records like https://github.com/c-blake/bu/blob/main/doc/funnel.md . That one's Nim, but the meat is only like 50 lines and easily ported to C like your progress tool. ( EDIT: and it will also probably be drastically lower overhead than `parallel` which has over 70X worse time overhead and 10X the RAM overhead of tools written in fast, native-compiled languages: https://github.com/c-blake/bu/blob/main/tests/strench.sh )
Also, the last time I tried, to do similar with FIFOs (no /tmp | whatever storage like other e.g.s here https://news.ycombinator.com/item?id=37211687), GNU parallel needed some - for me - specially compiled Perl interpreter with threads enabled to use its `parcat` program which is also probably slow. Besides the nagware insanity, `parallel` seems just not a very compelling tool in either machine|human overheads unless -- maybe -- you already know Perl (which I always found a supremely forgettable language).
There's a shell script version of GNU parallel that's great for CI/CD pipeline tasks. You just keep it in your repo and source it as needed. It's incredibly useful, we use it in one build to batch process a few thousand things in groups of 25.
Edited to add: finally got signed in to work, you create the script via:
parallel --embed > scriptname.sh
It's about 14,000 lines of awesome and works on "ash, bash, dash, ksh, sh, and zsh"
Maybe this is a silly question, but what advantage do you get from checking that huge file into VC instead of just installing parallel ahead of time on the CI images?
I’ve been writing a lot of PowerShell recently and discovered the ForEach-Object cmdlets with the -parallel parameter and it has been addicting to parallelize my scripts, so I totally understand why parallelizing using a command line tool is attractive
xargs is more useful because it's posix so you can always guarantee it to be there (whereas with GNU Parallel you probably have to reach for a package manager to install it first). The ergonomics are worse though, as usual.
The entirety of GNU Parallel is just one Perl program. It could be copied over and used in a pinch. The installation itself is very simple and no special dependencies or privileges are needed.
There are also many Linux distributions that do not install by default all the POSIX utilities, but only the minimal set that is needed to bootstrap the system.
On all such systems, it is very easy for the user to install any missing POSIX utility, but it is also easy to install any non-POSIX GNU utility.
So not even xargs is certain to exist by default on all systems.
Moreover, POSIX xargs is restricted to execute sequentially all processes.
Any use of xargs for parallel execution is non-POSIX, so in that case there is no reason to not use "parallel" instead.
Makes an annoyingly slow task tolerable, as parallel doesn't block while fetching to preserve order. We probably should rewrite this to be more efficient, but this task is run infrequently.
GNU Parallel has been created precisely for solving some deficiencies of xargs.
While there are cases when it makes sense to stick to what is specified by POSIX, there are also cases when the POSIX specification is so obsolete that using POSIX instead of some free ubiquitous programs is a big mistake.
Among these latter cases are writing scripts for a POSIX shell instead of writing them for bash and using xargs instead of parallel.
Having a layer of parallelisation on top of good old sequential code seems like a very neat idea. It resolves headaches of learning how to run code in parallel in languages that aren’t necessarily my primary language (e.g. short, one-off scripts). Thanks for sharing!!
Someone gifted an old blade server to me a few years ago. Very slow, but 16 cores and 24 gig of RAM. At the time I was making a lot of video art with ffmpeg, without a GPU. That version of ffmpeg wasn't optimized for multiple cores so rendering was really slow and sequential. I discovered Parallel and set the server to process large videos with most of the cores in parallel. Voila, it chewed through a massive amount of media fairly quickly. Faster than the hard drives actually.
Folks who are here and interested in parallelization for CI/CD may also be interested in Dagger.io — I had heard about it on HN over the years but not played w it. It's basically a more fine-grained Docker-like executor with better caching and utilities for spinning up services and running tests.
Curious if anyone else has experiences with it, honestly been surprised at how little I've heard about it
I try to use it last week to run 10 instances of curl against a webserver.
I was expecting something simple as 'parallel -j10 curl https://whatever' but couldnt find the right syntax in less time that took me to prepare a dirty shell script that did the same.
> This runs a benchmark for 30 seconds, using 2 threads, keeping 100 HTTP connections open, and a constant throughput of 2000 requests per second (total, across all connections combined).
Some distros include `ab`[2] which is also good, but wrk2 improves on it (and on wrk version 1) in multiple ways, so that's what I use myself.
parallel is great but its default behaviors never quite seem to match my needs, so every time I use it I have to spend some time consulting the man page. Fortunately, the man page is more than up to the task.
But because of the mini learning curve on each use and because I find I need a little more boiler plate to use parallel, I use xargs -P more often, only using parallel when I need its special features (e.g. multiple hosts or collating the output streams).
Oh also, parallel itself can be a bit of a resource hog. (Obviously that depends a lot on how you're using it-- but I mean in cases where xargs' usage is unnoticeable I sometimes have to change the size of my jobs to get parallel out of the way).
I have wanted to parallelize my .zshrc file for a while – all those environment setup scripts for nvm, pyenv, starship, etc really makes the startup time noticably slow. Does anyone know how to do this?
Ooh nice thought. I’m not certain, but I kinda doubt it’s possible, because those startup scripts need to modify the current shell environment. I believe GNU parallel runs in a subshell and launching new tasks in separate processes, so fundamentally doesn’t operate the same way that e.g. sourcing the nvm script does, unfortunately. Even if there was some way to hack it, I’d be nervous about changing environment variables in parallel, to me that sounds like asking for really nasty race condition bugs.
Seems like you could accomplish the same thing more cleanly (IMO) with make. You can create a target for each test, which can be done with patterns, and then use `make -j` to run them in parallel.
parallel is one of those tools like jq, to me. It's great, but by the time I've grokked the syntax, AGAIN, I'd've been quicker to write a quick shell/ruby/python script to do it that's almost readable.
Probably for very simple use cases, but the real power in parallel really comes from the myriad of switches that enables so much more than what "&" and "wait" could do.
When I'm using parallel, it's usually because I have thousands of jobs. Worse, they have nontrivial memory requirements. When you background processes with &, the system starts timeslicing. Each process gets to allocate its memory before being paused to make time for the next process. Your system will almost immediately crumple under load. Hopefully, the oom killer will target your backgrounded jobs... but the script spawning them will go untouched because it isn't the thing hogging memory.
Before I learned of parallel, I tried a hack where I'd manually assemble jobs into batches, and wait on the batches before starting the next. It achieved very low system utilization, because inevitably, one job each the batch takes much longer than the rest. A slight improvement (still not good), is to use `split` to chop your jobs file into $num_cores chunks, and background each chunk. But still, this gets low utilization. Problem being that you aren't using a thread/worker pool.
Parallel (or, TIL, xargs) can maintain 100% system utilization, until the very last $num_cores jobs.
You don't have to reinvent the wheel for your script, all the parallel options are ready for you to use and are well documented. It's also packed with features that might take a long time to write into your Python script.
I am trying to use Python by default when writing scripts nowadays, but sometimes the best tool for the job isn't Python or writing your own Python.
IMO, effective "scripting" just means the ability to solve ad hoc problems easily by writing task-specific glue that delegates the hard parts of the program to (1) an effective set of libraries you've written yourself and (2) external code or tools when it makes sense.
From this perspective, the languages of the glue, the libraries, and the external code all matter less than the ease of writing the glue; interfacing with the external code; and maintaining the libraries. The best language for this probably comes down to a combination of what you're comfortable writing (and reading, and maintaining) and what kinds of tasks you're trying to solve.
For me personally, using Python glue and libraries strikes a pretty good balance here. Writing a script "in Python" doesn't mean you need to reinvent the wheel. If you think `parallel` provides a better interface for map-reduce parallelism than `subprocess` (or than a library function you've written on top of `subprocess`), no problem: you can just call `parallel` from Python (and you'll probably find yourself writing a library function on top of it to abstract away the fact that it's a shell script).
But if you're much more effective working in Bash than Python, then writing your glue and developing your libraries in Bash could be the way to go.
start a bunch of threads and e.g. invoke subprocess.run() from them
Done that many, many times and honestly combining python with parallel is in many cases the best way to go. Write your python script to be as fast as possible on one core and then use parallel to run it on all your cores. This has the added advantage that you can go from running on all the cores on your machine to running on all the cores on a 100 machine cluster by just changing a couple of lines of code.
subprocess.run is likely to be significantly slower than a low-level dedicated utility like parallel, and adding a lot of flakyness and overhead. I'm a big pythonaro but one should always use the best tool for the job.
You guys know that in bash you can use `&` to pass a foreground terminal process to the background and then use `wait` to wait for all the session's background process to end, right?
Yes, and those work well for smaller workloads, but if you just run 1,000,000 commands with `&` in a `for` loop, it will grind your computer to a halt (if the tasks are modestly resource intensive). GNU parallel will let you run those same 1,000,000 tasks but make sure that only (e.g.) 16 of them are running at once. It's not easy to do that in bash.
It takes time to notice that if you do _several_ of these background jobs with `&`, you will only get the exit status of the last one when you do `wait`. Errors of the others will be swallowed.
Then you _have_ resort to 'wait <pid>' with the 20 lines of bash coded need to manage all those PIDs. I have a large editor bash snippet just for that.
You seem a bit behind or too invested in C# in particular. Elixir for example can run stuff in parallel with just 3-4 added lines of code added to an otherwise sequential code.
Yes of course other programming languages can do this. I was more referring to culture and idioms. The point is that tools don't support it or think about it. And that is because probably things work for most small use cases without it, and that it is a leaky abstraction - you need to change your code to support it.
Imagine a world where there were only GPUs for example - then everyone by default would be running parallel-first code, and in that imaginary world you would need to do nothing to run a series of bash commands piping into each other in parallel.
In a proper programming language, we'd have something like
And `parallel` would only have to worry about parallelization.Instead, the shell environment forces programs to invent their own parameter separator (:::), a templating format ({1}), and a way to output a list of structures (CSV-like). You can see the same issues in `find`, where the exec separator is `\;`, the template is `{}`, and the output is delimited by \n or \0. And `xargs` does it in yet another different way.
It's very hard to acquire and retain mastery over a toolbox where every tool reinvents the basics. If you ever found yourself searching "find exec syntax" multiple times in a week, it's not your fault.
As for alternatives, I'm a fan of YSH[1] (Javascript-like), Nushell[2] (reinvented from first-principles for simplicity and safety) and Fish[3] (bash-like but without the footguns). Nushell is probably my favorite from the bunch, here's a parallel example:
[1] https://www.oilshell.org/release/latest/doc/ysh-tour.html[2] https://github.com/nushell/nushell
[3] https://fishshell.com/