Hacker News new | past | comments | ask | show | jobs | submit login
A common mistake involving wildcards and the find command (robertelder.org)
237 points by robertelder 9 months ago | hide | past | favorite | 143 comments



As another user mentioned, many POSIX and/or GNU utilities havent aged well. I respect trying to stay portable, but peoples needs change over time and these tools simply havent kept up. Like the other user, I use Fd now instead:

https://github.com/sharkdp/fd

As well as Silver Searcher:

https://github.com/ggreer/the_silver_searcher

while Grep performance is pretty good, its also gotten pretty stale with regard to its defaults and options.


For grepping, rg is faster and saner.

https://github.com/BurntSushi/ripgrep

On end-user systems with fast I/O, it is usually a better use of resources to have a battery-/suspend-/reboot-aware indexing system that has path/extension white/blacklists and prioritizes file monitored changes. This way, searching can happen against an optimized text search DBMS much faster than waiting for zillions of IOPS that are basically wasted by instead spending them ahead-of-time in the background/idle indexing text files/metadata/structured data once when it changed/s and already know exactly where all occurrences of kWhateverCondition_PP3V42_G3H or \A[AB]{1,3}c+d\z live under ~/Projects without reading any actual files.


Do either of these tools have a "grep" mode, so that I can alias grep to it and get similar, but improved, behavior? 25 years of habit is a lot of reinforcement to overcome and I'd love to just be able to take advantage of these tools with an alias.

As is I just have a grep wrapping shell script to give it some sane defaults which works fairly well.


Author of ripgrep here.

> As is I just have a grep wrapping shell script to give it some sane defaults which works fairly well.

Yes, that's what I did for about ten years before I wrote ripgrep. :-)

> Do either of these tools have a "grep" mode, so that I can alias grep to it and get similar, but improved, behavior?

Another commenter already mentioned `rg -uuu`, and that's pretty much the right answer. In a large number of cases, if you `alias grep=rg`, then most things will continue to work as you expect. If you really do not want any kind of smart filtering, then yes, you'll want `alias grep="rg -uuu"`. But otherwise, things like `command ... | rg pattern` and `rg pattern file ...` will continue to work just like grep.

ripgrep also tries to use the same names for flags as grep, wherever possible. So you'll find a lot of overlap there too.

Please note that ripgrep is not and is not intended to be a drop-in replacement. But I didn't go out of my way to be incompatible with grep either. So a lot of it should be quite familiar!


The closest I know of is `rg -uuu`. From the manual:

> -u, --unrestricted ...

> Reduce the level of "smart" searching. A single -u won't respect .gitignore (etc.) files. Two -u flags will additionally search hidden files and directories. Three -u flags will additionally search binary files.

> rg -uuu is roughly equivalent to grep -r.


No, you're asking oranges to be bananas. Read the man page or user guide. It's not grep, it's better and already has saner defaults. https://github.com/BurntSushi/ripgrep/blob/master/GUIDE.md


Yeah and its also 10 times larger

https://github.com/BurntSushi/ripgrep/issues/1481


If you go on a system with grep aliased and run your single grep command, are you going to be able to tell which binary was larger?

If you look at the size of two disk images for different distros, and one has grep while the other has ripgrep, are you going to be able to tell which is which?

10x larger than grep is noise in 2020. Making that tradeoff in exchange for significantly improved performance seems reasonable.

Not to mention a large part of that difference is likely due to grep relying on libc, while ripgrep has to bundle its own runtime. And that's not even bringing up memory safety.

I'll take ripgrep at 10x the size of grep any day.


Yes. If that disk space is more important to you than correctness, performance and better Unicode support, then yeah, you should definitely keep using the silver searcher.

I gave more details in a response on that issue: https://github.com/BurntSushi/ripgrep/issues/1481#issuecomme...


Which is absolutely a good trade-off for a modern desktop system.


the previous issue on the topic has some pointers


Wouldn't you have the same problem with fd if you invoke it on unquoted globs? The problem is the semantics of the shell, not find.


Yes, but fd has implicit pattern matching without requiring the * character:

    Features:
        Convenient syntax: fd PATTERN instead of find -iname '*PATTERN*'


Seems a misfeature. Much of the time, I use find to look for files with a specific extension, which makes the trailing asterisk harmful.


    fd -e py
Also neatly bypasses the glob issue.


As another user mentioned, many POSIX and/or GNU utilities havent aged well

What do you mean by that? Users have gotten more stupid and uneducated over time? I've seen "when in doubt, escape it" in several books about UNIX and shell, and escapes and wildcard expansion are one of the most beginner lessons in a lot of other materials. However, the Internet has let everyone have a voice, for better or worse, and as a result anyone who barely knows something can now write a misguided "tutorial" about it.

In other words: don't blame the tools, nor advocate writing dumbed-down replacements which lack the composability and generality the originals had; blame the proliferation of barely-correct educational material that has spread cancerously over the Internet.


I have written hundreds of shell scripts, several which are hundreds of lines long. When I say they havent aged well, Im speaking from experience. When I use something like:

    find -name *.jpg
I know what it actually does. But what it actually doesnt isnt what it should do. The current behavior returns matched files... unless no files match. Then it returns the literal string "*.jpg". That behavior is usually going to be desired by no one. Same with:

    grep $SOMETHING
it going to fail if the variable has any spaces. Again something most people arent going to want. These both can be worked around with Shell options and "IFS" respectively, but people shouldnt have to do that. The default behavior should be sane, but in many cases the POSIX standard isnt that. For goodness sake, the "venerable" find tool doesnt even have a "maxdepth" option:

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/f...


You completely ignored my point, which is that none of this behaviour should be surprising or unexpected to anyone who has taken the time to think about how it all works together, and that this flexibility is the inherent way in which the UNIX command line is so powerful.

it going to fail if the variable has any spaces

Quoting and escaping. I mentioned that already, it's a basic part of the understanding and is not surprising at all.

Again something most people arent going to want.

You're assuming a lot. The fact that the variable substitution could expand to multiple arguments (because it's really just simple textual substitution) is in fact very powerful and useful (you can pass multiple arguments in one variable, for example), and you can always quote if you really want it to expand to one argument, the same way you would quote strings containing spaces to make them one argument. This is very consistent; observe that if $SOMETHING contains "foo bar" (without the quotes), then

    grep $SOMETHING
is exactly the same as

    grep foo bar
but what you seem to be suggesting is that it becone

    grep "foo bar"
...what? Where did those quotes come from? More importantly, how do you propose to remove them? My point is, the current behaviour is simple and consistent as well as powerful and flexible, and I suspect people are just going "it doesn't work the way I thought it would" and downvoting my comment when they really haven't thought about the perfectly logical explanation for why things are the way they are. The original developers of the shell language were most certainly not stupid.


The current behavior is not simple and consistent. If it were, then *.jpg would “expand” to the empty list of arguments, rather than remain unchanged, in the degenerate case of no match.


Our tools should be dead simple whenever possible rather than "if you make mistakes then git gud".


I almost never want the split-on-ifs-on-expansion behavior. That's one of the reasons I use zsh. When I want I can either use an array or $=SOMETHING.

The default split behavior basically forces you to always quote expansions which is tiresome. It also is confusing because almost no other programming language works like that.

And when you assigned SOMETHING you did wrote SOMETHING="foo bar" (with quotes) so it sorta makes sense visually also.


So, do you prefer people have this misconception instead? https://unix.stackexchange.com/q/544189/348263

Also, for the case of passing multiple arguments to a command, just use an array:

    SOMETHING=(foo bar)
    
    grep "${SOMETHING[@]}"
It will handle arguments with spaces correctly, unlike using a string variable and splitting it on spaces. There is no reason to ever use space splitting, unless you're maintaining some legacy interface where you can only pass a string instead of an array.


I don't understand escaping properly.

Why, on some machines (all varying distros of Ubuntu) does

    find -name *.xml
return nothing and I have to use

    find -name '*.xml'
whilst on other machines the first command returns results? The shell is always bash, it's a stock install - so why the variation?

I get that it should always be quoted, but not why it sometimes works when unquoted and other times doesn't.


Perhaps `shopt -s nullglob` has been set?

“If set, Bash allows filename patterns which match no files to expand to a null string, rather than themselves.” [1]

[1] https://www.gnu.org/software/bash/manual/html_node/The-Shopt...


> so why the variation?

Difference between dash and bash?

Shell options? `man shopt` and grep for glob


That's not going to directly solve this problem since the globs are expanded before the invoked process is even called with its arguments.


> Smart case: the search is case-insensitive by default. It switches to case-sensitive if the pattern contains an uppercase character*.

This is what I want as a default for pretty much any text search.


Emacs have had this since forever. It’s pretty neat.


Vim also provides it if you use:

  :set smartcase


Imho also anything that outputs or uses multiples of 512 bytes (blocks) by default

Sure, disks may have sectors and there may be some use cases, but we use bytes, kilobytes, kibibytes, etc now for data sizes.


In this case the shell hasn’t aged well. Besides the globing feature, the bash language is arcane and few can actually reliably write correct bash despite it being ubiquitous. Personally I would like to see a completely reimagined take on the shell.


Have you looked at powershell, scsh or fish?


shellcheck helps a ton, though.


I would argue that this case shows that the shell(s) have a problem with a feature being enabled by default. In hindsight it might have been smarter to put globbing expressions in special quotes than the other way round.


For me, Silver Searcher often fails to find matches in files that 'grep' will, easily. I've not been able to determine why this is, but it happens often enough that I consider SS to be highly unreliable, even if its fast.

So, it seems that there are still some issues to be sorted here.


This claim needs adequate proof, please provide an example. Otherwise the HN reader has to assume you made a user error.


Maybe it has something to do with my not understanding its case sensitivity rules .. which I've learned about in this thread as being different.

As soon as it happens again I'll bring it here as an example, but like I said I stopped using silversearcher because of it getting in my way like that.


I often set the nullglob option in scripts, because it makes the handling of globs which don't match anything a bit more predictable:

http://bash.cumulonim.biz/NullGlob.html

There's a note at the end about how with nullglob set, ls on a glob with no matches does something surprising. This is a great illustration of how an empty list and the absence of a list are different. Sadly it's rather hard to make that distinction in shells!

I do wish that either shells had a more explicit syntax for globbing, or other commands didn't use the same syntax for patterns. Then confusion like this couldn't occur. An example of the former would be if you had to write:

  ls $(glob *.txt)
Here, the shell would not treat * specially, but rather you would have to explicitly expand it. This would be a pain, but at least you wouldn't do it by mistake!


I set failglob: `shopt -s failglob`. Makes the whole command fail if there's no matches. That combined with `set -e` which aborts the script in the event of any command failing makes me feel somewhat safe.

Indeed I add the following two lines to every bash script I write:

    set -exu
    shopt -s failglob


If you like set -e, I recommend looking at set -o pipefail.


Yes, Oil (https://oilshell.org/) now has nullglob on when you run bin/oil instead of bin/osh.

So you get this:

    oil$ find . -name *.jpg
    find: missing argument to `-name'

    oil$ find . -name '*.jpg'
    (works)
Details: I had it on when you set 'shopt -s strict:all', but I neglected to turn it on for 'oil:basic' and 'oil:all'. Those are oil-specific option groups so you don't have to remember all the option names.

But I just fixed that and it will be out with the next Oil release.

https://github.com/oilshell/oil/commit/ddac119254f9a7045dca7...

If anyone wants to add more strictness options to Oil to avoid these types of mistakes, let us know! (e.g. on https://github.com/oilshell/oil, or there's more contact info on the home page).


I don't know if it's the default or if it's something I've set in my config a decade or two ago but my zsh behaves like this by default (i.e. I get an error for blobs that match nothing instead of silently passing the * along). That seems much saner to me:

    $ find . -name *.txt
    zsh: no matches found: *.txt
That'll teach you to quote your `find` patterns real fast...


It's either default or part of oh my zsh, because I've had the same experience and I don't have any custom configs.


Tangential: another safeguard you can adopt is avoiding "hard delete" commands like rm and find -delete. Untrain yourself from these commands by never using them. On Mac systems, the "trash" program (brew install trash) sends files to your system trash. You can use `trash [file]` and `find .. -print0 | xargs -0 trash --`. rm is a dangerous command you should only very rarely be using.

I fish something out of the trash a few times a year and lemme tell ya; it's worth the investment.

Another tip if you fancy debugging shells is using

  python -c "print(__import__('sys').argv[1:])" sample "arg here" * foo
this provides the same functionality as the c program in TFA, without needing gcc.


I learned from a sysadmin to use `mv` instead of `rm` when you're removing potentially critical files.

Other tips include colouring your prompt highlighted red on production boxes so you never accidentally think you're somewhere safe.

Nowadays I try to avoid SSHing into mission critical machines though.


Trash is the poor man's backup system.


You use trash and backups.

Rich men can still save a few dollars for pocket change.


> this provides the same functionality as the c program in TFA, without needing gcc.

or just using bash:

function showargs() { for x in "$@"; do echo "arg: $x"; done }


I like it. I'm using a numbered version:

  function showargs() { i=1; for x in "$@"; do echo "arg $((i++)): $x"; done }


No need for python, this works just as good:

    printf ":%s:" *foo


For a long time (probably since the first time I forgot to quote something and got burned, so around 40 years), I've thought that there should be some mechanism for the shell to pass in information about how each argument came about.

For each argument, it would tell the program if it was supplied directly, or came from wildcard expansion. For those from wildcard expansion, it would tell the program what the wildcard was.

Most programs would not care, but some programs could use this to catch common quoting errors.


The unix haters guide lists the shell handling the wildcard like it does as a major flaw with unix shells.


> For those from wildcard expansion, it would tell the program what the wildcard was.

Different shells have different globbing mechanisms. Why should all programs tie themselves to the mechanisms of any one particular shell?

The simpler the calling mechanism for executables, the simpler it is to write them in any existing or future language. This also gives more flexibility to future shells.

UNIX is pretty much designed thinking of users as programmers. Making it easy to write programs building on other programs is as important as being able to call them. With that in mind, I don't think it's a good compromise to complicate the writing of executables in order to protect users from their own mistakes.


> Different shells have different globbing mechanisms. Why should all programs tie themselves to the mechanisms of any one particular shell?

I don't think my proposal would do that.

First, a clarification. When I said "For those from wildcard expansion, it would tell the program what the wildcard was" I meant the shell tells the program the full argument containing the wildcard, not just the wildcard part. E.g., if the argument was a?c which expanded to abc, the shell would tell the program it was a?c, not merely ?.

Let's say myprog takes one argument, which it expects to either be a file name or a wildcard in some pattern language that is not necessarily the same pattern language used by the shell.

Suppose myprog is invoked and it sees it has one argument, X. If the shell tells it that X did not come from shell wildcard expansion, myprog can safely use it.

If it does come from wildcard expansion and the source was Y, then there are a few cases.

1. Neither X nor Y contains any myprog wildcards. In this case, I'd have myprog go ahead and use X.

2. X does not contain any myprog wildcards, but Y does. Myprog can infer from this that it and the shell probably have overlapping wildcard languages. Report a "needs quote?" error.

3. X contains a myprog wildcard, but Y does not. Probably myprog and the shell use different wildcard languages. I'd guess its pretty unlikely that the user expects that a filename will contain a myprog wildcard, so I'd report an error in this case.

4. Both X and Y contain myprog wildcards. Go ahead an accept X, applying myprog wildcard expanstion to it. (Maybe first check that X and Y contain the same myprog wildcards, in the same order).

I'd probably add support in myprog for some optional flags to change the defaults in those cases so if you really want to do obscure things like have file names that contain myprog wildcards and use shell wildcards to provide those file names you can do it.


Right now, programs receive a list of strings as their arguments. What would be a concrete proposal for this X-Y mapping, taking into account that a single Y results in multiple Xes?

The calls of executables is done via this system call:

int execve(const char * path, char * const argv[], char * const envp[]);

How would you modify that system call signature to be able to carry that information? And what about how programs receive arguments?

int main(int argc, char * * argv)

How would you modify main()'s signature?

Take into account, the kernel has no notion that shells even exist. It has no need to know about features of particular programs in the upper layers, but now it's going to carry this information. All other programs, too. Right now, they're typically not aware that there is such a thing in the world as a shell, but now an integral part of what makes an executable an executable, its arguments, is going to come attached to this data structure that represents the use of a feature of some other program that has nothing to do with the executable itself.

What is a wildcard/glob/pattern? It's just some weird idea a program not different from any other program had. It's not particularly important. It's nothing all other executables ever needed to know about before.

Sorry I'm ranting. What I'm trying to convey is that typically software comes in layers, like an upside down pyramid, and the upper layers base themselves on the lower layers. The upper layers depend on the foundation that is the lower layer. This seems to go backwards, with the lower layer basing itself on a particular tiny insignificant piece of the upper layer. To the lower layer, that piece from the upper layer doesn't even need to exist. It can just disappear. The lower layer can run just fine without it. The kernel doesn't need the shell, and even shells don't need globbing. It's not a core feature even if it's common.

I'm fine with breaking rules and making a step towards a mess when the benefit is really worth it, but I'm not convinced this is worth it.

Listen, this article says `find -name * .jpg` is a common mistake (I put the space because of HN formatting). Yet, I don't think I've ever done that, and I don't believe I ever will. I trust the claim to an extent. When people are learning a new language (tongue), like English or Spanish, they'll make all sorts of silly grammatical mistakes, like saying "you was late" or similar. I'm sure they're very common, but as they gain experience or are a native speaker, they'll never commit those mistakes. Trust me that `find -name * .jpg` looks really wrong. It jumps out, and just like you'd never say "you was late", my fingers would never type that.

I wouldn't try to change English grammar so that "you was late" became acceptable. For whoever is making those types of mistakes, learning is a process, and that's fine.


> What would be a concrete proposal for this X-Y mapping, taking into account that a single Y results in multiple Xes?

Two arrays:

1. An array of all the Ys.

2. An array with one entry for each X. If X came from shell wildcard expansion, this entry is the index into the first array of the Y whose expansion resulted in X. If X did not come from wildcard expansion, this entry is some magic value that cannot be mistaken for an index into the first array.

Pass these arrays in the environment. No need for any kernel or standard library or language runtime modification to support it.


Just add an ENV var containing a JSON with this metadata. There's tons of env vars that get ignored by programs that don't care about each specifically.


ENV variables are inherited so this may confuse subprocesses if they're not careful

there are workarounds to that by adding eg. the target PID, but that probably comes with its own issues


I'm not sure why people are saying it would make executables harder to write; it could most easily be done with environment variables as opposed to modifying the signature of `int main()`. Something along the lines of: `GLOBLESS_ARGC=5` means that `GLOBLESS_ARG{0..4}` have the original arguments as supplied by the shell user.


it's trivial to set this up yourself by echoing expanded commands into a log file as your program runs


Shell (because this is technically a shell, not a find issue) is the worst language that everyone should learn. It's a language you'll actually encounter, and it's one that's hard to avoid (unlike PHP).


This was obvious to me, but one version of this that surprised me is when using scp. If you glob a remote destination like "scp myserver:*.jpg ./" It will probably work! But how? Because the remote path will likely not match any local files and the path with the asterisk will be passed to scp and scp will do the globbing on the remote side.


I mostly use ssh and tar instead of scp, because I'm usually after more than one file.

Like, "ssh remote 'cd /whatever; tar -cf - someglobpattern' | tar -xvf"


Tip: get into the habit of using && rather than ;

So...

> ssh remote 'cd /whatever && tar -cf - someglobpattern' | tar -xvf


Seconded. Here https://apple.stackexchange.com/questions/378942/cant-initia... is an example of what can happen from not checking for errors after a `cd` command. Summary: "sudo find ... -delete" ran in the root directory rather than the intended location, and now the OS is gone (probably along with lots more).


Fair observation. I will change my ways :)


even better

    printf "<cmd>" | ssh bash
or

    cat cmd.sh  | ssh bash
This passes all of stdin to the remote bash, avoiding any expansions happening on the local side entirely.


Could also be faster because of the common compression dictionary.


I believe that programming languages should never make the meaning of a program depend on the context in which it is executed. So many obscure bugs are directly caused by such behaviors. There should be exactly one possible interpretation for a given statement, and if that cannot be executed, then the program should abort. In this case, the glob should never have been passed on to find. It should either have expanded to the empty array, or failed.


One could argue that it is merely a side effect that a shell constitutes a programming language. And one could also note that find should employ the same tradition as the shell (ie. Use /*.jpg to recourse into folders)


My instant reaction to the example was “that won’t work; you’re shell will say something like ‘no matches’”.

Using an unescaped star in a find command never works for me, which is a lot better than it sometimes working and sometimes breaking!

Reading the article and the comments, it seems like bash doesn’t do this? I suppose it’s one nice thing about oh-my-zsh, whose default confit I use almost unchanged.


Ditto, fish shell also provides this behavior in the default config.


It's a default in zsh, nevermind oh-my-zsh.


Thanks; I didn’t know whether it was or not while I was writing my comment, and it was difficult to check on mobile, so I went with what I knew for sure.

Glad to know it’s a stock zsh default, too!


Straight from The UNIX-HATERS Handbook. https://web.mit.edu/~simsong/www/ugh.pdf


A classic


Putting ‘shellcheck’ in your CI pipeline is a must for me now, after one too many mistakes.

I just finished cleaning away all existing ‘error’ and ‘warning’ level issues in our codebase so that the ‘shellcheck’ CI step can be really strict on code quality.


Subject of the post is wrong. This is a common mistake between the user and whichever shell the user is using, not the user and the command itself.

The find command works exactly as expected.


Phew! I'm glad I've been hitting the "Happy Case" scenario all these years!

Very useful article. And very informative.

Summary:

Instead of -

  find . -name *.jpg

Use quotes around pattern i.e.

  find . -name '*.jpg'

Edit: Oops, the double-quotes should have been single quotes! Thanks, @lucd. Happy case, like I said!


Wildcards are still expanded inside double quotes.. Yiu have to use single quotes.


That's not true. Variable expansion happens in double quotes, but globs stay untouched.


...or a backslash.


I use the backslash. I also specify -type f explicitly when I’m looking for files.

find . -type f -name \*.txt

Now the next article will talk about problems when there’s a space in the file names and you just piped the output into xargs :)


Yes but double quotes will work fine in this case -- globs won't be expanded...


Common? Yes. Simple enough to stop doing this mistake after two times? Also yes. One you internalize in which case shell is responsible for globbing and in which case command itself, it is pretty clear cut.


`find` has one of the worst user experiences out of UNIX tools. I prefer to use `find . | grep foo` to find files.


Isn't this in the Unix hater manual? I never use -name without ''. I guess this is just muscle memory from early on when I run into this issue that in Unix *.py can mean very different things depending on where it gets resolved.


That was a lot of text to explain that one should be cautious of the wildcard expansion some shells provide.

Thanks! I would have jumped right in!


The title seems a bit off since shell expansions and arguments has nothing to do with the find command.

Both features are also often covered in entry level material for introduction to shell.


Well it seems the root of the problem is

> * Most importantly, the 'find' command uses a different algorithm than shell globbing does when matching wildcard characters. More specifically, the find command will apply the search pattern against the base of the file name with all leading directories removed. This is contrasted from shell globbing which will expand the wildcard between each path component separately. When no path components are specified, the wildcard will match only files in the current directory.*

So there does seem to be a `find` specific issue here


That's just ignorance about find. The globbing works fine, they're just not doing the search they think they are - they seem to think "-name" acts like "-path":

  $ ls Foobar/
  one  three  two
  $ find . -path '*bar/t*'
  ./Foobar/two
  ./Foobar/three


Most commands don’t accept the shell-like wildcard `*` as part of their command syntax; find does. That’s the connection.


It is just awesome that I stumbled upon this post. I remember previously I had faced similar issue while running a command like

  find . -name *.gradle | blah blah 
Instead of finding the root cause, I by-passed it by

  find . | grep "\.gradle" | blah blah
It just feels great to now connect the dot and know the real reason for the issue.


find . -name \*.jpg

This is pretty elementary which any seasoned Linux person should know.


> You can type: man glob

Translation: you cannot type "man" followed by an asterisk because that would have required forethought in how one learns a programming language.

Argle: Hey Bargle, is that new bridge built to spec?

Bargle: It's like I always say, man: good enough for shell script manual operator discoverability.

Argle: Yeah, you're always saying that...


I almost appreciate the idea of writing a C program whose sole purpose is to show you the arguments send to it. That's some serious overkill.

But I think echo *.py would not just be easier, but more effective at demonstrating what your find command line will actually look like after shell expansion.


That will be quite misleading when you have filenames with spaces and arguments with quotes.


Another one I've seen a few times with find (although it is actually more an xargs thing than a find thing), usually not with any bad consequences at least, is something like this:

find . -type f | xargs grep foo

You expect to see all the "foo" lines from your files, each prefixed with the file name and a colon. And that's what you get most of the time.

But sometimes you might get a foo line without the filename prefix.

Why? Because grep only adds the filename prefix when there is more than one filename argument. There are two ways that might come about in the above command.

The first is if the find only finds one file.

The second is if the find finds so many files that xargs has to invoke grep more than once. It can happen that there is only one file left to do when it gets to the final grep invocation.

Simple fix:

find . -type f | xargs grep foo /dev/null


A “better” fix would be:

    find . -type f -exec grep -H foo {} +


Yeah, find ¦ xargs is definitely an antipattern, but in practice I use it all the time. Either because I forget the syntax for find -exec, or because I'm feeling principled that find -exec breaks the Unix philosophy of doing one thing well and being composable, or because my script started using ls and got changed to use find.


Why is find | grep an anti-pattern? Is eschewing memorizing the flags of one particular command in favor of using a pipeline which is much more broadly applicable really an anti-pattern?

Heck, I’m at the point now where I pipe tar into gzip.


I think there is a cottage industry around telling people that they're using the interactive shell wrong. This all started with "useless use of cat", which people still bitch about on Reddit and Twitter on a regular basis, and so whenever someone finds something a little weird about UNIX, they are quick to call a person asking a question about it dumb. I think it might be stockholm syndrome, or a symptom of one's greatest achievement in life thusfar being reading the man page for find. Or maybe they genuinely think they're helping. I dunno.

Ultimately, it's a cascading failure of bad design. Good design is that find and xargs each do one thing and do it well. Find can output files to anything! Xargs can xargify input from anything! Bad design is that find uses \n as the output record separator and that xargs uses \n as the input record separator, but \n can appear in filenames! This design can never work and will always lead to difficult-to-debug problems. So instead of fixing it, we blame the user for not knowing "oh, well find has xargs built in. sort of. a lot of the features are there. not all of them. but some. so never use find | args, smile." or "well, we designed the unix shell to be easy to use... but there is this corner case that we knew about and could have protected you from, but we chose not to, so 1% of the time you'll rm -rf / with your xargs command, UNLESS YOU REMEMBER THIS ONE WEIRD TRICK which is to use -0 to use an out-of-band symbol to separate records which we could have just made happen by default but chose not to, just to mess you up!"

I appreciate the efforts of people that are re-imagining the interactive shell, like PowerShell. I am not very productive with PowerShell (to get a listing of files in your directory, just type "Could-We-Make-It-Even-More-Verbose -PerhapsInTheNextVersionWeWill!") but it's a good prototype of a good shell. I have personally stepped on enough UNIX landmines to be generally OK with the compromises, or more typically don't write shell scripts for anything that will be run more than twice. It's bad though, and I don't think the users are to blame.


> Bad design is that find uses \n as the output record separator and that xargs uses \n as the input record separator, but \n can appear in filenames! This design can never work and will always lead to difficult-to-debug problems. So instead of fixing it

Perhaps, but it also can't be avoided. Everything can appear in filenames. We want filenames to be able to contain any character.

The most high-profile attempt to "fix" the "problem" you identify here is of course the ASCII standard, which defines several separators (0x1C - 0x1F) just for the purpose of delimiting fields with bytes that can't appear in data. Except of course nobody cares -- the solution was stillborn -- because data can contain any bytes that anyone wants it to.

> I am not very productive with PowerShell (to get a listing of files in your directory, just type "Could-We-Make-It-Even-More-Verbose -PerhapsInTheNextVersionWeWill!")

Well, you could do this by typing Get-ChildItem, but it would be both easier and more standard to type one of "gci" (what Microsoft would like you to use), "ls" (meant for those used to unix) or "dir" (for those used to dos).


> Perhaps, but it also can't be avoided. Everything can appear in filenames. We want filenames to be able to contain any character.

Who is "we" ? I wouldn't mind some 'sane' limits, eg UTF8, no control characters and no names starting with a dash.

Still hoping for the day that some distribution sets a mount option to enforce sane filenames by default and starts weeding out the application bugs it'll trigger..

https://dwheeler.com/essays/fixing-unix-linux-filenames.html


There are two options. \0 can't appear in filenames, which is certainly questionable (why is that byte special?), so using \0 as the separator works. A better way of solving the problem is to just prefix the record with a field that represents the length. People are allergic to the length prefix because, say you set it to 2 bytes, now your max file length is 512 characters AND you have two bytes of overhead for every filename. Whereas using \0 means that you have only one byte of overhead and can have infinitely long filenames.

Interestingly, I guess nobody has ever died because of a buffer overflow. Therac-25 was integer overflow and bad synchronization (and an open loop control system, which people still love). So I guess it doesn't matter.


> just type "Could-We-Make-It-Even-More-Verbose -PerhapsInTheNextVersionWeWill!"

PowerShell can be as terse as bash. Just use the aliases (which are actually real aliases in PowerShell), leverage default parameters and shorten parameter names (or use their aliases) to just use enough character to disambiguate.

To get a file listing of your directory, simply type

    ls
If you want to know why this works, type

    help ls
and PowerShell will answer you with the help on "Get-ChildItem".

You could also type this to explain the `ls` command:

    gcm ls
And PowerShell will answer

    CommandType     Name                                               Version    Source
    -----------     ----                                               -------    ------
    Alias           ls -> Get-ChildItem
`gcm` is itself an alias for Get-Command - a command that gets information about a command. Executing `gcm gcm` will produce this output:

    CommandType     Name                                               Version    Source
    -----------     ----                                               -------    ------
    Alias           gcm -> Get-Command
Incidentally, why does `find` have a `--delete` option in the first place? That's not very do one thing only and do it well. `find` should search/find and do that well. Why is it also a file-deleter?

Going back to the example from TFA, in PowerShell you would delete files not matching a pattern using this command:

    ls -ex *.py -rec -file | rm
Or, if you do want it verbose, using full names instead of aliases and shortened parameter names:

    Get-ChildItem -Exclude *.py -Recurse -File | Remove-Item


I assume you mean “find | xargs”, rather than “find | grep”; the latter would grep against the paths that find matches, rather than their contents. Anyway, the reason “find | xargs” is considered an antipattern is because xargs splits its input on newlines and this would break on any paths that contained a newline. Arguably, you can get around this by using find’s -print0 and xargs’s -0 arguments, respectively, as paths cannot contain null bytes, but you might as well have used -exec at this point!


xargs is also more efficient than -exec, as it stuffs multiple files to the target command. Esp. for rm, instead of executing a copy for each file.


doesn't

    -exec <command> {} \+
do that too? (and we're back at the "remembering specific syntax" point: knowing xargs helps you everywhere)


Both of these are valid:

  -exec <command> <args> {} +
  -exec <command> <args> {} <args> \;
(The backslash is unnecessary for +)

The first one, the one used here, acts like xargs. The second one executes <command> individually for each file found. Because the "+" version continually adds files to it, it must be last, but the "\;" version doesn't have that restriction since it's a single substitution. Just to be clear with an example:

  $ ls
  one  three  two
  $ find . -exec echo 11 {} 22 \;
  11 . 22
  11 ./two 22
  11 ./three 22
  11 ./one 22
  $ find . -exec echo 11 {} 22 +
  find: missing argument to `-exec'
  $ find . -exec echo 11 {} +
  11 . ./two ./three ./one
Most people seem to only learn one of them, even getting them mixed up. For example, I had learned the "\;" one and for years thought they both acted like that, with "+" simply being an alternate end-of-command identifier to avoid needing to escape the semicolon.


The “+” terminator to -exec has the same effect for find.


Do people really use find like that to clean up source code?

  git clean -fxdn  # review what would be deleted, then

  git clean -fxd


I do for .pyc files.

  find . -name "*.pyc" -delete


Is there a reason you prefer that rather than having something like

  *.py[cod]
in your .gitignore?


Yeah, the reason is that the Python interpreter doesn't give a shit about .gitignore.


Apparently, the OP read the man page of "glob" but couldn't spend a single minute reading the man page of FIND.

  -name pattern

    ... Don't forget to enclose the pattern in quotes in order to protect it from expansion by the shell.
So, a lengthy article on an already documented feature of find.


Globbing is also more complex than it seems on first blush.

Alternation is supported in bash. Stuff like "echo ∗.{png,jp{e,}g}" (utf8 asterisk to get around HN, cut/paste won't work).

That bsd derived glob is useful for some sometimes useful tricks. Like in Perl:

use File::Glob qw/bsd_glob/;

my @list = bsd_glob('This {list|stuff} is nested {{very,quite} deeply,deep}');


> "echo ∗.{png,jp{e,}g}" (utf8 asterisk to get around HN, cut/paste won't work)

This works:

    echo *.{png,jp{e,}g}
Note

    ∗ ⟵ U+2217   * ⟵ needed ASCII asterisk
Just put the code examples in the separate lines, prefixed with spaces.


One of my favorite features of fish shell is the wildcard expansion:

https://fishshell.com/docs/current/tutorial.html#tut_wildcar...

It's rare I reach for find nowadays since switching.


Also, a good thing to remember is to mind the order of your options. I know from experience that it matters a lot after thinking once: "I can just place this `-delete` option wherever, right?" and using it as my first option. Needless to say, I had a very bad time.


It is unexpected to new users, principle of lead surprise should show it's not a great idea. At the same time, globbing is a basic concept for beginning shell users. I don't know how I learned about it, if someone told me or something.


I wonder if this is also a problem with fish. The globing there is different.


Fish won't pass along the unmatched glob, so in the given example you'd notice that with

    find . -name *.py
it will error initially, because there is no "*.py" in the current directory.

Globbing is still a thing fish as the shell does, and you still need to protect asterisks that need to reach commands, but you're more likely to notice.

(for bash there is also the "failglob" option that I personally recommend)

Disclaimer: I am a fish developer.


Fish basically forces you to single quote commands that contain wildcard characters.


I always write it like this:

  find . -name \*.jpg
escape the glob


I also have the habit of always using -iname instead of -name to make case insensitive searches.


I used to do this until just replacing it with `fd`.

https://github.com/sharkdp/fd


Find is an ugly beast. I only use it to print all the files and then grep my way on.


Quiz; in any directory, full or empty I can run this command

$ *

and get

*

What allows me to do this? (the $ is the prompt, not what I typed)


You can't, at least on bash version 5.0.3(1)-release with the default config Ubuntu packages it with. After a lag I get the error `$folder_name$: command not found`


I promise you you can, with a previous particular incantation.

Answer to come.

PS. the response is not important.


Oh, I thought you meant in an arbitrary directory. If the first file in sorted order is executable and prints "*" to stdout you could get that.


I did mean an arbitrary directory.

Your answer almost works; but only if PATH includes the current directory which is unusual.

It does hint at another almost answer ...

   mkdir t
   cd t
   ln -s $(type -p ls)
   *


Does bash not do expansion for the program name?


    GLOBIGNORE=*; *(){ echo *; }
welp


Correct! More or less. Mine uses alias rather than a function.

    alias \*='set noglob; echo \*'
Works because alias expansion is done before globbing.


TL;DR: globbing kicks in before other things unless you turn it off like with single quotes:

https://www.tldp.org/LDP/abs/html/globbingref.html

Don't forget about globbing when making Bash scripts.


points for spelling globbing correctly :)


I use fd-find and I don't really miss find.


at my workplace I use git bash on windows, and because i'm always getting man: command not found I flick to a browser and type man <whatever I was looking for>

About once a year I forget what happened all the previous times I typed

   man find 
into google.

TLDR: looking for Dennis Ritchie, I found Chuck Tingle.


  find . -name *.jpg
TL;DR if you don’t wrap your wildcard expressions in quotes the shell will expand them.

  find . -name ‘*.jpg’
so wrap them in single quotes so that they make their way to the find command unexpanded.


This is the one thing I like about DOS/cmd.exe. In those shells, wildcard expansion is done by applications instead of the shell itself, so there's no need to resort to hacks like this.


That means the expansion is very limited and can be inconsistent. The bash etc. way is much more powerful and convenient.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: