
Fixing Unix Filenames: Control Characters, Leading Dashes, and Other Problems - ralph
http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
======
fffggg
This rant misplaces its frustration. This is not a problem with unix
filesystems, this is a problem with Bourne Shell scripts, and with UNIX
argument parsing semantics.

Bourne shell is notorious for its problematic quoting, both of filesystem data
and of any data from any other source. Every example in which he described a
problem with a filename parameter could just as well be a problem with a non-
filename parameter. The correct solution is to not program complicated scripts
in Bourne Shell, and instead use a language which _does not implement variable
access by interpolating strings and then re-tokenizing and re-evaluating
them_. Examples of satisfactory languages include Perl, Python, Ruby.

Regarding UNIX arguments and the dash, it is an unfortunate aspect to the flag
argc/argv/envp calling convention for unix programs. Some other operating
systems provide more structure in their calling convention, explicitly
separating different types of parameters from one another. This is both a
strength and a weakness, as it results in a uniform yet inflexible systems
interface. One of the greatest strengths of UNIX is that its calling
convention is so flexible. The semantics used today are quite different from
the semantics used 40 years ago -- yet execve() remains unchanged. I would
encourage anyone interested to do a bit of historical digging here, and see
how those more ridged system APIs fared over time.

Anyway, the solution to his initial question of using `ls` is the -- argument,
which signifies argument parsing should be disabled for the remainder of argv:
ls -- *

The correct answer to his dotfile/glob question is: "glob() and the Bourne
shell do not have the semantics you're after. Do not use them, use readdir()."

The correct answer to his find -print question is: Yes, print's use of
whitespace was a mistake, and it is a mistake repeated continually throughout
the land of shell scripting and accompanying standard UNIX utilities. As he
notes, it is why print0 was introduced. Making print0 standard is far easier
than reworking filesystem semantics (and, reworking userland in this manner is
a more complete solution as it addresses data integrity issues from non-
filesystem inputs as well). If you want reliable, correct programs, do not
write them in shell.

~~~
haberman
Yes, Bourne Shell's variable access scheme is a bit ghetto, but to me the
problem is that the shell is doing globbing at all. Why not have the shell
pass "*" through to the program, and have the program itself perform globbing?
Then filenames would have no impact on how the command-line is parsed.

~~~
rwmj
Because that's how MS-DOS used to work, and it was dumb. It means every
program has to do globbing (or often, _didn't_ do globbing). In any case, bash
does get this right: ls * will pass the correct filenames to the ls program no
matter what the filenames contain. Also quotes around variable expansions can
cope with any characters.

~~~
haberman
> It means every program has to do globbing

So what? If the primary API used by command-line applications to open files
does the globbing, then programs will have to go out of their way to not glob.
And you'll get the added benefit that globs will only be applied to arguments
that are actually meant to specify filenames. There would be none of this
escaping "*" when you pass it to "find."

> In any case, bash does get this right: ls will pass the correct filenames to
> the ls program no matter what the filenames contain.

That doesn't solve the problem; your filename could be called "--help."

~~~
prakashk
> your filename could be called "--help."

bash isn't interpreting '--help' at all, it is just passed on to the program
being executed, and most GNU CLI programs conventionally interpret '--help' as
a special option.

If your filename is indeed --help, the convention is to use '--' as the
separator between your command line options and filenames. Anything after --
is not interpreted as a command-line option.

Another way would be to use a more qualified filename form ('./--help')

------
btilly
I view this as a long and detailed demonstration of Dan Bernstein's point #5
in <http://cr.yp.to/qmail/guarantee.html>: _Don't Parse!_

All of these problems arise because we're passing around text that needs to be
parsed. Programmers don't know the parsing rules, And they vary by utility.
The end result is that it is very easy to make something work, and very hard
to make it work correctly.

As was noted by fffggg, most of these problems disappear as soon as you switch
to a language that lets you stick the filenames in strings and never again
tries to parse them. All of the major scripting languages will do.

But not all, just most. The exception being the UTF-8 issue. It does not solve
printability of strings with unknown (and perhaps no valid) encoding.
Furthermore scripting languages often will want to try to interpret external
bytes coming back from the filesystem as a string, and may have trouble if
file names are not some recognized encoding.

~~~
sedachv
And if djb hasn't managed to convince you to stop parsing, watch this
presentation from Meredith Patterson at last year's CCC:
<http://www.youtube.com/watch?v=v8F8BqSa-XY>

------
m104
The sure-fire one-liner Bash way to iterate over files with arbitrary
filenames, with any number of files (0 to millions) is this:

    
    
        $ find $CRITERIA -print0 | xargs -0 -r -n $BATCH_SIZE $COMMAND
    

What this does is use 'find' to find the file and directory names (with your
supplied CRITERIA), send them over to xargs with no funny filename parsing,
run those filenames as arguments to COMMAND, batched up to BATCH_SIZE at a
time (I usually use 100-500), and ignoring empty batches.

Usually COMMAND would have '--' at the end to prevent filenames from being
parsed as arguments, but not all commands need that treatment. If COMMAND
doesn't take the filename list as the last argument(s), it can be fixed with
the '-i' option to xargs.

The BATCH_SIZE part here is important in many cases. Since large numbers of
filenames will break the command line arguments size limit to COMMAND, a
reasonable BATCH_SIZE will prevent thousands or millions of filenames from
breaking your script. Also, sometimes you may want to run the files through
COMMAND one at a time (BATCH_SIZE of 1).

Finally, the '-r' option, which I don't see mentioned often enough, tells
xargs not to be derpy and run COMMAND with an empty argument list. Seriously,
'-r' is short for '--no-run-if-empty'.

For anything else, a nice scripting language is the only way to fly.

------
engtech
I got into a 20 minutes argument with a coworker once about why it isn't a
good idea to put spaces in filenames (my argument being that it can sometimes
break handcoded argument parsing in homegrown unix scripts when the escaping
of the spaces aren't passed down properly).

~~~
sturadnidge
Yes, because the potential to break administrative scripts is a far more
important consideration than the usability of a system for the average person.

I agree with the most of the points in the article, but the one regarding
spaces is, IMHO, ridiculous.

EDIT: Just to be clear, I don't use (or care that much about) spaces in file
names. I am talking about the vast majority of non-technical users.

~~~
ralph
You wouldn't like Plan 9 then; it wisely took the decision to forbid spaces
(U+20) in file names at the kernel level.

~~~
4ad
Just a small correction. While nobody uses spaces in file names in Plan 9, and
at least in Inferno #U changes spaces to something else, spaces are not
forbidden in the kernel since March 23th, 1999: [http://swtch.com/cgi-
bin/plan9history.cgi?f=1999/0323/port/c...](http://swtch.com/cgi-
bin/plan9history.cgi?f=1999/0323/port/chan.c;v=diff;e=1)

~~~
ralph
Thanks, I think you've corrected me on this before and I haven't learned! My
Plan 9 comes from before this switch.

------
nessus42
I am so very glad that someone wrote this issue up in detail. The fact that
the Unix filesystem makes it extremely difficult to write correct scripts and
to issue non-problematic commands at the shell is Unix's biggest flaw.

The whole point of Unix, and the reason that it was so revolutionary, was to
empower the easy composition of complex tasks via scripts and command line
invocations. The “negative freedom" implemented by the filesystem has
undermined the "positive freedom" of Unix's original intent.

The kernel hackers' response to this, back in the day, was typically, "Program
in C instead." It was a shame that the kernel hackers didn't understand Unix!

------
dfc
I use rename[1] to clean up problematic filenames. Sadly I have to rename it
to `rnm` because there is a namespace conflict with rename.pl in util-linux.
On OSX it is as easy as `brew install rename.` I do not know why it has never
made it into the debian.

Taking care of a directory full of madness is as easy as `rename -z`:

    
    
      -z, --sanitize
          Replaces consecutive blanks, shell meta characters, and control characters 
          in filenames with underscores.
    

[1] <http://plasmasturm.org/code/rename>

~~~
telemachos
Somewhat random, but I'm the person who "packaged" Aristotle Pagaltzis's
outstanding _rename_ for Homebrew. I'm very glad to hear people other than me
are using it too.

~~~
dfc
Many thanks!!! I cant tell you how often I use that command. Or how many times
I had scoured `apt-cache search` results looking for this functionality.

------
webreac
Very good article. Using perl instead of shell seems a lot more meaningful
after this reading.

cat * > ../collection becomes

perl -e 'map{open ($f,"<",$_);print <$f>;close($f)}<*>' > ../collection

Strangely, it is far easier to write a correct script in perl than in shell.

~~~
mrud
What about

    
    
        perl -ne 'print;' -- * > ../collection

~~~
kahirsch
That's actually vulnerable to certain problems. Using "-n" or "-p" or "while
(<>)", the files get opened using the 2-argument version of open, so a file
names, e.g. "|rm *" can cause problems!

------
ibotty
that's a good article, but it is not exactly new either.

