
Fixing Unix/Linux/Posix Filenames (2009) - lelf
https://dwheeler.com/essays/fixing-unix-linux-filenames.html
======
pdkl95

        $ curl https://dwheeler.com/essays/fixing-unix-linux-filenames.html > /tmp/page.html
        $ grep -c '"$@"' /tmp/page.html 
        0
        $ grep -c '$@' /tmp/page.html 
        0
    

For some reason this article talks about lots of incorrect metho9ds of
handling filenames in Bourne shell, and completely ignores the correct
solution that fixes a _lot_ of the problem: always use double quotes, and
iterating over "$@".

~~~
dwheeler
But that does not solve the problem at all. File names with leading dashes
will be interpreted exactly the same way. Displaying a file name may cause
terminal escapes to suddenly be executed. A list are file names separated by
new lines still fails to work. Indeed I specifically showed many cases even in
Shell where using "$@" does not help, and many of the problems hit even if you
aren't using a shell at all.

~~~
pdkl95
> But that does not solve the problem at all.

I never claimed it did. I'm commenting on the many _Bourne shell examples_ in
the article, which seem to use many common incorrect solutions, while ignoring
the method that was added to the shell specifically to fix argument[1]
handling,

> Displaying a file name may cause terminal escapes to suddenly be executed

I'm not addressing program output, because program output always has that type
of problem. You can also have the same problem when you display the contents
of a file or user input. This gets even harder to solve when Unicode is
involved, regardless of encoding, and regardless of source. Again, if you
aren't validating and sanitizing your input, you have bigger (probably
security related) problems that cannot be solved by changing how the OS
handles filenames.

> Indeed I specifically showed many cases even in Shell where using "$@" does
> not help

Where? The only @ in the document that I can find is an example using arrays,
which is different than the specific string "$@" (those 4 chars only) that is
defined as copying the current args (skips word-splitting). From bash(1):

>> @ Expands to the positional parameters, starting from one. When the
expansion occurs within double quotes, each parameter expands to a separate
word. That is, "$@" is equivalent to "$1" "$2" ...

[1] The problem of in-band whitespace/etc isn't just a filename issue!
Changing how the OS handles filenames doesn't fix _non-filename_ arguments.
Again, you're only looking at once small set of problems!

edit: typos

------
catern
The problem is not in the filesystem, it's in the shell. Stop writing shell
scripts for tasks that need to be robust! None of these problems hit you when
using a better programming language like Python. (Yes, Python 3 has a bit of
an impedance mismatch between its native Unicode string type and the
filesystem, but that just makes for slightly uglier code, not actual bugs)

~~~
dwheeler
False. Many of these problems affect all languages, including Python3. There
are still many cases where you need to execute external programs. It's true
that some problems are worse in shell, but even then, most languages
(including Python) have constructs that call shell. Since the problems hit all
languages, we should work on trying to make things better. Where possible, the
simple and obvious thing should be the safe thing.

~~~
catern
Come on. The problems you list (in the introduction) are:

\- control characters in filenames: irrelevant to non-shell languages

\- leading dashes in filenames: doesn't affect normal Python/non-shell
languages; for example, unlike in shell, directory walking in Python is
implemented by calling os.walk, not execing a separate process "find" and
parsing its output

\- the lack of a standard character encoding scheme: doesn't affect Python 3,
now that the filesystem encoding system is complete and cleanly handles
encoding errors

\- spaces in filenames can cause problems: irrelevant to non-shell languages

\- special metacharacters in filenames cause some problems: irrelevant to non-
shell languages

As you admit, these problems are worse in shell (I contend, only meaningfully
exist in shell). So why is your article written using shell scripting for its
examples?

\- If you really believe that shell is worse for this, then you should be
warding people away from shell

\- If you really believe these problems are pervasive, take on the hard
target, instead of going for the easy target of shell

\- If these problems really still affect Python programmers who believe they
are safe from these footguns that are so prevalent in shell, then you should
be warning them, not the shell programmers who already know that shell is an
incredibly hard language to write robust programs in

But as your article is using shell for its examples, I can't help but conclude
that this is just another piece of shell-zealotry. If you seriously want to
fix this problem for people, you should be advising them to stop writing
programs in shell. That is undeniably the fastest and easiest mitigation to
the majority of these problems. On that, surely you must agree.

\-----------------------------------------------------------------

An aside on the topic of leading dashes in filenames: it's true that any
program parsing filenames from the command line will run into this issue where
data and options can be confused. You diagnose this as an issue with
filenames, but I disagree: This is an issue with traditional argument parsing,
which is a very poor serialization format. Any number of alternative
serialization formats would avoid this issue; for example,
[https://github.com/NuxiNL/argdata](https://github.com/NuxiNL/argdata) does
not have this problem.

~~~
cesarb
> An aside on the topic of leading dashes in filenames: [...] This is an issue
> with traditional argument parsing, which is a very poor serialization
> format.

I'd say that the issue is that traditional argument parsing uses in-band
signalling: a "filename" starting with a hyphen is treated as a control
instruction instead of a filename (in some cases, unless a special "double
hyphen" control instruction has been seen).

------
makecheck
This seems like a variation of other security problems: you can fix certain
endpoints but treat what’s in the middle as a kind of “sewer” that you’ll
never be able to clean up entirely. If you fix certain filesystems to restrict
characters, you know whatever runs on them is fine and then you just have to
deal with data-in, data-out (e.g. if a program has problems renaming its files
when copying, that program needs to change but maybe not _everything_ needs to
change). I’m sure even this could be tricky; I think as long as the absolute
path to the “safe” filesystem is using only safe characters, any target in the
tree should be OK to use with even the sloppiest script.

I don’t think it is realistic at all to update all tools to deal with this
(certainly not shell scripts). For one, it is a lot of work to fix just one
tool and usually there are so many programs in an infrastructure that you’d
have to fix _all_ of them before you could begin using carelessly-named files.
Also, maintenance always introduces the risk of bugs; your carefully-written
script will eventually be ruined by someone adding a new argument that handles
file names poorly, I’m sure.

If a program must be fixed, it has to be changed in a way that is maintenance-
proof. For example, if the _only_ way for a script to even _find_ a file is to
use an intermediate API, that is safe; someone can carelessly hack your
program later to add a new file option but at least they’d _only_ be able to
get that working by using the same file API, and it would remain safe.

There are now systems that obfuscate the locations of files (e.g. containers
in auto-generated weird paths for security and other purposes). We also have
situations where the file’s name in an interface may not mirror the filesystem
(e.g. a name of a standard directory can be localized and appear to be called
something entirely different, in the local language, than it is on disk).
Thus, intermediaries aren’t that new of a concept, and if you need to update
programs for security purposes _anyway_ then you can fix this problem at the
same time.

------
tyingq
Of course, other operating systems have their own oddities. Apps have to watch
for a variety of crazy things. One example:
[http://kizu514.com/blog/forbidden-file-names-on-
windows-10/](http://kizu514.com/blog/forbidden-file-names-on-windows-10/)

Note that the "Portable Filename Character Set" doesn't solve some of these.

------
moron4hire
Anytime the operating system exposes a function that takes a parameter as "a
string with some restrictions" it really needs to expose the function as
requiring a structure for that parameter, and the structure needs composition
functions that make it impossible to create it incorrectly.

------
dwheeler
Author here. Please post questions, I will try to answer.

~~~
unilynx
You mention some solutions that wouldn't work with SUID programs and that
they'd most need the protections.. but is that really a problem?

Most of the problems are in the shell expanding command lines before invoking
the actual application, so they get protected. Once the application is
started, there usually isn't any globbing going on anymore unless those SUID
applications invoke shells themselves and pass them user generated data... and
then we probably have bigger problems anyway.

~~~
dwheeler
The problems continue. For example, if you are writing a program in Python 3,
and directly execute another program with a filename as its first parameter,
it is extremely easy to have problems if the file name begins with a dash.
Notice this has absolutely nothing to do with shells, you do not need to have
a shell at all for this to be a problem. The problem happens in all cases,
because the dash is an option indicator in many programs.

And that is just one example. As I discussed in the article, even displaying
file names can be a big problem because they can include control characters
that can control the terminal in which they are being displayed.

~~~
tedunangst
But the file doesn't need to exist for this problem to occur? All of git, svn,
hg have had command injection attacks where a file/repo name became a command
option, but none of those cases required that such a file exist.

~~~
tux1968
This is a really good point that I think pretty much refutes the idea that the
fix to this problem is restricting what filenames can be stored in the
filesystem. The real issue is enforcing unified and robust command line
processing.

~~~
dwheeler
There is no way to enforce unified command line processing across all possible
languages and all possible programs. Even if you did, how would you enforce
knowing what a leading dash is? There's no way to tell the difference between
a file name and an option, regardless of the programming language used. That
also doesn't deal with the problem that you cannot have a list of file names
with new lines terminating them, or that merely displaying A filename with
control characters can take over a terminal window. In short, even if you
could do the impossible, it would not be enough.

~~~
tux1968
All of what you said is true, but that doesn't negate the issue raised in the
post to which I responded. Fixing filesystem naming constraints, isn't enough
to fix the command line problems you've identified. So if you're going to
truly fix command line parsing, you're going to have to do it there. Perhaps
offering a library that can eventually be used across the spectrum of
utilities and commands.

------
groestl
If you mess up quoting, a lot of things break when you're presented with
"funny" input. If you get quoting right, you stop worrying about the things OP
mentions. Instead of limiting the input creating more mess and a new set of
compatibility issues, it's important that designers get the quoting/input
encodings of their languages/platforms/shells fixed.

~~~
dwheeler
If all people wrote perfect programs in all programming languages at all times
there are no problems, true. But these kinds of file names are landmine for
all programming languages. I I think it is much better to ensure that simple
programs are usually the correct programs. And please note, as the article
discusses, this is not just a shell problem, all programming languages have
problems.

~~~
groestl
> But these kinds of file names are landmine for all programming languages.

That's simply not true. Some languages/platforms do not emphasize, or actively
stand in the way of correctly describing layers of input, often in the name of
convenience. Those are the worst offenders. The shell, for example. Or PHP, or
ansible, which both retroactively corrected bad decisions made in the past,
when their quoting was designed.

Some languages have bulky ways of interacting with the filesystem. They are
not convenient to use (because their layers of input are explicit) but they
don't have problems with funny characters. And this is not limited to
filesystems, just look at SQL and prepared statements.

------
peterwwillis
If filenames are pointing out a ton of bugs in programs that aren't sanitizing
input, I'd say that's fine. If the shell is misinterpreting filenames, it's
probably being way too liberal in how it allows you to construct scripts, or
at the least should include a "warning" linter mode. Bash is probably the best
program in the whole Unix-like environment, but it's crufty.

What I'd say needs to be fixed is the artificial limit on sizes of names and
paths, and I think everything should just default to UTF8 at this point.

------
evancox100
How is PowerShell (+ NTFS) in this regard? Does it's object
oriented/structured data paradigm fare better?

Edit: and now that PowerShell is actually available on Linux, how does it fare
there as well?

~~~
tyingq
It has different issues. Try creating a file called aux, for example. Or even
aux.js: [https://github.com/gajus/react-
aux/issues/10](https://github.com/gajus/react-aux/issues/10)

See: [https://docs.microsoft.com/en-
us/windows/desktop/fileio/nami...](https://docs.microsoft.com/en-
us/windows/desktop/fileio/naming-a-file)

~~~
Arnavion
>It has different issues.

To clarify the "it" here, NTFS allows you to create a file named `aux.js` just
fine. The thing that injects special handling of `aux.js` is the Windows
object layer that sits on top of it.

    
    
        CreateFileW(LR"#(\\?\C:\Users\Arnavion\Desktop\aux.js)#", GENERIC_WRITE, 0, nullptr, CREATE_NEW, 0, nullptr)
    

will create a file named `aux.js` just fine, since it tells the Windows layer
to not normalize names, which includes disabling the special handling of files
named `aux`.

~~~
tyingq
The main thread is discussing, for example, Unix filenames with newlines,
which can similarly be created with ease, but cause problems later with other
tools.

------
O_H_E
Margins would have made this a lot easier to read.

~~~
dwheeler
The text reflows to whatever margins you like. I let people choose the
margins, instead of forcing people to use specific margins that I forced on
them. If you resize your window, you can get any size you want.

~~~
AnIdiotOnTheNet
I, for one, appreciate that you aren't wasting the majority of my screen space
because you think everything looks best crammed into a column that's 1/3rd of
a screen wide.

~~~
O_H_E
I am gonna leave this here. [https://cirw.in/blog/bracketed-
paste](https://cirw.in/blog/bracketed-paste)

It is not that "I think" it is objectively more efficient and easier to read,
yes this has been studied before. I agree that going too narrow increases
scrolling and it's annoying. (Widescreens are a stupid trend)

------
zvrba
Unrelated, but:

> Negative freedom is freedom from constraint, that is, permission to do
> things; Positive freedom is empowerment, that is, ability to do things...

Cited Angus Sibley, but the idea originates, I believe, from Erich Fromm.

------
dwheeler
I have modified the article to show examples where the same problems happened
in python3. Hopefully that will make it clear to people that this is not just
a shell problem, and that writing everything in some other language does not
solve the problems.

------
ufo
If someone wanted to take these ideas further, how would it work? Could you
create a different filesystem that forbids certain names? I don't remember if
there is a place in the APIs to return a "filename is not allowed" error.

------
rurban
I agree on all issues, but he misses an important an one: identifier security.

Filenames must be treated as Unicode identifiers. I.e. they need to be
normalized, as Apple did in HFS+. The common laissez fair garbage in, garbage
out is a security risk. Certain mixed scripts need to be forbidden, e.g.
Cyrillic letters may not appear next to Greek in the same name. Confusables
need to be warned about.

[http://unicode.org/reports/tr39/#General_Security_Profile](http://unicode.org/reports/tr39/#General_Security_Profile)
See also
[http://www.unicode.org/reports/tr36/](http://www.unicode.org/reports/tr36/)

------
zrm
> Ugh — lots of annoying problems, caused not because we don’t have enough
> flexibility, but because we have too much.

I don't think this is quite right. The problem isn't weird characters in
filenames, it's that the system itself handles them poorly. The default
separator should have been '\0' to begin with. Globs for files should expand
names in the current directory with the "./" prefix even if you didn't
explicitly prefix the glob with "./".

Then for printing/inputting unprintable characters there should be a universal
standardized escaping format, which takes as input what it prints as output,
used by all standard utilities and with conversion functions in the standard
library that convert from escaped names to binary names and vice versa.

Actually prohibiting the characters from the filesystem creates other
problems.

Suppose I have an existing filesystem, or one mounted from a foreign system
without these restrictions. It has files "foo" and "foo\nbar" and " foo" and
"foo " etc. in the same directory. If we restrict what the filesystem accepts,
do my existing files become unreadable? Impossible to delete or rename using
standard utilities? Impossible to reversibly backup to a different machine
that uses the new restrictions?

Suppose I get an arbitrary name from an external source and intend to store it
in a filename in a way that can be losslessly converted back to the original
name later. If the only disallowed characters are '\0' and '/' then I can
handle those and be done. If filenames had many other restrictions, which can
change over time as they add Unicode control characters or people decide
something new should be prohibited, now the programmer has to handle escaping
all of those too and you're just moving the problem over there. Moreover, if
the set of characters you're not allowed to use isn't fixed ahead of time then
the safe thing is something like base64 encoding the name, but then that makes
the common case worse because you get gibberish names in all cases even though
they would otherwise have been meaningful to humans >99% of the time.

The place for prohibiting weird characters isn't at the filesystem level, it's
somewhere above it. It's perfectly reasonable for a program to restrict what
characters it is willing to put in a filename, because there are many contexts
where it makes no sense to have newlines and such. But that's for the program
to decide, because some, including the system utilities, should accept
anything. If you already have a file called "-this\nfilename\nis\nlame " then
it's quite important that things like _rm_ and _mv_ (i.e. rename) should be
able to work on it.

Moreover, the problem with portability is that it goes both ways. If you want
to be portable then you shouldn't create filenames that start with '-' because
some systems don't support them, but if you want to be portable then you still
have to be able to handle filenames that start with '-' because some systems
do have them which means there may be existing files with those names.

That said, many of the proposed solutions are still good. The continued
existence of encodings other than UTF-8 seems almost entirely without merit at
this point, and if the filename contained binary data that isn't valid UTF-8
then it could be losslessly escaped in the same way that unprintable UTF-8
characters could be.

~~~
pjc50
> The default separator should have been '\0' to begin with

Then either you put a '\0' key on the keyboard, or you can't pass more than
one file to a program. Neither of which seems great.

(The decision that shell language and interactive shell are the same is the
great and terrible choice of UNIX; it gives the user the REPL quality that
it's really easy to build up programs from simple cases on the command line,
but it also caused optimisation for typing the minimum of characters)

~~~
zrm
> Then either you put a '\0' key on the keyboard, or you can't pass more than
> one file to a program.

Or you have ctrl+space insert '\0', or have space insert '\0' and ctrl+space
insert ' '.

It might also help if '\0' had its own printable symbol.

------
otterley
(2009)

~~~
loeg
Article claims a 2018 update, although I don't know what changed.

~~~
pedrow
The 2009 version is available from archive.org[0]. Quite a lot of detail has
been added in my opinion.

    
    
      $ wc 2009.txt 2018.txt 
        1930   15805  105631 2009.txt
        3231   27439  182068 2018.txt
    

[0]:
[https://web.archive.org/web/20090328012800/https://dwheeler....](https://web.archive.org/web/20090328012800/https://dwheeler.com/essays/fixing-
unix-linux-filenames.html)

