Hacker News new | past | comments | ask | show | jobs | submit login
Fixing Unix/Linux/Posix Filenames (2009) (dwheeler.com)
94 points by lelf on March 17, 2019 | hide | past | favorite | 96 comments



    $ curl https://dwheeler.com/essays/fixing-unix-linux-filenames.html > /tmp/page.html
    $ grep -c '"$@"' /tmp/page.html 
    0
    $ grep -c '$@' /tmp/page.html 
    0
For some reason this article talks about lots of incorrect metho9ds of handling filenames in Bourne shell, and completely ignores the correct solution that fixes a lot of the problem: always use double quotes, and iterating over "$@".


But that does not solve the problem at all. File names with leading dashes will be interpreted exactly the same way. Displaying a file name may cause terminal escapes to suddenly be executed. A list are file names separated by new lines still fails to work. Indeed I specifically showed many cases even in Shell where using "$@" does not help, and many of the problems hit even if you aren't using a shell at all.


> But that does not solve the problem at all.

I never claimed it did. I'm commenting on the many Bourne shell examples in the article, which seem to use many common incorrect solutions, while ignoring the method that was added to the shell specifically to fix argument[1] handling,

> Displaying a file name may cause terminal escapes to suddenly be executed

I'm not addressing program output, because program output always has that type of problem. You can also have the same problem when you display the contents of a file or user input. This gets even harder to solve when Unicode is involved, regardless of encoding, and regardless of source. Again, if you aren't validating and sanitizing your input, you have bigger (probably security related) problems that cannot be solved by changing how the OS handles filenames.

> Indeed I specifically showed many cases even in Shell where using "$@" does not help

Where? The only @ in the document that I can find is an example using arrays, which is different than the specific string "$@" (those 4 chars only) that is defined as copying the current args (skips word-splitting). From bash(1):

>> @ Expands to the positional parameters, starting from one. When the expansion occurs within double quotes, each parameter expands to a separate word. That is, "$@" is equivalent to "$1" "$2" ...

[1] The problem of in-band whitespace/etc isn't just a filename issue! Changing how the OS handles filenames doesn't fix non-filename arguments. Again, you're only looking at once small set of problems!

edit: typos


The problem is not in the filesystem, it's in the shell. Stop writing shell scripts for tasks that need to be robust! None of these problems hit you when using a better programming language like Python. (Yes, Python 3 has a bit of an impedance mismatch between its native Unicode string type and the filesystem, but that just makes for slightly uglier code, not actual bugs)


False. Many of these problems affect all languages, including Python3. There are still many cases where you need to execute external programs. It's true that some problems are worse in shell, but even then, most languages (including Python) have constructs that call shell. Since the problems hit all languages, we should work on trying to make things better. Where possible, the simple and obvious thing should be the safe thing.


> There are still many cases where you need to execute external programs.

But you don't need to call the shell to execute external programs.


Exactly. Even if you didn't have a shell installed, these file names would be a problem. If a file name begins with the dash, and you naively pass that file name as argument one, you are still in trouble, even if you do not have a shell installed on your system. What will happen is the receiving program will see a dash as the first character of its first argument, and in most cases assume it is an option, not a filename. That is because leading dashes are allowable as file names. It is possible to workaround this, for example by pre pending "./". But our goal should be to make things easy, not just make it possible to do things correctly.


A more simple and clean solution, that I always use in my python scripts: always work with absolute paths, that starts with /, and you solve all these problems.

No reason to have relative paths outside the shell for me, relative paths are only a convenience for the shell user that doesn't have to specify the full path as an argument, but if you invoke a program from another program, it doesn't cost you nothing to pass to it as an argument the full path.


There are situations (most of them involving symbolic links) in which using absolute paths can change the meaning of the path. The most obvious is when creating a symbolic link, for instance a relative symbolic link from "README" to "README.md" will not break when the whole directory is moved. A more worrying one is that using absolute paths everywhere is vulnerable to TOCTOU issues, while relative paths can be less vulnerable (and can be combined with newer APIs like openat to reduce these issues further).


> What will happen is the receiving program will see a dash as the first character of its first argument, and in most cases assume it is an option, not a filename.

That's true, and a very real problem in many cases. A lot of those cases, though, are because people use positional arguments when they should use switched options ('--foo=bar' or '--foo bar') instead. This shields them from all kinds of pretty severe input-validation errors, and not just with filenames. It's a few more characters to type, but a great many option-parsing libs in all sorts of languages can automatically generate terse ('--foo=bar' can be specified as '-f bar')

I know that puts some of the onus on developers, but as you and others have pointed out, relying on the shell to handle these weird cases likely puts responsibility where it doesn't belong (calling subprocesses from another language shouldn't use the shell unless it's overwhelmingly necessary) and on something that handles it poorly (all your points about quoting hell).

That's a narrow quibble, and is exclusively concerned with getting data in to invoked programs intact, from a non-shell invoker. All of your other points about encoding, validation, shell-altering STDOUT, and broken behavior when you are working in the shell (like the whitespace-related issues you point out, which are still a hassle to work around even when using switched options) are well put and very well taken.


> It is possible to workaround this, for example by pre pending "./".

Or you could use the solutio9n provided by most argument parsers: the "--" option that disables option parsing for the remaining options. Trying to fix it yourself by "fixing" filenames (by prefixing with "./" or whatever) requires you to enumerate all possible problem[1] or you're just creating a new weird machine[2].

You're worrying a lot about specific problems like if a filename might be confused with an option, but there are many other ways filenames can be a problem *even if their names are limited to "simple"[3] alphanumeric characters. If you're accepting input from the user (or worse, over a network), you will have far worse problem than crashing on a bad filename if you aren't strictly validating and parsing that input before sending it to any other program or using it as a filename.

[1] ("#2) Enumerating Badness") http://www.ranum.com/security/computer_security/editorials/d...

[2] https://media.ccc.de/v/28c3-4763-en-the_science_of_insecurit...

[3] Defining concepts like "alphanumeric" or even "codepoints that can be safely [printed to the screen" are NOT simple co9ncepts if you support Unicode (which most programs should support).


The article addresses your suggestion:

"The "obvious" way to do this is to litter command invocations with "--" before the filename(s). But it turns out this doesn't really work, because not all commands support "--" (ugh!). For example, the widely-used "echo" command is not required to support "--". What's worse, echo does support at least one dash option, so we need to escape leading-dash values somehow. POSIX recommends that you use printf(1) instead of echo(1), but some old systems do not include printf(1)."


> But it turns out this doesn't really work, because not all commands support "--"

So what? If you don't read the documentation for your the programs you run, mistaking filenames for an option is only only one of the many types of problem you might have. The problem of using programs incorrectly cannot be solved by changing filenames. You solve that type of problem by carefully understanding how to run each program you want to use (which means using -- for many (but not all) programs) and strictly validating & sanitizing your input.


>The problem of using programs incorrectly cannot be solved by changing filenames

If changing filenames makes an incorrect solution work, then it solves it, in that case.

----

In this thread, you're missing the point somewhat - the problem isn't that it's _impossible_ to work with all sorts of filenames using shell, the problem is that it's a PITA to do so, even for experienced users but especially for newbies.

I.e. sure I can use `"$@"` everywhere and quote every single variable I ever use, and prefix every path with `./` or use `--` for every command that supports it. What if I didn't have to? Wouldn't that be better?


> In this thread, you're missing the point somewhat

I assure you I'm not; I'm trying to explain why worrying about these problems at the filename level doesn't actually solve the problem, which should be handled at a different level of abstraction.

> sure I can use `"$@"` everywhere and quote every single variable

If you are writing Bourne shell without doing that, you are always going to have problems, because arguments are not always filenames. Yes, the syntax can be slightly annoying for historical/compatibility reasons.

> What if I didn't have to?

You don't. Use a different shell, or write your scripts in one of the many languages that are widely available like ruby, python, etc. If a scrip0t has #!/bin/sh at the top, you can probably rewrite it into any language you prefer.


Which problem will be fixed first, all the systems you use will ship with printf(1) or all the systems you use will ban filenames starting with -?


But you have already acknowledged the fundamental problem with using --, namely, that you cannot count on it always working. If you can't count on something, then that is not something you should be relying on. Clearly it is possible to do a lookup on every single command you ever use, but most people will not. Prepending "./" always works, no exceptions.

I completely agree that validating input is very important. Indeed, I am recommending that the kernel do input validation on file names.


It's important to validate input at the right level, otherwise you get attempts like PHP's "magic quotes" that appear to solve the problem superficially, but don't.

I agree with others at this thread here: Each program that your program passes a filename to needs to be considered and handled individually.

You cannot rely on "./" being a general solution. What if the program matches the given path against other paths, and does so verbatim without normalization? What if you're explicitly supposed to pass a filename, not a path? In the first case, maybe you claim that the program misbehaves, but one could argue that not having something like "--" is also misbehavior, and in any case, you have to handle it.

Again: Thinking that there is a universal solution, like "./", for a problem that in reality is about inconsistent interfaces and therefore requires you to look at each interface individually will get you into hot waters.


If the user is the one entering the program name not calling the shell can cause unexpected behavior.


How so?

The only cases in which I've seen this cause problems are programs which rely on the presence of a tty when they should not, or assume the existence of environment variables that they don't really need.


Neither of those really have anything to do with the shell?

And it's not like it's all impossible to do yourself without the shell there's just a lot and the shell does it in a nice and well known way. Here are some examples:

Searching the path

Setting custom environment variables

builtins

One weird one that I feel like a lot of people don't know about is that you can run shell scripts from the shell without the shebang. But if you exec one you'll get a format error. It's not good to rely on this but it's another thing that surprises people with messy scripts.


Come on. The problems you list (in the introduction) are:

- control characters in filenames: irrelevant to non-shell languages

- leading dashes in filenames: doesn't affect normal Python/non-shell languages; for example, unlike in shell, directory walking in Python is implemented by calling os.walk, not execing a separate process "find" and parsing its output

- the lack of a standard character encoding scheme: doesn't affect Python 3, now that the filesystem encoding system is complete and cleanly handles encoding errors

- spaces in filenames can cause problems: irrelevant to non-shell languages

- special metacharacters in filenames cause some problems: irrelevant to non-shell languages

As you admit, these problems are worse in shell (I contend, only meaningfully exist in shell). So why is your article written using shell scripting for its examples?

- If you really believe that shell is worse for this, then you should be warding people away from shell

- If you really believe these problems are pervasive, take on the hard target, instead of going for the easy target of shell

- If these problems really still affect Python programmers who believe they are safe from these footguns that are so prevalent in shell, then you should be warning them, not the shell programmers who already know that shell is an incredibly hard language to write robust programs in

But as your article is using shell for its examples, I can't help but conclude that this is just another piece of shell-zealotry. If you seriously want to fix this problem for people, you should be advising them to stop writing programs in shell. That is undeniably the fastest and easiest mitigation to the majority of these problems. On that, surely you must agree.

-----------------------------------------------------------------

An aside on the topic of leading dashes in filenames: it's true that any program parsing filenames from the command line will run into this issue where data and options can be confused. You diagnose this as an issue with filenames, but I disagree: This is an issue with traditional argument parsing, which is a very poor serialization format. Any number of alternative serialization formats would avoid this issue; for example, https://github.com/NuxiNL/argdata does not have this problem.


> An aside on the topic of leading dashes in filenames: [...] This is an issue with traditional argument parsing, which is a very poor serialization format.

I'd say that the issue is that traditional argument parsing uses in-band signalling: a "filename" starting with a hyphen is treated as a control instruction instead of a filename (in some cases, unless a special "double hyphen" control instruction has been seen).


>control characters in filenames

I don't understand how this doesn't affect non-shell languages, it's not the shell that interprets these (unless were talking about different things)

> spaces in filenames can cause problems: irrelevant to non-shell languages

Again, this is a format/protocol problem not a language problem. A lot of formats use spaces as delimiters and if you put delimiters in the filename you're going to have to escape it.


I have added several examples of python3 programs that have exactly the same problems for exactly the same reasons. Hopefully that will clarify things.


Python has already been suggested, and refuted a dozen times in this thread.

It might be suitable for writing long, complex scripts, but for short, ad-hoc scripts its overhead (both in terms of startup speed and in terms of the boilerplate required to write full-fledged Python scripts compared to simple shell scripts) makes it really unwieldy and awkward to use.

Python still has problems displaying filenames with unprintable characters.

Arguments passed to Python from the shell are still subject to these pitfalls, etc.

See the rest of this thread for more.


The fact that languages have a way to run a command on the shell does not mean you should.

Actually, you never should, unless you have a very, very good reason. Even without taking into account encoding issues, you still shouldn't.


Please re-read the article. Even if you never use the shell, running a command can still cause a lot of problems because of file name issues. They would be a problem even if you uninstalled every shell.


I completely agree with your idea; I've long believed that the OS kernel should reject dubious filenames.


However, from past thoughts and discussions, there are issues.

(1) We have to be able to mount filesystems which have bad filenames and allow users to work with them.

In that situation, that means the kernel cannot reject all requests for bad file names, only creation requests. And perhaps not even that.

It can be a mount option with three values: off; reject all dubious names in all filesystem syscalls; or just in creating syscalls: open with O_CREAT, mknod, link, ...

It could also be a capability. Superuser could work with the dubious filenames and set up capabilities for non-privileged child processes to do same, so people needing to work with a filesystem full of these names can get into an environment where they can do that.

(2) Enforcement at the kernel level is too late in some situations. It doesn't entirely compensate for broken programs. Here is a trivial example of what this means. Suppose that some program receives an argument "foo bar" and decides to split that into "foo" and "bar", treating this as two file names to create or process. Those filenames look fine to the kernel, so it is allowed.

In shell scripts, not all name-related issues arise from files that are already in the system. In fact, shell scripts are often robust in the face of existing files with poor names. For instance names generated with globbing expansion are safe, even if they contain spaces and shell meta-characters. The one danger is that the prefix of the expanded list looks like command line options to the program.

The issues that arise in shell programming from failing to quote variable expansions cannot be fixed outside of the shell. The OS cannot fix the lack of quoting in command $VAR; it can't prevent a shell internal variable VAR from containing spaces.


It's also not hard to find Python/Perl/etc scripts that do things like "filename per line" or "split filenames on whitespace".


> most languages (including Python) have constructs that call shell.

Rust's stdlib doesn't. You must pass arguments as array.

https://doc.rust-lang.org/std/process/struct.Command.html


An exec-like mechanism be used to construct a [ "/bin/sh" "/bin/sh" "-c" "yourcommandhere", NULL ] argument vector.


sure, but then you're intentionally making things harder on yourself


Making what harder on yourself?

What if yourcommandhere is shell syntax, like "for x in *.foo; do echo $foo; done"? That can't be executed directly as a process argument list.

If all you have is a wrapper for execv, then you have to construct a call to "sh", where that command is all in one argument following a "-c" argument.

You said that Rust doesn't have a library function for calling the shell but only process execution; I'm showing how you make one in such a situation.

A language and its library won't prevent applications from calling the shell, unless it's uselessly crippled.


> What if yourcommandhere is shell syntax, like "for x in *.foo; do echo $foo; done"

Then you readdir, loop, println. no need to use the shell

> You said that Rust doesn't have a library function for calling the shell but only process execution; I'm showing how you make one in such a situation.

You're stating the obvious.

My point was that it has no footgun which automatically defers to the shell or does shell-like argument extraction from a single string when spawning child processes. Other languages have those things in their standard library which is a source of errors that rust avoids. It does provide a regular gun which you can repurposed as a make-shift footgun by taking the conscious action of aiming down first.


> Then you readdir, loop, println. no need to use the shell

Yes; you have a Turing complete language, so why would you use the shell?

The point that you've lost sight of here is that someone can do that and someone likely will.

(I don't disagree with you; I think that "shelling out" commands from decent programmign languages is an anti-pattern.)

Have you seen this?

https://doc.rust-lang.org/std/process/struct.Command.html

Look, an example of what I'm talking about, right on the Rust site: how to send "echo hello" to your shell on Windows and Unix.

You don't think someone's going to copy and paste this, and substitute whatever command they want?


I agree that documentation sets a bad example.

> The point that you've lost sight of here is that someone can do that and someone likely will.

Sure, that is inherent in the freedom of powerful tools, you can also use them to do dangerous things. But that is besides the point.

The point is that the default and simplest way is the safe way to spawn commands.

You have to do extra work to invoke dangerous shellisms. You need to at least understand that arguments are conceptually not one long string but an array of several strings, and to invoke a shell you would already have to start with splitting that invocation into an array, so you gain little from then writing the rest of your command as one string, you might as well split up all of them and avoid all those escaping pitfalls.

This is the contrast to other languages which have constructs like system("echo foo"), spawn_sh("echo foo") or whatever which are superficially easier to easier use than launching child processes the safe way. In this regard rust is safer than most other languages, since it has no "construct which calls the shell" - note the active voice used here, "can be used to call a shell", passive, is a different case.

This is all I have claimed in this thread.


I’ve had to deal with maintenance of multiple python programs that could have been done in shell instead.

They were written in python 2 and 3 by different teams of strong developers at different organizations at different times.

They were universally unmaintainable, unreliable and undebuggable.

Some narrowly avoided installing pip-distributed malware on internal networks.

There is a reason shell is still widely used. I have never worked with anyone that actually knows one of the shell languages that prefers python for shell scripting tasks.


You have dealt with bad written python programs, sure, but if you have to deal with a bad written shell script, it's much worse!

The shell is full of strange and dangerous behaviors, that make even experienced programmers write buggy scripts. The main reason is that the shell wants to be a programming language but works differently from every other programming language, and has a special and complicated syntax.

Also shell scripts are really not that portable at all, if you write a script on a GNU system it's nearly guaranteed that it will not work on a non-GNU system, and vice versa. Not only the shell but more importantly the commands are different, GNU awk is different than standard UNIX awk, GNU sed is different that UNIX sed, grep is different, and so on. Maybe minor difference that produces bug difficult to find and fix.

For me if a shell script becomes more long than 50/100 lines of code you should consider rewriting it in python, shell scripts are only a valid solutions to rapidly write programs that invokes some commands to automate some things, not a valid solution for doing complex tasks.


As the article already points out, not using the shell (or fixing the shell) would fix many of the problems with spaces and newlines in filenames but there are still others that are more fundamental. For example:

* Filenames starting with hyphens can get parsed as command line options

* Filenames containing control sequences cause the terminal to misbehave if they are printed to the screen

* Filenames containing newlines can't be easily included in all sorts of line-by-line file formats.

Finally, as the article also points out, the shell is not going away anytime soon. Getting rid of it is not the easy way out.


Why should we stop using a popular and effective tool in order to accommodate an unpopular and archaic set of non-features?


The proposal just wants to eliminate problematic names that no one uses anyway.

Python 3 is full of problems with fsencoding, to the point that some people want to deprecate the plain C locale (!).

This is hardly a solution.


That's right, no one really uses these kinds of filenames anyway unless they are an attacker. Supporting them simply makes it easy to end up with security vulnerabilities, and it also makes it harder to write code. Let's make things easy.


On the other hand, people may try to create "Fixing Unix/Linux/Posix Filenames.html". I'd much rather have any bytestring as filename and explicit structure.


There's a manpage that cautions how to handle this of you called your file "-" (without the quotes) IIRC ... some people just want to watch the World burn.


In addition to all of the other problems with Python mentioned in this thread, Python can also be incredibly slow to start compared to sh.

Another problem is that doing many things that would be simple in the shell require a lot of boilerplate in Python. That kind of boilerplate might be worth putting up with for large programs, but for something that could be easily handled with a short shell script it's really annoying.


It wouldn't just be shell scripts that need replacement - the command line is generally also a shell with the same problems, so that needs replacement too

The chances of getting the kernel to outright ban some of the more dangerous filenames seem a lot higher than simply eliminating sh/bash...


I think the best would be to create an LSM module. I created one a while back but have not had time to complete it. I called it safename:

https://github.com/david-a-wheeler/linux/blob/squelch-1/secu...


Unfortunately this wouldn't catch names from network filesystems, removable media, etc., which comes up fairly frequently. (Or, more generally, from existing disks.) Is there a place in the Linux kernel you can intervene on existing filesystems to escape bad names from existing filesystems and refuse the creation of new, unescaped bad names? Could this be something unionfs-ish?


This seems like a variation of other security problems: you can fix certain endpoints but treat what’s in the middle as a kind of “sewer” that you’ll never be able to clean up entirely. If you fix certain filesystems to restrict characters, you know whatever runs on them is fine and then you just have to deal with data-in, data-out (e.g. if a program has problems renaming its files when copying, that program needs to change but maybe not everything needs to change). I’m sure even this could be tricky; I think as long as the absolute path to the “safe” filesystem is using only safe characters, any target in the tree should be OK to use with even the sloppiest script.

I don’t think it is realistic at all to update all tools to deal with this (certainly not shell scripts). For one, it is a lot of work to fix just one tool and usually there are so many programs in an infrastructure that you’d have to fix all of them before you could begin using carelessly-named files. Also, maintenance always introduces the risk of bugs; your carefully-written script will eventually be ruined by someone adding a new argument that handles file names poorly, I’m sure.

If a program must be fixed, it has to be changed in a way that is maintenance-proof. For example, if the only way for a script to even find a file is to use an intermediate API, that is safe; someone can carelessly hack your program later to add a new file option but at least they’d only be able to get that working by using the same file API, and it would remain safe.

There are now systems that obfuscate the locations of files (e.g. containers in auto-generated weird paths for security and other purposes). We also have situations where the file’s name in an interface may not mirror the filesystem (e.g. a name of a standard directory can be localized and appear to be called something entirely different, in the local language, than it is on disk). Thus, intermediaries aren’t that new of a concept, and if you need to update programs for security purposes anyway then you can fix this problem at the same time.


Of course, other operating systems have their own oddities. Apps have to watch for a variety of crazy things. One example: http://kizu514.com/blog/forbidden-file-names-on-windows-10/

Note that the "Portable Filename Character Set" doesn't solve some of these.


Anytime the operating system exposes a function that takes a parameter as "a string with some restrictions" it really needs to expose the function as requiring a structure for that parameter, and the structure needs composition functions that make it impossible to create it incorrectly.


Author here. Please post questions, I will try to answer.


> If you use GNU find and GNU xargs, you can use non-standard extensions to separate filenames with \0 instead.

MacOS and the BSDs have had `xargs -0` and `find -print0` for a while now. Time to update that section?


You mention some solutions that wouldn't work with SUID programs and that they'd most need the protections.. but is that really a problem?

Most of the problems are in the shell expanding command lines before invoking the actual application, so they get protected. Once the application is started, there usually isn't any globbing going on anymore unless those SUID applications invoke shells themselves and pass them user generated data... and then we probably have bigger problems anyway.


The problems continue. For example, if you are writing a program in Python 3, and directly execute another program with a filename as its first parameter, it is extremely easy to have problems if the file name begins with a dash. Notice this has absolutely nothing to do with shells, you do not need to have a shell at all for this to be a problem. The problem happens in all cases, because the dash is an option indicator in many programs.

And that is just one example. As I discussed in the article, even displaying file names can be a big problem because they can include control characters that can control the terminal in which they are being displayed.


But the file doesn't need to exist for this problem to occur? All of git, svn, hg have had command injection attacks where a file/repo name became a command option, but none of those cases required that such a file exist.


This is a really good point that I think pretty much refutes the idea that the fix to this problem is restricting what filenames can be stored in the filesystem. The real issue is enforcing unified and robust command line processing.


There is no way to enforce unified command line processing across all possible languages and all possible programs. Even if you did, how would you enforce knowing what a leading dash is? There's no way to tell the difference between a file name and an option, regardless of the programming language used. That also doesn't deal with the problem that you cannot have a list of file names with new lines terminating them, or that merely displaying A filename with control characters can take over a terminal window. In short, even if you could do the impossible, it would not be enough.


All of what you said is true, but that doesn't negate the issue raised in the post to which I responded. Fixing filesystem naming constraints, isn't enough to fix the command line problems you've identified. So if you're going to truly fix command line parsing, you're going to have to do it there. Perhaps offering a library that can eventually be used across the spectrum of utilities and commands.


No questions but wanted to let you know that’s a fantastic article. I don’t remember when I first read it (too many years back!) but enjoyed it then and enjoyed reading it again.


What's the status of your "Safename" patch?

edit- just wanted to say, thanks for picking this fight! It is indeed ridiculous the community hasn't fixed this.


It is currently in my queue of things that I want to do. However, seriously making a Linux kernel module requires multi hour blocks of time that I have struggled to find. I hope to get back to it.


It seems like this issue would benefit from a github/gitlab page?

There are a lot of tracking bugs that would need to be created for fixing userspace by reaching out to other projects, mitigations/workarounds, RFCs for the posix change? etc.

It does seem like a big effort but it seems like a worthy fight.


The article is quite long. Do you have a succinct list of specific recommendations for operating systems?

Do you have a succinct list of problems with arbitrary path names for non-shell programs? The two you've mentioned in this thread are (1) passing filenames with leading hyphens as options to other programs via exec(), and (2) printing filenames with terminal control characters.


If you mess up quoting, a lot of things break when you're presented with "funny" input. If you get quoting right, you stop worrying about the things OP mentions. Instead of limiting the input creating more mess and a new set of compatibility issues, it's important that designers get the quoting/input encodings of their languages/platforms/shells fixed.


If all people wrote perfect programs in all programming languages at all times there are no problems, true. But these kinds of file names are landmine for all programming languages. I I think it is much better to ensure that simple programs are usually the correct programs. And please note, as the article discusses, this is not just a shell problem, all programming languages have problems.


> But these kinds of file names are landmine for all programming languages.

That's simply not true. Some languages/platforms do not emphasize, or actively stand in the way of correctly describing layers of input, often in the name of convenience. Those are the worst offenders. The shell, for example. Or PHP, or ansible, which both retroactively corrected bad decisions made in the past, when their quoting was designed.

Some languages have bulky ways of interacting with the filesystem. They are not convenient to use (because their layers of input are explicit) but they don't have problems with funny characters. And this is not limited to filesystems, just look at SQL and prepared statements.


If filenames are pointing out a ton of bugs in programs that aren't sanitizing input, I'd say that's fine. If the shell is misinterpreting filenames, it's probably being way too liberal in how it allows you to construct scripts, or at the least should include a "warning" linter mode. Bash is probably the best program in the whole Unix-like environment, but it's crufty.

What I'd say needs to be fixed is the artificial limit on sizes of names and paths, and I think everything should just default to UTF8 at this point.


How is PowerShell (+ NTFS) in this regard? Does it's object oriented/structured data paradigm fare better?

Edit: and now that PowerShell is actually available on Linux, how does it fare there as well?


It has different issues. Try creating a file called aux, for example. Or even aux.js: https://github.com/gajus/react-aux/issues/10

See: https://docs.microsoft.com/en-us/windows/desktop/fileio/nami...


>It has different issues.

To clarify the "it" here, NTFS allows you to create a file named `aux.js` just fine. The thing that injects special handling of `aux.js` is the Windows object layer that sits on top of it.

    CreateFileW(LR"#(\\?\C:\Users\Arnavion\Desktop\aux.js)#", GENERIC_WRITE, 0, nullptr, CREATE_NEW, 0, nullptr)
will create a file named `aux.js` just fine, since it tells the Windows layer to not normalize names, which includes disabling the special handling of files named `aux`.


The main thread is discussing, for example, Unix filenames with newlines, which can similarly be created with ease, but cause problems later with other tools.


NTFS is not too different from Linux filesystems. It allows any characters except `\` and `\0` in path components (explorer blocks a few more but that can be bypassed using namespaced paths).

As for PS, commandlets with parameters do not have a problem with filenames being mistaken for parameters. Eg `gc $foo` will never parse the contents of `$foo` as a parameter name and always as the value of the (implicit) `-Path` parameter.

Splatting arrays of values into a commandlet also does not cause problems, since splatting an array is always interpreted as providing values for positional parameters (thus they cannot be confused with parameter names), and splatting hashtables is of course unambiguous (keys are always parameter names, values are always parameter values). So `$foo = @('-Path'); gc @foo` is also fine.

Since commandlet parameters are typed, there is also no problem with filenames containing `\n`, etc, since a single string is a single string no matter what characters it contains.

Printing strings containing control characters is probably a problem on Win 10's conhost and Linux terminal emulators. I haven't checked but I would assume PS does not do anything special to escape strings containing such character codes before emitting them to `$Host`.

`iex` (the PS equivalent of `eval`) of course has the problems you'd expect, but just like `eval` it is rarely necessary.

Edit: Also, PS does not do glob expansion. Glob expansion is the responsibility of the commandlet / program being invoked, so it is also not possible for filenames to resemble parameter names in this way.

    gc *
    gc '*'
    gc -Path *
are all equivalent.


Also, PS does not consider string literals to be parameter names. So even if you do have a file named `-Path`, `gc '-Path'` is equivalent to `gc -Path '-Path'` and works fine.


Margins would have made this a lot easier to read.


The text reflows to whatever margins you like. I let people choose the margins, instead of forcing people to use specific margins that I forced on them. If you resize your window, you can get any size you want.


I, for one, appreciate that you aren't wasting the majority of my screen space because you think everything looks best crammed into a column that's 1/3rd of a screen wide.


I am gonna leave this here. https://cirw.in/blog/bracketed-paste

It is not that "I think" it is objectively more efficient and easier to read, yes this has been studied before. I agree that going too narrow increases scrolling and it's annoying. (Widescreens are a stupid trend)


Huh, thanks. That's a respectable opinion.


Thank you so much!!! I wish more sites did that!


In Firefox you can click on the reader view icon (top right in the URL bar). In Chrome, you can start the browser with the command line option `--enable-dom-distiller`, which will give you a new menu entry "Distill page" that has the same effect.


Thanks so much : )


Unrelated, but:

> Negative freedom is freedom from constraint, that is, permission to do things; Positive freedom is empowerment, that is, ability to do things...

Cited Angus Sibley, but the idea originates, I believe, from Erich Fromm.


I have modified the article to show examples where the same problems happened in python3. Hopefully that will make it clear to people that this is not just a shell problem, and that writing everything in some other language does not solve the problems.


If someone wanted to take these ideas further, how would it work? Could you create a different filesystem that forbids certain names? I don't remember if there is a place in the APIs to return a "filename is not allowed" error.


I agree on all issues, but he misses an important an one: identifier security.

Filenames must be treated as Unicode identifiers. I.e. they need to be normalized, as Apple did in HFS+. The common laissez fair garbage in, garbage out is a security risk. Certain mixed scripts need to be forbidden, e.g. Cyrillic letters may not appear next to Greek in the same name. Confusables need to be warned about.

http://unicode.org/reports/tr39/#General_Security_Profile See also http://www.unicode.org/reports/tr36/


> Ugh — lots of annoying problems, caused not because we don’t have enough flexibility, but because we have too much.

I don't think this is quite right. The problem isn't weird characters in filenames, it's that the system itself handles them poorly. The default separator should have been '\0' to begin with. Globs for files should expand names in the current directory with the "./" prefix even if you didn't explicitly prefix the glob with "./".

Then for printing/inputting unprintable characters there should be a universal standardized escaping format, which takes as input what it prints as output, used by all standard utilities and with conversion functions in the standard library that convert from escaped names to binary names and vice versa.

Actually prohibiting the characters from the filesystem creates other problems.

Suppose I have an existing filesystem, or one mounted from a foreign system without these restrictions. It has files "foo" and "foo\nbar" and " foo" and "foo " etc. in the same directory. If we restrict what the filesystem accepts, do my existing files become unreadable? Impossible to delete or rename using standard utilities? Impossible to reversibly backup to a different machine that uses the new restrictions?

Suppose I get an arbitrary name from an external source and intend to store it in a filename in a way that can be losslessly converted back to the original name later. If the only disallowed characters are '\0' and '/' then I can handle those and be done. If filenames had many other restrictions, which can change over time as they add Unicode control characters or people decide something new should be prohibited, now the programmer has to handle escaping all of those too and you're just moving the problem over there. Moreover, if the set of characters you're not allowed to use isn't fixed ahead of time then the safe thing is something like base64 encoding the name, but then that makes the common case worse because you get gibberish names in all cases even though they would otherwise have been meaningful to humans >99% of the time.

The place for prohibiting weird characters isn't at the filesystem level, it's somewhere above it. It's perfectly reasonable for a program to restrict what characters it is willing to put in a filename, because there are many contexts where it makes no sense to have newlines and such. But that's for the program to decide, because some, including the system utilities, should accept anything. If you already have a file called "-this\nfilename\nis\nlame " then it's quite important that things like rm and mv (i.e. rename) should be able to work on it.

Moreover, the problem with portability is that it goes both ways. If you want to be portable then you shouldn't create filenames that start with '-' because some systems don't support them, but if you want to be portable then you still have to be able to handle filenames that start with '-' because some systems do have them which means there may be existing files with those names.

That said, many of the proposed solutions are still good. The continued existence of encodings other than UTF-8 seems almost entirely without merit at this point, and if the filename contained binary data that isn't valid UTF-8 then it could be losslessly escaped in the same way that unprintable UTF-8 characters could be.


> The default separator should have been '\0' to begin with

Then either you put a '\0' key on the keyboard, or you can't pass more than one file to a program. Neither of which seems great.

(The decision that shell language and interactive shell are the same is the great and terrible choice of UNIX; it gives the user the REPL quality that it's really easy to build up programs from simple cases on the command line, but it also caused optimisation for typing the minimum of characters)


> Then either you put a '\0' key on the keyboard, or you can't pass more than one file to a program.

Or you have ctrl+space insert '\0', or have space insert '\0' and ctrl+space insert ' '.

It might also help if '\0' had its own printable symbol.


It should be remembered that terminals have had "'\0' keys" for a long time. Control+@ yields NUL.


(2009)


Article claims a 2018 update, although I don't know what changed.


The 2009 version is available from archive.org[0]. Quite a lot of detail has been added in my opinion.

  $ wc 2009.txt 2018.txt 
    1930   15805  105631 2009.txt
    3231   27439  182068 2018.txt
[0]: https://web.archive.org/web/20090328012800/https://dwheeler....


We usually put the original year in titles, unless there was clearly a significant update. So I guess 2009 for now.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: