$ curl https://dwheeler.com/essays/fixing-unix-linux-filenames.html > /tmp/page.html
$ grep -c '"$@"' /tmp/page.html
$ grep -c '$@' /tmp/page.html
I never claimed it did. I'm commenting on the many Bourne shell examples in the article, which seem to use many common incorrect solutions, while ignoring the method that was added to the shell specifically to fix argument handling,
> Displaying a file name may cause terminal escapes to suddenly be executed
I'm not addressing program output, because program output always has that type of problem. You can also have the same problem when you display the contents of a file or user input. This gets even harder to solve when Unicode is involved, regardless of encoding, and regardless of source. Again, if you aren't validating and sanitizing your input, you have bigger (probably security related) problems that cannot be solved by changing how the OS handles filenames.
> Indeed I specifically showed many cases even in Shell where using "$@" does not help
Where? The only @ in the document that I can find is an example using arrays, which is different than the specific string "$@" (those 4 chars only) that is defined as copying the current args (skips word-splitting). From bash(1):
>> @ Expands to the positional parameters, starting from one. When the expansion occurs within double quotes, each parameter expands to a separate word. That is, "$@" is equivalent to "$1" "$2" ...
 The problem of in-band whitespace/etc isn't just a filename issue! Changing how the OS handles filenames doesn't fix non-filename arguments. Again, you're only looking at once small set of problems!
But you don't need to call the shell to execute external programs.
No reason to have relative paths outside the shell for me, relative paths are only a convenience for the shell user that doesn't have to specify the full path as an argument, but if you invoke a program from another program, it doesn't cost you nothing to pass to it as an argument the full path.
That's true, and a very real problem in many cases. A lot of those cases, though, are because people use positional arguments when they should use switched options ('--foo=bar' or '--foo bar') instead. This shields them from all kinds of pretty severe input-validation errors, and not just with filenames. It's a few more characters to type, but a great many option-parsing libs in all sorts of languages can automatically generate terse ('--foo=bar' can be specified as '-f bar')
I know that puts some of the onus on developers, but as you and others have pointed out, relying on the shell to handle these weird cases likely puts responsibility where it doesn't belong (calling subprocesses from another language shouldn't use the shell unless it's overwhelmingly necessary) and on something that handles it poorly (all your points about quoting hell).
That's a narrow quibble, and is exclusively concerned with getting data in to invoked programs intact, from a non-shell invoker. All of your other points about encoding, validation, shell-altering STDOUT, and broken behavior when you are working in the shell (like the whitespace-related issues you point out, which are still a hassle to work around even when using switched options) are well put and very well taken.
Or you could use the solutio9n provided by most argument parsers: the "--" option that disables option parsing for the remaining options. Trying to fix it yourself by "fixing" filenames (by prefixing with "./" or whatever) requires you to enumerate all possible problem or you're just creating a new weird machine.
You're worrying a lot about specific problems like if a filename might be confused with an option, but there are many other ways filenames can be a problem *even if their names are limited to "simple" alphanumeric characters. If you're accepting input from the user (or worse, over a network), you will have far worse problem than crashing on a bad filename if you aren't strictly validating and parsing that input before sending it to any other program or using it as a filename.
 ("#2) Enumerating Badness") http://www.ranum.com/security/computer_security/editorials/d...
 Defining concepts like "alphanumeric" or even "codepoints that can be safely [printed to the screen" are NOT simple co9ncepts if you support Unicode (which most programs should support).
"The "obvious" way to do this is to litter command invocations with "--" before the filename(s). But it turns out this doesn't really work, because not all commands support "--" (ugh!). For example, the widely-used "echo" command is not required to support "--". What's worse, echo does support at least one dash option, so we need to escape leading-dash values somehow. POSIX recommends that you use printf(1) instead of echo(1), but some old systems do not include printf(1)."
So what? If you don't read the documentation for your the programs you run, mistaking filenames for an option is only only one of the many types of problem you might have. The problem of using programs incorrectly cannot be solved by changing filenames. You solve that type of problem by carefully understanding how to run each program you want to use (which means using -- for many (but not all) programs) and strictly validating & sanitizing your input.
If changing filenames makes an incorrect solution work, then it solves it, in that case.
In this thread, you're missing the point somewhat - the problem isn't that it's _impossible_ to work with all sorts of filenames using shell, the problem is that it's a PITA to do so, even for experienced users but especially for newbies.
I.e. sure I can use `"$@"` everywhere and quote every single variable I ever use, and prefix every path with `./` or use `--` for every command that supports it. What if I didn't have to? Wouldn't that be better?
I assure you I'm not; I'm trying to explain why worrying about these problems at the filename level doesn't actually solve the problem, which should be handled at a different level of abstraction.
> sure I can use `"$@"` everywhere and quote every single variable
If you are writing Bourne shell without doing that, you are always going to have problems, because arguments are not always filenames. Yes, the syntax can be slightly annoying for historical/compatibility reasons.
> What if I didn't have to?
You don't. Use a different shell, or write your scripts in one of the many languages that are widely available like ruby, python, etc. If a scrip0t has #!/bin/sh at the top, you can probably rewrite it into any language you prefer.
I completely agree that validating input is very important. Indeed, I am recommending that the kernel do input validation on file names.
I agree with others at this thread here: Each program that your program passes a filename to needs to be considered and handled individually.
You cannot rely on "./" being a general solution. What if the program matches the given path against other paths, and does so verbatim without normalization? What if you're explicitly supposed to pass a filename, not a path? In the first case, maybe you claim that the program misbehaves, but one could argue that not having something like "--" is also misbehavior, and in any case, you have to handle it.
Again: Thinking that there is a universal solution, like "./", for a problem that in reality is about inconsistent interfaces and therefore requires you to look at each interface individually will get you into hot waters.
The only cases in which I've seen this cause problems are programs which rely on the presence of a tty when they should not, or assume the existence of environment variables that they don't really need.
And it's not like it's all impossible to do yourself without the shell there's just a lot and the shell does it in a nice and well known way. Here are some examples:
Searching the path
Setting custom environment variables
One weird one that I feel like a lot of people don't know about is that you can run shell scripts from the shell without the shebang. But if you exec one you'll get a format error. It's not good to rely on this but it's another thing that surprises people with messy scripts.
- control characters in filenames: irrelevant to non-shell languages
- leading dashes in filenames: doesn't affect normal Python/non-shell languages; for example, unlike in shell, directory walking in Python is implemented by calling os.walk, not execing a separate process "find" and parsing its output
- the lack of a standard character encoding scheme: doesn't affect Python 3, now that the filesystem encoding system is complete and cleanly handles encoding errors
- spaces in filenames can cause problems: irrelevant to non-shell languages
- special metacharacters in filenames cause some problems: irrelevant to non-shell languages
As you admit, these problems are worse in shell (I contend, only meaningfully exist in shell). So why is your article written using shell scripting for its examples?
- If you really believe that shell is worse for this, then you should be warding people away from shell
- If you really believe these problems are pervasive, take on the hard target, instead of going for the easy target of shell
- If these problems really still affect Python programmers who believe they are safe from these footguns that are so prevalent in shell, then you should be warning them, not the shell programmers who already know that shell is an incredibly hard language to write robust programs in
But as your article is using shell for its examples, I can't help but conclude that this is just another piece of shell-zealotry. If you seriously want to fix this problem for people, you should be advising them to stop writing programs in shell. That is undeniably the fastest and easiest mitigation to the majority of these problems. On that, surely you must agree.
An aside on the topic of leading dashes in filenames: it's true that any program parsing filenames from the command line will run into this issue where data and options can be confused. You diagnose this as an issue with filenames, but I disagree: This is an issue with traditional argument parsing, which is a very poor serialization format. Any number of alternative serialization formats would avoid this issue; for example, https://github.com/NuxiNL/argdata does not have this problem.
I'd say that the issue is that traditional argument parsing uses in-band signalling: a "filename" starting with a hyphen is treated as a control instruction instead of a filename (in some cases, unless a special "double hyphen" control instruction has been seen).
I don't understand how this doesn't affect non-shell languages, it's not the shell that interprets these (unless were talking about different things)
> spaces in filenames can cause problems: irrelevant to non-shell languages
Again, this is a format/protocol problem not a language problem. A lot of formats use spaces as delimiters and if you put delimiters in the filename you're going to have to escape it.
It might be suitable for writing long, complex scripts, but for short, ad-hoc scripts its overhead (both in terms of startup speed and in terms of the boilerplate required to write full-fledged Python scripts compared to simple shell scripts) makes it really unwieldy and awkward to use.
Python still has problems displaying filenames with unprintable characters.
Arguments passed to Python from the shell are still subject to these pitfalls, etc.
See the rest of this thread for more.
Actually, you never should, unless you have a very, very good reason. Even without taking into account encoding issues, you still shouldn't.
(1) We have to be able to mount filesystems which have bad filenames and allow users to work with them.
In that situation, that means the kernel cannot reject all requests for bad file names, only creation requests. And perhaps not even that.
It can be a mount option with three values: off; reject all dubious names in all filesystem syscalls; or just in creating syscalls: open with O_CREAT, mknod, link, ...
It could also be a capability. Superuser could work with the dubious filenames and set up capabilities for non-privileged child processes to do same, so people needing to work with a filesystem full of these names can get into an environment where they can do that.
(2) Enforcement at the kernel level is too late in some situations. It doesn't entirely compensate for broken programs. Here is a trivial example of what this means. Suppose that some program receives an argument "foo bar" and decides to split that into "foo" and "bar", treating this as two file names to create or process. Those filenames look fine to the kernel, so it is allowed.
In shell scripts, not all name-related issues arise from files that are already in the system. In fact, shell scripts are often robust in the face of existing files with poor names. For instance names generated with globbing expansion are safe, even if they contain spaces and shell meta-characters. The one danger is that the prefix of the expanded list looks like command line options to the program.
The issues that arise in shell programming from failing to quote variable expansions cannot be fixed outside of the shell. The OS cannot fix the lack of quoting in command $VAR; it can't prevent a shell internal variable VAR from containing spaces.
Rust's stdlib doesn't. You must pass arguments as array.
What if yourcommandhere is shell syntax, like "for x in *.foo; do echo $foo; done"? That can't be executed directly as a process argument list.
If all you have is a wrapper for execv, then you have to construct a call to "sh", where that command is all in one argument following a "-c" argument.
You said that Rust doesn't have a library function for calling the shell but only process execution; I'm showing how you make one in such a situation.
A language and its library won't prevent applications from calling the shell, unless it's uselessly crippled.
Then you readdir, loop, println. no need to use the shell
> You said that Rust doesn't have a library function for calling the shell but only process execution; I'm showing how you make one in such a situation.
You're stating the obvious.
My point was that it has no footgun which automatically defers to the shell or does shell-like argument extraction from a single string when spawning child processes. Other languages have those things in their standard library which is a source of errors that rust avoids.
It does provide a regular gun which you can repurposed as a make-shift footgun by taking the conscious action of aiming down first.
Yes; you have a Turing complete language, so why would you use the shell?
The point that you've lost sight of here is that someone can do that and someone likely will.
(I don't disagree with you; I think that "shelling out" commands from decent programmign languages is an anti-pattern.)
Have you seen this?
Look, an example of what I'm talking about, right on the Rust site: how to send "echo hello" to your shell on Windows and Unix.
You don't think someone's going to copy and paste this, and substitute whatever command they want?
> The point that you've lost sight of here is that someone can do that and someone likely will.
Sure, that is inherent in the freedom of powerful tools, you can also use them to do dangerous things. But that is besides the point.
The point is that the default and simplest way is the safe way to spawn commands.
You have to do extra work to invoke dangerous shellisms. You need to at least understand that arguments are conceptually not one long string but an array of several strings, and to invoke a shell you would already have to start with splitting that invocation into an array, so you gain little from then writing the rest of your command as one string, you might as well split up all of them and avoid all those escaping pitfalls.
This is the contrast to other languages which have constructs like system("echo foo"), spawn_sh("echo foo") or whatever which are superficially easier to easier use than launching child processes the safe way.
In this regard rust is safer than most other languages, since it has no "construct which calls the shell" - note the active voice used here, "can be used to call a shell", passive, is a different case.
This is all I have claimed in this thread.
They were written in python 2 and 3 by different teams of strong developers at different organizations at different times.
They were universally unmaintainable, unreliable and undebuggable.
Some narrowly avoided installing pip-distributed malware on internal networks.
There is a reason shell is still widely used. I have never worked with anyone that actually knows one of the shell languages that prefers python for shell scripting tasks.
The shell is full of strange and dangerous behaviors, that make even experienced programmers write buggy scripts. The main reason is that the shell wants to be a programming language but works differently from every other programming language, and has a special and complicated syntax.
Also shell scripts are really not that portable at all, if you write a script on a GNU system it's nearly guaranteed that it will not work on a non-GNU system, and vice versa. Not only the shell but more importantly the commands are different, GNU awk is different than standard UNIX awk, GNU sed is different that UNIX sed, grep is different, and so on. Maybe minor difference that produces bug difficult to find and fix.
For me if a shell script becomes more long than 50/100 lines of code you should consider rewriting it in python, shell scripts are only a valid solutions to rapidly write programs that invokes some commands to automate some things, not a valid solution for doing complex tasks.
* Filenames starting with hyphens can get parsed as command line options
* Filenames containing control sequences cause the terminal to misbehave if they are printed to the screen
* Filenames containing newlines can't be easily included in all sorts of line-by-line file formats.
Finally, as the article also points out, the shell is not going away anytime soon. Getting rid of it is not the easy way out.
Python 3 is full of problems with fsencoding, to the point that some people want to deprecate the plain C locale (!).
This is hardly a solution.
Another problem is that doing many things that would be simple in the shell require a lot of boilerplate in Python. That kind of boilerplate might be worth putting up with for large programs, but for something that could be easily handled with a short shell script it's really annoying.
The chances of getting the kernel to outright ban some of the more dangerous filenames seem a lot higher than simply eliminating sh/bash...
I don’t think it is realistic at all to update all tools to deal with this (certainly not shell scripts). For one, it is a lot of work to fix just one tool and usually there are so many programs in an infrastructure that you’d have to fix all of them before you could begin using carelessly-named files. Also, maintenance always introduces the risk of bugs; your carefully-written script will eventually be ruined by someone adding a new argument that handles file names poorly, I’m sure.
If a program must be fixed, it has to be changed in a way that is maintenance-proof. For example, if the only way for a script to even find a file is to use an intermediate API, that is safe; someone can carelessly hack your program later to add a new file option but at least they’d only be able to get that working by using the same file API, and it would remain safe.
There are now systems that obfuscate the locations of files (e.g. containers in auto-generated weird paths for security and other purposes). We also have situations where the file’s name in an interface may not mirror the filesystem (e.g. a name of a standard directory can be localized and appear to be called something entirely different, in the local language, than it is on disk). Thus, intermediaries aren’t that new of a concept, and if you need to update programs for security purposes anyway then you can fix this problem at the same time.
Note that the "Portable Filename Character Set" doesn't solve some of these.
MacOS and the BSDs have had `xargs -0` and `find -print0` for a while now. Time to update that section?
Most of the problems are in the shell expanding command lines before invoking the actual application, so they get protected. Once the application is started, there usually isn't any globbing going on anymore unless those SUID applications invoke shells themselves and pass them user generated data... and then we probably have bigger problems anyway.
And that is just one example. As I discussed in the article, even displaying file names can be a big problem because they can include control characters that can control the terminal in which they are being displayed.
edit- just wanted to say, thanks for picking this fight! It is indeed ridiculous the community hasn't fixed this.
There are a lot of tracking bugs that would need to be created for fixing userspace by reaching out to other projects, mitigations/workarounds, RFCs for the posix change? etc.
It does seem like a big effort but it seems like a worthy fight.
Do you have a succinct list of problems with arbitrary path names for non-shell programs? The two you've mentioned in this thread are (1) passing filenames with leading hyphens as options to other programs via exec(), and (2) printing filenames with terminal control characters.
That's simply not true. Some languages/platforms do not emphasize, or actively stand in the way of correctly describing layers of input, often in the name of convenience. Those are the worst offenders. The shell, for example. Or PHP, or ansible, which both retroactively corrected bad decisions made in the past, when their quoting was designed.
Some languages have bulky ways of interacting with the filesystem. They are not convenient to use (because their layers of input are explicit) but they don't have problems with funny characters. And this is not limited to filesystems, just look at SQL and prepared statements.
What I'd say needs to be fixed is the artificial limit on sizes of names and paths, and I think everything should just default to UTF8 at this point.
Edit: and now that PowerShell is actually available on Linux, how does it fare there as well?
To clarify the "it" here, NTFS allows you to create a file named `aux.js` just fine. The thing that injects special handling of `aux.js` is the Windows object layer that sits on top of it.
CreateFileW(LR"#(\\?\C:\Users\Arnavion\Desktop\aux.js)#", GENERIC_WRITE, 0, nullptr, CREATE_NEW, 0, nullptr)
As for PS, commandlets with parameters do not have a problem with filenames being mistaken for parameters. Eg `gc $foo` will never parse the contents of `$foo` as a parameter name and always as the value of the (implicit) `-Path` parameter.
Splatting arrays of values into a commandlet also does not cause problems, since splatting an array is always interpreted as providing values for positional parameters (thus they cannot be confused with parameter names), and splatting hashtables is of course unambiguous (keys are always parameter names, values are always parameter values). So `$foo = @('-Path'); gc @foo` is also fine.
Since commandlet parameters are typed, there is also no problem with filenames containing `\n`, etc, since a single string is a single string no matter what characters it contains.
Printing strings containing control characters is probably a problem on Win 10's conhost and Linux terminal emulators. I haven't checked but I would assume PS does not do anything special to escape strings containing such character codes before emitting them to `$Host`.
`iex` (the PS equivalent of `eval`) of course has the problems you'd expect, but just like `eval` it is rarely necessary.
Edit: Also, PS does not do glob expansion. Glob expansion is the responsibility of the commandlet / program being invoked, so it is also not possible for filenames to resemble parameter names in this way.
gc -Path *
It is not that "I think" it is objectively more efficient and easier to read, yes this has been studied before.
I agree that going too narrow increases scrolling and it's annoying. (Widescreens are a stupid trend)
> Negative freedom is freedom from constraint, that is, permission to do things; Positive freedom is empowerment, that is, ability to do things...
Cited Angus Sibley, but the idea originates, I believe, from Erich Fromm.
Filenames must be treated as Unicode identifiers. I.e. they need to be normalized, as Apple did in HFS+. The common laissez fair garbage in, garbage out is a security risk.
Certain mixed scripts need to be forbidden, e.g. Cyrillic letters may not appear next to Greek in the same name. Confusables need to be warned about.
See also http://www.unicode.org/reports/tr36/
I don't think this is quite right. The problem isn't weird characters in filenames, it's that the system itself handles them poorly. The default separator should have been '\0' to begin with. Globs for files should expand names in the current directory with the "./" prefix even if you didn't explicitly prefix the glob with "./".
Then for printing/inputting unprintable characters there should be a universal standardized escaping format, which takes as input what it prints as output, used by all standard utilities and with conversion functions in the standard library that convert from escaped names to binary names and vice versa.
Actually prohibiting the characters from the filesystem creates other problems.
Suppose I have an existing filesystem, or one mounted from a foreign system without these restrictions. It has files "foo" and "foo\nbar" and " foo" and "foo " etc. in the same directory. If we restrict what the filesystem accepts, do my existing files become unreadable? Impossible to delete or rename using standard utilities? Impossible to reversibly backup to a different machine that uses the new restrictions?
Suppose I get an arbitrary name from an external source and intend to store it in a filename in a way that can be losslessly converted back to the original name later. If the only disallowed characters are '\0' and '/' then I can handle those and be done. If filenames had many other restrictions, which can change over time as they add Unicode control characters or people decide something new should be prohibited, now the programmer has to handle escaping all of those too and you're just moving the problem over there. Moreover, if the set of characters you're not allowed to use isn't fixed ahead of time then the safe thing is something like base64 encoding the name, but then that makes the common case worse because you get gibberish names in all cases even though they would otherwise have been meaningful to humans >99% of the time.
The place for prohibiting weird characters isn't at the filesystem level, it's somewhere above it. It's perfectly reasonable for a program to restrict what characters it is willing to put in a filename, because there are many contexts where it makes no sense to have newlines and such. But that's for the program to decide, because some, including the system utilities, should accept anything. If you already have a file called "-this\nfilename\nis\nlame " then it's quite important that things like rm and mv (i.e. rename) should be able to work on it.
Moreover, the problem with portability is that it goes both ways. If you want to be portable then you shouldn't create filenames that start with '-' because some systems don't support them, but if you want to be portable then you still have to be able to handle filenames that start with '-' because some systems do have them which means there may be existing files with those names.
That said, many of the proposed solutions are still good. The continued existence of encodings other than UTF-8 seems almost entirely without merit at this point, and if the filename contained binary data that isn't valid UTF-8 then it could be losslessly escaped in the same way that unprintable UTF-8 characters could be.
Then either you put a '\0' key on the keyboard, or you can't pass more than one file to a program. Neither of which seems great.
(The decision that shell language and interactive shell are the same is the great and terrible choice of UNIX; it gives the user the REPL quality that it's really easy to build up programs from simple cases on the command line, but it also caused optimisation for typing the minimum of characters)
Or you have ctrl+space insert '\0', or have space insert '\0' and ctrl+space insert ' '.
It might also help if '\0' had its own printable symbol.
$ wc 2009.txt 2018.txt
1930 15805 105631 2009.txt
3231 27439 182068 2018.txt