Fixing Unix Filenames (2012)

ketralnis · on Oct 31, 2013

Half of these problems are problems with the shell and have nothing to do with the filesystem. The shell is really, really, really bad about doing things like expanding spaces in filenames or environment variables into separate arguments. Notice that every example here about how it's "wrong" is a small shell script that has unexpected behaviour. That's mostly because the shell is wrong!

> Oh, and don’t display filenames. Filenames could contain control characters that control the terminal (and X-windows), causing nasty side-effects on display

That's the fault of the terminal, mostly. If it's an otherwise "reasonable" name containing, say, encoded unicode, that's everyone's fault from the filesystem to the terminal to (sometimes) the shell. But the fault is spread out over more than just the fact that the filesystem lets you name files however you like.

But now, Linus' quote there:

> "...filesystem people should aim to make "badly written" code "just work" unless people are really really unlucky. Because like it or not, that’s what 99% of all code is... Crying that it’s an application bug is like crying over the speed of light: you should deal with reality, not what you wish reality was." — Linus Torvalds, on a slightly different topic (but I like the sentiment) (http://lwn.net/Articles/326505/)

speaks to both "sides" here: there are applications in the real world that create filenames with special characters in them. The simplest most obvious of these is the space, but yes, in the real world we have applications that do this. So neither alternate reality ("restrict filesystems' names" or "keep doing what we're doing but have bug-free applications") exists, so which "reality" are you proposing that we deal with here?

overgard · on Oct 31, 2013

I'm sure most reasonable people would agree that (simple) unicode characters are fine, but what possible use is there for having control characters in a filename, like newline or carriage return? (Or if you're really evil, beep). If you remove "special" characters from the filesystem, you also remove a lot of complexity from other parts of the system for dealing with pathological corner cases. It's a win for everyone.

gilgoomesh · on Oct 31, 2013

Ignoring the fact that non-terminal users regularly want spaces, hyphens and newlines in their filenames since they have visual purpose...

The argument in favor of newlines is the same argument in favor of spaces in filenames or filenames starting with hyphen: the filesystem should not be restricted by bash and other shell scripts continual confusion between data and instructions. We should fix/replace lazy programs, don't try to fix lazy programming by neutering the system.

I think the correct action should be to design shell scripting languages that are robust about data versus instructions. The lack of robustness in shell scripting languages has been a burden since their inception. Restricting the filesystem won't stop problems with shell scripts making the same mistakes on other kinds of data.

einhverfr · on Oct 31, 2013

I have done a lot with users of all sorts. I see a lot of spaces and hyphens inside files, but I have never seen either a leading hyphen or an embedded newline.

Are there cases where these would be desired in GUI environments? Why and what is the use case?

gilgoomesh · on Oct 31, 2013

I've regularly seen:

-- Myfile With Emphasis --.txt

Basically, users will try to use any punctuation they see on their keyboard to adorn their filenames.

einhverfr · on Nov 1, 2013

Interesting. So maybe this should be configurable.

vacri · on Oct 31, 2013

Ignoring the fact that non-terminal users regularly want spaces, hyphens and newlines in their filenames since they have visual purpose...

I'll happily agree with the spaces, because people like filenames to look like short phrases. But I've never seen a human make filenames with a leading hyphen (plenty with hyphen within the word) nor with a newline.

I could imagine that somewhere there is a person or two that likes to make hyphen-led filenames to try and sort things, but this is catering to the outliers. I can't imagine the use case for anyone to intentionally make a filename with a newline in it, and haven't seen such a thing, human or program-made. I think it's not common or valuable enough a thing to hamstring everyone else.

Edit: solution in the traditional linux way: off by default, but you can always compile them back in if you love them that much :)

jlgreco · on Oct 31, 2013

I once saw humans making filenames like --splitwork --verbose -tdxh input.txt

And by "once", I mean "yesterday". Now, is it reasonable to design a system that takes input flags through the names of input files? No. It is clever, yes, but it's not reasonable. The fact of the matter is however that these usecases, for better or worse, do exist. We should not change filesystems to break these systems (broken as they are already) just because bourne-related shells aren't fond of them.

overgard · on Oct 31, 2013

So one solution is to replace all the popular shells from the ground up, throwing away about 30 years of history because they're "not elegant". (Ignoring the fact that almost every attempt to do this in the past has been a resounding failure).

The other solution is to agree not put overly weird shit in filenames.

Which seems more likely to happen in our lifetime? Which would you rather code against?

I'm baffled that people are so hell bent on what "should" be allowed that they're ignoring the fact that allowing any possible character introduces exceptional complexity while offering extremely marginal benefits. I mean, imagine trying to make a gui for a filesystem browser that allows for characters that flow both up and down along with left and right. And newlines all over the place. It'd be a mess. I mean.. why? Is filename expressivity really such a major deal?

jfb · on Oct 31, 2013

Who owns the filename? If the user owns the filename, than it is incumbent on the system to be able to handle arbitrary names. If the user doesn't own the name, then the system must not expose the name to the user. c.f. filename extensions.

If a human script is read top-to-bottom, then the computer should handle that without complaint. This may not be possible in all cases immediately -- there's far too much Unix braindamage weighting the world down to throw it all away at once -- but we should all be trying to liberate humans from the dead hand of New Jersey.

gilgoomesh · on Oct 31, 2013

If you're looking at "30 years of history" then shells have endured "weird shit in filenames" for their entire existence and civilisation hasn't collapsed. Clearly, we don't need to change anything for them so I don't think filesystem are going to recant on their longstanding policies by making any changes to further accommodate scripting languages in our lifetimes.

Meanwhile, new shells (or new versions of shells) come and go all the time. Could one of these implement a foolproof (or at least, foolresistant) way of escaping arguments? Maybe.

cnvogel · on Oct 31, 2013

I think it shows that contemporary shell programming never was exposed to as much malicious misuse as common web-apps. I mean "program ★foo★" is basically equivalent to "SELECT ★ FROM table WHERE foo = '$bar'" when provided with arbitrary, user-provided input. And you see a lot of this in typical shell-scripts. And for output messing up our terminal, no one ever uses "<? print($input) ?>" to put things in HTML pages without escaping it.

So, yes, if you are writing scripts that should be resilient to all malicious or accidentaly provided input, it's a lot of work, and you better have code for these common operations (input, output of user-provided filenames) in all applications that try to do useful things with filenames. And probably it means that you cannot use programs not providing a "--" parse-stopper or put the data on the end of a "ssh" command-line.

And there will always be an application confused by some random other seemingly innocuous character that the kernel "still" allows in filenames.

So the solution probably is not to limit the allowed characters, and also not to try to fix the useful and working parts of shells, but to augment them with the equivalents of html_escape() and mysql_prepare("...?...")/mysql_escape() where needed (to stay in the realm of SQL and web-apps).

toast0 · on Oct 31, 2013

> The other solution is to agree not put overly weird shit in filenames.

If you don't put weird shit in your filenames, I won't put weird shit in mine. Let's not bother the kernel with our contract. :)

Flimm · on Oct 31, 2013

That doesn't solve the security vulnerabilities caused by weird filenames.

ketralnis · on Oct 31, 2013

> I'm sure most reasonable people would agree that (simple) unicode characters are fine,

What's simple? Chinese? How about vertical scripts like Mongolian? http://en.wikipedia.org/wiki/Mongolian_script What about composition characters, where unicode is basically acting like a set of instructions just like a terminal control character?

> what possible use is there for having control characters in a filename, like newline or carriage return?

Terminal control characters are only a problem because the terminal is interpreting them due to another ancient broken paradigm, but the fact is that these filenames exist. They're present right now. You're breaking old code by changing the rules on them. Keep 'rm' from operating on them, and then I can't delete them when some stupid program goes and creates them. Keep 'vi' from operating on them and I can't look inside to see wtf caused it. That's what I meant by re-quoting Linus there. You don't get to pick the reality that you already live in, and the reality that we live in has silly messy filenames with "beep" in them.

Besides, unless you go hyper-restrictive like [A-Z0-9]{1,8}, there are going to be edge-cases. Just spaces are enough to trip up bad shell scripts. Spaces are effectively control characters as far as the shell is concerned. If you allow spaces, you already have to deal with the "bad" filenames.

I don't pretend to have a good answer, so I guess my point here is that these kinds of issues are more deep-seated than they appear at first. The fact that "ls" allows what's basically an injection attack on your terminal isn't ls's fault, or necessarily the terminal's fault, or the shell's fault, it's because of a whole generation of bad assumptions and "good enough" design that adds up across these tools. Just "limiting filenames" seems like a quick fix, but isn't going to fix it because that's not the root of the problem.

rat87 · on Oct 31, 2013

I think tools should ideally not create strange filenames without forcing(and for some tools as gui tools should be missing the force override). Ideally they should still be able to view and delete them. If you a file with that type of name and edit and attempt to save it, the tool should offer to rename it(deleting the version with the "bad name") by default but allow an override to save with the old bad name.

overgard · on Oct 31, 2013

By simple I mean this:

* Printable (space excluded)

* Doesn't change the flow/layout of characters (no weird reverse characters or superscript/subscripts or characters that flow up/down).

Will that make everyone happy? Of course not. But it's pretty much "expressive enough" and it makes dealing with filenames 10x simpler.

Which policy seems more realistic:

A) sensible restrictions on file names that will please the majority of people and keep most programs working, while slightly annoying a lunatic fringe that wants EVERYTHING.

B) Require that every program that handles files in some capacity can handle every single possible weird unicode and control code subtlety in existence? (Keeping in mind that even browsers, some of the most complicated pieces of software in existence, still tend to screw up the corner cases).

ketralnis · on Oct 31, 2013

> Will that make everyone happy? Of course not. But it's pretty much "expressive enough" and it makes dealing with filenames 10x simpler.

Simple enough for a Western language speaker, sure. Pretty inhibiting for a speaker of Kannada that they can't name their résumé file with their own name.

> Require that every program that handles files in some capacity can handle every single possible weird unicode and control code subtlety in existence?

I assert again that once you allow spaces, you're already in the situation that you need to treat filenames as sequences of bytes that need quoting.

Also, I want to point out again that almost all of these problems are only when writing shell scripts. C programs or e.g. GUI applications with file pickers tend (with exceptions of course) to not have these problems since you get the filenames into them.

overgard · on Oct 31, 2013

Ok so fine, we allow people to write filenames top-to-bottom.

So programs that show the filename in the title bar: how do they display it? Really big title bar? Align the title bar on the left or right? What if I mix english and Kannada? What then?

Or what if I start with an english word in the terminal, and then switch to kannada, what's the terminal supposed to do?

I haven't used a lot of asian language systems, but I'm going to guess two things that I'm pretty sure are true:

A) they're already implicitly dealing with these restrictions, whether the file system enforces it or not

B) they probably already have workarounds to deal with these cases anyway. (alternate scripts or whatever).

Regardless of culture, I'm going to make what I think is a pretty obvious statement: good filenames are easy to type and unambiguous. If your filename is neither of those, it's not a good filename. The point of a filename isn't to be as expressive as possible, it's to be universal, descriptive, and easy to work with. None of that requires special characters.

With regard to the space thing, I think it's overblown. Spaces are common enough that good software engineers think to handle it. Newlines in files? I've /never/ seen that, even though it's possible. Also, space doesn't require an escape character. Once you start getting to newlines, it does. Once you start introducing escape characters into it, it's an entirely new layer of complexity.

josteink · on Oct 31, 2013

> So programs that show the filename in the title bar: how do they display it?

That hardly sounds like a file-system issue. That's an application and UI-design issue.

Just because you think some of these issues are hard and you don't know how to deal with them on an application level (beacuse you haven't had to yet), doesn't make that a valid reason into limiting the capabilities of the file-system.

hdevalence · on Oct 31, 2013

Why are ASCII regarded as "normal characters", but Kannada, or Hangul, or Arabic, or whatever are "special characters"?

jerf · on Oct 31, 2013

There is a reasonable, non-"culturist" answer to your question, which is that ASCII admits of a monospace typesetting in which each character is fully isolated in its own box and does not interact with any others across the box, nor are there any combining marks or any of the other vagueries of Unicode, and there's a very limited number of such characters. Possibly Hangul can claim that, I know Arabic and some other scripts can not. Other languages that can claim this include I believe all European languages (the accents are rare enough you can just represent them as characters of their own rather than combining marks) and Cyrillic. None of the ideographic languages can currently claim this, though I heard rumors that Japanese and Chinese are both experiencing some heavy pressure internally to go to an alphabetic subset of their language that actually would meet this criterion. (And not due to authoritative mandate, but rather the natural flow of language.)

It is a problem when you have languages with fundamentally different rules trying to share the screen, and that is also not a culturist observation; indeed one would think that denying this would be the culturist position.

asdffdsa33 · on Oct 31, 2013

For Japanese, in a Unicode context, I've never seen a combining mark actually used.

duskwuff · on Oct 31, 2013

While the precombined versions are much more common, Japanese does have combining characters to add voiced and semi-voiced marks to syllabic characters (e.g, か vs が, は vs. ぱ).

jerf · on Oct 31, 2013

The ideographic languages can't claim "a limited number of characters". (7-bit) ASCII fits on a typewriter ball. (By "limited", I meant small, rather than merely finite.)

Of the various points I mentioned, in modern times this is probably the least important, now that RAM, ROM, and CPU are so dirt cheap. This wasn't always true.

asdffdsa33 · on Oct 31, 2013

It does, but I've never seen them used in a Unicode context.

einhverfr · on Oct 31, 2013

Shouldn't display be a feature of the locale?

hisham_hm · on Oct 31, 2013

> Pretty inhibiting for a speaker of Kannada that they can't name their résumé file with their own name.

That's what transliteration is for. I have no problems writing my name transliterated into the Latin script.

josteink · on Oct 31, 2013

> But it's pretty much "expressive enough"

You mean like ASCII was "expressive enough" until people whose language was not English started using computers?

Your western bias shows.

That "expressive enough" attitude got us code-pages and a million different encodings to cope with the limitations we had put upon ourselves, and the related (and sometimes impossible) compatibility and inter-op issues.

Let's not walk into that one yet again, shall we?

selmnoo · on Oct 31, 2013

A really interesting case is Urdu. In Nastaliq script, the Urdu language word-wise goes right-to-left... diagonally. Here's a picture: http://i.imgur.com/nIHmZJc.png See how every next letter is just a little bit lower than the one preceding it? That's Nastaliq script, it goes left-down-ward.

As you could imagine, it's a special hell supporting this language. What's happened now is... a compromise has basically been made. Nastaliq is how Urdu had predominantly been written for years and years... but with growing usage of computers, even native Urdu speakers are abandoning it, because there's such poor support for it.

This article talks a little more about this: https://medium.com/stories-that-matter/9ce935435d90

toast0 · on Oct 31, 2013

If you're going to whitelist characters from unicode, we're going to have several problems with managing the whitelist:

a) overhead for the whitelist is going to be an issue on some systems

b) I'm going to need a kernel update every time there's a new useful code point for example, the Indian Rupee Sign http://www.fileformat.info/info/unicode/char/20b9/index.htm

dwheeler · on Oct 31, 2013

You can just check "is it UTF-8 encoded?". Then it doesn't matter when a new code point is added, it would still be correctly encoded. You could make this a configurable kernel option; when enabled, filenames can only be created when they encode with UTF-8.

The next big problem is control characters. But really, the only people who need control characters embedded in filenames are people breaking into computers using them. It doesn't matter if you use a GUI or a CUI, people generally don't embed newlines, or escape, or tabs in filenames.

The leading "-" is probably more controversial, but the POSIX standard specifically notes such filenames as nonstandard.

Flimm · on Oct 31, 2013

The fact is, we already support filesystems on UNIX with restricted filenames (eg: FAT and NTFS). If you ban terrible filenames from the filesystem, those filenames would not exist because they could never be created in the first place.

jlgreco · on Oct 31, 2013

Yeah, but we don't have a time machine, so we don't get to pick the reality where all UNIX-native filesystems restricted filenames like FAT. This isn't something that you can retroactively change, so we are stuck on this reality and need to find ways to cope.

stass · on Oct 31, 2013

It's not. Just look at what limitation of such sort lead to in Windows, where each application tries to serialize an otherwise perfectly valid filename into a set of characters allowed in Windows. Of corse, majority of them never get that right.

Unix is tools, not policy. It's what you make of it, and that's its beauty.

nmcfarl · on Oct 31, 2013

I’m agree on the last point - but I’ll point out that at this point we already have policy limitations on what makes a file name.

As I understand it most unixes require that they are composed of valid characters (not random bytes (or more evilly bits)) and can be no longer than a file system imposed length (commonly 255 bytes).

stass · on Oct 31, 2013

What do you mean by valid characters? The file name usually can include any characters except the forward slash. There are, of course, filename limitiations, which are implementation specific.

mpyne · on Oct 31, 2013

Well, any character but 0x2f and 0x00. :P

nmcfarl · on Oct 31, 2013

I was thinking that - but more than that I was thinking of FSs that allow UTF8 or UTF16 characters - which do not allow invalid code points. But from this morning's research that seems to only be NTFS and UTF16, which is not exactly a "unix" FS.

ketralnis · on Nov 1, 2013

I don't think that's strictly true. Most unix filesystems "allow" UTF-8 characters, specifically because they treat the filename only as an array of bytes and don't interpret that array at all. Perhaps NTFS does do some work to present it as UTF-16 codepoints, I don't know, but it's far from the only file system that allows this to happen.

mpyne · on Nov 1, 2013

He may have meant accept only UTF-8/16. That is the very nice thing about UTF-8 though is that it plays so nice with routines that can accept ASCII or Latin-1. You can't use old routines to character count/change case/etc., but at least they won't corrupt your string by accident.

derleth · on Oct 31, 2013

Your character set's control character is another character set's perfectly normal printable character.

As for making the kernel enforce a single character set for the filesystem: No. No. No. Put another way: No.

http://yarchive.net/comp/linux/utf8.html

From Al Viro:

    Bullshit.  It has _nothing_ to characters, wide or not.  For system filenames
    are opaque.  The only things that have special meanings are:
            octet 0x2f ('/') splits the pathname into components
            "." as a component has a special meaning
            ".." as a component has a special meaning.
    That's it.  The rest is never interpreted by the kernel.

judk · on Oct 31, 2013

What happens if I try to create a filename using a character set in which octect 0x2f appears as part of a multibyte character? Oops?

ninkendo · on Oct 31, 2013

You can't do that. That's one of the upshots of the way Linux handles filenames.

Linus has talked about this in the past... because of the fact that Linux only cares about 1 "special" character (the ASCII slash), the only sensible way to do unicode for filesystem names is to use UTF-8, which is backwards compatible with ASCII and thus obeys the kernel's expectations.

It's one of those "you can have any color car so long as it's black" ultimatums... you can use whatever encoding you want, so long as it uses 0x2f for slashes, and slashes only.

lifthrasiir · on Oct 31, 2013

I think neither ISO 2022 nor UTF-16/32 have never seen use in filesystem, exactly for this reason. The closest analogue of this problem in non-POSIX system is Shift_JIS, which can contain a backslash in a multibyte character (which I regard as a stupidity, given the possibility of using FD/FE bytes instead).

derleth · on Oct 31, 2013

> can contain a backslash in a multibyte character

This is potentially confusing but, ultimately, fine. The problem is slash (byte 0x2f, in ASCII '/') in multibyte characters.

lifthrasiir · on Nov 2, 2013

In Windows backslashes are analogous to slashes in the path.

hannibal5 · on Oct 31, 2013

This is solving the problem in the wrong place. The proper way for OS is to treat filenames as binary sequences like Linux does and have no problems.

The problems that arise elsewhere should be solved elsewhere.

overgard · on Oct 31, 2013

So instead of solving it in one place, we should solve it in hundreds of places?

ketralnis · on Oct 31, 2013

If the problem is the shell like I assert, there are probably fewer shells than filesystems

ambrop7 · on Oct 31, 2013

My thoughts exactly. The design of the unix shell is broken, and not anything else. In particular, the lack of a well supported list/array. If there was a list type, you could store filenames in a list, and there would be no reason for the IFS stuff. And by well-supported, I mean you could easily encode a list unambiguously to stdout, and decode it on the other end.

I know about bash arrays, but those are neither well-supported nor standard. For example you can't easily return an array from a bash function, which is a major problem.

einhverfr · on Nov 1, 2013

I don't think it is just the shell though. As soon as you start trying to do any automatic interchange of filesystem metadata, you are going to run into the same problems. Namely you have to have semantically important sets of bytes, and filesystems which allow arbitrary binary strings can insert semantic data into your exchange system, which necessitates extra escaping on all sides.

So the question is what assumptions we have, and how those should be enforced. Here are some humble recommendations.

1. Disable non-printing characters in filenames (this is encoding dependent and assumes a kernel aware of encoding).

2. Allow system administrators to configure additional rules, like "no leading/training whitespace, no starting with a hyphen, no internal tabs, no SGML special characters, UTF-8 only" and the like.

This would empower application developers to state what assumptions they rely on and eventual standards to emerge.

Flimm · on Oct 31, 2013

Leading dashes in filenames are not a problem caused by the shell but a problem caused by the way most binaries interpret argv.

deckiedan · on Oct 31, 2013

My life would be a lot easier if there were some more restrictions on file-names, it's true. Just cutting out newlines would save me a lot of grief.

At my work I'm doing a lot of sysop/devops/admin stuff for a media production team. The team-wide project file structure contains a whole bunch of folders starting with '-'. It's not a problem, you just have to write all scripts being aware of it.

As we're making videos about different places around the world, it's not uncommon for editors to copy and paste things (including unicode weird other languages, etc) into filenames from OSX Finder.

I got caught out recently when someone had pasted in some text with newlines. Boy can that ever run havok.

The thing which got me was that BSD md5 on OSX and GNU md5sum on CentOS handle things differently (using \ before a filename with control characters in it on one of them). So creating a md5 list on one computer and comparing it against a list made on another computer was failing, even though the files where exactly the same.

Also annoying is trying to write scripts which handle " and ' in names.

The typical BASH tutorial response is, 'use "$VAR"! That always works!'

Unless you are sending a command to run on a remote server, or putting together a rsync command that runs over SSH. Then, it breaks. So:

    PATH="Fred's New: \"Test\" Project!"
    rsync -ave "ssh -i ~/.ssh/key" --delete "$PATH/" "user@remote:$PATH"

will have problems. (If I remember correctly). I think this one was fixed with something odd like using ' quotes, and then replacing all ' in the filenames with '\'' (which ends the current string, puts a single quote on its own as a string, and then starts the string again, but treats them all as one long string).

99% of my problems went away when I switched most scripts over to python.

(And, yes, started adding more and more weirdo filenames into my /shell/ /script/ unit tests...)

deckiedan · on Oct 31, 2013

That said, I'm very much in favour of allowing any kind of multi-lingual multi-script UTF-whatever filenames. I'm not even totally convinced that new-lines and other control characters are always wrong.

However, some better tools and much better awareness for people would not go astray.

justincormack · on Oct 31, 2013

Under Linux you could probably add a security module hook to enforce rules so long as you know there aren't any there initially.

deckiedan · on Oct 31, 2013

Almost exclusively an OSX house, but with one Linux box for the LTO archive (tape is still cheaper than 'the cloud' when your internet isn't that fast and a single event/project can be over a terrabyte of footage...).

We may switch over to a linux or BSD based server at some point in the future. Others in our organisation have deployed synology machines with great success.

mpyne · on Oct 31, 2013

Interesting proposal. It's interesting that such filenames cause a lot of problems for GUI toolkits as well.

For example, Qt since Qt 4 removed the hacks they had in QString to allow malformed Unicode data in its QString constructor. What this means is that the old trick of just reading a filename from the OS and making a QString out of it is impossible in general since there are filenames which are not valid ASCII, Latin-1, or UTF-8.

Qt does provide a way to convert from the "local 8-bit' filename-encoding to and from QString, but this depends on there being one, and only one, defined filename-encoding (unless the application wishes to roll its own conversion). This has effectively caused KDE to mandate users use UTF-8 for filenames if they want them to show up in the file manager, be able to be passed around on DBus interfaces, etc.

Frankly I can't wait until we can safely rely on filenames being UTF-8 and UTF-8 only. Better still if that can be enforced by the kernel somehow.

KMag · on Nov 2, 2013

Back at university, there was a big rubber stamp reading "ABSTRACTION VIOLATION" that the SICP teaching assistants would use to mark problem set solutions as incorrect if they violated interface abstractions.

Using a QString to represent a file path shoves display constraints down into the lowest level Qt file handling routines. It should have had a FilePath class used for representing file paths nearly everywhere, with a toString or to_string member function for display purposes.

yason · on Oct 31, 2013

The Unix filenames provide a mechanism. That doesn't need to be changed. The policy can be different, based on the userbase and the tools being used, but limiting the mechanism to disallow bad policies is wrong.

While you can create a page-long listing of filename troubles, in practice these shell users only end up having to account for filenames with spaces in them. Occasionally you have to rm -- an accidentally created file. Nobody uses those crazy filenames with leading and trailing space, escape codes, and newline characters because it means trouble with the shell. So the problem kind of takes care of itself.

But there might be some future user interface that doesn't have the escaping and expansion problems of current shells and with that it might be useful to be able to put nearly any character into filenames. The Linux/Unix kernel isn't in the position to dictate how the filenames shall be used. It only needs the path separator and NUL terminator and not caring about anything else means less kernel code that can break.

einhverfr · on Oct 31, 2013

> Nobody uses those crazy filenames with leading and trailing space, escape codes, and newline characters because it means trouble with the shell. So the problem kind of takes care of itself.

What about people who want to create trouble?

yason · on Oct 31, 2013

What did they say about making a system that is idiot proof...

einhverfr · on Nov 1, 2013

I am not talking about idiots. I am talking about malicious users. The issue is a security issue. What happens if you have a windows server attached to a local printer and I upload a file called LPT1.html which contains malicious postscript instructions?

You can't have security without a set of common assumptions regarding allowable input. This occurs on any shared computer system.

kalleboo · on Oct 31, 2013

It sounds like most of these problems are "CLIs suck, since it's hard to parse commands and filenames". GUIs don't have these problems. And Apple solved the pathname problem in the Classic MacOS by not having paths - references to files were to be stored as "alias" data structures, which contained the inode ID (so despite moving files around, a reference would still point to the original file).

dwheeler · on Oct 31, 2013

GUIs have these problems in spades.

For example, all GUIs must display filenames (say in a file-picker), but how do you display the filenames if you don't know what the encoding is, or if the filename disobeys the encoding? What should you show a user when "control-A" is one of the "characters"? How can you give line-at-a-time display when newline is in a filename? How do you process filename, since it may not be a legal string in the current encoding (a problem both Qt and Python have hit)?

The Qt solution is to mandate that filenames must be in UTF-8. Period. So the nonsense that "filenames can be almost any sequence of bytes" is just that, nonsense. It's silly anyway; why allow ALMOST anything, but not anything?

This also assumes that GUI programs never invoke other programs, which is also absurd. Every time a GUI does system(), exec(), and so on, it risks these filename problems (e.g., if it tries to pass a filename with a leading "-").

The problems are WORSE in shell, but they still occur in other languages.

kalleboo · on Oct 31, 2013

The difference is that in a GUI, even if you fall back to displaying a foreign or poorly-encoded filename as a bunch of Unicode entity squares, the user can still preview, open, edit, move, rename or delete the file. On the command line, you can't do anything with it if you can't type the name.

GauntletWizard · on Oct 31, 2013

Has nobody heard of "--" ? This is the standard unix way of saying "Please, do not process flags beyond this point". cat -- * does precisely what I expect it to do in most contexts.

ramidarigaz · on Oct 31, 2013

He mentions it further down in the article, but not all commands support '--'. Most notably: echo is not required to support that (which I believe means that while some implementations do support '--', a fully portable script can't assume that it does).

cnvogel · on Oct 31, 2013

One frequently sees scripts relying on the GNU extensions to echo, often the support for escape characters, in installation scripts for commercial (but also the one for oh-my-zsh...) software when not run on a OS where sh is symlinked to bash (but to a simpler shell).

     \033[0;1m ** CONGRATULATIONS! **\033[0m
     Your program is now installed.

cetu86 · on Oct 31, 2013

Very interesting discussion. Here are my 2 cents: I agree, that how shell handles filesnames and confuses them with commands is inherently broken.

I would like to focus on whtat filenames are supposed to be. In my understanding filenames are supposed to be like booktitles or labels on goods like in the supermarket. So they are not supposed to contain random binary data. That is what file's contents are for. Filenames should however contain any printable character that you expect on a label. And no fake characters like beep or linefeed. But they should be fully international. I mean this is the 21st century! :-) In order to define printable characters you also need to define a character encoding. Unicode clearly defines 2 sets of control characters c0 and c1. I would exclude these two sets, but allow any other unicode character. I now there is an argument about which unicode encoding is better (utf8, utf16, the way apple encodes unicode vs the way everyone else does, ...) Maybe one could define the filesystem's encoding inside it and even give the kernel a translation layer between the ondisk encoding and the one visible to the user.

einhverfr · on Oct 31, 2013

why not make rulesystems modular?

cetu86 · on Oct 31, 2013

Of course. So everyone can decide wether to use it or not. Or even switch this on at some point in the boot sequence Currently the linux kernel doesn't have an interface for this. But I think it is important to do this within the kernel so no malicious program can bypass it.

dwheeler · on Oct 31, 2013

I agree, I think having a CONFIGURABLE option in the kernel where admins can decide "what is allowed" would be a big step forward. (1) Enable requiring UTF-8 encoding, and (2) list what bytes are allowed/forbidden at the beginning, the middle, and the end. Then you could have a local policy like "UTF-8 only", "no control chars", "no dash at beginning", and "no space at the end".

einhverfr · on Nov 1, 2013

This also means that programmers can document what they require, and eventual standards can emerge, which would be a good thing.

codezero · on Oct 31, 2013

Sounds like one of the big problems here is glob, and another is how args are parsed before being sent to a command.

njharman · on Oct 31, 2013

> This article will try to convince you...

Failed.