
Fixing Unix Filenames (2012) - laurent123456
http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
======
ketralnis
Half of these problems are problems with the shell and have nothing to do with
the filesystem. The shell is _really, really, really bad_ about doing things
like expanding spaces in filenames or environment variables into separate
arguments. Notice that every example here about how it's "wrong" is a small
shell script that has unexpected behaviour. That's mostly because the shell is
wrong!

> Oh, and don’t display filenames. Filenames could contain control characters
> that control the terminal (and X-windows), causing nasty side-effects on
> display

That's the fault of the terminal, mostly. If it's an otherwise "reasonable"
name containing, say, encoded unicode, that's everyone's fault from the
filesystem to the terminal to (sometimes) the shell. But the fault is spread
out over more than just the fact that the filesystem lets you name files
however you like.

But now, Linus' quote there:

> "...filesystem people should aim to make "badly written" code "just work"
> unless people are really really unlucky. Because like it or not, that’s what
> 99% of all code is... Crying that it’s an application bug is like crying
> over the speed of light: you should deal with _reality_ , not what you wish
> reality was." — Linus Torvalds, on a slightly different topic (but I like
> the sentiment)
> ([http://lwn.net/Articles/326505/](http://lwn.net/Articles/326505/))

speaks to both "sides" here: there are applications in the real world that
create filenames with special characters in them. The simplest most obvious of
these is the space, but yes, in the real world we have applications that do
this. So neither alternate reality ("restrict filesystems' names" or "keep
doing what we're doing but have bug-free applications") exists, so which
"reality" are you proposing that we deal with here?

~~~
overgard
I'm sure most reasonable people would agree that (simple) unicode characters
are fine, but what possible use is there for having control characters in a
filename, like newline or carriage return? (Or if you're really evil, beep).
If you remove "special" characters from the filesystem, you also remove a lot
of complexity from other parts of the system for dealing with pathological
corner cases. It's a win for everyone.

~~~
gilgoomesh
Ignoring the fact that non-terminal users _regularly_ want spaces, hyphens and
newlines in their filenames since they have visual purpose...

The argument in favor of newlines is the same argument in favor of spaces in
filenames or filenames starting with hyphen: the filesystem should not be
restricted by bash and other shell scripts continual confusion between data
and instructions. We should fix/replace lazy programs, don't try to fix lazy
programming by neutering the system.

I think the correct action should be to design shell scripting languages that
are robust about data versus instructions. The lack of robustness in shell
scripting languages has been a burden since their inception. Restricting the
filesystem won't stop problems with shell scripts making the same mistakes on
other kinds of data.

~~~
overgard
So one solution is to replace all the popular shells from the ground up,
throwing away about 30 years of history because they're "not elegant".
(Ignoring the fact that almost every attempt to do this in the past has been a
resounding failure).

The other solution is to agree not put overly weird shit in filenames.

Which seems more likely to happen in our lifetime? Which would you rather code
against?

I'm baffled that people are so hell bent on what "should" be allowed that
they're ignoring the fact that allowing any possible character introduces
exceptional complexity while offering extremely marginal benefits. I mean,
imagine trying to make a gui for a filesystem browser that allows for
characters that flow both up and down along with left and right. And newlines
all over the place. It'd be a mess. I mean.. why? Is filename expressivity
really such a major deal?

~~~
toast0
> The other solution is to agree not put overly weird shit in filenames.

If you don't put weird shit in your filenames, I won't put weird shit in mine.
Let's not bother the kernel with our contract. :)

~~~
Flimm
That doesn't solve the security vulnerabilities caused by weird filenames.

------
deckiedan
My life would be a lot easier if there were some more restrictions on file-
names, it's true. Just cutting out newlines would save me a lot of grief.

At my work I'm doing a lot of sysop/devops/admin stuff for a media production
team. The team-wide project file structure contains a whole bunch of folders
starting with '-'. It's not a problem, you just have to write all scripts
being aware of it.

As we're making videos about different places around the world, it's not
uncommon for editors to copy and paste things (including unicode weird other
languages, etc) into filenames from OSX Finder.

I got caught out recently when someone had pasted in some text with newlines.
Boy can that ever run havok.

The thing which got me was that BSD md5 on OSX and GNU md5sum on CentOS handle
things differently (using \ before a filename with control characters in it on
one of them). So creating a md5 list on one computer and comparing it against
a list made on another computer was failing, even though the files where
exactly the same.

Also annoying is trying to write scripts which handle " and ' in names.

The typical BASH tutorial response is, 'use "$VAR"! That always works!'

Unless you are sending a command to run on a remote server, or putting
together a rsync command that runs over SSH. Then, it breaks. So:

    
    
        PATH="Fred's New: \"Test\" Project!"
        rsync -ave "ssh -i ~/.ssh/key" --delete "$PATH/" "user@remote:$PATH"
    

will have problems. (If I remember correctly). I think this one was fixed with
something odd like using ' quotes, and then replacing all ' in the filenames
with '\'' (which ends the current string, puts a single quote on its own as a
string, and then starts the string again, but treats them all as one long
string).

99% of my problems went away when I switched most scripts over to python.

(And, yes, started adding more and more weirdo filenames into my /shell/
/script/ unit tests...)

~~~
justincormack
Under Linux you could probably add a security module hook to enforce rules so
long as you know there aren't any there initially.

~~~
deckiedan
Almost exclusively an OSX house, but with one Linux box for the LTO archive
(tape is still cheaper than 'the cloud' when your internet isn't that fast and
a single event/project can be over a terrabyte of footage...).

We may switch over to a linux or BSD based server at some point in the future.
Others in our organisation have deployed synology machines with great success.

------
mpyne
Interesting proposal. It's interesting that such filenames cause a lot of
problems for GUI toolkits as well.

For example, Qt since Qt 4 removed the hacks they had in QString to allow
malformed Unicode data in its QString constructor. What this means is that the
old trick of just reading a filename from the OS and making a QString out of
it is impossible in general since there are filenames which are not valid
ASCII, Latin-1, or UTF-8.

Qt does provide a way to convert from the "local 8-bit' filename-encoding to
and from QString, but this depends on there being one, and only one, defined
filename-encoding (unless the application wishes to roll its own conversion).
This has effectively caused KDE to mandate users use UTF-8 for filenames if
they want them to show up in the file manager, be able to be passed around on
DBus interfaces, etc.

Frankly I can't wait until we can safely rely on filenames being UTF-8 and
UTF-8 only. Better still if that can be enforced by the kernel somehow.

~~~
KMag
Back at university, there was a big rubber stamp reading "ABSTRACTION
VIOLATION" that the SICP teaching assistants would use to mark problem set
solutions as incorrect if they violated interface abstractions.

Using a QString to represent a file path shoves display constraints down into
the lowest level Qt file handling routines. It should have had a FilePath
class used for representing file paths nearly everywhere, with a toString or
to_string member function for display purposes.

------
yason
The Unix filenames provide a mechanism. That doesn't need to be changed. The
policy can be different, based on the userbase and the tools being used, but
limiting the mechanism to disallow bad policies is wrong.

While you can create a page-long listing of filename troubles, in practice
_these shell users_ only end up having to account for filenames with spaces in
them. Occasionally you have to _rm --_ an accidentally created file. Nobody
uses those crazy filenames with leading and trailing space, escape codes, and
newline characters because it means trouble with the shell. So the problem
kind of takes care of itself.

But there might be some future user interface that doesn't have the escaping
and expansion problems of current shells and with that it might be useful to
be able to put nearly any character into filenames. The Linux/Unix kernel
isn't in the position to dictate how the filenames shall be used. It only
needs the path separator and NUL terminator and not caring about anything else
means less kernel code that can break.

~~~
einhverfr
> Nobody uses those crazy filenames with leading and trailing space, escape
> codes, and newline characters because it means trouble with the shell. So
> the problem kind of takes care of itself.

What about people who _want_ to create trouble?

~~~
yason
What did they say about making a system that is idiot proof...

~~~
einhverfr
I am not talking about idiots. I am talking about malicious users. The issue
is a security issue. What happens if you have a windows server attached to a
local printer and I upload a file called LPT1.html which contains malicious
postscript instructions?

You can't have security without a set of common assumptions regarding
allowable input. This occurs on any shared computer system.

------
kalleboo
It sounds like most of these problems are "CLIs suck, since it's hard to parse
commands and filenames". GUIs don't have these problems. And Apple solved the
pathname problem in the Classic MacOS by not having paths - references to
files were to be stored as "alias" data structures, which contained the inode
ID (so despite moving files around, a reference would still point to the
original file).

~~~
dwheeler
GUIs have these problems in spades.

For example, all GUIs must display filenames (say in a file-picker), but how
do you display the filenames if you don't know what the encoding is, or if the
filename disobeys the encoding? What should you show a user when "control-A"
is one of the "characters"? How can you give line-at-a-time display when
newline is in a filename? How do you process filename, since it may not be a
legal string in the current encoding (a problem both Qt and Python have hit)?

The Qt solution is to mandate that filenames must be in UTF-8. Period. So the
nonsense that "filenames can be almost any sequence of bytes" is just that,
nonsense. It's silly anyway; why allow ALMOST anything, but not anything?

This also assumes that GUI programs never invoke other programs, which is also
absurd. Every time a GUI does system(), exec(), and so on, it risks these
filename problems (e.g., if it tries to pass a filename with a leading "-").

The problems are WORSE in shell, but they still occur in other languages.

~~~
kalleboo
The difference is that in a GUI, even if you fall back to displaying a foreign
or poorly-encoded filename as a bunch of Unicode entity squares, the user can
still preview, open, edit, move, rename or delete the file. On the command
line, you can't do anything with it if you can't type the name.

------
GauntletWizard
Has nobody heard of "\--" ? This is the standard unix way of saying "Please,
do not process flags beyond this point". cat -- * does precisely what I expect
it to do in most contexts.

~~~
ramidarigaz
He mentions it further down in the article, but not all commands support
'\--'. Most notably: echo is not required to support that (which I believe
means that while some implementations do support '\--', a fully portable
script can't assume that it does).

~~~
cnvogel
One frequently sees scripts relying on the GNU extensions to echo, often the
support for escape characters, in installation scripts for commercial (but
also the one for oh-my-zsh...) software when not run on a OS where sh is
symlinked to bash (but to a simpler shell).

    
    
         \033[0;1m ** CONGRATULATIONS! **\033[0m
         Your program is now installed.

------
cetu86
Very interesting discussion. Here are my 2 cents: I agree, that how shell
handles filesnames and confuses them with commands is inherently broken.

I would like to focus on whtat filenames are supposed to be. In my
understanding filenames are supposed to be like booktitles or labels on goods
like in the supermarket. So they are not supposed to contain random binary
data. That is what file's contents are for. Filenames should however contain
any printable character that you expect on a label. And no fake characters
like beep or linefeed. But they should be fully international. I mean this is
the 21st century! :-) In order to define printable characters you also need to
define a character encoding. Unicode clearly defines 2 sets of control
characters c0 and c1. I would exclude these two sets, but allow any other
unicode character. I now there is an argument about which unicode encoding is
better (utf8, utf16, the way apple encodes unicode vs the way everyone else
does, ...) Maybe one could define the filesystem's encoding inside it and even
give the kernel a translation layer between the ondisk encoding and the one
visible to the user.

~~~
einhverfr
why not make rulesystems modular?

~~~
cetu86
Of course. So everyone can decide wether to use it or not. Or even switch this
on at some point in the boot sequence Currently the linux kernel doesn't have
an interface for this. But I think it is important to do this within the
kernel so no malicious program can bypass it.

~~~
dwheeler
I agree, I think having a CONFIGURABLE option in the kernel where admins can
decide "what is allowed" would be a big step forward. (1) Enable requiring
UTF-8 encoding, and (2) list what bytes are allowed/forbidden at the
beginning, the middle, and the end. Then you could have a local policy like
"UTF-8 only", "no control chars", "no dash at beginning", and "no space at the
end".

~~~
einhverfr
This also means that programmers can document what they require, and eventual
standards can emerge, which would be a good thing.

------
codezero
Sounds like one of the big problems here is glob, and another is how args are
parsed before being sent to a command.

------
njharman
> This article will try to convince you...

Failed.

