
The hell that is filename encoding (2016) - kristjansson
http://beets.io/blog/paths.html
======
jabl
Funny(???) warstory:

1\. Back in the days, we were using a Linux NFS server, with NFSv3, and out-
of-the-box locale was iso-8859-1 (latin1). Life was good, except for
occasional problems with people with strange non-latin1 names, or documents
with non-latin1 names etc.

2\. At some point, we switch to using UTF-8 by default. Telling users to use
convmv to rename their files when they are ready to switch to the new
defaults. Most people ignored this, of course, but files with now invalid
utf-8 were mostly fine, just with the occasional "?"'s in the names.

3\. Switch to NFSv4. Invisible to end users. NFSv4 per se requires that paths
are UTF-8 encoded, but in practice the Linux NFS server and client just pass
along a bag of bytes, so invalid UTF-8 just worked as fine as it did
previously.

4\. Switch from a Linux NFS server to a netapp.

5\. User complains that files are missing. Initial comparison with the old
Linux NFS server, which was still online, shows no problems. Problem occurs
only on user workstation, not on admin box which has both the old Linux NFS
and netapp directory trees mounted. Investigation on users workstation shows
that in some cases lots of files appear to be missing, including ones which
plain ASCII names.

\- Turns out that the admin box had the netapp mounted with NFSv3, and thus
everything appeared Ok there, including the rsync from Linux NFS -> netapp in
the first place.

\- However, when mounted using NFSv4, netapp follows the spec and does not
like non-utf8 paths. Does it report an error then? Hell no, the NFS READDIR
(READDIRPLUS?) message reply just stops returning directory entries when it
hits the first one with invalid UTF-8. And thus you get a partial directory
listing. GAAAH!

\- So the solution was to run convmv centrally (from the admin box which had
the netapp mounted with NFSv3) for the entire directory tree which had been
moved.

~~~
wazoox
Ah yes, had the same fun problem at a customer's facility last week. Moving
350 TB of data from an old DDP storage server to a Linux one. Mounting with
CIFS (no other option available), an copying using "cp -a".

The file names look OK after the copy on the Linux machine. However, when
exporting the directory through Samba, the Macs Finder doesn't display files
with accents in the names (though they appear correctly with "ls", weird...).

So the user copies the files again, using the Finder. Now I have files with
exactly the same name (uhhhhh???):

# ls -l M _mo-1._ -rw-rw-rw- 1 root root 8417218 6 sept. 2013 Mémo-1.aif -rwxr
--r-- 1 test test 8417218 6 sept. 2013 Mémo-1.aif -rw-rw-rw- 1 root root
363175 6 sept. 2013 Mémo-1.m4a -rwxr--r-- 1 test test 363175 6 sept. 2013
Mémo-1.m4a

Yes, it looks like two files have exactly the same name, but actually they're
different: one as "é" encoded as 0xCC81, and the other one (the "good one") as
0xC3A9. Why is that? Why does one work with the Finder, and the other doesn't?
who knows.

~~~
throwaway7767
Most likely it's different normalization. I've seen this before with Mac
systems.

Renaming the files to use NFKC normalization fixed it. In python, you could
loop through the files and do something like:

    
    
      os.rename(originalfilename, unicodedata.normalize('NFKC', originalfilename.decode('utf8')))
    

EDIT: You'll probably need to do this on a non-Mac system, linux for example
should work.

------
quietbritishjim
> on Windows, _paths are fundamentally text_

They were back when there were less than 2^16 characters in the Unicode
standard. Back then each two-byte word in a filename corresponded exactly with
a Unicode code point.

Now there are more than 2^16 but well under 2^32, Windows uses UTF-16 in
filenames. That is, Unicode code points above 2^15 are obtained by a pair of
special Unicode code points in the range 2^15-2^16 called _surrogates_ ;
surrogate pairs need to be collapsed into a single code point when decoding
the file name. Surrogates are exactly those things that Python uses on Linux
to hide bytes that are not valid UTF-8. Here's the problem: it is possible to
have unmatched surrogates in a file name (or in other places that Windows
accepts UTF-16).

In summary, on Windows, you end up with effectively the same situation as
Linux: file names that are supposed to be in one encoding (UTF16) but contain
invalid data for that encoding.

~~~
chungy
> Here's the problem: it is possible to have unmatched surrogates in a file
> name (or in other places that Windows accepts UTF-16).

NTFS (and Windows as a whole) does not use UTF-16, it uses UCS-2. It is a
subtle difference, but surrogate pairs didn't exist in UCS-2.

~~~
burfog
Old versions use UCS-2. New versions use UTF-16. (correctness not enforced)
This is also how Java and OS X were updated.

The kernel generally uses the 16-bit equivalent of the old Pascal string, that
being a 16-bit count of 16-bit pieces of UTF-16 data. This allows a 16-bit NUL
to get into various places that make the Win32 API choke.

~~~
masklinn
> Old versions use UCS-2. New versions use UTF-16. (correctness not enforced)
> This is also how Java and OS X were updated.

I usually call that ucs2-plus-surrogates, to make it clear that you may
encounter unpaired surrogates, and thus invalid paths if you assume proper
UTF-16.

~~~
leeter
This is actually kinda painful with NTFS as NTFS doesn't really care what's in
a path other than the directory separators it's all binary. This means that
different applications using different Unicode normalization will result in
odd things happening. To me the right answer is NTFS should normalize all
paths the same way internally but they have yet to implement it because it
would break legacy systems that have un-normalized paths (I assume).

~~~
freeone3000
Precisely.

Programs that internally use ShiftJIS, for instance, would stop functioning on
UTF-16 enforced compatability or normalization. They're currently "broken" (as
in, operating incorrectly) but in a way that works.

~~~
leeter
Can you explain why this would be the case? In theory this shouldn't be an
issue because any ShiftJIS conversion to Unicode should be reversible.

~~~
TazeTSchnitzel
I think it's not a perfect round-trip due to differences in how the two
standards encode certain characters. You will get a correct conversion either
way, but the result of a round-trip might not be bit-identical.

------
lifthrasiir
Rust `std::path` [1] has two representations under the hood for Windows
(UTF-16 plus lone surrogates) and non-Windows (bytes) for the exactly same
reason. Paths are not strings nor texts.

[1] [https://doc.rust-lang.org/stable/std/path/](https://doc.rust-
lang.org/stable/std/path/)

~~~
cesarb
Emphasis on the "plus lone surrogates" part. Like on Unix, Windows does not
require a path to be valid Unicode.

That is, on Windows, paths are fundamentally sequences of 16-bit words, just
like on Unix paths are fundamentally sequences of 8-bit bytes. On neither
system are paths fundamentally text.

~~~
lifthrasiir
The story then goes on further: every NTFS volume contains a special file
named `$UpCase` that has a uppercase mapping for all possible 16-bit words,
resulting in an 128 KiB table. This approach has an upside for backward and
forward compatibility... unless you eventually need a case mapping for non-BMP
characters or complex mapping that expands to multiple characters.

~~~
tialaramex
I should briefly explain why this is here:

NTFS is (usually) case preserving but not case-sensitive. So the OS needs to
be able to tell whether EXAMPLE.TXT and example.txt are the "same" name, which
means it needs case conversion.

Not everybody agrees about how this conversion should work. The most famous
example is Turkish, but there are others. So there's an actual choice to make
here.

If Windows baked this into the core OS, they might get pushback in countries
where their (presumably American) defaults were culturally unacceptable.

If they made it configurable at the OS level, everything would seem fine
until, say, a German tries to access a USB drive with files from a Turk on
them and some files don't work correctly, or the disk just can't be mounted at
all.

So, they have to bake it into each NTFS filesystem.

~~~
JdeBP
HPFS had a similar system some years before.

* [http://www.edm2.com/index.php/Inside_the_High_Performance_Fi...](http://www.edm2.com/index.php/Inside_the_High_Performance_File_System_-_Part_4#Code_Pages)

------
zaptheimpaler
Filesystems seem so fiddly and broken.. is it very difficult to make a small
layer over filesystems that provides sane semantics? Like one that handles
paths sanely, handles fclose()/fsync() properly, lets you control when things
are buffered/flushed etc? Even a broken API with clear, modern documentation
enumerating all the fiddly cases would be a huge step forward from digging
through random mailing lists on sites with UX from the 90s.

Has anyone tried this? Is it possible with FUSE? I would love to hear from
people who know about this stuff - what are the obstacles? Or do you think FSs
are fine the way they are?

~~~
wongarsu
What is sane path handling?

Alone on windows, the maximum path length is 260 characters, except when you
use extended-length paths which have a 4 character prefix and a maximum length
of 32,767 characters. A sane API for reading files probably converts your
paths to extended-length paths. But if you do the same thing for writing files
your users start calling you insane again, because most of Windows (including
Windows Explorer) can't open extended-length paths. So you would be creating
files that only select software can even open, and which the user can't browse
without third party software.

(An easier to ignore fun fact is that NTSF and Windows also support case
sensitive names if you set the right flags in the file APIs. But nobody uses
that, so it's probably save to ignore that (until somebody mounts EXT3
partitions in windows...))

~~~
majewsky
> So you would be creating files that only select software can even open, and
> which the user can't browse without third party software.

iOS suggests that people are ok with this. /s

------
nialo
I work on another music file management system, my personal special hell is
playlist files. An m3u playlist file is just a new-line separated list of file
paths, which can be relative or absolute, and potentially encoded in whatever
locale is set on the users computer. Some fun issues:

* Windows and Mac filesystems are generally case-insensitive, so some users will have the file names in the playlist file in one case and the actual file names on disk in another format * Sometimes file paths cross between two different filesystems, because one is mounted in the other with a USB drive or over CIFS or similar. Sometimes these two different filesystems have different case sensitivities * There's no way to know how the playlist file was encoded * HFS+ normalizes file paths to Unicode NFD, but there's no guarantee that the paths in a playlist file will be normalized. Also, sometimes users generate an m3u file on a Windows system and expect it to just work on a Mac. Also, the filesystem nesting problem with network or USB mounts can happen this way too.

~~~
creeble
Sounds a lot like my life a couple of years ago (and intermittently since). I
don't get the bug reports any more because I think customer service has
learned that file name problems can be fixed by _renaming the files_. Not fun
for the user, but a sure fix.

Ya know what kind of file names work virtually everywhere? ASCII ones.

~~~
JdeBP
This is a mis-use of ASCII. After all, the colon, asterisk, forward slash,
question mark, backward slash, and NUL characters are all in ASCII, yet they
are _far_ from things that "work virtually everywhere". And that isn't even
considering the open and close square bracket and semi-colon characters which
are also not anywhere near portable to the extent of "working virtually
everywhere".

The kind of file names that _do_ work "virtually everywhere" are _not ASCII_ ,
but rather are those who only use characters from the POSIX Portable Filename
Character Set, which at 65 characters is just over half the size of ASCII
(which has 128 characters).

* [http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_...](http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_282)

------
netheril96
It misses a even more complex, I'd say insane, encoding problem: on HFS+ (or
even APFS now?) filenames are unicode normalized.

~~~
loeg
Not only are they normalized unicode, they're normalized _decomposed_ , and
not only that, but slightly non-standard (does not conform to standard Unicode
"NFD" form). (Or at least, this was the case with HFS. I haven't followed APFS
closely enough to say for that.)

~~~
cryptonector
NFD hadn't been standardized at the time.

IIUC the reason they did this is that they wanted directories to be
canonically ordered on disk, and they thought decomposition would naturally
yield better results than pre-composition. I'm not sure that's right, and
frankly I don't care either, because the most important thing to note is that
input methods (especially for European languages) by and large produce NFC,
and most application software does no normalization at all, so disagreements
as to form cause problems[0][1].

[0] [https://cryptonector.com/2010/04/on-unicode-normalization-
or...](https://cryptonector.com/2010/04/on-unicode-normalization-or-why-
normalization-insensitivity-should-be-rule/) [1]
[https://cryptonector.com/2006/12/filesystem-i18n/](https://cryptonector.com/2006/12/filesystem-i18n/)

~~~
cryptonector
I should add that because different locales have different collations, it's
not that important that directories be order by name. It's good enough that
directories be somewhat ordered, and even that they not be at all. GUIs will
almost always let you sort by name and/or date, and the same goes for ls(1),
so, really, it doesn't matter at all.

IMO it was a terrible mistake to normalize to NFD on create. Normalizing to
NFC on create would still have been a mistake, but a lesser one.

------
lowbloodsugar
"NTFS allows any sequence of 16-bit values for name encoding (file names,
stream names, index names, etc.) except 0x0000. This means UTF-16 code units
are supported, but the file system does not check whether a sequence is valid
UTF-16 (it allows any sequence of short values, not restricted to those in the
Unicode standard). "

\- from wikipedia NTFS page [1]

So if you assume that NTFS filename is valid UTF-16 and convert it to UTF-8
there might be a problem. Basically they can be any sequence of 16-bit values.

    
    
      [1] https://en.wikipedia.org/wiki/NTFS

~~~
cryptonector
Doesn't it (or Windows) also disallow the path component separator
character(s) ('/' and '\')?

Unix and alike disallow NULs and /, for obvious reasons.

~~~
tcoff91
There are a number of characters like path separators that cannot be part of a
file name on windows. However I am not sure if this is enforced by the OS APIs
or by NTFS itself. It is entirely possible that NTFS could allow something
that higher layers don’t.

~~~
cryptonector
If the kernel (and SMB, and...) imposes these constraints, it's fine for the
filesystem to not also impose the same constraints on file naming.

------
mattbierner
The built-in file system libraries for many languages are total footguns.
Using strings as paths is a great example. I’ve had a few recent bugs around
case sensitive vs case insensitive file systems because a lot of code assumes
that when pathA != pathB then it must be dealing with two different resources.
Not to mention the classic “doesn’t work on windows” problem: newPath = pathA
+ “/“ + pathB

~~~
lostmsu
AFAIK, Windows understands / as a directory separator.

~~~
mehrdadn
Only in some special cases, not in general.

~~~
dvlsg
Really? I use forward slashes all the time in windows 10. I don't think I've
run into a problem yet.

~~~
mehrdadn
Yes. One huge problem is slashes are also used for command-line switches. That
can entirely change the meaning of your commands.

For example, compare these two in the command prompt:

    
    
      start /Windows/Notepad.exe
      start \Windows\Notepad.exe
    

I wouldn't blame this on poor parsing or other silly things though. I actually
think it makes sense to use slashes for switches, because they are invalid
filename characters, whereas dashes are valid, and hence get ambiguous (hence
the need for "\--" in *nix). I think it would've made more sense to disallow
slashes as directory separators entirely, to avoid this for good.

------
jakeogh
tangentially: Tool I use to test my stuff when I expect it to handle all valid
filenames:

[https://github.com/jakeogh/angryfiles](https://github.com/jakeogh/angryfiles)

------
linsomniac
I found a similar problem with my backups (Ugh. :-)

A year ago I went on a trip and some combination of the humidity, the travel,
and the 6 year old Thinkpad resulted in my laptop not booting.

I had been experimenting with Borg to backup the system, and so I tried using
Borg to restore the latest copy onto the new laptop. Turns out that I have a
bunch of files on my laptop that have names with weird characters in them:
rips of my CD collection. I couldn't find any combination of settings and
environment and locale that would allow borg to recover or skip these files
and recover everything else.

Now, I had 2-3 other copies of the data (my pre-borg backups, the original SSD
which was still readable, a few other rsync copies), so it wasn't a big deal.

But, as always, test your recoveries!

~~~
yepguy
I think you could have mounted your borg backup as a FUSE filesystem, and then
used rsync to restore your files.

------
ggm
now sing along with me children: /none of this matters to me/ because I live
in eight dot three/

------
dwheeler
I wrote an essay a while ago about fixing Unix/Linux filenames here:
[https://www.dwheeler.com/essays/fixing-unix-linux-
filenames....](https://www.dwheeler.com/essays/fixing-unix-linux-
filenames.html)

This is a big disconnect between "what most users expect" and "what systems
actually do". Usually generally expect that filenames are sequences of
characters - and today almost everyone expects that they must be in UTF-8 on a
Unix-like system. That is not, of course, what most systems actually do.

------
cryptonector
I wrote about this eons ago:
[https://cryptonector.com/2006/12/filesystem-i18n/](https://cryptonector.com/2006/12/filesystem-i18n/)
and [https://cryptonector.com/2010/04/on-unicode-normalization-
or...](https://cryptonector.com/2010/04/on-unicode-normalization-or-why-
normalization-insensitivity-should-be-rule/) \-- these might still be
available on [https://blogs.oracle.com/](https://blogs.oracle.com/), though
these are from my days at Sun.

TL;DR, basically, the lack of ability to tag strings in the system call API
with codesets means that UTF-8 is the only plausible answer, and the ends (C
library system call stubs, filesystems) have to apply whatever codeset
conversions. But there's practically zero chance of C library system call
stubs (and related functions) performing codeset conversions (can you imagine
readdir(3) doing it?), which means that the only reasonable answer is to use
UTF-8 locales and be done.

Even shorter: just use UTF-8 locales and be done.

~~~
netheril96
You cannot use UTF-8 locales on Windows though.

~~~
cryptonector
That's OK. On Unix use UTF-8. On Windows use Unicode, and let apps use UTF-8
or UTF-16 as appropriate -- the kernel/NTFS make it right.

------
hossbeast
In which the author takes a long and winding path to what most of us already
know, "paths are fundamentally bytes".

~~~
deathanatos
… but also, sort-of, but not really, text: they get displayed to the user,
they get input from the user, and they get emitted in logs, messages, etc. All
as text. And that rub between where they're bytes but they should have been
text, that's the problem and that's the complexity.

------
cyberferret
Just dealing with file _extensions_ is enough of a head spin. We stopped
trying to differentiate between _.xls,_.xlsx, *.xlst... etc. to show an Excel
icon for a file uploaded to our SaaS and just went with a generic file icon in
the end.

~~~
Waterluvian
That sounds like the classic example of a decorative feature that someone
thought, "oh that sounds easy, add it to someone's sprint." But of course
turns out to be mind numbingly complex.

~~~
gear54rus
Honest question, why is that feature complex? What is the problem in looking
at last part after dot?

~~~
hackits
Because Microsoft made their new office extension .xml if that doesn't make
your head spin i don't know what else will.

~~~
gear54rus
I mean, this is only adding another entry to your array of extensions for the
specific icon. It's also not a deal-braker if it doesn't work. It really is
easy, I don't see what's the problem.

~~~
heeen2
Maybe because it is used for all kinds of office documents. You'd have to look
at the contents to see if it's a spreadsheet, text document, presentation or
not even a non office pain XML file

------
throwaway7767
IBM's backup software TSM/Spectrum Protect messes this up as well.

If the machine has a UTF-8 encoding (like, say, every modern system), it will
try to treat filenames as valid UTF-8 strings and fail to back up files which
don't fulfill that assumption. The "solution" is to run the TSM software with
a single-byte locale like en_US.

I've seen a number of shops that were silently missing files from backup from
old systems because of this problem.

~~~
blattimwind
I don't think any backup software actually can do the right thing(tm). Some
might preserve (or attempt so, anyway) binary representation, others attempt
to preserve unicode codepoint-space representation...

... most do neither, but rather do ${complex thing emerging from combination
of implementation details of runtime and backup tool, impossible to reproduce
in any other runtime, likely platform- and environment dependent; the same
backup likely restores in different ways on different machines, and the same
source files create different backups on different machines; creating a backup
on one machine and restoring it on another does not generally result in the
same files; and I have not yet mentioned what might happen if you mount the
same source file system from different platforms, because results might vary a
lot; also, we are only talking about paths here, not any of the other plethora
of things that can and will be different between any element in OSxFSxEnv}.

~~~
throwaway7767
> I don't think any backup software actually can do the right thing(tm).

Sure it can. In this case, I'd say treating the filename as a bag of bytes is
the correct way to go, as that's the way the OS treats them. Translating
filenames between character sets should not be part of a backup systems job.

There are valid setups where different software on the same machine might be
running with different character sets for legacy reasons. In that case there
is no correct way to handle the filenames as text. But treating it as a bag-
of-bytes will always work consistently.

Also, the one purpose of a backup system is to back up the files on the
filesystem. If it can't back up some files that the OS considers valid, it's
the backup software that failed.

------
pjmlp
And me thinking that I would see mainframe and other non-POSIX systems, only
to see the usual Linux/Windows dichotomy.

------
shadytrees
And a slow black snake slithers out of your computer's USB ports, made of
billowed smoke and unrealized dreams. It sticks its tongue out; "typesssss,"
it whispers, "typessssss."

------
mkesper
A way to stay sane seems to use Python3 pathlib and drop Python2 development.

