
Working with UTF-8 in the Kernel - mindcrime
https://lwn.net/SubscriberLink/784124/2595e4df117dc86a/
======
eponeponepon
My word, but this would break _so much_. I would like to think that I've
managed to keep my wits about me enough over the years that I haven't got any
code needing to find both 'A.txt' and 'a.txt' in the same place... but I
_really_ wouldn't put it past past-me to've screwed that up... that guy's just
the _worst_.

edit to add: I don't think I ever noticed Android being case-insensitive.
Always something new to learn!

~~~
jey
MacOS is a case-insensitive unix, and it seems to work fine.

A few years ago I made the mistake of formatting my Mac to use a case-
sensitive filesystem, and it turned out to be a huge mistake. My thought
process was "I want full compatibility with Unix-land", but this wasn't in
line with the rest of the Mac conventions and just ended up causing pain.

------
mjevans
My own preferences: No, don't, that's a bad idea. Unicode belongs entirely in
Userspace.

If the main reason for this change is to make end-users lives easier then the
solution requires two userspace parts.

1) When creating a new filename, check to see if it collides in a case
insensitive way.

2) A 'filename validation' tool that scans for filenames that are not
normalized in a desired way, and also for case insensitive collisions.

BOTH of these tasks should be OPTIONAL and entirely in userspace.

~~~
comex
Checking for collisions in userspace is what Samba does, and maybe Android as
well (I know they currently use something based on wrapfs based on FUSE, but
not sure how exactly it works). It has two big problems:

1\. It’s slow – every time you want to create a file, you have to do an O(n)
enumeration of every file in the current directory to see if any of them
collide with your new filename. A kernel implementation can use the
normalized, case-folded form of each filename for its hash in the dentry
cache, and then look things up in O(1) based on that hash. You could try to
maintain a similar cache in some userland daemon, but it would break if anyone
changed the files without going through that daemon, creating a mismatch
between the userland cache and the actual file system state. If you want to
accommodate arbitrary modifications by other processes, you have to go to the
kernel every time.

2\. It’s inherently racy. If you (1) list the directory to check for existing
colliding files, find none, and then (2) create a new file, another process
might have created a file with a colliding (but not byte-equal) name in the
interval between (1) and (2). That wouldn’t be an issue if we had
transactional filesystems or even just better filesystem-based concurrency
APIs, but in Linux like most other OSes, we can’t have such nice things. You
could avoid the race by either making all filesystem accesses go through a
daemon, or making up a locking protocol where you have to grab an advisory
lock on a hidden file or something. But those solutions have the same flaw as
the last paragraph – they only work if all processes writing to the filesystem
are on board. They also add a lot of complexity.

A partial alternative is to just accept the race as a possibility; clients
could check for collisions _after_ creating a file, and if they found one,
delete the file they just created. That’s arguably good enough for most use
cases, but it’s sure ugly to add the edge case of a transient broken state,
when the point of keeping insensitivity in userland is to avoid ugliness. And
what if a process crashes between creating a colliding file and deleting it?

2.5. And of course a non-case-insensitivity-aware process could just create
two conflicting files on its own. I’m not counting this as a third problem
since I’ve already raised the issue of unaware processes and ugly edge cases
in the first two, and you did mention there could be a validation tool. But I
rather dislike the idea that accessing your files using standard Unix tools
would be “risky”, in that you could accidentally create files with names that
case-insensitivity-aware programs couldn’t handle. As I see it, that
essentially relegates the Unix tools to being second-class citizens.

Given these issues, I think it really is best to handle case insensitivity in
the kernel, no matter how ugly it seems. Either that, or make FUSE faster so
you can do it in a userland filesystem without sacrificing performance…

~~~
mjevans
For a multi-threaded system I don't actually see how any of that is solved by
moving things that SHOULD NOT belong in the kernel to the kernel.

Edited to add this paragraph: I feel I should expand on the above line. It's
actually the reason why most of the post this is replying to is invalid.
Outside of the single thread context __all__ caches are potentially invalid,
even the in-kernel cache could be invalidated between different submissions.
So all clients have to be coded to accept a 'fail' case and behave properly
anyway, OR you allow for one to 'win'. This is why (man 2 open) open() has
been extended over time to include the O_EXCL flag for failing in the case of
a file already existing. Similarly the (man 2 rename) rename() has been
extended (fairly recently) with RENAME_NOREPLACE, which fails in the case of
the file already existing. Anything which might be subject to concurrency
issues already requires such handling, or accepting that the results might be
undefined (E.G. one version of the rename to target filename working in an
'atomic' way but no assurance about the ordering of operations from parallel
threads/processes).

Anything not facing a user should already be creating filenames in it's
preferred case and normalization.

Anything facing the user (accepting user input as a filename) should either be
doing the same (if it's part of a sanitized filename) OR is directly part of a
UI interacting with a user, which must already handle cases such as renaming
an existing file to one that might also exist.

I really don't buy that anything of actual value is being gained here for use
cases that seem very rare and a complexity cost in kernel that is very high.

~~~
comex
O_EXCL and RENAME_NOREPLACE are examples of _why_ you want case insensitivity
to be in the kernel. They're atomic operations: if you use O_EXCL, you could
get a successful file creation, or you could get EEXIST, but not something in
between. But you can't implement a case insensitive version of O_EXCL in
userland. As I said, the best you can do is scan first to see whether a
conflicting file already exists, but that doesn't help if a conflicting file
is created in between the scan and when you create your file.

The kernel is in a better position; since all readers and writers must go
through the kernel's filesystem routines, those routines can use mutexes or
other concurrency mechanisms (that they need anyway to protect their data
structures) to implement higher-level atomic semantics. That's how the kernel
implements regular O_EXCL, and it's how it can implement case-insensitive
O_EXCL as well, without having to deal with race conditions. In fact it
already does that for vfat and other case-insensitive filesystems.

(A FUSE filesystem similarly gatekeeps all accesses and thus can do the job in
userspace, but FUSE is slow.)

~~~
nybble41
> But you can't implement a case insensitive version of O_EXCL in userland.

So leave O_EXCL in the kernel in its current case-sensitive form. In
userspace, implement normalization and case-folding as you see fit and pass
the normalized, case-folded filename to open(). Ignore any files other
programs may create with non-normalized names. Keep an xattr or lookup table
with the unfolded "presentation" version of the name for directory listings.

------
saagarjha
> Working through the string is then accomplished by repeated calls to:
    
    
      int utf8byte(struct utf8cursor *u8c);
    

> This function will return the next byte in the normalized (and possibly
> case-folded) string, or zero at the end. UTF-8-encoded code points can take
> more than one byte, of course, so individual bytes do not, on their own,
> represent code points.

What's the choice behind going byte-by-byte rather than grapheme-by-grapheme?
Seems like an API designed for comparisons and nothing else…

~~~
dbaupp
There's no fixed-width/non-allocating way to return a grapheme (unless one can
return a slice into the original data, which cannot happen here), because they
may consist of arbitrarily many codepoints (and code units) both in theory and
in practice. For instance, many non-Latin scripts compose single graphemes out
of multiple code points and many emoji are multiple codepoints (e.g. the
families[0], and the skin tone variants).

[https://manishearth.github.io/blog/2017/01/15/breaking-
our-l...](https://manishearth.github.io/blog/2017/01/15/breaking-our-
latin-1-assumptions/)

One option would to instead return (24-bit) code points, which would amortize
comparison and iteration costs slightly, but may accidentally encourage people
to forget about multiple-codepoint characters.

[0]:
[https://r12a.github.io/uniview/?charlist=%F0%9F%91%A9%E2%80%...](https://r12a.github.io/uniview/?charlist=%F0%9F%91%A9%E2%80%8D%F0%9F%91%A9%E2%80%8D%F0%9F%91%A7%E2%80%8D%F0%9F%91%A6)

~~~
saagarjha
> There's no fixed-width/non-allocating way to return a grapheme (unless one
> can return a slice into the original data, which cannot happen here)

I take it's not possible to return a pointer into the data because
normalization is being performed?

~~~
dbaupp
Yeah, normalising and case folding will both potentially result in a new
sequence of graphemes (and code points and code units), so slicing can't work.

------
rurban
Oh my, I've waited so long to add the utf8 API to my safelibc, but this not
what I expected. It's amateur level.

    
    
        int utf8byte(char*)
    

should be named utf8cp of course. A 32bit int "byte" codepoint is a bit hard
to swallow.

utf8 iterators proved to be wrong in a myriad of previous cases, e.g.
libunistring which is still not used in coreutils over this wrong design
decision. You do the whole high-level API in buffers by yourself and don't do
it with exposed iterators.

casefolding and normalization is only one of the expected features, the next
would be identifier safety, which requires a bit more checks. left to right
R2L mixups, script mixups, confusables ... The kernel deals with identifiable
strings and once you open the unicode floodgates others will rush in. You
really need to think the API through, not like this. At least they thought
about the UNICODE_VERSION, which others didn't.

~~~
comex
The function does in fact return an (unsigned) byte; the use of int is
apparently because it returns -1 on failure. I'm not a big fan of overloading
return values that way, but it is reasonably common...

------
MrRadar
> The proposed addition of case-insensitive file-name lookups to the ext4
> filesystem...

Ugh, why? I thought the general consensus was that case-insensitive
filesystems were a mistake. The article linked on that line is behind a
paywall, can someone summarize the rationale for this change?

Also, this article doesn't mention how the kernel developers plan to handle
differences in case folding between locales (e.g. the infamous Turkish "dotted
I"). I learned recently that NTFS actually stores a full case-folding table
(from the current locale of the system formatting the volume) in the
filesystem metadata specifically to ensure case-folding will remain consistent
even if the volume is attached to a system with a different locale which seems
like the only sane way to handle this.

~~~
saagarjha
macOS's HFS+ and APFS are case-insensitive by default, which I feel is a bit
less confusing to users ("What do you mean that 'document' and 'Document' are
different things?! They're both 'document'!")

~~~
benj111
That's only a very small leap to 'documents' ("its obvious what I mean!")
though. I don't think the kernel should start trying to handle that.

------
bengerbil
Case insensitivity for files in a unix-ish environment seems like a great
idea, until you're searching PATH for an executable called "head", and both
HEAD and head exist. Do you go for first match, or best match, falling back to
case-insensitivity? Bonus points for the confusion that ensues when some of
the file systems being checked are case sensitive and some are only case
preserving.

------
ben509
It seems like pretty names are something extended attributes could deal with
better than the kernel.

If we add "display" and "locale" field to the directory entry my office app
would write out "bob's great spreadsheet.xls" and add some extended attributes
like display="Bob's Great Spreadsheet" and locale="en-US".

------
eridius
How does comparison work if one of the strings contains invalid UTF-8
sequences? Does utf8byte() interpret the invalid subsequence as a U+FFFD
REPLACEMENT CHARACTER or does it return an error?

------
tinus_hn
Hope they do enough fuzzing to be sure that there are no DOS and crash attacks
with weird filenames.

------
FrozenVoid
This is a bad idea that will add lots of attack surfaces.

