Hacker News new | past | comments | ask | show | jobs | submit login
You probably don't need to validate UTF-8 strings (viralinstruction.com)
78 points by jakobnissen 39 days ago | hide | past | favorite | 100 comments



There is one property about UTF-8 that distinguishes it from opaque byte strings of unknown encoding: its codepoints are self-delimiting, so you can naively locate instances of a substring (and delete them, replace them, split on them, etc.) without worrying that you've grabbed something else with the same bytes as the substring.

Constrast with UTF-16, where a substring might match the bytes at an odd index in the original string, corresponding to totally different characters.

Identifying a substring is valid in every human language I know of, as long as the substring itself is semantically meaningful (e.g., it doesn't end in part of a grapheme cluster; though if you want to avoid breaking up words, you may also want a \b-like mechanism). So it does seem to refute the author's notion that you can do nothing with knowledge only of the encoding.


String equality is an extremely domain-dependent problem. It's so bad that it's even possible for two strings that contain identical bytes to represent different strings! (because of CJK unification, the same UTF-8 byte sequence encodes different characters in different locales). And it's extremely common in Unicode for different byte sequences to represent the same characters, in various ways (accented characters vs character+accent, different orders of character and accents, identical characters in different scripts, such as latin a and cyrillic a, etc).

And beyond Unicode itself, two strings that would be displayed differently would still be expected by your users to match. Case insensitive search is the most common example for English, but other language users often expect searches to ignore accents or similar diacritics. Then, for things like addresses and even sometimes names, multiple spellings of the same name are often considered matches, e.g. when identifying a delivery address, or for many kinds of simple identity verifications.

There's really no good general concept of string equality that you can bake into a language except for the byte equality one, which doesn't care for UTF-8.


I don’t agree that there is no general right way of comparing strings. I think most people would agree that strings are considered equal if they are made up of the same characters in the same sequence. That means Cyrillic characters are considered different characters than lookalike Latin characters. It also means that for string comparison, different methods if representing accented characters should be considered equal.

The points you mentioned about matching strings without considering accents or matching different ways to write an address are relevant in applications but that’s not really string equality, I would call that string similarity or fuzzy search.

As far as I know, the correct way to compare Unicode strings is to normalize them and then compare the bytes.


"Correct" isn't quite, well, correct here ;)

For a starting point, case-insensitive comparison is comparison, and is locale-specific.

The general name for the task we're discussion is "collation". Unicode defines a standard collation algorithm: https://www.unicode.org/reports/tr10/

This is designed (because it has to be) to work with locale data, also provided by Unicode https://cldr.unicode.org

I would say that the minimum a language's standard library should provide is: all the normalization forms, and a lexicographical sort. "Fancy Unicode" (there's a lot!) is better off as a library of its own, or perhaps several: covering every single base is a monumental achievement, and a rather large codebase.

There are valid sorts not defined by Unicode, involving abbreviations, standard corrections of misspellings, and much more: these are needed for tasks like address normalization.


Most people might agree, but most people would be wrong. I actually think that most people would agree that string comparison should be a byte by byte comparison, which I think is actually slightly better.

> That means Cyrillic characters are considered different characters than lookalike Latin characters. It also means that for string comparison, different methods if representing accented characters should be considered equal.

Why? I know this is how Unicode defines it, but what real world use case does this serve? Why would two strings that will be rendered identically in virtually any use case, like "Cap" and "Сар", be considered different, but other strings like "ő" and "ő", equal?

Edit:

> As far as I know, the correct way to compare Unicode strings is to normalize them and then compare the bytes.

This is also a kind of fuzzy string match, just a different kind of fuzzyness. For example, this is an inappropriate way to compare strings if you're using it in a file diff implementation, as you would conclude a file has not changed when in fact it has, in ways which might matter to other tools. Database engines don't normally do normalization before enforcing uniqueness or creating indexes.


Im pretty sure we both agree that equality should not depend on the look of the character, as you said you prefer byte comparison.

Treating different byte encodings of the same character as equal makes sense to me, because the different possibilities of representing the characters are just an artifact of the encoding. It’s like comparing negative and positive zero floats. It’s the same value, but the encoding allows multiple equivalent representations.

I wouldn’t call the matching after normalization fuzzy, because the thing you want to compare is the actual characters, not the way they are encoded.

You’re right that doing normalization when diffing files would be wrong, but that’s because you don’t want to compare strings when doing diffs, you want to compare binary representation.

Edit: to make my point more clear: If I give my string comparison function different strings containing the same characters, but encoded in different ways (for example UTF-8 and UTF-16), I would expect the function to tell me they are equal, because the function is supposed to compare the string content, not the way it is encoded in memory.


> Im pretty sure we both agree that equality should not depend on the look of the character, as you said you prefer byte comparison.

My position is that "string equality" is a meaningless term on its own, that different contexts require different notions of equality and that none of these is fundamentally better than any other. "the strings look the same to a human", "the strings are represented by the same bytes on a wire/disk", "the strings represent the same characters even if encoded differently", "the strings represent the same word to a native", "the strings represent the same street name to a native", "the strings represent the same human name", etc are all just as valid concepts of "string equality" and are useful in different more or less narrow contexts.

Your point that the need for normalization is just an artifact of the encoding is fair enough, but there are still valid contexts, even beyond pure byte comparisons, that require these to be considered different - if nothing else, then at least in posts explaining this very difference.

> If I give my string comparison function different strings containing the same characters, but encoded in different ways (for example UTF-8 and UTF-16), I would expect the function to tell me they are equal, because the function is supposed to compare the string content, not the way it is encoded in memory.

A perfectly valid equality, just like many others I mentioned.


> Why would "the path name you entered is not valid Unicode/UTF-8" be a good error message, if the path actually exists on the system?

On Windows NT (which uses Unicode file names), it is a good error message (although it should allow UTF-8 with possible mismatched surrogates, which is called WTF-8). However, on UNIX, and other systems that don't use Unicode file names, it is not a good error message; the program should not care if the path is valid UTF-8 and should just use it as is. (If the system disallows certain bytes in file names (or if the user entered a file name that is too long for the file system in use) then the error message can say that, if that is the case, though)

> There's really no good general concept of string equality that you can bake into a language except for the byte equality one, which doesn't care for UTF-8.

I agree.

How you compare text (or other data) would depend on the application-specific details, and Unicode makes it especially complicated (and probably even impossible to do "correctly" in any way other than just comparing bytes, anyways).

(Even, in my own operating system design, which does not use Unicode but uses Extended TRON Code instead (it also supports arbitrary 8-bit character sets, which in some contexts are much more useful than large character sets), I had considered that comparing text still depends on application-specific details (and other details); while many of the problems of Unicode are avoided, there are still problems, including some of its own, although they are not as complicated to deal with than Unicode, and I had considered how to deal with them.)


You mentioned quite a lot of things here. Not really a spectrum, but on one side you have a comparison of bits (bytes), where only maybe byte order may cause issues, and on the other you have such high-level problems like comparing synonyms "begin" == "start" (or you could go further and compare sentences if they convey the same information). Somewhere between those extremes you have characters, glyphs, "confusables", normalization...

I don't follow on your CJK unification argument... How are strings different if they share the same codepoint? Is there software that analyzes the language used, or otherwise context, and prints the codepoint differently based on the context? Or is there a recommendation to do so?


> Not really a spectrum, but on one side you have a comparison of bits (bytes), where only maybe byte order may cause issues, and on the other you have such high-level problems like [...]

Yes, and my point is that various real-world problems will require various solutions from this spectrum.

> How are strings different if they share the same codepoint? Is there software that analyzes the language used, or otherwise context, and prints the codepoint differently based on the context? Or is there a recommendation to do so?

Yes, any renderer is supposed to choose a (slightly) different glyph for those codepoints based on the user's chosen locale. Wikipedia has a table of examples [0] of various codepoints that are supposed to be slightly different in Chinese (Simplified), Chinese (Traditional), Japanese, Korean, or Vietnamese.

It of course depends on your use case if you'd care beyond that. For example, if the problem you're solving is "are these two strings going to render identically", then the code points don't help, you also need to know that you need to take into account locale. I can imagine some (very niche, to be sure) use cases like this in the area of OCR or of web page comparison.

[0] https://en.m.wikipedia.org/wiki/Han_unification#Examples_of_...


You can also do that with invalid UTF8. You simply treat the first invalid byte of a codepoint as the first byte of a new codepoint, terminating the previous one. In this way, the self-syncronization (self-dilimiting) of UTF8 is preserved. This is what Julia does.


Sure, that's one way to do it, but if the invalid byte were logically tied to the byte before it in its original encoding, then now your substring split will mangle that character.


If it's invalid then you don't mangle anything, it's mangled already for whatever reason, no?


> Identifying a substring is valid in every human language I know of, as long as the substring itself is semantically meaningful

Doesn't work with ZWJ emoji, for example, the "job" emojis are made by combining a "person" emoji and another emoji related to the job. For example the emoji for a male pilot is encoded as man+ZWJ+plane. It means the "plane" emoji (a meaningful substring) will be found in the "pilot" emoji even though it is a different thing.

I don't know if the same thing happens in natural languages. Some languages use ZWJ, or combining diacritics, but I don't know how acceptable, for instance, looking for "e" and finding "é" would be.


> Constrast with UTF-16, where a substring might match the bytes at an odd index in the original string, corresponding to totally different characters.

This is not a problem if you have the correct alignment; since it works on 16-bit units then you would be searching/indexing/splitting on such boundaries. There are surrogates, but you can easily self-delimit them like you can with UTF-8, too.


> Identifying a substring is valid in every human language I know of, as long as the substring itself is semantically meaningful

Nope. Searching a UTF-8 string for a byte sequence, even one that is semantically meaningful, is not a semantically meaningful operation in most human languages. E.g. searching for the byte sequence 63 61 66 C3 A9 will not find the semantically equivalent byte sequence 63 61 66 65 CC 81, and so your program will have very annoying bugs.


Is it possible to cheaply canonicalize UTF-8? Or is a full parse/validation unavoidable if all you need to do is a substring search?

I suppose you could also generate all byte sequences that encode the desired substring and do an NFA-to-DFA trick to search in linear time.


> Is it possible to cheaply canonicalize UTF-8? Or is a full parse/validation unavoidable if all you need to do is a substring search?

Even if it was, it wouldn't help you. Implementing search over unicode text properly requires locale-awareness and cannot be done by comparing codepoint sequences, even after canonicalisation.


No, the normalization algorithms require big tables of which codepoints are equivalent to what codepoint.


I think there is no cheap Unicode normalization. It's done with lookup tables.


There's no such thing as "canonical UTF-8". You can ask about canonicalizing Unicode, regardless of the byte encoding, and no, there's no cheap way.


My understanding is that Rust has designed the rest of the String API under the assumption of validity. You can't create an invalid String because the methods that operate on strings strive to be tightly optimized UTF-8 manipulation algorithms that assume the string has already been cleaned. Pack all of the actual robustness in the guarantee that you are working with UTF-8, and you can avoid unnecessary CPU cycles, which is one of the goals of systems languages. If you want to skip it, go for raw strings or CStr -- all raw byte buffers have the basic ASCII functions available, which are designed to be robust against whatever you throw at it, and it shouldn't be too hard to introduce genericity for an API to accept both strings and raw data.

That being said, I'm not sure how this is actually implemented, I assume there is still some degree of robustness when running methods on strings generated using `unsafe fn from_utf8_unchecked` just by nature of UTF-8's self-synchronization, which may be what the article is pointing out. It's possible that some cleverly optimized UTF-8 algorithms don't need valid data to avoid memory issues / UB that trips the execution of the entire program, and can instead catch the error or perform a lossy transformation on the spot without incurring too much overhead.


Rust's "all strings must be valid UTF-8" rule allows them to remove a bounds check in their "get the Unicode code point corresponding to a byte index in the string" function. Like, if you see a byte in the string that indicates it's the start of a 3-byte character, you don't have to worry about whether you'll have a buffer overrun when you try to read the following two bytes.

As far as I know that's pretty much the only memory-safety-related benefit of that rule, and it's probably a wash since you still need to do that check at some other point in your program to safely construct a &str from non-constant string data.

The impression I get from Rust people is that UTF-8 str is mostly about purity for purity's sake, and any performance improvement is at best a minor side benefit.


Good paper on UTF-8 validation performance: https://arxiv.org/pdf/2010.03090

    The relatively simple algorithm (lookup) can be several times faster than conventional algorithms at a common task using nothing more than the instructions available on commodity processors. It requires fewer than an instruction per input byte in the worst case.


So... the main reason to use Unicode in general, and UTF-8 specifically, is that it's the common denominator of a lot of weird stuff you'd see in the wild.

For example, most Unix platforms allow filenames to be arbitrary sets of bytes, while Windows lets filenames be UCS-2 (i.e. invalid surrogates are supported). Also, both Unix and Windows have some notion of a "local encoding" (LC_ALL etc on Unix, codepages on Windows).

The common denominator, the Schelling point [1], of all of these weird systems is Unicode. Without prior coordination, you can generally assume that other participants in your system would try to use Unicode, and probably with the UTF-8 encoding.

Checking at the boundaries of your program that your inputs are valid Unicode/UTF-8 leads to (a) good error messages when they aren't, and (b) not having to deal with jank internally.

[1] https://en.wikipedia.org/wiki/Focal_point_(game_theory)


Why would "the path name you entered is not valid Unicode/UTF-8" be a good error message, if the path actually exists on the system?

Also, what does "jank" mean here? What do you gain by treating file names as Unicode instead of byte sequences, for the majority of programs that don't even need to display the name, except perhaps in logs?

The way I see it, Unicode is only relevant for displaying strings to humans, or for taking input from humans directly. For virtually all other purposes, strings should be treated as byte sequences internally, regardless of whether they were intended to be UTF-8 or something else. For example, if you're reading a JSON document and looking for a hardcoded key, there's no reason whatsoever to represent the JSON or the key as Unicode. The key is a sequence of bytes, the JSON objects have sequences of bytes as keys. The fact that JSON usually prefers UTF-8 is of relatively little relevance.


> Why would "the path name you entered is not valid Unicode/UTF-8" be a good error message, if the path actually exists on the system?

Some programs have to cope with arbitrary file names. (So yes, cp shouldn't require filenames to be UTF-8.) The vast majority don't.

I maintain a Rust crate called camino [1], the readme for which outlines the general philosophy. The fact is that simply enforcing that the file paths you deal with are always UTF-8 greatly simplifies a lot of code.

> Also, what does "jank" mean here? What do you gain by treating file names as Unicode instead of byte sequences, for the majority of programs that don't even need to display the name, except perhaps in logs?

If you ever have filenames in a text file, how do you match them up with filenames on disk? If you try to support the full space of filenames that can possibly exist on a platform, there is no general, cross-platform solution for doing so. (This is known as the "makefile problem", and if you find the right wiki page you'll see a large table exploring all the possibilities and their tradeoffs.) And if you start pulling that thread, you'll unravel a very large number of problems trying to handle non-Unicode output in reality.

But you can simply cut the knot by restricting filenames to Unicode, and most programs should do that.

For example, I work at Oxide. Why would any of our services want any internal filenames to be anything but UTF-8 (or really ASCII)? Trying to support weird filenames is unnecessary complexity. So we just use camino.

[1] https://crates.io/crates/camino/


I would say loads of Unix tools have a much bigger problem with files that contain whitespace and especially newlines then they do with the character encoding. Similarly, you can usefully process a lot of file types if you can safely assume only the encoding of special characters for that format, like {",[ and newline for JSON.

This is why I don't get what you mean by "text file" in this context. Obviously it's hard, if not impossible, to meaningfully interpret any part of a random text file as a file name, regardless of encoding. But if you have a text file in some known structured format, it shouldn't be a significant problem at all, as long as you know the encoding of those special characters and have some basic conventions. In particular, the agreement could be that the filenames will be represented as raw bytes except for format-specific escapes (like escaping " in JSON or > in XML), then the file name part need not even fully match the intended encoding of the rest of the file. It's true though that it's not very easy to work with a byte array that has different encodings in different parts.

On the other hand, I fully agree that it's a good idea to restrict things to simple sunsets of characters if you can get away with it. I just don't think that restring to "all of Unicode" is particularly useful. Restricting to a subset of ASCII or even just to the BMP does have meaningful advantages, if it's an option for a particular domain.


> Restricting to a subset of ASCII or even just to the BMP does have meaningful advantages, if it's an option for a particular domain.

This is definitely appropriate in some cases, but for Rust specifically gets in the way a lot. For example, &camino::Utf8Path and &str have transparent conversions both ways, in a way that users rely on heavily (passing in a string into a function that takes an AsRef<Path> or AsRef<Utf8Path> is extremely common). If you introduced, say, AsciiPath, there would be a lot more friction -- you couldn't just pass in an arbitrary string and treat it as an AsciiPath.

Again, Schelling point -- without prior coordination you can assume that folks are using strings.


Your proposed scheme works on Unix, but importantly, not on Windows. This is exactly the makefile problem.


Why serde1 and proptest1 as opposed to say just serde and proptest or serde-camino and proptest-camino?


Because there can be a serde 2 or proptest 2 in the future, but camino's API surface is relatively small and pretty rigid so there will never be a camino 2.

If camino's MSRV was more modern (1.60 I think?) Rust I'd remove the `serde` and `proptest` features entirely via the `dep:` syntax. (Come to think of it, it may be worth bumping the MSRV for that! Would want to look at some data, and maybe in a few months -- camino deliberately has an ancient MSRV as a foundational crate.)


So if Serde goes from 1.0.202 to 2.0.202 you're going to have both a serde1 feature and a serde2 feature?

Is this just an oxide style guide thing to include a major version in the feature name?


Yes, we'd have both serde1 and serde2 features.

Published this library years before I started at Oxide.


> The way I see it, Unicode is only relevant for displaying strings to humans, or for taking input from humans directly. For virtually all other purposes, strings should be treated as byte sequences internally, regardless of whether they were intended to be UTF-8 or something else. For example, if you're reading a JSON document and looking for a hardcoded key, there's no reason whatsoever to represent the JSON or the key as Unicode. The key is a sequence of bytes, the JSON objects have sequences of bytes as keys. The fact that JSON usually prefers UTF-8 is of relatively little relevance.

This is true (although Unicode is not the best character set, but that is a separate issue), although in the case of JSON, being treated as Unicode is relevant because of the escape codes that can be used in JSON string literals (although this does not make it necessary to validate UTF-8; it only makes it necessary to encode UTF-8 when an escape code is encountered).

Furthermore, when displaying text only for writing to a file, or to a terminal which is assumed to already have the correct character encoding (if you do not need to deal with alignment and stuff like that), you do not need to worry about UTF-8, and in fact is better that you don't; then it will use the same character encoding that it already is and will already be correct, whether it is UTF-8 or not (and you can avoid unnecessaily wasting time with validating UTF-8). (A program might require though that it is valid ASCII or extended ASCII (so e.g. UTF-16 will not work), but shouldn't need to care what the non-ASCII bytes mean.)

Unfortunately, some programming languages make it difficult.


That's why paths in Rust don't have to be UTF-8. See Path and OsStr documentation.


I wish Swift grapheme oriented strings had been in the comparison. I think he blows off the grapheme level too quickly.


Raku also has a distinctive (and a bit bizarre) string type for including graphemes. It builds a lookup table of grapheme clusters, and represents them in memory as negative i32s.

My Hot Take on this is that finding grapheme clusters is important (Julia has this built in, last I checked it was a crate in Rust), but O(1) access to each grapheme is less important.

Except for emoji-heavy applications, such as tend to be written for phones. So, good choice for Swift.


Interesting. I've been noodling around with a glyph string design that sounds a bit like Raku's. I'll have to take a look at it. Thanks!


> It builds a lookup table of grapheme clusters, and represents them in memory as negative i32s.

Only for those grapheme clusters that do not have a representation in Unicode!

Also, these negative i32s are really an implementation detail.

> but O(1) access to each grapheme is less important

Unless you want regexes to be a. correct in the unicode world, and b. be performant


> Only for those grapheme clusters that do not have a representation in Unicode!

I think it's reasonable to consider a grapheme cluster composing one codepoint to be a codepoint, not a grapheme cluster. One grape is not a cluster of grapes.

> Also, these negative i32s are really an implementation detail.

What a coincidence! I was explicitly discussing the implementation.

Oh, is this the thing where some people pretend that Raku is different from Rakudo? Fine. Pretend I said Rakudo.

> Unless you want regexes to be a. correct in the unicode world, and b. be performant

I work extensively on low-level pattern matching code. So I can say with considerable confidence that blowing up every string to take up four bytes per codepoint or grapheme cluster, is not the only way to make regex correct in the unicode world, nor is it necessarily the best, or even helpful. The assertion that a regex search on a blown-up and custom-tailored string is going to be more performant than performing that search on the native UTF-8 representation of the string, is hard to justify. It's seems evident to me that it would be less so, by default.

Furthermore, I'm unsure how O(1) access to anything could aid regexen, since using them is O(n) by definition.

I think Raku is an interesting language and that people should check it out, to be clear. That doesn't mean I agree with every choice the Rakudo implementation has made.


> I think Raku is an interesting language and that people should check it out, to be clear

I agree :-)

> That doesn't mean I agree with every choice the Rakudo implementation has made

Indeed. Some of these choices have their roots in the late 1990's / early 2000's. Some of them make less sense now than they did then. FWIW, these are continuously evaluated by the current core team, to continue to improve Rakudo.


The problem with graphemes is that they are kind of meaningless for most uses since grapheme clusters can be rendered as single glyphs.


"98% of web sites are encoded in UTF8"

And quite a few web sites claim to be encoded in UTF8 and serve latin-1. It is best to check, or at least to specify error handling on your decoder.


My guess is that the reality is "98% of websites are valid UTF-8 documents". A large portions contain only ASCIIs so they happen to be, just indistinguishable with truly UTF8 encoded ones until they break.


I agree. Validating UTF-8 will waste processing time, as well as not work well with non-Unicode text; and (like it says in the article) often you should not actually care what character encoding (if any) it uses anyways. Furthermore, it is often useful to measure the length or split by bytes rather than Unicode code points anyways.

Unicode string types are just a bad idea, I think. Byte strings are better; you can still add functions to deal with Unicode or other character codes if necessary (and/or add explicit tagging for character encoding, if that is helpful).

Many programming languages though make it difficult to work with byte strings, non-Unicode strings, etc. This often causes problems, in my experience, unless you are careful.

Unicode string types are a problem especially when used incorrectly, since if used in a library they can even be exposed to applications that call it even if they do not want it and even if the library doesn't or shouldn't really care. GOTO is not a problem; it is good, because it does not affect library APIs; even if a library uses it, your program does not have to use it, and vice-versa. Unicode string types do not have that kind of benefit, so they are a much more significant problem, and should be avoided when designing a programming language.

(None of the above means that there is never any reason to deal with UTF-8, although usually there isn't a good one. For example, if a file in ASCII format can contain commands which are used to produce some output in a UTF-16 format, then it makes sense to treat the arguments to those commands as WTF-8 so that they can be converted to UTF-16, since WTF-8 is the "corresponding ASCII-compatible character encoding" than UTF-16. Similarly, if the output file is JIS, then using EUC-JP would be sensible.)


Personally, I prefer Ruby's behavior of explicit encoding on strings and being very cranky when invalid codepoints show up in a UTF-8 string.

If you want to ignore invalid UTF-8, use String#scrub to replace heretical values with \uFFFD and life is good. :)


life is good until you try to read a file path that is not a valid string and discover that you read the wrong file.


Definitely and that’s what ASCII-8BIT is for. You’ll get out exactly what you put in. :)

Also, the “life is good” was somewhat sarcastic.


As a small nit, the reason Rust has pairs of types (String vs &str) is not due to mutability, but to allow references to substrings.

A String is an allocated object, with a location and length. A str is a sequence of characters "somewhere", which is why you can only have references to str, never an actual str object. Cstr and Osstr are similar. You could use &[u8] instead of any of them, but the stronger types enforce some guarantees on what you'll find in the sequence.


> In Rust, strings are always valid UTF8, and attempting to create a string with invalid UTF8 will panic at runtime:

> [piece of code explicitly calling .unwrap()]

You misspelled "returns an error".

It might be worth considering Python, where the most central change from 2 to 3 was that strings would now be validated UTF-8. I don't understand why it gets discarded with "it was designed in the 1990's" when that change happened so recently.


There are countries where Python 3's stable release is old enough to legally get married, and this change was being planned since 2004. It's not that recent!

This is the oldest concrete plan I can quickly spot: https://github.com/python/peps/blob/b5815e3e638834e28233fc20...


Julia is not so different, implementation started in 2009 (I don't know how far before it was planned). Why is it new and Python 3 old?

https://github.com/JuliaLang/julia/commit/a9cbc036ac62dc5ba5...


Python strings are Unicode, not utf8, at least until encoding time. At which point they become bytes.


I'm going to define a "Unicode string" as Rust does: a sequence of USVs / content that can be validly represented as UTF-8. Thus, no, sadly, Python's strings are not Unicode, as they're a sequence of Unicode code points. Because of that,

  a_string.encode('utf-8')
… can raise in Python. For example:

  In [1]: '\uD83D'.encode('utf-8')
  ---------------------------------------------------------------------------
  UnicodeEncodeError                        Traceback (most recent call last)
  Cell In[1], line 1
  ----> 1 '\uD83D'.encode('utf-8')

  UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
(The underlying encoding of str in Python these days is either [u8], [u16], or [u32], essentially, depending on the value of the largest code point in the string. So, for some values, e.g., 'hello world', the underlying representation is UTF-8, essentially.)


Surrogate pairs are not allowed in UTF8 (only in UTF16), so this error is not surprising. It must be decoded from UTF16, then reencoded to UTF8.

The in-memory representation is an implementation detail, and can be anything that works, as you described.


> Surrogate pairs are not allowed in UTF8 (only in UTF16), so this error is not surprising. It must be decoded from UTF16, then reencoded to UTF8.

The source code here is whatever your on-disk .py file is, likely ASCII, so UTF-8. No re-encoding is required, in practice or mentally.

> The in-memory representation is an implementation detail, and can be anything that works, as you described.

The in memory representation is not really the issue; it's the type itself. Consider a type is a set of possible values: bool, for example, is the set {true, false}. u8 is a set containing {0, 1, … 255}, and things like strings are infinite sets.

The set of values that `str` represents is not the same set as what UTF-8 can encode / is not the same set of values as a type that encodes any possible sequence of USVs. The set of all possible `str` instances in Rust != the set of all possible `str` instances in Python.

Rust is pretty unique in this regard: I don't think I know of another language whose string type is a sequence of USVs. (Though I assume there is likely one out there.)

It is worse, too: there's JS's string type, for example, which is a sequence of UTF-16 code units, which is different yet! So (JS string) ≠ (Python str) ≠ (Rust str)


On disk source-code is not relevant. To normalize surrogate pairs for later export to UTF8, they must be decoded first:

    >>> '\ud83d\ude04'.encode('utf-16', 'surrogatepass').decode('utf-16')
    ''
    >>> _.encode('utf8')
    b'\xf0\x9f\x98\x84'
(The happy face emoji was removed by HN, run line in terminal to see it.)

In other words surrogate pairs are only vaild in the context of UTF16 encoded bytes and no where else. If you're adding them into your program by hand on purpose, you're either doing it wrong or specifically to raise an error for illustration purposes.


I'm well aware that surrogates are only valid in the context of UTF-16's encoding.

> If you're adding them into your program by hand on purpose, you're either doing it wrong or specifically to raise an error for illustration purposes.

In the comments above, it should be obvious that it is for illustrative purposes. The point is that the string type does not catch these errors. It's not that someone is going to "add them to the program … on purpose" … it's that they're going to slip in from bad data, or bad code, not on purpose but because people write bugs and generate garbage data.

A good type does not let you represent illegal states. For the same reason null is a bad idea — here, it's not a valid unicode string.

The point of a Unicode string type would be to expose operations that one might want to perform on Unicode strings. By permitting invalid values to exist in the set of values the type can represent, you're basically making every operation fallible or GIGO.

For example, take a function that would iterate over the string and yield the USVs within it — a borderline trivial operation, but basically a fundamental building block of any higher Unicode function we might want — it cannot process the example string from my previous comment without either raising, or yielding garbage/invalid USVs.

Make illegal values actually illegal, and "iterate over the USVs" becomes an infallible method.

This is no different than a bool that permits representing a "2", where a "2" has no well-defined meaning. You're just breeding bugs, at that point.


Python doesn't do runtime type checking (especially internal ranges) and never will. It does find the error at the first invalid operation, just not as early as you'd like perhaps.

If this is a non-theoretical problem at your end, a stricter language might well be called for.


> Python doesn't do runtime type checking

I don't know what you're defining "runtime type checking" as, but it absolutely does. You can't stuff a bool into an int, you can't stuff any arbitrary byte into a str. For the most part, it's strongly typed, but str falls down by allowing non-Unicode values into its set of values.

> If this is a non-theoretical problem at your end, a stricter language might well be called for.

… Python is plenty well suited to the task, if not for bugs in the implementations of its types.


> As always, immutability comes with a performance penalty: Mutating values is generally faster than creating new ones.

I get what they're saying, but I'm not sure I agree with it. Mutating one specific value is faster than making a copy and then altering that. Knowing that a value can't be mutated and using that to optimize the rest of the system can be faster yet. I think it's more likely the case that allowing mutability comes with a performance penalty.


It's easy to see that altering an entry in an array (size N) is O(1) with mutability and at least O(log N) with immutability, and that affects many algorithms. Altering any small part of a larger data structure has similar issues. In the end, many algorithms gain a factor of log N in their time complexity.


Right, but look at the bigger picture. Immutability removes a whole lot of order of operations concerns. That frees a smart compiler to use faster algorithms, parallelization, etc. in ways that might not be safely possible if the data might change in place. Yes, that may mean it's slower to deal with individual values. It may also mean that the resulting system can be faster than otherwise.


Well, language benchmarks fairly uniformly show that to be untrue in general. None of the fastest languages have forced immutability. It's not like it's a novel, untested idea.


I... think that's mostly because compilers haven't yet optimized for that case yet.


I mean, that's kind of misinterpreting their point. The authors are not claiming otherwise.


Honestly if you don't know that it's valid unicode then it's not a string at all, but a bytstring.


Even if you knew it was "valid Unicode", that still doesn't gain you much. There is nothing you can do with a "valid Unicode" string without any extra context that you can't do just as badly as with a byte string.

And I'm putting "valid Unicode" in quotes because it's very much possible to have a string composed of well-defined Unicode code points that are nevertheless not a valid string in any meaningful sense, e.g. a string composed exclusively of LTR markers or accents.


In terms of actual issues - i think normalizing to NFC is much more important than validating.


You do have to validate UTF-8 strings:

- You can't just skip stuff if you run any kind of normalization

- How would you index into or split an invalid UTF-8 string?

- How would you apply a regex?

- What is its length?

- How do you deal with other systems that do validate UTF-8 strings?

Meta point: scanning a sequence of byte for invalid UTF-8 sequences is validating. The decision to skip them is just slightly different code than "raise error". It's probably also a lot slower as you have to always do this for every operation, whereas once you've validated a string you can operate on it with impunity.

Love this for the hot take/big swing, but it's a whiff.


- delay validation until normalization

- treat non-utf-8 bytes as bytes

- regexes match bytes just fine: the article has a whole section on ripgrep’s use of bstr

- another section discusses how the length of a string is not a well defined quantity

- the article says you can delay validation until required


This world sucks to write production code in. It's sometimes necessary but it sucks.


>> - You can't just skip stuff if you run any kind of normalization

> - delay validation until normalization

You probably want to normalize if you want to use them as keys in a hash table, which is pretty common. Or serialize them to JSON (or protobufs, or msgpack, etc. etc.), also pretty common. Why? Because you're commonly comparing strings and normalization lets strings that should be equal be equal. Canonicalization also requires sorting which requires normalization, etc. etc.

>> - How would you index into or split an invalid UTF-8 string?

> - treat non-utf-8 bytes as bytes

Well I mean by code point, or grapheme. Does the invalid byte go with the 1st or the 2nd? What about multiple invalid bytes? Will all implementations agree? Also hey, looks like you scanned a string and found some invalid bytes. Looks like you validated a UTF-8 string!

>> - How would you apply a regex?

> - regexes match bytes just fine: the article has a whole section on ripgrep’s use of bstr

They definitely don't [0]. If you're allowing invalid UTF-8 you're giving up Unicode support entirely, by necessity.

>> - What is its length?

> - another section discusses how the length of a string is not a well defined quantity

The article doesn't say that, it says you can know neither the printed length of a Unicode string just by its contents, nor the number of characters in a Unicode string. It's right about the first, but wrong about the second: a Unicode character is represented by a Unicode code point, so their number is knowable. UTF-8 strings have a specific number of bytes and a specific number of code points and graphemes. You can know the number of bytes, code points or graphemes, but not if the string is invalid, because all bets are off.

>> - How do you deal with other systems that do validate UTF-8 strings?

> - the article says you can delay validation until required

My general point is there are very few things you can do without validating. If you're passing them around completely internally, never normalizing, never sending to another system, never writing them into a file name or a file itself, etc. You can gain a tiny amount of speedup by not validating, but it's a huge risk and not worth it for 99% of programs.

[0]: https://docs.rs/regex/1.10.4/regex/bytes/index.html#syntax


> You probably want to normalize if you want to use them as keys in a hash table, which is pretty common.

Not necessarily. Sometimes this might be useful, but sometimes it will just make it worse.

> Does the invalid byte go with the 1st or the 2nd? What about multiple invalid bytes?

What seems to me the most logical way to do it, if you do not then need to decode it into Unicode code point numbers, is: If the byte is in range 0x80 to 0xBF, then it belongs to the same code point as the previous byte; otherwise, the byte is the start of a new code point. (However, this does not answer the question of what if the first byte is in range 0x80 to 0xBF?)

However, that is usually unnecessary. Splitting a string by Unicode code points is rarely helpful (unless you want to convert to UTF-16 or UTF-32) and is often harmful, anyways. Just use sequences of bytes.

> if you're allowing invalid UTF-8 you're giving up Unicode support entirely, by necessity.

It is often useful to do regular expressions on sequences of bytes instead of on sequences of Unicode code points anyways; and if you have ASCII only, then doing it this way is more efficient anyways (even if the text contains non-ASCII characters).

(PCRE has a UTF-8-oriented mode and byte-oriented mode, so both are possible.)

Also, you can match a UTF-8 code point (or a character code in EUC-JP or some other multibyte encoding) with a byte-oriented regular expression if you need to, anyways, so still you might not need Unicode regular expressions.

> another section discusses how the length of a string is not a well defined quantity

I think that the number of bytes is generally the most useful measure, although in some contexts it is useful to measure something else.


Your link to the regex crate docs doesn't support what you're saying, as far as I can tell. You can search on `&[u8]` just fine with full Unicode support. Invalid UTF-8 just won't match.

> My general point is there are very few things you can do without validating.

Pretty much all string processing I've built up in Rust over the past ten years is based on the idea of not validating up front. It's too costly to do. There's an entire section in my blog post on bstr about this: https://blog.burntsushi.net/bstr/#motivation-based-on-perfor...

The rest of the blog may be very useful to read as well.


I'm referring to #2 there: "In ASCII compatible mode, Unicode character classes are not allowed."

> Pretty much all string processing I've built up in Rust over the past ten years is based on the idea of not validating up front. It's too costly to do. There's an entire section in my blog post on bstr about this: https://blog.burntsushi.net/bstr/#motivation-based-on-perfor...

Yeah I read it; it's very good! (like everything you write) I think you're right that you can't handwave away the validation overhead, and that in some cases it matters. I just think ripgrep is pretty niche: there's gazillions of websites out there and like, 10 greps. If you're writing a grep and you want to optimize the shit out of it, well welcome to working around not validating. Otherwise, the safety tradeoffs and weirdo lack of specification aren't worth it.


> I'm referring to #2 there: "In ASCII compatible mode, Unicode character classes are not allowed."

But that doesn't throw out Unicode support. ASCII compatible mode is literally defined as "Unicode support is disabled and the atom of matching is the individual byte instead of the codepoint." But that only applies to the region of the pattern in which Unicode mode is disabled. Searching on `&[u8]` doesn't automatically enable ASCII compatible mode. It merely makes it possible to enable it in contexts that would otherwise match invalid UTF-8.

For example, you cannot search for `(?-u:[^a])` on `&str` because `(?-u:[^a])` matches any individual byte except for the ASCII `a`. That includes things like `\xFF` or even a prefix of a valid UTF-8 encoded Unicode scalar value. But if you search on `&[u8]`, then constructs like `(?-u:[^a])` are allowed. But disabling Unicode mode when searching a `&str` is still allowed, but only if it's still guaranteed to match valid UTF-8. For example:

    $ echo 'ſ' | rg '(?i-u:s)'
    $ echo 'ſ' | rg '(?i:s)'
    ſ
But when searching `&[u8]`, Unicode mode is still enabled by default. `rg` searches `&[u8]`, and you can see right above that it's still doing a Unicode aware case insensitive search.

> Yeah I read it; it's very good! (like everything you write) I think you're right that you can't handwave away the validation overhead, and that in some cases it matters. I just think ripgrep is pretty niche: there's gazillions of websites out there and like, 10 greps. If you're writing a grep and you want to optimize the shit out of it, well welcome to working around not validating. Otherwise, the safety tradeoffs and weirdo lack of specification aren't worth it.

It's not just grep. It's pretty much any tool that wants to deal with arbitrary file content where it's useful.

To be clear, I am not staking a position like the OP where I'm saying "conventionally UTF-8" instead of "required UTF-8" is always better or should be the preferred design in a primitive string data type. My own personal bias is certainly towards that, but as someone who is also a steward of Rust's standard library, I do not limit myself to the world of grep or Unix command line tools.

My point here is not that you're wrong about requiring UTF-8 validation as a better design, but rather, that you aren't quite getting the details right when talking about the downsides of conventional UTF-8.

I'm not clear on what you mean by "safety trade-offs" or "weirdo lack of specification." The Unicode consortium has defined rules for dealing with invalid sequences of bytes. bstr's API docs have an entire section about this[1], and its behavior even comes from a W3C standard.

The difference between "conventional UTF-8" and "required UTF-8" is not really one of safety or lack of specification. In either design, you can expose the same logical semantics. The real differences tend to come in two flavors. First is that when you have a UTF-8 guarantee and your representation is UTF-8, you can do UTF-8 decoding potentially faster than you could otherwise (safely) do because you can eliminate certain conditional branches and error checking. (But this advantage is rendered moot if validation forces you to do two passes over a string that is large.) The second benefit is that it forces a stronger discipline of discarding junk data at the boundaries. It makes failure modes as a result of bad data more explicit. It is this second benefit that I think is the most compelling advantage personally, and why I think, on balance, Rust's design is better. But it's not obviously The Correct Choice.

[1]: https://docs.rs/bstr/latest/bstr/#handling-of-invalid-utf-8


> But that only applies to the region of the pattern in which Unicode mode is disabled.

Yeah I mean, fair but I'm not impugning bstr or byte-based regexes. Broadly my worry here is people will say, "I'll be cool; I'll use &[u8] everywhere and skip validation" and then run into weirdness when using regexes because the behavior is subtly different, or some features are missing, especially when other stuff is in the mix like invalid UTF-8 or not normalizing. I guess I can file it under "if you're using &[u8] regexes you probably know what you're doing" though.

> It's not just grep. It's pretty much any tool that wants to deal with arbitrary file content where it's useful.

I do think there's a broader discussion to be had about this like, is it The Correct Choice for a language to enforce an encoding (and implicitly validation) in their primitive string type? Most languages I'd say yeah; it's the kind of thing you should abstract away, the safety benefits outweigh the performance costs, and you can always provide some kind of trapdoor (Python's BytesIO or w/e, etc.). I'm not sure about systems languages; feels you're at least asking everything that deals with (let's just say) &str to also deal with &[u8] in order to preserve choice, which is at least tedious for library authors. Dunno what the right mechanism here is but, I'm definitely sympathetic.

> you aren't quite getting the details right when talking about the downsides of conventional UTF-8

Ha well that's certainly possible :)

I might be throwing "safety" around too carelessly; really what I'm saying here is "if you validate, your program might be a very small bit slower but you'll avoid spreading the virus of invalid UTF-8 through your system and others'" There are a lot of benefits to that -- a lot of other programs will still validate by default, so if they read output from a program that isn't validating, they might choke or behave in subtly bad ways leading to all kinds of problems ranging from junk on the screen to some kind of security bypass because there's now invalid UTF-8 in a JWT. That's bad! Why would you risk it?

You described the benefits really well there (I've written a Unicode-aware string library in C and come to the same conclusions) but, again I'm sympathetic to other totally valid use cases where you're like, "but my data is in UTF-16, a perfectly valid format that's very very popular, and you want me to convert it _every time_?" Tough to take, again especially if you're dealing with lots of data.


Aye. Here's another example. This isn't necessarily responding to any specific point in your comment, but something to mull over. I'm working on a new datetime library in Rust. One of the main features of a datetime library is the ability to parse datetimes out of strings. The standard way to do this in Rust is with the `FromStr` trait, which requires that you pass in a `&str`.

But there's really nothing about datetime parsing that requires a `&str`. And so if you just do what the "standard" thing in Rust is (use a `&str`), then you're automatically requiring that every single datetime parse go through a separate UTF-8 validation step first, which is required in order to get a `&str`. This puts an unavoidable hit on perf that isn't actually necessary for the task at hand.

Imagine for example parsing a CSV file. The `csv` crate pushes you toward using `&str` for everything, but you can also parse CSV data as fields of `&[u8]` instead. Skipping that UTF-8 validation will most definitely speed things up. Now imagine you want to parse a datetime from a field. Is it better to just parse your `&[u8]` field directly, or should I make you do UTF-8 validation to get a `&str` first? IMO, the correct answer here is to define a way to parse datetimes out of a `&[u8]`, while pushing folks in the API toward using `&str` for "most" cases.

The hitch is that if you want to parse with `&[u8]`, you're now working outside the domain of standard Rust strings. And things become just a little trickier. It's not the paved path that `&str` is. And this all comes from the fact that Rust strings required UTF-8.

The great thing about parsing `&[u8]` is that you can reuse that implementation for parsing from `&str` too. But you really do have to know to do the implementation on `&[u8]` first. Unless you know all the nuances of required UTF-8 versus conventional UTF-8, it's not obvious. Because of course, the standard party line is, "use strings and strings are always UTF-8, why use anything else?" That question is I think where the OP lives, and getting more folks to think in a nuanced way about this topic is probably a good thing. :-)


I'm pretty well convinced at this point that specifying an encoding for a system language's string type is probably too constraining--and like you point out, often commits the sin of requiring unnecessary processing.

Your CSV example here is a pretty good one: you either need heuristics or some kind of OOB data to "know" if you've got an encoding that can possibly represent a CSV file. If you've got that, you can always at least convert to a compatible encoding, e.g. UTF-8. If you don't, you can either bail or cross your fingers and hope that it'll work out (this is perfectly reasonable in many cases). Regardless, if you want to give programmers all of these options with zero cost you need a fully functional encoding-agnostic byte string type.

I do think you've done the legwork to improve the situation though? It would be cool if Rust incorporated bstr into std; I think that would give library authors (and maybe std authors) confidence in working on a lower level interface that didn't guarantee/require valid UTF-8.

> Because of course, the standard party line is, "use strings and strings are always UTF-8, why use anything else?" That question is I think where the OP lives, and getting more folks to think in a nuanced way about this topic is probably a good thing. :-)

100% agree. Honestly, re-reading OP it's clear I got really hung up on the bullet points in the middle ("Do you know how to render it" etc.), but it is really about UTF-8 in programming language string types. And hey, empirically it worked haha :)


Yeah I think there is some desire for bstr to come into std. But... it's like... tricky. We're steadily adding "string-like" methods to `&[u8]`, but for example, we still lack substring search on `&[u8]`. And lots of other things too. And there are things to consider like the fact that std has this elaborate `Pattern` abstraction (which is itself unstable, but still permits ergonomic polymorphic use of routines that are generic over it). Does that also need to get moved to `&[u8]`? And then there are things like `FromStr` that std implements for lots of types like integers and floating point. So to use those parsing routines, you need to UTF-8 validate to `&str` first. And finally, another nice point about bstr is its `BString` and `BStr` types, which serve as a "target" for trait impls that don't necessarily make sense on a generic `Vec<T>` and `&[T]`. Which is maybe fine for a crate, but it's kind of a clunky solution to the problem.

I think if `&str` was conventionally UTF-8, then a lot of this bifurcation and pain wouldn't exist. Because you could just zero-cost convert a `&[u8]` to a `&str` in safe Rust without any hitch. And then you get every method available on `&str` for free given a `&[u8]`.


Yeah that's probably too much of a slog. It's probably the case that very few things would break if you just snuck the validation out the back door and added like u8str (or whatever) that did the validation for people who want that enforced by the type system. Would be real easy to provide migration tooling too.


Regexes normally match characters, not bytes.


There's nothing about regexes that makes them any better or worse at matching characters rather than bytes (or digits or any other fixed alphabet).


You need to know about Unicode to use Unicode regexes; if you want to pass invalid UTF-8 into these engines you have to give up Unicode regex support; i.e. how would you match "\p{East Asian Width:Narrow}" on an invalid UTF-8 string?


It just doesn't match.


Maybe! Or maybe another tool is like, "oof that's some junk in there, let's pop that out"


I'm the author of Rust's regex engine. Treat me as an oracle. What is your specific concern?

I would really strongly recommend you read my bstr blog post first though. It should answer a lot of your questions and I hope presents a nuanced data driven analysis.


> I'm the author of Rust's regex engine. Treat me as an oracle. What is your specific concern?

I know and I'm terrified haha.

That's a very generous offer; I don't disagree w/ your post or OP's basic premise that validation isn't absolutely necessary. I also think it's perfectly reasonable for regexes to not match on invalid bytes (what, we should just ignore stuff now? Total madness). Like, if you don't like it, consider living less dangerously and validating!

My overarching worry is that a lot of engineers will read this and think, "validation is indeed for wimps" and suddenly a lot of invalid UTF-8 will end up in databases, tape backups, and S3 tripping us up for years to come. I probably could be mollified with a small title tweak like "In some cases (and you'll know if if you have one) you can intentionally not validate UTF-8 strings, but throw some salt over your shoulder first." I also think there's plenty of weird interactions like with normalization or basic string manipulation people should think about before walking down this road.

> I hope presents a nuanced data driven analysis

I was actually sold early on when you pointed out that filesystems are creepy attics full of exotic data; perfect use case for not assuming/forcing any kind of string encoding and a really good blueprint for applying solid engineering principles to give the people what they want.


I think I would point to Go here as a good data point for mollification. Its `string` data is conventionally UTF-8. Go has been a huge player in web services over the last decade and a half, and I'm not sure I've seen a rise in the problem of junk data.

The people behind Go were also the people behind UTF-8 (Ken Thompson and Rob Pike). Go's `string` data type is thoughtfully designed. If you iterate over the characters in a string that contains invalid UTF-8, you get replacement codepoints, just like in bstr. Go has transcoding and normalization libraries that work just fine too.

I don't think there are much if any weird interactions with normalization or other such things. When you're doing Unicode operations on a conventionally UTF-8 string, you just treat invalid UTF-8 as U+FFFD. That's really all there is to it. Everything else works. Unicode algorithms are defined over codepoints. So all you need is a semantic for how to convert invalid sequences to codepoints. And Unicode (Chapter 3, Section 9) provides exactly that for you.

Or in some cases, you can just leave the invalid UTF-8 as-is. The last example in bstr's docs for its "to_lowercase" routine[1], for example, demonstrates how you can perform a Unicode-aware algorithm on a string that contains invalid UTF-8.

We've all been burned by mojibake before. I get it. I'm not totally discounting your viewpoint. I do tend to think that "required UTF-8" is probably the better thing for a primitive string data type. But you really do absolutely need to also then support conventional UTF-8. Otherwise you're cutting off a ton of very valid use cases, and potentially forcing others to either take a huge perf hit or to do something dangerous like `unsafe { std::str::from_utf8_unchecked(..) }`. I've seen people do the unchecked variant before just because they had a `&[u8]` but needed it to be a `&str` because they wanted to treat it as a string without doing a second pass over the data. And believe you me, you'd see a lot more of that if I wasn't so careful to provide both `&str` and `&[u8]` APIs for all string oriented operations. ;-)

[1]: https://docs.rs/bstr/latest/bstr/trait.ByteSlice.html#exampl...


The Go thing (+ invalid sequence conversion) is a really good point. If OP's point was "don't validate just convert invalid sequences when you need to" or "think twice before forcing validation into your primitive string type" I'd buy in. I also think your choice there to just write the invalid bytes out unchanged also makes a ton of sense. You're right that there's more options than "yolo no validate" or "validate or face death".

> And believe you me, you'd see a lot more of that if I wasn't so careful to provide both `&str` and `&[u8]` APIs for all string oriented operations. ;-)

Woof man yeah, that's what I was getting at with you're forcing people to basically implement things twice; it's not the best!

This was a really good conversation; before this I was 100% in agreement w/ you on "'required UTF-8' is probably the better thing for a primitive string type" but now I see there's a fair amount of pragmatic issues with that. Definitely a lot to mull over here; thanks :)


Does . match one bytes or one characters?


`.`, which is `(?u:.)` by default, matches the UTF-8 encoding of any Unicode scalar value except for the newline terminator.

`(?-u:.)` matches any individual byte except for the newline terminator.

You can't use `(?-u:.)` when searching `&str` because it could match invalid UTF-8. `Regex::new` will fail if you try. You can only use it when searching `&[u8]` via `regex::bytes::Regex`.


Regex compilers know what a character is, so that "α+" doesn't match one 0xce followed by one-or-more 0xb1.

Regex engines neither know nor care what a character is. They match bytes. Invalid UFT-8 is just another thing they won't match, along with all the valid sequences which also don't match.


> Regex engines neither know nor care what a character is. They match bytes.

I think this is mostly not true, for example there's some interesting behavior in POSIX-compliant engines:

"Other than POSIX-compliant engines part of a POSIX-compliant system, none of the regex flavors discussed in this tutorial support collating sequences.

Note that a fully POSIX-compliant regex engine treats ch as a single character when the locale is set to Czech. This means that [^x]emie also matches chemie. [^x] matches a single character that is not an x, which includes ch in the Czech POSIX locale.

In any other regular expression engine, or in a POSIX engine using a locale that does not treat ch as a digraph, [^x]emie matches the misspelled word cemie but not chemie, as [^x] cannot match the two characters ch."

If you were just matching bytes you wouldn't do any of this. These engines clearly know that "ch" is two characters and two bytes under some locales and 1 character and two bytes under others.

Same stuff goes for the Unicode regexes: using the traditional patterns leads to unexpected results: "In Unicode, à can be encoded as two code points: U+0061 (a) followed by U+0300 (grave accent). In this situation, . applied to à will match a without the accent. ^.$ will fail to match, since the string consists of two code points. ^..$ matches à." [1]. \p{Letter} works correctly here though, again showing that these engines know about more than just bytes.

[0]: https://www.regular-expressions.info/posixbrackets.html#coll

[1]: https://www.regular-expressions.info/unicode.html


The regex compiler then can have a flag to treat the regex as Unicode or non-Unicode. The regex engine then does not need to care which flag the compiler used.


> How would you index into or split an invalid UTF-8 string?

Indexing/splitting by bytes is usually more useful than by Unicode code points anyways. Often, you would want to split by a substring; if so, then it works just as well if you are treating them as sequences of bytes instead of code points; due to the self-delimiting feature of UTF-8, such a thing would work correctly (unlike e.g. Shift-JIS, which doesn't work correctly). Treating them as bytes when splitting by substrings also avoids needing to count the code points every time.

> How would you apply a regex?

You can apply byte-oriented regular expressions. (Often you will not need Unicode-oriented regular expressions, and you can still match UTF-8 strings with byte-oriented regular expressions.)

> What is its length?

It depends on why you want to measure it. Usually, the number of bytes is the most useful kind of measurement, is what it seems to me.

> How do you deal with other systems that do validate UTF-8 strings?

It depends. Sometimes it is better to just not deal with such systems (or to fix them if you can). Sometimes you might treat the input as ISO-8859-1 and convert it to UTF-8, or as bytes and convert it to base64, just so that the other system won't complain. In a few cases, doing the validation would be appropriate, though.

> scanning a sequence of byte for invalid UTF-8 sequences is validating. ... It's probably also a lot slower as you have to always do this for every operation, whereas once you've validated a string you can operate on it with impunity.

Usually you should not need to do this scanning at all; just use the bytes as they are. Of course, you can add the validation step once where needed if you really do need this validation, but usually you shouldn't need it.

It is also notable that you cannot always assume strings are Unicode anyways; there are problems with Unicode anyways and you might have strings with other character sets (or none at all).


In my opinion, one argument for internally representing `String`s as UTF8 is it prevents accidentally saving a file as Latin1 or other encodings. I would like to read a file my coworker sent me in my favorite language without having to figure out what the encoding of the file is.

For example, my most recent Julia project has the following line:

    windows1252_to_utf8(s) = decode(Vector{UInt8}(String(coalesce(s, ""))), "Windows-1252")
Figuring out that I had to use Windows-1252 (and not Latin1) took a lot more time than I would have liked it to.

I get that there's some ergonomic challenges around this in languages like Julia that are optimized for data analysis workflows, but imho all data analysis languages/scripts should be forced to explicitly list encodings/decodings whenever reading/writing a file or default to UTF-8.


I don't understand how a language runtime is supposed to prevent your colleague from using an unexpected encoding.

Next time you try to load whoops-weird-encoding.txt as utf-8, and get garbage, may I suggest `file whoops-weird-encoding.txt`? It's pretty good at guessing.

There might be a Julia package which can do that as well. I haven't run into the problem so I have no need to check.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: