> "Unicode Scalars", aka "well-formed UTF-16", aka "the Python string type" Can ...

csande17 · 2025-08-23T16:34:54 1755966894

I could be mistaken, but I think Python cares about making sure strings don't include any surrogate code points that can't be represented in UTF-16 -- even if you're encoding/decoding the string using some other encoding. (Possibly it still lets you construct such a string in memory, though? So there might be a philosophical dispute there.)

Like, the basic code points -> bytes in memory logic that underlies UTF-32, or UTF-8 for that matter, is perfectly capable of representing [U+D83D U+DE00] as a sequence distinct from [U+1F600]. But UTF-16 can't because the first sequence is a surrogate pair. So if your language applies the restriction that strings can't contain surrogate code points, it's basically emulating the UTF-16 worldview on top of whatever encoding it uses internally. The set of strings it supports is the same as the set of strings a language that does use well-formed UTF-16 supports, for the purposes of deciding what's allowed to be represented in a wire protocol.

MyOutfitIsVague · 2025-08-23T19:20:19 1755976819

You're somewhat mistaken, in that "UTF-32, or UTF-8 for that matter, is perfectly capable of representing [U+D83D U+DE00] as a sequence distinct from [U+1F600]." You're right that the encoding on a raw level is technically capable of this, but it is actually forbidden in Unicode. Those are invalid codepoints.

Using those codepoints makes for invalid Unicode, not just invalid UTF-16. Rust, which uses utf-8 for its String type, also forbids unpaired surrogates. `let illegal: char = 0xDEADu32.try_into().unwrap();` panics.

It's not that these languages emulate the UTF-16 worldview, it's that UTF-16 has infected and shaped all of Unicode. No code points are allowed that can't be unambiguously represented in UTF-16.

edit: This cousin comment has some really good detail on Python in particular: https://news.ycombinator.com/item?id=44997146

csande17 · 2025-08-23T20:09:21 1755979761

The Unicode Consortium has indeed published documents recommending that people adopt the UTF-16 worldview when working with strings, but it is not always a good idea to follow their recommendations.

zahlman · 2025-08-23T16:53:24 1755968004

You're not wrong; I gave more detail in a direct reply https://news.ycombinator.com/item?id=44997146 .