Hacker News new | past | comments | ask | show | jobs | submit login
How Python does Unicode (b-list.org)
165 points by scw on Sept 6, 2017 | hide | past | favorite | 136 comments



UCS-4 is essentially never the right choice. It wastes space and thus messes up your cache. UCS-2 can be the right choice if the language you're encoding uses a lot of non-Latin glyphs (i.e. East Asian languages) but suffers from the same problem as UCS-2. UTF-8 is a good default: for most strings it's very compact, and for strings with a lot of multibyte codepoints it doesn't compare too unfavorably with UTF-16.

Python 3 tried to have its cake and eat it too by choosing the most compact encoding depending on the string, but in practice this wastes a lot of space. You'll double (or heaven forbid quadruple) your string size because of a single codepoint, and these codepoints are almost always a small percentage of the string. That's actually why UTF-16 and UTF-8 exist.

It would have been better for strings to default to UTF-8, and to add an optional encoding so the programmer can specify what kind of encoding to use. As it is now, in order to use (for example) UTF-16 strings in Python you have to keep them around as bytes, decode them to a string, perform string operations, and reencode them to bytes again. Any benefit you get from using UTF-16 vanishes the moment you need to operate on it like a string, in other words.

I get that the idea was to maintain indexing via codepoint, but (again) in practice that's not great: usually you want to index via grapheme -- if you want to index at all.

A better solution is to allow programmers to specify string encoding and default it to UTF-8. From that, there's a clear path to everything you'd want to do.


I've done the math for a few of the programs I've worked on, and the waste was negligible every time. A lot of strings like "reply" that end up being ten bytes longer in UCS-4 than UTF-8 once you add all the object and allocator overhead, progressively fewer long strings. Even the string-heavy code I worked on didn't spend much more than ten per cent of its total memory on strings, having the typical object be 40 instead of 30 bytes wasn't a big deal then.

Perhaps I should give an example. Suppose you're parsing and dealing with something. Say HTML since it's well-known. So you receive a long byte array starting "<html><body><p>Sometimes</p>". You parse the byte array and produce a number of objects, including up to four strings, namely "html", "body", "p" and "Sometimes", and by the time you've stored those in objects and allocated them, they occupy 32 bytes each on the heap. If you use UCS-4 the last may need 48 or 64, depending on your allocator's rounding and buckets. The byte array you for from the I/O subsystem may be 100k but most of the strings in the code are short, and the impact of using UCS-4 is moderate.

A more interesting question is whether UCS-4's advantages are worth it. It provides an array of characters, but as the years pass, the code I see does ever less char-array processing on strings. 20-30 years ago the world was full of char pointers, now, not so much. Something like this looks more typical, and doesn't benefit much from UCS-4, if at all: foo.split(" ").each{|word| bar(word) }.


> A more interesting question is whether UCS-4's advantages are worth it. It provides an array of characters, but as the years pass, the code I see does ever less char-array processing on strings. 20-30 years ago the world was full of char pointers, now, not so much. Something like this looks more typical, and doesn't benefit much from UCS-4, if at all: foo.split(" ").each{|word| bar(word) }.

You are looking at the issue from the perspective from a language user, not a language designer. 20 years ago we didn't have languages such as Python/Ruby which had internal multibyte support in their sting manipulation functions. 20 years ago string manipulation functions didn't even exist!

But this post is about the design of the language, not the application, and the language is still written in C/C++ and _internally_ stores strings as byte arrays that must be presented nicely to the programmer in that language's string manipulation functions.


> You are looking at the issue from the perspective from a language user, not a language designer. 20 years ago we didn't have languages such as Python/Ruby which had internal multibyte support in their sting manipulation functions.

20 years ago was 1997. I'm reasonably certain NSString has been unicode-aware for much longer than that.

> 20 years ago string manipulation functions didn't even exist!

What kind of absolute utter nonsense is that?

> But this post is about the design of the language, not the application, and the language is still written in C/C++ and _internally_ stores strings as byte arrays that must be presented nicely to the programmer in that language's string manipulation functions.

So?


That should have been _30_ years ago string manipulation functions didn't exist.

NSString may have been Unicode-aware (I've never used Objective-C), and I believe that even the early Javas supported multibyte strings, but at that time most business and consumer desktop applications in the Windows world were still written in C/C++. Do you remember when the Euro symbol became common? I'm pretty sure that character alone was responsible for much of the push to support Unicode.


> I get that the idea was to maintain indexing via codepoint, but (again) in practice that's not great: usually you want to index via grapheme -- if you want to index at all.

I definitely need indexes, and I don't really care about graphemes. I actually have only a vague idea what that is.

I write parsers typically by using a global string and lots of indices. The important thing for me is to be able to extract characters and slices at given positions, and to be able to say "parse error at line X character Y" where X and Y are helpful to the user most of the time.

I would be absolutely fine with working in UTF-8 bytes only (and that would be faster I guess), but there would be a more pressing need to recompute character positions (as a code point or grapheme index) from byte offsets at times.

There are more abstract parsing methods where parser subroutines are implemented in a position agnostic way, but I'm very happy with my simple method.

If everything works on graphemes instead of code points (as I think does Perl6) I will be happy to use that, but it's not so important from a practical standpoint.


The issue with your model is that’s graphemes ultimately don’t behave like you may expect a character to. For example, it may take multiple code points to make a grapheme, so getting the index of any particular one might require walking the string instead of a constant time access since any one code point could be “globbed” with its neighbors into a grapheme–in a way that is dependent on its neighbors.


> I definitely need indexes

No you don't. You need iterators, which behave like pointers. Let's say you're hundreds or thousands of characters into a string at the start of some token. Now you want to scan from that position to the end of the token.

With indexes it works fast only if it's by codepoint. in a language that properly supports graphemes this would mean it would have to scan from the beginning to get to that index.

With iterators it can start scanning from that position directly. Same speed no matter where you are in the string. With indexes the larger your input the slower your parse gets, and not in a linear way.

It's also super easy to get a slice using a start and end iterator. As for line x character y messages, you can't get that directly from an index as it depends on how many new lines you parsed so indexing doesn't help there.


Well, I could roll my own iterator which encapsulates a string and some position information, but then I'd have to wrap a lot of different operations, like advance, advance by n, compare two iterators by position, test for end position, extract character, extract slice, etc.

And the code would get a lot noiser, while the only advantage I see is graphemes support, which I have never needed so far. (And I hope graphemes are actually designed with a similar sensibility for technical concerns as is UTF-8, where I can simply parse with indexes at the byte level, looking only for ASCII characters, without headaches and with maximum performance.)

As for getting line/character from a byte or codepoint offset, that's no problem if I do the calculation only in case of an error. The alternative would be to do it on each advance, which again means ADT wrapping, thus line noise and slower performance.


I'm not advocating that the programmer needs to implement the iterators but that the language/runtime have built in support for them.

As for searching for ASCII, which is prevalent in parsing, the iterator function to find the next specified character can do a low level and fast byte search. That's one of the benefits of UTF-8, searching for ASCII characters is super fast.

You wouldn't have to do the character position on each advance. Just have a beginning of line iterator that's updated every time you see a newline character and on error you do call a function that gives you how many characters between the current position iterator and the start of line iterator.

Working with iterators is no more coplex than working with indexes. But it's the language that needs to provide them.


Formal parsers do without indexing, but those rolled by hand often do, for simplicity's sake. I think these cases can still be serviced by permitting indexes, but backing them by a lazily computed index table.


> I get that the idea was to maintain indexing via codepoint, but (again) in practice that's not great

Of course not, but it was considered that breaking O(1) indexing guarantees were a bridge too far even in the breaky release of Python 3.


UTF-16 is a right choice if you are in asia or outside of the west. UTF-8 is the right choice if you are in america or using the internet.

> A better solution is to allow programmers to specify string encoding and default it to UTF-8.

Agreed. UTF-8 is the sensible default for most people.


> Any benefit you get from using UTF-16 vanishes the moment you need to operate on it like a string, in other words.

So, don't decode to a string, and do all your character manipulation on the bytes.

> A better solution is to allow programmers to specify string encoding and default it to UTF-8.

Absolutely not: the internal representation of a string should be of no interest to a user of your language. The 'best' solution is to represent strings as a list of index lookups into a palette, and to update the palette as new graphemes are seen. This is similar to the approach Perl6 is using[0].

[0]: https://6guts.wordpress.com/2015/12/05/getting-closer-to-chr...


> So, don't decode to a string, and do all your character manipulation on the bytes.

WHAT?!? I suppose that you've only ever worked with Latin characters. Please show a code example of changing European to African in this sentence in your language of choice, working on the bytes in any multi-byte encoding:

מהי מהירות האווירית של סנונית ארופאית ללא משא?‏

Yes, that is a Hebrew Monty Python quote. Now try it with a smiley somewhere in the string (HN filtered out my attempt to post the string with a smiley).

Is each application to maintain their own dictionary of code points? If the map is to be in a library, then why not have it in the language itself?


I don't understand your complaints. You clearly have some task you have in mind that you wish to perform: why not tell me what it is?

> Please show a code example of changing European to African in this sentence in your language of choice, working on the bytes in any multi-byte encoding:

מהי מהירות האווירית של סנונית ארופאית ללא משא?‏

I don't see the string 'European' in that sentence, it seems to be solely comprised of Hebrew characters.

edit to attempt to answer your question:

    struct m {
        pos_t start;
        pos_t end;
    }

    int findsn(char* str, char* substr, match m) {
        next: for( int c_i = 0; c_i++; s[c_i] != '\0' ) {
            match.start = c_i;
            int s_i = 0;
            for( ; s_i++; substr[s_i] != '\0' ) {
                if( str[c_i] != substr[s_i] ) goto next;
            }
            match.end = c_i + s_i;
            return 1;
        }
        return 0;
    }

    char* replacesn(char* str, char* needle, char* rpl) {
        match m;
        if( findsn(str, needle, &m) ) {
            splicesn(str, m.start, m.end, rpl);
        }
        return str;
    }
splicesn should be obvious, and you normalise your strings before calling replacesn. This is just me crappily re-implementing a fraction of the wchar API without checking MSDN.

edit 2:

> Is each application to maintain their own dictionary of code points?

No, you use the system/standard library for composing/decomposing/normalising codepoints.

> If the map is to be in a library, then why not have it in the language itself?

Why not indeed? What a great idea.


You win on the string replace, that was a bad example. Try a regex replace! But I will also mention that seeing properly indented code with clear identifier names is refreshing where I work!

> Why not indeed? What a great idea.

It sounded to me that you were arguing that string manipulation functions do not need to be included in modern programming languages. You said: "don't decode to a string, and do all your character manipulation on the bytes"


OK, I see how what I said could mean that. What I meant was: if using the language's internal string representation gives poor performance/resource usage, better to avoid it and directly manipulate the undecoded bytes. Most languages allow you to control when loaded data is converted to strings; simply don't convert it, and uh reimplement stdlib functions to work with your preferred encoding.


Except in East Asia with a population of over one billion.


Yeah I think UTF-8 is pretty euro-centric and that doesn't get enough play. Being able to set the default string encoding in your program would do a lot to alleviate that, I wish there were a language that provided it.


My favorite story about Python's handling of Unicode was when one of my coworkers did a hotfix for our Python website, wrote tests, confirmed everything worked as expected... but right before committing and pushing to production wrote a comment like:

# Apparently we expect the field to be in this format ¯\_(ツ)_/¯

Right above the code he'd just fixed.

Of course, the moment we pushed the update it brought production down, because the Python interpreter doesn't understand Unicode in source files unless you specify which encoding you are using.

After that, "¯\_(ツ)_/¯" became a synonym for his name on our HipChat server, heh.


This would be the case in Python 2, where source code files are assumed to be ASCII-encoded unless there's an encoding comment at the top of the file.

In Python 3, source code files are assumed to be UTF-8.


Interesting that Python 2 couldn't fix that in a hotfix/point release... UTF-8 is backwards compatible with ASCII so it shouldn't break anything if source started being interpreted as UTF8. I'd be curious to see what their reasoning is.


The change to UTF-8 source encoding also changed the legal set of characters for identifiers, and specified how to normalize them. Which in turn is the reason behind this thing I posted on Twitter a while back:

https://gist.github.com/ubernostrum/b7b705bf21b86a1b5c1e2c9f...

And also is a big enough change to not really be something that could happen in Python 2.


I would imagine Python's approach to introducing new language features had a lot to do with it. Having to go through the PEP system takes some time, and changes like these tend to be reserved for minor-version releases. All in all, I love the PEP system, it's such an open concept and I've been surprised by the amount of quality proposals that get implemented. Wish Go had something like it.


Correct, this was a codebase that still had some Pylons (gasp! Not even Pyramid, but legit Pylons) code.


This was a pretty gutsy move on Python's part. The presence of a single emoji in an English string will blow up memory usage for the whole string by 4x. And because graphemes aren't 1:1 to code points, the O(1) indexing and length operations you bought with that trade-off will still confuse people who don't understand Unicode.


As I said in the article, I think the overhead of adding yet more weirdness in the form of quirks of the internal encoding (which could vary according to how the Python interpreter was compiled!) is a bad thing to do on top of how much people seem to struggle mentally just to get Unicode all on its own.

Though I also think the struggle is mostly due to people being stuck in an everything-is-like-ASCII mindset, and though I didn't get into that, it's one big reason why I think UTF-8 is generally the wrong way to expose Unicode to a programmer, since it lets them think they can keep that cherished "one byte == one character" assumption right up until something breaks at 2AM on a weekend.

Personally I'd like everyone to just actually learn at least the things about Unicode that I went into here (such as why "one code point == one character" is a wrong assumption), and I think that'd alleviate a lot of the pain. I also avoided talking much about normalization, because too many people hear about it and decide they can just normalize to NFKC and go back to assuming code point/character equivalence post-normalization.


> it's one big reason why I think UTF-8 is generally the wrong way to expose Unicode to a programmer, since it lets them think they can keep that cherished "one byte == one character" assumption right up until something breaks at 2AM on a weekend.

Unfortunately, as long as you believe that you can index into a Unicode string, your code is going to break. The only question is how soon.

I actually like UTF-8 because it will break very quickly, and force the programmer to do the right thing. The first time you hit é or € or ️an emoji, you'll have a multibyte character, and you'll need to deal with it.

All the other options will also break, but later on:

- If you use UTF-16, then é and € will work, but emoji will still result in surrogate pairs.

- If you use a 4-byte representation, then you'll be able to treat most emoji as single characters. But then somebody will build é from two separate code points as "e + U+0301 COMBINING ACUTE ACCENT", or you'll run into a flag or skin color emoji, and once again, you're back at square zero.

You can't really index Unicode characters like ASCII strings. Written language is just too weird for that. But if you use UTF-8 (with a good API), then you'll be forced to accept that "str[3]" is hopeless very quickly. It helps a lot if your language has separate types for "byte" and "Unicode codepoint", however, so you can't accidentally treat a single byte as a character.


This sort of thing is why Swift treats grapheme clusters, rather than code points or bytes or "characters", as the fundamental unit of text. When I first started learning Swift I thought that was a weird choice that would just get in the way, but these days I'm coming around to their way of thinking.


Treating grapheme clusters as fundamental is slightly problematic in the sense that then the fundamentals change as Unicode adds more combining characters. A reasonable programming environment should still provide iteration by grapheme cluster as a library feature whose exact behavior is expected to change over time as the library tracks new Unicode versions.

Depending on the task at hand, iterating by UTF-8 byte or by code point can make sense, too. And the definition of these is frozen regardless of Unicode version, which makes these safer candidates for "fundamental" operations. There is no right unit of iteration for all tasks.


Pfft, that is just as bad. There is no 'fundamental unit of text'. There are different units of text that are appropriate to different tasks.

If I want to know how much memory to allocate, bytes are it. If I want to know how much screen space to allocate, font rendering metrics are it. If I want to do word-breaking, grapheme clusters are it.

None of these are fundamental.


Since a string doesn't have any font rendering metrics (in that it lacks a font or size), I'm not sure how you expect a language's String implementation to take it into account. Similarly, bytes will change based on encodings, which most people would expect a language's String type to abstract over. Do you really want UTF8 and UTF16 strings to behave differently and introduce even more complexity to a very complex system?

There are languages whose orthographies don't fit the Unicode grapheme cluster specification, but they're complex enough that I doubt there's any way to deal with them properly other than having someone proficient in them looking over your text processing or pawning it off to a library. At least with grapheme clusters your code won't choke on something as simple as unnormalized Latin text.


It's not about encodings at all, actually. It's about the API that is presented to the programmer.

And the way you take it all into account is by refusing to accept any defaults. So, for example, a string type should not have a "length" operation at all. It should have "length in code points", "length in graphemes" etc operations. And maybe, if you do decide to expose UTF-8 (which I think is a bad idea) - "length in bytes". But every time someone talks about the length, they should be forced to specify what they want (and hence think about why they actually need it).

Similarly, strings shouldn't support simple indexing - at all. They should support operations like "nth codepoint", "nth grapheme" etc. Again, forcing the programmer to decide every time, and to think about the implications of those decisions.

It wouldn't solve all problems, of course. But I bet it would reduce them significantly, because wrong assumptions about strings are the most common source of problems.


What do you mean by "expose UTF-8"? Because nothing about UTF-8 requires that you give byte access to the string.

As for indexing, strings shouldn't require indexing period. That's the ASCII way of thinking, especially fixed width columns and such. You should be thinking relatively. For example, find me the first space then using that point in the string the next character needs to be letter. When you build you're code that way you don't fall for the trap of byte indexing or the performance hit of codepoint indexing (UTF-8) or grapheme indexing (all encodings).


There are real-world textual data types for which your idealized approach simply does not work. As in, it would be impossible or impossibly unwieldy to validate conformance to the type using your approach, because they require indexing to specific locations, or determining length, or both.

For example, I work for a company that does business in the (US) Medicare space. Every Medicare beneficiary has a HICN -- Health Insurance Claim Number -- and HICNs come in different types which need to be identified. Want to know how to identify them? By looking at prefix and suffix characters in specific positions, and the length of what comes between them. For example, the prefix 'A' followed by six digits means the person identified is the primary beneficiary and was first covered under the Railroad Retirement Board benefit program. Doing this without indexing and length operations is madness.

These data types can and should be subjected first to some basic checks to ensure they're not nonsense (i.e., something expected to be a numeric value probably should not contain Linear B code points, and it's probably a good idea to at least throw a regex at it first, but then applying regex to Unicode also has quirks people don't often expect at first...).


I don't see why this would be hard with iterators. You have an iterstor to the start of the HICN, either at the start of a or deep in the string. Take a second iterator and set it to the first. Loop six times advancing that iterator checking to see if it's a digit. Then check if the next position is a space.

For the prefix and suffix and how many characters between them you do the above but use the second iterator to find the suffix. Then you either keep track of how many characters you advanced or ask for how many characters between the two.

It's very easy to think about it this way as that's how a normal (non programmer) human would do it. Basically the code literally does what you wrote in english above.

My point being is that iterators are much faster than indexing when the underlying string system uses graphemes. You can do pretty much anyting just as easy or easier with iterators than with indexing. The big exception is fixed width columnar tet files. I've seen a lot of these in financial situations but fortuanately those systems are ASCII based so not an issue.


You're not really changing anything, though; you're basically saying that instead of indexing to position N, you're going to take an iterator and advance it N positions, and somehow say that's a completely different operation. It isn't a different operation, and doesn't change anything about what you're doing.

If you want to argue that there should be ways to iterate over graphemes and index based on graphemes, then that is a genuine difference, but splitting semantic hairs over whether you're indexing or iterating doesn't get you a solution.


If the string is stored as ASCII characters or Unicode code points (UCS-16 or UCS-32) then you are correct that not much changes. But if the string is in UTF-8, UTF-16 or the string system uses graphemes then indexing goes from O(1) to O(N). Every index operation would have to start a linear scan from the beginning of the string to get to the correct spot. With iterators it would be a quick operation to access what it's pointing to and very quick to advance it.

My argument is that iterators are far superior to indexing when using graphemes (or code points stored as UTF-8 but grapheme support is superior). And they don't hurt when used on ASCII or fixed width strings either so the code will work with either string format. No hairs, split or otherwise here.


I agree that iterators generally make more sense with strings. But sometimes, you really do want to operate on code points - for example, because you're writing a lexer, and the spec that you're implementing defines lexemes as sequences of code points.


That's why the search functions need to be more intelegent. If you pass the search function a grapheme it will do more work. If it notices you just passed in a grapheme that's just a code point it can do a code point scan. And if the internal representstion is UTF-8 and it sees you passed in an ASCII charzcter (very common in lexing/parsing) it will just do a fast byte scan for it.

Now if the spec thinks identifiers are just a collection of code points then it's being imprecise. But things would still work if the lexer/parser you wrote returns identifiers as a bunch of graphemes because ultimately they're just a bunch of code points strung together.

It's only in situations where you need to truncate identifiers to a certain length that graphemes become important. Also normalizing them when matching identifiers would also probably be a good idea.


int_19h's approach is still valid for this; you're asking for whole displayed characters which are combined of some (you don't need to know) number of bits in memory across several units of the memory segment(s) that hold the string.

Based on your description, the correct solution is probably to use a structure or class of a more regular format to store the decoded HICN in pre-broken form. If they really only allow numbers in runs of text you might save space and speed comparison/indexing by doing this.


It's more that I get tired of people declaring that indexing and length operations need to be completely and utterly and permanently forbidden and removed, and then proposing that they be replaced by operations which are equivalent to indexing and length operations.

Doing these operations on sequences of code points can be perfectly safe and correct, and in 99.99%+ of real-world cases probably will be perfectly safe and correct. My preference is for people to know what the rare failure cases are, and to teach how to watch out for and handle those cases, while the other approach is to forbid the 99.99% case to shut down the risk of mis-handling the 0.001% case.


When people say they should be removed they mean primitive operations (like a standard 'length' attribute/function, or an array index operator) shouldn't exist for that type.

Just like it is better to have something like .nth(X) as a function for stepping to a numbered node, so to does a language string demand operations like .nth_printing(X) .nth_rune(X) and .nth_octet(X); to make it clear to any programmer working with that code what the intent is.


Semantically equivalent yes, access time equivalent for variable width strings no. One of the reasons for Python 3's odd internal string format is because they wanted to keep indexing and have indexing be O(1). The reason why I think replacing indexing with iterators is that it removes this restriction and they could have made the internal format UTF-8 and/or easily added support for graphemes.

I prefer to have a system where 100% of the cases are valid and teaching people corner cases is not required. We all know how well teaching people about surrogate pairs went. And we're not forbidding the 99.99% case but providing an alternative way to accomplish the exact same thing. The vast majority of code uses index variables as a form of iterator anyways so it's not that big of a change.

The main reason people keep clinging to indexing strings is that's all they know. Most high level languages don't provide another way of doing it. People who program in C quickly switch from indexing to pointers into strings. Give a C programmer an iterator into strings and they'll easily handle it.


By "expose UTF-8" I mean exposing the underlying UTF-8 representation of the string directly on the object itelf, instead of going through a separate byte array (or byte array view, to avoid copying)


Ah, I see. I agree that it would be a bad idea to give acess to the UTF-8 representstion.

As for length in bytes, a good way to handle most use cases regarding that is to have a function that truncates the string to fit into a certain number of bytes. That way you can make sure it fits into whatever fixed buffer and the truncation would happen on a grapheme level.


This is exactly how Swift’s Strings work.


>If I want to know how much memory to allocate, bytes are it. If I want to know how much screen space to allocate, font rendering metrics are it. If I want to do word-breaking, grapheme clusters are it.

Size in memory/bytes you could get trivially for any string (and this doesn't change with whether you chose bytes, graphemes or code points or whatever to iterate).

Screen space is irrelevant/orthogonal to encoding -- it appears at the font level and the font rendering engine that will give the metrics will accept whatever encoding it is.


>Screen space is irrelevant/orthogonal to encoding

Exactly. That's why measurements of string length shouldn't ever assume I'm looking for a unit-of-offset for a monospaced font.

The problem is that most naive programmers think that's what a string length is and should be.


I would love it if Python at least would support the '\X' regex metacharacter in its own built-in regex module. Right now you have to turn to a third-party implementation to get that.

I also wish Python would expose more of the Unicode database than it does; I've had to turn to third-party modules that basically build their own separate database for some Unicode stuff Python doesn't provide (like access to the Script property).


Unfortunately, as long as you believe that you can index into a Unicode string, your code is going to break. The only question is how soon.

Depends on what you want to index into it for. I'll admit that once upon a time I opposed adding a "truncate at N characters" template helper to Django since there was a real risk it would cut in the middle of a grapheme cluster, and I don't particularly care for the compromise that ended up getting it added (it normalizes the string-to-truncate to a composed form first to try to minimize the chance of slicing at a bad spot).

But when you get right down to it, what I do for a living is write web applications, and sometimes I have to write validation that cares about length, or about finding specific things in specific positions, and so indexing into a string is something I have do to from time to time, and I'd rather have it behave as a sequence of code points than have it behave as a sequence of bytes in a variable-width encoding.

As to whether UTF-8 forces people to deal with Unicode up-front, I very strongly disagree; UTF-8 literally has as a design goal that it puts off your need to think about anything that isn't ASCII.


Yes, while "back up one UTF-8 rune" is a well defined operation, "back up one grapheme" is tough. Forward is easy, though.

I had the need to write grapheme-level word wrap in Rust. Here it is. It assumes all graphemes have the same visible width. This is used mostly for debug output, not for general text rendering.

[1] https://github.com/John-Nagle/rust-rssclient/blob/master/src...


> But when you get right down to it, what I do for a living is write web applications,

That is my use case for Python as well.

> sometimes I have to write validation that cares about length,

That's where a trucation function that understands grapheme clusters whould come in so handy. Tell it that you want to truncate to n bytes maximum and let it chop a bit more as to not split a grapheme cluster.

Fortunately my database does not have fixed with strings so I rarely bump into this one.

> or about finding specific things in specific positions, and so indexing into a string is something I have do to from time to time

I write my code to avoid this. Yes I still have to use an index because that's what Python supports but it would be trivial to convert it to another language that supports string iterators.


I get the variable byte encodings. And I know that Unicode has things like U+0301 as you say, and so code points are not the same as characters/glyphs. But I don't understand why it was designed that way. Why is Unicode not simply an enumeration of characters.


It's important to distinguish between Unicode the spec, where people definitely do make that distinction, and implementations. Most of the problems are due to history: we have half a century of mostly English-speaking programmers assuming one character is one byte, especially bad when that's baked into APIs, and treating the problem as simpler than it really is.

Combining accents are a great example: if you're an American, especially in the 80s, it's easy to assume that you only need a couple of accents like you used in Spanish and French classes and that's really simple for converting old data to a new encoding. Later, it becomes obvious that far more are needed but by then there's a ton of code and data in the wild so you end up needing the concept of normalization for compatibility.

(That's the same lapse which lead to things like UCS-2 assuming 2^16 characters even though that's not enough for a full representation of Chinese alone.)

I think it's also worth remembering the combination of arrogance and laziness which was not uncommon in the field, especially in the 90s. I remember impassioned rants about how nobody needed anything more than ASCII from programmers who didn't want to have to deal with iconv, thought encoding was too much hassle, claimed it was too slow, etc. as if that excused not being able to handle valid requests. About a decade ago I worked at a major university where the account management system crashed on apostrophes or accents (in a heavily Italian town!) and it was just excused as the natural order of things so the team could work on more interesting problems.


One reason is because it would take a lot more code points to describe all the possible combinations.

Take the country flag emoji. They're actually two seperate code points. The 26 code points used are just special country code letters A to Z. The pair of letters is the country code and shows up as a flag. So just 26 codes to make all the flags in the world. Plus new ones can be added easily without having to add more code points.

Another example is the new skin tone emoji. The new codes are just the colour and are put in front of the existing emoji codes. Existing software just shows the normal coloured emoji but you may see a square box or question mark symbol in front of it.


>The pair of letters is the country code and shows up as a flag. So just 26 codes to make all the flags in the world. Plus new ones can be added easily without having to add more code points. Another example is the new skin tone emoji.

Still not answering the question though.

For one, when the unicode standard was originally designed it didn't have emoji in it.

Second, if it was limitations to the arbitrary addition of thousands of BS symbols like emoji that necessitate such a design, we could rather do without emojis in unicode at all (or klingon or whatever).

So, the question is rather: why not a design that doesn't need "normalization" and runes, code points, and all that...

Using less memory (like utf-8 allows) I guess is a valid concern.


It didn't have emoji but it did have other combining characters. While some langages it's feasable to normalize them to single code points but other langagues it would not be.

Plus the fact that some visible characters are made up of many graphemes the number of single code points would be huge.

As to your second point it seems to me to be a little close minded. The whole point of a universal character set was that languages can be added to it whether they be textual, symbolic or pictographic.


>As to your second point it seems to me to be a little close minded. The whole point of a universal character set was that languages can be added to it whether they be textual, symbolic or pictographic.

Representing all languages is ok as a goal -- adding klingon and BS emojis not so much (from a sanity perspective, if adding them meddled with having a logical and simple representation of characters).

So, it comes to "the fact that some visible characters are made up of many graphemes the number of single code points would be huge" and "while some languages it's feasable to normalize them to single code points but other langagues it would not be".

Wouldn't 32 bits be enough for all possible valid combinations? I see e.g. that: "The largest corpus of modern Chinese words is as listed in the Chinese Hanyucidian (汉语辞典), with 370,000 words derived from 23,000 characters".

And how many combinations are there of stuff like Hangul? I see that's 11,172. Accents in languages like Russian, Hungarian, Greek should be even easier.

Now, having each accented character as a separate might take some lookup tables -- but we already require tons of complicated lookup tables for string manipulation in UTF-8 implementations IIRC.


You might be correct and 32 bits could have been enough but Unicode has restricted code points to 21 bits. Why? Because of stupid UTF-16 and surrogate pairs.

I'm curious why you think that UTF-8 requires complicated lookup tables.


>I'm curious why you think that UTF-8 requires complicated lookup tables.

Because in the end it's still a Unicode encoding, and still has to deal with BS like "equivalence", right?

Which is not mechanically encoded in the err, encoding (e.g. all characters with the same bit pattern there are equivalent) but needs external tables for that.


But that's the same for UTF-16 and UTF-32. That's why I was wondering why you singled UTF-8 out, implying it needed extra handling.


Nah, didn't single it out, I asked why we don't have a 32-bit fixed-size code points, non-surrogate-pair-bs etc encoding.

And I added that while this might need some lookup tables, we already have those in UTF-8 too anyway (a non fixed width encoding).

So the reason I didn't mention UTF-16 and UTF-32 is because those are already fixed-size to begin with (and increasingly less used nowadays except in platforms stuck with them for legacy reasons) -- so the "competitor" encoding would be UTF-8, not them.


> So, the question is rather: why not a design that doesn't need "normalization" and runes, code points, and all that...

Because language is messy. At some point you have to start getting into the raw philosophy of language and it's not just a technical problem at that point but a political problem and an emotional problem.

Take accents as one example: in English a diaresis is a rare but sometimes useful accent mark to distinguish digraphs (coöperate should be pronounced as two Os, not one OOOH sound like in chicken coop) the letter stays the same it just has "bonus information"; in German an umlaut version of a letter (ö versus o) is considered an entirely different letter, with a different pronunciation and alphabet order (though further complicated by conversions to digraphs in some situations such as ö to oe).

Which language is "right"? The one that thinks that diaresis is merely a modifier or the one that thinks of an accented letter as a different letter from the unmodified? There isn't a right and wrong here, there's just different perspectives, different philosophies, huge histories of language evolution and divergence, and lots of people reusing similar looking concepts for vastly different needs.

Similarly the Spanish ñ is single letter to Spanish but the ~ accent may be a tone marker in another language that is important to the pronunciation of the word and a modifier to the letter rather a letter on its own.

There's the case of the overlaps where different alphabets diverged from similar origins. Are the letters that still look alike the same letters? [1]

Math is a language with a merged alphabet of latin characters, arabic characters, greek characters, monastery manuscript-derived shorthands, etc. Is the modern Greek Pi the same as the mathematical symbol Pi anymore? Do they need different representations? Do you need to distinguish, say in the context of modern Greek mathematical discussions the usage of Pi in the alphabet versus the usage of mathematical Pi?

These are just the easy examples in the mostly EFIGS space most of HN will be aware of. Multiply those sorts of philosophical complications across the spectrum of languages written across the world, the diversity of Asian scripts, and the wonder of ancient scripts, and yes the modern joy of emoji. Even "normalization" is a hack where you don't care about the philosophical meaning of a symbol, you just need to know if the symbols vaguely look alike, and even then there are so many different kinds of normalization available in Unicode because everyone can't always agree which things look alike either, because that changes with different perspectives from different languages.

[1] An excellent Venn diagram: https://en.wikipedia.org/wiki/File:Venn_diagram_showing_Gree...


Some languages use multiple marks on a single base character. The combining character system is more flexible, uses fewer code points, and doesn't require lookup tables to search or sort by base character.


In practice UTF-8 has done more to enable the wrong thing, rather than forcing programmers to do the right thing.

> You can't really index Unicode characters like ASCII strings

But then why do strings-are-UTF8 languages like Go or D make it so easy? Why optimize your syntax with `len(txt)` and `txt[0]` when, as you point out, you can't do that? Why make it trivial to split code points or composed character sequences, but doing something like proper string truncation is brutally hard?

UTF-8's fail-fast property has not enabled more Unicode-savviness. Instead it just lets programmers pretend that we still are in the land of C strings.


> Why optimize your syntax with `len(txt)` and `txt[0]` when, as you point out, you can't do that?

I like Rust's approach. It's a strings-are-UTF8 language but strings (both str and String):

- are not directly indexable

- force you to be explicit when iterating: you iterate over either `s.chars()` or `s.bytes()`

- are called out in the docs as being a vector of unsigned 8-bit integers internally

- support a len() method that is called out as returning the length of that vector

- can be sliced if you reaaaally need to get around inability to index directly but attempting to slice in the middle of a character causes a panic


> - support a len() method that is called out as returning the length of that vector

They should have called that one bytelen() then.

And how do you get a proper offset for slicing? Do you then have to interpret the UTF-8 bytes yourself, or can you somehow get it via the chars() iterator or something similar?


Yeah this is the way to go for sure.


> But then why do strings-are-UTF8 languages like Go

To clarify: strings in Go are not necessarily UTF-8. String literals will be, because the source code is defined to be UTF-8, but strings values in Go can contain any sequence of bytes: https://blog.golang.org/strings

Note that this prints 2, because the character contains two bytes in UTF-8, even though the two bytes correspond to one codepoint: https://play.golang.org/p/BqGzW1O2WX

Go also has the concept of a rune, which is separate from a byte and a string, and makes this easier when you're working with raw string encodings.


This makes it sound like Go is even more confused. If strings in Go are not necessarily UTF-8, why does the strings package assume UTF-8, `for range` assumes UTF-8, etc?


> If strings in Go are not necessarily UTF-8, why does the strings package assume UTF-8, `for range` assumes UTF-8, etc?

The blog post I linked to explains this in more detail, but in short: the `strings` package provides essentially the same functions as the `bytes` package does, except applied to work on UTF-8 strings. There are other packages for dealing with other text encodings.

The `for range` syntax is the one "special case", and it was done because the alternative (having it range over bytes instead of codepoints) is almost never desirable in practice[0], and it's easier to manually iterate the few times you do need it than it it would be to import a UTF-8 package just to iterate over a string 99.9% of the time.

[0] iterating over bytes is done all the time, of course, but usually at that point you're dealing with an actual slice of bytes already that you want to iterate over, not a string.


The point is that Go lumps together byte arrays and strings. It's a common flaw, but it's really unfortunate to see it perpetrated in a language that was designed after this lesson was already learned.

A byte array is a representation of a string, for sure. But strings themselves are higher-level abstractions. It shouldn't be that easy to mix the two.

An equivalent situation would be if integers were byte arrays. So len(x) would give you 4, for example, and you could do x[0], x[1] etc - except you would almost never actually do that in practice, and occasionally you'd end up doing the wrong thing by mistake.

If any language actually worked that way, everyone would be up in arms about it. Unfortunately, the same passes for strings, because of how conditioned we are to treat them as byte sequences.

Calling it "char" in C was probably the second million dollar mistake in the history of PL design, right after null.


Easily moving from bytes to strings and back is the only way it makes sense for Go. It runs on POSIX for the most part, and every. single. POSIX. API. is done in bytes. Not Unicode. Bytes.

Languages like Python 3 that try to be so Unicode-pure that they crash or ignore legal Linux filenames are insane.


I would dare say that the fact that Linux filenames don't have to be valid strings (i.e. they can be arbitrary byte sequences that cannot be meaningfully interpreted using the current locale encoding) is the insane part.

But does POSIX require support for arbitrary byte sequences in filenames, or does it merely use bytes (in locale encoding) as part of its ABI? I suspect the latter, since OS X is Unix-certified, and IIRC it does use UTF-16 for filenames on HFS - so presumably their POSIX API implementation maps to that somehow. If that's correct, then that's also the sane way forward - for the sake of POSIX compatibility, use byte arrays to pass strings around, but for the sake of sanity, require them to be valid UTF-8.


> I think UTF-8 is generally the wrong way to expose Unicode to a programmer, since it lets them think they can keep that cherished "one byte == one character" assumption right up until something breaks at 2AM on a weekend.

The solution to that is simple, don't let the programmer access individual bytes in a Unicode string.

Get rid of indexing into them and replace it with iterators. Make string handling functions work on code points at the very least but better yet on grapheme clusters. There's a little more to it than that but it's a good start.

Yes, people are still stuck in the ASCII mindset and can't seem to get away from thinking in bytes. But I belive it's the ability to index into strings is what's to blame and not the encoding used.


Agreed, assuming O(1) lookup of anything inside a string only leads to bad encoding bugs. UTF-8 everywhere, no exceptions.

You can never assume any user-visible character will align evenly with any byte boundary, even if you're using UTF-32. Composed characters throw that assumption out the window, as well as dozens of other unicode quirks I can't recall now.


Python took the obvious approach - they already had UTF-16 and UTF-32 builds, so this was just making that mechanism dynamic.

Go and Rust expose UTF-8 at the byte level. This is something of a headache and may result in invalid string slices. It basically punts the problem back to the user.

Here's an alternative: Use UTF-8 as the internal representation, but don't expose it to the user.

If you're iterating over a string one rune or one grapheme at a time, the UTF-8 substructure is hidden from the user. Only if the user uses an explicit numeric subscript do you need to know a rune's position in string. When a request by subscript comes in, scan the string and build an index of rune subscript->byte position. This is expensive, but no worse than UTF-32 in space usage or expansion to UTF-32 in time.

Optimizations:

- Requests for s[0] to s[N], and s[-1] to s[-N], for small N, should be handled by working forwards or backwards through the UTF-8. (Yes, you can back up by rune in UTF-8. That's one of the neat features of the representation.)

- Lookup functions such as "index" should return an opaque type which represents the position into that string. If such an object is used as a subscript, there's no need to build the index by rune. If you coerce this opaque type into an integer, the index table has to be built. Adding or subtracting small integers from this opaque type should be supported by working backwards or forwards in the string.

- Regular expression processing has to be UTF-8 aware. It shouldn't need an index by rune.

This would maintain Python's existing semantics while reducing memory consumption.

Some performance measurement tool that finds all the places where an index by rune has to be built is useful. It's rare that you really need this, but sometimes you do.


> Go and Rust expose UTF-8 at the byte level. This is something of a headache and may result in invalid string slices. It basically punts the problem back to the user.

In Go, yes. In Rust, no. UTF-8 in Go is garbage in, garbage out. Rust, however, won't let you materialize an invalid &str without "unsafe".


The difference is that Go expects your majority use case to be copy or concatenate. If you're taking a string sequence value you're normally not going to change it, or you're going to combine it together with something else. If you have valid UTF-8 input, you should get output that is valid, but might not be 'normalized' to a single form. IF you care about normalizing you can decide when to do that (usually in output construction).

If you need to make a decision based on the content of a string, then you often need to make a normalized (the same way for both) copy the inputs.

Most importantly, if you feed in garbage, you get out the SAME garbage. The real world, and historical data, are messy. Trying to be smart can often lead to the most disastrous consequences. Being conservative and tolerant allows for intentional planning to handle the conversion at the source, if and when desired.


> Go and Rust expose UTF-8 at the byte level.

Or you can take the C++/C approach and have a character 1 byte, 2 bytes, or a multi-byte. It's a pain in the ass to constantly in C/C++ having to interface between two libraries that one decided to use char and another w_char!


The way the C and C++ committees approach Unicode is even worse than Python breaking away from UTF-16 in the wrong direction (UTF-32 being the wrong direction and UTF-8 being the right direction).

The first rule of reasonably happy C and C++ Unicode programming is not to use wchar_t for any purpose other than immediate interaction with the Win32 API.

The second rule of reasonably happy C and C++ Unicode programming is not to use the standard library facilities (which depend on the execution environment) for text processing but using some other library where the UTF-* interpretation of inputs and outputs doesn't shift depending on the execution environment or compilation environment.


This is where we run into a little bit of a problem. You have a char pointer that can be either a multiple byte encoded (depending on the code page window is using). It also can be UTF-8 encoded. Then when you move onto windows wchar_t that is originally defined as (UCS-2) then was later renamed to UTF-16, due to surrogate pair's.

So in the windows world with COM/DCOM you're basically nugged into using UTF-16 wchar_t or it becomes a hell of a lot of pain. So it is easier just simply to accept to use UTF-16 and do all the conversion from UTF-8, UTF-32, code pages to a single encoding standard.


You could just wrap that pointer in a class that describes what it is - ideally at the type level (Utf8String, etc). Each string class knows how to convert from other string types, and any library calls get wrapped in a method that is either templated on the string input type(s) or takes a BaseString* and calls virtual conversion functions. Or force a manual call to convert each time so that your fellow developers know when slow conversions are happening for sure.

It is a crappy situation though. Pick where you want your pain point to be.


See http://utf8everywhere.org

It talks about Windows quite a bit.


> Go and Rust expose UTF-8 at the byte level. This is something of a headache and may result in invalid string slices. It basically punts the problem back to the user.

There are pros and cons to both approaches. The prime ones being that []byte allows for easy random access, whereas []rune usually takes O(n) to work with (unless you store rune lengths separately, which is memory intensive).

I guess it's about the right level of abstraction, so that you can choose if you're working with bytes (binary I/O, when you know it's ascii etc.) and when with runes (most situations).

I still haven't decided whether I prefer the Python approach or the Go one.


If you look over Elixir's doc [0] on binary strings, they take the best of both worlds. The APIs are specifically crafted for least surprise, e.g. with `String.length()`, `byte_size`, `Strings.graphemes()` and `String.codepoints()` functions.

[0] https://hexdocs.pm/elixir/String.html


I think the devil is in the details of that opaque 'position' type.

With integers you can do things like concatenate two strings and adjust the indexes referring to the second string by adding the length of the first one. If you invent a new position type you have to add support for several things like this.

In any case I think the Python people were right to carry on using integers.


That would force a conversion from opaque type to integer, which would force creation of the rune to byte index. It doesn't have to be handled as a special case. The opaque type thing is an optimization hidden from the user. If you try to look at it, you get the integer value, expensively.


It's a shame to do all that work at runtime though, when the result is just going to be that the byte number in the opaque value is increased by the byte length of the first string.

I do think there's a great deal to be said for indexing and pattern matching returning opaque 'locations' (particularly if/when we have languages that let you verify at compile time that you're using the location with the right string).

But I fear doing it well would be distinctly fiddly.


> Go and Rust expose UTF-8 at the byte level. This is something of a headache and may result in invalid string slices.

Rust will panic on invalid slices unless you first convert to raw bytes, and then it will not allow converting invalid slices back to a string in safe rust (in unsafe you're obviously on your own).

Safe Rust guarantees and requires[0] that strings are valid UTF8 at all times.

That aside, essentially all of your desires are part of Swift's string, you should check them out.

> Requests for s[0] to s[N], and s[-1] to s[-N], for small N, should be handled by working forwards or backwards through the UTF-8. (Yes, you can back up by rune in UTF-8. That's one of the neat features of the representation.)

Rust does that through the `chars()` iterator[1] which iterates through USVs (codepoints) and can be iterated from both ends. Sadly unlike Swift it does not ship with a grapheme cluster iterator. Happily there is a unicode_segmentation crate[2]. Swift also uses iterators but has more of them: the default iteration works on extended grapheme clusters, and alternate iterators are USV, UTF-16 and UTF-8.

If indexing is necessary for some reason Rust also has char_indices() which iterates on the USV and its (byte) position in the string.

> - Lookup functions such as "index" should return an opaque type which represents the position into that string. If such an object is used as a subscript, there's no need to build the index by rune. If you coerce this opaque type into an integer, the index table has to be built. Adding or subtracting small integers from this opaque type should be supported by working backwards or forwards in the string.

That is what Swift does. `String.index(of:String)` will return a String.Index: https://developer.apple.com/documentation/swift/string.index and indexed String methods will work based on that index type. This includes "reindexing" (offsetting) which is done using String.index(String.Index, offsetBy: String.IndexDistance). Furthermore String exposes two built-in indexes startIndex and endIndex as well as an "indices" iterator.

> This would maintain Python's existing semantics while reducing memory consumption.

It would not maintain O(1) USV indexing (especially in the C API), which was the reason for not just switching to UTF8.

In fact, FSR strings already contain a full UTF8 representation of the string[3], which the latin1 representation can share for pure ASCII strings.

[0] a non-utf8 str is one of Rust's 10 undefined behaviours, part of the "invalid primitive values" section alongside null references or invalid booleans: https://doc.rust-lang.org/nomicon/meet-safe-and-unsafe.html

[1] https://doc.rust-lang.org/std/primitive.str.html#method.char...

[2] https://kbknapp.github.io/clap-rs/unicode_segmentation/index...

[3] https://github.com/python/cpython/blob/49b2734bf12dc1cda80fd...


I'm not a fan of how Python 3 stores Unicode strings internally. In my opinion they should have went with UTF-8. The extra scanning and conversion puts more preassure on the processor and caches under load.

I agree that Python 2's Unicode handling is broken. That's why I just stored UTF-8 in a normal string and avoided the whole mess. The only thing I have to do is validate any input from the outside world is really UTF-8.


Since the high-level API is supposed to let you treat a string as a sequence of code points, a correct implementation (which Python didn't have until 3.3!) would've imposed the overhead of conversion to something resembling a fixed-width encoding whenever a programmer invoked certain operations.

And the vast majority of strings in real-world Python contain only code points also present in latin-1, which means they can be stored in one byte per code point with this approach. And for strings which can't be stored in one byte per code point, you were similarly going to pay the price sooner or later.


> Since the high-level API is supposed to let you treat a string as a sequence of code points,

I disagree with that premise. It should operate on grapheme clusters. Operating on code points falls into the same trap as operating on bytes.

> a correct implementation (which Python didn't have until 3.3!) would've imposed the overhead of conversion to something resembling a fixed-width encoding whenever a programmer invoked certain operations.

Those operations should have been removed. Indexing is the big one that needs fixed width internal representation for speed. Code could have been rewritten to not require indexing. But mechanical translation from Python 2 to 3 was a goal and because of that they couldn't radically change the unicode API for the better.

> And the vast majority of strings in real-world Python contain only code points also present in latin-1, which means they can be stored in one byte per code point with this approach. And for strings which can't be stored in one byte per code point, you were similarly going to pay the price sooner or later.

You're going to pay the price for 4 byte per codepoint strings quite often. A single emoji will blow up a latin-1 string to 4 times the size.


> That's why I just stored UTF-8 in a normal string and avoided the whole mess.

This only works if every library that you use agrees with you on this, and treats all strings you pass to it as UTF-8 whenever encoding matters.

OTOH, if you don't care about that, then you might as well just use bytes everywhere, and get the same thing. At least in Python 3, with bytes, if a library does try to use it as a string, you'll get an error, rather than silent wrong output.


> This only works if every library that you use agrees with you on this, and treats all strings you pass to it as UTF-8 whenever encoding matters.

Not exactly. The library just has to treat it as a string and not worry about the encoding (i.e. not try to encode it to/from the unicode type).

Only ran into this issue once and the library had an option to return everything as string so not a problem.

> At least in Python 3, with bytes, if a library does try to use it as a string, you'll get an error, rather than silent wrong output.

Bytes in Python 3 don't support string operators.


Bytes in Python 3 don't support string operators.

Slight nitpick: `bytes` objects in Python 3 do not share all of the operations and methods available on `str`, but do share quite a few. Notably, `bytes` will never implement format(), but it does implement printf()-style formatting via the modulo operator.

The `bytes` and `bytearray` types implement the following methods which also exist on `str` (in some cases, with the caveat that the operation only makes sense if the bytes in question are in the ASCII range):

capitalize(), center(), count(), endswith(), expandtabs(), find(), index(), isalnum(), isalpha(), isdigit(), islower(), isspace(), istitle(), isupper(), join(), ljust(), lower(), lstrip(), maketrans(), partition(), replace(), rfind(), rindex(), rjust(), rpartition(), rsplit(), rstrip(), split(), splitlines(), startswith(), strip(), swapcase(), title(), translate(), upper(), zfill()


I didn't realize the modulo operator for bytes was added. most information I've run across said it didn't work.

Unfortunately most libraries for 3 will be using str so using bytes with UTF-8 inside will become more and more difficult.


> I didn't realize the modulo operator for bytes was added. most information I've run across said it didn't work.

It was added in Python 3.5 (IIRC that's the last backwards compatibility feature added, I don't remember 3.6 adding any, or any being planned for 3.7).


> The library just has to treat it as a string and not worry about the encoding (i.e. not try to encode it to/from the unicode type).

If I pass a library a string it receives a Unicode string, bytes already decoded using an encoding. It shouldn't be able to re-decode that in any way, whatever that is supposed to mean on a technical level.

If a library receives a byte-array representing text, that is a completely different matter and talking about encodings is fully appropriate, even required.

But this matter should predominantly exist at your application's barrier, when doing IO.

If you're regularly doing encoding and decoding anywhere else, you're doing something wrong (or your language is).


Look back a few posts. We're discussing using UTF-8 in str and avoiding the unicode type in Python 2.

I'n my use case I validate the string as UTF-8 from the internet. To and from the database is UTF-8 so no validation is required there. Output back to the internet requires no additional steps.

Nowhere in this method is encode or decode required or desired.


> Not exactly. The library just has to treat it as a string and not worry about the encoding (i.e. not try to encode it to/from the unicode type).

Or do anything else that implies encoding. Like measure length, index, slice, change case etc.


Change case, yes, that would require actually decoding the string to the unicode type. But that could be done when needed and not every time something from my databse needs to go out to the client.

Slicing works fine on a UTF-8 string as I'm slicing between ASCII characters which don't appear inside a non ASCII character. If I needed to slice between certain code points it would still be easy as I just look for the appropriate 2-4 byte sequence and slice before or after it. Python doesn't support graphemes so can't do much with those.

Measuring length is not something that comes up for me. And indexing to an absolute spot in a string never comes up at all.

But yes, if I did have to call a text processing library I'd have to then encode/decode to the Unicode type. But that's rare enough that I can keep everything UTF-8.


http://bit.ly/unipain is my go-to reference whenever i get tripped up on what's going on with unicode in python.

it is significantly more sane in python 3.3+.


I've always been curious on how this change in 3.3 impacts the C/C++ interface. I don't really know where to look it up, and since I haven't yet had to code a C++ library for Python I've had no burning need to answer the question.


The Python C API grew some new functions and constants which are aware of what's going on and can tell you what encoding a particular Unicode object is using, read from/write to it, etc. The pre-3.3 APIs have a lot of deprecations in favor of the new API. If you want to use new API on a Unicode string created via old API, you have to use the new PyUnicode_READY() on it first.


https://www.python.org/dev/peps/pep-0393/ has details down to C API changes related to the FSR implementation.


Question: if you read a file is there an algorithm that will make sure it is parsing the right encoding?


Not reliably, no. You can detect if it's an invalid string according to the encoding you're currently using (value > 127 for ASCII, invalid surrogate pair for UTF-16) but there are lots of byte sequences that produce valid (but semantically meaningless) output in multiple encodings. To choose between them programmatically requires your algorithm to understand the meaning of the string as well as be able to decode it, which might be possible in limited domains, but is a very hard problem in general.


If this was phrased as a question it would be a trick one.


"How Python does Unicode: Poorly."


Python 2 maybe. Python 3 does Unicode wonderfully well; I miss it whenever I'm working with other languages.


All Python 3 did was put a hard barrier between bytes and strings. That's it.

Mising is all the grapheme handling that languages that do Unicode strings right have.


I'd be curious to hear why you think that.


Unicode is a horrible scam, the worst thing to happen to digital language representation. This is all just so much turd polishing.

(Also, that explanation of UTF-8 is crap. UTF-8 is beautiful quite apart from its utility, but you'd hardly know it from the article.)

I've said it before: Unicode is a conflation of a good idea and an impossible idea. The good idea is a standard mapping from numbers to little pictures. That's all ASCII was. The impossible idea is a digital code for every way humans write. It's a form of digital cultural imperialism.

Unicode Consortium et. al. are absurdly arrogant.


Critical rants that don't suggest a better alternative, or describe what a better alternative might look like even in outline, are rarely informative or persuasive.


Step One: Admit there's a problem.

I heard, "Tell me more about what you think would be better." Here goes:

For written languages that are well-served by a simple sequence of symbols (English, etc.) there is no problem: a catalog of the mappings from numbers to pictures is fine is all that is required. Put them in a sequence (anoint UTF-8 as the One True Encoding) and you're good-to-go.

For languages that are NOT well-served by this simple abstraction the first thing to do (assuming you have the requisite breadth and depth of linguistic knowledge) is to figure out simple formal systems that do abstract the languages in question. Then determine equivalence classes and standardize the formal systems.

Let the structure of the language abstraction be a "first-class" entity that has reference implementations. Instead of adding weird modifiers and other dynamic behavior to the code, let them be actual simple DSLs whose output is the proper graphics.

Human languages are like a superset of what computers can represent.

Here's the Unicode Standard[1] on Arabic:

> The basic set of Arabic letters is well defined. Each letter receives only one Unicode character value in the basic Arabic block, no matter how many different contextual appearances it may exhibit in text. Each Arabic letter in the Unicode Standard may be said to represent the inherent semantic identity of the letter. A word is spelled as a sequence of these letters. The representative glyph shown in the Unicode character chart for an Arabic letter is usually the form of the letter when standing by itself. It is simply used to distinguish and identify the character in the code charts and does not restrict the glyphs used to represent it.

They baldly admit that Unicode is not good for drawing Arabic. I find the phrase "the inherent semantic identity of the letter" to be particularly rich. It's nearly mysticism.

If it is inconvenient to try to represent a language in terms of a sequence of symbols, then let's represent it as a (simple) program that renders the language correctly, which allows us to shoehorn non-linear behavior into a sequence of symbols.

If you think about it, this is already what Unicode is doing with modifiers and such. If you read further in the Unicode Standard doc I quoted above you'll see that they basically do create a kind of DSL for dealing with Arabic.

I'm saying: make it explicit.

Don't try to pretend that Unicode is one big standard for human languages. Admit that the "space" of writing systems is way bigger and more involved than Latin et. al. Study the problem of representing writing in a computer as a first-class issue. Publish reference implementations of code that can handle each kind of writing system along with the catalog of numbered pictures.

From the Unicode Standard again:

> The Arabic script is cursive, even in its printed form. As a result, the same letter may be written in different forms depending on how it joins with its neighbors. Vow-els and various other marks may be written as combining marks called tashkil, which are applied to consonantal base letters. In normal writing, however, these marks are omitted.

Computer systems that are adapted to English are not going to work for Arabic. I'd love to use a language simpler than PostScript to draw Arabic! Unicode strings are not that language.

Consider the "Base-4 fractions in Telugu" https://blog.plover.com/math/telugu.html

The fact that we have a way to represent the graphics ౦౼౽౾౸౹౺౻ is great! But any software that wants to use them properly will require some code to translate to and from numbers in the computer to Telugu sequences of those graphics.

Let that be part of "Unicode" and I'll shut up. In the meantime, I feel like it's a huge scam and a kind of cultural imperialism from us hacker types to the folks who are late to the party and for whom ASCII++ isn't going to really cut it.

To sum up: I think the thing that replaces Unicode for dealing with human languages in digital form should:

A.) Be created by linguists with help from computer folks, not by computer folks with some nagging from linguists (apologies to the linguist/computer folk who actually did the stuff.)

B.) We should clearly state the problems first: What are the ways that human language are written down?

C.) Write specific DSLs for each kind of writing. Publish reference implementations.

I think that's it. Are you informed? Persuaded even? Entertained at least? ;-)

[1] http://www.unicode.org/versions/Unicode9.0.0/ch09.pdf


That's a really good explanation of your position and reasons for it, thanks you.

>They baldly admit that Unicode is not good for drawing Arabic.....I'd love to use a language simpler than PostScript to draw Arabic! Unicode strings are not that language.

Unicode isn't good for drawing anything. Unicode is not intended to, or try to encode how a text should be displayed. At all, even slightly. This is the root of my disagreement with your post. You're claiming it can't accurately render the appearance of text, but that simply isn't it's purpose. It is purely and only about encoding the graphemes. Glyphs are what fonts and display technologies like PostScript are for, not Unicode.

You could argue that it should do that, perhapse Unicode should be a vector drawing language or something, but it's hard to see how that would make it useful for text processing that does concern itself with graphemes and grapheme like units. Unless the display oriented system you want contained within it a grapheme encoding system like Unicode to facilitate that - but then why not work the other way around and use Unicode for that and build a display system on top of Unicode to address your concerns?

I think trying to have your cake and eat it with a family of distinct DSLs would be problematic. Text processing is bad enough, but how would you process the content of a string that is actually a DSL? With Unicode it's possible to write a library that can process text in any script, even ones not in the standard yet, but if text could consist of computer code in any one of thousands of different domain specific languages, how would you ever be able to write one piece of code to work with all of them and all possible future permutations? Finally if your DSL is producing display output, how does that work with fonts? What if you want to vary the appearance of the output, how do you apply that to the encoding output? It just seems that this approach produces an enormous monolithic super-complex rabbit hole with no bottom in sight.


> That's a really good explanation of your position and reasons for it, thanks you.

Cheers, I've had time to think and some sleep. I apologize to you and the people I've offended with my cranky trollish manner.

> Unicode is not intended to, or try to encode how a text should be displayed.

This made realize "text" traditionally is exactly language that is displayed somehow. The whole concept of storing writing as digital bits is metaphysical. Barely so for e.g. English, but quite a lot for e.g. Arabic.

> [Unicode] is purely and only about encoding the graphemes.

If it's just a catalog mapping numbers to little pictures (technically to collections, or families, of glyphs, or even to non-specific heuristics for deciding if a graphical structure counts as a glyph for a grapheme [1]) then I'll shut up. But what about the modifiers and stuff?

Maybe I am being unfair to Unicode. I don't want to deny or denigrate the cool and useful things it actually does do. As I said I think it's a combination of a good idea (encoding graphemes) with an impossible idea (encoding written human languages). If Unicode isn't the latter then I've been shouting at the wrong cloud!

- - - -

Here's what I'm trying to say: Imagine a conceptual "space" with ASCII on one side and PostScript on the other. In between there's a countably infinite set of formalisms that can describe and render human languages. From this point of view, the Unicode standard is a small part of that domain but it is absorbing (in my opinion) so much of the available time and attention that other potentially more-useful regions of the domain are completely neglected.

- - - -

So, yeah, I think we should study languages and writing systems and computerize them carefully with native speakers and writers and linguistic experts in the room. And I think we would need what are in effect DSLs for each kind of writing system. (Not every language, but rather every kind of way that languages are written down.)

> how would you process the content of a string that is actually a DSL

Parse it to a data-structure, the simplest that will suffice for the language's structure. Work with it using defined functions (API). This is what we do already but the fact that English could be represented as array<char> reasonably well tends to obscure it.

string_value.split()

Or better yet:

    >>> s = "What is the type of text?"
    >>> s.title()
    'What Is The Type Of Text?'

> With Unicode it's possible to write a library that can process text in any script

That seems like it's true but I don't think it is true in practice. In your reply to mjevans elsewhere in this thread,

> You can't determine [the correct way of connecting the characters] purely from Unicode, you have to also know the conventions used in writing Arabic script. However Unicode is not intended to encode such conventions.

And you point out that Unicode won't help you properly support cut-and-paste for Arabic. So you can't process text using Unicode if that text is Arabic. In fact, there may not be "text" in Arabic the way there is in English! There is written Arabic but not textual Arabic. In other words, Unicode may well be engaged in creating the textual form of Arabic (and other languages.)

> any one of thousands of different domain specific languages

I think there would be less than a hundred distinct formalisms that together could capture the ways we have come up with to write, perhaps less than a dozen, but I wouldn't want to bet on it.

> how would you ever be able to write one piece of code to work with all of them and all possible future permutations?

Maybe you can't.

But if it's possible it will be by figuring out the type of text, which means exactly to figure out the set of functions that make sense on text. At which point your code can use those functions (the API of the TextType) to abstract over text. Like the str.title() method. Does that even makes sense in Chinese or Arabic?

The comment by int_19h in this thread speaks to this point really well:

> It's not about encodings at all, actually. It's about the API that is presented to the programmer.

> And the way you take it all into account is by refusing to accept any defaults. So, for example, a string type should not have a "length" operation at all. It should have "length in code points", "length in graphemes" etc operations. And maybe, if you do decide to expose UTF-8 (which I think is a bad idea) - "length in bytes". But every time someone talks about the length, they should be forced to specify what they want (and hence think about why they actually need it).

> Similarly, strings shouldn't support simple indexing - at all. They should support operations like "nth codepoint", "nth grapheme" etc. Again, forcing the programmer to decide every time, and to think about the implications of those decisions.

> It wouldn't solve all problems, of course. But I bet it would reduce them significantly, because wrong assumptions about strings are the most common source of problems.

What you're asking for is the base type for "text" for all languages, the ur-basestring, if you will. (It may not exist.)

> Finally if your DSL is producing display output, how does that work with fonts? What if you want to vary the appearance of the output, how do you apply that to the encoding output? It just seems that this approach produces an enormous monolithic super-complex rabbit hole with no bottom in sight.

Well again, computerized text is a new thing under the sun, different from writing, which has been happening all over the world for thousands of years (cf. Rongorongo[2]) Separating the "text" from the written form of the text (the display) is a new and metaphysical thing to do. For languages like English we get pretty far with encoding the Alphabet and some punctuation marks and putting them in a row. We completely bunted on capitalization though, we pretend that 'a' and 'A' are two different things. Typefaces can be abstracted from the stream of encoded byte/characters and treated as metadata. If you want to include it in a digital document you immediately have to define a DSL (Rich Text Format for example) to shoehorn the metadata back into the byte stream. Complications ensue.

For some languages (e.g. Arabic) it may not make sense to abstract the display of the text from the text. (Again, writing is exactly display. It is literally (no pun intended) the act of displaying language.) You have to include metadata in addition to the graphemes in order to recreate the correct display of the text, so you have to have some kind of DSL for the task.

As I said above, I don't think there are more than one or two dozen truly different ways of writing. A set of DSLs (perhaps not dissimilar to the generative L-Systems that can produce myriad realistic plant-like images from a small set of operations) could presumably model those ways of writing.

Unicode was a start on computerization of written languages. I think an approach that treats each kind of writing system as a first-class object of study in its own right will give us standard models for dealing with text in each kind in digital form. We should strive for computerized writing systems that are "as simple as possible, but no simpler." And, yes, it seems to me that some of them will have to include producing display output.

[1] DuckDuckGo image search for "letter A" https://duckduckgo.com/?q=letter+a&t=ffsb&atb=v60-2_b&iax=1&...

[2] https://en.wikipedia.org/wiki/Rongorongo

- - - -

Here's my "Cartoon History of Unicode":

    1. ASCII exists
    2. Europe does too!  Extend ASCII with the funky umlauts or whatever.
    3. Oh shit! Japan! Mojibake!
    4. I know! Let's use *sixteen* bits!  That'll solve everything.
    5. What do you mean Chinese is different from Japanese?
    6. WTF Arabic!?
    7. Boy there sure are a lot of graphemes.  Gotta collect 'em all.
    8. PIZZA SLICE
    9. POOP
	
At which point we reach "peak internet" and Doge appears to say "wow".


> Unicode is a horrible scam, the worst thing to happen to digital language representation.

> They baldly admit that Unicode is not good for drawing Arabic

> Consider the "Base-4 fractions in Telugu" [...] any software that wants to use them properly will require some code to translate to and from numbers in the computer to Telugu sequences of those graphics.

Written language is hard to represent, encode and draw. You admit that Unicode&utf8 got 2 out of 3 right yet you call it a scam.

Your complaint is a scam and a horrible trollism.


When I'm trolling you'll know it. I have a point, I believe it's a good point, and I'm making it.

For languages that can be represented as a sequence of little pictures Unicode is a little better than ASCII. For the rest, it's a scam: We tell people that we have a way of dealing with human languages in computers but it's half-baked, born in ignorance, and all the grotty details are papered over, but you can write PIZZA SLICE or POOP now, so fuck it, ship it.

Represent: "Astral Plane"? What does that have to do with a standard catalog mapping numbers to pictures? I feel Unicode messes that up.

Encoding: UTF-8 is near perfect, 'nuff said. Ken Thompson and Rob Pike doing their thing.

Drawing: Doesn't even begin to touch it really.

Unicode is a nasty little black hole that's sucking up time and other resources and not really solving the problem.


I think that's too negative. Is Unicode perfect? Of course not, but it's the best we've got for now. Just as Morse, Baudot, or ASCII were the best approximations at one point in time.

It's a hard problem and will take decades to for the right solutions/implementations to present themselves. Surely one day there will be an improved successor to Unicode. Things are a lot better than they were even ten years ago, however.


Yeah, sorry, I was pretty cranky last night. Please see my reply to simonh in this thread a few minutes ago. (I'm basically agreeing with you.)


Speaking for myself, Unicode's original fundamental mistake was one that could only be recognized as a mistake in hindsight: insisting on round-trip compatibility with existing encodings.

Round-trip compatibility meant Unicode had to not only adopt but permanently preserve all the mistakes and inconsistencies of encodings which were popular at the time. Which is how we get a bunch of duplicates, a bunch of code points that are there but only supposed to be used for round-tripping, some of the un-fun edge cases for Latin text where things have both composed and decomposed forms, some of the weirder aspects of equivalence and normalization, etc.

At the time it seemed like a smart and rational thing to do since it meant you could losslessly transition from your existing character set, and then losslessly go back to it if you wanted to, but now that Unicode "won" it's just a source of "well, that's annoying and inconsistent but they needed it for round-tripping" explanations.

In particular, round-trip compatibility meant that Unicode ended up containing a bunch of variant forms of things that existing encodings treated as distinct characters, but which probably would not pass the test of being distinct graphemes by Unicode's definition. Declaring the variant forms to be a contextual issue left up to the font or the rendering system would have been better,

Ironically, the second big mistake was to then try to switch philosophies and do just that with the CJK characters, sparking the whole Han unification mess.


> I feel like it's a huge scam and a kind of cultural imperialism from us hacker types to the folks who are late to the party and for whom ASCII++ isn't going to really cut it.

It's more pay-to-play than "cultural imperialism". Arabic does seem to suffer due to no primarily-Arabic country being a member of the consortium (IIRC and it hasn't changed in the last five years). If someone was willing to absorb that cost then they could almost certainly get things done (e.g. look at Japanese).

While the pay-to-play aspect is obviously not utopia, it does seem to work quite well in practice: Arabic does have a large amount of support as is; to get more, you really need to have people who use Arabic as primary members so they can make the hard decisions.


"Hey Arabs, we'll computerize your language if you pay for it or show up otherwise we'll do it anyway, poorly, because it's fun for us and it makes us feel like we're helping. Hope that works for you 'cause it's what you're going to get whether you want it or not."

Yeah, I don't have a lot of respect for that.


Is it possible to, based purely on the context of the symbols that the Unicode standard has added, determine the correct way of connecting the characters? Does the written language have such a formal mechanic?

Or is are tashkil extra distinctive modifications that might be like spices or sauces per (whatever you want to call a single display slot)?


You can't determine that purely from Unicode, you have to also know the conventions used in writing Arabic script. However Unicode is not intended to encode such conventions.

Suppose these conventions change, as they have throughout history? Or if there are different variations of these conventions in different regions or sub-dialects? Also for example in Arabic it's often possible to determine the pronunciation of a word from it's context in a sentence, but in other contexts it isn't and so the tashkil are added. There's no way for a system like Unicde to ddecide that for you. For example suppose you cut-and-paste the word from one sentence into another, should Unicode somehow automatically add or remove the tashkil? No, that's up to the author (e.g. performing the edit in a word processor) or the program performing the operation if it's being done programatically.

Unicode provides one layer in the stack. Fonts provide another layer. Program code or editorial sensibility provides another layer. Many criticisms of Unicode are premised on the expectation that it should be solving problems that belong to another layer. Not all criticisms, it's a complex system that has had to make many compromises and there have been a series of mistakes in it's history, but taken overall it's been unbelievably successful and useful.

I'm in awe of the way it solves such a huge range of problems in the space, that people picking nits about the gaps that remain are piss me off, especially when they're based on a fundamental misunderstanding of the problem it's actually solving. Cynicism is easy, solving hard problems is not. I know who gets my respect.


It sounds like the written language is more similar to vocal musical instructions; and also like the 'spices' analogy that I was making is how the changes in display and connecting forms work.

With so much complexity and variability present at the time in history that the written form becomes fixed, I can't imagine any solution actually being easy, and can only think of editor software offering an emoji like list of likely 'accents' for a given 'word' (and breaking it down per character for corrections).

Such a system sounds incredibly tedious for user and programmer alike. I am glad that written 'western' languages became 'fixed' many centuries ago.


that explanation of UTF-8 is crap. UTF-8 is beautiful quite apart from its utility, but you'd hardly know it from the article

My goal was not to judge UTF-8 aesthetically, but to explain how it works and point out that it's a variable-width encoding which emphasizes its compatibility with ASCII for strings containing only code points <= U+007F.

Unicode Consortium et. al. are absurdly arrogant.

I would agree that Unicode as it exists today involves some historical and historic bad decisions. But again, staying off value judgments with respect to Unicode itself since the point of the article was to explain how Python now handles it internally.


Oh, Hi there.

Apologies for being cranky. You did a great job explaining how Python now handles Unicode!

To me it was strange reading about UTF-32 first and then getting to UTF-8 from that context. It seemed to obscure the coolth and beauty of the format.

Overall a great article, sorry again for being so negative.


That section was written for people who know little to nothing about Unicode and the ways Unicode can be encoded to bytes. So it starts with the obvious approach -- just spit out a sequence of bytes whose integer values are the code points, which is near enough as makes no difference to how UTF-32 works -- then introduces variable-width encoding through the history of UCS-2 and UTF-16, then gets to UTF-8 and what motivated it.

The advantages/disadvantages of the various encodings is something that could eat up several pieces just as long as the entire post, and for fun I'd probably throw in weird stuff like the attempt to do EBCDIC-compatible UTF instead of ASCII-compatible, etc.


Someone should write up EBCDIC-based UTF as an RFC. I'm sure that there's at least one COBOL programmer out there that has been waiting for that for decades.

ETA: Mostly a joke, but it would also fit right in with things like WTF-8 (https://simonsapin.github.io/wtf-8/)


It wasn't a joke. UTF-EBCDIC is a Unicode Technical Report:

http://www.unicode.org/reports/tr16/


aw, now i'm cranky again. lol




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: