- C sort of assumes all strings are ASCII
- Go, Dart, Rust and Zig all return to single byte encodings. They all use UTF-8 internally, and have different functions to interact with the string as a list of bytes, and the string as a list of unicode codepoints. (In Go's case its because Rob Pike was behind both Go and UTF8.)
Its delightful to see Swift stepping away from Obj-C's mistakes and improving things internally. Although in Obj-C's defence, NSString has always hidden its internal representation, which did a lot of work to protect developers from the footguns.
One of the big ironies of all this is how well C has stood the test of time. C strings interoperate beautifully with UTF8 - all thats missing is some libc helper methods to count & iterate through UTF8 codepoints. strlen / strncpy / strcmp / etc all work perfectly when dealing with UTF8. The only change is that you need to supply lengths in bytes not characters.
I wouldn't say that.
I'd rather claim that UTF-8 has been engineered to be ASCII pass-through compatible because of the lacking Unicode support in most C-codebases out there.
I mean... If you compare Unicode support in platforms and OSes where this was clearly a thought (Windows, C#, Java) to platforms with the more naive approach (Linux, C, PHP, etc), you will see a very clear picture of which side has the most unicode bugs and encoding errors.
And I say that as a Linux-guy.
C is terrible for text-processing. UTF-8 was designed as it was because of that.
Trying to paint C as the best choice here because someone found a working solution they could also apply to terrible C-code is clearly seeing things backwards.
Their insistence on UCS2 however gave us the worst outcome of all. We have forever lost all north of 23 bits for characters due to UTF16. The result of this misdesign creates bigger problem than C’s char type.
"Reasonable people adapt themselves to the world. Unreasonable people attempt to
adapt the world to themselves. All progress, therefore, depends on unreasonable people"
Likewise, C was created in a time when it looked like 7 or 8 bits would suffice. That turned out to be wrong. UTF-8 was invented to solve that problem without redoing C strings from scratch.
$ python3 --version
$ python3 hello.py
$ python3 hello.py > out.txt
$ cat out.txt
C:\Users\u\tmp>python hello.py > out.txt
Traceback (most recent call last):
File "hello.py", line 1, in <module>
cp1252.py", line 19, in encode
UnicodeEncodeError: 'charmap' codec can't encode character '\u3053' in position
0: character maps to <undefined>
The python unidecode pypi package is your best friend if you have to deal with it.
Agreed. And it's not really as much a native Windows app, as it is an emulator for the "good old" times of MS-DOS, which has received the necessary adjustments to keep chugging along, but little else.
Given that you probably have what is the single weakest link unicode-wise you can find anywhere in the entirety of the Windows universe.
It's really a completely obsoleted, abandoned mess. The fact that we had to wait until Windows 10 before you could actually resize the window freely should tell you at what level you should put your expectations.
If you want to live your life in the command-line on Windows (for python or whatever else), do yourself a favour and get any other terminal.
There's really quite a few to choose from. CMDer, ConEmu and MinTTY springs to mind.
Claims unicode support is decent in Windows 8 and up.
Python 3 internally uses ASCII, UCS2 or UCS4 for its strings depending on which is most space efficient but still capable of representing the string. It can do that because Python strings are immutable* and because it is impossible for user code to see what the encoding is (to see the bytes of a string you must explicitly convert it, at which point you specify the destination encoding and Python ensures the translation is correct). If you join a UCS2 string with a UCS4 string then the result is automatically UCS4 and there's no way to tell from user code (except by memory usage)!
There's a good reason for this: indexing into a UTF8 string takes O(n) time because you must parse all the bytes before that point in order to count the number of characters. Iterating over a string by index would either take O(n^2) time or you would have to use some sort of awkward string iterator. If you do not provide any way to index by character (rather than byte position) then I would argue that your data structure is not really a "string". I suppose an alternative implementation would be to use UTF-8 but have an auxiliary data structure that maps between character indices and byte positions.
* Immutability matters because if you had a mutable string in ASCII and you changed one character to be a code point >127 then you'd need to copy the string before making the modification, which if possible at all would be O(n) for what you would expect to be O(1).
Our fundamental ideas about how a string should work and what kinds of things we might want to do with one are deeply influenced by the languages spoken by the pioneers of the computer age. Short of someone from a very different language tradition not being exposed to these implicit assumptions in their formative programming years I don't think we are ever going to know what a different path would look like.
And needs to be handled with care...there are edge cases where it can bite you. For example, if Unicode is being used internally to process data in a legacy CJK encoding, normalisation may lose distinctions that are needed for accurate round-trip conversion.
Another surprise "gotcha" is that simply concatenating two already-normalised strings may give you a result that is not normalised.
Here's the thing: You simply don't do that, ever. It's a meaningless operation. Most you would do is iterate over a string.
Computed indicies into strings are a no-go but that is far from the only use case.
What I mean is that "extract the 42th codepoint/glyph/whatever from this UTF-8 string" is a pointless operation for free-form strings because in a free-form string the character position is meaningless.
(Basically all non-ASCII UTF-8 is pretty much a black box. You can't do serious computation with general Unicode because it's so complex and, as a consequence, ill-defined in practice).
Doesn't work well with BIDI languages, or Mongolian. Seriously if you haven't tested with Mongolian your code is probably wrong.
You can do that just fine with UTF8, both Swift and Rust allow it.
UTF8 is as efficient as ASCII in the ASCII range, and more than or as efficient as UCS4 beyond the BMP.
The only iffy point is the U+0800 ~ U+FFFF range: Samaritan script to the end of the BMP, which mostly affects extremely contents-dense (mostly text, little to no markup) asian, and native scripts, as well as a few african ones.
so... windows ? unicode has always worked fine for me on linux, and has always been a complete pain in windows - as a software user and even worse as a developer. The only problem until ~ 6-7 years ago was libre unicode fonts with complete coverage but they are now pretty good.
Reversing cause and effect, methinks: UTF-8 interoperates beautifully with ASCII. :P
That's exaggerating the truth, I think. Programmers can and do presume strlen's result is things like the length of the string in characters (and the typename "char" does not help), or the number of terminal columns wide a string is. E.g., I'm pretty sure the mysql client can still, to this day, not properly format a table, and IIRC, it is written in C. strcmp() does a byte-for-byte comparison of strings, which in Unicode may very well be meaningless. strncpy() has its own footguns w/ leaving the buffer unterminated. (But yes, as long as your indexes into the string make "sense" — aligned to code points that will make sense at the destination — it works fine.)
If you have to write any of the helper methods to count & iterate through UTF8 codepoints, you have to watch out for the footgun that char is sometimes signed, sometimes not, so attempting to get the high bits to determine things like "is this a continuation byte?" with >> or & is either implementation defined or UB, depending on the specifics of the platform. (>> is UB on negatives, which char can be, and often is; & on a signed int doesn't make much sense, and it's implementation defined what it does in C (it operates on the bitwise representation of whatever the underlying signed integer representation is, which is implementation defined. Yes, yes, all the world is two's complement. The earlier UB is much more likely to bite you.))
unsigned char * is better for both raw binary buffers and dealing w/ buffers of UTF-8 data.
(Honestly, I'm not even sure that C requires the character literal 'A' to correspond to an ASCII "A". But all the world is ASCII…)
Or, even better, uint8_t. Fundamental types don’t have guarantee fixed sizes, only minimum. Explicitly asking for 8 bits only can save sanity in edge cases in a long running code base where no one is sure what other people is doing anymore.
This is overly pedantic IMO. Practically speaking, a char will always be 8bits, unless you're working in embedded (even then, only some ancient DSP devices IIRC have non 8 bit Char). If you're writing code for POSIX, a chae will always be 8 bits, as will windows^. Almost evrry embedded device also works with an 8 bit Char, and anything that doesn't you probably know about in advance, and probably aren't handling Unicode text on.
^ Citation needed
That’s preferable to building and misbehaving, but if you got that far without realizing you’re dealing with an unusually-sized char then you’re probably doomed anyway.
If you want an 8-bit unsigned type, uint8_t is that type. If it's available, going with it isn't "overly pedantic", not going with it is irresponsible.
What does "overly pedantic" even mean here? Whenever I see that phrase, it's often a case of someone who is wrong trying to shame someone who is a right for the crime of being right.
It doesn't. Nor do chars have to be 8-bits.
EBCDIC is still alive and well on IBM mainframes.
Python 2 had "narrow builds" (2-byte unichar) and "wide builds" (4 byte unichar). Wide was generally the default on unices.
> - Go, Dart, Rust and Zig all return to single byte encodings. They all use UTF-8 internally, and have different functions to interact with the string as a list of bytes, and the string as a list of unicode codepoints. (In Go's case its because Rob Pike was behind both Go and UTF8.)
Dunno about Dart and Zig, but AFAIK Go is closer to Ruby's old model of "IDK lol". A number of string-processing functions assume (and sometimes even assert) UTF8, but that's not a guarantee at all
> It's important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes.
For Rust on the other hand, having non-UTF8 byte sequences in an str is one of the language's UB (hence str::from_utf8_unchecked existing and being unsafe).
> C strings interoperate beautifully with UTF8 - all thats missing is some libc helper methods to count & iterate through UTF8 codepoints. strlen / strncpy / strcmp / etc all work perfectly when dealing with UTF8.
C strings don't "interoperate with UTF8": they treat everything as a nul-terminated bag of bytes and that's it. Which means they won't properly process valid UTF8 sequences containing NUL, and they will improperly process invalid UTF8 sequences. Ignoring the entire thing is not interoperability, you can not actually operate on text using the C stdlib. All you can do is treat it as an opaque blob and move it from one place to the other mostly unmolested.
How so? Concerning runtime values, you can do whatever you want in C, which means UTF-8 support in C is basically free. Note that UTF-8 was explicitly designed to have this kind of upward compatibility.
Concerning string literals, just keep your source files encoded in UTF-8 and put in quotes the UTF-8 string you want to process at runtime. I'm not aware how the spec handles source code files that are not ASCII, but in practice it works, and furthermore you can always opt for hex escapes.
Technically the docs never said what the internal representation was, but it was not very well hidden at all. As the article points out, NSString presents an API with "constant-time access to UTF-16 code units".
I ran into a bug just the other day where emoji in a comment in a rust file with RLS caused a weird rendering error in VSC.
'�' // oops!
I care about this having worked on realtime collaborative editors. In this space it definitely does make sense to describe string offsets with unicode codepoints. The problem with counting the number of grapheme clusters is that that number can change with every revision of the unicode spec, and you can get different answers on different platforms. Or the same platform, but a different version of the OS. Does a zero width space offset the position count or not? Do you get the same answer on Windows, MacOS and through whatever unix you happen to use AWS? I shudder just thinking about it.
In contrast, counting unicode codepoints is simple, consistent and well defined. Counting string lengths that way is also implemented in the standard library of just about all languages.
why oops ? what do you expect ? do you also '' random image data and expect it to be the first pixel's red component ?
The first two are used by CFStringRef (bridged to NSString), the third by Swift strings that have been hit with UTF-16 offset lookups, and the fourth by Objective C tagged pointer strings (on 64-bit devices).
Yeah, I guess that NSString has mostly exhausted the possibilities for ways to fulfill their API obligations.
Bottom line, don't promote python 2. At this stage, it's legacy. We maintain existing money makers with it, but all new projects have no reason to be in 2.
That's actually not entirely accurate. NSString in Foundation was (and is, sort of) a class cluster that is quite flexible and agnostic about the actual encoding of the representations. I remember this quite clearly, because I fleshed out some of the missing NSString pieces in libFoundation, which was a fairly clean/simple implementation of the OPENSTEP Foundation spec.
What locked in the representation was CoreFoundation, Apple's C re-implementation of Foundation for the Carbon crowd, who apparently wouldn't countenance even linking against Objective-C, even if hidden behind a C facade.
So instead of taking the tried-and-true, fast and flexible Foundation and adding some wrappers, the existing Foundation had to be abandoned and rebuilt on top of a newly-written much inferior object-oriented C library.
With that, we got much worse performance, much reduced flexibility in terms of representation and locking away of flexibility. Just as an example, the binary plist format is quite capable, but almost all of that capability is lost because it's hidden behind a monolithic C API.
Now there are some issues in the API, for example just a single "length" attribute that clashes with NSData's length and makes it...challenging...to present both Byte and String faces at the same time, making for a lot of otherwise unnecessary copying and conversion.
Isn’t this exactly the design goal of UTF8? To be backward compatible with ASCII strings?
"A Dart string is a sequence of UTF-16 code units"
You can get them as "runes": https://www.dartlang.org/guides/language/language-tour#runes
That's not true for C# and Python2. While both were first released in 2000 and Unicode code points beyond the BMP first appeared with Unicode 3.1.0 in 2001, the foundation for additional Unicode planes was laid as early as 1996 with Unicode 2.0.0.
Single byte storage, so it works well with utf-8.
Byte length field, so it knows the length of every string and it can store null bytes.
An additional null byte past the end and pointer shifting, so every Pascal string is also a C pchar string.
Reference-counted copy-on-write, so you can copy it in constant time for reading and treat it as if the string was fully copied for writing.
Low-level memory management, so it can store 2 GB long strings. And in most situations you do not need a string builder, because strings are freed immediately and very fast when the refcounts zeros without staying around as garbage.
Behaves as a value type with the null pointer being the empty string, so you can never get a null pointer exception. Any string operation returns always a valid string.
(Optionally) index checked with the length, so you can never get a buffer overflow.
Unfortunately the newer FreePascal/Delphi versions made it all very confusing by adding an encoding field. In the past you could just assume all strings are UTF-8 as code style rule. Now each string has an individual encoding. It is converted automatically from one encoding, but not always, so in any function that needs to access the characters, you kind of need to check if it is called with an utf-8 string, or latin1 string, or some other encoding.
Delphi/Pascal had fucking terrible string.
> Single byte storage, so it works well with utf-8.
No awareness of encoding, so easy to break your text if you try to do anything other than pipe it through.
> Byte length field, so it knows the length of every string and it can store null bytes.
Single-byte length field, so your strings can't exceed 255 bytes, and you need a pointer indirection just to get the length of your string.
> An additional null byte past the end and pointer shifting, so every Pascal string is also a C pchar string.
Scratch that, 254.
> Reference-counted copy-on-write, so you can copy it in constant time for reading and treat it as if the string was fully copied for writing.
Rc'd cow, so you never know whether your write is going to be free or incur a complete copy.
> Behaves as a value type with the null pointer being the empty string, so you can never get a null pointer exception. Any string operation returns always a valid string.
The null pointer is the empty string, you don't get to know whether you got no input or got an empty input.
> (Optionally) index checked with the length, so you can never get a buffer overflow.
(Optionally) you're completely unable to implement generic string operations because the string type carries a length and the language gives no way to be generic over that.
> No awareness of encoding
> Single-byte length field
> Scratch that, 254.
is completely wrong. Or rather, hasn't been right for several decades.
A modern Delphi String has unlimited length (well, 32-bit value, a multi-gigabyte string is okay for 1996, I think), carries encoding (see my comment in this same thread), etc.
COW is another argument, but one that seems to have won out over time. Many string implementations use many tricks to achieve something similar, or go with COW directly.
>No awareness of encoding, so easy to break your text if you try to do anything other than pipe it through.
Which is good, when you have data with an unknown encoding. Unix file names for example, you can create a file that has an invalid utf-8 file name, and many tools that want to store the file name in a string with an encoding, just cannot access that file
>Single-byte length field, so your strings can't exceed 255 bytes, and you need a pointer indirection just to get the length of your string.
Ansistrings have a 32-bit length. I have not used Delphi on a 64 bit system, but in FreePascal they have a 64-bit length there.
shortstrings have a 255 max length, but since they are short they are mostly stored on the stack and you do not need any pointer indirection.
>Rc'd cow, so you never know whether your write is going to be free or incur a complete copy.
Shortstrings are not rc'd cow and are always copied. Since they are short and on the stack that copying is fast.
No write on long strings is free, since you can get a cache miss. And you only make a copy if some other function still has a reference string. How could that happen? The function should only keep the string around if it still needs the old value, and then it would need to make a copy anyways. No cow always leads to more copying.
>The null pointer is the empty string, you don't get to know whether you got no input or got an empty input.
If you want no input, you can use a pointer to a string.
>(Optionally) you're completely unable to implement generic string operations because the string type carries a length and the language gives no way to be generic over that.
Ansistrings do not have a length in the type, nor the basic shortstring. Nowadays you would define all string operations for ansistrings (=utf8strings).
I can clear this up. This change is really useful and important if you are using strings.
A string of bytes has attached metadata saying what it is. Is it ANSI of some sort, or UTF8, or...? Is it a specific encoding, such as Windows-1252? Without that data, all you have are bytes, and you don't know how to interpret them.
Thus, RawByteString (bytes); UTF8String (UTF8); and ANSI strings with the encoding, plus UnicodeString which is native Unicode on whichever platform (eg, on Windows it matches Windows UTF16.)
This data is essential to convert to and from different string types. I don't know where conversion does "not always" happen - can't think of anywhere offhand. But if you ever run into issues, there are RTL functions for conversion. Check out the TEncoding class: http://docwiki.embarcadero.com/Libraries/Tokyo/en/System.Sys...
> In the past you could just assume all strings are UTF-8 as code style rule.
This was an incorrect assumption, because before the encoding metadata was added, you would have been using an AnsiString there, and that, by definition ('Ansi'), is not UTF8. These days, if you have a UTF8 string, you can place it in a UTF8String type. Correctness enforced by libraries is much better than a coding convention that a certain type contains a subtly different payload. That way lies horror. Metadata and strong typing is much safer.
I do agree with you that Delphi has the best strings :) Copy on write and embedded length both seem a real win, after twenty years of use, not to mention great string twiddling methods.
I'm looking at adding string_view support to the strings currently (for C++17 support); one thing it highlights is how much more powerful the inbuilt String types are, and how much string_view is a workaround for a problem in C++'s string design which other string libraries - not just ours, but ours is IMO very good - do not suffer from.
Check, underlying storage is a byte array with the utf-8 encoding.
Check, getting a C string from a Swift string is O(1).
> Low-level memory management, so it can store 2 GB long strings.
Check, just tested with a 4 GB string on macOS.
> And in most situations you do not need a string builder, because strings are freed immediately and very fast when the refcounts zeros without staying around as garbage.
Check, Swift objects are reference counted, storage of intermediary strings will be freed as soon as they go out of scope.
> Behaves as a value type with the null pointer being the empty string, so you can never get a null pointer exception.
Check, String is a value type.
Check, this is the default - trap on out-of-bounds access rather than an illegal memory access. If you really want can disable by compiling with -Ounchecked.
> Unfortunately the newer FreePascal/Delphi versions made it all very confusing by adding an encoding field. In the past you could just assume all strings are UTF-8 as code style rule
Check. String operations are performed on grapheme clusters (rather than for example UTF-8 code points), which is generally the right level of abstraction to work with. There are "views" for accessing specific encodings.
No, not anything that is OpenJDK 9+ based which uses 1 byte where possible.
> They all use UTF-8 internally
Which means a lot of functions now have linear instead of constant asymptotic complexity.
They already do if they do proper text manipulation as unicode itself is variable length and has to be stream-processed. O(1) access to codepoints is not actually useful, and most languages don't even provide it since they don't internally encode to UTF-32.
It's been a huge win for D.
C strings interoperate terribly with almost any encoding, including ASCII and UTF8.
C strings cannot contain the NUL character.
are they that popular outside of chat apps ? how many emojis on this very page ?
> how many emojis on this very page ?
HN literally strips out arbitrary codepoints — including emoji — from comments. So zero. Because HN forbids them.
The fact that Swift is aiming for ABI stability, something neither C++ nor Rust have because we all rely on C for FFI, is very interesting.
let fields = line.split(separator: ",")
let fields = line.unicodeScalars.split(separator: ",").map(Substring.init)
If you're assuming CSV is a binary format, you should split on code units before the textual decoding.
Trivial proof: If it did, then it would be literally impossible to represent a field starting with a combining character.
Also, the set of combining characters has changed over time. Machine-readable formats in general do not change as Unicode does. A CSV that parses today should not fail to parse next year because a field starts with a codepoint that today is unused and next year has been assigned to a combining character.
In Swift, string views are a different type from strings. For the most part you can perform similar operations on them but they are not the same type, so a `Substring` clearly tells you that it's linked to arbitrary-size baggage, while a `String` tells you it owns its data. You can operate generically over both using string protocols, or you can specifically ask for one or the other.
Why on earth not just use utf8 from the start? Surely no micro-optimization made possible by their choice could be worth such a convoluted design?
So NSString uses a bunch of wacky encodings internally to pack very small strings into the 8 byte pointer that would otherwise be used to point to an object on the heap.
I suspect that microoptimizations like this have done a lot of the work in making iOS outperform Android clock-for-clock. Although I suppose right now thats less of a big deal given how fast their A-series chips are as well.
Nitpick: these pointers are odd, so they can't point to a valid object on the heap (malloc is 16-byte aligned on macOS). So it makes sense to reuse them by tagging them and packing a string (or number, or date) into the remaining bits to save on an allocation.
This gets all the benefits people usually want from "just use UTF-8", and then some -- strings containing only code points in the latin-1 range (not just the ASCII range) take one byte per code point -- and also keeps the property of code units being fixed width no matter what's in the string. Which means programmers don't have to deal with leaky abstractions that are all too common in languages that expose "Unicode" strings which are really byte sequences in some particular encoding of Unicode.
Python is inexorably committed to the idioms which depend assume fixed width characters -- there's no persuading the community to use e.g. functions to obtain substrings rather than array indexes. So this is an understandable design decision.
Python strings are not iterables of characters. They're iterables of Unicode code points. This is why leaking the internal storage up to the programmer is problematic; prior to 3.3, you'd routinely see artifacts of the internal storage (like surrogate pairs) which broke the "strings are iterables of code points" abstraction.
e.g. functions to obtain substrings rather than array indexes
Strings are iterables of code points. Indexing into a string yields the code point at the requested index. While I'd like to have an abstraction for sequences of graphemes, strings-as-code-points is not the worst thing that a language could do. And all the "just use this thing that does exactly the same thing with a different name because I want indexing/length but also want to insist people don't call them that" is frankly pointless.
Array index syntax over variable width data is problematic: either deceptively expensive -- O(n) for what looks like an O(1) operation -- or wrong. I suspect in that we agree.
As for the alternative, I'm talking about Tom Christiansen's argument here: https://bugs.python.org/msg142041
To paraphrase Tom's examples, this usage of array indexes is more idiomatic...
s = "for finding the biggest of all the strings"
x_at = s.index("big")
y_at = s.index("the", x_at)
some = s[x_at:y_at]
s = "for finding the biggest of all the strings"
some = re.search("(big.*?)the", s).group(1)
However, in terms of language design for handling Unicode strings, I prefer the tradeoffs of the second idiom: a single O(n) operation which is relatively easy to anticipate and plan for, rather than unpredictable memory blowups.
Which is why Python uses a solution that ensures fixed-width data. There's never a need to worry if a code point will extend over multiple code units of the internal storage model, because the way Python now handles strings ensures that won't happen.
I really think a lot of your problem with this is not actually with the string type, but with the existence of a string type. You want to talk in terms of bytes and indexes into arrays of bytes and iterating over bytes. But that's fundamentally not what a well-implemented string type should ever be, and Python has a bytes type for you (rather than a kinda-string-ish-sometimes type that's actually bytes and will blow up if you try to work with it as a string) if you really want to go there.
Rust also has string indexes, they are not opaque, are byte indices into the backing buffer, and will straight panic if falling within a codepoint.