No such thing! Strings are an array of integer unicode code points. Stop thinking about bytes at this level. The internal representation of strings and chars does not matter, because you as a programmer ever only see integer code points.
Encoding only enters the picture the moment you want to convert your string to or from a byte array (for example to write to disk or send over the network). The encoding, such as "UTF-8", then specifies how to map between the array of abstract code points to an array of 8bit bytes.
* Newer languages expose strings as an array of Unicode code points, which is cleaner because it's independent of any particular encoding.
* Even when working with code points, you can't safely reverse strings or anything like that, since a user-perceived character might consist of multiple code points.
It is still very important in general to distinguish a string from a byte array, and I felt like the article was fairly counter-productive with all its talk about "UTF-8 strings" (which doesn't make much sense, either you have a byte-array that you can apply UTF-8 decoding on to get a string out of, or you already have a string, in which case encodings in the traditional sense (UTF-8, ISO-8859-1 latin1, etc) doesn't apply)
Eh. I agree and disagree. I agree in the sense that the phrase "UTF-8 string" is generally a misnomer and is a good signal that there's some confusion somewhere, but I don't find think I find it as damning as you do. In particular, not all languages represent strings as sequences of codepoints, and instead make their internal UTF-8 byte representation a first class part of their string API. Two languages that come to mind are Go and Rust, where Go's conventionally uses UTF-8 and Rust uses enforced UTF-8. But in both cases, accessing the raw bytes is not only a standard operation, but is necessary whenever you want to do high performance string processing.
That is, if someone said Go/Rust had "UTF-8 strings," that wouldn't be altogether wrong. UTF-8 is a first class aspect of both string APIs, while both provide standard functions you'd expect from Unicode strings.
Python 2 is in the same boat concerning internal representation (it will use UTF-16 much of the time), but it doesn’t expose any UTF-16ness in the API.
Don't do that (the array part). Guaranteeing O(1) indexing isn't very useful but precludes efficient storage.
You imply that there is some universal "string" definition, when there is no such thing.
Literally, "string" just means sequence. Beyond that, it depends on the context.
* C++ std::string is a sequence of char (aka byte, defined to be 8+ bits).
* Swift string is a sequence of Unicode graphemes.
* Ruby String is a sequence of bytes/octects, with an attached (and mutable) encoding.
* Python 2 str is a sequence of bytes/octets
* Python 3 str is a sequence of Unicode code points
* ECMAScript string (and Java java.lang.String) is a sequence of UTF-16 code units.
Moreover, the entire idea of "Strings are an array of integer unicode code points" is confined to Unicode. It doesn't say anything about other character sets, e.g. ASCII, ISO 8859-1, or Windows-1252. (Though AFAIK Unicode is a superset of those particular three.)
So...I think you can safely say "Unicode strings are sequences of Unicode code points."
I'd argue that, generally, strings are simply arrays of chars, which are bytes.
THe failure here, was keeping the name "string" for what are arrays of codepoints instead of bytes.
Unicode strings are arrays of code points which are 21bit numbers.
If the API requires fast subscript (it usually does) then they would be UTF-32 or three-codepoints-in-int64, otherwise more compact internal representation is possible.
If you don't require supporting subscript and allow only iteration over list of code points then in-memory representation of strings can be more compact. It can use UTF-8 or even SCSU or BOCU1.
Some languages use polymorphic unicode strings which store ascii if the value is all-ascii and switch to something else if it isn't (python3.3 and factor come to mind).
In C, strings were always semantically a sequence of characters (as they are commonly defined elsewhere).
For a while a character was one byte, so the distinction was unimportant (and became blurred).
A char was both a character and a byte. A string was both a sequence of characters and an array of characters... and an array of "char"s, and an array of bytes.
Code -- and programmers! -- became dependent on these equivalency assumptions.
Once it became clear we could no longer pretend that a maximum of 256 characters was tenable (less actually, since the use of 0-31 for control/separation/termination had become standard) we were left with conflict, leading to a variety of uncomfortable choices and compromises.
One such conflict is "char"... should it retain its semantics or its size (one byte)?
The last time I developed seriously in C or C++ it had retained its size, but lost its semantics -- a char IS a byte now. (That was a while ago, I don't know if that's changed -- it sounds like from your post that it hasn't.)
I guess UTF-8 has won out in C and C++ (and elsewhere) so now, while a char is byte, a C/C++ string is: (1) an array of char/bytes; (2) a sequence of characters. The thing that's been dropped is that a string is no longer an array of characters.
(In case there's confusion: here, "array" means an ordered sequence of elements of uniform size with O(1) random access, while sequence is just an ordered sequence of elements that doesn't necessarily offer O(1) random access or elements with uniform size.)
V8 turns out to have a ton of internal string representations. They don’t affect semantics (the general point that JS string functions think in UTF-16 is valid) but they’re interesting.
V8 apparently stores all-ASCII strings as ASCII, so stuff like HTML tag names or base64 blobs doesn’t double in size.
Like Go, V8 lets you take a substring as a pointer into the larger string; the internal class is called SlicedString, but from JS-land you don’t see anything different from a string literal. As in Go, keeping a short substring of a long parent string keeps the whole parent ‘alive’ across GCs so sometimes folks will be surprised all those bytes are still allocated.
Unlike Go, V8 has a ConsString type, so concatenating strings sometimes doesn’t really immediately copy the underlying bytes anywhere. Building a string with a loop that runs str += newPiece probably goes faster than expected because of this. [It turns out it flattens the string the next time you index into it, or at least used to, which has some perf implications of its own: https://gist.github.com/mraleph/3397008]
That’s mostly from this post by a member of the Dart team http://mrale.ph/blog/2016/11/23/making-less-dart-faster.html ; his blog has a lot of interesting stuff about how these fine-tuned language implementations really work.
Kind of amazing the lengths V8 (and other JS engine) teams went to to make the code they saw in the wild work well.
Python does this too.
In Python 2, and in Python 3 until 3.3, a compile-time flag determined the internal Unicode storage of the Python interpreter; a "narrow" build of the interpreter used 2-byte Unicode storage with surrogate pairs, while a "wide" build used 4-byte Unicode storage.
As of Python 3.3, the internal storage of Unicode is dynamic. Python 3 source code is always parsed as UTF-8, but then as string objects are created by the interpreter their memory representation is chosen on a per-string basis, to be able to accommodate the widest code point present in the string. So Python will choose either a one-byte, two-byte, or four-byte encoding to store the string in memory, depending on what code points are present in it.
This is very nice because it means iteration over a Python string is always iteration over its code points, the length of a string is always the number of code points in it, and indexing always yields the code point at index, since the internal storage of the string is fixed width and never has to include surrogates (in pre-3.3 Python, "narrow" builds would actually yield up things for which ord() gave a value in the surrogate range, and code points requiring surrogates added 2 to the length of a string rather than 1).
Our programming languages, type systems and compilers, are still extremely poor at specifying the properties of types, and at permitting variant implementations which preserve them. We're still struggling with basics, like memory layout for locality (eg, arrays of structures of arrays). And many languages still can't manage multiple dispatch.
As we slowly become less crippled, it eventually becomes straightforward to use alternate representations and encodings. For instance, UTF strings with inline descriptive bitmasks is already a thing, as is substring-local encoding.
So it seems we needn't be trapped in a local maxima, at least long-term. And that perhaps the future will eventually be more heterogeneous. Perhaps like integers and floats, there's both diversity collapse (big endian dies, ieee 754), and heterogeneity (integer packing, SIMD).
Maybe WTF-8? https://simonsapin.github.io/wtf-8/
I doubt that this is going to be replacing anything except in the narrow use-cases mentioned on the same page.
Possibly if CJK keeps gaining importance there may be a new encoding which provides smaller encoding of these characters rather than the current primacy of western/european character sets.
And no, CJK language users is not a minority, although many of us have become used to the font inconsistensies (because while wrong, they are legible to us).
Another point that is commonly missed is that Han unification was first proposed by East Asians (not ignorant Westerners), specifically the Chinese (which is why the Chinese don't really object to Unicode but the Japanese do).
A side note: it's interesting that you bring up ä as an example, because ä is actually a typographic unification of two very distinct letters. In languages like English and French, the diacritic is a diaeresis: a mark, derived from Ancient Greek indicating that the vowel is to be pronounced separately rather than as a diphthong. In languages like German, it's an umlaut, where it's a typographic reduction of a superscript e. They are in fact very different characters that merely look the same.
I am aware of the limits that the original Unicode specs were subjected to, which adds to my grief on how things could have been. Now that it has taken over the world we have to deal with wrong characters displayed for many applications. (Even modern day iOS has problems displaying mixed language content in system text fields.)
I don't know much about the history of western languages, so I cannot comment on the issue, just that it feels unjustified that during development CJK languages got crammed together and messed up (no matter who did it). More and more emojis and other symbols just rub salt into the wound.
For western alphabets that have thirty or so symbols, the need is not nearly so acute.
How would you like if there was Latin/Greek/Cyrillic unification? I mean they're all just alphabets that make the same sounds they're all basically the same right?
1. Introduce new methods which allow access to code points without indexing by UCS-2 code unit. An iterator over code points is the key thing. You could also have some opaque kind of index, with a method to get a list of indices for every code point in the string, and then a method to look up a code point by opaque index. The former could be expensive, as it would involve scanning the string to find where each character starts (although perhaps that could be done lazily?), but the latter should be cheap. The results of the expensive method could be cached.
2. Wait N years for the new methods to be widely adopted.
3. Shift the internal representation of strings to UTF-8. Anyone using the new methods will see similar, or better, performance. Anyone using the old methods will see a drop in performance. Implementations could even dynamically choose between representations based on usage patterns. The fact that each web page is its own little JS universe should make that fairly practical.
Incorrect or wildly inaccurate. A string is conceptually text which may be (is) represented internally as bytes, through some means of encoding that text. And thus the concept of encodings are introduced.
That (and how) text is represented should be an implementation-detail though: Strings represent text, not bytes.
I think most people miss this distinction and that's the main source of confusion for encoding-related problems among programmers.
Something like a language specification would have good reason to stick to talking about the semantics, but that isn't a reason to discard any source that uses 'is' to talk about a different aspect. For example, Go's authors have written explanatory posts both about the semantics of strings and their in-memory representations, and it would be weird to say one of those posts is Just Wrong if it says 'A string is [whichever view of strings is most relevant locally]'. Sometimes another way of looking at the system is clearly better for the immediate purpose (writing about memory-use optimizations you would need to know what's in memory) and sometimes discussions need to jump around and look at the different levels at different points (like here, where the post both fills folks in on interactions between various APIs and performance, which is a(n important!) aspect of the implementation).
Consider the phrasing my way of shielding you from accusations about being entirely wrong ;)
I strongly disagree. For high performance text search, it's critical that you deal with its in-memory representation explicitly. This violates your maxim that such things are only done at the boundaries.
For example, if you're implementing substring search, the techniques you use will heavily depend on how your string is represented in memory. Is it UTF-16? UTF-8? A sequence of codepoints? A sequence of grapheme clusters, where each cluster is a sequence of codepoints? Each of these choices will require different substring search strategies if you care about squeezing the most juice out of the underlying hardware.
The text encodings themselves (e.g. UTF-8, UTF-32) ought to be proper data types. Strings are a leaky abstraction that cause otherwise competent programmers to have funny ideas about what text is and isn't, as this entire thread demonstrates.
I do not completely agree with that. Text is what strings are mostly used for, but I wouldn't call a base64 encoded image text. Text is something a person can read and make sense of.
So I think it would be more accurate to say that a string is a sequence of characters (grapheme clusters in unicode speak).
This is all massively inefficient, of course. Most strings are representable as UTF-8, and using two bytes to represent their characters means you are using more memory than you need to, as well as paying an O(n) tax to re-encode the string any time you encounter a HTTP or filesystem boundary.
There's nothing stopping us from packing UTF-8 bytes into a UTF-16 string: to use each of the two bytes to store one UTF-8 character. We would need custom encoders and decoders, but it's possible. And it would avoid the need to re-encode the string at any system boundary.
This line of thinking does not make sense to me, and I'm not sure what scheme is being proposed. First of all, "most strings" are representable as UTF-8? All strings are representable as UTF-8. And if you are reading data from a UTF-8 source and you don't want to pay the time and memory cost of decoding it and re-encoding it in a less space-efficient code, leave it as a Buffer of bytes! Don't decode it.
Because everything else in the world is moving to UTF-8, Node is also trying to move to UTF-8.
Node does make you specify which encoding you want when converting between strings and Buffers, but it's always embraced UTF-8 as its primary encoding, as far as I'm aware.
This is kind of silly. Java made the same decision and was certainly not "invented" in ten man-days!
Then after all those runtimes and APIs were established, the Unicode consortium realized 65,000 characters wasn't actually enough and blew up all those assumptions. Since then, things have been very uncomfortable in those systems, string-wise.
Next-generation systems are incorporating the semantics of modern Unicode, but it'll take a long while to fix this--you can't just redefine a primitive type like "string" in an existing giant codebase.
I didn't know Plan 9 was behind UTF-8. :)
We've come a long, long way since the 90s and even the 00s.
It's shocking and gratifying how well emojis work online. A great motivator for Unicode support, too. :)
I've seen this... questionable design decision in another interpreted/scripting language too. Those are encodings, but clearly not encodings at the same "layer of abstraction" as e.g. UTF-8 or Shift-JIS or UTF-32 or UTF-16, because you could have a UTF-16 string containing "base64" or "hex".
Actually you can't, at least not in a standard way. There are combinations of code points that don't have a single code point equivalent. Flag emjois are an example, but there are many letterlike versions as well. There are enough unallocated points in the 32-bit space that probably you could manage to make single-point equivalents on your own.
Unicode doesn't prevent you from drowning letters in many combining marks (e.g. an accent, an umlaut, and an overhead tilde), or doing this with letters where they make no sense and for which no font will have useful positioning information (e.g. over a copyright sign). There's over 1700 combining marks in Unicode, which gives you way over 2^32 ways to use the combining marks with the letter a, let alone the rest of the letters in Unicode.
I'm guessing you'd have to write a C++ module for this, but any suggestions on how one might do this successfully?
Also, "Encoding is the process of squashing the graphics you see on screen, say, 世 - into actual bytes." No it's a way of representing one value in a way a system can more easily handle in a hopefully lossless fashion. Encoding has nothing to do with what's on screen other than that being one representation of the data.
Languages that use UTF-16 mostly use it only because they're older than UTF-8 (or are tied to a platform older than UTF-8).
I think you're right that older languages use UTF-16 and newer ones use UTF-8. But it also seems empirically true that UTF-16 languages do better at grappling with Unicode's subtleties, compared to UTF-8. The temptation of UTF8-is-bsically-C-strings is hard to ignore.
You have to deal with unpaired surrogates, though. And just because that wasn't annoying enough, any UTF-8 sequence which contains an encoded surrogate (which is technically invalid, but not prohibited by most implementations) is impossible to encode as UTF-16.
With UTF-8 you'll at least have a shot at noticing that you're not handling multi-unit codepoints well, while with UTF-16 you won't notice unless you test Chinese or a more off the beaten path language.
Regarding your last point, I'm totally with you (if I understood you correctly). Of course applications should support multiple input/output encodings, but as a programmer you have to decide on some internal representation.
That being said, I really don't see how processing UTF-8 is significantly more complex than processing, say, UTF-16. In both cases you need to handle continuation units for the extraction of Unicode code points.
UTF-8 has 4 valid cases, one for each length, and many more invalid cases for each length (2-byte sequence missing trail byte, 3-byte sequence missing 1 trail byte, 3-byte sequence missing 2 trail bytes, 4-byte sequence missing 3 trail bytes, 4-byte sequence missing 2 trail bytes, ..., overlongs, UTF-8'd surrogates, overflow, etc.) Differences between implementations' treatment of error cases have lead to some security concerns; see https://hsivonen.fi/broken-utf-8/ and discussion at https://news.ycombinator.com/item?id=14451822 for an example.
UTF-16 has two valid cases (one or two code units) and two error cases (lead surrogate not followed by trail surrogate, lone trail surrogate). It's more like a DBCS, except each code unit is 2 instead of 1 byte.
As someone who has actually written UTF-8/UTF-16 conversion code, I can immediately tell you which one is far easier to implement: UTF-16. The number of valid cases is basically halved, and the number of error cases in UTF-16 is a fraction of those in UTF-8. Put another way, there are plenty more invalid UTF-8 sequences than invalid UTF-16 sequences.
In any case, the discussion here is the appropriate string API, and the relative difficulty of working with those. Exposing UTF-8 versus UTF-16 changes essentially nothing: in both cases you need to either deal with non-integer indexes or deal with integer indexes where not all values are valid.
Good string APIs are hard. Most Unicode-aware languages pick one particular encoding and then toss the programmer in the deep end with it.
The only language I've seen get it vaguely correct is Swift. (I'm sure there are others, but it's definitely not common.) Swift strings provide multiple views, so you can work with UTF-8, UTF-16, UTF-32, or grapheme clusters, as you need. It doesn't allow using integer indexes directly, so you have to confront the fact that indexing is actually non-trivial. Swift 3 requires using views, and Swift 4 makes the String type itself a sequence of grapheme clusters, which is usually the correct answer to the question of "what unit do you want to work with?"
In my experience, I have not yet found a case where I ever wanted to use grapheme clusters. Most algorithms want to iterate over Unicode codepoints (e.g., displaying fonts). Even in display cases, grapheme clusters isn't necessarily the right thing to use for the backspace key or left/right motion.
UTF-16's validation concerns are:
1. Broken surrogate pairs, which is mostly benign.
2. Byte-order confusion.
While UTF-8 has:
1. Invalid code points, for example, code points for surrogate halves.
2. Invalid code units, such as 0xFF.
3. Non-shortest forms, where a character may be encoded multiple ways.
4. Representation of NUL, and potential for confusion with APIs that expect null-terminated strings.
In practice the UTF-8 issues have caused much more serious vulnerabilities.
In any case, these are all concerns for a decoder, but not for an API, which is what we're discussing here. In fact, the original comment I replied to up there was advocating the opposite: UTF-16 internally, and UTF-8 for interchange!
while (*c) count += ((*(c++) & 0xC0) == 0x80) ? 0 : 1;
Counting the number of UTF-8 code units in a UTF-8 string is of course trivial. Counting the number of UTF-16 code units in a UTF-8 strings would take more work. But there's probably no reason you'd want to compute that anyway.
Back then, the trade-off made sense, and was popular too, because variable-sized encodings were much more rare. These days UTF-16 really is the worst of both worlds, but we are stuck with what we have.