Hacker News new | past | comments | ask | show | jobs | submit login
JavaScript string encoding (inburke.com)
103 points by iamwil on Sept 3, 2017 | hide | past | web | favorite | 90 comments

I don't think this article is very good. It seems to make the "newbie unicode error" of assuming that strings "have" an encoding (or that strings "are" UTF-8) and thinking of bytes in string. For example in the JSON paragraph, it makes the cardinal sin of referring to "... creating UTF-8 strings".

No such thing! Strings are an array of integer unicode code points. Stop thinking about bytes at this level. The internal representation of strings and chars does not matter, because you as a programmer ever only see integer code points.

Encoding only enters the picture the moment you want to convert your string to or from a byte array (for example to write to disk or send over the network). The encoding, such as "UTF-8", then specifies how to map between the array of abstract code points to an array of 8bit bytes.

Unfortunately, JavaScript (and Java and Python 2 and other languages) uses UTF-16 for its strings, and leaks that information to the programmer, so if you use a language like that, you should probably have a basic understanding of how UTF-16 works.

I'd say a JavaScript programmer should think of things this way:

* JavaScript strings are exposed to the programmer as an array of UTF-16 code units, although there are some helper functions like `codePointAt` to help interpret strings in terms of code points.

* Newer languages expose strings as an array of Unicode code points, which is cleaner because it's independent of any particular encoding.

* Even when working with code points, you can't safely reverse strings or anything like that, since a user-perceived character might consist of multiple code points.

It is true that Javascript, like many other languages (Java, Win32 wide chars, etc) has to deal with the problem that they assumed unicode code points could not exceed the integer value 65535, so you have to deal with the surrogate pairs. So I guess that is one way to "encode" all planes of unicode in a backward compatible way. Bytes don't enter the picture though!

It is still very important in general to distinguish a string from a byte array, and I felt like the article was fairly counter-productive with all its talk about "UTF-8 strings" (which doesn't make much sense, either you have a byte-array that you can apply UTF-8 decoding on to get a string out of, or you already have a string, in which case encodings in the traditional sense (UTF-8, ISO-8859-1 latin1, etc) doesn't apply)

> was fairly counter-productive with all its talk about "UTF-8 strings"

Eh. I agree and disagree. I agree in the sense that the phrase "UTF-8 string" is generally a misnomer and is a good signal that there's some confusion somewhere, but I don't find think I find it as damning as you do. In particular, not all languages represent strings as sequences of codepoints, and instead make their internal UTF-8 byte representation a first class part of their string API. Two languages that come to mind are Go and Rust, where Go's conventionally uses UTF-8 and Rust uses enforced UTF-8. But in both cases, accessing the raw bytes is not only a standard operation, but is necessary whenever you want to do high performance string processing.

That is, if someone said Go/Rust had "UTF-8 strings," that wouldn't be altogether wrong. UTF-8 is a first class aspect of both string APIs, while both provide standard functions you'd expect from Unicode strings.

An important subtlety: JavaScript doesn’t use UTF-16 for its strings; it uses UCS-2 code units.

But internally, JavaScript engines can use whatever encoding they like, though most use UCS-2 most of the time. (Servo, for example, represents documents in WTF-8 rather than the conventional UCS-2. This is good for memory efficiency and certain other things, though bad for arbitrary indexing which is rare.)

Python 2 is in the same boat concerning internal representation (it will use UTF-16 much of the time), but it doesn’t expose any UTF-16ness in the API.

> * Newer languages expose strings as an array of Unicode code points, which is cleaner because it's independent of any particular encoding.

Don't do that (the array part). Guaranteeing O(1) indexing isn't very useful but precludes efficient storage.

> It seems to make the "newbie unicode error" of assuming that strings "have" an encoding

You imply that there is some universal "string" definition, when there is no such thing.

Literally, "string" just means sequence. Beyond that, it depends on the context.

* C++ std::string is a sequence of char (aka byte, defined to be 8+ bits).

* Swift string is a sequence of Unicode graphemes.

* Ruby String is a sequence of bytes/octects, with an attached (and mutable) encoding.

* Python 2 str is a sequence of bytes/octets

* Python 3 str is a sequence of Unicode code points

* ECMAScript string (and Java java.lang.String) is a sequence of UTF-16 code units.

Moreover, the entire idea of "Strings are an array of integer unicode code points" is confined to Unicode. It doesn't say anything about other character sets, e.g. ASCII, ISO 8859-1, or Windows-1252. (Though AFAIK Unicode is a superset of those particular three.)

So...I think you can safely say "Unicode strings are sequences of Unicode code points."

>No such thing! Strings are an array of integer unicode code points.

I'd argue that, generally, strings are simply arrays of chars, which are bytes.

THe failure here, was keeping the name "string" for what are arrays of codepoints instead of bytes.

C does not own the word "string". A string is a piece if text. It is not a byte array.

Unicode strings are arrays of code points which are 21bit numbers.

If the API requires fast subscript (it usually does) then they would be UTF-32 or three-codepoints-in-int64, otherwise more compact internal representation is possible.

If you don't require supporting subscript and allow only iteration over list of code points then in-memory representation of strings can be more compact. It can use UTF-8 or even SCSU or BOCU1.

Some languages use polymorphic unicode strings which store ascii if the value is all-ascii and switch to something else if it isn't (python3.3 and factor come to mind).

I'm going to argue a little differently...

In C, strings were always semantically a sequence of characters (as they are commonly defined elsewhere).

For a while a character was one byte, so the distinction was unimportant (and became blurred).

A char was both a character and a byte. A string was both a sequence of characters and an array of characters... and an array of "char"s, and an array of bytes.

Code -- and programmers! -- became dependent on these equivalency assumptions.

Once it became clear we could no longer pretend that a maximum of 256 characters was tenable (less actually, since the use of 0-31 for control/separation/termination had become standard) we were left with conflict, leading to a variety of uncomfortable choices and compromises.

One such conflict is "char"... should it retain its semantics or its size (one byte)?

The last time I developed seriously in C or C++ it had retained its size, but lost its semantics -- a char IS a byte now. (That was a while ago, I don't know if that's changed -- it sounds like from your post that it hasn't.)

I guess UTF-8 has won out in C and C++ (and elsewhere) so now, while a char is byte, a C/C++ string is: (1) an array of char/bytes; (2) a sequence of characters. The thing that's been dropped is that a string is no longer an array of characters.

(In case there's confusion: here, "array" means an ordered sequence of elements of uniform size with O(1) random access, while sequence is just an ordered sequence of elements that doesn't necessarily offer O(1) random access or elements with uniform size.)

It's all this vocabulary fighting that makes this stuff so damn hard for people new to trying to do interesting things with strings. Like, different languages use totally different terms in the API documentation, even.

So, for example, I can figure out how to take a document written in Microsoft Word with that Latin-1 business and make the characters stop sucking in python 3, but I don't even know what to google to do the same thing in javascript, because people use terms like "encoding" and such totally differently.

Commented this on the blog; cross-posting here:

V8 turns out to have a ton of internal string representations. They don’t affect semantics (the general point that JS string functions think in UTF-16 is valid) but they’re interesting.

V8 apparently stores all-ASCII strings as ASCII, so stuff like HTML tag names or base64 blobs doesn’t double in size.

Like Go, V8 lets you take a substring as a pointer into the larger string; the internal class is called SlicedString, but from JS-land you don’t see anything different from a string literal. As in Go, keeping a short substring of a long parent string keeps the whole parent ‘alive’ across GCs so sometimes folks will be surprised all those bytes are still allocated.

Unlike Go, V8 has a ConsString type, so concatenating strings sometimes doesn’t really immediately copy the underlying bytes anywhere. Building a string with a loop that runs str += newPiece probably goes faster than expected because of this. [It turns out it flattens the string the next time you index into it, or at least used to, which has some perf implications of its own: https://gist.github.com/mraleph/3397008]

That’s mostly from this post by a member of the Dart team http://mrale.ph/blog/2016/11/23/making-less-dart-faster.html ; his blog has a lot of interesting stuff about how these fine-tuned language implementations really work.

Kind of amazing the lengths V8 (and other JS engine) teams went to to make the code they saw in the wild work well.

V8 apparently stores all-ASCII strings as ASCII, so stuff like HTML tag names or base64 blobs doesn’t double in size.

Python does this too.

In Python 2, and in Python 3 until 3.3, a compile-time flag determined the internal Unicode storage of the Python interpreter; a "narrow" build of the interpreter used 2-byte Unicode storage with surrogate pairs, while a "wide" build used 4-byte Unicode storage.

As of Python 3.3, the internal storage of Unicode is dynamic. Python 3 source code is always parsed as UTF-8, but then as string objects are created by the interpreter their memory representation is chosen on a per-string basis, to be able to accommodate the widest code point present in the string. So Python will choose either a one-byte, two-byte, or four-byte encoding to store the string in memory, depending on what code points are present in it.

This is very nice because it means iteration over a Python string is always iteration over its code points, the length of a string is always the number of code points in it, and indexing always yields the code point at index, since the internal storage of the string is fixed width and never has to include surrogates (in pre-3.3 Python, "narrow" builds would actually yield up things for which ord() gave a value in the surrogate range, and code points requiring surrogates added 2 to the length of a string rather than 1).

Here is a comment describing strings in SpiderMonkey: http://searchfox.org/mozilla-central/rev/999385a5e8c2d360cc3.... If you scroll down there is ASCII art showing the different string sub-classes.

I wonder if there is ever going to be an encoding that replaces UTF-8? Or have we hit on some sort of permanent local maxima (not a global one in the sense that UTF-8 carries the baggage of being backwards compatible with ASCII ... though maybe you could argue that's more of a unicode problem than a UTF8 encoding format one).

At this point UTF-8 seems pretty permanent, what would come along to replace it? And if it is likely to be permanent shouldn't node / javascript in general be moving towards deprecating UCS-2 / UTF-16 and giving first class support to UTF-8?

I saw all this because a couple of years ago I had to write a UTF-8 converter before ScalaJS natively supported for a serialization library I had written. I was kind of surprised that the javascript support was so lacking, luckily writing a UTF-8 encoder/decoder isn't that hard of an endeavor.

> ever going to be an encoding that replaces UTF-8? Or have we hit on some sort of permanent local maxima

Our programming languages, type systems and compilers, are still extremely poor at specifying the properties of types, and at permitting variant implementations which preserve them. We're still struggling with basics, like memory layout for locality (eg, arrays of structures of arrays). And many languages still can't manage multiple dispatch.

As we slowly become less crippled, it eventually becomes straightforward to use alternate representations and encodings. For instance, UTF strings with inline descriptive bitmasks is already a thing, as is substring-local encoding.

So it seems we needn't be trapped in a local maxima, at least long-term. And that perhaps the future will eventually be more heterogeneous. Perhaps like integers and floats, there's both diversity collapse (big endian dies, ieee 754), and heterogeneity (integer packing, SIMD).

I wonder if there is ever going to be an encoding that replaces UTF-8?

Maybe WTF-8? https://simonsapin.github.io/wtf-8/

Section 1: "WTF-8 is a hack..."

I doubt that this is going to be replacing anything except in the narrow use-cases mentioned on the same page.

I should have published this definition of WTF-8 properly http://people.ds.cam.ac.uk/fanf2/hermes/doc/qsmtp/draft-fanf...


> I wonder if there is ever going to be an encoding that replaces UTF-8?

Possibly if CJK keeps gaining importance there may be a new encoding which provides smaller encoding of these characters rather than the current primacy of western/european character sets.

That is already served well by UTF-16. 50% smaller than UTF-8 for CJK, yet still covers the full Unicode range.

Most commonly-used CJK characters are in the Basic Multilingual Plane, and take 2 bytes to represent in UTF-16 and 3 in UTF-8. So you're only really saving 33%, not 50%. Plus, if you're storing e.g. HTML, all the HTML markup characters go from one byte to two bytes, which is going to offset the CJK advantage to some degree (or maybe even outweigh it). I think most web documents are served compressed anyway, which makes the difference even smaller.

No, Unicode fares terribly for CJK languages, see Han Unification. I can't stress enough how it is a lie that Unicode can represent all languages in use, when it cannot even distinguish Japanese from Chinese.

It can't distinguish between English, French, German, Swedish, Norwegian, Spanish, etc. either. And there's no charset that does that. Do you find that a problem? If so, you're very much in the minority; if not, why distinguish between Japanese and Chinese but not English and French?

Swedish ä and German ä are the same character. 令 is a different character (of the same origin) in Japanese, Traditional Chinese and Simplied Chinese (and it depends on your browser setup which you get on screen, which is absolutely insane). It is like saying we should use English p as Cyrillic р, which only look similar, or as п, because they have the same origin or sound?

And no, CJK language users is not a minority, although many of us have become used to the font inconsistensies (because while wrong, they are legible to us).

The question comes down to if Chinese hanzi and Japanese kanji are the same alphabet or not. It's not clear what the answer should be, but for Unicode, it was absolutely necessary to treat them as the same alphabet, if they were not to break the 64K character limit (it's dubious Unicode would have seen its modern universality if it weren't a compact fixed 16-bit encoding, especially since the genius of UTF-8 encoding was a rather later addition).

Another point that is commonly missed is that Han unification was first proposed by East Asians (not ignorant Westerners), specifically the Chinese (which is why the Chinese don't really object to Unicode but the Japanese do).

A side note: it's interesting that you bring up ä as an example, because ä is actually a typographic unification of two very distinct letters. In languages like English and French, the diacritic is a diaeresis: a mark, derived from Ancient Greek indicating that the vowel is to be pronounced separately rather than as a diphthong. In languages like German, it's an umlaut, where it's a typographic reduction of a superscript e. They are in fact very different characters that merely look the same.

Whether the ideographs are the "same" or not is definitely a can of worm, both academically and politically speaking.

I am aware of the limits that the original Unicode specs were subjected to, which adds to my grief on how things could have been. Now that it has taken over the world we have to deal with wrong characters displayed for many applications. (Even modern day iOS has problems displaying mixed language content in system text fields.)

I don't know much about the history of western languages, so I cannot comment on the issue, just that it feels unjustified that during development CJK languages got crammed together and messed up (no matter who did it). More and more emojis and other symbols just rub salt into the wound.

I think it makes sense to "compress" the CJK characters to points where their appearance is the same, because there are on the order of one hundred thousand of them.

For western alphabets that have thirty or so symbols, the need is not nearly so acute.

And that is effectively what happened. Han unification has already done its damage and will continue to perform a lossy compression on all current and future CJK texts. Not that there is a way around...

Because the characters are different enough that it causes actual day-to-day problems (I see text all the time on printouts/signs where some incorrect font substitution has taken place and you get Chinese instead of Japanese characters). You can (probably correctly) argue that these are really just deficiencies in text editing/markup tools where you can't mark the language of text but the fact that that's required for correct presentation only in CJK languages seems to indicate that it's a problem with this unification in particular.

How would you like if there was Latin/Greek/Cyrillic unification? I mean they're all just alphabets that make the same sounds they're all basically the same right?

Wouldn't bother me personally, why should it? However, as there are only a few tens of characters between them, it wouldn't save much.

Functions like String.prototype.charCodeAt, String.prototype.indexOf, String.prototype.substr and such (which operate on indices into 16-bit-wide strings) will have to be supported somehow, so I doubt it.

String.fromCodePoint and String#codePointAt were introduced for this reason. Or rather, the time to bring the UTF-8 Everywhere initiative to JS would have been when these two methods were introduced for ES6. Unfortunately, that didn't happen and the committee went with UTF-16 (which is not the same thing as UCS-2—the fact that they differ is why these methods were introduced in the first place).

codePointAt still takes the same type of index that charCodeAt does. It doesn't address the issue at all.

    $ jsc
    >>> '\u{1f4a9}A'.codePointAt(1).toString(16)
And what was the committee supposed to do with all those string indexing functions that have existed since LiveScript times, remove them? Change their semantics overnight and break countless software in the process?

If there's any thing JavaScript has been doing right so far, it's backwards compatibility. For better or for worse.

This doesn't seem insurmountable. Three steps:

1. Introduce new methods which allow access to code points without indexing by UCS-2 code unit. An iterator over code points is the key thing. You could also have some opaque kind of index, with a method to get a list of indices for every code point in the string, and then a method to look up a code point by opaque index. The former could be expensive, as it would involve scanning the string to find where each character starts (although perhaps that could be done lazily?), but the latter should be cheap. The results of the expensive method could be cached.

2. Wait N years for the new methods to be widely adopted.

3. Shift the internal representation of strings to UTF-8. Anyone using the new methods will see similar, or better, performance. Anyone using the old methods will see a drop in performance. Implementations could even dynamically choose between representations based on usage patterns. The fact that each web page is its own little JS universe should make that fairly practical.

I thought it would have been clear from my comment that I don't regard the new ES6 methods as flawless.

Only matters to the Windows platform, though. EDIT: early morning brain fart on my part, js of course too. Grr.

What? These are cross-platform features, defined in the ECMAScript spec. Nearly all string-processing code written in JavaScript uses them.

> A string is a series of bytes.

Incorrect or wildly inaccurate. A string is conceptually text which may be (is) represented internally as bytes, through some means of encoding that text. And thus the concept of encodings are introduced.

That (and how) text is represented should be an implementation-detail though: Strings represent text, not bytes.

I think most people miss this distinction and that's the main source of confusion for encoding-related problems among programmers.

Eh, a string 'is' its semantic definition and its in-memory representation and I'm sure other things in other contexts. Multiple notions of the same thing coexist outside of computers too. I might be a bunch of cells and physiological systems to someone in biology or medicine, a legal person to the state, something else to a philosopher.

Something like a language specification would have good reason to stick to talking about the semantics, but that isn't a reason to discard any source that uses 'is' to talk about a different aspect. For example, Go's authors have written explanatory posts both about the semantics of strings and their in-memory representations, and it would be weird to say one of those posts is Just Wrong if it says 'A string is [whichever view of strings is most relevant locally]'. Sometimes another way of looking at the system is clearly better for the immediate purpose (writing about memory-use optimizations you would need to know what's in memory) and sometimes discussions need to jump around and look at the different levels at different points (like here, where the post both fills folks in on interactions between various APIs and performance, which is a(n important!) aspect of the implementation).

Thanks for describing the post contents as "wildly inaccurate." I guess the problem with explaining anything is you have to decide which abstractions are good enough for the concept you are trying to get across

Sure. But this is a pet peeves of mine: strings are text. Bytes (and thus encodings) something you should only be concerned about when doing file or network IO. It's boundary-stuff, and none of your actual text-processing should depend on it.

Consider the phrasing my way of shielding you from accusations about being entirely wrong ;)

> It's boundary-stuff, and none of your actual text-processing should depend on it.

I strongly disagree. For high performance text search, it's critical that you deal with its in-memory representation explicitly. This violates your maxim that such things are only done at the boundaries.

For example, if you're implementing substring search, the techniques you use will heavily depend on how your string is represented in memory. Is it UTF-16? UTF-8? A sequence of codepoints? A sequence of grapheme clusters, where each cluster is a sequence of codepoints? Each of these choices will require different substring search strategies if you care about squeezing the most juice out of the underlying hardware.

Sometimes I think we would be better off if our languages didn't have string as a data type at all.

The text encodings themselves (e.g. UTF-8, UTF-32) ought to be proper data types. Strings are a leaky abstraction that cause otherwise competent programmers to have funny ideas about what text is and isn't, as this entire thread demonstrates.

>strings are text

I do not completely agree with that. Text is what strings are mostly used for, but I wouldn't call a base64 encoded image text. Text is something a person can read and make sense of.

So I think it would be more accurate to say that a string is a sequence of characters (grapheme clusters in unicode speak).

Or memory "IO", which standard libraries are often bad at abstracting away.

Exactly; a string is not conceptually a "series of bytes" -- yes, there are languages that have a series-of-bytes type called "string" for historical reasons, but that's irrelevant to how to think of the string type in JavaScript, or in an ideal world.

One level of abstraction that works is to think of a string as a series of integers (which may be much larger than 65535) representing code points. Whether your VM uses a fixed-length or variable-length encoding to encode this series of integers should be irrelevant, aside from performance and memory consumption concerns. JavaScript's "charAt" and numeric indexing of strings violate this abstraction, making them not useful in Unicode-aware code. Of course, the "series of code points" abstraction has limits too, since what we think of as a single "character" can be composed of multiple code points, which calls into the question the whole concept of a "character," which is fuzzy in the first place.

This is all massively inefficient, of course. Most strings are representable as UTF-8, and using two bytes to represent their characters means you are using more memory than you need to, as well as paying an O(n) tax to re-encode the string any time you encounter a HTTP or filesystem boundary.

There's nothing stopping us from packing UTF-8 bytes into a UTF-16 string: to use each of the two bytes to store one UTF-8 character. We would need custom encoders and decoders, but it's possible. And it would avoid the need to re-encode the string at any system boundary.

This line of thinking does not make sense to me, and I'm not sure what scheme is being proposed. First of all, "most strings" are representable as UTF-8? All strings are representable as UTF-8. And if you are reading data from a UTF-8 source and you don't want to pay the time and memory cost of decoding it and re-encoding it in a less space-efficient code, leave it as a Buffer of bytes! Don't decode it.

Because everything else in the world is moving to UTF-8, Node is also trying to move to UTF-8.

Node does make you specify which encoding you want when converting between strings and Buffers, but it's always embraced UTF-8 as its primary encoding, as far as I'm aware.

Because Javascript was invented twenty years ago in the space of ten days, it uses an encoding that uses two bytes to store each character, which translates roughly to an encoding called UCS-2, or another one called UTF-16.

This is kind of silly. Java made the same decision and was certainly not "invented" in ten man-days!

Right, that was one of the obviously-correct representations at the time. When the pioneering major Unicode-based systems (later Mac OSes, Windows NT, Java, etc.) were being built, the Unicode consortium thought code points were just 16 bits, so two-byte characters were a fine solution, trading off some memory but retaining "normal" string characteristics like O(1) indexing and easy size calculation. Pretty much everybody went with that, except Plan 9, which nobody was paying attention to as they invented UTF-8, which is now saving our bacon--thanks!

Then after all those runtimes and APIs were established, the Unicode consortium realized 65,000 characters wasn't actually enough and blew up all those assumptions. Since then, things have been very uncomfortable in those systems, string-wise.

Next-generation systems are incorporating the semantics of modern Unicode, but it'll take a long while to fix this--you can't just redefine a primitive type like "string" in an existing giant codebase.


I didn't know Plan 9 was behind UTF-8. :)

We've come a long, long way since the 90s and even the 00s. It's shocking and gratifying how well emojis work online. A great motivator for Unicode support, too. :)

That feels like kicking the can down the road. What is text?

Depends on the language. Some co-opt the string as a generic buffer too, and have no other suitable type to use.

Unless of course you specify hex or base64, in which case it does refer to the encoding of the output string

I've seen this... questionable design decision in another interpreted/scripting language too. Those are encodings, but clearly not encodings at the same "layer of abstraction" as e.g. UTF-8 or Shift-JIS or UTF-32 or UTF-16, because you could have a UTF-16 string containing "base64" or "hex".

> You could easily represent all of the characters in the Unicode set with an encoding that says simply "assign one number, 4 bytes (or 32 bits) long, for each character in the Unicode set."

Actually you can't, at least not in a standard way. There are combinations of code points that don't have a single code point equivalent. Flag emjois are an example, but there are many letterlike versions as well. There are enough unallocated points in the 32-bit space that probably you could manage to make single-point equivalents on your own.

> There are enough unallocated points in the 32-bit space that probably you could manage to make single-point equivalents on your own.

Unicode doesn't prevent you from drowning letters in many combining marks (e.g. an accent, an umlaut, and an overhead tilde), or doing this with letters where they make no sense and for which no font will have useful positioning information (e.g. over a copyright sign). There's over 1700 combining marks in Unicode, which gives you way over 2^32 ways to use the combining marks with the letter a, let alone the rest of the letters in Unicode.

Hmm, I don't understand. AFAIK flag emojis are ligatures across two code points representing the two-letter international code for the country.

gumby is deliberately interpreting the word character to mean grapheme cluster (whereas the author meant code point) to make a point.

We really should avoid that word when discussing Unicode. It just leads to confusion.

> There's nothing stopping us from packing UTF-8 bytes into a UTF-16 string: to use each of the two bytes to store one UTF-8 character. We would need custom encoders and decoders, but it's possible. And it would avoid the need to re-encode the string at any system boundary.

I'm guessing you'd have to write a C++ module for this, but any suggestions on how one might do this successfully?

That's completely bogus. Forget unpaired surrogates; this scheme cannot even represent odd-length ASCII strings.

You could use JavaScript, bit shifting and codePointAt, I think, but it would be a huge pain.

This just leads me to believe that Node and it's followers never learned the lessons from their stint with php.

Also, "Encoding is the process of squashing the graphics you see on screen, say, 世 - into actual bytes." No it's a way of representing one value in a way a system can more easily handle in a hopefully lossless fashion. Encoding has nothing to do with what's on screen other than that being one representation of the data.

As much as I enjoy making fun of JavaScript, it seems more likely to me that the reason JavaScript uses UTF16 internally is the same as for most other languages that support Unicode; it's more efficient and convenient to process. UTF8 has variable character boundaries, which means that indexing/counting requires decoding char by char; but it works wonders as an exchange format since it's compact and mostly any language can deal with it.

UTF-16 is a variable-width encoding. Thanks to surrogate pairs some code points take 2 bytes, and some take 4. Even if you use UCS-2 (the 2-byte "UTF-16" infamous for mangling emoji) constant-time indexing of code points still doesn't give you constant time access to what humans would call characters ("grapheme clusters" in Unicode), because of decomposed characters with modifiers and joiners.

Languages that use UTF-16 mostly use it only because they're older than UTF-8 (or are tied to a platform older than UTF-8).

UTF-16 is a variable width encoding that is also more convenient to process compared to UTF-8. For example you don't have to deal with invalid code units, non-shortest forms, etc.

I think you're right that older languages use UTF-16 and newer ones use UTF-8. But it also seems empirically true that UTF-16 languages do better at grappling with Unicode's subtleties, compared to UTF-8. The temptation of UTF8-is-bsically-C-strings is hard to ignore.

> For example you don't have to deal with invalid code units, non-shortest forms, etc.

You have to deal with unpaired surrogates, though. And just because that wasn't annoying enough, any UTF-8 sequence which contains an encoded surrogate (which is technically invalid, but not prohibited by most implementations) is impossible to encode as UTF-16.

Javascript doesn't specify anything about what character encoding is used "internally". Indeed, as the first comment on the article points out, some implementations (like V8) internally use ASCII to store strings that contain only ASCII characters.

What Javascript does is use UTF-16 semantics for Unicode strings. The reason why it does this is simple: when those methods were implemented in the mid-1990s, UTF-16 was largely synonymous with Unicode. No characters beyond U+FFFF were defined until the release of Unicode 3.1 in March 2001.

All Unicode encodings require intelligent indexing. JavaScript uses UTF-16 because that (or rather its predecessor UCS-2) was the standard when it was being created. Same reason Java and Apple's Objective-C frameworks use it.

This. Even if you would represent Unicode strings as 32-bit codewords, it would still require careful processing, e.g. to extract single characters from the string. For example, the single character "Ï" can both be represented as the single codeword "u+CF" and the codeword sequence "u+308 u+49". Due to these complications I favour the old method of just treating strings as byte sequences (and possibly enforcing UTF-8 as encoding, since it is a strict superset of ASCII) and to use specialized functions for Unicode processing.

But processing UTF8 is still without a doubt more complex; you're not going to weasel your way out of that fact, no matter how many cases you can think of where they are comparable. Why can't several alternatives be allowed to coexist and complement each other? Why does everything have to be UTF8, or JavaScript, or Rust, or Go or whatever?

Processing UTF8 is very barely more complex, and mostly for whoever writes your language's String implementation. Once you have to worry about whether a code point is more than one unit, there's not much difference between 1-2 units and 1-4 units. Ideally you should be working with grapheme clusters anyway since that's the only way to have a hope of not butchering things (even non-normalized Latin text may contain multi-codepoint letters), but most languages don't give you a good way to deal with them so that's difficult in practice.

With UTF-8 you'll at least have a shot at noticing that you're not handling multi-unit codepoints well, while with UTF-16 you won't notice unless you test Chinese or a more off the beaten path language.

I think that last part is important, and it's a big reason why I dislike UTF-16. It's much harder to notice an inadequate implementation when you're using UTF-16. Note that most of Chinese is still in the BMP so even then it will probably work fine. You'll get failures on more obscure Chinese characters, most emoji, and really obscure scripts like linear B and cuneiform.

UTF-8 makes it more obvious that you're mishandling multi-unit code points, BUT it introduces its own issues, specifically invalid code units and non-shortest forms. These issues represent security vulnerabilities which have been successfully exploited, and are impossible-by-design with UTF-16.

Can you elaborate? I didn't say that you should use UTF-8 (that's just what I prefer personally), but my point was that you should never make any assumption about a Unicode string without consulting the corresponding Unicode tables and essentially have to treat strings as "opaque sequence of something" anyways. May as well be a byte sequence.

Regarding your last point, I'm totally with you (if I understood you correctly). Of course applications should support multiple input/output encodings, but as a programmer you have to decide on some internal representation.

That being said, I really don't see how processing UTF-8 is significantly more complex than processing, say, UTF-16. In both cases you need to handle continuation units for the extraction of Unicode code points.

That being said, I really don't see how processing UTF-8 is significantly more complex than processing, say, UTF-16. In both cases you need to handle continuation units for the extraction of Unicode code points.

UTF-8 has 4 valid cases, one for each length, and many more invalid cases for each length (2-byte sequence missing trail byte, 3-byte sequence missing 1 trail byte, 3-byte sequence missing 2 trail bytes, 4-byte sequence missing 3 trail bytes, 4-byte sequence missing 2 trail bytes, ..., overlongs, UTF-8'd surrogates, overflow, etc.) Differences between implementations' treatment of error cases have lead to some security concerns; see https://hsivonen.fi/broken-utf-8/ and discussion at https://news.ycombinator.com/item?id=14451822 for an example.

UTF-16 has two valid cases (one or two code units) and two error cases (lead surrogate not followed by trail surrogate, lone trail surrogate). It's more like a DBCS, except each code unit is 2 instead of 1 byte.

And Windows ("#define _UNICODE", which actually makes everything use UCS-2)

Not to the same extent as UTF8. It's not like UTF8 invalidated all other encodings; they still fill a purpose and UTF16 seems to still be a popular choice for internal processing, despite the misguided push to use UTF8 for everything.

What's the difference? Both UTF-8 and UTF-16 are variable-length encodings where careless mutation with integer indexes can produce invalid results. UTF-8 is 1-4 bytes per code point, whereas UTF-16 is only 1-2 code units per code point, but that doesn't really make it easier. And proper handling really requires detecting grapheme cluster boundaries, which is the same difficulty regardless of whether you use UTF-8, UTF-16, or UTF-32.

UTF-8 is 1-4 bytes per code point, whereas UTF-16 is only 1-2 code units per code point, but that doesn't really make it easier

As someone who has actually written UTF-8/UTF-16 conversion code, I can immediately tell you which one is far easier to implement: UTF-16. The number of valid cases is basically halved, and the number of error cases in UTF-16 is a fraction of those in UTF-8. Put another way, there are plenty more invalid UTF-8 sequences than invalid UTF-16 sequences.

I agree that UTF-16 is slightly simpler to parse, but compared to everything else you need for Unicode-aware string processing, both are completely trivial.

In any case, the discussion here is the appropriate string API, and the relative difficulty of working with those. Exposing UTF-8 versus UTF-16 changes essentially nothing: in both cases you need to either deal with non-integer indexes or deal with integer indexes where not all values are valid.

Good string APIs are hard. Most Unicode-aware languages pick one particular encoding and then toss the programmer in the deep end with it.

The only language I've seen get it vaguely correct is Swift. (I'm sure there are others, but it's definitely not common.) Swift strings provide multiple views, so you can work with UTF-8, UTF-16, UTF-32, or grapheme clusters, as you need. It doesn't allow using integer indexes directly, so you have to confront the fact that indexing is actually non-trivial. Swift 3 requires using views, and Swift 4 makes the String type itself a sequence of grapheme clusters, which is usually the correct answer to the question of "what unit do you want to work with?"

> Swift 4 makes the String type itself a sequence of grapheme clusters, which is usually the correct answer to the question of "what unit do you want to work with?"

In my experience, I have not yet found a case where I ever wanted to use grapheme clusters. Most algorithms want to iterate over Unicode codepoints (e.g., displaying fonts). Even in display cases, grapheme clusters isn't necessarily the right thing to use for the backspace key or left/right motion.

When would you use something other than grapheme clusters for backspace or arrow keys?

This whole argument is about whether a language's built in string support should use UTF-8. There is no way that you could build a JavaScript engine where you'd even notice the additional complexity of UTF-8 compared to everything else you have to get right (and yes I've had to handle UTF-8 decoding directly before).

Plus I'm pretty sure all the major JavaScript engines (V8 for sure) already know how to handle UTF-8 since that's the encoding most scripts come from.

So you're looking at up to four times as many chars per code point to take into account, but you still claim it's mostly the same thing. Good luck with the lobbying then, I think we're going to have to agree to disagree on this one.

Can you describe an operation (other than "count the number of UTF-16 code units") which is easier to code for UTF-16 than UTF-8?

Yes, the most important operation: string validation!

UTF-16's validation concerns are:

1. Broken surrogate pairs, which is mostly benign.

2. Byte-order confusion.

While UTF-8 has:

1. Invalid code points, for example, code points for surrogate halves.

2. Invalid code units, such as 0xFF.

3. Non-shortest forms, where a character may be encoded multiple ways.

4. Representation of NUL, and potential for confusion with APIs that expect null-terminated strings.

In practice the UTF-8 issues have caused much more serious vulnerabilities.

#4 is double-counting, since that's a special case of #3.

In any case, these are all concerns for a decoder, but not for an API, which is what we're discussing here. In fact, the original comment I replied to up there was advocating the opposite: UTF-16 internally, and UTF-8 for interchange!

Well, even counting the number of code units is straight forward for UTF-8:

   while (*c) count += ((*(c++) & 0xC0) == 0x80) ? 0 : 1;
See https://stackoverflow.com/questions/9356169/utf-8-continuati... for more details.

That's counting code points, not code units. A code point is the Unicode "character" number. A code unit is the smallest unit used by an encoding, such as a byte in UTF-8 or two bytes in UTF-16.

Counting the number of UTF-8 code units in a UTF-8 string is of course trivial. Counting the number of UTF-16 code units in a UTF-8 strings would take more work. But there's probably no reason you'd want to compute that anyway.

I don't know what angle you are coming from, but by far and large most uses of UTF-16 are entirely due to legacy reasons – APIs and languages made before Unicode extended beyond what fits in 16 bits.

Back then, the trade-off made sense, and was popular too, because variable-sized encodings were much more rare. These days UTF-16 really is the worst of both worlds, but we are stuck with what we have.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact