
JavaScript string encoding - iamwil
https://kev.inburke.com/kevin/node-js-string-encoding/
======
0x0
I don't think this article is very good. It seems to make the "newbie unicode
error" of assuming that strings "have" an encoding (or that strings "are"
UTF-8) and thinking of bytes in string. For example in the JSON paragraph, it
makes the cardinal sin of referring to "... creating UTF-8 strings".

No such thing! Strings are an array of integer unicode code points. Stop
thinking about bytes at this level. The internal representation of strings and
chars does not matter, because you as a programmer ever only see integer code
points.

Encoding only enters the picture the moment you want to convert your string to
or from a byte array (for example to write to disk or send over the network).
The encoding, such as "UTF-8", then specifies how to map between the array of
abstract code points to an array of 8bit bytes.

~~~
alangpierce
Unfortunately, JavaScript (and Java and Python 2 and other languages) uses
UTF-16 for its strings, and leaks that information to the programmer, so if
you use a language like that, you should probably have a basic understanding
of how UTF-16 works.

I'd say a JavaScript programmer should think of things this way:

* JavaScript strings are exposed to the programmer as an array of UTF-16 code units, although there are some helper functions like `codePointAt` to help interpret strings in terms of code points.

* Newer languages expose strings as an array of Unicode code points, which is cleaner because it's independent of any particular encoding.

* Even when working with code points, you can't safely reverse strings or anything like that, since a user-perceived character might consist of multiple code points.

~~~
0x0
It is true that Javascript, like many other languages (Java, Win32 wide chars,
etc) has to deal with the problem that they assumed unicode code points could
not exceed the integer value 65535, so you have to deal with the surrogate
pairs. So I guess that is one way to "encode" all planes of unicode in a
backward compatible way. Bytes don't enter the picture though!

It is still very important in general to distinguish a string from a byte
array, and I felt like the article was fairly counter-productive with all its
talk about "UTF-8 strings" (which doesn't make much sense, either you have a
byte-array that you can apply UTF-8 decoding on to get a string out of, or you
already have a string, in which case encodings in the traditional sense
(UTF-8, ISO-8859-1 latin1, etc) doesn't apply)

~~~
burntsushi
> was fairly counter-productive with all its talk about "UTF-8 strings"

Eh. I agree and disagree. I agree in the sense that the phrase "UTF-8 string"
is generally a misnomer and is a good signal that there's some confusion
somewhere, but I don't find think I find it as damning as you do. In
particular, not all languages represent strings as sequences of codepoints,
and instead make their internal UTF-8 byte representation a first class part
of their string API. Two languages that come to mind are Go and Rust, where
Go's conventionally uses UTF-8 and Rust uses enforced UTF-8. But in both
cases, accessing the raw bytes is not only a standard operation, but is
necessary whenever you want to do high performance string processing.

That is, if someone said Go/Rust had "UTF-8 strings," that wouldn't be
altogether wrong. UTF-8 is a first class aspect of both string APIs, while
both provide standard functions you'd expect from Unicode strings.

------
twotwotwo
Commented this on the blog; cross-posting here:

V8 turns out to have a ton of internal string representations. They don’t
affect semantics (the general point that JS string functions think in UTF-16
is valid) but they’re interesting.

V8 apparently stores all-ASCII strings as ASCII, so stuff like HTML tag names
or base64 blobs doesn’t double in size.

Like Go, V8 lets you take a substring as a pointer into the larger string; the
internal class is called SlicedString, but from JS-land you don’t see anything
different from a string literal. As in Go, keeping a short substring of a long
parent string keeps the whole parent ‘alive’ across GCs so sometimes folks
will be surprised all those bytes are still allocated.

Unlike Go, V8 has a ConsString type, so concatenating strings sometimes
doesn’t really immediately copy the underlying bytes anywhere. Building a
string with a loop that runs str += newPiece probably goes faster than
expected because of this. [It turns out it flattens the string the next time
you index into it, or at least used to, which has some perf implications of
its own:
[https://gist.github.com/mraleph/3397008](https://gist.github.com/mraleph/3397008)]

That’s mostly from this post by a member of the Dart team
[http://mrale.ph/blog/2016/11/23/making-less-dart-
faster.html](http://mrale.ph/blog/2016/11/23/making-less-dart-faster.html) ;
his blog has a lot of interesting stuff about how these fine-tuned language
implementations really work.

Kind of amazing the lengths V8 (and other JS engine) teams went to to make the
code they saw in the wild work well.

~~~
ubernostrum
_V8 apparently stores all-ASCII strings as ASCII, so stuff like HTML tag names
or base64 blobs doesn’t double in size._

Python does this too.

In Python 2, and in Python 3 until 3.3, a compile-time flag determined the
internal Unicode storage of the Python interpreter; a "narrow" build of the
interpreter used 2-byte Unicode storage with surrogate pairs, while a "wide"
build used 4-byte Unicode storage.

As of Python 3.3, the internal storage of Unicode is dynamic. Python 3 source
code is always parsed as UTF-8, but then as string objects are created by the
interpreter their memory representation is chosen on a per-string basis, to be
able to accommodate the widest code point present in the string. So Python
will choose either a one-byte, two-byte, or four-byte encoding to store the
string in memory, depending on what code points are present in it.

This is very nice because it means iteration over a Python string is _always_
iteration over its code points, the length of a string is _always_ the number
of code points in it, and indexing _always_ yields the code point at index,
since the internal storage of the string is fixed width and never has to
include surrogates (in pre-3.3 Python, "narrow" builds would actually yield up
things for which ord() gave a value in the surrogate range, and code points
requiring surrogates added 2 to the length of a string rather than 1).

------
benjaminjackman
I wonder if there is ever going to be an encoding that replaces UTF-8? Or have
we hit on some sort of permanent local maxima (not a global one in the sense
that UTF-8 carries the baggage of being backwards compatible with ASCII ...
though maybe you could argue that's more of a unicode problem than a UTF8
encoding format one).

At this point UTF-8 seems pretty permanent, what would come along to replace
it? And if it is likely to be permanent shouldn't node / javascript in general
be moving towards deprecating UCS-2 / UTF-16 and giving first class support to
UTF-8?

I saw all this because a couple of years ago I had to write a UTF-8 converter
before ScalaJS natively supported for a serialization library I had written. I
was kind of surprised that the javascript support was so lacking, luckily
writing a UTF-8 encoder/decoder isn't that hard of an endeavor.

~~~
codewiz
_I wonder if there is ever going to be an encoding that replaces UTF-8?_

Maybe WTF-8?
[https://simonsapin.github.io/wtf-8/](https://simonsapin.github.io/wtf-8/)

~~~
tentaTherapist
Section 1: "WTF-8 is a hack..."

I doubt that this is going to be replacing anything except in the narrow use-
cases mentioned on the same page.

~~~
fanf2
I should have published this definition of WTF-8 properly
[http://people.ds.cam.ac.uk/fanf2/hermes/doc/qsmtp/draft-
fanf...](http://people.ds.cam.ac.uk/fanf2/hermes/doc/qsmtp/draft-fanf-
wtf8.html)

------
josteink
> A string is a series of bytes.

Incorrect or wildly inaccurate. A string is conceptually _text_ which may be
(is) represented internally as bytes, through some means of encoding that
text. And thus the concept of _encodings_ are introduced.

That (and how) text is represented should be an implementation-detail though:
Strings represent text, not bytes.

I think most people miss this distinction and that's the main source of
confusion for encoding-related problems among programmers.

~~~
kevinburke
Thanks for describing the post contents as "wildly inaccurate." I guess the
problem with explaining anything is you have to decide which abstractions are
good enough for the concept you are trying to get across

~~~
josteink
Sure. But this is a pet peeves of mine: strings are text. Bytes (and thus
encodings) something you should only be concerned about when doing file or
network IO. It's boundary-stuff, and none of your actual text-processing
should depend on it.

Consider the phrasing my way of shielding you from accusations about being
entirely wrong ;)

~~~
burntsushi
> It's boundary-stuff, and none of your actual text-processing should depend
> on it.

I strongly disagree. For high performance text search, it's _critical_ that
you deal with its in-memory representation explicitly. This violates your
maxim that such things are only done at the boundaries.

For example, if you're implementing substring search, the techniques you use
will heavily depend on how your string is represented in memory. Is it UTF-16?
UTF-8? A sequence of codepoints? A sequence of grapheme clusters, where each
cluster is a sequence of codepoints? Each of these choices will require
different substring search strategies if you care about squeezing the most
juice out of the underlying hardware.

~~~
ademarre
Sometimes I think we would be better off if our languages didn't have _string_
as a data type at all.

The text encodings themselves (e.g. UTF-8, UTF-32) ought to be proper data
types. Strings are a leaky abstraction that cause otherwise competent
programmers to have funny ideas about what text is and isn't, as this entire
thread demonstrates.

------
userbinator
_Unless of course you specify hex or base64, in which case it does refer to
the encoding of the output string_

I've seen this... questionable design decision in another
interpreted/scripting language too. Those are encodings, but clearly not
encodings at the same "layer of abstraction" as e.g. UTF-8 or Shift-JIS or
UTF-32 or UTF-16, because you could have a UTF-16 string containing "base64"
or "hex".

------
gumby
> You could easily represent all of the characters in the Unicode set with an
> encoding that says simply "assign one number, 4 bytes (or 32 bits) long, for
> each character in the Unicode set."

Actually you can't, at least not in a standard way. There are combinations of
code points that don't have a single code point equivalent. Flag emjois are an
example, but there are many letterlike versions as well. There are enough
unallocated points in the 32-bit space that probably you could manage to make
single-point equivalents on your own.

~~~
whipoodle
Hmm, I don't understand. AFAIK flag emojis are ligatures across two code
points representing the two-letter international code for the country.

~~~
mahkoh
gumby is deliberately interpreting the word character to mean grapheme cluster
(whereas the author meant code point) to make a point.

~~~
mikeash
We really should avoid that word when discussing Unicode. It just leads to
confusion.

------
zengid
> _There 's nothing stopping us from packing UTF-8 bytes into a UTF-16 string:
> to use each of the two bytes to store one UTF-8 character. We would need
> custom encoders and decoders, but it's possible. And it would avoid the need
> to re-encode the string at any system boundary._

I'm guessing you'd have to write a C++ module for this, but any suggestions on
how one might do this successfully?

~~~
pwdisswordfish
That's completely bogus. Forget unpaired surrogates; this scheme cannot even
represent _odd-length ASCII strings_.

------
huhlig
This just leads me to believe that Node and it's followers never learned the
lessons from their stint with php.

Also, "Encoding is the process of squashing the graphics you see on screen,
say, 世 - into actual bytes." No it's a way of representing one value in a way
a system can more easily handle in a hopefully lossless fashion. Encoding has
nothing to do with what's on screen other than that being one representation
of the data.

------
andreasgonewild
As much as I enjoy making fun of JavaScript, it seems more likely to me that
the reason JavaScript uses UTF16 internally is the same as for most other
languages that support Unicode; it's more efficient and convenient to process.
UTF8 has variable character boundaries, which means that indexing/counting
requires decoding char by char; but it works wonders as an exchange format
since it's compact and mostly any language can deal with it.

~~~
mikeash
All Unicode encodings require intelligent indexing. JavaScript uses UTF-16
because that (or rather its predecessor UCS-2) was the standard when it was
being created. Same reason Java and Apple's Objective-C frameworks use it.

~~~
andreasgonewild
Not to the same extent as UTF8. It's not like UTF8 invalidated all other
encodings; they still fill a purpose and UTF16 seems to still be a popular
choice for internal processing, despite the misguided push to use UTF8 for
everything.

~~~
mikeash
What's the difference? Both UTF-8 and UTF-16 are variable-length encodings
where careless mutation with integer indexes can produce invalid results.
UTF-8 is 1-4 bytes per code point, whereas UTF-16 is only 1-2 code units per
code point, but that doesn't really make it easier. And proper handling really
requires detecting grapheme cluster boundaries, which is the same difficulty
regardless of whether you use UTF-8, UTF-16, or UTF-32.

~~~
userbinator
_UTF-8 is 1-4 bytes per code point, whereas UTF-16 is only 1-2 code units per
code point, but that doesn 't really make it easier_

As someone who has actually written UTF-8/UTF-16 conversion code, I can
immediately tell you which one is far easier to implement: UTF-16. The number
of valid cases is basically halved, and the number of error cases in UTF-16 is
a fraction of those in UTF-8. Put another way, there are plenty more invalid
UTF-8 sequences than invalid UTF-16 sequences.

~~~
mikeash
I agree that UTF-16 is slightly simpler to parse, but compared to everything
else you need for Unicode-aware string processing, both are completely
trivial.

In any case, the discussion here is the appropriate string API, and the
relative difficulty of working with those. Exposing UTF-8 versus UTF-16
changes essentially nothing: in both cases you need to either deal with non-
integer indexes or deal with integer indexes where not all values are valid.

Good string APIs are hard. Most Unicode-aware languages pick one particular
encoding and then toss the programmer in the deep end with it.

The only language I've seen get it vaguely correct is Swift. (I'm sure there
are others, but it's definitely not common.) Swift strings provide multiple
views, so you can work with UTF-8, UTF-16, UTF-32, or grapheme clusters, as
you need. It doesn't allow using integer indexes directly, so you have to
confront the fact that indexing is actually non-trivial. Swift 3 requires
using views, and Swift 4 makes the String type itself a sequence of grapheme
clusters, which is usually the correct answer to the question of "what unit do
you want to work with?"

~~~
jcranmer
> Swift 4 makes the String type itself a sequence of grapheme clusters, which
> is usually the correct answer to the question of "what unit do you want to
> work with?"

In my experience, I have not yet found a case where I ever wanted to use
grapheme clusters. Most algorithms want to iterate over Unicode codepoints
(e.g., displaying fonts). Even in display cases, grapheme clusters isn't
necessarily the right thing to use for the backspace key or left/right motion.

~~~
mikeash
When would you use something other than grapheme clusters for backspace or
arrow keys?

