
Slimmer and faster JavaScript strings in Firefox - evilpie
https://blog.mozilla.org/javascript/2014/07/21/slimmer-and-faster-javascript-strings-in-firefox/
======
codewiz
Lazily converting UTF-8 (or latin1) to UTF-16 as needed is indeed an old trick
employed by many string classes.

It's even a bit surprising that a codebase as popular and performance-critical
as SpiderMonkey hadn't picked up such as a simple and high-yield optimization
several years ago.

By the way, other implementations are even lazier: the string is kept in its
original encoding (utf-8 or 7-bit ascii) until someone calls a method
requiring indexed access to characters, such as the subscript operator. At
this point, you convert to UTF-16 to for O(1) random access.

Indexing characters in a localized string is rarely useful to applications and
often denotes a bug (did they want the N-th grapheme, glyph or code-point?).
It's best to use higher-level primitives for collating, concatenating and
splitting localized text.

Granted, a JavaScript interpreter must remain bug-by-bug compatible with
existing code, thus precluding some of the most aggressive optimizations.

~~~
nitrogen
What do the lazily converting string classes do for characters that don't fit
in UTF-16? Would they convert to UTF-32, or just fall back to an O(n) index?

Example: ☃

~~~
ceronman
String classes rarely use UTF-16 because it doesn't have fixed length code
point representation. UCS-2 is often used instead, which uses two bytes to
represent all the unicode points in the Basic Multilingual Plane (BMP), which
is enough for 99.99% of the use cases.

One example of this is Python, which used UCS-2 until version 3.3. There was a
compile time option to use UCS-4, but UCS-2 was enough for most cases because
the BMP contains all the characters of all the languages currently in use.

~~~
chadzawistowski
Which encoding does Python use now?

~~~
ceronman
PEP 393 introduced flexible string representation which can use 1, 2 or 4
bytes depending on the type of the string:
[http://legacy.python.org/dev/peps/pep-0393/](http://legacy.python.org/dev/peps/pep-0393/)

------
tolmasky
_Linear-time indexing: operations like charAt require character indexing to be
fast. We discussed solving this by adding a special flag to indicate all
characters in the string are ASCII, so that we can still use O(1) indexing in
this case. This scheme will only work for ASCII strings, though, so it’s a
potential performance risk. An alternative is to have such operations inflate
the string from UTF8 to TwoByte, but that’s also not ideal._

Perhaps I'm missing something (quite likely, as I am certainly no expert when
it comes to unicode), but I was under the impression that this would already
have to be the case since UTF16 is also variable length.

~~~
sheetjs
Technically, for characters whose codepoint exceeds 0xFFFF, javascript treats
them as two characters. To see that, consider the Sushi character "🍣"
(U+1f363):

    
    
        "🍣".length // 2
        "🍣".charCodeAt(0) // 55356
        "🍣".charCodeAt(1) // 57187

~~~
iopq
That's a bad interface that allows you to split strings at useless codepoints
and get illegal UTF-16 strings as the result.

~~~
dbaupp
It's the historical interface which websites now rely on, changing it would be
like writing a libc with strcmp operating on Pascal strings.

In any case, a Javascript String is not actually designed to be UTF-16, it is
essentially just an `uint16_t[]`. Even textual strings just store UTF-16 _code
units_ , not full UTF-16 data. Relevant snippets from the standard:

 _The String type is the set of all finite ordered sequences of zero or more
16-bit unsigned integer values ( "elements")._

 _When a String contains actual textual data, each element is considered to be
a single UTF-16 code unit. [...] All operations on Strings (except as
otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned
integers; they do not ensure the resulting String is in normalised form, nor
do they ensure language-sensitive results._

See also:

\- Section 8.4 [http://www.ecma-international.org/publications/files/ECMA-
ST...](http://www.ecma-international.org/publications/files/ECMA-
ST/Ecma-262.pdf)

\- [http://mathiasbynens.be/notes/javascript-
encoding](http://mathiasbynens.be/notes/javascript-encoding)

~~~
gsnedders
> Although the standard does state that Strings with textual data are supposed
> to be UTF-16.

No, it doesn't. It states that they're UTF-16 code units, a term defined in
Unicode (see D77; essentially an unsigned 16-bit integer), which _is not_ the
same as UTF-16. A sequence of 16-bit code units can therefore include lone
surrogates, which something encoded in UTF-16 could not.

~~~
dbaupp
Oh, yes; I just skimmed 'code unit' bit without actually reading. (I've now
removed the misinformation from my previous comment.)

------
TheLoneWolfling
My personal ideal representation of strings.

Modified rope-like tree, where each node stores a flat array of characters,
with the caveat that "characters" within a node must have the same encoding,
with the node storing the encoding. Yes, this means that a string can have
multiple encodings within it. The internal "default" representation is a flat
array of a (modified) UTF-8 encoding, at a fixed number of bytes per character
(stored in the encoding) using overlong encodings if necessary. (So if you
change a character in a node composed of two-byte characters into a single-
byte character, you don't need to regenerate the node if it doesn't make
sense.)

~~~
deathanatos
Honestly, I really couldn't care how a string is stored internally. It matters
much more to me that it exposes, at a bare minimum, the ability to iterate
over Unicode code points — something JavaScript and so many other languages do
not do. From there I can perhaps build up to something actually useful (like a
concept of a character).

~~~
jeltz
Most languages designed today use UTF-8 so in those you should be able to
iterate over the code points.

~~~
dbaupp
What do you mean by this? UTF-8 != iterating over codepoints (in fact, UTF-8
is the most complicated UTF encoding to iterate over codepoints).

------
userbinator
_For every JS string we allocate a small, fixed-size structure (JSString) on
the gc-heap. Short strings can store their characters inline (see the Inline
strings section below), longer strings contain a pointer to characters stored
on the malloc-heap._

I wonder what the reason is for this roundabout way of doing it - couldn't the
whole string be stored as a variable-length block (header with length, and
then the content bytes), all on one heap? Incidentally this is also one of the
things I think is broken about the malloc() interface; there is no portable
way to get the size of an allocated block with only a pointer to it, despite
that information being available somewhere - free() has to know, after all.
Thus to do it the "correct, portable" way you have to end up essentially
duplicating that length somewhere else. The fact that people are getting told
that it's not something they need to know (e.g.
[http://stackoverflow.com/questions/5451104/how-to-get-
memory...](http://stackoverflow.com/questions/5451104/how-to-get-memory-block-
length-after-malloc) ) doesn't help either.

I've written a "nonportable" (in reality, all that would be needed is to
change the function that gets the length from the block header) string
implementation that uses this technique, and it definitely works well.

 _Some operations like eval currently inflate Latin1 to a temporary TwoByte
buffer, because the parser still works on TwoByte strings and making it work
on Latin1 would be a pretty big change._

I haven't looked at their code but if the parser expects the whole string to
be available and accesses it randomly it would certainly be a big rewrite;
otherwise, if it's more like a getchar(), it wouldn't be so hard to have a
function expand each character from the source string as the parser consumes
it.

 _The main goal was saving memory, but Latin1 strings also improved
performance on several benchmarks._

With modern processors having multilevel cache hierarchies and increasing
memory latencies, smaller almost always is faster - it's well worth spending a
few extra cycles in the core to avoid the few hundred cycles (or more) of a
cache miss.

~~~
kevingadd
SpiderMonkey uses one of the rather elaborate, carefully-tuned modern
allocators (tcmalloc or jemalloc, I forget which) that's designed around
clustering allocations into 'size categories', and carving allocations out of
those categories. As a result, a given allocation will end up with overhead
based on the size bucket it's put into, and certain buckets may perform
better.

In this environment it's very useful for each type to be fixed-size (within
reason) and be consistently sized. You'll end up with all your JSStrings
allocated out of the same size bucket and other things get simpler and faster.

For example, with fixed-size JSStrings allocating out of a fixed-size
allocation bucket, perhaps thread-local, you can allocate a string by
atomically incrementing a pointer. Speedy!

~~~
pcwalton
In general Firefox uses jemalloc, so shipping builds of SpiderMonkey do as
well. However, _individual_ JavaScript objects are allocated using a custom
SpiderMonkey-specific allocator that uses fixed-size bins, knows about garbage
collection information, and (as of this year) supports multiple generations.
No production-quality JavaScript engine uses the C malloc interface for
individual JS objects; you'd get killed in the benchmarks if you tried.

------
ch0wn
Is this similar to the Flexible String Representation[0] in Python 3.3?

[0]
[http://legacy.python.org/dev/peps/pep-0393/](http://legacy.python.org/dev/peps/pep-0393/)

~~~
AnkhMorporkian
Nearly the same, save for the fact that python also gives an option for
UTF-32.

~~~
deathanatos
I wouldn't say it's an option: the string's internal representation might be
UTF-32, but whether or not it is is transparent to you the coder. (Just as the
JS change is transparent to the JS coder.)

However, the Python change wasn't _entirely_ transparent: len() on a string
now returns the length of the string in code points, whereas previously it
returned the length in code _units_. Further, previously Python could be built
with one of two internal string representations, so len(s) for a constant s
could return different answers depending on your build. Now it doesn't, and
len returning code points is _much_ more useful.

------
greggman
I know this will probably get downvoted into oblivious but is string space
really an issue in the browser?

For example, this page at the time of this post has 68k of html so 68k of
text. Let's assume it's all ASCII but we store it at 32bits per character so
it's 4x in memory or 270k. Checking Chrome's task manager this page is using
42meg of ram. Do I really care about 68k vs 270k when something else is using
42meg of ram for this page? Firefox is also at a similar level.

Why optimize this? It seems like wrong thing to optimize? Especially for the
complexity added to the code.

~~~
jtc331
I'd guess that it's actually quite an issue for JS heavy pages. This would
probably benefit anyone doing signification in-browser apps in JS.

~~~
greggman
Hmmm, checking for example Gmail which is arguably a heavy page it's got 4meg
of requests for various js + html files. So 16meg if expanded to 32bits per
code point. But it's using 160meg of ram. Strings are not where all the space
is going it would seem.

~~~
bzbarsky
If you actually read the linked article, it has measurements for how much RAM
strings use in Gmail. For the particular case of the article's author, it was
about 11MB of strings before the changes he made; it was about 6-7MB of
strings afterward. Your mileage will vary depending on what actual mails you
have, of course.

Note also that comparing this to the Chrome numbers for overall Gmail memory
usage is comparing apples and oranges: Firefox tends to use less memory than
Chrome. You'd want to look at about:memory in Firefox to see how much memory
that gmail page is likely using.

------
hexleo
Why not compared with other browser like chrome etc?

~~~
nnethercote
In Firefox you can easily answer questions like "how much memory are strings
taking up on this page", thanks to the fine-grained measurements available in
about:memory.

I don't know of a way to get these measurements in other browsers. Chrome's
about:memory page contains _much_ coarser measurements, for example.

~~~
szatkus
Chrome has the memory profiler for that.

~~~
rockdoe
Does that even work for memory usage by native code?

------
thomersch_
Latin1? I hoped it would die some day.

~~~
kannanvijayan
This is an internal representation. JS strings do and continue to behave as
sequences of 16-bit integers.

This change takes advantage of the fact that most JS strings fit into an 8-bit
charspace, so for those that do, it uses a more compact representation
internally.

This optimization is simply: if we have a string and we know that all of the
uint16_ts in the string are <= 255, then just store it as a sequence of
uint8_ts.

~~~
angersock
ES6.1 wishlist: UTF8 strings, full stop.

~~~
Ygg2
Be careful what you wish for. Unicode strings are fucking complex. UTF8 double
so.

For example which of the four Unicode character normalization interests you
most? Or you need grapheme clusters? Or you need code points? Or byte values?

~~~
iopq
You already have UTF16 which is both complex and inefficient (because of two
byte representations of Latin characters)

