Hacker News new | past | comments | ask | show | jobs | submit login
The complete guide to working with strings in modern JavaScript (baseclass.io)
64 points by davethedevguy 32 days ago | hide | past | favorite | 32 comments

This guide omits an IMPORTANT issue: handling characters outside the Basic Multilingual Plane (BMP). JavaScript, like some other languages, suffers from the UCS-2 curse - that is, it assumes that all characters fit inside 16 bits, even though that is no longer true.

For example, the cited text says: "You can return the UTF-16 code for a character in a string using charCodeAt()...".

Not true. This only works if the UTF-16 code fits in 16 bits; if it's more than 16 bits, charCodeAt will only return a part of the character.

There are lots of discussions about this, here's one: https://stackoverflow.com/questions/3744721/javascript-strin...

JavaScript can handle characters outside the BMP, but you sometimes have to aware of the problem & carefully code around it when such characters are possible.

Exactly, and emoji are outside the BMP, so it's not exactly an edge case, but the norm where two code units (UTF-16 double-bytes) are used to make one code point (Unicode character) [1].

And it gets even worse, when you consider that for many purposes you're not even interested in code points but in graphemes which can be sequences of code points -- e.g. a single visible emoji might actually be a sequence of 5 code points, represented by 8 UTF-16 code units, taking up 16 bytes [2]. Similarly a single accented character will often be two code points (letter plus combining diacritic).

If you want to split a string by graphemes -- e.g. to count its visible length, or delete its last visible character -- you can either use a library for it [3], or the relatively new Intl.Segmenter [4] which is in Chrome and Safari, but hasn't made it to Firefox [5].

Kind of amazing it's 2021 and you still can't calculate the number of visible characters (graphemes) in a string using native functions across all modern browsers.

[1] https://blog.jonnew.com/posts/poo-dot-length-equals-two

[2] https://www.contentful.com/blog/2016/12/06/unicode-javascrip...

[3] https://github.com/orling/grapheme-splitter

[4] https://github.com/tc39/proposal-intl-segmenter

[5] https://bugzilla.mozilla.org/show_bug.cgi?id=1423593

You're right about the graphemes. If you need it, you need it, but I recommend writing code that does not need to count graphemes if you can avoid them.

In many cases strings are best considered units that can be concatenated at will, but it's best if you avoid splitting them, and if you must split them, generally only split them on ASCII character boundaries. Don't consider "lengths" as something that has a meaning to humans (it doesn't), and don't assume that a "character" is a single JavaScript character (it isn't). If you normally just work strings as opaque sequences of "characters" that can be later displayed, you can avoid many complications (though obviously NOT all of them).

Thank you for these comments. I didn't know about this at all!

I'll read up on it until I understand it, and then add something to the article that covers it.

This might also help you:


e.g. use Array.from() to at least process code points rather than code units, though that's still not graphemes.

I’m surprised to see no mention of tagged literals, a much more complex version of template literals. For users they may seem ~like a function call without parentheses. But they do quite a bit more.

Short version: they accept an array of raw substrings and a variadic set of arguments corresponding to the runtime values provided in template positions, each positional value corresponding following the raw string preceding it.

That raw array is more than what it seems, it also has a getter of raw string values for the template expressions. This is what String.raw (also not mentioned) uses to treat those arguments essentially the same way an untagged template literal would.

It’s an odd design/interface but it can be used to do some pretty cool stuff. For example, Zapatos[1], a type-safe SQL library for TypeScript.

My only complaints:

- I can’t think of a real reason for it to be variadic, and this makes authoring them a little more error prone. You should be able to expect one array of strings with a length N, and one array of (type checkable/inferrable) values with a length N-1.

2. Likewise I can’t think of a real reason for the raw values to be bolted onto a weird array subclass. It could just as easily have been an iterable third argument.

1: https://github.com/jawj/zapatos

Thank you, I didn't know about this!

I've added a section on it:


    // Correct: returns '1'
    'Résumé'.localeCompare('RESUME', undefined, {sensitivity: 'accent'})
localeCompare() returns 0 if the strings are equal and -1/+1 if they're different. Since this section is about comparing two strings that only differ in case and accents, I would expect to see a method I could use that would consider the strings to be equal. Instead, this example just shows two ways to compare strings (=== and localeCompare) that both consider the strings to be different.

Thank you, you're right that's a mistake.

This example was supposed to use: {sensitivity: 'base'}

I've corrected it.

Not really. Case (in) sensitivity and accent (in)sensitivity are two orthogonal things. If you want to compare two strings, ignoring case differences, converting both to lowercase (or uppercase) is completely fine in Javascript (it might be problematic in other languages because of Turkish dotted and dotless i, but in JS, the obvious first-choice toLowerCase() is locale-ignorant, you would need to use toLocaleLowerCase() to be bitten by the problem and why would do that?). Obviously, the method considers "Á" and "a" to be different, but why wouldn't it? Those characters differ in accents, not only in case.

I came to write this. Why are accents being disregarded under the case-insensitive comparison section?

The proper solution is typically "case fold", however I only know it from Python, not sure if Javascript supports it natively.

Thanks for this.

I've split 'Handling diacritics' in to a separate section to 'Case sensitivity'.

> Case (in) sensitivity and accent (in)sensitivity are two orthogonal things

Is that definitely true?

As I understand it (and I admit I'm no expert), it's common to omit accents in some languages when changing case.

I'll add something: if a string [^1] contains essentially only ASCII [^2] characters v8 will use 1 byte per character, if that string contains _any_ character other than ASCII characters in it then it will use 2 bytes for each character in the string. Said it differently storing strings as lines may save you up to 50% of memory usage depending on your use case.

[^1]: It actually depends on how that string was made, if internally it still references the parent string then slicing it up into lines won't save you any memory. I'm referring to "flattened" strings.

[^2]: I don't remember what the exact character set is, I think it's not exactly ASCII but close enough.

It's latin1. The same is true of DOM strings in Chromium, like attributes, blocks of text, and inline scripts.

Webkit and the JDK implement the same string optimization, while .NET unfortunately doesn't: https://github.com/dotnet/runtime/issues/6612

Regarding [^2], is it code page 437 by any chance? https://en.m.wikipedia.org/wiki/Code_page_437

It's Latin1 which I think is code page 850: https://en.m.wikipedia.org/wiki/Code_page_850

I didn't know that! I'll add something to the post.

Thank you!

> If you're not sure [which equality operator] to use, prefer strict equality using ====.


Ha! Thank you for noticing that. I've fixed it :)

> let website = new String("BaseClass")

> website.rating = "great!"

Ok that’s new to me. Can anyone point out where this could be useful, if anywhere?

It’s useful to know about even if you have no use case for it. Because this conversion happens every time you access a property on any JS primitive. For example, this:

Is actually doing this:

    new String(str).toUpperCase()
...hand waving away obvious optimizations that surely occur for multiple property access.

There may be cases, for instance a series of imperative synchronous hot loops, where this has a meaningful performance impact.

Another thing that’s important to know is that you can do this:

    Object.assign(str, {
        anythingAt: ‘all’
And you’ve now mutated your string into a String. This breaks a buuuuunch of expected behaviors with primitives, such as === and the default Array#sort comparator.

Edit: this last one bit me experimenting with a runtime implementation of a common pattern in TypeScript to define nominal types for primitives (e.g. to distinguish a known valid URL from a regular string) called branding[1]. The approach is useful if you’re obsessive about type safety, but flawed because it almost certainly misrepresents the runtime type. I tried to use Object.assign to get that much more type safety and... I got the opposite! I got a String claiming to be a string, which broke all sorts of stuff, including the aforementioned === and Array#sort.

1: https://medium.com/@KevinBGreene/surviving-the-typescript-ec...

Only thing useful I can come up is monkey patching without changing the function signature. E.g. smuggling additional data to a function which expects a string like thing.

But I think the whole `new String` is problematic, because it breaks the strict comparison:

  let a = "Foo";
  let b = new String("Foo");
  a === "Foo"
  b === "Foo"

> Only thing useful I can come up is monkey patching without changing the function signature. E.g. smuggling additional data to a function which expects a string like thing.

Yeah this sounds terrifying to me, if I were to debug something like this.

Also the fact they identify as objects:

    typeof new String("Foo") === 'object';
Case closed, I will never use it :D

Primarily for adding properties, and making objects that can be inherited from (unlike primitives). This is a great answer with even more cases:


> Primarily for adding properties

I personally would never expect a property which is treated as a string. That's what POJOs are for, but that is just opinion I guess. still weird they let you do this.

> and making objects that can be inherited from (unlike primitives).

I mean the SO gives an example of how to do it but I would still love to see some place where this is needed or even pragmatic to any degree.

Fine concise reference!

While SPLITting a string into an array is useful, JOINing array elements to create a string is also useful.

e.g. let x = commaSepList.split(',').join('\n')


I'm not sure how I managed to completely forget about joining strings. I'll add that.

I think there's a minor typo in the trimming examples:

  "  Trim Me  ".trim() // "Trim"
should be

  "  Trim Me  ".trim() // "Trim Me"

Thank you! I've fixed it.

Great site. Great idea. Keep it up and add more!

Thank you!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact