Requiem for hacks upon hacks all the way down the stack to make historic brain f...

alexvitkov · 2023-10-19T12:42:18

A few copies aren't gonna kill us, but this may very well might, if charAt() is now O(n):

    for (int i = 0; i < str.length(); i++)
        doSomethingWith(str.charAt(i));

UTF-8 just a text encoding. If you're making something new, yeah, it's the obvious choice, but it's not better enough to justify breaking all sorts of shit just to switch over.

Ygg2 · 2023-10-19T14:28:21

That code makes zero sense in Unicode. First question how are representing your umlauts, followed by Zero Join Width Characters.

You never work on characters, you work on grapheme clusters or whatnot but never characters.

alexvitkov · 2023-10-19T15:13:52

I'm not advocating you write this, I'm saying people have written it, probably hundreds of thousands of times, and if charAt() becomes O(n) instead of O(1), this code suddenly hangs your CPU for 10 seconds on a long string, thus you can't really swap out UTF-16 for UTF-8 transparently.

Ygg2 · 2023-10-19T16:41:20

Your point doesn't stand for UTF-16 either. It's not a fixed length encoding either. It's broken in UTF-16 as well.

It was always O(n).

Of course assuming you aren't using UTF-32, which has its own set of problems (BE or LE), and sees little usage outside of China.

taway1237 · 2023-10-19T21:04:18

...it's not O(n). Many languages, JS, Java and C# included, have O(1) access to a character at a given position. You correctly note that it won't work well with international strings, but GP is right that A LOT of code like this was written by western ASCII-brained developers.

alexvitkov · 2023-10-20T06:34:51

Haven't used Java in a while but I believe charAt() returns a UTF-16 codepoint and is constant time access. So something like the above works not only for ASCII, as well as for the majority of Western languages and special characters you may encounter on a day to day basis.

Ygg2 · 2023-10-20T12:41:12

It's constant time iff you ignore surrogate pairs and Unicode. By that logic UTF8 is constant time if you ignore anything not ASCII because most text is in English.

Saying it works fine if you ignore errors and avoid edge cases is just a clever rephrashing of it worked on my machine.

Plus Emojis are Unicode U+1F600 and above, so even in Western language you are bound to find such "exceptions" .

nvm0n2 · 2023-10-19T19:07:55

In practice you often do because it's common to parse text that has "special" characters which are always ASCII. Think CSV, XML, JSON, source code, that sort of thing. These formats may have non-ASCII characters in them in places, but it's still a very common task to work with indexes into the string and the "character" at that index, which works fine because in practice that character is known to always be a single code unit.

Ygg2 · 2023-10-20T12:26:38

I've found in that case it's much easier to just operate on raw bytes, then transform those into UTF characters. It works trivially for UTF8 and needs some massaging for UTF16 and UTF32 because BE/LE.

hyperpape · 2023-10-19T15:37:09

You’re correct about algorithms that do “human” things with text, but you need to think of more examples.

That’s how you write hashing algorithms, checksums, and certain trivial parsers.[0]

But most importantly, right or wrong, this code is out there, running today, god knows where, and you do not slow it down from O(n) to O(n^2).

samus · 2023-10-19T20:14:39

Is such code really going to be ported to WASM though? And does it really matter for the string lengths that a typical web application has to process? WASM really doesn't have to worry about legacy that much.

SAI_Peregrinus · 2023-10-19T18:20:57

Hashing algorithms and checksums work on bytes, not characters.

hyperpape · 2023-10-19T18:44:06

Here is the JDK 7 String#hashCode(), which operates on characters: https://github.com/openjdk-mirror/jdk7u-jdk/blob/f4d80957e89....

That's changed in the newer versions, because String has a `byte[]` not a `char[]`, but it was just fine. A hash algorithm can take in bytes, characters, ints, it doesn't matter.

In Java, you don't get access to the bytes that make up a string, to preserve the string's immutability. So for many operations where you might operate on bytes in a lower level language, you end up using characters (unless you're the standard library, and you can finagle access to the bytes), or alternately doing a byte copy of the entire string.

I admit, checksums using characters are a bit weird sounding, but they should also be perfectly well-defined.

samus · 2023-10-19T21:54:30

A possible optimization would be to change internal representation on-the-fly for long-ish strings as soon as random accesses are observed. Guidance from experiments would be required to tell where the right tresholds are. Also JavaScript implementations already do internal conversions between string implementations.

davexunit · 2023-10-19T12:45:29

A copy for every string passed to the DOM API, to name just one thing, will be a significant limiting factor.

alexvitkov · 2023-10-19T12:55:09

There are some Rust front-end web frameworks that presumably manipulate the DOM, and in C++/Rust to pass a string to JS you need to run a TextDecoder over your WASM memory, so it's probably not a deal breaker.

But like... if you're writing a website, just use JavaScript.

davexunit · 2023-10-19T13:05:12

C++/Rust use WASM linear memory, but this article is about reference types via WASM GC. UTF-8 data in an (array i8) or UTF-16 data in an (array i16) are opaque to the host.

alexvitkov · 2023-10-19T13:42:16

Yeah, and you still have to marshal strings at the JS/WASM boundry, same as if you used (array i8/16) over JS strings in Java.

In the case of non-managed strings, this overhead hasn't been big enough to stop people from writing fairly fast (by Web standards) frontend frameworks in Rust.

kevingadd · 2023-10-19T23:20:52

The amount of memory and CPU overhead involved in sending strings across the wasm/JS boundary to do something like put text in a textarea is a lot bigger than you might think. It's really severe.

azakai · 2023-10-20T01:49:01

> Any non-toy uses of WebAssembly are gonna be new developments.

Major uses of WebAssembly include things like Photoshop, Unity, and many other large existing codebases.