Requiem for a stringref

paroneayea · 2023-10-19T12:39:28

Stringref is an extremely thoughtful proposal for strings in WebAssembly. It’s surprising, in a way, how thoughtful one need be about strings.

Here is an aside, I promise it’ll be relevant. I once visited Gerry Sussman in his office, he was very busy preparing for a class and I was surprised to see that he was preparing his slides on oldschool overhead projector transparencies. “It’s because I hate computers” he said, and complained about how he could design a computer from top to bottom and all its operating system components but found any program that wasn’t emacs or a terminal frustrating and difficult and unintuitive to use (picking up and dropping his mouse to dramatic effect).

And he said another thing, with a sigh, which has stuck with me: “Strings aren’t strings anymore.”

If you lived through the Python 2 to Python 3 transition, and especially if you lived through the world of using Python 2 where most of the applications you worked with were (with an anglophone-centric bias) probably just using ascii to suddenly having unicode errors all the time as you built internationally-viable applications, you’ll also recognize the motivation to redesign strings as a very thoughtful and separate thing from “bytestrings”, as Python 3 did. Python 2 to Python 3 may have been a painful transition, but dealing with text in Python 3 is mountains better than beforehand.

The WebAssembly world has not, as a whole, learned this lesson yet. This will probably start to change soon as more and more higher level languages start to enter the world thanks to WASM GC landing, but for right now the thinking about strings for most of the world is very C-brained, very Python 2. Stringref recognizes that if WASM is going to be the universal VM it hopes to be, strings are one of the things that need to be designed very thoughtfully, both for the future we want and for the present we have to live in (ugh, all that UTF-16 surrogate pair pain!). Perhaps it is too early or too beautiful for this world. I hope it gets a good chance.

kragen · 2023-10-19T15:49:44

> Python 2 to Python 3 may have been a painful transition, but dealing with text in Python 3 is mountains better than beforehand

it is not

python 2 made a disastrously wrong choice about how to add unicode support

python 3 inserted that disastrously wrong choice everywhere (though at least you no longer get compile errors when you put a non-ascii character in utf-8 or latin-1 in a comment, a level of brain damage i've never seen from any other language)

rust and golang made reasonable choices about how to handle unicode; python, by contrast, is a bug-prone mess

i've lost python error tracebacks generated by an on-orbit satellite because they contained a non-ascii character and so the attempt to encode them as text generated an encoding error. python's unicode handling catastrophe has made it unusable for any context where reliability is especially important

amluto · 2023-10-19T16:19:08

I would argue that Python 3 reliability issues should be blamed on inadequate static checking, not on Unicode strictness.

If you do foo.decode(), you are introducing an operation that can throw. If you are programming in Python for a reliability-critical environment, you should detect this at commit/test time and handle it appropriately.

Rust is every bit as Unicode-strict, but it’s harder to fail to notice that you have a failure path.

Meanwhile, Python 2 will just happily malfunction and carry on. Sure, the code keeps executing, but this doesn’t mean that you will actually get your error message out.

kragen · 2023-11-01T15:04:08

python has a ubiquitous lack of static checking; every other feature added to it must be considered in that context. if on balance it's bad without static checking, it's bad in python

the code in question was not doing foo.decode() or foo.encode(). it was writing a string to a file. python 3 inserts implicit unicode encoding and decoding operations in every access to environment variables, file names, command line arguments, and file contents, unless you pass a special binary flag when you open the file, as if you were on fucking ms-dos.

all those things are byte strings, and rust and python 2 give you access to them as byte strings. python 3 instead prefers to insert subtle bugs into your program

capitainenemo · 2023-10-19T18:22:53

It's also made for hard to port python2 code. I have a dozen line python2 script I use regularly that a dozen python experts have thrown up their hands on easily porting to python3 - I'll probably just rework it in something else, like rust (not in the least since I don't want to write it in python anyway).

Then there's https://gregoryszorc.com/blog/2020/01/13/mercurial%27s-journ... where he comments that the design choices of python3 forced implementing a large portion of core themselves and that had rust been a bit more mature they probably would just have migrated to that instead.

selimthegrim · 2023-10-19T19:59:06

Ok I feel less bad about struggling to port similar length scripts to Python 3 then

xorcist · 2023-10-19T21:24:14

Perl made reasonable choices for unicode. A decade earlier. They are from the same culture and have similar use cases. There was plenty of time to learn.

Groxx · 2023-10-20T04:00:41

Perl is a surprising font of well thought out design decisions. It's not a language I would generally recommend using, but oh boy can you learn a lot of things by learning how to use it.

Const-me · 2023-10-20T03:41:59

> or latin-1 in a comment, a level of brain damage i've never seen from any other language

I know another one, HLSL – Microsoft’s language widely used to write graphics and compute shaders for Direct3D GPUs.

oconnor663 · 2023-10-19T16:43:25

It seems like what Python needs here is some equivalent of .to_string_lossy(). But that's just a library function, not a big architectural change.

o11c · 2023-10-19T16:53:23

That's spelled `errors='surrogateescape'` but it's a horrible hack and doesn't fix the main lies that strings propogate.

kragen · 2023-11-01T15:06:06

surrogateescape, aka utf-8b, is a brilliant hack, and would have been an acceptable default, eliminating the subtle bugs i'm talking about

BobbyTables2 · 2023-10-20T02:00:28

Sounds like you should make friends with latin-1 encoding. All 256 8bit values are valid.

Spivak · 2023-10-19T21:48:31

Hard disagree, there's plenty to complain about with python strings but drawing a formal distinction between str and bytes is one of the smartest things they did for the language. It made the transition from 2->3 a huge PITA but it's one of the things that forces you two write better code. You have to actually acknowledge when you're doing an encoding/decoding step and what encoding you expect.

Python3 caught a programming error for you and you're mad about it. The traceback you got was an encoded form (bytes) that you were blinding decoding in ascii when it was in fact UTF-8. You can tell it to truck through with surrogateescape but surely you can agree that it would be insane to make that the default.

kragen · 2023-11-01T15:06:52

this is incorrect, see above

flohofwoe · 2023-10-19T16:26:58

Python3 is really not a great example to copy elsewhere though. By the time Python3 came about it was already clear that UTF-8 encoding is all one ever needs to represent UNICODE strings, and all the other encodings are either historical accidents (like UCS-2 and UTF-16), or only needed at runtime in very specific situations (like UTF-32, but even this is debatable when working with grapheme clusters instead of codepoints).

And with that basic idea that strings are just a different view on a bytestream (e.g. every string is a valid bytestream, but not every bytestream is a valid string) most of the painful python2-to-python3 transition could have been avoided. I really don't know what they've been thinking when the 'obviously right' solution ("UTF-8 everywhere") was right there in plain sight since around the mid-90's.

amluto · 2023-10-19T17:39:18

> And with that basic idea that strings are just a different view on a bytestream (e.g. every string is a valid bytestream, but not every bytestream is a valid string) most of the painful python2-to-python3 transition could have been avoided.

Can you elaborate?

Much of the pain of the transition was figuring out which strings were bytes and which were Unicode data. The actual spelling of the type names never seemed like a big deal to me.

(I do think Python 3 messed some things up. My current favorite peeve is the fact that iterating bytes yields ints. That causes a lot of type confusions to result in digit gobbledygook instead of a useful exception or static checker error.)

flohofwoe · 2023-10-20T06:49:58

> Much of the pain of the transition was figuring out which strings were bytes and which were Unicode data.

And for a lot of code (that which just passes data around), this shouldn't matter.

It's basically "Schroedinger's strings", you don't need to know if some data is valid string data until you actually need it as a string, and often this isn't needed at all (IMHO all encodings/decodings should be explicit, not just between bytestreams and strings, but also between different string encodings - and those should arguably go into different string types which cannot be assigned directly to each other - e.g. the standard string type should always only be UTF-8). Also, file operations should always work on bytestreams (same in the IO functions of the C stdlib btw).

amluto · 2023-10-20T23:31:05

> It's basically "Schroedinger's strings", you don't need to know if some data is valid string data until you actually need it as a string, and often this isn't needed at all

Then you can pass around an untyped value, which is the default in all versions of Python. With type annotations, one can spell this typing.Any.

When you finally do need your value to be a string, you need to decide whether it’s a runtime error when it needs to be a string or whether it’s a runtime error way up the call stack. Especially if databases are involved (or network calls, etc), this decision matters.

> e.g. the standard string type should always only be UTF-8

It almost kind of sounds like you’re arguing in favor of Python 3’s design, where str is indistinguishable from UTF-8 except insofar as you need to actually ask for bytes (e.g. call encode()) to get the UTF-8 bytes.

> Also, file operations should always work on bytestreams

So how do you read a line from a text file?

> (same in the IO functions of the C stdlib btw).

Are we talking about the same C? The language where calling gets() at all is a severe security bug, where fgets returns int, and where fgetwc exists?

flohofwoe · 2023-10-21T10:05:17

> So how do you read a line from a text file?

In that case you need to know upfront how the text file is encoded anyway, since text files don't carry that information around.

If it is a byte-stream encoding from the "ASCII heritage" like UTF-8, 7-bit ASCII, or codepaged 8-bit "ASCII" - whatever that is actually called...): load bytes until you encounter a 0x0A or 0x0D (and skip those when continuing), what has been loaded until then is a line in the text file's encoding. If the original encoding was codepaged 8-bit ASCII you probably want to convert that to UTF-8 next, for that you also need to know the proper codepage though (not needed for 7-bit ASCII since that already is valid UTF-8 - in UTF-8, every byte with the topmost bit cleared is guaranteed to be a standalone 7-bit ASCII character and every byte with the topmost bit set is part of a multi-byte sequence for codepoints above 127, that's why one can simply iterate byte by byte over an UTF-8 encoded byte stream when looking for 7-bit ASCII characters (such as newline and carriage-return).

The gist is that the file IO functions themselves should never be aware of text encodings, they should only work on bytes. The "text awareness" should happen in higher level code above the IO layer.

> Are we talking about the same C?

What I meant here - but expressed poorly - was that C also got that wrong (or rather the C stdlib, C itself isn't involved). There should be no "text mode IO" in the C stdlib IO functions either, only raw byte IO. And functions like gets(), fgets() etc... shouldn't be in the C stdlib in the first place.

amluto · 2023-10-22T02:57:37

Python 3 actually works approximately the way you’re describing:

https://docs.python.org/3/library/io.html#io.TextIOWrapper

open is just a factory function, conceptually inherited (I think) from C.

altfredd · 2023-10-20T02:36:02

Unless you enjoy getting hacked, all strings received from outside sources are bytes.

tomcam · 2023-10-19T13:51:22

I think the WebAssembly people have been judicious about features. Watching it evolve has made me feel that they truly respect how important it is to keep things well thought out and as efficient as possible. I feel like it’s in very good hands.

paroneayea · 2023-10-19T14:49:47

I agree with this assertion. WebAssembly, on its whole, is extremely good.

The string stuff is, IMO, something the group has not come to realize the "right direction" on, but so much has been done right! Hopefully strings can get there too. :)

pjmlp · 2023-10-20T12:25:15

It guess that is why GC support in now in about 5 years and counting, whereas CLR is doing it since 2001, including with interoperability with C++.

nicoburns · 2023-10-19T15:10:53

> right now the thinking about strings for most of the world is very C-brained, very Python 2

Is it? Doesn't pretty much every language have a unicode string type (be that UTF16 in older languages or UTF8 in newer one) that is the default goto type for dealing with text these days? C and C++ being the notable exceptions I suppose.

kragen · 2023-10-19T15:50:49

utf-8 strings work fine in c and c++, as they have since utf-8 was introduced; that was the major design objective of utf-8 in fact

nine_k · 2023-10-19T16:10:27

They work well as long as you're fine working with bytes. For "characters" which a user sees on the screen, that is, graphemes, you need an entirely new layer. Take some word, e.g. "éclair". How long is it? What are its first three characters? How do you uppercase it?

flohofwoe · 2023-10-19T16:32:33

Stuff like this is handled in a (3rd-party) UNICODE library in the C/C++ world, which should ideally work on UTF-8 encoded byte arrays, provided by another (3rd-party) UTF-8 encoding/decoding library.

Other then that high-level UNICODE stuff (like finding grapheme cluster boundaries) UTF-8 itself really works fine in C/C++ anywhere than Windows in the sense that I can write a foreign-language "Hello World!" and it "just works" (e.g. if the whole source file is UTF-8 encoded anyway, than C string literals are also automatically valid UTF-8 strings).

UNICODE on Windows is still a bigger mess than it should be because of its UCS-2 / UTF-16 heritage.

slaymaker1907 · 2023-10-19T20:19:15

Stuff gets really nasty when you start trying to reason about case insensitive string comparisons. For example, the following might return something different depending on what your locale is:

"π".localeCompare("Π", undefined, { sensitivity: "accent" })

My machine says these are equal, but I've seen cases where network stacks consider domains as different if they feature the same Greek letter in a different case even though domain names are supposed to be case-insensitive.

samus · 2023-10-19T19:49:05

To answer all of these, you need `libicu`, not just a mere UTF-8 decoder. Java also doesn't include full facilities: it's `BreakIterator` is based on an ancient Unicode version.

kragen · 2023-10-19T16:11:32

those are library functions, and they work fine on utf-8 strings, though graphemes in particular are difficult and context-dependent in unicode in a way that is exactly the same in c and in java

amluto · 2023-10-19T16:22:03

They are library functions for which a good library does not exist. I recently needed to convert probably-UTF-8 data to definitely valid UTF-8 with errors replaced. This was not an enjoyable experience in C++.

(The ztd proposal is IMO a big step in the right direction.)

microtherion · 2023-10-19T22:36:31

C++ has had std::u8string, std::u16string and std::u32string since C++11

flir · 2023-10-20T22:33:02

Very early in my career, I said something about strings and a more experienced programmer said "that's because you think a string is an array of bytes terminated with a \0". Absolute lightbulb moment for me, and not just about strings.

10000truths · 2023-10-19T17:29:28

It pains me to see people inventing opcodes for operations with unbounded execution times. WebAssembly is a sandboxed runtime first and foremost, and part of a sandbox's security is the ability to limit resource usage. I don't want untrusted user code to DoS my WASM engine, and such opcodes are the perfect vector for this kind of attack. Lua made this mistake with "string.find" [1], and I wish the WASM committee would not repeat it with GC and stringref.

[1] http://lua-users.org/lists/lua-l/2011-02/msg01595.html

slaymaker1907 · 2023-10-19T20:30:13

I don't think this is really a problem that the WASM standard needs to be too concerned about. Execution engines themselves should provide ways to preempt execution. I'm more concerned with these proposals that they are introducing a lot of complexity to implementing WASM.

10000truths · 2023-10-19T20:48:43

> Execution engines themselves should provide ways to preempt execution.

Yes, and that preemption occurs at the opcode level. Which is completely defeated if your string matching opcode can hang for an inordinate amount of time because the user supplies a very large string and/or crafts string inputs that force O(n^2) or worse behaviour.

davexunit · 2023-10-20T01:02:32

I implemented a WASM interpreter recently. Didn't know WASM when I started. GC wasn't particularly difficult. Type canonicalization was the trickiest bit. Did a small subset of stringref, too. If there's anything I'm intimidated by with WASM implementation it's all the SIMD instructions.

mike_hock · 2023-10-19T11:58:41

Requiem for hacks upon hacks all the way down the stack to make historic brain farts work.

A sensible way forward would be to deprecate APIs for direct UTF-16 code unit access, but implement them for backward compatibility on top of an internal UTF-8 representation. On both sides.

We've lived with JS bloat for two decades, you think a few string copies/conversions are gonna kill us? Any non-toy uses of WebAssembly are gonna be new developments. Old shit that nobody is gonna run on WebAssembly other than to go "yeah, huh, it runs on WebAssembly" and then never use it again, doesn't need to run great.

alexvitkov · 2023-10-19T12:42:18

A few copies aren't gonna kill us, but this may very well might, if charAt() is now O(n):

    for (int i = 0; i < str.length(); i++)
        doSomethingWith(str.charAt(i));

UTF-8 just a text encoding. If you're making something new, yeah, it's the obvious choice, but it's not better enough to justify breaking all sorts of shit just to switch over.

Ygg2 · 2023-10-19T14:28:21

That code makes zero sense in Unicode. First question how are representing your umlauts, followed by Zero Join Width Characters.

You never work on characters, you work on grapheme clusters or whatnot but never characters.

alexvitkov · 2023-10-19T15:13:52

I'm not advocating you write this, I'm saying people have written it, probably hundreds of thousands of times, and if charAt() becomes O(n) instead of O(1), this code suddenly hangs your CPU for 10 seconds on a long string, thus you can't really swap out UTF-16 for UTF-8 transparently.

Ygg2 · 2023-10-19T16:41:20

Your point doesn't stand for UTF-16 either. It's not a fixed length encoding either. It's broken in UTF-16 as well.

It was always O(n).

Of course assuming you aren't using UTF-32, which has its own set of problems (BE or LE), and sees little usage outside of China.

taway1237 · 2023-10-19T21:04:18

...it's not O(n). Many languages, JS, Java and C# included, have O(1) access to a character at a given position. You correctly note that it won't work well with international strings, but GP is right that A LOT of code like this was written by western ASCII-brained developers.

alexvitkov · 2023-10-20T06:34:51

Haven't used Java in a while but I believe charAt() returns a UTF-16 codepoint and is constant time access. So something like the above works not only for ASCII, as well as for the majority of Western languages and special characters you may encounter on a day to day basis.

Ygg2 · 2023-10-20T12:41:12

It's constant time iff you ignore surrogate pairs and Unicode. By that logic UTF8 is constant time if you ignore anything not ASCII because most text is in English.

Saying it works fine if you ignore errors and avoid edge cases is just a clever rephrashing of it worked on my machine.

Plus Emojis are Unicode U+1F600 and above, so even in Western language you are bound to find such "exceptions" .

nvm0n2 · 2023-10-19T19:07:55

In practice you often do because it's common to parse text that has "special" characters which are always ASCII. Think CSV, XML, JSON, source code, that sort of thing. These formats may have non-ASCII characters in them in places, but it's still a very common task to work with indexes into the string and the "character" at that index, which works fine because in practice that character is known to always be a single code unit.

Ygg2 · 2023-10-20T12:26:38

I've found in that case it's much easier to just operate on raw bytes, then transform those into UTF characters. It works trivially for UTF8 and needs some massaging for UTF16 and UTF32 because BE/LE.

hyperpape · 2023-10-19T15:37:09

You’re correct about algorithms that do “human” things with text, but you need to think of more examples.

That’s how you write hashing algorithms, checksums, and certain trivial parsers.[0]

But most importantly, right or wrong, this code is out there, running today, god knows where, and you do not slow it down from O(n) to O(n^2).

samus · 2023-10-19T20:14:39

Is such code really going to be ported to WASM though? And does it really matter for the string lengths that a typical web application has to process? WASM really doesn't have to worry about legacy that much.

SAI_Peregrinus · 2023-10-19T18:20:57

Hashing algorithms and checksums work on bytes, not characters.

hyperpape · 2023-10-19T18:44:06

Here is the JDK 7 String#hashCode(), which operates on characters: https://github.com/openjdk-mirror/jdk7u-jdk/blob/f4d80957e89....

That's changed in the newer versions, because String has a `byte[]` not a `char[]`, but it was just fine. A hash algorithm can take in bytes, characters, ints, it doesn't matter.

In Java, you don't get access to the bytes that make up a string, to preserve the string's immutability. So for many operations where you might operate on bytes in a lower level language, you end up using characters (unless you're the standard library, and you can finagle access to the bytes), or alternately doing a byte copy of the entire string.

I admit, checksums using characters are a bit weird sounding, but they should also be perfectly well-defined.

samus · 2023-10-19T21:54:30

A possible optimization would be to change internal representation on-the-fly for long-ish strings as soon as random accesses are observed. Guidance from experiments would be required to tell where the right tresholds are. Also JavaScript implementations already do internal conversions between string implementations.

davexunit · 2023-10-19T12:45:29

A copy for every string passed to the DOM API, to name just one thing, will be a significant limiting factor.

alexvitkov · 2023-10-19T12:55:09

There are some Rust front-end web frameworks that presumably manipulate the DOM, and in C++/Rust to pass a string to JS you need to run a TextDecoder over your WASM memory, so it's probably not a deal breaker.

But like... if you're writing a website, just use JavaScript.

davexunit · 2023-10-19T13:05:12

C++/Rust use WASM linear memory, but this article is about reference types via WASM GC. UTF-8 data in an (array i8) or UTF-16 data in an (array i16) are opaque to the host.

alexvitkov · 2023-10-19T13:42:16

Yeah, and you still have to marshal strings at the JS/WASM boundry, same as if you used (array i8/16) over JS strings in Java.

In the case of non-managed strings, this overhead hasn't been big enough to stop people from writing fairly fast (by Web standards) frontend frameworks in Rust.

kevingadd · 2023-10-19T23:20:52

The amount of memory and CPU overhead involved in sending strings across the wasm/JS boundary to do something like put text in a textarea is a lot bigger than you might think. It's really severe.

azakai · 2023-10-20T01:49:01

> Any non-toy uses of WebAssembly are gonna be new developments.

Major uses of WebAssembly include things like Photoshop, Unity, and many other large existing codebases.

kibwen · 2023-10-19T12:26:30

Just because you can't get rid of UTF-16 doesn't mean you can't let people opt-in to UTF-8 string semantics. Just like there's a `use strict` pragma, there could be a `use utf-8` pragma.

cyco130 · 2023-10-19T13:18:24

I think it's very different: "use strict" is strictly local (function or script scoped), "use utf-8" would require the entire JavaScript context to cooperate. It means you can't safely use a library that expects UTF-16 in an app with "use utf-8". E.g. you can't include Google Analytics in your "use utf-8" web page.

ykonstant · 2023-10-19T13:37:25

> E.g. you can't include Google Analytics in your "use utf-8" web page.

Well, what are we waiting for, then.

Someone · 2023-10-19T13:51:01

Java and JavaScript being high-level languages, it’s easy to switch the internal representation of strings.

In fact, the JVM already has moved to a mix of ISO-8859-1/Latin-1 and UTF-16 (https://openjdk.org/jeps/254), and I expect many performant JavaScript implementations also do something in that direction.

nvm0n2 · 2023-10-19T19:05:10

According to the article they don't, actually? Apparently they're thinking about it but aren't sure if it's worth it. For Java it was largely because it reduced time spent in GC (less memory usage = less frequent need to collect).

aardvark179 · 2023-10-19T19:46:07

That was the enhancement proposal targeting Java 9 which came out about 6 years ago.

For Java it’s a good saving because it reduces the overall heap size, and if JS has a similar distribution of objects then it should work well there as well. It may already do so, the internal storage format of things like strings and arrays is deliberately opaque.

samus · 2023-10-19T20:09:05

Java and JavaScript are actually hampered in that regard because they have to pretend that the encoding is UTF-16. Thus the limitation to Latin-1. With UTF-8, seeking in the middle of the string would be harder.

Someone · 2023-10-19T20:52:58

I think seeking into the middle of strings, as opposed to iterating over them from the start, is rare in most code.

If so, using UTF-8 and only converting to UTF-16 the moment such seeking happens may be beneficial.

Problem, however, is that Java and JavaScript have C-style for loops that give false positives, where the code indexes into the string in order to iterate over it.

samus · 2023-10-19T22:14:03

The conversion is required to properly support indexing for any index != 0. Optimizations are only possible if iterator-style APIs are used so the runtime can iterate as well. However, it might be still more efficient to convert the whole string and be done with it, depending on its length. Languages with a proper WASM backend could offer optimized runtime libraries and/or optimize such code.

The issue is not new. JavaScript runtimes frequently use multiple optimized string types for various situations:

https://github.com/danbev/learning-v8/blob/master/notes/stri...

Someone · 2023-10-20T06:33:05

> Optimizations are only possible if iterator-style APIs are used

They make it simpler, but it also is possible to detect that loops are iteration-style access in disguise.

That takes time, so its more likely to happen in ahead of time compiled languages.

C compilers can vectorize some of such loops, so they have logic for doing that.

haberman · 2023-10-19T14:52:13

I really have trouble understanding the push to provide higher-level features as intrinsic types and instructions. AIUI, the main alternative to stringref (linked in the article: https://github.com/WebAssembly/js-string-builtins/blob/main/...) is to provide string types via WebAssembly's normal "import" mechanism. This alternative makes more sense to me.

When you write extension modules for languages like Python, Ruby, PHP, or even JavaScript (via V8 or JavaScriptCore), you are always importing APIs like strings (eg. #include <ruby.h>). It seems natural to me that WebAssembly would be the same way.

Now ideally a WebAssembly module wouldn't be specific to one particular embedding environment. You would want to standardize this string API, so that a single API can be used efficiently with multiple host languages. So you'd want something more like #include <wasm/wtf16string.h>. But if it's an import, it also leaves open the possibility that such an API can evolve over time, or that a competing API could supplant it as the landscape of language engines changes. By putting it directly into the instruction set, you'd be baking in assumptions about how JS engines in 2023 function, even though those assumptions can (and hopefully will) change over time.

It's true that wasm/wtf16string.h probably won't be available in environments that don't have JS-style strings. But that seems for the best. I'd rather have a leaner WebAssembly than make every WebAssembly engine provide JS-style strings. In the case that the host environment doesn't already have a JS-style string that you are trying to interoperate with, why not just have Java implement its string type directly in WebAssembly? Or once shared libraries are available, wasm/wtf16string.h could be provided via some third-party library, shared across multiple modules, leaving open the possibility of passing strings around between modules.

I guess, to the article's point, why shouldn't memcpy() be an import also? What is gained by making it part of the instruction set? I assume that a WebAssembly implementation could recognize special imports, and optimize them the same as if it were a built-in instruction.

davexunit · 2023-10-19T15:00:38

> I really do not understand the push to provide higher-level features as intrinsic types and instructions

As mentioned in the article, the purpose of WebAssembly is not to provide the lowest level instruction support possible but to be a good compilation target. Not having a string type makes targeting WASM GC more complicated and the runtime less efficient.

> By putting it directly into the instruction set, you'd be baking in assumptions about how JS engines in 2023 function, even though those assumptions can (and hopefully will) change over time.

The stringref proposal does a good job of balancing the world we exist in with the "right thing", IMO. I implemented a subset of it for an interpreter hosted in Guile Scheme, using Guile's string type which is very unlike Java/JavaScript (not UTF-16), and it works well. I'd like to see the proposal get more support.

haberman · 2023-10-19T15:07:27

I don't understand how the counter-proposal makes WebAssembly a worse compilation target.

Take the simple example of memory.copy. How is it worse to compile memcpy() to import+memcpy() rather than memory.copy?

Why can't a string type be efficiently GC'd via an imported type?

slaymaker1907 · 2023-10-19T20:43:56

To work with GC, you need some way to track if the GC'd object is accessible in WASM itself. You can't just have gc.release $addr because then you need to introduce a check everywhere you try and do something with $addr in WASM as WASM is supposed to be memory safe.

The reason why you probably need a custom string type is so you can actually embed string literals without relying on interop with the environment. If a WASM module tries to simulate this by having an initialization function that constructs all your string constants in linear memory or something, I could see that getting pretty expensive and/or difficult to optimize.

haberman · 2023-10-19T21:54:13

> To work with GC, you need some way to track if the GC'd object is accessible in WASM itself.

I've never heard of a GC with that kind of API. Usually any native code that holds a GC reference would either mark that reference as a root explicitly (eg. https://github.com/WebAssembly/design/issues/1459) or ensure that it can be traced from a parent object. Either way, this should prevent collection of the object for as long as the reference is reachable from a root. I agree that explicitly checking whether a GC'd object has been freed would not make any sense.

> The reason why you probably need a custom string type is so you can actually embed string literals without relying on interop with the environment.

WASM already has ways of embedding flat string data. This can be materialized into GC/heap objects at module startup. This must happen in some form anyway, as all GC-able objects must be registered with the GC upon creation, for them to be discoverable as candidates for collection.

Overall I still don't understand the issue. There is so much prior art for these patterns in native extensions for Python, PHP, Ruby, etc.

csjh · 2023-10-19T14:56:34

I think string builtins seem to be the direction the wasm committee is moving towards anyways? Unless I'm mistaken

davexunit · 2023-10-19T15:01:56

Yeah, the JS String Builtins proposal is an alternative to stringref. Having experienced some of both, I think that stringref is much better.

csjh · 2023-10-19T15:45:59

Yeah, builtins seem like a bit of a specialized case. If I'm not mistaken, it doesn't even really give a Wasm-side method to create strings? Bit of a shame.

davexunit · 2023-10-19T17:32:01

Right, inside the wasm module you're on your own and you use (array i8). Alternatively, you could decide that your strings are the host's strings and use (ref extern), but performance within the wasm module suffers because every string operation is a host call. The least worst option right now is (array i8) internally and copying whenever they cross the guest/host boundary.

chubot · 2023-10-19T22:46:40

I believe it's mainly because of the requirements for

(1) integration with a single GC (the one in the browser), and

(2) zero-copy inter-op with JavaScript strings.

If you take them as axioms, those pretty much force you to put strings as a native type in WASM. They can't really be done as libraries.

Not sure if it helps, but I wrote about my experience writing a GC here, and the "reality sandwich" metaphor:

https://www.oilshell.org/blog/2023/01/garbage-collector.html

https://lobste.rs/s/v5ferx/scheme_browser_hoot_tale#c_n87bzw

https://lobste.rs/s/v5ferx/scheme_browser_hoot_tale#c_mw6tfx

The sandwich has:

(1) the mutator's view on one side

(2) the single bit representation of data types in the middle

(3) the GC's view on the other side

Basically a bunch of people thought that you could just have one half of the sandwich -- you could just have the GC, and use whatever data types you want.

Implementing a GC will disabuse you of that notion (ESPECIALLY a GC that runs remote, untrusted code, but it's true for any GC.)

The core data types and the GC are tightly coupled, and the GC already lives in the browser.

So WASM necessarily gives you the WHOLE sandwich -- both data types and the GC. Fundamentally, it can't do anything else.

But that means that the data types are a huge design compromise -- there are winners and losers. Now you have a MAPPING problem from every language to WASM types.

---

As a concrete example, I mentioned that Go has slices, which are reference types, and they have pointers to anywhere in a string.

This means the GC has to be able to find the head of a string from an interior pointer, which is hard in general.

So, without being very close to WASM GC, I suspect Go will be a loser in this respect (it will perform less well), simply because JS GC's don't have to deal with interior pointers.

Every design decision involves winners and losers -- "universal VM" is a bit of a fallacy.

---

Strings are another area where there are winners and losers -- UTF-16 means JS/Java perform better, but UTF-8 means other languages perform better.

Although this stringref proposal is interesting because it tries not to be biased -- it tries to develop an API that can be implemented with both UTF-8 or UTF-16 representations. Still, it's very hard problem and involves deep compromises.

haberman · 2023-10-20T00:24:03

> They can't really be done as libraries.

It's definitely the case that you can't implement a string type in WASM and have it interoperate with JavaScript strings.

But there is no reason that JavaScript strings can't be a "special" kind of import that provides functionality implemented by the engine. In other words, #include <wasm/wtf16string.h> could provide an API for the built-in JavaScript string, much like #include <ruby.h> provides access to the native Ruby string in Ruby extensions.

I believe this is how the counter-proposal works: https://github.com/WebAssembly/js-string-builtins/blob/main/...

For engines that do not have a built-in JavaScript-like string type, the wasm/wtf16string.h library could be implemented in WASM as a polyfill. In that case, there is no native string to interoperate with, so no disadvantage to implementing directly in WASM (except perhaps a bit of speed penalty).

chubot · 2023-10-20T03:53:17

What I'm reading there is that it's less ambitious -- it would only work for languages that have JavaScript string semantics.

Some languages targeting WebAssembly may have compatible primitives and would benefit from being able to use the equivalent JavaScript primitive for their implementation

I believe Java fits that bill (it's UTF-16, at least). So if you were writing a JVM to work on WASM, then you could possibly just import the JS string and use it, and get zero-copy inter-op.

---

But what the Wingo article is saying is they want something even more ambitious: have a working Scheme or Python or Lua implementation with zero-copy JS inter-op.

If they can paper over the bytes vs. UTF-16 vs. code points API issue, making a single string type that's not biased toward one or the other, I think it's a step in the right direction. (The aside about decoupling JS itself from UTF-16 string APIs is pretty intriguing too!)

But it's still not obvious to me that even this gets you all the way there. For example, Python interns some strings, and Lua apparently interns ALL strings, and they both have hash codes in the string object, and expose it to users. (Even changes to the hash function may break some programs)

Unique object IDs in many languages are another issue -- the JVM apparently uses space for them in the header of every object.

So you still may have more compromises about how to implement the WASM string, that produces winners and losers. The only way to know how the compromises pan out is to try, and I'm glad people are doing that.

---

So anyway, I'm not very close to WASM, but I do see a pretty clear difference between the proposals. I think it depends on what the goals are -- is "running Python or Scheme with zero-copy JS string interop" a goal?

I guess people want to manipulate the DOM directly from Scheme or Python, which seems reasonable. If you have to copy all the strings, that will be a huge drag. DOM operations are already hugely expensive and have tons of objects. It seems like the proposal you linked doesn't address that issue.

tomcam · 2023-10-19T13:44:33

In other news, WebAssembly now has garbage collection. From the article: “The GC support gives you the ability to define a number of different kinds of aggregate data types: structs (records), arrays, and functions-as-values.”

Happy to hear this.

ReleaseCandidat · 2023-10-19T14:06:04

Well, "now" and "has" ...

There is a GC proposal: https://webassembly.github.io/gc/core/

V8 implemented it some months ago (ChromeNode and I guess Deno, with a experimental switch to enable support). WasmEdge isn't there yet: https://github.com/WasmEdge/WasmEdge/issues/1122#issuecommen... WasmTime finished the RFC for the implementation details in June: https://github.com/bytecodealliance/wasmtime/issues/5032

davexunit · 2023-10-19T14:23:59

Yeah the non-web runtimes are playing catch-up with GC, but the upcoming Firefox 120 and Chrome 119 releases will have GC enabled by default. Not sure what state Safari is in but presumably not far behind. WASM GC should be usable in all major browsers by the end of the year.

layer8 · 2023-10-19T14:32:33

> Java (and JavaScript) is outdated: if you were designing them today, their strings would not be UTF-16.

Except that UTF-16 makes a lot of sense on Windows, which won’t change anytime soon.

Since you always have to deal with noncharacters, initial vs. non-initial BOMs, isolated combining characters, etc., and you have to validate your inputs anyway (meaning you almost always need a failure path for unvalidated strings anyway), I’m not sure if (unpaired) surrogates constitute that much more of a complication.

chubot · 2023-10-19T18:28:07

It probably won't change soon, but Microsoft (or at least some teams in it) have acknowledged the mistake of UTF-8, AND they have taken some steps toward UTF-8:

http://www.oilshell.org/blog/2023/06/surrogate-pair.html#fut...

chubot · 2023-10-19T21:26:33

edit: s/mistake of UTF-8/mistake of UTF-16/

a1o · 2023-10-20T01:30:27

I am a bit in need for help here. Why does string matter to webassembly? Shouldn't be possible to implement any string type on top of it? Is it really required for it to have some special handling for strings?

I am used to C and C++ so I guess I never thought of wasm as a runtime like C#+.NET so any help is appreciated.

Skinney · 2023-10-20T06:35:01

It is possible to implement your own strings by only using WASM primitives. There are two, as I see it, major reasons for including specialized support for it anyway:

1. One of the biggest WASM runtimes are browsers, and so it's likely that one would want to exchange strings between WASM and JS/DOM _efficiently_

2. The article states that most modern languages, if compiled to WASM, would likely need a lot of functionality in order to deal with unicode correctly (was how I interpreted the reference to libICU), which again would greatly increase the size of WASM blobs for those languages. Having stuff like that builtin would mitigate that.

charcircuit · 2023-10-19T16:09:57

Grapheme clusters were mentioned. Grapheme clusters should not be handled at the string layer. It's not possible to know what makes a grapheme cluster without knowing the font that is rendering it.

>= can be 2 grapheme clusters is some fonts and 1 grapheme cluster in other fonts which show it as ≥

amluto · 2023-10-19T16:26:37

No, you’re thinking of glyphs, not grapheme clusters.

charcircuit · 2023-10-19T16:42:50

From the Unicode standard

    Display of Grapheme Clusters. Grapheme clusters are not the same as ligatures. For example, the grapheme cluster “ch” in Slovak is not normally a ligature and, conversely, the ligature “fi” is not a grapheme cluster. Default grapheme clusters do not necessarily reflect text display. For example, the sequence <f, i> may be displayed as a single glyph on the screen, but would still be two grapheme clusters.

Jasper_ · 2023-10-19T17:02:22

Right, so "f" and "i" are two separate grapheme clusters (and always are), but might map to one rendered glyph under certain fonts that combine them into a ligature. Grapheme clusters have a specific definition that doesn't depend on the font in use. That definition is updated over time as the Unicode standard is updated, which means it can vary over time, but it does not change based on the font.

LoganDark · 2023-10-19T15:41:25

For anyone else who's also struggling to find that footnote, search for "when I mention UTF-16".

It's a fake.

I do wish they had used some symbol that can actually be pasted into the "Find" dialog, rather than being normalized into the number 1, which matches at least 75 (!) times.

kevingadd · 2023-10-19T23:47:08

There are a few use cases here and I personally think only one of them is really important in the sense that it would provide a lot of value for many people.

What JS in general needs is a 'StringView' type, which is most useful for asm.js/wasm scenarios, but is useful for other scenarios too. You could define it for UTF16 only, or define it to work with both UTF8 and UTF16. Why?

Well, a very common use case for asm.js/wasm scenarios is that you have a big wasm application that needs to chat with things outside of the wasm sandbox. The vast majority of APIs available to a wasm application are browser APIs, which communicate using objects (we have a solution for those, externref), strings, doubles (native wasm support), i32s (semi-native support - v8 only supports 31-bit integers, but it's usually good enough) and booleans (just use i31s).

Right now if you want to send a string from wasm to a browser API, you need the help of some JS glue to construct a brand new JS string on demand from bytes living in the wasm heap. This glue is Not Fast, and it becomes increasingly Not Fast the larger your strings are. If you have a great many strings - for example, text that you want to cram into textareas or spans inside of a table in the DOM - you will waste a lot of time doing this. And then if you want to transfer strings back into wasm, you have to do the reverse, painstakingly copying the string back into the wasm heap one character at a time.

If you're lucky, the browser has native APIs that accelerate this process by decoding/encoding UTF16, but those APIs have limitations - for example, TextDecoder does not support SharedArrayBuffer, so enabling multithreading in your wasm application will instantly make it slower to send strings across the js/wasm boundary. Cool.

The widely used wasm platform I work on (.NET / Blazor) does a lot of sending strings across the js/wasm boundary, so we've had to jump through some hoops to optimize this as much as possible. For example, we maintain an 'interned string' table on both sides of the boundary so that commonly reused strings like method names or enum value names don't have to get encoded/decoded each time they cross. This significantly increases memory usage and means we have to jump through hoops to avoid using too much memory, but the advantage in terms of performance is measurable, upwards of 10% per boundary crossing even for small strings. A typical application has lots of grids, tables and lists filled with string data, so it spends a sizable amount of time doing all this string encoding/decoding. Incidentally this interning table is harder to implement efficiently because nobody is willing to expose a getHashCode equivalent for JS types or even specifically for JS strings, but I can understand that decision at least...

If you had a StringView type, you could basically go "here's an ArrayView, I promise it contains WTF16 (or UTF16) data, and I promise I won't mutate it. You can make a defensive copy if you want. Please turn it into something string-like so I can use it with DOM properties like textContent or APIs like WebSocket, thanks." JS runtimes already have support under the hood for diverse string storage formats, whether it's ropes, latin1 buffers, or utf16 buffers, so this would just be a new one of those.

Maybe we'll get one before 2030. Not holding out hope though, this was a known need at the very beginning of the WebAssembly spec process and I doubt anyone has forgotten about it, it just doesn't seem to be considered that important compared to features like "having multiple distinct address spaces for some reason" and "64-bit address spaces that probably won't be available in web browsers for 20 years because Chrome on Android is still 32-bit"