So, the problem is that AssemblyScript wants to keep using UTF-16? I'm not sure I understand.
Is AssemblyScript the thing that lets you hand-write WebAsm?
I’m confused why they can’t just switch their (nascent) language to UTF-8, and if so why the alarmist attitude? I didn’t think they were mature enough to claim no breaking changes, for example.
I probably prefer we drag the web (and .Net and Java) platforms towards UTF-8, to be honest… but maybe that’s just me.
P.S. the web will never switch to UTF-8. It would break too many web pages. Most browser vendors won't even accept breaking 0.1% of web pages, unless they're doing it to show you more ads (i.e. Chrome).
The problem being discussed is about runtime interoperability between JS (with WTF-16 string format) and WebAssembly.
Python has several internal string representations to reduce conversions.
Really "UTF-16 string" and "UTF-8 string" are code smells. Applications should be using character/code point sequences or byte sequences.
Using code unit sequences is....bizzare. (Yes I know Java, C#, JS has chosen to do that, but a new language has an opportunity to improve.)
let foo: string = "whatever"
be able to work in any similar sense as TS/JS if? How can that map to multiple string types? The idea is both AS and TS use the same syntax for strings, and are compatible across boundaries (TS for JS side, AS for Wasm side). Having multiple string types is possible, but this would greatly reduce developer ergonomics.
If it's a language, then it gets to decide what ```let foo: string = "whatever"``` means.
If it's a compiler for an existing language, then semantics have already been decided, and the compiler has to implement it.
But none of the precludes abstraction to reduce data type conversions.
I'm not sure what type of compiler to label that as, but that's the goal.
A compiler. From TypeScript to web assembly.
Certainly it has its work cut out for it to have TS semantics (that is, JS semantics) and be optimal for target web assembly.
I can't think of any scripting languages that are optimal for targeting low-level runtimes.
I wish AssemblyScript the best; seems like a hard problem.
I sympathize with pain, but the bleeding edge of tech does...bleed.
How does a compiler ensure that when that string is passed to a Rust Wasm module it goes to it in UTF-8 and then when moments later the same string is passed by the same module to JS it goes over as WTF-16?
How will the compiler know where the string is being passed after compilation (at runtime)?
What new syntax would you propose for TypeScript to make it possible to work with all strings types? How would you keep TS/JS developer ergonomics up to par with what currently exists?
If Interface Types we're to consider web as a first class citizen (because Wasm originated as a web feature) then interop between Wasm modules and JS would considered of utmost importance, without making a web language ,(such as AssemblyScript) have to go through great lengths to engineer that aforementioned complication.
For FFI there's nothing a compiler can do. That's why FFI is unsafe and restricted to rudimentary types in most languages - it's up to the caller to ensure the data is laid out as the callee expects.
I also don't know what interface types have to do with anything. Wasm is far lower level than interfaces, and nothing is stopping you from implementing interfaces in your language and doing automatic type conversion through them to handle string representations as required.
Look past the web for a moment - wasm is a competitor with the JVM, GraalVM, and LLVM as a platform and implementation independent byte code. Think about how your language would be implemented on those targets before the web.
If you want to chime in and retrieve more context, here are some relevant issues:
This announcement is deliberately phrased to scare people who do not have sufficient context. I don't know why some AssemblyScript maintainers have decided to act in this extreme way over what is quite a niche issue. The vote that this announcement is sounding the alarm over is _not_ a vote on whether UTF-16 should be supported.
There has been a longstanding debate as part of the Wasm interface types proposal regarding whether UTF-8 should be privileged as a canonical string representation. Recently, we have moved in the direction of supporting both UTF-8 and UTF-16, although a vote to confirm this is still pending (but I personally believe would pass uncontroversially).
AS does not need to radically alter its string representation - if we were were to support UTF-16 with sanitisation, they could simply document that their potentially invalid UTF-16 strings will be sanitised when passed between components. Note that the component model is actually still being specified, so this design choice doesn't even affect any currently existing AS code. I interpret the announcement's threat of radical change as some maintainers holding AS hostage over the (again, very niche) string sanitisation issue, which is frankly pretty poor behaviour.
You previously posted yourself that documenting sanitisation at the component boundary would be an acceptable solution: (https://web.archive.org/web/20210726140105if_/https://github...).
I don't understand why you have so radically changed your opinion since then.
For fairness, I will link below to your concrete example of "corruption", noting that you claim it will render Wasm "the biggest security disaster man ever created for everything that uses or opted to preserve the semantics of 16-bit Unicode".
I'd argue that the fundamental bug here is in splitting a string in between two code points which make up an emoji, creating isolated surrogates. This kind of mistake is common and can already cause logic and display errors in other parts of the code (e.g. for languages with non-BMP characters) independent of whether components are involved (again, I emphasise that no code using components has been written yet).
EDIT: I should also note that if it becomes necessary to transfer raw/invalid code points between components, the fallback of the `list u8` or `list u16` interface type always exists, although I acknowledge that the ergonomics may not be ideal, especially prior to adaptor functions existing.
Here's Linus Torvalds explaining it better than I could: https://youtu.be/Pzl1B7nB9Kc?t=263
And sure you can transfer your string that someone else does not consider a string using alternative mechanisms, but then you are only not doing anything wrong because you are not doing it at all for entire categories of languages. There is no integration story for these, and once one mixes with optimizations like compact strings or has multiple encodings under the hood one cannot statically annotate the appropriate type anyhow. And sadly, adapter functions won't help as well when the fundamental 'char' type backing the 'string' type is already unable to represent your language's string.
I also do not understand where the idea that a single language always lives in a single component comes from. Certainly not from npm, NuGet, Maven or DLLs.
Extended this post to provide additional relevant context. It's not a bug, it's a feature.
EDIT: since a whole other paragraph was edited in as I replied, I will respond by saying that within a component, your string can have whatever invalid representation you want. Most written code will naturally be a single component (which could even be made up of both JS and Wasm composed together through the JS API). The code may interface with other components, and this discussion is purely about what enforcement is appropriate at that boundary.
EDIT2: please consider a further reply to my post, rather than repeatedly editing your parent post in response. It is disorientating for observers. In any case, my paragraph above did not claim that there will be one component per language, but that the code _one writes oneself_ within a single language (or a collection of languages/libraries which can be tightly coupled through an API/build system) will naturally form one component.
I appreciate that this is exactly the point where we currently disagree, and accept that I won't be able to convince you here. However, the AS website's announcement did not make the boundaries of the debate clear.
We must get rid of legacy encodings no matter the cost, I'm tired of seeing Java and Qt apps wasting millions of CPU cycles mindlessly converting stuff back and forth from UTF-16. It's plain madness, and sometimes you just need the courage to destroy everything and start again.
UTF-8 is a great hack that works wonderfully on Linux and BSD, because neither actually supported internationalisation properly until recently. They clung to 8-bit ASCII with white knuckles until they could bear it no longer, but then UTF-8 came to the rescue and there was much rejoicing. "It's the inevitable future!" cried millions of Linux devs... in English. I mention this because UTF-8 is a bit... shit... if you're from Asia.
Meanwhile, in the other universe, UCS-2 or UTF-16 have been around for forever because in that Universe people do things for money and had to take internationalisation seriously. Not just recently, but decades ago. Before some Linux developers were born. In this Universe, an ungodly amount of Real Important Code was written by Big Business and Big Government. The type of code that processes trillions of dollars, not the type used to call MySQL unreliably from some Python ML bullshit running in a container or whatever the kids are doing these days.
So, yes. Clearly UTF-16 has to "die" because it's inconvenient for C developers that never figured out how to deal with strings based on more than encoding.
PS: There are several Unicode compression formats that blow UTF-8 out of the water if used in the right way. If you can support those, then you can support UTF-16. If you can't, then you can't claim that you chose UTF-8 because you care about performance.
The needs of all the different WASM consumers also creates tension here. A C# programmer trying to ship a webapp has very different needs from a C programmer trying to run WASM on a cloudflare edge node, and you can't really satisfy both of them, so you end up having to tell one of them to go take a walk into the sea.
- an extra performance cost due to format conversion at the boundary,
- as well as negative implications on security and data integrity,
Hope that sums it up in one sentence. :)
This influx is the reason why AssemblyScript is now in the top three WebAssembly languages next to C++ and Rust (https://blog.scottlogic.com/2021/06/21/state-of-wasm.html) and should not be taken lightly.
There is a huge opportunity here to build an optimal foundation for these incoming developers, so that they won't be let down.
The influx has only just begun.
Ideally though, interface types would give languages options: the ability to choose which format their boundary will use. Obviously a JS host and a language like AssemblyScript would align on WTF-16, while a Rust Wasm module running on a Rust-powered Wasm runtime like wasmtime could optimally choose UTF-8.
I'm hoping things will be designed with flexibility in mind for this upcoming most-generic runtime feature.
In the past year it has gained numerous libraries and bindings, including from Surma from Google. Stay tuned...
So it's not just UTF16 that has problems and can cause security problems. I just wanted to emphasize that
Every problem that UTF-8, it shares with UTF-16. It also shares every problem with UTF-32.