Hacker News new | past | comments | ask | show | jobs | submit login

Java and JavaScript being high-level languages, it’s easy to switch the internal representation of strings.

In fact, the JVM already has moved to a mix of ISO-8859-1/Latin-1 and UTF-16 (https://openjdk.org/jeps/254), and I expect many performant JavaScript implementations also do something in that direction.




According to the article they don't, actually? Apparently they're thinking about it but aren't sure if it's worth it. For Java it was largely because it reduced time spent in GC (less memory usage = less frequent need to collect).


That was the enhancement proposal targeting Java 9 which came out about 6 years ago.

For Java it’s a good saving because it reduces the overall heap size, and if JS has a similar distribution of objects then it should work well there as well. It may already do so, the internal storage format of things like strings and arrays is deliberately opaque.


Java and JavaScript are actually hampered in that regard because they have to pretend that the encoding is UTF-16. Thus the limitation to Latin-1. With UTF-8, seeking in the middle of the string would be harder.


I think seeking into the middle of strings, as opposed to iterating over them from the start, is rare in most code.

If so, using UTF-8 and only converting to UTF-16 the moment such seeking happens may be beneficial.

Problem, however, is that Java and JavaScript have C-style for loops that give false positives, where the code indexes into the string in order to iterate over it.


The conversion is required to properly support indexing for any index != 0. Optimizations are only possible if iterator-style APIs are used so the runtime can iterate as well. However, it might be still more efficient to convert the whole string and be done with it, depending on its length. Languages with a proper WASM backend could offer optimized runtime libraries and/or optimize such code.

The issue is not new. JavaScript runtimes frequently use multiple optimized string types for various situations:

https://github.com/danbev/learning-v8/blob/master/notes/stri...


> Optimizations are only possible if iterator-style APIs are used

They make it simpler, but it also is possible to detect that loops are iteration-style access in disguise.

That takes time, so its more likely to happen in ahead of time compiled languages.

C compilers can vectorize some of such loops, so they have logic for doing that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: