Hacker News new | past | comments | ask | show | jobs | submit login

Well, we're talking about DOM manipulation performance here. Pages that use DOM manipulation heavily will see a potentially-unacceptable performance loss if text always has to be converted to UTF-8.

Is fast DOM manipulation important? Given that the only way for the sole scripting language on the Web to display anything or interact with the user is through DOM manipulation, I think it's worth optimizing every cycle...




http://www.utf8everywhere.org/#faq.cvt.perf

If the function you're calling with UTF8 is non-trivial, converting a few dozen bytes is unlikely to make a significant difference. Benchmark it, of course, but don't be surprised if you don't need to care. Modifying the DOM is probably going to be non-trivial.


UTF-8 -> UTF-16 conversion was at one point a noticeable fraction of Firefox's startup time.

https://bugzilla.mozilla.org/show_bug.cgi?id=506431

Since then we've done things such as fast-path ASCII -> UTF-16 conversion with SSE2 instructions. Converting a few dozen bytes is unlikely to make a significant difference, but often one needs to deal with more than a few dozen bytes.


I believe the authors wrote it clear enough that they don't rule out UTF-16 completely:

> We believe that all other encodings of Unicode (or text, in general) belong to rare edge-cases of optimization and should be avoided by mainstream users.

So if you're writing a browser that must use UTF-16 in its Javascript engine due to dumb standards... it is a reasonable performance optimizations to use UTF-16 for your strings. But how many people write Javascript engines?


It's not the conversion that kills you, it's the memory allocation. If you're trying to interact with a "chatty" UTF-16 API, the difference between using UTF-8 and UTF-16 in your implementation could the difference between passing a pointer (a cycle or two) and allocating/deallocating a buffer (potentially hundreds of cycles) per call.

There are obviously ways around this. The API could be rewritten to exchange strings less frequently or use static strings that could be replaced with handles. You can try to be clever about your buffer allocation and share one amongst all calls (but watch out for threading issues!) You could write your own allocator. But all this plumbing just increases complexity and the risk of bugs, along with adding its own performance cost.

I'm not arguing against UTF-8 as the preferred encoding for many future applications, but the "minimal overhead" example given in the manifesto isn't particularly convincing.


I agree in principle. But "chatty" APIs usually work with short strings, in which case you can use stack allocations in those performance critical calls.


See my comment above; I can construct cases that make this sort of conversion have unacceptable overhead.

Would it matter in a real-world setting? I can't say for sure, because nobody I know of has tried making a production-quality UTF-8 web layout engine. But, in my mind, none of the benefits of UTF-8 (memory usage being the main one in a browser [1]) outweigh the performance risks of doing conversion. And the risk is real.

[1]: Note that you still need UTF-16 anyway, for interoperability with JavaScript. So using UTF-8 might even lead to worse memory usage, due to the necessity of duplicating strings, than a careful UTF-16-everywhere scheme that takes advantage of string buffer sharing between the JS heap and the layout engine heap would.


Is string storage really that big a portion of the browser's memory usage? I find it hard to believe that my browser is currently storing nearly a gigabyte of text right now I'm not saying that there wouldn't be performance overhead, but would it be significant? I would be surprised if it made a big difference.

For legacy reasons, there probably isn't a point in changing it. But I'd be surprised if performance reasons turned out to be gating.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: