Hacker News new | past | comments | ask | show | jobs | submit login
UTF-8 History (2003) (cam.ac.uk)
12 points by zdw on April 3, 2019 | hide | past | favorite | 3 comments

Posted over a dozen times but with surprisingly few comments. The best I found:

https://news.ycombinator.com/item?id=6463466 (2013)

https://news.ycombinator.com/item?id=8648541 (2014)

https://news.ycombinator.com/item?id=15236856 (2017)

Don't miss this great link from that last thread: https://www.flickr.com/photos/ajstarks/sets/7215763147079887...

It is tragic that this was not adopted as the Standard encoding for C++98. It probably could have been, if the right people had known its properties, and we might have been able to avoid standardizing wstring and all the basic_* templates.

It still could be the Standard encoding for C++20 if the right people could be persuaded. In practice, that would mean that whatever other execution encodings any given Implementation supports, it must support a Standard mode with a UTF-8 encoding assumed for any output from the program that is interpreted as text.

It would mean that downstream programs that need to know the encoding, such as terminal emulators in some Implementations, would need to have a way to recognize or designate non-Standard programs and their encodings, and interpose appropriate handling for them.

They’d be better off providing standard free functions that can operate on any bucket of bytes claiming to be UTF-8 (allowing operations like iterating over complete composed sequences, creating normalization forms, erasing or inserting based on ranges of composed character sequences, etc.). I don’t want to be forced to create a std::string for example. The bytes might be an incomplete stream as well, e.g. requiring processing up to the last complete thing, detecting invalid sequences.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact