Hacker News new | past | comments | ask | show | jobs | submit login

> I'm not so sure other languages do that any better

I can only speak of D since I'm familiar with it.

In D, strings are arrays of chars. The standard library assumes that they contain valid UTF-8 code points and together form valid UTF-8, but it's ultimately your responsibility to ensure that. This assumption allows the standard library to present strings as ranges of Unicode code points (i.e. whole characters spanning multiple bytes).

To enforce this assumption, when raw data is interpreted as D strings it is usually checked if it's valid UTF-8. For example, readText() takes a filename, reads its contents, and checks that it is valid UTF-8 before returning it. assumeUTF() will take an array of bytes and return it as-is, but will throw in a check when the program is built in debug mode. Finally, string.representation (nothing more than a cast under the hood) gives you the raw bytes, and .byChar etc. allow iteration over code units rather than code points, if you really want to avoid auto-decoding and process a string byte-wise.

There are also types for UTF-16 and UTF-32 strings and code units, which work as the above. For other encodings, there's std.encoding which provides conversion to and from some common ones.

My only grip with how D deals with Unicode is that its standard library insists on decoding UTF-8 into code points when processing strings as ranges (and many string processing functions in other languages are written as generic range algorithms in D). Often enough, it's unnecessary, slow, and makes processing non-UTF text a chore, but it's not too hard to avoid. Other than this, I think D's approach to Unicode is better among the other languages I've seen.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: