Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thank you. That is very good to learn and I looked over the wikipedia article. But as far as byte order, how is that architecture independent? Is it just that utf-8 dictates that the order of the bytes always be the same, so whatever system you're on, you ignore its norm, and interpret bytes in the order utf-8 tells you to?


Yes, basically. UTF-8 doesn't encode a code point as a single integer; it encodes it as a sequence of bytes, with a particular order, where some of the bits are used to represent the code point, and some of them are just used to represent whether you are looking at an initial byte or a continuation byte.

I'd recommend checking out the description of UTF-8 in Wikipdia. The tables make it fairly clear how the encoding works: http://en.wikipedia.org/wiki/UTF-8#Description


utf-8 is a single byte encoding. Reversing the order of a sequence that's one byte long just gives back that one byte.


No it isn't. Any letter with an accent will take up two bytes. Most non-Latin characters take up three bytes, sometimes even four.


Poorly phrased. It can take multiple bytes to fully define one codepoint, but the encoding is defined in terms of a stream of single bytes. In other words, each unit is one byte, hence flipping each unit gives back the same unit.

This is not the case for UTF-16 and UTF-32.


AFAIK the correct term is "byte oriented".


Yes but those two bytes will be in the same order regardless of the endianness of the system.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: