Thank you. That is very good to learn and I looked over the wikipedia article. B...

lambda · on April 29, 2012

Yes, basically. UTF-8 doesn't encode a code point as a single integer; it encodes it as a sequence of bytes, with a particular order, where some of the bits are used to represent the code point, and some of them are just used to represent whether you are looking at an initial byte or a continuation byte.

I'd recommend checking out the description of UTF-8 in Wikipdia. The tables make it fairly clear how the encoding works: http://en.wikipedia.org/wiki/UTF-8#Description

ori_b · on April 29, 2012

utf-8 is a single byte encoding. Reversing the order of a sequence that's one byte long just gives back that one byte.

kijin · on April 29, 2012

No it isn't. Any letter with an accent will take up two bytes. Most non-Latin characters take up three bytes, sometimes even four.

ori_b · on April 30, 2012

Poorly phrased. It can take multiple bytes to fully define one codepoint, but the encoding is defined in terms of a stream of single bytes. In other words, each unit is one byte, hence flipping each unit gives back the same unit.

This is not the case for UTF-16 and UTF-32.

ybungalobill · on May 1, 2012

AFAIK the correct term is "byte oriented".

Tloewald · on April 30, 2012

Yes but those two bytes will be in the same order regardless of the endianness of the system.