
What developers should know about Unicode and character sets in 2013 - oyvindeh
http://the-pastry-box-project.net/oli-studholme/2013-october-8/
======
jrochkind1
> Never assume that the data you’re dealing with is UTF-8 — ASCII appears
> identical unless you view the hex to see if each character is taking one
> byte (ASCII) or three (UTF-8).

Um, what? This is just wrong. ascii-equivalent characters only take one byte
in UTF-8. Other characters may take two, or three, bytes.

If the author actually viewed text in ascii that, when in UTF-8, had three-
bytes per character.... I don't know what they were looking at, but it wasn't
UTF-8.

~~~
jrochkind1
Also, if the data is ASCII, and includes only legal 7-bit ASCII characters --
it is simultaneously ALSO valid and legal UTF-8. UTF-8 is a superset of ASCII.

I'm not sure this guy understands what he's talking about.

------
PeterisP
The concluding statement is a bit wierd: "ASCII appears identical unless you
view the hex to see if each character is taking one byte (ASCII) or three
(UTF-8)"

That isn't accurate, ASCII text would appear identical even if 'you view the
hex', because it is identical in UTF-8, that's the whole point of UTF-8. You'd
have to look at non-ASCII characters to see how they're encoded.

~~~
ygra
Notepad also doesn't save as ASCII by default but »ANSI«, the default legacy
codepage configured for your Windows installation.

~~~
apaprocki
Yes, the default Windows code page -- many pieces of software don't realize
that registry keys, file paths, etc. are all encoded in a different code page
if you are running, for example, Japanese Windows. (Also, it isn't _exactly_
Shift-JIS...)

------
VLM
Some background not covered in an otherwise pretty good article:

"In general, don’t save a Byte Order Mark (BOM) — it’s not needed for UTF-8,
and historically could cause problems."

This attitude comes from agony in processing from UTF-16 files. I interface
with a group that finds it hilarious to send me textual data in UTF-16 format
and the first hard won lesson you learn with UTF-16 is superficially the
default order should be correct 50% of the time if guessed randomly but
somehow its always wrong. So say you read one line of a UTF-16 text file and
process it accordingly after passing it thru a UTF-16 decoder. OK no problemo,
it had a BOM as the first glyph/byte/character/whatever and was converted and
interpreted correctly. Then you read another line, just like you'd read a line
process a line with ASCII or UTF-8. However they only give me a BOM at the
start of a file not a start of line, so invariably I translate that to garbage
because the bytes are swapped.

Now there are program methods to analyze the BOM and memorize it. Or read the
whole blasted multi-gig file into memory all at once and then de-UTF-16 it all
at once and then line by line the file. But fundamentally its a simple one
liner sysadmin type job to just shove the file thru a UTF-16 to UTF-8
translator program before it hits my processing system. I already had to
unencrypt it, and unzip it, and verify its hash so I know they sent the whole
file to me (and correctly), so adding a conversion stage is no big deal.

And this kind of UTF-16 experience is what leads people to do things like say
"oh, its unicode? That means I should squirt out BOMs as often as possible"
even though that technically only applies to unicode UTF-16 and is not helpful
for UTF-8.

------
danso
I hate to be "that SEO guy", but the OP needs to do some SEO. The submitted
title here is nowhere to be seen, which is too bad because it's a great title
and one that I would try to Google after forgetting to bookmark this page.

Luckily I do use Pinboard, which auto-grabs the title, if it existed. But this
is a helpful reference to many devs who don't read HN, and it's all but
obscured.

------
golergka
Oh, one more fun fact: some emoji characters occupy more than one _Unicode_
character, and can be encoded in different ways depending on the device that
uses them. (Before they were introduced into Unicode, they used character
codes designated for custom platform-specific stuff).

Debugging a text input field where user can enter emoji & RTL text is FUN.

~~~
twic
Are there really multi-character emoji? Or is it that they are single
characters on an astral plane which are encoded as two code units in UTF-16,
and therefore behave rather like two characters if your language uses 16-bit
chars?

~~~
golergka
Several characters, yes. And those characters, in turn, can be presented as
low and hi surrogate pairs in UTF-16.

[http://apps.timwhitlock.info/emoji/tables/unicode](http://apps.timwhitlock.info/emoji/tables/unicode)

Look for flags and numbers. Here's German flag in ASCII:
\xF0\x9F\x87\xA9\xF0\x9F\x87\xAA 8 bytes, 2 unicode symbols, 4 UTF-16 symbols.

~~~
tuukkah
Are these handled in the font as ligatures?

~~~
golergka
In what UI framework? When I worked on that, I decided to render them from a
different texture that doesn't depend on the current font, but scales to it's
size.

------
ygra
Site appears to be down; Google cache:
[http://webcache.googleusercontent.com/search?q=cache:A8oNdl-...](http://webcache.googleusercontent.com/search?q=cache:A8oNdl-
pbKIJ:the-pastry-box-project.net/oli-
studholme/2013-october-8/+&cd=1&hl=de&ct=clnk&gl=de)

------
ohwp
Note that some browser do use the <meta charset="UTF-8"> even if the content-
type header already sent the charset.

Another thing to add: always open a database connection in the charset of
choice. And if you are a PHP user (like I am): there are still functions that
don't support multibyte so be careful.

~~~
oneeyedpigeon
This is the biggest current driver towards me trying to muster the effort to
move off of PHP. Also, I had no end of trouble working with filenames that
contained UTF-8 characters using PHP, and had to give up in the end.

------
hcarvalhoalves
> While there are a ton of encodings you could use, for the web use UTF-8. You
> want to use UTF-8 for your entire stack. So how do we get that?

You should use your language's internal unicode representation, and decode
from/encode to UTF-8 on I/O.

