
What to know about encodings and character sets - dshankar
http://kunststube.net/encoding/
======
Animats
Something they should have mentioned: put a

    
    
        <meta charset="utf-8" /> 
    

in all your HTML documents that are in UTF-8. Note that this has to be in the
first 1024 bytes of the document. Otherwise, the browser has to invoke the
"encoding guesser"[1], which will sometimes guess wrong. (W3C: "The user agent
may attempt to autodetect the character encoding from applying frequency
analysis or other algorithms to the data stream.") The result will be
occasional users seeing random pages in the wrong encoding, depending on
browser, browser version, platform, and page content.

I recently saw the front page of the New York Times misconverted because they
didn't specify an encoding, and the only UTF-8 sequence near the beginning of
the document was in

    
    
       <link rel="apple-touch-icon-precomposed" sizes="144×144" ...
    

The "×" there is not the letter x, it's the Unicode multiplication symbol.
This confused an encoding guesser. Don't go there.

[1] [http://www.w3.org/TR/html5/syntax.html#determining-the-
chara...](http://www.w3.org/TR/html5/syntax.html#determining-the-character-
encoding)

~~~
icebraining
You can also add "; charset=utf-8" to your Content-Type header instead.

~~~
Animats
Yes, but then it's separated from the document. If someone saves the file,
they lose the charset info.

How reliable are CDNs and caching servers about preserving Content-Type
headers?

~~~
nrinaudo
While this is true, I find the meta tag to be a horrible pain in the ass.

If you have to parse some HTML that you get over an HTTP connection - you're
writing a crawler, say, or you want to extract RDFa metadata, you have to deal
with the following, surprisingly common case: both the header and the html
document contain encoding information, and they disagree. The RFC states that
you should trust the header, but in practice, that's certainly not always the
case - nor even, in my experience, is it the case the majority of the time.

If you decide to use the meta tag, that means that you have already strarted
decoding your byte stream, get the encoding, then need to re-interpret the
bytes you've already read. I have seen a lot of pages that declared their
encoding _after_ the title tag.

What's worse, you can't know whether you have a meta tag until you've parsed
the whole head, which can be huge with hundreds of kilobytes of inlined
javascript and css.

The argument that you should just read the first 1024 bytes and assume utf-8
if nothing is found is just not satisfactory - I want the encoding of the
documents I'm parsing to be correct all the time, not when the remote host
follows the rules. If I'm writing a crawler, the remote host cares nothing
about my needs and I'm the one suffering from my unwilingness to be flexible.

So, yeah. Don't use the meta encoding tag, and trust your user agent to save
the html code in a sane (utf-8) encoding. There is no reason to store encoding
information in an html file, just like I doubt your source code always starts
with a preprocessor instruction declaring the encoding that the compiler
should use.

~~~
JupiterMoon
Hang on isn't this argument basically: "lots of people don't follow the
standards now so no one should bother?"

~~~
nrinaudo
Well, not exactly. If people truly followed the standards, there would be no
need for the meta charset element: the RFC clearly states that encoding should
either be specified in the header or default to iso latin 1. I can't recall
whether it makes provisions for media type specific default charsets, but
either way, if you follow the HTTP standard, you should not specify content
encoding in your text document (this of course does not apply to binary
formats that migh encode text).

So, to be a bit pedantic about it, my argument is that you should follow the
standards and ignore / work around the hacks used to make life easier for
people that don't know / don't care about encoding.

Note that I do not mean that as condescending - at some point, a lot of
designers were writing HTML manually, and I don't expect them to know about
encoding, just the same as they hopefuly don't expect me to know about...
design stuff I'm really terrible at.

~~~
JupiterMoon
Fair enough. Not condescending. I don't know much about HTML tbh.

However, if I were faced with your situation I would try to use whatever logic
is used by firefox or chromium to work out encoding. After all designers are
going to (should) test if things work on one/both of these right?

------
PeterisP
We work with a lot of multilingual text, and for "what to know about encodings
and character sets" we have a very simple answer to that - a guideline called
"use UTF8 or die".

It's not suitable for absolutely everyone (e.g. if you have a lot of Asian
text then you may want a different encoding), but for our use case every
single deviation causes lot of headache, risks and unneccessary work in fixing
garbage data.

In simplistic terms what we mean by this guideline:

* in your app, 100% all human text should be stored UTF8 only, no exceptions. If you need to deal with data in other encodings - other databases, old documents, whatever - make a wrapper/interface that takes some resource ID and returns the data with everything properly converted to UTF8; and has no way (or at least no convenient way) to access the original bytes in another encoding.

* in all persistence layers, store text only as UTF8. If at all possible, don't even provide options to export files in other encodings. If legacy/foreign data stores need another encoding, then in your code never have an API that requires a programmer to pass data in _that_ encoding - the functions "store_document_to_the_mainframe_in_EBCDIC" and "send_cardholder_data_to_that_weird_CC_gateway" should take UTF8 strings only and handle the encodings themselves.

* in all [semi-]public API, treat text as UTF8-only and _document that_. If your API documentation mentions a text field, state the encoding so that there is no guessing or assuming by anyone.

* in all system configuration, set UTF8 as the default whenever possible. A database server? Make sure that any new databases/tables/text fields will have UTF8 set as the default, so unless someone takes explicit action then user-local-language encodings won't accidentally appear.

* Whoever introduces a single byte of different encoding data is responsible for fixing the resulting mess. This is the key part. Did you write a data input function that passed on data in the user computer default encoding; tested it only on US-ASCII nonenglish symbols; and got a bunch of garbage data stored? You're responsible for finding the relevant entries and fixing them, not only your code. Used a third party library that crashes or loses data when passed non-english unicode symbols? Either fix the library (if it's OS) or rewrite code to use a different one.

------
peapicker
From the article: "Overall, Unicode is yet another encoding scheme."

It is more than that - for instance, it includes algorithms as well... for
instance, dealing with RTL languages with ordering and shaping rules (i.e.
Arabic), how to know what to do when RTL languages are mixed with LTR (is that
'.' at the end of '123' a decimal point, or a period? (determines if it goes
to the right or the left or the sequence)) and how to know when data is
equivalent despite being normalized or not, etc...

------
scottfr
The linked article by Joel Spolksy is also great:

[http://www.joelonsoftware.com/articles/Unicode.html](http://www.joelonsoftware.com/articles/Unicode.html)

~~~
r4ferrei
It's a great article indeed. I think it was after I read this one that I
really started to understand what was going on with all that encoding stuff
that I was already used to doing. Funny to look back.

------
ninjakeyboard
I hate that I have to know this stuff. Working on implementing a spec today
where handling character encoding is a requirement.

------
vorg
> _It basically defines a ginormous table of 1,114,112 code points that can be
> used for all sorts of letters and symbols. That 's plenty to encode all
> existing, pre-historian and future characters mankind knows about. There's
> even an unofficial section for Klingon. Indeed, Unicode is big enough to
> allow for unofficial, private-use areas._

The private use areas only encode about 137,000 codepoints (U+e000 to U+f8ff &
U+f0000 to U+10ffff) and are running out quickly. Most of U+e000 to U+f8ff is
used by many different private agreements, and some pseudo-public ones like
the Conscript registry which encodes Klingon, linked to in the article.
Conscript also uses a large chunk of plane F to encode the constructed script
_Kinya_ , i.e. the 3696 codepoints in U+f0000 to U+f0e6f, see
[http://www.kreativekorp.com/ucsur/charts/PDF/UF0000.pdf](http://www.kreativekorp.com/ucsur/charts/PDF/UF0000.pdf)
. It takes up so much room because it's a block script like Korean Hangul and
is encoded by formula just like Hangul. Each Korean Hangul block is made up of
2 or 3 jamo: one of 19 leading consonants, one of 21 vowels, and optionally
one of 27 trailing consonants, giving a total of 19 * 21 * 28 = 11,172
possible syllable blocks, generated by formula into range U+ac00 to U+d7a3.
Kinya also uses such a formula to generate its script, and I'm sure many other
constructed block scripts will make their way into the quasi-official
Conscript Registry. I'm even working on one of my own.

In fact, rather than filling up U+f0000 to U+10ffff, such conscripts only need
to fill up the first quarter of it (i.e. U+f0000 to U+f7fff) for Unicode to
run out of private use space, because the remainder (U+f8000 to U+10ffff) is
needed for a second-tier surrogate system (see
[https://github.com/gavingroovygrover/utf88](https://github.com/gavingroovygrover/utf88)
) to extend the codepoint space back up to 1 billionish codepoints as it was
originally specified by Pike and Thompson until it was clipped back down to 1
million in 2003.

So Unicode is not "plenty to encode" or "big enough to allow for" all known,
future, or private-use characters.

~~~
jekub
> see
> [https://github.com/gavingroovygrover/utf88](https://github.com/gavingroovygrover/utf88)

This is the most stupid way to extend UTF-8 I've seen. The only acceptable
solution is to remove the restriction of using only four byte per sequence
which would allow to encode these easily keeping all the advantages of UTF-8.

Doing it like they does add an additional layer of encoding and so a lot of
complexity a room for bugs.

It was probably made for compatibility but a lot of software will do bad thing
with these new "surrogate" pairs so this solution is not really more
compatible in practice. And updating software to handle UTF-8 sequence longer
than 4 bytes is a lot more easier than updating them to handle such encoding.

~~~
vorg
> The only acceptable solution is to remove the restriction of using only four
> byte per sequence which would allow to encode these easily keeping all the
> advantages of UTF-8

I agree. Extending UTF-8 with surrogates like this is intended to be
temporary, only used until the pre-2003 2.1 billion codepoint limit for UTF-8
and UTF-32 is reinstated by the Unicode Consortium. Then any software using
UTF-88 can easily swap the encoding to the 1 to 6-byte sequences in
"reinstated" UTF-8. This surrogation scheme is actually intended for UTF-16 to
use as a second-tier surrogate scheme so it can encode the same number of
codepoints as UTF-8 and UTF-32. I wrote all this under "Rationale" at the
bottom of the linked page, did you read that far?

Hopefully, though, UTF-16 will be on its way out when pre-2003 UTF-8 and
UTF-32 are reinstated so this surrogation scheme wouldn't even see much use
there.

~~~
jekub
But "temporary" is a thing who exists only in theory. In practice its always
never or (almost) forever. As soon as a few applications start using this
"new" form of UTF-8, some of them may have to keep supporting it forever.

Why not directly going for the pre-2003 UTF-8 encoding ? It would even put a
bit of pressure for restoring them and would show that this is the good way.
It is also the only way I think to convince people to start implementing it.

~~~
vorg
> As soon as a few applications start using this "new" form of UTF-8, some of
> them may have to keep supporting it forever

Not if it's used through a 3rd-party library such as the Go-implementation of
UTF-88 I've provided.

> Why not directly going for the pre-2003 UTF-8 encoding ? It would even put a
> bit of pressure for restoring them

Because it's not a valid encoding under the current scheme, whereas using
surrogates with UTF-8 _is_ , using as it does the 2 private use planes to
implement the surrogates. The goal is for restoration by the Unicode
Consortium, but based on their public utterances it's not going to happen
easily or quickly, and in the meantime we need an encoding that's valid under
the current scheme because it may need to be used for 10 or 20 years. Of
course I could have used UTF-16 with a doubly-directed surrogate system but
that would be even more error-prone, and I expect whatever 2nd-level surrogate
system is eventually provided with UTF-16 will be legally available with UTF-8
and UTF-16 anyway.

UTF-88 is an attempt to showcase both a surrogation scheme implementable in
current UTF-16 _and_ the fact that UTF-8 is the best encoding.

------
keedot
Interesting that I actually don't need to know this stuff. I think you'll find
that MOST developers actually don't need to know this stuff. People seem to
forget that the vast bulk of developers are for corporate and in house
development, single language, being English.

I know this stuff because I like to understand how this works, but for all the
dev's under me, there are probably a thousand concepts that I want them to
understand before they start tackling encoding beyond knowing when to call the
correct function.

~~~
vanous
As being from a non English speaking country I have had to deal with many
instances of issues caused by this way of thinking. Before unicode happened to
be widely applied, using Linux was 100% harder for us. So sometimes it may see
that some things aren't fundamental but they might actually be, you simply
don't know it yet.

------
imaok
One thing I'm still confused about. What exactly is happening when you copy
paste some text from one app to another? What encoding will the copied text be
in?

------
JoachimS
Good, gentle introduction that goes through everything step by step. Turns to
php at the end.

------
carsonreinke
...never assume one byte per glyph

------
SilasX
Sorry, but now I reflexively flag-on-sight any instance if this clickbaity,
obviously overstated "every programmer needs to know about semiconductor
opcodes/mainframe architecture/etc".

------
SFjulie1
PHP devs are so slow they just adopted utf8 and see its glory.

I myself "UTF8 or die!"d a long time ago and discovered it was not a good
idea.

I will forget the problem of the parsing of the nth character, the string
length vs the memory used, the canonization of strings for comparison. And go
directly to 2 problems:

* There exists cases in which latin1 & utf8 are mangled in a same string. (ex http/SIP headers are latin1 and content can be utf8 and you may want to store a full HTTP/SIP transaction verbatim for debugging purpose), and it can store in iso-latin3 (code table for esperanto to be sarcastic), but will explode in utf8 unless you rencode it (B64)

* tools are partly UTF8 compliant: mysql (which is as good as PHP in terms of quality) is clueless about UTF8 (hint index and collation), and PHP too [https://bugs.php.net/bug.php?id=18556](https://bugs.php.net/bug.php?id=18556) <\--- THIS BUG TOOK 10 YEARS TO BE CLOSED

The whole point is developer don't understand the organic nature of culture,
and especially of its writing and the diversity of culture.

They think that because some rules applies in their language it also applies
in others: BUT

* PHP devs: lowercase of I is not always i (it can be i without a dot). It took 10 years to the dev to find where their bug was! * shortening a memory representation does not always shorten its graphical representation (apples bug with sms in arabian) * sort orders are not immutable (collation not only can vary from language to language but also according to the administrative context (ex: proper name in french)) * inflexions are hell and text size for error varies a lot (hence the unstability of windows 95 in french because error message where copied in a reserved page and the fixed size was less than the whole size of the domain's corpus... hence any contiguous block in memory (lower xor upper bound) would have its memory potentially corrupted)).

My point is UTF8 is not hell. Real world is complex. And it becomes hell when
some dumb devs thinking that by manipulating strings that represents any
language they know about any language.. and apply universal rules that are
not.

Some problems can be solved by ignoring them. But with culture it is not the
case.

And actually, unicode SUX because it is US centrics

* computers should be able to store all our pasts books and make them available for the future, even in verbatim mode. But unicode HAS not archeological character sets like THIS [https://fr.wikipedia.org/wiki/S_long](https://fr.wikipedia.org/wiki/S_long) I don't care about the USA lack of history. I see no use in the computer if it requires to sacrifice our histories and cultures, * [https://modelviewculture.com/pieces/i-can-text-you-a-pile-of...](https://modelviewculture.com/pieces/i-can-text-you-a-pile-of-poo-but-i-cant-write-my-name) some people cannot even use it in their own language

Unicode suffers a lot of problem plus a conceptual one; it is has immutable
characters AND directives (change the direction of the writing, apply
ligatures)... that not only will create security concerns (one of the funniest
being the possibility by adding a string to reedit silently text in on an
effector (screen or printer))... We are introducing type setting rules in
unicode.

For those who have used tex since a long time, the non separation of the
almost programmatical typography and the graphens is like not separing the
model and the controller.

Which actually also calls for the view (the effection) and thus the fonts.
Having the encoding of the Slong does not tell you what it looks like unless
you have a canonical representation of the codepoint as a graphem.

And since we are printing/creating document for juridical purpose we may like
to control the view that ensures that the mapping of the string representation
will not alter graphical representation in a way that can compromise its
meaning. If someone signs in a box you don't want the signature to alter the
representation anywhere or worse without notice.

The devil lies in the detail. Unicode is a Babel tower that may well crash for
the same reason as in the bible: hubris.

