
What programmers need to know about encodings and charsets (2011) - neiesc
https://kunststube.net/encoding/
======
at_a_remove
I was looking for the catch. Here it is: "It's really simple: _Know_ what
encoding a certain piece of text, that is, a certain byte sequence, is in,
then interpret it with that encoding."

That's like "knowing" the truth. How?

I have received some very interesting files that made Python yack unicode
errors, again and again. Why? Not only did I not "know" what encoding it was
in -- _the encodings changed_ at different points in the stream of bytes. I
call this "slamming bytes together" because somewhere along the line,
someone's program did exactly that.

Everything is simple -- until it isn't.

~~~
banthar
There is nothing you can do with text file with unknown encoding but treat it
as an array of bytes.

If you start guessing the encoding, at best it won't work in some cases, at
worst you are introducing security vulnerabilities. You can try, but there is
just no way to do it right.

[http://michaelthelin.se/security/2014/06/08/web-security-
cro...](http://michaelthelin.se/security/2014/06/08/web-security-cross-site-
scripting-attacks-using-utf-7.html)

~~~
tialaramex
Generally you can safely treat text in an unknown encoding as UTF-8. Since
you're expecting potential failures but want to press on anyway instead of
causing an exception/ error you treat invalid sequences as U+FFFD the
Replacement Character as you would in a language or API with no exception
reporting mechanism.

There are lots of pleasing aspects to this choice. It's ASCII compatible of
course, so anything that was actually ASCII is still ASCII, anything that was
_almost_ ASCII is just ASCII with U+FFFD where it deviated.

The replacement character resolutely isn't any of the specific things, nor any
of the generic classes of thing you might be expected to treat differently for
security reasons. It isn't a number, or a letter (of either "case"), it isn't
white space, and it certainly isn't any of the separators, escapes or quote
markers like ? or \ or + or . or _ or...

... yet it is still inside the BMP so it won't trigger weird (perhaps less
well tested) behaviour for other planes.

It's self-synchronising. If something goes wrong somehow, in a few bytes if
there is UTF-8 or an ASCII-compatible encoding the decoder will synchronise
properly, you never end up "out of phase" as can happen for some encodings.

Most usefully, whatever you're now butted up against works with UTF-8 now.
Maybe some day that'll get formally documented, maybe it won't. As the years
drag on the chance of specifying _anything else_ shrink more, and the de facto
popularity of UTF-8 means even if it's never formalised anywhere everybody
will just assume UTF-8 anyway and you haven't to lift a finger.

~~~
lstamour
I probably don't need to say this, but it all depends. Many operating systems
and GUI frameworks internally use UTF-16 because it was more common when they
were built. Lots of old files use really obscure encodings. Sometimes you get
a UTF BOM to identify UTF-16 and UTF-32, other times you don't. Then there are
the pesky ways you can encode characters with HTML or XML entities, the
occasional double-encoding of such, and so on.

When I worked with library records, I had to deal with text encodings that
pre-dated SQL, though I suppose I should be thankful that ASCII existed by
then so they were mostly ASCII compatible, but even today there are systems
designed to output MARC-8 + UTF-8 as a fallback only when a MARC-8 character
isn't available (MARC-21) instead of just using UTF-8.

I'll admit though, outside of MARC-8 and the various Unicode encodings, I'm
having trouble thinking up systems that would still be incompatible today. Old
documents, yes, absolutely would be encoded in different charsets, Windows
still generally defaults to encoding in their Latin1 if I recall correctly,
but most systems today do expect UTF-8 over the network at least, and UTF-16
for display perhaps...

Don't get me started on line endings though, and how many files use one, both,
more than one ... and especially how much fun it can be with git repos cross-
platform, or when automated tools use platform default line endings when they
should be configurable, etc. CSV files that aren't properly escaped are also a
special mini hell...

Data is never easy. :) And that's assuming it's written correctly -
[https://rachelbythebay.com/w/2020/08/11/files/](https://rachelbythebay.com/w/2020/08/11/files/)

~~~
lstamour
Wanted to amend this list of character encodings with another one I came
across recently -- GSM-7, used for SMS.
[https://en.wikipedia.org/wiki/GSM_03.38#GSM_7-bit_default_al...](https://en.wikipedia.org/wiki/GSM_03.38#GSM_7-bit_default_alphabet_and_extension_table_of_3GPP_TS_23.038_.2F_GSM_03.38)
If a message you send includes other Unicode characters than that, including
emoji, it will cost more to send and use UCS-2 encoding (which later became
UTF-16).

------
jbandela1
Note: This post is basically a TLDR of
[https://www.theregister.com/2013/10/04/verity_stob_unicode/](https://www.theregister.com/2013/10/04/verity_stob_unicode/)
by Verity Stob.

One of the reasons there is a lot of confusion about encodings vs Unicode is
that Unicode was initially an encoding. It was thought that 65K characters was
enough to represent all the characters in actual use across the languages and
thus you just needed to change the from an 8 bit char to a 16 bit char and all
would be well (apart from the issue of endianness). Thus Unicode initially
specified what each symbol would look like encoded in 16bits. (see
[http://unicode.org/history/unicode88.pdf](http://unicode.org/history/unicode88.pdf),
particularly section 2). Windows NT, Java, ICU, all embraced this.

Then it turned out that you needed a lot more characters than 65K and instead
of each character being 16 bits, you would need 32 bit characters (or else
have weird 3 byte data types). Whereas people could justify going from 8 bits
to 16 bits as a cost of not having to worry about charsets, most developers
balked at 32 bits for every character. In addition, you now had a bunch of the
early adopters (Java and Windows NT) that had already embraced 16 bit
characters. So then encodings were hacked on such as UTF-16 (surrogate pairs
of 16 bit characters for some unicode code points).

I think, if the problem had been understood better at the start that you have
a lot more characters than will fit in 16 bits, then something UTF-8 would
likely have been chosen as the canonical encoding and we could have avoided a
lot of these issues. Alas, such is the benefit of 20/20 hindsight.

~~~
naniwaduni
> see
> [http://unicode.org/history/unicode88.pdf](http://unicode.org/history/unicode88.pdf),
> particularly section 2

it's Fascinating to see how people can arrive at the answer No and conclude
that the answer is Yes

------
sgopalra
Interesting article from Joel spoolsky on unicode and character sets.
[https://www.joelonsoftware.com/2003/10/08/the-absolute-
minim...](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-
every-software-developer-absolutely-positively-must-know-about-unicode-and-
character-sets-no-excuses/)

------
dang
If curious see also

2015
[https://news.ycombinator.com/item?id=9788253](https://news.ycombinator.com/item?id=9788253)

2012:
[https://news.ycombinator.com/item?id=4771987](https://news.ycombinator.com/item?id=4771987)

------
UpdatedFolders
I personally had a good time re-reading this over and over again when I was
migrating python 2 to python 3, it's a great resource:
[http://farmdev.com/talks/unicode/](http://farmdev.com/talks/unicode/)

------
neiesc
I think not explorer BOM UTF-8
[https://en.wikipedia.org/wiki/Byte_order_mark](https://en.wikipedia.org/wiki/Byte_order_mark)

------
ExtremisAndy
I love C++ so much, and it has brought me such joy as a hobbyist programmer,
but good grief, this one aspect of it (dealing with encodings & charsets) is
so depressing I just want to cry sometimes.

------
nunez
F for respects for everyone who got wrecked by BOM (byte-order mark) and CRLF
vs LF.

