
Code Pages, Character Encoding, Unicode, UTF-8 and the BOM [video] - soheilpro
https://www.hanselman.com/blog/ComputerThingsTheyDidntTeachYouInSchool2CodePagesCharacterEncodingUnicodeUTF8AndTheBOM.aspx
======
hmottestad
This video is about “Code Pages, Character Encoding, Unicode, UTF-8 and the
BOM”.

I don’t know about everyone else, but I had to implement an ISO 8859 encoding
to UTF-8 converter in assembly when I was at uni around 8-10 years ago. So
this is standard stuff for most developers who graduate from the University of
Oslo.

~~~
lmilcin
Converting to UTF-8 from a simple codepage like ISO is easy. The difficult
part is parsing UTF-8 with all its various behaviour changing characters.

Take the BOM as an example.

I work on backend Java projects for large banks. Over the years I fought with
BOM on numerous times. For some reason 95% of software that says it is UTF-8
compliant is not. My modus operandi for dealing with it is to remove BOM on
ingestion and only add it when the string leaves the system when we know the
outside party absolutely requires it (though it should not...)

~~~
tialaramex
People say if the only tool you have is a hammer, everything looks like a nail
- but when the only tool you have is a left-handed can opener life gets
_really weird_ and that's how we got the UTF-8 BOM

UTF-8 BOM is largely a Microsoft idea, they've got a bunch of code that thinks
in UCS-2 (now retrofitted to more or less pretend it knows UTF-16) and so it
thinks about byte order when decoding text files, and from there a Byte Order
Mark in files that don't have byte ordering seems like a reasonable idea.

If the files actually _mean_ something then a UTF-8 BOM just introduces
confusion. Lots of code I'm responsible for processes UTF-8 just fine, but if
it handles say files full of key = value pairs and your file begins with a
BOM, well, OK then, that first key starts with U+FEFF, weird choice but no
reason we should disallow that. And of course that isn't what you wanted and
so now Windows users are complaining I'm not "compatible".

~~~
marcosdumay
Shouldn't U+FEFF be whitespace?

~~~
tialaramex
Sure. So what? Text file formats aren't magically obliged to ignore leading
whitespace just because that suits Microsoft. If my format would consider
U+0009 TAB or U+0020 SPACE to be part of the key if placed at the start of the
key why not U+FEFF ?

------
UglyToad
This was really nicely done and a really good initiative. I got in to software
development by teaching myself Python (badly) and PHP (badly), and it's
probably quicker to list the things I do know than the cavernous gaps in my
knowledge.

For most software development jobs you can get by without knowing this stuff
but it's great there are things like this to clearly explain fundamentals that
are either assumed knowledge or communicated in (to an outsider) gatekeeping
levels of dense terminology.

------
teddyh
See also: _The Absolute Minimum Every Software Developer Absolutely,
Positively Must Know About Unicode and Character Sets (No Excuses!)_

[https://www.joelonsoftware.com/2003/10/08/the-absolute-
minim...](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-
every-software-developer-absolutely-positively-must-know-about-unicode-and-
character-sets-no-excuses/)

~~~
planteen
I just read this over and it's a very dated Windows-centric view. Several
glaring errors - glosses over the difference between UCS-2 and UTF-16, no
mention of surrogate pairs for UTF-16 (thinks only 65k code points), says
UTF-8 can be up to 6 bytes (no it can't, this was proposed but never
standardized), the idea that ASCII standardization dates to the 8088 (its much
older), mentions UTF-7 (don't), no mention that wchar_t changes size based on
platform, no mention of Han unification, no mention of shaping, and no mention
of normalization.

~~~
bloak
RFC 2279 says: "In UTF-8, characters are encoded using sequences of 1 to 6
octets." That's not technically a standard, but it was widely implemented.

~~~
mark-r
UTF-8 was originally designed to handle codepoints up to a full 32 bits. It
wasn't until later that the codepoint range was restricted so that 4 octets
would be sufficient.

------
joshbaptiste
Brian Will's video covers code pages succinctly
[https://www.youtube.com/watch?v=IM-
vnsnLGd4](https://www.youtube.com/watch?v=IM-vnsnLGd4)

------
Rels
That's the first time I've heard about UTF8 files sometimes having a BOM, so
that's nice to learn something. :)

I'm wondering if it's widely used.

~~~
davidwtbuxton
I've seen UTF-8 with a BOM while consuming data when integrating with strongly
Windows-centric environments. Relatively uncommon, but does happen. And it is
very annoying!

~~~
C1sc0cat
It used to and maybe still does cause problems with how google parsed
robots.txt files!

Which is why all my robots.txt files have a comment on the first line.

~~~
Someone1234
> Which is why all my robots.txt files have a comment on the first line.

That doesn't stop a BOM being generated or consumed.

~~~
YSFEJ4SWJUVU6
BOM is only a problem with strict syntaxes, which robots.txt is not an example
of. If the "consumer" simply ignores invalid or meaningless lines, you can
avoid issues from invisible characters by not having anything meaningful on
the first line of your file.

------
finchisko
What is the vscode hex extensions Scott is using?

~~~
hutattedonmyarm
Looks like hexdump:
[https://marketplace.visualstudio.com/items?itemName=slevesqu...](https://marketplace.visualstudio.com/items?itemName=slevesque.vscode-
hexdump)

------
mark-r
I hate it when a page consists of nothing but a video. I like to take things
in at my own pace by reading. There should be a warning in the title.

~~~
dang
Ok, we've added that now.

