
UTF-8 Everywhere - pcr910303
http://utf8everywhere.org/
======
legulere
> In the UNIX world, narrow strings are considered UTF-8 by default almost
> everywhere. Because of that, the author of the file copy utility would not
> need to care about Unicode

It couldn’t be further from the truth. Unix paths don’t need to be valid UTF-8
and most programs happily pipe the mess through into text that should be
valid. (Windows filenames don’t have to be proper UTF-16 either)

Rust is one of the few programming languages that correctly doesn’t treat file
paths as strings.

~~~
DannyB2
> Rust is one of the few programming languages that correctly doesn’t treat
> file paths as strings.

Imagine if languages allowed subtypes of strings which are not directly
assignment compatible.

HtmlString

SqlString

String

A String could be converted to HtmlString not by assignment, but through a
function call, which escapes characters that the browser would recognize as
markup.

Similarly a String would be converted to a SqlString via a function.

It would be difficult to accidentally mix up strings because they would be
assignment incompatible without the functions that translate them.

There could be mixed "languages" within a string. Like a JSP or PHP that might
contain scripting snippets, and also JavaScript and CSS snippets, each with
different syntax rules and escaping conventions.

~~~
mhh__
Allowed you to? You could do that in C++ quite happily, it's just not useful
enough. To bother implementing, at least.

~~~
eska
It's absolutely useful enough, it's just that it's awful in C++ due to
language limitations as opposed to other languages such as Haskell, where it
is standard.

~~~
gpderetta
How would be awful in c++? It seems trivial to do, basic_string is already
templated and distinct instantiations are not mutually compatible by default.
In fact wstring, u8string, u16string, u32string exist today in the language
simply as distinct instatiantions of basic_string. You can crate your own by
picking a new char type. Algorithms can be and are, generic and work on any
string type.

------
tialaramex
However, sometimes you're in a layer when ASCII was fine and you should just
be explicit about that.

Server Name Indication (in RFC 3546) is flawed in several ways, it's a classic
unused extension point for example because it has an entire field for what
type of server name you mean, with only a single value for that field ever
defined. But one that stands out is it uses UTF-8 encoding rather than
insisting on ASCII for the server name.

You can see the reasoning - international domain names are a big deal, we
should embrace Unicode. But IDNA already needed to handle all this work, the
DNS A-labels are already ASCII even for IDNs.

Essentially choosing UTF-8 here only made things needlessly more complicated
in a critical security component. Users, the people who IDNs were for, don't
know what SNI is, and don't care how it's encoded.

------
anderspitman
Trying to figure out how to express this without making people mad at me. I
think the conflation of Unicode with "plain text" might be a mistake. Don't
get me wrong, Unicode serves an important purpose. But bumping the version
from plain text 1.0 (ASCII) to plain text 2.0 (Unicode) introduced a ton of
complexity, and there are cases where the abstractions start leaking
(iterating characters etc).

With things like data archival, if I have a hard drive with the library of
congress stored in ASCII, I need half a sheet of paper to understand how to
decode it.

Whereas apparently UTF8 requires 7k words just to explain why it's important.
And that's not even looking at the spec.

Just to be crystal clear, I'm not advocating to not use Unicode, or even use
it less. I'm just saying I think it maybe shouldn't count as plain text, since
it looks a lot like a relatively complicated binary format to me.

~~~
pjscott
_Unicode_ is complicated because the languages it needs to handle are, alas,
complicated. _UTF-8_ is super simple. It's a variable-length encoding for
21-bit unsigned integers. Wikipedia gives a handy table showing how it works:

[https://en.wikipedia.org/wiki/UTF-8#Description](https://en.wikipedia.org/wiki/UTF-8#Description)

~~~
carapace
Yeah, this. I have a pat "Unicode Rant" that boils down to this essentially.

Having a catalog of standard numbers-to-glyphs (or symbols or whatever, little
pictures humans use to communicate with) is awesome and useful (and all ASCII
ever was) but trying to digitalize all of human language is much much more
challenging.

~~~
tialaramex
But human language doesn't stop being "much much more challenging" if you
decide not to engage.

Sometimes (and this can even be an admirable choice) in some specialist
applications it's acceptable to decide you won't embrace the complexity of
human language. But in a lot of places where that's fine we already did this
with the decimal digits such as in telephone numbers, or UPC/EAN product
codes, so we don't need ASCII.

In most other places insisting upon ASCII is just an annoying limitation, it's
annoying not being able to write your sister's name in the name of the JPEG
file, regardless of whether her name is 林鳳嬌 or Jenny Smith, and it jumps out
at you if the product you're using is OK with Jenny Smith but not 林鳳嬌.

You might think well, OK, but there weren't problems in ASCII. The complexity
is Unicode's fault. Think about Sarah O'Connor? That apostrophe will often
break people's software without any help from Unicode.

~~~
anderspitman
Your sister's name doesn't render in my browser (stable Firefox on Linux 5.6).
I'm sure I'm missing a fontpack or something. Again, I'm not saying ASCII is
the solution, I'm saying Unicode is much more difficult to get right, and
maybe we should call it something other than "plain text", since we already
had a generally accepted meaning for that for many years. I'm usually in favor
of making a new name for a thing rather than overloading an old name.

~~~
tialaramex
Firefox does full font fallback. So this means your system just isn't capable
of rendering her name (which yes you might be able to fix if you wanted to by
installing font packages). If you don't understand Han characters that's an
acceptable situation, the dotted boxes (which I assume rendered instead) alert
you that there is something here you can't display properly but if you know
you can't understand it even if it's displayed there's no need to bother.

It really is just plain text. Human writing systems were always this hard, and
"for many years" what you had were separate independent understandings of what
"plain text" means in different environments, which makes interoperability
impossible. Unicode is mostly about having only one "plain text" rather than
dozens.

It is _not_ mandatory that your 80x25 terminal learn how to display Linear B,
you can't read Linear B and you probably have no desire to learn how and no
interest in any text written in it. But Unicode means your computer agrees
with everybody else's computer that it's Linear B, and not a bunch of symbols
for drawing Space Invaders, or the manufacturer's logo, if you fix a typo in a
document I wrote that has some Linear B in it, your computer doesn't replace
the Linear B with question marks, or erase the document, since it knows what
that is even if you can't read it and it doesn't know how to display it.

------
shpx
What I never see mentioned about Unicode is Han Unification

[https://en.m.wikipedia.org/wiki/Han_unification](https://en.m.wikipedia.org/wiki/Han_unification)

As I understand it, it's impossible to have a txt file that uses Japanese and
Chinese characters at the same time. The file will either use the Chinese or
Japanese forms of the characters, depending on your font. I would think this
is a big gotcha people must run into all the time, but I never hear anyone
talk about it.

~~~
klodolph
I’m not going to try and minimize the problem, here. Han unification was
pushed through by western interests, by my understanding.

However, most Unicode characters are identical or nearly identical in Chinese
and Japanese. Characters with “significant” visual differences got encoded as
different Unicode characters. The same thing applies to simplified and
traditional Chinese characters.

So for a given “Han character”, there might be between one and three different
Unicode characters, and there might be between one and three different ways of
writing it.

Here’s an illustration:
[https://japanese.stackexchange.com/questions/64590/why-
are-j...](https://japanese.stackexchange.com/questions/64590/why-are-japanese-
fonts-different-to-chinese)

So the issue does come up when mixing Chinese and Japanese text, but it’s not
really one that has a big impact on _legibility_ of the text but you would
definitely be concerned if you were writing a Japanese textbook for Chinese
students, or vice versa.

Beyond that, it is usually fairly trivial to distinguish between Japanese and
Chinese text, so you could just lean on simple heuristics to get the work done
(Japanese text, with the exception of fairly ancient text or very short
fragments, contains kana, but Chinese does not).

~~~
cygx
_Han unification was pushed through by western interests, by my
understanding._

Note that as far as I'm aware, the interest in question was the initial 16-bit
limit of the character set and later on the non-proliferation of competing
standards.

Also note that while Han unification is the most prominent example, there are
technically similar cases, which just aren't as charged culturally. For one,
Unicode doesn't encode German Fraktur: While some characters are available due
to their use in mathematics, it's lacking the corresponding variants of ä, ö,
ü, ß, ſ as well as specific ligatures. So if you want to intermix modern with
old German writing, you'll also have to go out-of-band.

~~~
anoncake
That's not the same thing. Fraktur is just a style of fonts, antiqua and
fraktur letters are semantically the same.

~~~
tialaramex
It's actually exactly the same thing. The Han Unification didn't smash
together unrelated squiggles that just happened to look similar, they were
semantically the same - scholars of the Han writing system spent a bunch of
time deciding what is or is not the same squiggle just drawn differently, like
Fraktur, and today people are annoyed because, as you'd expect some of them
believed that "style of fonts" was integral to the meaning anyway.

~~~
anoncake
Chinese characters represent the Chinese words or parts thereof, Japanese ones
represent Japanese words and parts thereof. That is a semantic difference.

~~~
tialaramex
So what you're saying is that because 'chat' in English and 'chat' in French
are quite different words with very different meanings, you believe there
should be a separate letter 'c' for English and French to enable us to tell
those words apart?

~~~
anoncake
The Latin alphabet is not logographic.

~~~
zajio1am
It is not logographic, but characters still have meaning - associtated
phonemes. Although this is less clear in English, it is emphasized in other
languages.

And this mapping is different between languages. So 'c' in English has
different meaning to 'c' in Czech.

~~~
anoncake
Not really. Morphemes are considered (defined even) as the smallest unit that
has meaning by itself.

------
rakoo
Maybe it's time for MySQL to make "utf8" actually mean UTF-8 then
([https://medium.com/@adamhooper/in-mysql-never-use-
utf8-use-u...](https://medium.com/@adamhooper/in-mysql-never-use-utf8-use-
utf8mb4-11761243e434))

~~~
smacktoward
They probably couldn't even if they wanted to, by this point there will be too
much software out there depending on "utf8" meaning "MySQL's weird proprietary
hacked-up version of UTF-8".

The only real solution is to hammer home the message that "utf8mb4" is what
you put into MySQL if you want UTF-8.

~~~
cosarara
There are acual problems too, when switching from utf8mb3 to utf8mb4, because
of maximum varchar length in indices:
[https://stackoverflow.com/questions/48500355/mysql-
character...](https://stackoverflow.com/questions/48500355/mysql-character-
set-utf8mb4-varchar-length)

------
GnarfGnarf
I came to the same conclusion years ago. My app is Win32, but I never defined
UNICODE or used the TCHAR abomination. All strings are stored as UTF8 until
they are passed to Win32 APIs, whereupon they are converted to UCS-2. I
explicitly call the wchar version of functions (ex: TextOutW). This strategy
enabled me to transition easily and safely from single-byte ASCII (Windows
3.1) to Unicode.

The database is also UTF8.

~~~
malkia
Calling the "A", instead of "W" functions might be some small perf hit (don't
know if it matters), but for some functionality you need to call the "W"
functions, for example to break the limit of 256 or was it 260 characters, up
to 32768 (or was it 16384).

:)

------
dpc_pw
> For instance, ‘ch’ is two letters in English and Latin, but considered to be
> one letter in Czech and Slovak.

Is "ch" really considered one _character_ in Czech and Slovak? I'm Polish and
we do have "ch" and consider it one ... sound... represented by two letters? I
mean... if you asked anyone to count letters/characters in a word, they would
count "ch" as two. So I wonder if that's different in Slovakia or Chech
Republic, or is just my definition of "character" wrong.

~~~
Svip
A better example would probably be "ij" in Dutch. That's definitely considered
a single letter, as words starting with ij in Dutch are capitalised IJ. Though
there are glyphs for Ĳ /ĳ already in unicode.

~~~
mercer
"Ij" is also one sounds represented bij two letters, and I think capitalizing
just the 'I' is pretty standard. As a Dutch person myself, I didn't even know
that there's a glyph for it!

We also have "ei", which sounds the same and was invented to annoy people
learning Dutch. Then there's "oe", "eu", "ui". And just to fuck even more with
people learning the language, we have "au" and "ou" which also sound the same.
Oh, and "ch" and "g".

Hans Brinker, the inventor of the Dutch language, famously would toss a
florijn to decide between using ei/ij and au/ou, as he was not fond of
foreigners. He's mostly known for saving our country though when he plugged a
hole in a dyke with his finger (yes, I know what you're thinking, and no, we
do not appreciate your dirty minds making light of this heroic act).

~~~
unwind
Spelling it "dike" helps keep people's minds on the right thing. :)

~~~
samatman
If you spell it "dijk" it's even less racy, because it's no longer a four-
letter word.

------
jfkebwjsbx
Even Microsoft is finally giving up UTF-16!

They recommend now to use the UTF-8 "code page" in new code.

~~~
snazz
Is java.lang.String still UTF-16? Is there any plan to fix that? Once Windows
and Java take care of it, I can't think of any other major UTF-16 uses left.
Are there any that I've forgotten about?

Edit: Still looks like UTF-16, according to the Oracle documentation page:
[https://docs.oracle.com/en/java/javase/14/docs/api/java.base...](https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/lang/String.html)
Edit 2: JavaScript too. See my reply to someone else below.

~~~
diroussel
Compact Strings were added in Java 9;
[https://openjdk.java.net/jeps/254](https://openjdk.java.net/jeps/254)

So they can now be stored as one byte per character.

~~~
kllrnohj
Only for ASCII text. There is still no UTF-8 support (it's even called out as
a non-goal in the JEP: "It is not a goal to use alternate encodings such as
UTF-8 in the internal representation of strings.")

------
projektfu
When I used to do a lot of windows programming in the late 90s, I wish that I
had a sensible guide like this for handling strings. TCHAR was always a source
of subtle bugs.

I suppose, though, that the underlying problem was that Microsoft was so late
to implement a compatibility solution for Windows 9x. Most software of the
time ended up implementing on "ANSI" multibyte character set (MBCS) just
because otherwise you would need to either deploy 2 executables or do your own
thunking. This solution would be a double thunk on 9x because you'd be
thunking your UTF-8 to unicode and then thunking that back to MBCS.

------
xg15
> _When writing a UTF-8 string to a file, it is the length in bytes which is
> important. Counting any other type of ‘characters’ is, on the other hand,
> not very helpful._

So, suppose I have a UTF-8 string of n code units (bytes) length.
Unfortunately my data structure only permits strings of length m < n bytes.

How do I correctly truncate the string so it doesn't become invalid UTF-8 and
won't show any unexpected gibberish when rendered? (E.g., the truncated string
doesn't suddenly contain any glyphs or grapheme clusters that weren't in the
original string)

~~~
samatman
Avoiding invalid UTF-8 is easy, almost trivial: just make sure you don't
truncate in the middle of a code point.

The latter is fiendishly difficult to get right in all cases, the ugliest case
being emoji flags. Being all-or-nothing on both sides of a ZWJ will get you
most of the way there, however.

~~~
smasher164
It's not though. Replacing invalid byte sequences is not terribly difficult.

[https://golang.org/src/strings/strings.go?s=15854:15900#L627](https://golang.org/src/strings/strings.go?s=15854:15900#L627).

~~~
samatman
We are agreeing, the part I was indicating is difficult is 'displaying
gibberish'.

Knowing what constitutes a grapheme cluster is detailed and frequently
changes.

------
ddebernardy
(2012)

Previous discussions:
[https://news.ycombinator.com/from?site=utf8everywhere.org](https://news.ycombinator.com/from?site=utf8everywhere.org)

------
hutzlibu
As someone who experienced serious pain with broken strings that I sometimes
only discovered, after the original files were gone and new special characters
were integrated, I directed quite some anger to the fact, that computer
systems are internal mostly operated in english only, so usually nobody
notices bugs with wrong character encoding. So I share the sentiment of the
article ..

I do not want to think about UTF encoding, when I simply create a 7z or tar
file, without even programming. But I learned the hard way, I had to. I never
even found out for example, if it was/is a bug with 7z, tar, rsync, scite text
editor/ notepad++ .. or just wrong usage/configuration. I just had(and still
have even now my workflow is clean) a special first file/codeline with special
characters, I checked to be correct, after compressing, rsyncing between
different systems. Especially between windows and linux. But it probably
helps, that I don't have to do that anymore.

------
thanksforfish
> Many third-party libraries for Windows do not support Unicode: they accept
> narrow string parameters and pass them to the ANSI API. Sometimes, even for
> file names. In the general case, it is impossible to work around this, as a
> string may not be representable completely in any ANSI code page (if it
> contains characters from a mix of Unicode blocks). What is normally done by
> Windows programmers for file names is getting an 8.3 path to the file (if it
> already exists) and feeding it into such a library. It is not possible if
> the library is supposed to create a non-existing file.

Yikes. That's a fascinating use of 8.3 paths. Sometimes when I look at really
old Windows cruft I wonder when it will go away. 8.3 paths seemed like an easy
thing to get rid of, but with 8.3 paths used to hack around encoding issues in
3rd party libraries... that's going to stick around...

Anyone know which libraries this is talking about?

------
jodrellblank
> _Q: What do you think about Byte Order Marks? A: According to the Unicode
> Standard (v6.2, p.30): "Use of a BOM is neither required nor recommended for
> UTF-8". [...] Using BOMs would require all existing code to be aware of
> them, even in simple scenarios as file concatenation. This is unacceptable._

Then your site "UTF-8 everywhere" is misnamed, because standards-following
UTF-8 can have a BOM. It's not required or recommended, but it is possible and
allowable, so you might see them and if you follow the standard you have to
deal with them. It's not a matter of "this would require all existing code to
handle them" \- that is not hypothetical, that is the current world, to be
standards-compliant all existing code _does already_ need to be aware of them.
It isn't, which means it's broken. Declaring it "unacceptable" is meaningless,
except to say you're rejecting the standard and doing something incompatible
and broken because it's easier.

Which is a position one can take and defend, but it's not a good position for
a site claiming to be pushing for people to follow the standard. What it is,
is yet another non-standard ad-hoc variant defined by what some subset of
tools the authors use can/can't handle in April 2020.

> " _the UTF-8 BOM exists only to manifest that this is a UTF-8 stream_ "

Throwing the word "only" in there doesn't make it go away. It exists as a
standards-compliant way to distinguish UTF-8 from ASCII, not recommended but
not forbidden.

> " _A: Are you serious about not supporting all of Unicode in your software
> design? And, if you are going to support it anyway, how does the fact that
> non-BMP characters are rare practically change anything_ "

Well in the same way, how does the fact that UTF8+BOM is rare practically
change anything? At some level you're either pushing for everyone to follow
standards even if it's inconvenient because that makes life better for
everyone overall, like you are with surrogate pairs and indexing, or you're
creating another ad-hoc incompatible variation of UTF-8 which you prefer to
the standard and trying to strong-arm everyone else into using it with threats
of being incompatible with all the code which already does it wrong.

Being wary of Chesterton's Fence, presumably there's some company or system
which got UTF-8+BOM added to the standard because they wanted it, or needed
it.

~~~
alkonaut
100% agree.

> using BOMs would require all existing code to be aware of them, even in
> simple scenarios as file concatenation

Absolutely! Any app that writes UTF-files can (and probably should) avoid
writing them. But any program that reads UTF files _must_ handle a BOM. A lot
of apps write UTF-8 including the BOM by default, for example Visual Studio.

You can NOT concatenate two UTF-8 streams and expect that the resulting stream
is also a valid UTF-8 stream. NO tool should assume that, ever.

~~~
nybble41
> You can NOT concatenate two UTF-8 streams and expect that the resulting
> stream is also a valid UTF-8 stream.

Actually you can; the ability to concatenate UTF-8 streams is an intentionally
part of the design of UTF-8. The BOM is an ordinary Unicode code point and can
occur in the middle of a valid UTF-8 stream, where it should be treated as
either a zero-width non-breaking space or an unsupported character (which only
affects rendering). So concatenating two UTF-8 streams with leading BOMs still
results in a valid UTF-8 stream, albeit with an extra zero-width space.

The bigger problem with the BOM is that it breaks transparent compatibility
with ASCII. Absent a leading BOM character, a UTF-8 steam containing only
codepoints 0-127 is binary-identical to an ASCII-encoded text stream and can
be handled with tools that are not UTF-8 aware. This was an explicit design
consideration for both Unicode and UTF-8. Add the BOM, however, and your file
is no longer plain text, which can lead to syntax errors or other issues that
are difficult to diagnose because the BOM is invisible in UTF-8 aware text
editors.

I think the BOM was a mistake—along with the variable-length multi-byte
encodings it was created to support—but unfortunately at this point we're
stuck with it. (Actually the BOM is _prohibited_ in the multi-byte formats
with an explicit byte order, like UTF-16BE; it would have been really nice if
the same policy had been applied to UTF-8 where byte order is irrelevant.) The
best we can do is recommend that new programs omit the BOM when outputting
UTF-8 and either skip it at the beginning or convert it to U+2060 WORD JOINER
anywhere else when it appears in the input.

~~~
alkonaut
Interesting, I thought a BOM-in-the-middle was invalid. I know apps are even
more likely to choke on that than a leading BOM though.

In any case, you need to handle it in every app that claims to read UTF. The
loss of compatibility is indeed the biggest problem and I agree the BOM should
be omitted when possible, but that doesn’t change that it’s part of the spec
and millions of UTF files have a BOM.

Even if 100% of all apps stopped using a BOM today you couldn’t ignore it in a
parser.

------
justinfrankel
I would leave this on their facebook page, but fuck facebook so I'm posting it
here hoping that someone will find it useful.

We have a zlib-licensed wrapper header for some commonly-used win32 APIs to
make them take UTF-8, see:

[https://github.com/justinfrankel/WDL/blob/master/WDL/win32_u...](https://github.com/justinfrankel/WDL/blob/master/WDL/win32_utf8.h)

[https://github.com/justinfrankel/WDL/blob/master/WDL/win32_u...](https://github.com/justinfrankel/WDL/blob/master/WDL/win32_utf8.c)

(This is used in REAPER so it's relatively well tested!)

------
Animats
I'd argue for some standard tests for UTF-8 strings:

\- Basic - UTF-8 byte syntax correct.

\- Unambiguous - similar to the rules for Unicode domain names. The rules are
complicated, but basically they prohibit homoglyphs, mixing glyphs from
different character sets, forwards and backwards modifiers in the same string,
no emoji or modifiers, etc. Use where people have to visually compare two
things for identity or retype them, such as file names.

\- Unambiguous, light version - as above, but allow emoji and modifiers.
Normal form for documents.

------
malkia
Still doesn't solve the fact that filesystems across different OS's allow
invalid UTF8 sequences in the filenames.

Maybe 99% of apps do not care, but even a simple "cp" tool should care.
Filenames (and maybe other named resoureces) should be treated completely
differently, and not blindly assumed that they are utf8 compatible.

~~~
qiqitori
Are you saying that operating systems (i.e. the kernel) should check and
enforce encodings in filenames?

1) Why?

2) Bye bye backward compatibility and interoperability

~~~
masklinn
> 2) Bye bye backward compatibility and interoperability

It's already not really a thing.

Traditional unices allow arbitrary bytes with the exception of 00 and 2f, NTFS
allows arbitrary _utf-16 code units_ (including unpaired surrogates) with the
exception of 0000 and 002f, and I think HFS+ requires valid UTF-16 and allows
everything (including NUL).

The OS then adds its own limitations e.g. win32 forbids \, :, *, ", ?, <, >, |
(as well as a few special names I think) and OSX forbids 0000 and 003a (":"),
the latter of which gets converted to and from "/" (and similarly forbidden)
by the POSIX compatibility layer.

The latter is really weird to see in action, if you have access to an OSX
machine: open a terminal, try to create a file called "/" and it'll fail. Now
create one called ":". Switch over to the Finder, and you'll see that that
file is now called "/" (and creating a file called ":" fails).

Oh yeah and ZFS doesn't really care but can require that all paths be valid
UTF8 (by setting the utf8only flag).

~~~
account42
> Traditional unices allow arbitrary bytes with the exception of 00 and 2f,
> NTFS allows arbitrary utf-16 code units (including unpaired surrogates) with
> the exception of 0000 and 002f.

For just Windows -> Linux you can represent everything by mapping WTF-16 to
WTF-8.

------
vortico
What is this C++ `narrow()/widen()` function mentioned in the Windows section?
At the risk of asking to be spoonfed, can someone give the source code of a
function that takes a UTF-8 `std::string` and gives a UTF-16 `std::wstring`?

------
gramakri
> In the UNIX world, narrow strings are considered UTF-8 by default almost
> everywhere

I think in unix world, null terminated strings are the default. It doesn't
need to be valid UTF-8 even. For display purposes, the shell uses the locale
setting

------
nayuki
I love the typesetting on the page. It is content-first, clean, and simple.

It lacks all the usual noise like modal dialogs, headers and footers, social
media icons, colorful sidebars, newsletter sign-ups, cookie warnings, etc.

------
magicalhippo
I'd be happy if I could just get consistent encoding. Have to handle way too
many files with mixed encoding, even XML files with explicit encoding header.

------
kyberias
Well, the font on that article is too small and otherwise ugly.

------
heyplanet
I think UTF-8 was a mistake.

It is a pain in the ass to have a variable number of bytes per char.

In Ascii, you could easily know every character personally. No strange
surprises.

Also no surprises while reading black on white text and suddenly being
confronted with clors [1].

[1] Also no surprises when writing a comment on HN like this one and having
some characters stripped. I put in a smiley as the firs "o" in colors, but it
was stripped out. Looks like the makers of HN don't like UTF-8 either.

~~~
jandrese
> It is a pain in the ass to have a variable number of bytes per char.

Maybe, but nobody can stomach the wasted space you get with UTF-32 in almost
every situation. The encoding time tradeoff was considered less objectionable
than making most of your text twice or four times larger.

~~~
FabHK
And as the article points out, even then you might have more than one code
point for a character.

> For example, the only way to represent the abstract character ю́ _cyrillic
> small letter yu with acute_ is by the sequence U+044E _cyrillic small letter
> yu_ followed by U+0301 _combining acute accent._

------
camgunz
This pops up every so often, and is wrong on several fronts (UNIX is UTF-8,
UTF-8/32 lexicographically sort, etc.) There's not really a good reason to
support UTF-8 over UTF-16; you can quibble over byte order (just pick one) and
you can try and make an argument about everything being markup (it's not), but
the fact is that UTF-16 is a more efficient encoding for the languages a
plurality of people use natively.

But more broadly, being able to assume $encoding everywhere is unrealistic.
Write your programs/whatevers allowing your users to be aware of and configure
encodings. It might not be ideal, but such is life.

~~~
jcranmer
> There's not really a good reason to support UTF-8 over UTF-16

Two big reasons:

1\. All legal ASCII text is UTF-8. That means upgrading ASCII to UTF-8 to
support i18n doesn't require you to convert all your files that were in ASCII.

2\. UTF-16 gives people the mistaken impression that characters are fixed-
width instead of variable-width, and this causes things to break horribly on
non-BMP data. I've seen amusing examples of this.

> Write your programs/whatevers allowing your users to be aware of and
> configure encodings.

Internally, your program should be using UTF-8 (or UTF-16 if you have to for
legacy reasons), and you should convert from non-Unicode charsets as soon as
possible. But if you're emitting stuff... you should try hard to make sure
that UTF-8 is the only output charset you have to support. Letting people
select non-UTF-8 charsets for output adds lots of complication (now you have
to have error paths for characters that can't be emitted), and you need to
have strong justification for why your code needs that complication.

~~~
mark-r
Every program that purports to support Unicode should be tested with a bunch
of emoticons.

~~~
coolreader18
Do you mean emoji? I don't see what the issue would be with
[{}:();P\\[\\],.<>/~-_+=XD]

~~~
mark-r
Yes, that's what I meant. I knew I was using the wrong word but couldn't
remember the right one.

