
UTF-8 Encoding Debugging Chart - tard
http://www.i18nqa.com/debug/utf8-debug.html
======
rspeer
The fortunate thing is, almost all of the broken sequences are unambiguous
enough to be signs that the text should be encoded and then re-decoded as
UTF-8. (This is not the case with any arbitrary encoding mixup -- if you mix
up Big5 with EUC-JP, you might as well throw out your text and start over --
but it works for UTF-8 and the most common other encodings because UTF-8 is
well-designed.)

So if you want a Python library that can do this automatically with an
extremely low rate of false positives:
[https://github.com/LuminosoInsight/python-
ftfy](https://github.com/LuminosoInsight/python-ftfy)

------
pixelbeat
I previously wrote about this common double encoding issue at
[http://www.pixelbeat.org/docs/unicode_utils/](http://www.pixelbeat.org/docs/unicode_utils/)
which references tools and techniques to fix up such garbled data

------
plank
Missing: how defaults are wrong between UTF8 and EBCDIC. E.g. where a
character in UTF8 outside the MES2 subset ('latin1') will be mapped to the x3F
'unknown' character of EBCDIC, which will be mapped back to the x1A character
('CTRL-z') of UTF8...

------
julie1
Lol, biggest bug is developer ignoring that latin1 & unicode encoded in UTF8
can coexists in the same stream of data :

\- HTTP 1.1 headers are ISO-8859-1 (CERN legacy) while content can be UTF8 \-
SIP being based on HTTP RFC have the same flaw.

The CTO of my last VoIP company is still wondering why some callerIDs are
breaking his nice python program assuming everything is UTF8 and still does
not understand this...

Yes, encoding can change, I also saw it while using regionalisation with C#
.net in logs.

~~~
guelo
According to newer HTTP specs clients should ignore weird ISO-8859 characters.
[https://tools.ietf.org/html/rfc7230#section-3.2.4](https://tools.ietf.org/html/rfc7230#section-3.2.4):

    
    
       Historically, HTTP has allowed field content with text in the
       ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
       through use of [RFC2047] encoding.  In practice, most HTTP header
       field values use only a subset of the US-ASCII charset [USASCII].
       Newly defined header fields SHOULD limit their field values to
       US-ASCII octets.  A recipient SHOULD treat other octets in field
       content (obs-text) as opaque data.
    

Though I guess you'd still need to decode it correctly in order to ignore the
right characters.

~~~
AnthonyMouse
IETF should just publish an RFC that says "all text without a field specifying
its encoding shall be UTF-8, even if this conflicts with a previous RFC." The
only real objection to doing this is that it would break things, but almost
all of those things are already broken.

~~~
colejohnson66
The problem is browser vendors wouldn't want to implement that spec because if
updating your browser breaks the website, no matter how much you explain it to
the user, it's your fault, not the website owner's. It's why we have Quirks
Mode even after 15 years. It's why Linus is so adamant about patches breaking
userspace;[1] if your update broke it, it's your fault, no matter how bad the
truly broken thing is.

[1]:
[https://lkml.org/lkml/2012/12/23/75](https://lkml.org/lkml/2012/12/23/75)

~~~
AnthonyMouse
There are cases where the status quo is already broken and you're already
being blamed for it. A change that makes the brokenness 20% instead of 80% by
inverting the set of weird websites it happens on is going to make userspace
less broken on net.

------
arm
Currently down. Here’s a snapshot from January:

[http://archive.is/t2tB3](http://archive.is/t2tB3)

------
alien3d
Are utfmb4 effected also ? I been converting my table from utf8 unicode to
utfmb4 unicode for supporting emoticon unicode.

