

The Mailocalypse Is Upon Us: Why Isn’t All Mail UTF-8? - uggedal
http://lamsonproject.org/blog/2009-06-14.html

======
pilif
I may be overly conservative here, but IMHO a MTA should not touch the body of
the message.

It should not even touch the headers besides adding a Received-header to the
top.

Re-Encoding all Mails passed through is definitely not what I call "not
touching".

I know that there are some exceptions (i.e. 8bitmime), but I still think that
mail servers should keep their hands of what is passing through them.

Like mailmen who are not supposed to open the envelope, read the letters and
reprint them using a nicer font on nicer paper :-)

~~~
mrduncan
I don't see a problem with it _if_ the translation is 1-to-1 (and that's a
pretty big if). Disregarding federal law for the moment, what is the harm in
the mailman reprinting your letters with nicer font on nicer paper? It seems
to me that if the mailman wants to do that, as long as they don't lose any of
the information of the original letter I would benefit from having an easier
to read letter.

~~~
pantsd
But if it becomes normal for a mailman to do this, there is now an expected
man-in-the-middle. Sure the mailman might be benign, but what about if there
is another man-in-the-middle (less benign), now that I expect my communication
to be tampered with I won't notice anything suspicious. [Note: Zed's proposed
solution is to forward on the original mail as well so it can be verified]

~~~
mrduncan
I'll confess now to definitely not being an expert on mail systems but what is
to stop this tampering from happening now by a man-in-the-middle? I don't
really see any new avenue of attack that doesn't already exist with current
systems (encrypted email excluded).

~~~
pilif
no one is to stop tampering from happening.

but the SMTP standard and the whole culture around internet mail mandates that
the messages are not changed in transit (with the exception of that received
header and the stuff around 8bitmime).

UTF-8-Encoding mails just because it's "cleaner" doesn't feel like it's the
right thing to do, especially when you consider there to be old systems around
that can't handle UTF-8 encoded messages.

Also, standards are there to be adhered to - like HTML and all the others.

------
grandalf
What is lamson doing with mails that makes just leaving them in their original
charset unworkable?

~~~
aristus
Processing them in Python. :) No, really. Python is horrible at dealing with
strings of different encodings. You could generalize that to any complex app
with lots of data sources: the only sane way to do it is to convert everything
to a single encoding at the door.

(edit) whups, I hadn't thought about PGP/etc signatures. _sigh_

~~~
grandalf
yeah charset conversion would also break dkim/domainkeys

------
dnewcome
Here be dragons. Be afraid. Be very afraid.

~~~
patio11
Everything you say and more. Internationalization, ick -- anyone who thinks
this is easy has no clue how deep this rabbit hole goes.

On Han unification: the Japanese reluctance to this is partly because they're
being told "Some of your national literature needs to die so that our data
standard can live. Deal." and partly because they're being told "What's with
all the resistance, you xenophobic bastards, get with the effing program
already.", generally by people who they perceive as not quite getting the
issue.

All the educated Americans in the room have read Romeo and Juliet, right?
Remember the balcony scene? Remember the world in the balcony scene that you
have never heard in _any_ other context?

O Romeo, Romeo, Wherefore art thou Romeo?

Imagine being told "For technical reasons, we're standardizing computers away
from being able to accept 'Wherefore' as input or output. As a workaround, we
suggest using "why", or perhaps putting the word in an image file and pasting
it in when it is required. Most people don't use "wherefore" anyhow and, if
you routinely do, you can modify your editing software to accommodate it, as
long as it doesn't have to interface with any other computer ever. Oh, by the
way, some other words you know are also going to stop working. It's nothing
major. Well, OK, 'Gertrudes' might find it somewhat annoying but we've got a
nice selection of names from Aluicious to Xavier and, if all else fails, you
can spell it phonetically because your language is capable of that, too, and
don't pretend otherwise."

~~~
jsonscripter
They should have an encoding standard for characters that are not in the
standard character set. I've see base64 encoded images in CSS for a long time.
Sure, it takes up a lot of space, but isn't this the best of both worlds? Of
course, the images could be SVG or a sort of typeface definition. It doesn't
matter because it's base64 encoded.

Of course, I'm not trivializing the problems with internationalization. I'm
simply pointing out that there are simple solutions to some of the problems,
such as this one. I think the biggest problem with character encoding is
getting people to agree on a standard.

~~~
patio11
_I think the biggest problem with character encoding is getting people to
agree on a standard._

Prior to the Unicode standard springing into being, Japan had relatively
little problem with agreeing on standards, by the simple expedient of agreeing
on enough of them such that most stakeholders left the table with at least one
they liked. People are sort of reluctant to leave these and the tangible
benefits they offered, hence the lukewarm reception to Unicode. (If I hear
tsk-tsking from any Americans on this point, just try to imagine how much
adoption UTF-8 would have in the US if it didn't have the happy-and-totally-
not-accidental property of working exactly like ASCII for the language
American programmers and legacy systems/documents care about.)

Sidenote: I was working on a translation of features of a particular software
package today and had to think a little of how to explain _mojibake_ , and why
you'd want a servlet container that can detect and correct it, to an American
audience. (It's when some program between the user and the server's
application layer applies a heuristic incorrectly and results in one or the
other getting complete gibberish.) The problem is every bit as fun to deal
with as a developer and a user as you would expect it to be.

~~~
jrockway
To be fair, the Unicode folks totally screwed over the Japanese. Now there are
many people who can't even write their name on a computer.

Needless to say, these people would rather use the "legacy" character sets
that didn't have this problem.

------
gchpaco
Among the problems that this has, it will shatter PGP and S/MIME signatures
silently.

~~~
dfranke
Ding ding ding! You win the thread. It'll break DKIM/Domainkeys too if you
have 8-bit characters in your headers.

Paws off my mail, Zed.

------
dmm
Is this guy MIME encoding everything in base64 or quoted-printable?

Last time I checked email could only be 7bit ascii because of many legacy
servers.

~~~
vidarh
I'd be interested in seeing estimates on how many such servers are still being
used. I've never come across one.

~~~
jrockway
Indeed. This was a problem in 1990, but amazingly, it's not 1990 anymore.

------
prodigal_erik
Drop badly encoded email on the floor because you hope it's _probably_ spam?
That deliberately violates Postel's Law, which is what keeps this mess mostly
working.

And doesn't multipart/signed rely on knowing the actual charset the signer was
using?

~~~
calambrac
Umm, no. Drop badly encoded email because most of it _is_ spam. Or, at least,
that's the hypothesis that _he's asking you to help verify_. Did you even read
the article?

~~~
prodigal_erik
He said himself that "a quick eye-ball sample says those failures mail are
mostly spam that wasn’t classified right". Mostly? That means he already knows
badly-encoded but legitimate email is not only possible but actually exists,
so what's left to verify? Either you're okay with losing messages or you
aren't.

~~~
calambrac
I was responding to your use of the term 'hope', which made it sound like you
thought this was just some blind stab in the dark. He has data, he's asking
for more, and he's asking people to shoot him down if it's a dumb idea. It's
hardly a faith-based effort.

And now you're parsing his sentence for subtle evidence that he secretly knows
he's dropping some legitimate messages? Dude, _he posted the fucking numbers
himself_. It's obvious that he's willing to drop some messages. He's asking
how many he actually would be dropping and trying to work out if the benefits
of "everything as UTF-8" outweighs the costs of dropping a message every now
and then.

 _Either you're okay with losing messages or you aren't_

Yeah, brilliant. There are already several reasons to drop messages (size,
mangled headers, etc). He's pushing for another one.

------
jrockway
If I hear about one more trivial issue that's called a something-pocalypse...

