

Why the #AskObama tweet was garbled on screen - brianwillis
http://www.hanselman.com/blog/WhyTheAskObamaTweetWasGarbledOnScreenKnowYourUTF8UnicodeASCIIAndANSIDecodingMrPresident.aspx

======
pak
"This is SUCH a classic sloppy programmer mistake that I'm disappointed"

Oh come off of it. This happens everywhere on the web on probably something
like 25% of websites. And it's NOT always the consuming program's fault: very
often somebody upstream, e.g. the hosting company, the person that wrote the
HTML, the source of an RSS feed being inserted into the page etc. etc. forgot
to encode something the way somebody else expected, and you as the poor guy at
the end of the chain gets a document with multiple encodings improperly
embedded into it. Inevitably you have to make some bad decisions and not all
corner cases are handled.

Somebody once reverse-engineered the state chart for how Internet Explorer
handles documents with conflicting encoding declarations and I kid you not, it
must have had >20 branches spanning a good few pages. Officially, the correct
order of precedence is ([http://www.w3.org/International/questions/qa-html-
encoding-d...](http://www.w3.org/International/questions/qa-html-encoding-
declarations)):

1\. HTTP Content-Type header

2\. byte-order mark (BOM)

3\. XML declaration

4\. meta element

5\. link charset attribute

but that's not how every browser does it, because the W3C sort of declared
that after things on the Real Internet (TM) had already gotten out of hand. I
hate to resuscitate Joel posts but Unicode is not easy to implement right.

~~~
eropple
Twitter said that the message in question is UTF-8. The message recipient
decoded the message with something that is not UTF-8.

That's a sloppy mistake and should without doubt have been caught in testing.

~~~
sundarurfriend
The company responsible seems to have responded in the comments:

"It was definitely a mistake on our part. The problem was _not_ the encoding
on our data feed, but the HTML document was sent with ISO-8859-1. The second
we inserted the twitter text into the DOM, the browsers interpreted the UTF-8
string as ISO-8859-1. Our visualizations are hosted on other platforms, and in
this case the server was not configured to send UTF-8 with text/html even
though the HTML file was encoded as such. It was the only issue (albeit a
pretty obvious one) during an otherwise flawless event. I apologize to
President Obama, Speaker Boehner, and Jack Dorsey for the mistake. If the
readers of the blog think it was stupid, imagine how we felt. dev environment
!= production environment. If we would have just included a <meta
charset="utf-8"> in the HTML head, then this would not have occurred.

The big take away is don’t make assumptions about other platforms (especially
when it comes to encoding), and always include charset meta tag." [emphasis
mine]

~~~
TVD
Including <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
is like the first thing you do when starting front-end development.

Every time you don't validate...God kills a kitten. Please, think of the
kittens.

This has been a Public Service Announcement. Please code responsibly:
<http://validator.w3.org/>

~~~
jacobolus
Just for the record, the correct spelling is now:

    
    
      <meta charset="utf-8">

------
giberson
Maybe I'm just a terrible programmer, but I think the author may be a little
over emphasizing the seriousness of the bug. To me, this is one of those throw
away issues that you keep in the back of your mind. Unless I'm coding
something that is extremely datacentric and critical in that sense. But it's
not like its on one of my "top 20 must run tests" or anything. It's always one
of those issues I ignore or assume is correct until I find out it isn't. When
I find out, it's simply a matter of tossing in a code page translation at
either the input or output end and I'm done with it.

Or maybe I've just been fortunate enough to be in an environment where an
occasional goof of this caliber doesn't have any serious consequences.

~~~
carbonica
> Or maybe I've just been fortunate enough to be in an environment where an
> occasional goof of this caliber doesn't have any serious consequences.

Primarily, this means you don't have to support internationalization - which
is hardly a bad thing, especially if you work on a startup, where worldwide
distribution should be the last thing on your mind. When your product is
rendered in over 80 scripts, including right-to-left languages, you can't
afford to figure encoding will sort itself later.

~~~
city41
And Hanselman works for Microsoft, where i18n is a _big_ deal. So yes, for
someone who's been at MS for a while i18n related issues become second nature.
But if you typically are only targeting the United States, it's more
understandable to not have these things on the brain.

~~~
finnh
On the other hand, Outlook still has a ridiculous bug that uses the wrong
encoding when presenting HTML email - that is, it uses the encoding of the
email's text-body when presenting the html-body, even if the html-body
specifies a different encoding.

So, if the two bodies have differing encodings (charsets), then the HTML body
will look wrong. Unless you force Outlook to always use UTF-8 for encoding
emails (which is a setting, but not the default) then you'll end up sending
emails that will look garbled to your recipient.

This "differing charset" scenario actually happens pretty frequently, because
of the following scenario:

a) You write an email (or reply to an existing email - actually it happens
most with replies).

b) Outlook's text editor decides to insert a non-breaking space (codepoint
U+00A0). Perhaps it generates HTML with &nbsp; but before transmission this
eventually turns into the single UTF-8 character 0xC2 0xA0.

c) When generating the text-body, Outlook decides to just use a plain old
space, so the text body is plain ASCII.

d) Outlook, in its cleverness, then says "ooh, I can 'conserve' encoding-ness
and use plain old iso-8859-1 for the text body, but I need to use UTF-8 for
the HTML body because of that non-ascii character"

e) Outlook generates this email (please excuse formatting woes due to HN).

Content-Type: multipart/alternative; boundary="0016e64dbd929784310488b2b082"

This is a multi-part message in MIME format.

\--0016e64dbd929784310488b2b082

Content-Transfer-Encoding: 7bit

Content-Type: text/plain; charset="ISO-8859-1"

yo yo

\--0016e64dbd929784310488b2b082

Content-Type: text/html; charset="UTF-8"

Content-Transfer-Encoding: quoted-printable

<html> <body>

yo=C2=A0yo

</body> <html>

\--0016e64dbd929784310488b2b082--

When you view the above email in Outlook, you see "yoÂ yo" instead of "yo yo"

~~~
city41
You should file this at connect.microsoft.com. MS devs and PMs really do read
and triage bug reports coming from there. The more details you put in the bug
report, the better.

------
yahelc
'So do political nerds get to moan because the author referred to Boehner as
"the Senator"?'

<https://twitter.com/douglas/status/89080894018686976>

~~~
anigbrowl
Touchée.

~~~
StavrosK
Are you female and awesome, or was that just a typo?

~~~
shii
I can confirm that anigbrowl is not a female.

~~~
StavrosK
Ah, just a typo then.

------
efalcao
Hey I work for the company responsible for the visualization behind the
president and the content on <http://askobama.twitter.com>

Let me take this very excellent opportunity to say that we are looking to hire
a full time "front end" developer. You'll get to work on badass projects like
the Obama Town Hall. Ideally, you'd be located in Austin. Find me on Twitter
@efalcao to learn more.

~~~
efalcao
FWIW, this was an intense project to pull off. 1000's of tweets per minute
from Twitter, 8000 requests per second on <http://askobama.twitter.com> (where
the same tweet was also delivered by us and rendered correctly).

We're not lazy or sloppy... It basically boiled down to one server sent down
the right header...the production one didn't.

Unicode issues are sorta in the class of "gotcha" issue. They happen, you go
"oh shit" and fix them right away. Our "oh shit" moment just happened to come
at the most intense possible moment....in front of the president, with so many
watching.

Wanted to reiterate once again: We're Hiring! @efalcao on twitter. Early stage
startup looking for exceptional talent.

~~~
thinkbohemian
I think you did a great job, way to represent the Austin Tech scene!

~~~
efalcao
Thanks so much!

------
js2
tl;dr:

    
    
      $ python
      >>> print u"\u2019".encode("utf-8").decode("Windows-1252")

~~~
Luyt
Also interesting, the Unicode Nazi:

    
    
      http://pypi.python.org/pypi/unicode-nazi

------
pavel_lishin
Did they see the encoding mistake before they showed it?

Because I wonder how difficult it would be to create a string that says
something innocuous in UTF-8 (e.g., "When will you bring the troops home
#AskObama") but in ASCII would read as something totally different, but
legible (e.g., "the secret priests would take great Cthulhu from his tomb to
revive His subjects and resume his rule of earth...")

~~~
StavrosK
I imagine quite difficult, as each character triplet when decoded with
Windows-1252 would have to be one letter in Unicode, and those would have to
actually form words. You'd be restricted to maybe 30 triplets.

------
pilif
Errors like this is what me and my coworkers jokingly refer to as US-UTF8 (no
offense meant). In a country that's dominated by ASCII, "supporting" UTF8
means "emitting the same data as usual but declare it as UTF8).

Sure there might be some misunderstandings with special punctuation characters
as evidenced by the article, but such issues generally get low priority.

In countries where the language isn't representable in ASCII, we can't use US-
UTF8, but have to resort to "real-UTF8" which means dealing with legacy
systems that don't do UTF8 (which is what happened in the article we're
currently commenting on), dealing with browsers who lie about encoding, and
dealing with the fact that a string length isn't its byte length any more even
if it doesn't contain "fancy" punctuation characters.

All that makes me wish I could do US-UTF8 too :-)

------
zach
Proof that even completely ordinary string data used in the most USA-centric
domain imaginable STILL needs proper encoding.

~~~
baddox
I wouldn't call the right single quotation mark "completely ordinary string
data." As far as I know, the only way it would ever get into a tweet is
through some "smart correcting" client, or purposeful manual entry.

~~~
zach
A healthy amount of the writing of the world still begins life in a word
processor.

Any string born of this heritage is likely to have single and double (curved)
quotation marks. OpenOffice, Word, Pages – they all do it and are expected to.

For that reason, I consider them to be reasonably present in ordinary string
data.

------
markbnine
What a cool shot. The prez with a common bug on his screen. And a bug I can
fix! Still, even though this is an easy fix, he's going to need to open up a
ticket.

~~~
sacrilicious
My only regret is that I have but one upvote to give to this comment.

------
anigbrowl
I find it infuriating that this sort of thing is still a problem. I'm
constantly seeing mangled apostrophes in places like Google reader too.

~~~
uxp
I would personally blame Wordpress for substituting a common apostrophe ' as
the left ‘ and right quote ’ respectfully. Same with quotation marks “ and ”
instead of the traditional ".

There is an option to not use "Smart Quotes", but it seems to be enabled by
default.

~~~
wrs
Wordpress is for writing, and writing should be properly typeset, and
properly-typeset text has proper quotes.

BTW, a note on the term “smart quotes”: that originated when word processors
became “smart” about transforming the easy-to-type (but incorrect) ' and " to
their proper equivalents automatically. The quotes themselves aren’t
smart…they’re just quotes.

Typography nerd out.

~~~
uxp
For typography, yes.

For many blogs in the hacker community, source code snippets inside <code>
tags can also be given "smart quotes", which completely breaks any strings
that may be present.

You forget that Wordpress is used for many people outside of the writing
community. When writing, and if the writer cares about having their word
properly typeset, then the author can do so themselves. Wordpress tries to be
smart about it, and covert them, but many people do not care about such
features. The developers, however, do.

Also, may I remind everyone, downvotes on HN are not for disagreement, they
are for factually incorrect statements.

~~~
alanh
IMO it’s not bad that Wordpress auto-educates quotes, it’s that it does it
even within code blocks. Which is inexcusable, for precisely the reason you
mention.

Markdown + SmartyPants are a better solution IMO. (And you can install WP
plugins that do this and that disable Wordpress’ default quote educator.)

------
kragen
Part of the problem is that UTF-8 makes things really, really simple, and
bulletproof, and then people have to go and create problems again.

Listen. Any time you use an encoding other than UTF-8, you are creating
incompatibilities. If your stated intention is to facilitate communication,
you are failing. You are a bad person. Stop doing it. The only possible excuse
for using a non-UTF-8 encoding is to frustrate communication.

(It's too fucking bad HTTP mandates that the default charset is ISO-8859-1.)

------
Jach
Zed Shaw, can you make a post called "Programmers Need To Learn Unicode Or I
Will Kill Them All"?

~~~
mahmud
Joel Spolsky wrote something that comes close, sans the threat:

<http://www.joelonsoftware.com/articles/Unicode.html>

------
luigi
What modern software stack uses Extended ASCII as its default encoding? The
last time I dealt with this problem, it was in 2005 or 2006 and I was working
with PHP.

~~~
samfoo
My guess would be that it's a flash app. Flash is (or at least used to be)
horrible at internationalization issues like this.

------
zipdog
What bugs me is not the mis-encoding (though that's a fail), but that people
"struggled to understand" it... surely everyone's seen apostrophes turn into
these special characters on enough web pages over the past ten years to have
recognized what's happening when it does.

~~~
dthorpe
Blame Hollywood. Non-tech folks have been conditioned to panic whenever
anything out of the ordinary appears on screen, because that's how Hollywood
visualizes computer viruses and alien space invaders.

------
gutini
An error like this should not detract from the value Mass Relevance delivers.
Clearly an event like this or similar events like the Oscars in which they
take part are better, more engaging because of their involvement.

------
dabent
That’s a lot of detail, but a very good explanation of what happened.

------
winsbe01
thank you for the fascinating article! I've seen this bug in other places, and
I never knew what it was (and usually brushed it off instead of digging deeper
to find it).

------
Jach
Well, it could have been worse. They could have shown \'

------
nextparadigms
Why does it say 3 hours ago under the tweet? Wasn't this in real time?

~~~
answerly
>Wasn't this in real time?

No. Questions were culled from Tweets with the #AskObama hashtag starting on
June 30. Some of the questions did come in close to real-time. I think the
most recent ones were 5-10 minutes old when presented to the president.

~~~
nextparadigms
So how did they choose the questions then? Based on retweets, or the Twitter
team just picked the ones they liked?

~~~
answerly
They had a group of moderators who were selected to find the best questions. I
believe most of the moderators were journalists or bloggers.

------
mahmud
He had to do a twitter townhall because El Jefe couldn't get a G+ invite :-|

------
joeyh
tl;dr -- mojibake

------
ignifero
So does HN support utf8 ¢ðrrê¢†l¥?

~~~
Luyt
˙sǝʎ 'os ʞuıɥʇ I

~~~
there
(ಠ_ಠ)

------
ignifero
You have to give it to Microsoft. They use Word even for 140 letter documents
now! (I understand spelling is a reason, but browsers have spelling now)

~~~
mgkimsal
Downvoted, understandably perhaps, but nearly every time I see this sort of
bug it's because someone has copy/pasted something from MS Word instead of
using whatever native client input method was provided (usually a textarea or
input field). Yes, blame the rendering script for not encoding properly, but I
suspect this was Boehner's team copy/pasting from some internal MS Word doc.
Cheap shot - they also likely copy/paste screenshots in to Word and mail those
around too, instead of just mailing the actual graphic file. :)

~~~
ben1040
_Cheap shot - they also likely copy/paste screenshots in to Word and mail
those around too, instead of just mailing the actual graphic file. :)_

When I was in consulting doing software integration work, few things
infuriated me more than client "bug reports" arriving in the form of an email
containing a 15 megabyte MS Word doc with a bunch of un-annotated screenshots.

I really hope Google someday opens up the awesome bug report/screencap feature
in Google+, that lets you highlight part of the screen and redact sensitive
parts.

~~~
mgkimsal
I _almost_ wouldn't mind, except the screenshots are inevitably shrunk down so
as to be unreadable.

In OSX, the cmd-shift-4 (and 3) keystroke which screenshots right in to a file
are near life-transforming. Snap, drag the file into an email, and it's done.
I'm sure there are utilities in Windows which do this, but having it built in
is great - no apps to start or install.

~~~
JonoW
press "Print Screen" and paste into new email? Works well in desktop clients
like outlook, less so for web-clients like gmail...

~~~
mgkimsal
_NO ONE DOES THAT ON WINDOWS OUTSIDE OF TECH GEEKS_. EVERYONE pastes in to MS
Word, then emails that document.

That said, from a geek standpoint, I still prefer having the raw image file
that I snapped in a folder someplace so I can refer to it later without having
to go through sent emails, but that's just a personal preference.

