Oh come off of it. This happens everywhere on the web on probably something like 25% of websites. And it's NOT always the consuming program's fault: very often somebody upstream, e.g. the hosting company, the person that wrote the HTML, the source of an RSS feed being inserted into the page etc. etc. forgot to encode something the way somebody else expected, and you as the poor guy at the end of the chain gets a document with multiple encodings improperly embedded into it. Inevitably you have to make some bad decisions and not all corner cases are handled.
Somebody once reverse-engineered the state chart for how Internet Explorer handles documents with conflicting encoding declarations and I kid you not, it must have had >20 branches spanning a good few pages. Officially, the correct order of precedence is (http://www.w3.org/International/questions/qa-html-encoding-d...):
1. HTTP Content-Type header
2. byte-order mark (BOM)
3. XML declaration
4. meta element
5. link charset attribute
but that's not how every browser does it, because the W3C sort of declared that after things on the Real Internet (TM) had already gotten out of hand. I hate to resuscitate Joel posts but Unicode is not easy to implement right.
That's a sloppy mistake and should without doubt have been caught in testing.
"It was definitely a mistake on our part. The problem was not the encoding on our data feed, but the HTML document was sent with ISO-8859-1. The second we inserted the twitter text into the DOM, the browsers interpreted the UTF-8 string as ISO-8859-1. Our visualizations are hosted on other platforms, and in this case the server was not configured to send UTF-8 with text/html even though the HTML file was encoded as such. It was the only issue (albeit a pretty obvious one) during an otherwise flawless event. I apologize to President Obama, Speaker Boehner, and Jack Dorsey for the mistake. If the readers of the blog think it was stupid, imagine how we felt. dev environment != production environment. If we would have just included a <meta charset="utf-8"> in the HTML head, then this would not have occurred.
The big take away is don’t make assumptions about other platforms (especially when it comes to encoding), and always include charset meta tag." [emphasis mine]
Something like this could be done as nginx/apache module: which detects encodings of the data and transcodes the HTML output into utf8 - could be useful for some cases.
Every time you don't validate...God kills a kitten. Please, think of the kittens.
This has been a Public Service Announcement. Please code responsibly: http://validator.w3.org/
Edit: yes, there were definitely people curating the questions, so it could not have been as simple as pulling one JSON feed. http://www.theatlantic.com/politics/archive/2011/07/how-obam... https://twitter.com/#!/townhall/july-6-curators
...That's kind of the point. There was someone being sloppy there.
There is no excuse for not testing your app for basic i18n brokenness. It's 2011, not 1998.
Or maybe I've just been fortunate enough to be in an environment where an occasional goof of this caliber doesn't have any serious consequences.
Primarily, this means you don't have to support internationalization - which is hardly a bad thing, especially if you work on a startup, where worldwide distribution should be the last thing on your mind. When your product is rendered in over 80 scripts, including right-to-left languages, you can't afford to figure encoding will sort itself later.
So, if the two bodies have differing encodings (charsets), then the HTML body will look wrong. Unless you force Outlook to always use UTF-8 for encoding emails (which is a setting, but not the default) then you'll end up sending emails that will look garbled to your recipient.
This "differing charset" scenario actually happens pretty frequently, because of the following scenario:
a) You write an email (or reply to an existing email - actually it happens most with replies).
b) Outlook's text editor decides to insert a non-breaking space (codepoint U+00A0). Perhaps it generates HTML with but before transmission this eventually turns into the single UTF-8 character 0xC2 0xA0.
c) When generating the text-body, Outlook decides to just use a plain old space, so the text body is plain ASCII.
d) Outlook, in its cleverness, then says "ooh, I can 'conserve' encoding-ness and use plain old iso-8859-1 for the text body, but I need to use UTF-8 for the HTML body because of that non-ascii character"
e) Outlook generates this email (please excuse formatting woes due to HN).
Content-Type: multipart/alternative; boundary="0016e64dbd929784310488b2b082"
This is a multi-part message in MIME format.
Content-Type: text/plain; charset="ISO-8859-1"
Content-Type: text/html; charset="UTF-8"
When you view the above email in Outlook, you see "yoÂ yo" instead of "yo yo"
Just last week, my co-worker had to waste two days debugging a two-year-old Sphinx setup (the person who implemented it no longer works with us) because a Japanese user of our blogging service wasn't seeing the post he was looking for in our search feature. The problem was that the conduit feeding the Sphinx indexer was handling Unicode incorrectly (to be specific, it was deleting certain bytes wholesale because this guy believed them never to be valid, and this broke multi-byte sequences horribly). Those two days would have been much better spent working on his current project.
Additionally, not handling Unicode correctly can leave you open to certain types of security holes!
The biggest of these is probably that Unicode means _a string is not an array of bytes_, so naïve allocators for languages with byte-array strings (read: C and its brethren) are susceptible to buffer overruns when handed multi-byte Unicode sequences when they're expecting ASCII.
Here's another one: the Unicode character space contains many, many glyphs that look almost (or in some cases exactly) like ASCII characters. RFC 3492 defines a framework for internationalized domain names (IDN) that encodes non-ASCII code points using ASCII characters, and the most common implementations of this transparently go between the two. This means that you could register "bаnkofamerica.com" (actually xn--bnkofamerica-x9j.com) and put a phishing site there, and people would happily click on the identical-looking URL and give you their bank account. This was pointed out several years ago and most modern browsers have mechanisms in place to defend against it, but your custom application might not unless you're careful to check what you're doing.
There are plenty of blog posts and articles out there designed to tell you how to be safe when dealing with Unicode (and you should assume that you will be). I highly suggest you go read one.
Such mistakes were excusable in the 90s. They're the sign of an amateur programmer today.
Let me take this very excellent opportunity to say that we are looking to hire a full time "front end" developer. You'll get to work on badass projects like the Obama Town Hall. Ideally, you'd be located in Austin. Find me on Twitter @efalcao to learn more.
We're not lazy or sloppy... It basically boiled down to one server sent down the right header...the production one didn't.
Unicode issues are sorta in the class of "gotcha" issue. They happen, you go "oh shit" and fix them right away. Our "oh shit" moment just happened to come at the most intense possible moment....in front of the president, with so many watching.
Wanted to reiterate once again: We're Hiring! @efalcao on twitter. Early stage startup looking for exceptional talent.
>>> print u"\u2019".encode("utf-8").decode("Windows-1252")
Because I wonder how difficult it would be to create a string that says something innocuous in UTF-8 (e.g., "When will you bring the troops home #AskObama") but in ASCII would read as something totally different, but legible (e.g., "the secret priests would take great Cthulhu from his tomb to revive His subjects and resume his rule of earth...")
Sure there might be some misunderstandings with special punctuation characters as evidenced by the article, but such issues generally get low priority.
In countries where the language isn't representable in ASCII, we can't use US-UTF8, but have to resort to "real-UTF8" which means dealing with legacy systems that don't do UTF8 (which is what happened in the article we're currently commenting on), dealing with browsers who lie about encoding, and dealing with the fact that a string length isn't its byte length any more even if it doesn't contain "fancy" punctuation characters.
All that makes me wish I could do US-UTF8 too :-)
Any string born of this heritage is likely to have single and double (curved) quotation marks. OpenOffice, Word, Pages – they all do it and are expected to.
For that reason, I consider them to be reasonably present in ordinary string data.
There is an option to not use "Smart Quotes", but it seems to be enabled by default.
BTW, a note on the term “smart quotes”: that originated when word processors became “smart” about transforming the easy-to-type (but incorrect) ' and " to their proper equivalents automatically. The quotes themselves aren’t smart…they’re just quotes.
Typography nerd out.
For many blogs in the hacker community, source code snippets inside <code> tags can also be given "smart quotes", which completely breaks any strings that may be present.
You forget that Wordpress is used for many people outside of the writing community. When writing, and if the writer cares about having their word properly typeset, then the author can do so themselves. Wordpress tries to be smart about it, and covert them, but many people do not care about such features. The developers, however, do.
Also, may I remind everyone, downvotes on HN are not for disagreement, they are for factually incorrect statements.
Markdown + SmartyPants are a better solution IMO. (And you can install WP plugins that do this and that disable Wordpress’ default quote educator.)
Listen. Any time you use an encoding other than UTF-8, you are creating incompatibilities. If your stated intention is to facilitate communication, you are failing. You are a bad person. Stop doing it. The only possible excuse for using a non-UTF-8 encoding is to frustrate communication.
(It's too fucking bad HTTP mandates that the default charset is ISO-8859-1.)
"The problem was not the encoding on our data feed, but the HTML document was sent with ISO-8859-1. The second we inserted the twitter text into the DOM, the browsers interpreted the UTF-8 string as ISO-8859-1."
No. Questions were culled from Tweets with the #AskObama hashtag starting on June 30. Some of the questions did come in close to real-time. I think the most recent ones were 5-10 minutes old when presented to the president.
When I was in consulting doing software integration work, few things infuriated me more than client "bug reports" arriving in the form of an email containing a 15 megabyte MS Word doc with a bunch of un-annotated screenshots.
I really hope Google someday opens up the awesome bug report/screencap feature in Google+, that lets you highlight part of the screen and redact sensitive parts.
In OSX, the cmd-shift-4 (and 3) keystroke which screenshots right in to a file are near life-transforming. Snap, drag the file into an email, and it's done. I'm sure there are utilities in Windows which do this, but having it built in is great - no apps to start or install.
That said, from a geek standpoint, I still prefer having the raw image file that I snapped in a folder someplace so I can refer to it later without having to go through sent emails, but that's just a personal preference.