Hacker News new | past | comments | ask | show | jobs | submit login
“Bush Hid the Facts” (wikipedia.org)
206 points by pizza 5 months ago | hide | past | favorite | 33 comments

I find the original post documenting the bug fairly entertaining.


Somebody immediately saying it's an encoding error accompanied by several people "troubleshooting" by copy and pasting the resulting text to other programs, and ending off with someone three years later posting in all caps asking for help with a completely unrelated problem.

"As of Windows Vista, Notepad has been modified to use a different detection algorithm that does not exhibit the bug, but IsTextUnicode remains unchanged in the operating system, so any other tools that use the function are still affected."

I find this line astonishing. I was hoping there would be more to this story than "we fixed Notepad and left OS function broken" but after following the reference it seems they just didn't think it worth fixing.

The alternative is ‘my application could open this string just fine, and now it can’t anymore’. You can’t do heuristics without allowing mistakes and on the Windows platform backwards compatibility is considered very important.

I appreciate the responsibility for maintaining backward compatibility, but I got a different vibe from that blog post.


or they took the linus approach of "don't break user space" and force every one else to fix their code because they fixed their code.

I'm glad UTF16 is (albeit slowly) going the way of the dodo by becoming just an internal representation for a few older toolkits and programming languages. It was incredibly nonsensical how inconvenient it was compared to UTF8. At least in UTF8 you can't do the extremely wrong assumption that 1 (w)char == 1 character, or rather, you can do that but it will explode on your hands way sooner than with UTF16.

Even though, I must say that the fact that now Windows puts a useless BOM at the beginning of every file is very annoying.

UTF16 was an extension of the original "obvious" Unicode encoding, back when Unicode started and was defined as fitting the world's languages into a 16-bit spec (what is now the Basic Multilingual Plane). UTF16 allowed (most) older UCS2 documents to be upgraded for free to UTF16, much the way that ASCII documents are also valid UTF8.

Some of the weirdness in the Unicode spec even comes from the need for backwards compatibility. 17 planes and 1,114,078 total usable codepoints... these numbers would not have been arrived if the system had the foresight that 16-bit wasn't enough in the first place. They were derived from reassigning private use area codepoints into surrogate pairs for UTF16. Unicode would probably have rather (maybe should have) started out as a 32-bit spec and avoided this mess from the get-go.

You know I’d love to read a book about the Unicode standardization process. Unicode and the “16-bits is enough for all modern scripts” business must be one of the biggest failures of requirements analysis in computing history. And the fact that at the same time they were upsetting so many of their partners with Asian unification to me means there must have been a real issue with personalities and project responsibilities.

Well, probably not 32 exactly. They had a 31 bit spec at the time but decided they could simplify to 16 bits. Also UTF-8 is naturally optimized for 31 bits.

> just an internal representation for a few older toolkits and programming languages

Maybe "older" but very much current, Qt's QString type internally uses UTF-16. https://doc.qt.io/qt-5/qstring.html

Yeah, because Qt started in the early 90s, when UCS2 hadn't still fiascoed and it was being adopted in droves as the "newer" solution to all encoding issues - you simply swapped `char` for `wchar_t`, made a `w` version of every C and C++ IO function and that's it, right?

Sadly, it wasn't it. 16 bits weren't enough, and stuff like the fact a Unicode rune ≠ printed character (i.e. [è] can either be a single codepoints, or a combination of a [`] modifier and the latin letter [e]) meant there was basically no point in using 16/32 bit chars in the first place. When people really understood this it was almost the '00s, and stuff like Python, Windows, macOS (due to NextStep), Java, .NET, Qt were stuck. It's impossible to go back to plain `char` without annihilating backward compatibility, so everyone kept using it internally.

Fun fact, some of those languages and frameworks I mentioned never bothered switching completely to UTF-16 - for instance, Python now uses a weird mixture of ASCII and UCS2 internally.

> for instance, Python now uses a weird mixture of ASCII and UCS2 internally.

Really? Last I heard (PEP 393), the rule was: "8 bits if all codepoints are less than 256 (i.e. Latin-1); 16 bits if all codepoints are less than 2^16 (i.e. BMP); otherwise 32 bits". This means that text with all Latin-1 characters (which are approximately the first 256 codepoints of Unicode) will be stored internally as, well, Latin-1. This implies that ASCII strings are stored as ASCII.

Yep, that's what I meant. They either use ASCII, UCS2 or UCS4 depending on the type of string. Doesn't make a lot of sense to me, but I guess they couldn't just throw 16 bit chars away.

I'm not 100% sure on this, but I don't backwards-compatibility mattered in this decision. They wanted memory-efficient and O(1)-indexable strings.

And both Java and Javascript, an thus any language derived from from them. I'd go as far as saying that a majority of the code being written today uses UTF-16.

With these languages not even providing alternative API's that are easy to use means it'll be some time before we don't have to suffer this.

I've seen a trend from those languages and APIs of hiding UTF-16 away. They have to interoperate with everything else using just UTF8, while the rest of the world doesn't give a half damn about UTF-16 (the fact you have to also take endianess into account I think was one of the biggest crippling blows 16 bit encoding had ever received)

> just an internal representation for a few older toolkits and programming languages.

One quite significant programming language being JavaScript - I know we have TextEncoder and TextDecoder now, and for the typical ways strings are used in JS this is a complete non-issue. But any context in which one would want to iterate over individual characters in strings in JavaScript there's a chance one might end up having to deal with UTF16 quirks.

Iterating over characters is not that much more useful than iterating over code units, actually (the only sensible use cases I can think of right now are things you should never really have to worry about implementing at the application level, such as sorting or comparing strings). For many useful use cases you basically need grapheme clusters, which is a lot closer to what humans think of as a character. And once you're at that point it's much less relevant whether the code units are 8 or 16 bits wide.

Right, I forgot about both codePointAt and charCodeAt existing (the former being useful for what you are talking about, the latter what people used to have to deal with before ES2015).

This exactly. One of the biggest reasons why UTF-32 doesn't make sense IMHO is not only its general overhead but the fact it reiterates on the broken concept of `char` == "glyph". That is never correct on Unicode, no matter the byte size, because so many characters can be spread over multiple codepoints. Iterating over codepoints is the only thing UTF-32 simplifies and it is honestly kinda pointless. That's why it took a while for Rust to implement the `Chars` iterator.

Does anybody know what the Chinese characters say? I put them into Google Translate and got: "Kang 栠 栠 栠 栠 敨 ying picking mongoose". I'm guessing its gibberish, but sometimes you can't trust machine translation.

I remember this one! I found it so fascinating and ended up spending a lot of time making up funny sentences which triggered the bug.

The wiki article does not say whether phrasing of the example as "Bush Hid the Facts" instead of using some other text, is related to the presidency of George W. Bush. It seems to be, or is it entirely unrelated?

IIRC it was related to 9/11 conspiracy theories. At the time, a lot of conspiracy garbage was forwarded through chain e-mail. This is one of two examples that I personally remember getting forwarded from a particular aunt (the "Bush Hid the Facts" text disappearing being "proof" for the whole inside job theory it presumably alludes to). The sort of stuff that aunt would nowadays share on that Face Website.

The other that I remember her sending me was along the lines of "OMG if you enter that planes flight number into MS Word 97 and set the font to Windings <variant whatever> you get a picture of a plane, two buildings, a skull and a Star of David!!1!eleven".

On a side note: If you entered the right combination of text into Excel 97, you could fly a plane over a fractal landscape ;-)

The kind of "logic" at work here is still used nowadays (e.g. the Sandy Hook school shooting conspiracy theory, with followers pointing towards the name appearing in a movie at the time). It still eludes me what the logic behind this is supposed to be. So if you plan a massive government conspiracy, you make sure to plant very precise, hidden clues all over the place in movies, TV shows, random office software and similar things years in advance, because... um.... why exactly?

Indeed. These were very big in middle school computer labs circa 2005.

I wonder if anyone with more historical perspective knows of any older examples of these blatantly false theories from other eras or if this type of conspiracy theory is unique to the digital age?

Modern flat earthism goes back to the mid-1800s.

> older examples of these blatantly false theories from other eras

The Salem Witch Trials comes to mind

At the time it came out, George W. Bush was President, and there were not only 9/11 conspiracy theories, but also the false narratives about WMDs in Iraq.

Similarly, a story came out during the Windows 3.1 era about how typing NYC with your font set to Wingdings yielded a skull and crossbones, a Star of David, and a thumbs up sign, stoking fears of antisemitism in a time of religious tension and a revitalization of the right wing after the Waco siege. While it is true that the letters in NYC mapped to those symbols, it was not deliberate. In the successor font Webdings, the letters NYC were deliberately mapped -- to an eye, a heart, and a city skyline (referencing "I love New York").

Nothing to do with invading Iraq it turns out

The article describes the resulting Chinese text as "nonsensical", but clearly it means "Kang homo sapiens reflect pick up mongoose" (according to Google Translate).

In my experience, Google Translate often delivers quite good results for sentences that are spelled correctly and don't make much use of implied subjects/objects. But if there is as much as a typo, it has a particular habit of just dropping words or even entire sentence fragments that it can't quite make sense of, and then playing fast and loose when it comes to interpreting the remaining verbs and re-inserting implied subjects, so it can somehow still shoehorn it into a reasonable English sentence.

Edit: Minor clarification: I'm referring to past experience translating Chinese to English in particular.

There's analysis of each character linked from the talk page


How think sentence you that do is nonsensical not

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact