
The "Bush hid the facts" bug - nateberkopec
https://en.wikipedia.org/wiki/Bush_hid_the_facts
======
pierrec
I think the article doesn't really clarify the reason these bugs exist. After
all, is it _really_ a bug? What if I so happen to be working with Chinese
Unicode files without BOM that are also valid ASCII files? Then it would seem
to me that Notepad was previously behaving correctly, and ever since Vista
they _introduced_ a bug!

No, this isn't a low-level bug tied to any particular Windows API function.
This is the inevitable result of the folly of trying to guess a file's
encoding. When we find ourselves cornered into doing something this
ridiculous, it becomes apparent that we, as a society of programmers, are
extremely disorganized.

~~~
patio11
_What if I so happen to be working with Chinese Unicode files without BOM that
are also valid ASCII files?_

What's the longest natural Chinese substring you can find for which this is
the case? When we ran our encoding-and-language-detection heuristic developed
for our research work I think, and this is 10+ years ago, it was 5 characters
long -- there was no substring 6+ characters (12 bytes of UTF-16) which could
be reasonably written in ASCII.

This is because much of ASCII is unprintable, more-than-doubly so if you're
concerned strictly with 7-bit ASCII as opposed to e.g. Latin-1. When's the
last time you saw a string contain 0x18 ("cancel"), 0x07 ("bell"), etc?

An assortment of 0x..18 characters for your amusement: 予,先,券,単,又. I picked
ones which are included in commonly used words in Japanese. Seeing any one of
these once in UTF-16 is dispositive that the bytestream is not ASCII.

This problem is worth thinking about (which is why a customer asked a team of
CS researchers and linguists to think about it, including one barely competent
programmer who nonetheless had reasonably good intuitions for what character
distributions looked like), but it turns out to be much, much less hard than
many people originally expected.

Anyhow, long story short: make a histogram of the bytes, dot product with the
histogram signature you have of a few large corpora in decent-guess-based-on-
apriori-knowledge languages/encoding, normalize. The screamingly obvious
candidate is almost always the winner.

If you want to do it really, really quickly you can even evaluate this
heuristic with a Bloom filter in an FPGA.

~~~
patio11
Elaboration since I like geeking out about this subject and demos like this
always got the point across to non-Unicode-fluent developers.

Here's a sample of Chinese language text, sourced by my favorite method: grab
whatever is on Wikipedia's homepage.

2006年大西洋飓风季时间轴中记录有全年大西洋盆地所有热带和亚热带气旋形成、增强、减弱、登陆、转变成温带气旋以及消散的具体信息。2006年大西洋飓风季于2006年6月1日正式开始，同年11月30日结束，传统上这样的日期界定了一年中绝大多数热带气旋在大西洋形成的时间段，这一飓风季是继2001年大西洋飓风季以来第一个没有任何一场飓风在美国登陆的大西洋飓风季，也是继1994年大西洋飓风季以来第一次在整个十月份都没有热带气旋形成。美国国家飓风中心每年都会对前一年飓风季的所有天气系统进行重新分析，并根据结果更新其风暴数据库，因此时间轴中还包括实际操作中没有发布的信息。包括最大持续风速、位置、距离在内的所有数字都是经四舍五入换算成整数。2006年大西洋飓风季的活动程度与前一年相比远远不及。起初气象学家预计在极其活跃的2005年大西洋飓风季后，2006年的活动程度应该只会略低。然而，2006年迅速形成的厄尔尼诺-
南方涛动现象、大西洋热带海域上空的撒哈拉空气层，以及以百慕大为中心的亚速尔高压这一强大二级高气压的持续存在，都令2006年大西洋飓风季的活动程度大幅降低。从10月2日以后一直到飓风季结束都完全没有热带气旋形成。2005年12月底形成的热带风暴泽塔一直持续到了2006年1月初，成为有纪录以来第二个跨日历年的大西洋风暴。虽然其存在时间不在任何一年飓风季的正式时间段里，但仍然可以视为2005和2006年大西洋飓风季的一部分。

Do you have an intuition for what that looks like if you interpret it as
ASCII? No? Just guess: "almost plausibly an English document", "gibberish but
mostly ASCII", "absolutely zero probability of being mistaken for ASCII."

I whipped up a quick Ruby script:

``` require colorize; chinese = File.readlines("/tmp/chinese.txt"); puts
chinese.bytes.map {|b| str = b.chr; if str.ascii_only? ? str.blue :
str.red}.join ```

which converts that string from a Unicode encoding (UTF-8) to ASCII and
renders the output blue where it collides with a printable ASCII character and
as a red question mark otherwise.

Did this match your prediction?

[https://www.evernote.com/l/Aaf93wCQGulAdZAtZjBA-8st_zgF_BKDl...](https://www.evernote.com/l/Aaf93wCQGulAdZAtZjBA-8st_zgF_BKDlv8B/image.png)

If we first convert the string to UTF-16, it's a little less screamingly
obvious but, well:

[https://www.evernote.com/l/AafWlxVe1CRIRKky5fDXLGYaVSFUetnXb...](https://www.evernote.com/l/AafWlxVe1CRIRKky5fDXLGYaVSFUetnXbmQB/image.png)

~~~
StavrosK
Your post confuses me. The first screenshot is pretty much exactly how I would
translate the Chinese string as well.

------
ratsbane
Donald Knuth's annual Christmas lecture at Stanford was just released on
YouTube a few days ago. It's about comma-free codes, a similar idea to this
bug:
[https://www.youtube.com/watch?v=48iJx8FVuis](https://www.youtube.com/watch?v=48iJx8FVuis)

------
yuhong
Note that Wikipedia is not exactly correct on Vista and later. See
[http://www.siao2.com/2008/03/25/8334796.aspx](http://www.siao2.com/2008/03/25/8334796.aspx)
for the real story.

~~~
0942v8653
That page doesn't say anything about Vista or later (or I'm missing it)

Edit: I was looking at one of the sources of the Wikipedia article and mistook
the tab for that post. My bad.

~~~
Arnavion
The Wikipedia article mentions other applications than notepad and implies
IsTextUnicode was fixed in Vista.

Kaplan's blog post explains the change in Vista was actually in notepad (use a
different algorithm) and IsTextUnicode was left broken, so the other
applications mentioned in the Wiki page would presumably still be broken on
Vista and above.

------
nyolfen
my favorite thing about this is the apparent fact that someone typed "bush hid
the facts" into a notepad document then saved it. "oh man, this is big...
better write this down..."

~~~
pilsetnieks
Reminds me of this crap: [http://gizmodo.com/wingdings-
predicted-9-11-a-truthers-tale-...](http://gizmodo.com/wingdings-
predicted-9-11-a-truthers-tale-1679759324)

~~~
wwwet
Yeah except that "Bush hid the facts" is entirely true.

------
fleitz
I recall once importing code from mozilla to solve just this problem, charset
detection. I wonder if it has similar problems.

------
thearn4
Reminds me of the "Wingdings Predicted 9/11" bug:

[http://gizmodo.com/wingdings-predicted-9-11-a-truthers-
tale-...](http://gizmodo.com/wingdings-predicted-9-11-a-truthers-
tale-1679759324)

------
jgalt212
chardet gets it right, but may be confused by others

[http://chardet.readthedocs.org/en/latest/usage.html](http://chardet.readthedocs.org/en/latest/usage.html)

import chardet

print chardet.detect('Bush hid the facts')

>>> {'confidence': 1.0, 'encoding': 'ascii'}

It might be fun to run Hypothesis on this to see what if any minimal ASCII set
is guessed wrong by chardet.

------
orionblastar
Ironic that this came after the DOJ stopped investigating Microsoft for
abusing their monopoly on Windows when Bush came into office.

I remember watching the DOJ videos on Bill Gates drinking Pepsi trying to ask
questions on what type of Java they are talking about Microsoft competing
against.

