The bar charts used to illustrate that article are terrible. They present raw counts for each font, but each font was not presented to the same number of people---they varied from 7,477 (CM) to 7,699 (Helvetica), which is a pretty big swing given the other numbers they're displaying. In fact, when you run the percentages, CM has a higher percentage of agreement than Baskerville (62.6% to 62.4%)!
When we turn to the "weighted" scores, which don't follow any clear statistical methodology that I'm aware of, the bar chart is again presented with counts rather than proportions, and this time with an egregiously misleading scale that makes it seem like CS gets half the score of gravitas-y fonts like CM and Baskerville, when in fact its score is only about 5% lower.
Finally we get to the "p-value for each font". That's... not how p-values work. The author admits that his next statement is "grossly oversimplified", but there's a difference between simplification and nonsense. He says that "the p-value for Baskerville is 0.0068." What does that mean? What test was being performed there? Can we have a little hint as to what the null and alternative hypotheses were?
The biggest problem I see is that Baskerville isn't standard on Windows. They may have specified a Baskerville font-family, but that's not necessarily what the reader saw. The original test article displayed the asteroid passage as text, not as an image, so unless they accounted for the rendering differences among OS's, the entire test seems questionable.
I'm on a Vista system at the moment, and it does have a Baskerville Old Face variant, but "Gold has an atomic number of 79" does not look like the text shown in the article.
I think you might be on to something, but I am not sure.
One thing that seemed odd to me was that Comic Sans has a negative association that is heavily biased by cultural factors rather than any real, intrinsic human determinant. I would suspect that all the typefaces are going to be more heavily influenced by social norms than anything else. But that's just a hunch.
I'm also confused about the p-value. Also, I'd love for someone who really knows statistics to explain: how are tests like confidence interval and p-value meaningful when (1) your only two choices are Agree & Disagree? (2) when your weighted agree/disagree curve is the opposite of a normal curve (low in the middle and high on the edges)?
Thanks for answering! I'm sure it's just my lack of elementary statistics, but I still don't understand re #2. I get that the weights aren't important, but the curve shape seems to invalidate the notion of a p-value, because each font had more "strongly X"s than "weakly X"s. When your sample results are clustered at the extremes, what do you do to apply these statistical techniques?
Here is something I tried: arbitrarily choose a font as the "control," in my case Georgia. Then to measure each other font, say Baskerville, randomly pair a Georgia data point with a Baskerville data point, and measure how much Baskerville improved agreement. (Comparing different shuffles, each font's mean improvement is pretty stable.) That gives what looks like a normal curve, at or least it is big in the middle and small on the edges. So now I can find a p-value, and my null hypothesis is that changing the font has no effect. I ran a t-test with R, and I got a p-value of 0.2069. So much less impressive than the article claims. But I assume my approach is wildly invalid, so what is the right approach?
I thought re-analyzing Morris's data would be a fun "homework" assignment to give myself as I try to learn statistics. It looks like the simplest approach is nothing like what I proposed above, but a "two-sample t-test." I performed that analysis and wrote it up here, if anyone is interested:
Check again: 4680+2797 is 7477 and 4680/7477 is 62.591%, for CM, and 4703+2833 is 7536 and 4703/7536 is 62.407%, for Baskerville. The raw number of Baskerville agreements is higher than the raw number of CM, so its bar is higher, but that disparity is just what I'm complaining about (well, among other things).
This is how geeks get such a bad reputation. The bar charts do appear directly below stacked bar charts, though I agree the weighted charts without axis markings are content-free. The rest is a pedantic quibble by someone who happens to understand statistical jargon. This is a newspaper article not a paper in a peer-reviewed journal.