The bar charts used to illustrate that article are terrible. They present raw counts for each font, but each font was not presented to the same number of people---they varied from 7,477 (CM) to 7,699 (Helvetica), which is a pretty big swing given the other numbers they're displaying. In fact, when you run the percentages, CM has a higher percentage of agreement than Baskerville (62.6% to 62.4%)!
When we turn to the "weighted" scores, which don't follow any clear statistical methodology that I'm aware of, the bar chart is again presented with counts rather than proportions, and this time with an egregiously misleading scale that makes it seem like CS gets half the score of gravitas-y fonts like CM and Baskerville, when in fact its score is only about 5% lower.
Finally we get to the "p-value for each font". That's... not how p-values work. The author admits that his next statement is "grossly oversimplified", but there's a difference between simplification and nonsense. He says that "the p-value for Baskerville is 0.0068." What does that mean? What test was being performed there? Can we have a little hint as to what the null and alternative hypotheses were?
I'm on a Vista system at the moment, and it does have a Baskerville Old Face variant, but "Gold has an atomic number of 79" does not look like the text shown in the article.
One thing that seemed odd to me was that Comic Sans has a negative association that is heavily biased by cultural factors rather than any real, intrinsic human determinant. I would suspect that all the typefaces are going to be more heavily influenced by social norms than anything else. But that's just a hunch.
On your #1, you could do a comparison of (% Agree) for one font versus another. For #2, the weights on the scale don't matter, you can just compare the means between the groups.
Here is something I tried: arbitrarily choose a font as the "control," in my case Georgia. Then to measure each other font, say Baskerville, randomly pair a Georgia data point with a Baskerville data point, and measure how much Baskerville improved agreement. (Comparing different shuffles, each font's mean improvement is pretty stable.) That gives what looks like a normal curve, at or least it is big in the middle and small on the edges. So now I can find a p-value, and my null hypothesis is that changing the font has no effect. I ran a t-test with R, and I got a p-value of 0.2069. So much less impressive than the article claims. But I assume my approach is wildly invalid, so what is the right approach?