Take above quote, which compares "1.9% of Arnolds are accountants" to the "0.55% of Shanes [are accountants]". They're implying that the probability of being an accountant (J), given that ones name (N) is Arnold, is above the expected probability of being an accountant in general. So they're looking for a high P[J|N]/P[J].
Now compare with what we were expecting to see. We assumed the chart showed, for a given job, names which had a higher incidence than normal. i.e., we're looking for a high P[N|J]/P[N].
Guess what. P[J|N]/P[J]=P[N|J]/P[N] by Bayes' Theorem [1]. These are EXACTLY the same metric! So their technique, and the chart, is correct. (And my original post, below, was wrong.)
(Not saying anything about causation here, and I don't think they were either.)
Yes, that is completely backward. 99% of Arnolds could go into farming, yet still the 1% who go into accounting dominate that field, and hence show up on this chart.
EDIT: but you missed the first half of that quote: "In our sample of two and a half million people, a whopping 1.9% of Arnolds are accountants. Contrast that with just 0.55% of Shanes." So I think the quote is correct. Makes me wonder if their chart is backward. (i.e., they put "Arnold" under "Accountant" because Arnolds are likely to be accountants, not because accountants are likely to be Arnolds, as the grouping implies).
After some pencil work I was convinced of EDIT2, but now I'm confused..
Before we get to my confusion, one interesting thing I found along the way is the following situation, call J1 "job 1" and N1 "name 1", and use the P(N|J)/P(N) (or its equivalent) metric:
J1 J2
------
N1 N1
N1 N1
N1 N3
N2
If we limit each job to it's top name, N1 doesn't get attached to either despite being the most common name in each. J1 gets N2 while J2 gets N3.
If this is the method then don't use this chart to guess the names of people in a profession, use it to guess the professions of people whose names you know. "Guy" may be listed for investment bankers, but an investment banker is still more likely to be named Dave, but if you meet a Dave he's likely to be a mechanic.
For the same situation say we use the top value for P(N|J), then J1 gets N1 and so does J2. P(J|N) goes back to J1 getting N2 and J2 getting N3 and N1 being left out in the cold.
But here's where it's unclear, I think this:
> In our sample of two and a half million people, a whopping 1.9% of Arnolds are accountants. Contrast that with just 0.55% of Shanes. Arnolds therefore appear to have a much higher tendency to be accountants than Shanes
implies they're listing the top P(J|N) values for each J. *(edit: they're comparing Arnolds to Shanes, not Arnolds to all accountants?) I think your approach is the most consistent but is it what they're using?
Take above quote, which compares "1.9% of Arnolds are accountants" to the "0.55% of Shanes [are accountants]". They're implying that the probability of being an accountant (J), given that ones name (N) is Arnold, is above the expected probability of being an accountant in general. So they're looking for a high P[J|N]/P[J].
Now compare with what we were expecting to see. We assumed the chart showed, for a given job, names which had a higher incidence than normal. i.e., we're looking for a high P[N|J]/P[N].
Guess what. P[J|N]/P[J]=P[N|J]/P[N] by Bayes' Theorem [1]. These are EXACTLY the same metric! So their technique, and the chart, is correct. (And my original post, below, was wrong.)
(Not saying anything about causation here, and I don't think they were either.)
[1] http://en.wikipedia.org/wiki/Bayes%27_theorem
-----------
Yes, that is completely backward. 99% of Arnolds could go into farming, yet still the 1% who go into accounting dominate that field, and hence show up on this chart.
EDIT: but you missed the first half of that quote: "In our sample of two and a half million people, a whopping 1.9% of Arnolds are accountants. Contrast that with just 0.55% of Shanes." So I think the quote is correct. Makes me wonder if their chart is backward. (i.e., they put "Arnold" under "Accountant" because Arnolds are likely to be accountants, not because accountants are likely to be Arnolds, as the grouping implies).