EDIT2: OK folks we're smart, let's use MATH. Take above quote, which compares "1...

stolio · on Jan 14, 2015

After some pencil work I was convinced of EDIT2, but now I'm confused..

Before we get to my confusion, one interesting thing I found along the way is the following situation, call J1 "job 1" and N1 "name 1", and use the P(N|J)/P(N) (or its equivalent) metric:

  J1  J2
  ------
  N1  N1
  N1  N1
  N1  N3
  N2

If we limit each job to it's top name, N1 doesn't get attached to either despite being the most common name in each. J1 gets N2 while J2 gets N3.

If this is the method then don't use this chart to guess the names of people in a profession, use it to guess the professions of people whose names you know. "Guy" may be listed for investment bankers, but an investment banker is still more likely to be named Dave, but if you meet a Dave he's likely to be a mechanic.

For the same situation say we use the top value for P(N|J), then J1 gets N1 and so does J2. P(J|N) goes back to J1 getting N2 and J2 getting N3 and N1 being left out in the cold.

But here's where it's unclear, I think this:

> In our sample of two and a half million people, a whopping 1.9% of Arnolds are accountants. Contrast that with just 0.55% of Shanes. Arnolds therefore appear to have a much higher tendency to be accountants than Shanes

implies they're listing the top P(J|N) values for each J. *(edit: they're comparing Arnolds to Shanes, not Arnolds to all accountants?) I think your approach is the most consistent but is it what they're using?