
Determining Gender of a Name with 80% Accuracy Using Only Three Features - max_
http://blog.ayoungprogrammer.com/2016/04/determining-gender-of-name-with-80.html
======
Rangi42
They never say _which_ last letters are associated with which sex. I'm
guessing that "o" is usually male and "a" is female, but, what about the other
24? Is "n" usually male ("-son")? Does "y" have a bias?

(Regarding sex vs. gender: yes, they aren't perfectly correlated and some
people don't stay with their assigned gender, but AFAIK they often/usually
choose a new name which probably matches their gender? Thus why "dead names"
are a thing.)

------
woodman
Those who find this interesting might also be interested in exploring the
additional dimensions offered: state and year. Without using any machine
learning, you anchor a fact (name, sex, state, birth year range) and get the
probability for the unanchored facts. The more you can anchor the higher your
level of confidence. Now this is way off topic, but if that sounded
interesting... "Constraint Logic Propagation Conflict Spreadsheets" by William
Taysom [0]

[0]
[https://www.youtube.com/watch?v=voG5-15aDu4](https://www.youtube.com/watch?v=voG5-15aDu4)

------
danso
Using the SSN baby names data is a staple in my data journalism classes. I
like using it as an example of how even a brute force comparison can be decent
enough (90%+ or better) to do gender analysis on a wide variety of public
datasets, such as campaign finance and public payroll.

Here's one example I've created regarding the Pulitzer board composition:
[https://github.com/compciv/gendered-pulitzer-
board](https://github.com/compciv/gendered-pulitzer-board)

Note: it was only an example...obviously a little tweaking is needed to put
into actual production. Also, obviously has varied effectiveness on datasets
with non-traditional American names. Here's one of the more comprehensive
efforts by a student, on New Yorker bylines:

[https://github.com/alecglassford/compciv-2016/blob/master/pr...](https://github.com/alecglassford/compciv-2016/blob/master/projects/gender-
detector-data/README.md)

The SSN name data seems almost certainly flawed to a small degree...I.e. It's
just hard to believe that there are dozens of boys named Jennifer (and yet,
strangely, no boys named Sue!)...but we're talking about infinitesimal
rounding errors. The vast, vast majority of names are 99% one way or the
other...with a few exceptions such as Leslie...though you can mitigate that by
using older years of the SSN database.

And here's a battery of SQL queries relating to baby names and gender
analysis: [http://2015.padjo.org/tutorials/babynames-and-college-
salari...](http://2015.padjo.org/tutorials/babynames-and-college-salaries/sql-
review-using-babynames-and-college-salaries/)

So I have to strongly disagree with OP that 80% accuracy is something to be
astounded by when it comes to gender classification...however, I do agree that
in terms of features, doing a simple frequency count of last characters...or
number of vowels overall can be a strong indicator of gender for a name, with
female names tending for the softer sound. I wonder how much more using
Soundex would add to the accuracy? Creating a trained name classifier would be
a fun project in service of a tool that could gender classify how masculine or
feminine a made-up name sounds like...which would be a slightly useful tool if
you were a fantasy fiction writer, though I suppose if you were to be a
successful writer, your ear would be trained well enough for he purpose to not
delegate it to a computational tool.

~~~
youngprogrammer
> The SSN name data seems almost certainly flawed to a small degree...I.e.
> It's just hard to believe that there are dozens of boys named Jennifer (and
> yet, strangely, no boys named Sue!)...but we're talking about infinitesimal
> rounding errors. The vast, vast majority of names are 99% one way or the
> other...with a few exceptions such as Leslie...though you can mitigate that
> by using older years of the SSN database.

I agree that the SSN data has flaws, but I only took names with at least 20
people. But the classification is probably iffy, as some names are classified
as male and female.

> So I have to strongly disagree with OP that 80% accuracy is something to be
> astounded by when it comes to gender classification...

I originally hypothesized, I could reach 90% accuracy, but I could only get up
to 82% max. As stated in the blog, 80% is the accuracy of a mammogram
detecting cancer in a 40-45 year old woman which is pretty good for 3
features!

> I wonder how much more using Soundex would add to the accuracy? Creating a
> trained name classifier would be a fun project in service of a tool that
> could gender classify how masculine or feminine a made-up name sounds
> like...which would be a slightly useful tool if you were a fantasy fiction
> writer, though I suppose if you were to be a successful writer, your ear
> would be trained well enough for he purpose to not delegate it to a
> computational tool.

This would be very interesting to see!

------
realusername
* Determining Gender of an (american) Name [...]

Still impressive but it would be interesting to see the result for other
countries. Maybe for some countries, the gender could be harder or easier to
guess. I can see how some people would use this to try to reconstruct a
database where gender is missing.

~~~
carlob
You can probably get to 80% accuracy in Italian just using the last letter. If
it's A it's most likely a female, if it's O a man.

Counterexamples among the 50 most common male names given in 2014 are:

    
    
        Andrea 2.46%
    
        Mattia 2.39%
    
        Luca 1.14%
    
        Elia 0.45%
    

Counterexamples among the 50 most common female names given in 2014 are:

    
    
        None
    

There are a handful of names ending with E or a consonant for both genders,
which will have to be placed at random.

Source (in Italian): [http://www.istat.it/it/prodotti/contenuti-
interattivi/calcol...](http://www.istat.it/it/prodotti/contenuti-
interattivi/calcolatori/nomi)

~~~
scotty79
> If it's A it's most likely a female

Same in polish. If 'a' is last you get like 98% probability it's female name
and if last is not 'a' then 98% it's a male name.

~~~
carlob
But Kuba is a male name :)

------
SeanDav
I am confused. Why use a dataset to train an AI, when one can just use a
decent dataset as a look up table and dispense with the complexity of machine
learning, while achieving close to perfect accuracy?

Obviously as an exercise in machine learning, this is perfectly acceptable,
but as a solution to a problem, not so much.

~~~
youngprogrammer
For me, this was an exercise in practicing machine learning and I found it
very interesting that you could get 80% accuracy with such few features. A
look up table works very well, but if a name does not exist in the table, you
could possibly use some kind of ML to guess the gender.

------
LoSboccacc
but what about Andrea?

~~~
function_seven
In the United States, most likely female.

~~~
carlob
Everywhere except Italy most likely female, in Italy, mostly male.

~~~
zyx321
So it's similar to Sasha? Female in the US and other English language regions,
Male in French and German language regions, unisex in eastern Europe.

------
onion2k
It's determining sex, not gender. Sex is a binary physical attribute (plus
some edge cases); gender is a much more fluid aspect of who we are. By writing
code that puts people in to only "male" or "female" and ignoring everything
else that people identify with you're disenfranchising a lot of people.

This is definitely some interesting work, but I really would recommend not
using it in the real world. You almost certainly don't need to know a user's
gender; if you _really_ do then you should ask them what it is and use their
own definition.

~~~
powera
I would think it shouldn't be used in the real world because an 80% guess rate
of gender based on name is really terrible compared to other naive approaches,
like "list of 1000 most common names by gender".

~~~
onion2k
The point still stands though - guessing a user's gender based on their name
_whatever solution you use_ will get it wrong every time the user doesn't
identify with the normative gender for that given name. It's much more user
friendly to ask (and include an option for "Prefer not to say").

~~~
cjslep
Serious question: How can a name have a biological sex? (Wouldn't the fact
that penis-people and vagina-people could be named Amy, but society heavily
biases towards one, imply names are indicators of societal genders and not
biological sex?)

I agree with your above points about the dangers of making guesses on gender.
I'm just not well versed in the progressive gender concepts and don't
understand the semantics you are arguing for.

~~~
jacalata
I think that by 'biological sex' they are saying it is identifying the gender
of the person as it was assigned to them at birth, which is usually based on
visible biological sex characteristics.

------
nness
Work out how to do this for ethnic make-up and Twitter may have a job for
you...

~~~
jonathankoren
That is trivially easy.

You simply find the most common names in each country with predominant
language. Want to find Indian names? Most common names in India. Latino names?
Most common names in Mexico. etc. It's not perfect, but nothing is. Case in
point:

As a wise man once sang, "I'm not black like Barry White. No, I am white like
Frank Black is."

------
erikpukinskis
This kind of thing makes me nervous, because I can't think of any uses for
this kind of thing that aren't pretty nefarious in my mind. I immediately
imagine targeting women with makeup and celebrities and men with cars and
sports. Even if you could achieve 100% accuracy, you're pigeonholing your
users in some sort of weird gender jail where behavior consistent with their
gender is reinforced.

Maybe this is a failure of imagination on my part though... Do other people
have a sense of some altruistic feature that would rely on tech like this?

~~~
Terr_
Nefarious? Guessing the gender of a name is something humans do all the time,
especially with the aid of learned conventions in a society. Janus and Janice,
Don and Dawn...

Even if the computer improves to twice human accuracy (i.e. mistakes half as
often) I don't see how it'll reveal or bias anything that wasn't already about
to happen.

