

 My approach to guessing a gender from a first name. - Stromgren

Hi!
A short time ago, i decided to try and build an API that would try to guess the gender of a first name. I thought this might be useful for segmenting user lists for campaigning, analytics or similar.
My first approach was to use a dataset of approved names from a few European countries. This was in the believe that most countries had lists like this (Which they don&#x27;t) and i planned to add them as i went along. I got wiser and the first feedback i got also told me that the API should be able to do probabilistic guesses and if possible, also offer some sort of localization filter to achieve more accurate guesses.
I decided to take an approach of using large, growing datasets of user profiles from social networks. Each entry containing a first name, a gender, a country_id and language_id. At last, i exposed this datamodel through http:&#x2F;&#x2F;genderize.io
It responds in JSON. Simple example: http:&#x2F;&#x2F;api.genderize.io?name=robin
I am now looking to get some feedback on my new approach. What do you think of this way of doing guesses. What do you think of the API? Any feedback is welcome.
The API is completely free by the way.
======
lutusp
> A short time ago, i decided to try and build an API that would try to guess
> the gender of a first name.

Obviously you need to run a test that uses a list of real people's names and
genders to measure the method's accuracy. But remember the following points:

* People might resent any effort to pin down their gender in a commercial or advertising context.

* The negative outcome for a gender misidentification may be much greater than the positive outcome for a correct one.

* Gender-neutral names are becoming increasingly fashionable among well-educated parents, i.e. people who have money.

On that basis and in my opinion, unless you can get above 90% accuracy, it's
not worth doing.

Some popular gender-neutral names:

[http://www.babynames1000.com/gender-
neutral/](http://www.babynames1000.com/gender-neutral/)

[http://thestir.cafemom.com/pregnancy/157282/25_best_genderne...](http://thestir.cafemom.com/pregnancy/157282/25_best_genderneutral_baby_names)

[http://en.wikipedia.org/wiki/Unisex_name#English](http://en.wikipedia.org/wiki/Unisex_name#English)

A quote: "Unisex names have been enjoying a decent amount of popularity in
English speaking countries in the past several decades."

~~~
gadders
Also a large number of Sikh names are gender neutral.

------
dictum
To anyone who is interested in implementing this in a product: don't.

To be fair: do it if you must. But _don 't let the user see the gender field
as it changes_. If someone has a name that's associated with the opposite
gender (or they believe themselves to be of another gender), seeing the change
to that gender in the gender field will make them sad, annoyed, or irritated.
At best, they will chuckle at the failed attempt to predict their gender.

This is one of those things that, when they work as intended, users don't
notice it and it doesn't improve their experience that much, but when it
fails, they notice and the annoyance hurts your image.

------
dalke
I see that you are missing various Swedish names, like Gudrun. I don't know if
you can get the full list of names, but you can get the list of names which
were given to at least 10 girls in the last decade or so at:

[http://www.scb.se/Pages/TableAndChart____31028.aspx](http://www.scb.se/Pages/TableAndChart____31028.aspx)

and for boys at:

[http://www.scb.se/Pages/TableAndChart____31036.aspx](http://www.scb.se/Pages/TableAndChart____31036.aspx)

You can also go to
[http://www.scb.se/Pages/NameSearch.aspx?id=259432](http://www.scb.se/Pages/NameSearch.aspx?id=259432)
and do a search for name. For example, there are 990 people in Sweden with
Strömgren as a last name.

It seems that "Gudrun" isn't that popular these days as fewer than 10 girls
get that name. A different set of names is available from
[http://en.wiktionary.org/wiki/Category:Swedish_given_names](http://en.wiktionary.org/wiki/Category:Swedish_given_names)
.

I don't have need for this data and I can't comment about the effectiveness of
the API.

You can get top-1000 US names for a given year by going to
[http://www.ssa.gov/OACT/babynames/#ht=1](http://www.ssa.gov/OACT/babynames/#ht=1)
, selecting a year, change "Popularity" to "Top 1000" and submitting the form.
(For example, your search doesn't have 'Lowell', which was #172 in the US in
1940.)

Good luck!

~~~
Stromgren
Thanks a lot. I'll look into this. Regarding missing names. I'm adding around
10.000 profiles a day to the dataset. So it get's better by the minute :) Not
that huge a dataset yet.

------
Asparagirl
Hey, nice job! I do a lot of work (both professionally and as a volunteer)
coding stuff for genealogical and historical non-profit organizations, and I
could totally see an API like this being useful to us. Do you accept donations
of name data sets from the 19th century Austro-Hungarian Empire? :-)

Also, I would love to learn more about how the service actually works on the
back-end.

~~~
Stromgren
Hey. thanks! The API doesn't actually use name sets like that. Though that was
my first approach. I changed it to use lists of profiles from social networks.
So when a name is requested it looks up every profile with that name and
counts the number of times each gender is represented. If you use any
localization parameters it will of course only look up profiles associated
with the particular country or language. I quickly realized with the initial
approach that my lists would never be sufficient, since most countries allow
for almost any name to be given and when combining lists from the whole world,
a lot of names would end up as unisex, that's why i went for a probability
factor instead. Also i'm hoping that by using social profiles, it might one
day be able to tell the gender of Superman or Catwoman and things like that.
People can after all call themselves what they want on the internet.

I've actually thought about adding like a baseline of names from different
lists though, to backup the names that are not yet represented in the dataset.
Do you have a link to the names you are mentioning? Could be interesting.

~~~
Asparagirl
Check out [http://search.geshergalicia.org/](http://search.geshergalicia.org/)
.

Many, but not all, of the people mentioned in the 87 data sets (and counting!)
that make up this database have a gender explicitly declared. Locale is the
former province of Galicia in the Austro-Hungarian Empire, which is today
eastern Poland and southwestern Ukraine. Time period is mostly 19th century
and some early 20th century. Ethnicity is strongly biased towards Ashkenazi
Jewish, but we also have some data sets that have representation of all the
people in the community at that time, such as tax lists or phonebooks or
school lists. I can get you data in JSON or XML, let me know.

I also have access to another large given name database that could be useful
to you -- but that one is entirely Ashkenazi Jewish from what used to be
northeastern Hungary, from roughly 1850 to 1906.

------
dscb
Now this is interesting, I entered in my name (Dillon) which is gender-neutral
and it returned:

{"name":"dillon","gender":"male","probability":"1.00","count":1}

I'm interested is how its decided there was a 100% probability that I'm male
(It was correct though).

~~~
ToastyMallows
..Dillon is gender neutral? I understand that this could be said about any
name (which makes predicting gender a non-trivial problem), but I am pretty
sure I've never met a female Dillon.

------
rtcoms
I searched
[http://api.genderize.io/?name=batman](http://api.genderize.io/?name=batman)

and received {"name":"batman","gender":null}

