
Show HN: Python Package for Predicting Gender from First Names - parthmaul
https://github.com/parthmaul/onomancer
======
nic-waller
Storing the lookup map on disk as a JSON-encoded dictionary seems less than
optimal for package size and module load time. Two plaintext files (M.txt and
F.txt) would be simple and more efficient on disk. The text is also highly
compressible -- that could further reduce package size. These things might
matter if the package is used in a Serverless environment.

Also, do you think there could be value in identifying classically androgynous
names?

~~~
parthmaul
Thanks for sharing your feedback! Great idea on using .txt instead - I'll make
a change for that. (My first time sharing a package I've prepared on github,
so I'm a noob with that kind of stuff)

There are names in the current json file classified as "N" which stands for
non-binary, but the frequency is quite low. "N" is based on if the frequency
of "M" == "F" or if the frequencies are within a certain magnitude of each
other. (magnitude calculation is based on proportions testing) With that being
said, maybe it'd be worth adding functionality for a user to upload their own
gender_lookup file?

~~~
nic-waller
Ah, I didn't see the N records originally. That makes sense!

Because I'm enthusiastic about data structures, and now that I've finished my
work day, I thought I'd come back with a few numbers to support my earlier
comment. The package size can be reduced by 90% by saving the names as
compressed plaintext.

    
    
      5192 KB gender.json
      1484 KB gender.txt     (71% savings)
       488 KB gender.txt.gz  (90% savings)
    

On my reasonably modern laptop, it takes over 700 ms to unmarshal 5MB worth of
JSON. But it takes less than 100 ms (85% time savings) to read the whole file
and compare strings.

    
    
      $ time jq -r . >/dev/null gender.json
      real 0m0.764s
      
      $ time grep -E '^(NIC|PAUL)' gender.txt
      real 0m0.091s
    

When working with large datasets of items that are mostly the same size,
sometimes it's useful to use fixed-width records to enable random indexing
into the file. Of course, this is a small data set so it's not really
worthwhile to pursue such optimizations. So just for fun, here's an analysis
of how long the names are. 99% of names in this set are 13 characters or less.
Representing short names as fixed-width records and long names as an appendix
would use about 2.5 MB (50% savings compared to JSON).

    
    
      2 222
      3 1785
      4 8913
      5 27862
      6 45986
      7 44311
      8 28291
      9 14355
      10 7167
      11 4072
      12 2611
      13 1512
    

PS. Here's how I prepared the text file:

    
    
      <gender.json jq -r '. | keys[]' > gender.txt

~~~
parthmaul
I stumbled along and made the changes you recommended! Seems to be working
fine. Thanks again for the tip!

------
jk801
This is a great idea.

~~~
parthmaul
Thank you! Using this at work for help with customer segmentation

