
Using Machine Learning and Node.js to detect the gender of Instagram Users - spolu
http://totems.co/?p=11084
======
idunning
Neural networks have their place, but are probably the most complicated and
opaque machine learning tool. They are also hard to set up: so many
parameters! Given that, I found it really strange that they went straight for
a neural network (and then implemented one themselves!). Surely the place to
start would be Naive Bayes, followed up with regularized logistic regression
(through something like glmnet). Heck, even random forests would do quite well
on this task I imagine, although thats getting closer to on the complexity and
opaqueness spectrum towards NN.

There is also no evidence of doing cross-validation, and in another comment
they say they used entire data set to do variable selection - a pretty bad
mistake. They justify by saying they aren't in an academic environment, but
thats kind of a bad excuse, as given the way they've done it I'm very unsure
whether they actually are getting the accuracy they think they are.

I also worry that they sunk two man-months into this when they could probably
have achieved similar if not better results with off-the-shelf and battled-
tested tools. Sets off a lot of warning bells.

~~~
hexhex
I am not sure whether I understood everything right, but I think they computed
just with the data they got from the links to Facebook-Profiles.

Although their computing is very smart, the output cant be better than the
input they used. Just determining the gender by looking up a related Facebook-
Profile should therefore be a better solution in my opinion.

------
adelevie
This is a great example of how privacy is not optional, even in "opt-in"
systems such as Instagram and FB. That Instagram does not _require_ you to
have a Facebook profile, and Facebook does not _require_ you to list gender
means very little in terms of your own privacy.

Merely choosing to withhold information about yourself does not insulate you
from a breach of privacy. That others do disclose such information allows 3rd
parties to make really good guesses and inferences about you.

There's a strange morality here: at what point is it unethical to voluntarily
disclose data about oneself, if it could be used in a way to harm someone
else's privacy? Short of drawing a moral boundary (it could very well be
impossible), we might do well to at least acknowledge the cost to these
methods, alongside their benefits.

~~~
spolu
> There's a strange morality here: at what point is it unethical to voluntary
> disclose data about oneself

That's an interesting question. Especially since the data you disclose may
triggers inappropriate inference of characteristics on someone else, maybe
eventually causing some form of harm (anytime the demo fails to classify
someone, we do cause some harm to him/her in a way). In the case where the
misclassification is more harmful than the privacy disclosure, one is better
off disclosing the information... weird equilibrium.

~~~
flashman
In other words: how am I affected by the fact that many of my Facebook friends
like drug-related pages?

------
gstar
It's unusual to see a coherent, from-first-principles explanation of a neural
network. Especially one that's commercially valuable (i presume) to Totems.

Mildly alarmed to learn I'm only .039 probability male, though - better bloke
it up on Instagram.

~~~
needacig
What's so alarming about being thought female?

~~~
barsonme
Personally, being male, I'd rather be thought of as male. Not as a slight
towards females, but just because it's who I am.

That said, I did get 0.998 female and 0.996 male. Oh well.

------
dn5
Thanks for sharing your experience! Couple of questions

Why implement the training in NodeJS and not use an existing library in R or
Python (scikit-learn) and just implement the scoring (feedforward network) in
Node?

Did you just use a single test/train split? What is the variation in Res if
you run cross validation?

Your article suggests that you used MI to select the 10k best features. Did
you perform this MI feature selection before your test/train split? If so, you
would already be "using" your class labels, and the results will be biased. It
is likely your true generalisation error will be lower.

~~~
spolu
> Why implement the training in NodeJS and not use an existing library in R or
> Python (scikit-learn) and just implement the scoring (feedforward network)
> in Node?

We wanted to contribute to the nodeJS ecosystem and build whatever tool was
missing to use neural network directly from NodeJS or at least as an add-on.
We also wanted to come up with a simple an straightforward implementation to
serve as an educational example rather than just bind into an existing library
(even though the results might have been better of course)

> Did you just use a single test/train split? What is the variation in Res if
> you run cross validation?

We didn't use cross-validation but rather simple train/test split (though our
test set was quite large ~100k / 570k). As explained in the intro we wanted to
stay very practical and were ok with dirty shortcuts as long as the result
looked OK.

> Did you perform this MI feature selection before your test/train split? If
> so, you would already be "using" your class labels, and the results will be
> biased. It is likely your true generalisation error will be lower.

Yes MI selection was made on the overall data set before training. You totally
are right that this is a bias against the test set. Nice catch.

------
im3w1l
Your implementation of momentum seems off, you just add a multiple of last
error, instead of adding exponentially declining contributions from the past.
I think you want

    
    
        double dW = alpha_ * val_[l][j] * D_[l+1][i] + beta_ * dW_[l+1][i][j];
        W_[l+1][i][j] += dW;
    

If you want to get an output class probability, softmax is the standard way.
Minimize KL-divergence instead of squared error.

You don't seem to be doing any regularization. It could maybe give you better
generalization.

I think you could get a speedup by doing your linalg with blas, I guess this
would complicate the code though, making it a tradeof.

Training on multiple threads and averaging is a nice touch. It would be
interesting to hear if (how much) it improved your results.

~~~
spolu
> Your implementation of momentum seems off

I think we used what is described in Artificial Intelligence: A Modern
Approach... But I have to check because what you propose seems better.

> If you want to get an output class probability, softmax is the standard way.
> Minimize KL-divergence instead of squared error.

Thanks! We'll totally try that.

> You don't seem to be doing any regularization. It could maybe give you
> better generalization.

Thanks again. Someone mentioned that before as well. We'll have to experiment
with that as well.

> Training on multiple threads and averaging is a nice touch. It would be
> interesting to hear if (how much) it improved your results.

Training was much faster and therefore tractable on a much larger set but we
didn't manage to get our best results using this multi-threaded approach
unfortunately as described in the post.

Maybe with a bigger training set we could have reach better results using
multi-threaded training. That being said, the averaging phase disrupts a lot
the overall backpropagation process, so I don't know how efficient it can
be... Some advanced experimentation would probably be interesting here.

~~~
im3w1l
>Thanks! We'll totally try that.

oops maybe I spoke too soon, allow me to backpedal a little. I still recommend
minimizing KL-divergence.

------
antihero
Giving it a go with most of my friends and I'd say the success rate was
definitely below .5, and it was pretty sure about it.

What seems odd is that the "test tool" allows you to _tweet_ whether it's
wrong or right. Why not just have it make a call to your API or something to
tell you directly, so you can look at the profiles and figure out what's gone
wrong?

~~~
friendzis
2 more epicly failing profiles: @friendzis @algimantas69

------
franciscop
Having used /harthur/brain before and being deeply interested in Neural
Networks, I have to say that this is one of the most interesting articles
about the topic I've ever seen.

Thank you for sharing the C version, I'll use it for sure.

------
tzs
This was submitted 4 days ago [1], and then was deleted. Anyone know what was
up with that?

[1]
[https://news.ycombinator.com/item?id=8368186](https://news.ycombinator.com/item?id=8368186)

~~~
spolu
I deleted it shortly after submitting it, because the demo crashed and we
didn't want to waste such a great opportunity on HN on a failed demo.

I know it's not perfect... But heh. Hope it's ok.

~~~
arthurcolle
The "post if WRONG" twitter link failed on my iPhone 5.

My instagram name is the same as my HN user id, and you classified me with
99.3% as female... Needs work!

~~~
NathanKP
It looks like very few of your photos have captions, meaning that the
algorithm doesn't have a lot of text to work with, and among those that do
have text there are a few there which contain keywords that are probably
heavily waited toward female such as "pink".

The algorithm could probably be improved to also take the instagram name into
consideration. Someone named "arthur" is very unlikely to be female.

------
minimaxir
> _Our platform retrieves or refreshes around 400 user profiles per second
> (this is managed using 4 high-bandwidth servers co-located with instagram’s
> API servers on AWS)._

Interesting, since Instagram's API only allows 5,000 requests per hour,
([http://instagram.com/developer/limits/](http://instagram.com/developer/limits/))
and does not support bulk requests of user data. How does this application
bypass this limit?

~~~
spolu
Hi minimaxir. We have a large number of tokens from our clients and people
doing oauth to access our free demo. Since the data is public, we can use any
of these tokens to access hashtags and account followers, etc...

Actually, Instagram API limit is pretty high when compared to other platform.
Today we have something like 100k tokens available to us, which means we can
make 12bn+ calls everyday. Almost like having a firehose. We don't use all of
them but we're one of the top users. Though there are at least 10 bigger users
than us on the API (according to them of course). Hope it helps!

~~~
tracker1
Interesting, when I was at GoDaddy (Website Builder), it seems Facebook had
implemented not only a per token limit, but application limits as well that we
hit pretty easily. Does Instagram not have the same kind of limits?

For those curious, IIRC it took a while to get our account's limits raised,
and we had to implement some request caching to stay under the limits as much
as possible. All around, it was interesting.

~~~
spolu
Nope they don't... at least not for now :)

------
_up
Wouldn't Bayesian filter be better suited? There must be a reason Spam Filter
use them instead of Neural Networks.

~~~
spolu
_up we evaluated thoroughly perceptron which are somewhat close to Bayesian
networks. Basically a perceptron is a one layer NN and is therefore quite
similar to a bayesian network in the fact that it encodes a linear regression.

That being said... studying bayesian networks more thoroughly might raise
better results indeed. Don't know though if Gmail is using bayesian networks
or deep learning?

~~~
tensor
My guess is that gmail is using a linear classifier. Both because of the scale
of the data, and because up until very recently linear classifiers have been
state of the art on text classification.

In the few cases where NN have achieved new state of the art on text, the
stanford sentiment analysis work and a few more recent works, a full sentence
parse is needed. Sentence parses do achieve 95% accuracy, but only on well
structured text in a given domain. Plus, they are hugely time intensive
compared to the large scale linear classifiers like vowpal wabbit or sophia-
ml.

Regarding perceptrons, a basic perceptron, although it is a linear classifier,
will not achieve state of the art. Averaged perceptrons get you closer, but
what you really want is a discriminatively trained linear classifier with
regularization.

If I had to bet, gmail is probably using something closer to
[https://code.google.com/p/sofia-ml/](https://code.google.com/p/sofia-ml/)
than a NN. Maybe a Googler will surprise me though!

~~~
wamatt
_> My guess is that gmail is using a linear classifier._

Yup, the Google "Priority Inbox" feature does indeed use a linear classifier,
in particular logistic regression [1] for the reason of scale as you point
out.

Also, IIRC Gmail's original spam detection used naive bayes. It may have
evolved since then.

_[1][http://static.googleusercontent.com/media/research.google.co...](http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36955.pdf)
_

------
syldor
It's true that it seems to be a lot of work in implementation. NN have a
complexity/performance ratio much higher than other algorithms. But hey ! la
fin justifie les moyens, I'm quite impressed with the result and had a lot of
fun with the demo and the article. Keep it up guys !

~~~
spolu
:+1:

------
m0nastic
It doesn't predict my account correctly:

    
    
       PROBABILITY FEMALE: 0.997 
       PROBABILITY MALE: 0.569 
    

I wonder if the fact that I mostly just post pictures with no text
accompanying them skews things.

------
mts_
My account (@matiassingers) got some very interesting numbers, and most of my
photos definitely do have a caption and hashtags.

    
    
        PROBABILITY FEMALE: 0.003 
        PROBABILITY MALE: 0.001

------
turbostyler
1.000 probability of being a man. Thank you for affirming my masculinity.

However, my business has a 0.885 probability of being a woman, which is odd
for a men's brand.

~~~
franciscop
Not really odd, if your brand is looking to _seduce_ men into buying (;

------
hnriot
Interesting blog, interesting ideas, but completely bogus results. It's very
inaccurate. Just using simple NB you'll get much better than this.

------
lpgauth
PROBABILITY FEMALE: 0.003 PROBABILITY MALE: 0.999

Errr, so it's out of 1.002?

~~~
SapphireSun
On many machine learning algorithms, the pattern matcher doesn't return
probabilities, rather confidences on a range of 0 to 1. The higher confidence
wins.

------
plg
@teganandsara

PROBABILITY FEMALE: 0.003

PROBABILITY MALE: 0.996

I would say this doesn't work very well.

~~~
muglug
Well, one datapoint means nothing. Also, if this is aimed at advertisers, it's
more useful to identify people whose _interests_ skew (stereotypically) male
or female.

Also, Tegan and Sara are great singers and artists, but neither of them is an
exemplification of what our culture considers stereotypically female.

