Author here. Beating everyone here to the punch: yes, I know this is pretty basic. I'm new to data-mining. I will happily take suggestions about what else to look for and how to improve it. Also, if anyone wants to donate their own Google Voice data to improve the model, I'd love to see how accurate we can get on some classification (I'm thinking family or not family, but open to suggestions).
I think the author's good intentions are taken a little bit too far here. As I understand it, it is claimed that the gender is "correct with 67% confidence", which I assume means "67% of all educated gender guesses are correct", with 50% standing for pure guesswork.
As is the case with any statistical data, a single value holds only very little information. The article does not do any analysis on how good these 67% are: could it be that changing the parameters changes it to 50%? There's a pretty large configuration space in question, and therefore I think giving out a simple number is very misleading.
What I also don't agree with is the conclusion, "Most importantly, if that's what I can do with a limited set of my own data, imagine what the NSA can do with the datasets it has access to." In the light of the previous paragraph, this draws the conclusion that the NSA can do pretty big things, but it's based on the wrong premise that the analysis of the "own data" has any statistical significance.
Note that I don't want to argue about the conclusion - data mining is very powerful, and so is statistics - but this kind of approach is very easily abused, be it intentionally or unintentionally.
So, in short: be careful trusting statistics, even when you really really like the result.
Another problem is he doesn't include any sort of meaningful baseline. Maybe 67% of the people he talks to are male. In which case, he's not predicting anything. Just guessing randomly biased by his own calling habits.
[EDIT: I had some other criticisms that were incorrect and overly harsh anyway. I've removed them.]
Thanks for the feedback. As I'm fairly new to all of this, I'd love to hear how I can turn this into a more viable experiment.
Also, to defend myself a little: I think I'm at least being responsible by in no way claiming that it's statistically valid, and in fact making the point that it's not, and citing the reasons why, several times in the article.
I do stand by my point, however, that this helps the public understand the danger of metadata for data-mining, as well as introducing them to the pitfalls of statistics.
> As I'm fairly new to all of this, I'd love to hear how I can turn this into a more viable experiment.
The most important thing is getting an estimate of how good your averages are. Try to modify a couple of things, for example how do the 67% change with sample size (verification)? How does the number turn out if you feed it biased data, e.g. only male phone numbers (falsification)?
Thanks. I solemnly swear to not abuse statistics, so I'll play around and update the piece.
The goal is to use this as a hook to get the public interested, then take them for the ride as I learn. That's why I tried to be really cautious about pointing out all the reasons why this particular model is bad.
I think the basic flaw here is that author assumes that NSA does data-mining. I personally doubt it.
Rather (and here I'm going off on pure speculation myself) I think they can use the system to start with some off-line lead (say one of the Chechen bomber brothers) - and see who he was calling / emailing and then checking if there is anything valuable there.
Ie - how in the past FBI would do an investigation off-line and talk to people and piece things together - they are looking to be able to replicate it (possibly by going back in time) online.
"As part of K.D.D., an algorithm was applied to the broader data set in efforts to detect patterns of behavior fitting models that had been previously established as being indicative of the activities of a terrorist cell."
Thank you. But the next sentence reads " It is also run against a large set of what are known as “dirty numbers”—telephones linked to terrorists either through American signals intelligence or information provided by foreign services. Even the Libyans under Qaddafi turned over huge stacks of dirty numbers to us."
So that seems to correlate with what I am saying - they have some flag generated offline that they use as a marker to find something online.