

I'm Beating The NSA To The Punch By Spying On Myself - anjalimullany
http://www.fastcolabs.com/3012908/tracking/im-beating-the-nsa-to-the-punch-by-spying-on-myself

======
gabestein
Author here. Beating everyone here to the punch: yes, I know this is pretty
basic. I'm new to data-mining. I will happily take suggestions about what else
to look for and how to improve it. Also, if anyone wants to donate their own
Google Voice data to improve the model, I'd love to see how accurate we can
get on some classification (I'm thinking family or not family, but open to
suggestions).

------
quchen
I think the author's good intentions are taken a little bit too far here. As I
understand it, it is claimed that the gender is "correct with 67% confidence",
which I assume means "67% of all educated gender guesses are correct", with
50% standing for pure guesswork.

As is the case with any statistical data, a single value holds only very
little information. The article does not do any analysis on how good these 67%
are: could it be that changing the parameters changes it to 50%? There's a
pretty large configuration space in question, and therefore I think giving out
a simple number is very misleading.

What I also don't agree with is the conclusion, "Most importantly, if that's
what I can do with a limited set of my own data, imagine what the NSA can do
with the datasets it has access to." In the light of the previous paragraph,
this draws the conclusion that the NSA can do pretty big things, but it's
based on the wrong premise that the analysis of the "own data" has any
statistical significance.

Note that I don't want to argue about the conclusion - data mining is very
powerful, and so is statistics - but this kind of approach is _very_ easily
abused, be it intentionally or unintentionally.

So, in short: be careful trusting statistics, even when you really really like
the result.

~~~
slashcom
Another problem is he doesn't include any sort of meaningful baseline. Maybe
67% of the people he talks to are male. In which case, he's not predicting
anything. Just guessing randomly biased by his own calling habits.

[EDIT: I had some other criticisms that were incorrect and overly harsh
anyway. I've removed them.]

~~~
gabestein
Thanks for the feedback. As I'm fairly new to all of this, I'd love to hear
how I can turn this into a more viable experiment.

Also, to defend myself a little: I think I'm at least being responsible by in
no way claiming that it's statistically valid, and in fact making the point
that it's not, and citing the reasons why, several times in the article.

I do stand by my point, however, that this helps the public understand the
danger of metadata for data-mining, as well as introducing them to the
pitfalls of statistics.

~~~
quchen
> As I'm fairly new to all of this, I'd love to hear how I can turn this into
> a more viable experiment.

The most important thing is getting an estimate of how good your averages are.
Try to modify a couple of things, for example how do the 67% change with
sample size (verification)? How does the number turn out if you feed it biased
data, e.g. only male phone numbers (falsification)?

~~~
gabestein
Thanks. I solemnly swear to not abuse statistics, so I'll play around and
update the piece.

The goal is to use this as a hook to get the public interested, then take them
for the ride as I learn. That's why I tried to be really cautious about
pointing out all the reasons why this particular model is bad.

~~~
quchen
I didn't mean to say you specifically were abusing statistics, it's just very
easy to draw dubious conclusions. (Doing this intentionally would be the abuse
I was talking about.)

~~~
gabestein
I know, but I still want to take it seriously, and thank you for your
criticism.

------
omonra
I think the basic flaw here is that author assumes that NSA does data-mining.
I personally doubt it.

Rather (and here I'm going off on pure speculation myself) I think they can
use the system to start with some off-line lead (say one of the Chechen bomber
brothers) - and see who he was calling / emailing and then checking if there
is anything valuable there.

Ie - how in the past FBI would do an investigation off-line and talk to people
and piece things together - they are looking to be able to replicate it
(possibly by going back in time) online.

~~~
gabestein
"As part of K.D.D., an algorithm was applied to the broader data set in
efforts to detect patterns of behavior fitting models that had been previously
established as being indicative of the activities of a terrorist cell."

[http://m.vanityfair.com/online/eichenwald/2013/06/obama-
veri...](http://m.vanityfair.com/online/eichenwald/2013/06/obama-verizon-cell-
phone)

~~~
omonra
Thank you. But the next sentence reads " It is also run against a large set of
what are known as “dirty numbers”—telephones linked to terrorists either
through American signals intelligence or information provided by foreign
services. Even the Libyans under Qaddafi turned over huge stacks of dirty
numbers to us."

So that seems to correlate with what I am saying - they have some _flag_
generated offline that they use as a marker to find something online.

~~~
gabestein
Oh, they definitely do both, I'm not disputing that.

------
kewk
I'd love to some of this myself but don't know Ruby.. I've used Google Voice
since 2009 full time.

Anyone up for a Python Google Voice to csv script?

