> The problem isn't just that such a system is wrong, it's that the mathematics of testing makes this sort of thing pretty ineffective in practice. It's called the "base rate fallacy." Suppose you have a test that's 90% accurate in identifying both sociopaths and non-sociopaths. If you assume that 4% of people are sociopaths, then the chance of someone who tests positive actually being a sociopath is 26%. (For every thousand people tested, 90% of the 40 sociopaths will test positive, but so will 10% of the 960 non-sociopaths.) You have postulate a test with an amazing 99% accuracy -- only a 1% false positive rate -- even to have an 80% chance of someone testing positive actually being a sociopath.
Interestingly here he uses percentages to describe base rates and risk. Gerd Gigerenzer has a nice book, Reckoning with Risk, where he explains with many examples the problems of this approach. Gerd asks people to use real numbers instead, which are much easier to understand for most people.
Thus, Schneier's example becomes:
> Out of 1,000 people about 40 of will be sociopaths. You have a test that will tell you if someone is, or is not, a sociopath. The test will be correct 9 times out of 10. Bob has taken the test, and has been identified as a possible sociopath. The chance that Bob is actually a sociopath are actually about 1 in 4. This is because the test will tell you that 36 of the 40 sociopaths are sociopaths, but it will also incorrectly tell you that 96 non-sociopaths are sociopaths.
My writing is lousy, and other people will be able to clean this up, but even with my poor writing style it's easier for most people to follow and understand than the percentages.
This is alarmingly important when you're making a health decision - "Should I remove my breasts to reduce my risk of breast cancer?" for example.
EDIT: I use "sociopath" because it's in the source article. I agree with NNQ that it's very troubling to bandy around diagnostic labels like this, and deem people to be dangerous, just because of a tentative probabilistic diagnosis.
The internet is well known as a negative influence on certain people, but couldn't it be having a positive effect that is harder to measure and more an unintended side effect.
The real goal is building social/political systems that are robust and have checks and balances so that they cannot be perverted by special interests and are accessible to those who need them (child abuse support lines are a good example). Anything where a group intervenes on behalf of an individual is prone to disaster.
Are search engines and archives going to all willingly 'forget' that data when you 'just leave' Facebook? Are they going to not aggregate and correlate it to any new service you join?
This is one of the huge points of criticism of Real-Name-required services: a person can never escape an unjust judgment of such communities, due the long memory of the internet.
... would be pointless, as a perfect world would have no concept of "trouble".
It is not a statement about the world in general.
Honest question, what would happen to the other three quarters?
All because people can't understand the example in question, which appears in the first few chapters of most introduction to statistics books. And while all that money is being spent on useless checks the 9/11 terrorists, who the agencies were warned about, and the Boston bombers, who the agencies were ALSO warned about, are not followed up on because human and other resources are being spent on mass surveillance.
What Schneier is missing is that while you can't ID people that well from a single test, you can apply a bunch of them. In his example, one test improves the probability of correctly ID a sociopath from 4% to 24%. Apply another, different test of similar efficacy to that result set and you'll have a population of 21 true positives, and 8 or 9 false positives, increasing the probabiliy of a successful ID from 25% to ~70%. Sure, there's no single test that will give you reliable answers, but so what? It's OK to use a multi-pronged solution.
"Facebook records reveal convicted killer wrote 13-word post 5 years ago - red flag was raised - why was nothing done?"
Waiting until someone actually commits a crime will stop people being persecuted for a coincidental similarity of their behavior to that of a terrorist, sociopath or mime artist.
Even 0.4% of the population that get tested and are incorrectly "proved" innocent of being a terrorist amounts to more than a million undetectable terrorists in the US alone.
You're saying that you would be happy to join 74 other non-terrorists (i.e. law abiding citizens) plus 25 actual terrorists and be taken off to Guantanamo Bay indefinitely ?
You're really sure about that being a Good Thing for law enforcement ?
Suppose you have a test that's 90% accurate in identifying both people with X and people without X. If you assume that 4% of people are people with X and you're told that you test positive for someone who has X.
Do you really find it easy to arrive at your actual chance (26%) of having X? Let's not forget that most people on HN are at the smarter end of the bell curve. It'd be interesting to see the results of a large scale study about answers to questions like this.
When I said I found his version clearer, I meant between the two versions originally given. The one you've just added is of course less clear because unlike the other two, it doesn't point out the issue.
BTW: But maybe it is clearer, for calculation rather than understanding, because I get 27.(27)%, not 26%...
.90 * .04 = 0.036
.10 * .96 = 0.096
positives that are true positives
0.036 / 0.132 = 0.2727...
i had to think about the calculation as i was doing it wasnt automatic even though it was just multiplication, but I think the difficulty is more to do with the fact that you have to use some relative of bayesian probability not really the fact that you had to deal with percentages
But today try to ask a few people around you, and see what they say.
> Suppose you have a test that's 90% accurate in identifying people who have a disease, and 90% accurate in identifying people who do not have the disease. Assume that 4% of people have this disease. Hypothetical_Bob is tested, and the test says that he has the disease. What are the chances that Bob actually does have the disease?
Lots of people - smart people too! - struggle with this. Even if you give them pencil and paper and let them doodle around they will often give you an incorrect number. And most of them will be surprised if you tell them it's as low as 26%.
I think my point is slightly orthogonal since I misunderstood you; if you tell someone that something is "10%" they will think "that is pretty bad" whereas "1 in 10" is more likely to get a "hey, that's not too shabby" response. Percentages sound "worse" than numbers, even when they are the same (at least to me). Perhaps because they are harder to reason with?