This paragraph in particular is one of the worst examples I've ever seen of researchers NOT UNDERSTANDING WHAT THEY ARE DOING:
Unlike a human examiner/judge, a computer vision algorithm or classifier has absolutely no subjective baggages, having no emotions, no biases whatsoever due to past experience, race, religion, political doctrine, gender, age, etc., no mental fatigue, no preconditioning of a bad sleep or meal. The automated inference on criminality eliminates the variable of meta-accuracy (the competence of the human judge/examiner) all together.
Please, read Weapons of Math Destruction and understand how excellent machine learning is at discovering and exploiting the biases in datasets.
Edit, no, sorry, it gets worse:
the upper lip curvature is on average 23.4% larger for criminals than for non-criminals.
the distance d between two eye inner corners for criminals is slightly shorter (5.6%) than for non-criminals
This paper is showing there's a detectable difference between the criminal and non-criminal faces in their dataset. Obviously, this would be the case if criminals actually had different faces. However, it could also be the case that people with more extreme faces are more likely to be convicted. That is, this paper could be showing that their dataset is biased.
Personally, I find this paper really interesting. I wouldn't have previously believed you could train a classifier so accurate just from facial images. It could be revealing some really strong biases in the criminal justice system and I'd love to see this kind of work used to help combat human biases rather than simply reinforce them (which seems to be how most people envision this will be used).
But I think we shouldn't get too prescriptive about what constitutes interesting and useful science. There may be very interesting relationships to uncover that we will otherwise miss. Previous research suggests that the correlation between attractiveness and 'averageness of features' is a strong one and one thesis is that this is because having features 'matching the template' is indicative of general fitness and resistance to disease.
Based on this research you might wonder about whether or not a person having perceived low fitness was a risk factor for criminality. Or perhaps the low fitness itself somehow causes criminality? Indeed there are many other interesting interpretations that we might miss out on if we judge all 'dangerous' work at face value. Perhaps we as a society are willing to give up on those ideas in our desire to avoid the atrocities of the past, but personally I am not. Onwards and upwards, they say.
I'm not much of a fan of a lot of the arguments made in Weapons of Math Destruction, but I do appreciate that in summarizing you draw the distinction between the biases of the engineer or (illogically, but oft-claimed nonetheless) the algorithm itself and the data which is used to train said model, and I think it's quite a valuable concept in regards to this particular paper.
For instance, the data set they're using here is fairly small, and while, they did use 10-fold cross-validation, that's still a bit on the less than ideal side generally speaking neural nets, especially CNN architectures, which are usually pretty deep. Furthermore, the dataset itself seems fairly questionable to me. I'm not sure how much I trust the Chinese criminal justice system to adequately adjudicate culpability in the first place, but even setting aside such admittedly conspiratorial notions, it seems rather odd indeed that nearly half of their positive samples are not in fact convicted criminals but merely suspects. I do not find their attempts at devil's advocate persuasive as it's not readily obvious exactly how they used or obtained any of their testing with the three different data sets.
As for the appropriateness of the broader topic, I'm more or less of the persuasion that all questions deserve to be examined, and that provided the work does not cause direct harm, it's hard for me to support a prohibition on examination of a given topic. That said, I do think that the more controversial the question, the higher quality of research required, and, good lord, does this mess fall well short of the mark. Perhaps if there existed a hypothetical criminal justice system free of systemic biases or, more realistically, a method by which to exactly define those prejudices and account for them in the composition of a data set, this could be a potentially useful question to investigate, but even then it seems to me quite unlikely that there's any particularly significant relationship between one's upper lip curvature and criminal disposition.
But to do it you need experts in criminology, physiology and machine learning, not just a couple of people who can follow the Keras instructions for how to use a neural net for classification.
For example, I think I remember reading a papers in the physiology field that show a link between increased testosterone and different facial features - but from memory (and I don't have the paper) there was no link between that and criminal offending.
In this case, the features they are finding don't seem to make any sense. A slight smile in the criminals seems more likely to be due to the way that set of photos are taken, and a number of the other features could possibly be explained by the fact the criminal set came from a single police department (in a single geographical area), while the other dataset was collected online. Given the small size of the dataset, if it included a single "family"-gang of criminals it is likely that would have been enough to taint the features.
> Testosterone plays a significant role in the arousal of these behavioral manifestations in the brain centers involved in aggression and on the development of the muscular system that enables their realization. There is evidence that testosterone levels are higher in individuals with aggressive behavior, such as prisoners who have committed violent crimes.
> Inmates who had committed personal crimes of sex and violence had higher testosterone levels than inmates who had committed property crimes of burglary, theft, and drugs. Inmates with higher testosterone levels also violated more rules in prison, especially rules involving overt confrontation.
Though the connection can be said to be weak, and definitely not the only factor (high testosterone alone is not sufficient for criminal offending), it is there.
> In this case, the features they are finding don't seem to make any sense. A slight smile in the criminals seems more likely to be due to the way that set of photos are taken
From the paper: "We stress that the criminal face images in Sc are normal ID photos not police mugshots."
> by the fact the criminal set came from a single police department (in a single geographical area)
Subset Sc contains ID photos of 730 criminals, of which 330 are published as wanted suspects by the ministry of public security of China and by the departments of public security for the provinces of Guangdong, Jiangsu, Liaoning, etc.; the others are provided by a city police department in China under a confidentiality agreement.
> if it included a single "family"-gang of criminals it is likely that would have been enough to taint the features.
Family resemblance is an interesting one. But unlikely to significantly affect the accuracy difference between proper labeling and random labeling (they'd all need to be related)
Overfit is sufficiently ruled out (to me), but leakage is not. Unfortunately it is not possible to replicate this study (even if the dataset was available, the implementation details are scarce). Differently sized raw ID pictures, or compression artifacts, could lead to near undetectable leakage for outsiders. I would probably not give this paper my stamp of approval, even if it was on an uncontroversial subject, but it is not abysmally bad.
I do think one has to be careful to separate moral concerns from technical concerns. Sure, this all feels very wrong to me, and should be taken into account when creating new regulation for ML systems, but the research itself (apart from the small sample size, and vague data gathering methods) is sufficiently solid for debate. Maybe we don't want to admit that phrenology can have a measurable impact on behavior, but that is wishful thinking, not science. Like you said: 'a link between increased testosterone and different facial features' exists, and I just sourced you that a link between criminal behavior and testosterone exists. Logic would deem us to conclude that different facial features are indicative of different criminal behavior, no matter the bizarre, scary, immoral research that supports it.
89% accuracy means that there is almost no other feature that influences criminality.
That should set off all kinds of alarms. If there was some kind of relationship between facial features and criminality (and I don't discount that there could be) I'd expect it to be a very weak one, not one that is accurate 9/10 times.
All positive instances ARE convicted criminals, among whom there are NO political prisoners, just for your information.
All in the upper lip curvature, because why would a criminal smile in a mugshot.
Why would a non-criminal smile in a mugshot?
Which would explain this 'paradox', it's just overtraining:
>The seeming paradox that Sc [the criminal set] and Sn [the noncriminal set] can be classified but the average faces of Sc [the criminal set] and Sn [the noncriminal set] appear almost the same can be explained, if the data distributions of Sc [the criminal set] and Sn [the noncriminal set] are heavily mingled and yet separable.
They're heavily mingled because they're identical and you're just testing your predictions with your training data.
The fact that we do not like the result does not make it false. For validation, see page 4, where they checked that a random labeling of images does not produce such a good distinguisher.
When referees at real journals actually do their jobs correctly, they check arXiv when given manuscripts to read & reject them if they have been posted to arXiv as violating the "no prior publication" rules at the real journals.
Likening this to the "wage gap" where the XY chromosome is responsible for more outlier behavior: both the top of society and the bottom is both heavily dominated by male participants, whereas the female population is closer to the average and has far fewer outliers. Could this be related?
There's variance in XY chromosomes that cause men to swing wildly on the scale in both positive and negative directions. There seems to be an answer to the hypothesis that asserts that individuals with wildly differing attributes seem more often than not to fall on the outside of the law.
Evidence from Meta-Analyses of the Facial Width-to-Height Ratio as an Evolved Cue of Threat
> All four classifiers perform consistently well and produce evidence for the validity of automated face-induced inference on criminality, despite the historical controversy surrounding the topic.
I realize the authors are intentionally skirting around this bit (it's not really the point of their paper), but the "problem" isn't that some physical features may indicate criminality, with some level of success. That's cool or whatever I guess, but hardly an issue or truly revolutionary I think from a social perspective (in person, people tend to understand "vibes" rather well, and bad vibes come from a number of things like body language or visual cues about a person. Humans have their own inference systems for these things, flawed as they are.)
No, the problem -- the "controversy" surrounding the topic -- IMO, is that, almost with 100% certainty, any implementation of this system will be completely left unchecked, will effectively be private, and will be totally unaccountable by any practical means.
Do the authors of this paper really think any implementation of this system would be open to the public in any accountable way, if used by say, LEOs? You know, as opposed to it being a big "every-criminal.sql" dump, based on hoarded data mining, and driven along and utilized by proprietary algorithms, created by some company selling to governments? LEOs in places like the US have already shown their hands with strategies like parallel reconstruction and the downright willingness to fabricate evidence out of thin air.
Really, who cares what some data science nerds think of their fancy criminal face models, and whether they think they're "accurate despite the controversy", when the police can just say "It's accurate, I say so, you're going to jail" and they can completely make shit up to support it? It's not a matter of whether the actual thing is accurate, it's whether or not it gives them a reason to do whatever they like.
It reminds me of rhetoric people said, about building walls around Mexico wrt the election. That can't happen. Who would build it. It'd be huge. Hard. Realistically? It'd be "easy". Humans have been building walls for a long, long time. It's not unthinkable. The difficult part is actually murdering people who would try to cross the wall by gunning them down -- and they will try to cross. I mean, you probably don't have to kill too many people to send the message. Just, enough of them. The Iron Curtain was a real thing, too, after all.
This is similar. The algorithm is the "easy" part. It's "only" some science. No, the hard part is dealing with the consequences. The hard part is closing the box of Pandora after you opened it.