Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Any open source code/materials on predicting future crimes based on data?
62 points by febin on Dec 21, 2017 | hide | past | web | favorite | 56 comments

In any case, you shouldn't neglect the subtle but important sources of bias those pre-crime models can have. Here's an interesting talk about it:


Basically, one instance of bias is the fact that many crime-prediction models are trained on police data, which means they will predict crime in places more often targeted by the police anyway. Then the model predictions even amplify that effect, since more training data may be generated from the places now more often policed, etc.

There's lots of resources out there on AI fairness these days. I think everyone who tries stuff like crime prediction should read up on that topic.

Cathy O'Neil wrote a book called "Weapons of Math Destruction" -- interesting read.

You can listen to an interview she does on econtalk -- interesting to learn more about the hidden biases.


>Cathy O'Neil wrote a book called "Weapons of Math Destruction" -- interesting read

+1000. That book should be required reading for anyone working in machine learning. Written by a former Wall Street quant who has the math down cold.

What she knows about rampant bias in allegedly politically agnostic machine learning circles is that the formulation and production of answers is trivial when compared to the formulation and production of questions.

Super-relevant to this thread is her work on recidivism risk scoring algos run on prisoners and defendants. The feedback loops that these algos spur are seriously damaging the lives of huge numbers of persons in the criminal justice system far beyond proportionality for the offenses that brought them there.

So if the AI identifies insider trading at trading firms, banks, etc., we should beware that this would create a feedback loop to look more into the investment and banking sectors and will ignore the mom and pop insider traders? That they go where crime is rampant over where it isn't and that could be a bad thing?

I think it's not inherently bad to have a bias, but it's bad if you don't recognize it. In your example:

If you train an AI on data that solely consists of trading firms & banks, you should recognize that it's an AI biased towards detecting activity at trading firms & banks, and that it might be lacking in other areas.

It becomes dangerous when such an AI is marketed and assumed as unbiased and the source of truth for detecting all insider trading activity.

If we don't have infinite resources (prosecutorial, for example), it makes sense to concentrate on the most salient loci of crime. If after enough resources are devoted to the most egregious locus, a different locus becomes the focus of crime and in turn gets the attention, that's not a bad thing...

In other words if done decently well, they'd obtain data from all places but focus on the problematic areas till they reach equilibrium and then you redirect to the next hotspot, no?

Is it a loci because that is where the crime happens, or is it a loci because that is where all of the crime prevention/detection/prosecution is directed due to external circumstances? You have to be able to correct for this sort of bias in your training data or else you are just baking the external circumstances into the model and pretending that they reflect reality. Start by figuring out how you would obtain data from all places first.

Mom and pop insider trading? How would that work, exactly? Or was it intended as a joke?

Small fry. People like Martha Stewart. She's wealthy of course, but her dipping in is of little consequence. Or your buddy at Google lets you in on something theyre not supposed to. It's the big sharks you care about.


You must be talking from your white privilege.

It's a common known fact that crime-ridden, mostly minority neighborhoods have a complete dearth of services such as police, firefighters, and ambulance. Even back in the 80s, Public Enemy released a song called "911 is a Joke" because if you called 911 from a black neighborhood, they won't respond.

It's routinely known that cops will rather stay in rich neighborhoods and let the poorer neighborhoods fester. This happened during the LA riots where Koreantown burned because most of the cops went to Beverly Hill, etc, to protect the rich people.

I saw this first hand in Detroit about 8 years ago. I was driving through a bad neighborhood in Detroit with my friend, and we entered the neighborhood of Grosse Point, a rich area. Cops followed us until we left, which is their way of saying "you don't belong here". Meanwhile, among the burnt down houses and broken windows of Detroit proper, you couldn't see a single cop.

I lived in a town that neighbored another where police presence was about 4 fold compared to the former. You would be stopped while walking down the street in the neighboring town, meanwhile I could look a out my window and watch a few drug deals take place in an afternoon.

The difference? The neighboring town had a couple decades long reputation of being a hotbed for crime.

That reputation would be cemented in an AI system that was trained off of police data.

Any such system is/would be potentially very dangerous. Crime data is not the same thing as crime. Populations that are over-policed are be disproportionately represented in any such data set, leading to higher prediction of crime, leading in turn more over-policing (feedback loop). I implore anyone attempting to build such a system to consider the serious issue of machine bias and it's implications in the real world.

See this tutorial given at this years NIPS machine learning conference: http://mrtz.org/nips17/#/

Potential dangers of such a film are highlighted in the film Minority Report. https://en.wikipedia.org/wiki/Minority_Report_(film)

This is an area that was explored some years ago, but ultimately determined to have civil rights pitfalls. Crime reporting is only as good (or biased) as the humans that report and input the crime data. Therefore, crime "training" data for AI systems can be very biased and it might only magnify those biases more so using AI - a sort of self-perpetuating negative feedback loop.

Having worked in law enforcement at various levels (state and federal) in a prior professional life, I can attest to the differences in what gets reported and how based upon who was working or supervising and where they were assigned. Humans are simply not reliable reporters for this kind of data. No matter how hard we try to make the reports plain and standardized our biases, one way or another, will always seep in.

Inspired by a Kaggle competition (https://www.kaggle.com/c/sf-crime), one of my older blog posts involved predicting the type of arrest in San Francisco (given that an arrest occurred) using data such as location and timing and the relatively new LightGBM machine learning algorithm: http://minimaxir.com/2017/02/predicting-arrests/

The code is open-sourced in an R Notebook: http://minimaxir.com/notebooks/predicting-arrests/

The model performance isn't great enough to usher in precrime, even in the best case. There are likely better approaches nowadays. (e.g. since the location data is spatial, a convolutional neural network might work better.)

Careful! Your crime predictor might unfairly conclude that men are more likely to commit crimes than women.

People might conclude the same thing, since more crimes are committed by males than females. Are you saying it's unfair because an innocent male could become a suspect by virtue of only being male? Or for a more subtle reason?

Statistics have been consistent in reporting that men commit more criminal acts than women.[1][2] Self-reported delinquent acts are also higher for men than women across many different actions.[3] Burton, et al. (1998) found that low levels of self control are associated with criminal activity.[4] Many professionals have offered explanations for this sex difference. Some differing explanations include men's evolutionary tendency toward risk and violent behavior, sex differences in activity, social support, and gender inequality.


OP wants an open sourced, data-based statistical model of where crime might occur (methodological flaws and all), and not an unasked-for politicized preaching about the supposed virtues of a subset of people over another subset of people.

As long as the methodological flaws are someone else's problem (the user/consumer), we don't need to take responsibility for building the tools that facilitate those methodological flaws, right?

I agree. I don't want the police using biased tools either.

OP has stated elsewhere in the comments that the reason they're interested in the tech is for testing it in some other unrelated area.

I don't see the OP making a claim like that. In any case, OP did not specify what the purpose of the tool was/is, and doesn't seem to be willing to explain the purpose of the the tool is.

Or worse, it might actually mention out loud that white collar crime kills more people and costs far more money to society every year than street crime, and mention that white collar crime is completely normalized and not even seen as deviant within upper class communities. What are you supposed to do then? Actually arrest the rich for the harm they do?

Reductio ad absurdum of the bias argument (-:

It would be right.

There are much better ways to solve crime than to double down on enforcement that is already happening, which is likely all your model will tell you. “Police the neighbourhoods where people are poor” wow, thanks ML!

Palantir already does all this on a massive scale for the US govt. Want to affect future crime in a positive way? Solve the problems that contribute to it.

Not that you asked.

I am currently writing my master thesis on predictive policing using machine learning. Working with local police in Norway. Got a bunch of papers and articles you might find interesting. Hit me up: michaedm@stud.ntnu.no

I'd be interested to know how you filter the garbage data? (By questions here I know others are interested, so public response would be great)

I know people have done these types of studies before but found that they easily became bias, and thus there is a wariness of using it (like the judge AI who was more likely to convict black people). I'm not sure how it is in Norway, but I don't expect it to be much different from America, where there are places which are disproportionately convicted of crimes, where other areas such crimes are seen as infractions. This is really going to mess with the data and perpetuate the bad system.

Thanks, Just shot an email.

Do you know about the journalist who spent years obsessing about this and supposedly had some predictive success relating to serial killers?

If I recall it was kind of a lone wolf effort, so I don’t know the rigor of his techniques, howver you never know if he might want to share results or collaborate.

Don’t have a link handy, but that should be enough info to google if you’re interested.

I think you're referring to this article about Thomas Hargrove that was in Bloomberg in February:


He's the founder of the Murder Accountability Project:


Ask HN: Any open source code/materials on predicting good fall guys based on data?

There is a project[1] + whitepaper[2] on projecting the likelihood of future white collar crimes written by Sam Lavigne, Francis Tseng, and Brian Clifton.

[1] https://thenewinquiry.com/white-collar-crime-risk-zones/ [2] https://whitecollar.thenewinquiry.com/static/whitepaper.pdf

Believe I heard about a project a UW student did predicting crime in San Francisco based on volume of vulgar tweets in a given area. Not sure if it's on github anywhere but you can always start with that idea. Nothing about specifics of the crimes, just where a high volume of them would be located.

There's a British tv presenter and scientist called Hannah Fry who has published in this area, including a talk in Germany (received just like many comments on this page), some Numberphile videos and BBC documentaries in other areas of data science.


Food for thought on how incredibly biased these effort can be.

For a source of data: https://data.cityofchicago.org/

And in the case of crime, chicago should be a pretty good dataset.

Are you looking for predicting future crimes in an area (i.e. city, neighborhood, state, etc...) or predicting whether an individual will commit future crimes?

Fwiw there's some discussion of this in the book Everybody Lies. Look into that. Perhaps follow up with the author. His name escapes me atm.

Are you looking for tools or data?

I am looking for tools or algorithms.

You need to watch Person of Interest and Minority Report.

I watched POI. I also heard cops using ML to predict crime locations. I am trying to find and reuse some existing code for a different project.

Don't be surprised if those predictions are heavily biased against minorities and poor people. Do you care if they do?

It's a similar problem to using ML to give people credit scores.

If the training data includes a lot of minorities and poor people breaking laws / delinquent payments, then your ML will simply key on race/economic status as a predictor.

So you've built a system that simply targets those groups.

But you might object and say that this race/economic status targeting gives the highest accuracy! It was only learned in the training data, after all. You can make a great classifier that is extremely unfair.

So you have to realize there is a conflict here between accuracy and fairness. This means there is a conflict between observational data (training), and using that data to produce decisions/outcomes.

If you make decisions/outcomes that reinforce the training data, you do not give racial groups/low economic status people a chance to improve their lives.

That is extremely inhuman, predatory, and unfair.

All I want to predict is time periods/locations which are vulnerable. Nothing more than that.

Racism is morally wrong but not mathematically wrong. P(criminal|black) > P(criminal), but if you observe that someone has black skin and treat them poorly because of it, you've done a bad thing. It doesn't matter that you were just following Bayesian reasoning because you're still hurting someone on the basis of something they can't control.

Lady Justice doesn't wear a blindfold as a fashion accessory. Discarding information is a key factor in nearly every established system of justice / morality. Refusing to do so (i.e. "just" running a ML algorithm) places you directly at odds with society's hard-earned best practices.

> Lady Justice doesn't wear a blindfold as a fashion accessory

I never noticed that before. Thanks for pointing this out!

> All I want to predict is time periods/locations which are vulnerable.

Ok, and to what end?

I assume someone else will be consuming these predictions, else you wouldn't bother at all.

What are your customers/users going to do with these predictions?

Or is that simply not your responsibility; someone else's problem?

Take a look at crimereports.com. You might get lucky and find a good source on a per city or county basis, it's too fragmented overall too try this. Different countries might have different documentation standards and publishing guidelines for this kinda of data, might be worth a shot to look.

and Psychopass

Have you checked kaggle for relevant datasets?

Why don't you base it on Law data?

The Poisson distribution!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact