Ask HN: Any open source code/materials on predicting future crimes based on data? - febin
======
Asdfbla
In any case, you shouldn't neglect the subtle but important sources of bias
those pre-crime models can have. Here's an interesting talk about it:

[https://www.youtube.com/watch?v=MfThopD7L1Y](https://www.youtube.com/watch?v=MfThopD7L1Y)

Basically, one instance of bias is the fact that many crime-prediction models
are trained on police data, which means they will predict crime in places more
often targeted by the police anyway. Then the model predictions even amplify
that effect, since more training data may be generated from the places now
more often policed, etc.

There's lots of resources out there on AI fairness these days. I think
everyone who tries stuff like crime prediction should read up on that topic.

~~~
mc32
So if the AI identifies insider trading at trading firms, banks, etc., we
should beware that this would create a feedback loop to look more into the
investment and banking sectors and will ignore the mom and pop insider
traders? That they go where crime is rampant over where it isn't and that
could be a bad thing?

~~~
bcyn
I think it's not inherently bad to have a bias, but it's bad if you don't
recognize it. In your example:

If you train an AI on data that solely consists of trading firms & banks, you
should recognize that it's an AI biased towards detecting activity at trading
firms & banks, and that it might be lacking in other areas.

It becomes dangerous when such an AI is marketed and assumed as unbiased and
the source of truth for detecting all insider trading activity.

~~~
mc32
If we don't have infinite resources (prosecutorial, for example), it makes
sense to concentrate on the most salient loci of crime. If after enough
resources are devoted to the most egregious locus, a different locus becomes
the focus of crime and in turn gets the attention, that's not a bad thing...

In other words if done decently well, they'd obtain data from all places but
focus on the problematic areas till they reach equilibrium and then you
redirect to the next hotspot, no?

~~~
evgen
Is it a loci because that is where the crime happens, or is it a loci because
that is where all of the crime prevention/detection/prosecution is directed
due to external circumstances? You have to be able to correct for this sort of
bias in your training data or else you are just baking the external
circumstances into the model and pretending that they reflect reality. Start
by figuring out how you would obtain data from all places first.

------
bayesbiol
Any such system is/would be potentially very dangerous. Crime data is not the
same thing as crime. Populations that are over-policed are be
disproportionately represented in any such data set, leading to higher
prediction of crime, leading in turn more over-policing (feedback loop). I
implore anyone attempting to build such a system to consider the serious issue
of machine bias and it's implications in the real world.

See this tutorial given at this years NIPS machine learning conference:
[http://mrtz.org/nips17/#/](http://mrtz.org/nips17/#/)

~~~
jensv
Potential dangers of such a film are highlighted in the film Minority Report.
[https://en.wikipedia.org/wiki/Minority_Report_(film)](https://en.wikipedia.org/wiki/Minority_Report_\(film\))

------
USNetizen
This is an area that was explored some years ago, but ultimately determined to
have civil rights pitfalls. Crime reporting is only as good (or biased) as the
humans that report and input the crime data. Therefore, crime "training" data
for AI systems can be very biased and it might only magnify those biases more
so using AI - a sort of self-perpetuating negative feedback loop.

Having worked in law enforcement at various levels (state and federal) in a
prior professional life, I can attest to the differences in what gets reported
and how based upon who was working or supervising and where they were
assigned. Humans are simply not reliable reporters for this kind of data. No
matter how hard we try to make the reports plain and standardized our biases,
one way or another, will always seep in.

------
minimaxir
Inspired by a Kaggle competition ([https://www.kaggle.com/c/sf-
crime](https://www.kaggle.com/c/sf-crime)), one of my older blog posts
involved predicting the type of arrest in San Francisco (given that an arrest
occurred) using data such as location and timing and the relatively new
LightGBM machine learning algorithm: [http://minimaxir.com/2017/02/predicting-
arrests/](http://minimaxir.com/2017/02/predicting-arrests/)

The code is open-sourced in an R Notebook:
[http://minimaxir.com/notebooks/predicting-
arrests/](http://minimaxir.com/notebooks/predicting-arrests/)

The model performance isn't great enough to usher in precrime, even in the
best case. There are likely better approaches nowadays. (e.g. since the
location data is spatial, a convolutional neural network might work better.)

------
SamReidHughes
Careful! Your crime predictor might unfairly conclude that men are more likely
to commit crimes than women.

~~~
jhiska
OP wants an open sourced, data-based statistical model of where crime might
occur (methodological flaws and all), and not an unasked-for politicized
preaching about the supposed virtues of a subset of people over another subset
of people.

~~~
platz
As long as the methodological flaws are someone else's problem (the
user/consumer), we don't need to take responsibility for building the tools
that facilitate those methodological flaws, right?

~~~
jhiska
I agree. I don't want the police using biased tools either.

OP has stated elsewhere in the comments that the reason they're interested in
the tech is for testing it in some other unrelated area.

~~~
platz
I don't see the OP making a claim like that. In any case, OP did not specify
what the purpose of the tool was/is, and doesn't seem to be willing to explain
the purpose of the the tool is.

------
lwansbrough
There are much better ways to solve crime than to double down on enforcement
that is already happening, which is likely all your model will tell you.
“Police the neighbourhoods where people are poor” wow, thanks ML!

Palantir already does all this on a massive scale for the US govt. Want to
affect future crime in a positive way? Solve the problems that contribute to
it.

Not that you asked.

------
michaelmcmillan
I am currently writing my master thesis on predictive policing using machine
learning. Working with local police in Norway. Got a bunch of papers and
articles you might find interesting. Hit me up: michaedm@stud.ntnu.no

~~~
godelski
I'd be interested to know how you filter the garbage data? (By questions here
I know others are interested, so public response would be great)

I know people have done these types of studies before but found that they
easily became bias, and thus there is a wariness of using it (like the judge
AI who was more likely to convict black people). I'm not sure how it is in
Norway, but I don't expect it to be much different from America, where there
are places which are disproportionately convicted of crimes, where other areas
such crimes are seen as infractions. This is really going to mess with the
data and perpetuate the bad system.

------
thedrake
A lot of good work by Cynthia Rudin
[http://online.liebertpub.com/doi/pdf/10.1089/big.2014.0021](http://online.liebertpub.com/doi/pdf/10.1089/big.2014.0021)
and her tools are open sourced (her papers
[https://users.cs.duke.edu/~cynthia/papers.html](https://users.cs.duke.edu/~cynthia/papers.html)
and tools
[https://users.cs.duke.edu/~cynthia/code.html](https://users.cs.duke.edu/~cynthia/code.html))

------
WhitneyLand
Do you know about the journalist who spent years obsessing about this and
supposedly had some predictive success relating to serial killers?

If I recall it was kind of a lone wolf effort, so I don’t know the rigor of
his techniques, howver you never know if he might want to share results or
collaborate.

Don’t have a link handy, but that should be enough info to google if you’re
interested.

~~~
noisecanceling
I think you're referring to this article about Thomas Hargrove that was in
Bloomberg in February:

[https://www.bloomberg.com/news/features/2017-02-08/serial-
ki...](https://www.bloomberg.com/news/features/2017-02-08/serial-killers-
should-fear-this-algorithm)

He's the founder of the Murder Accountability Project:

[http://www.murderdata.org](http://www.murderdata.org)

------
jjoonathan
Ask HN: Any open source code/materials on predicting good fall guys based on
data?

------
ryanmaynard
There is a project[1] + whitepaper[2] on projecting the likelihood of future
white collar crimes written by Sam Lavigne, Francis Tseng, and Brian Clifton.

[1] [https://thenewinquiry.com/white-collar-crime-risk-
zones/](https://thenewinquiry.com/white-collar-crime-risk-zones/) [2]
[https://whitecollar.thenewinquiry.com/static/whitepaper.pdf](https://whitecollar.thenewinquiry.com/static/whitepaper.pdf)

------
partycoder
[https://en.wikipedia.org/wiki/Predictive_policing](https://en.wikipedia.org/wiki/Predictive_policing)

The British series "The Code" speaks a little bit about it in ep 3:
[https://en.wikipedia.org/wiki/The_Code_(2011_TV_series)#Stag...](https://en.wikipedia.org/wiki/The_Code_\(2011_TV_series\)#Stage_3:_The_Finale)

------
zebrafish
Believe I heard about a project a UW student did predicting crime in San
Francisco based on volume of vulgar tweets in a given area. Not sure if it's
on github anywhere but you can always start with that idea. Nothing about
specifics of the crimes, just where a high volume of them would be located.

------
tobylane
There's a British tv presenter and scientist called Hannah Fry who has
published in this area, including a talk in Germany (received just like many
comments on this page), some Numberphile videos and BBC documentaries in other
areas of data science.

------
YurtleTheTurtle
[https://www.propublica.org/article/machine-bias-risk-
assessm...](https://www.propublica.org/article/machine-bias-risk-assessments-
in-criminal-sentencing)

Food for thought on how incredibly biased these effort can be.

------
paulie_a
For a source of data:
[https://data.cityofchicago.org/](https://data.cityofchicago.org/)

And in the case of crime, chicago should be a pretty good dataset.

------
crabl
[https://github.com/kandluis/crime-
prediction](https://github.com/kandluis/crime-prediction) is a good place to
start

------
jeffmould
Are you looking for predicting future crimes in an area (i.e. city,
neighborhood, state, etc...) or predicting whether an individual will commit
future crimes?

------
PaulHoule
[https://en.wikipedia.org/wiki/CompStat](https://en.wikipedia.org/wiki/CompStat)

------
chiefalchemist
Fwiw there's some discussion of this in the book Everybody Lies. Look into
that. Perhaps follow up with the author. His name escapes me atm.

------
thisisit
Are you looking for tools or data?

~~~
febin
I am looking for tools or algorithms.

------
amigoingtodie
You need to watch Person of Interest and Minority Report.

~~~
febin
I watched POI. I also heard cops using ML to predict crime locations. I am
trying to find and reuse some existing code for a different project.

~~~
platz
Don't be surprised if those predictions are heavily biased against minorities
and poor people. Do you care if they do?

It's a similar problem to using ML to give people credit scores.

If the training data includes a lot of minorities and poor people breaking
laws / delinquent payments, then your ML will simply key on race/economic
status as a predictor.

So you've built a system that simply targets those groups.

But you might object and say that this race/economic status targeting gives
the highest accuracy! It was only learned in the training data, after all. You
can make a great classifier that is extremely unfair.

So you have to realize there is a conflict here between accuracy and fairness.
This means there is a conflict between _observational data_ (training), and
using that data to produce _decisions /outcomes_.

If you make _decisions /outcomes_ that _reinforce_ the training data, you do
not give racial groups/low economic status people a chance to improve their
lives.

That is extremely inhuman, predatory, and unfair.

~~~
febin
All I want to predict is time periods/locations which are vulnerable. Nothing
more than that.

~~~
jjoonathan
Racism is morally wrong but not mathematically wrong. P(criminal|black) >
P(criminal), but if you observe that someone has black skin and treat them
poorly because of it, you've done a bad thing. It doesn't matter that you were
just following Bayesian reasoning because you're still hurting someone on the
basis of something they can't control.

Lady Justice doesn't wear a blindfold as a fashion accessory. Discarding
information is a key factor in nearly every established system of justice /
morality. Refusing to do so (i.e. "just" running a ML algorithm) places you
directly at odds with society's hard-earned best practices.

~~~
platz
> Lady Justice doesn't wear a blindfold as a fashion accessory

I never noticed that before. Thanks for pointing this out!

------
netrus
Have you checked kaggle for relevant datasets?

------
0xdeadbeefbabe
Why don't you base it on Law data?

------
0xdeadbeefbabe
The Poisson distribution!

