Hacker News new | comments | ask | show | jobs | submit login
Delayed Impact of Fair Machine Learning (berkeley.edu)
63 points by jonbaer 8 months ago | hide | past | web | favorite | 32 comments



"Machine learning systems trained to minimize prediction error may often exhibit discriminatory behavior based on sensitive characteristics such as race and gender. One reason could be due to historical bias in the data."

The thing about this discussion is that it aims to balance social welfare goals and profit maximizing goals but without any criteria for attaining basic fairness on a general level.

To wit, suppose a machine learning algorithm is looking at twenty or fifty piece of data about a given individual, all of which are entirely irrelevant to the individual's chance of repaying a loan. But by random chance, one of those pieces of data, say handedness, happens to be correlated with a group's repayment history. So, some individuals with a good handedness are given loans more frequently and some individuals, with a bad handedness, are given loans less frequently. This situation doesn't matter to the company since the data is irrelevant and they only give out so many loans anyway and shitting on, say, left-handed people, gives them no grief. Moreover, if this trend is noted by all the companies soon it will be made "true".

Which to say, companies making life-defining decisions like mortgages or parole-granting, just should be prohibited from the using a lasagna of random data to make their decisions and instead should be required to use specific rules with specific reasons behind them.

And cry me a river about missed chances for optimization. This about the structure of society and optimization doesn't benefit society here imo.


If you throw out the "random data", some groups could be overloaned, which leads to a decrease of average credit score of that group (caused by increase of defaults in that group), and that can decrease amount of loans the group will get in the future, even when compared to the unfair profit maximizing policy.

That's what the article is about.


This isn't about overfitting. Race is almost exactly the opposite problem: the signal in the data is so strong that you can't help but pick it up, even after you censor everything you can think of. If you try to predict any kind of consumer behavior from a racially heterogeneous dataset you will end up finding something that correlates with race, because practically everything correlates with race.


> practically everything correlates with race.

Hold on, excuse me?


It's easy for ML to pick up on proxies that are related to race even if you exclude what the race is explicitly - zip code and names are two common ones, but this can appear in a lot of different ways.

It's particularly insidious too since a model that accidentally is making credit decisions based on race helps to perpetuate the inequity in the future data.


lol. what this person is basically saying is that racism affects almost every part of a person's life. I know it seems unbelievable but "practically everything correlates with race" is the most nerd-intelligible way I've seen of explaining/expressing that racism is pervasive.


That's not what I'm saying.


You've been using this account primarily for political and ideological battle. That's destructive of what HN is for—regardless of your ideology—and we ban accounts that do it, as explained here: https://news.ycombinator.com/newsguidelines.html.

If you want to keep commenting here, please (re-)read the guidelines and use this site as intended from now on. The intention is intellectual curiosity, and that is the first casualty of ideological war.


What exactly is wrong with him saying this: "Race is almost exactly the opposite problem: the signal in the data is so strong that you can't help but pick it up, even after you censor everything you can think of." As far as I can see it is a reasonable thing to say. Why not just come out and say: "I am on ideologically opposed to you and here is a threat of censorship." What is more intellectually curious than the idea that - as much as we want to - we can't remove a variable in many analyses because of its statistical significance?


Apologies, and thank you for the correction. My comments betray a lack of intellectual curiosity and I will strive to harmonize them going forward.


Wow :(


fine you're not actually saying it (though I think you should be) but the direct implication of what you're saying is that.


He is saying that race is correlated with almost everything, not racism. You seem to be claiming all statistical differences between racial groups are a result of racism. But black people for instance aren't more likey to listen to rap music because of racism.


>He is saying that race is correlated with almost everything, not racism.

and i didn't say that he said that racism is correlated with anything.


I can’t really interpret your comment any different way.


Is this actually true, that everything correlates with race? I know many things do, but does it even exceed 50%?


Hey mods....if you happen to monitor those of us with warnings, I wonder if you can run a query to determine my top 3 downvoters over time and see if a pattern emerges. This is starting to get rather annoying.


If you use data such as repayment history, income, assets, and other metrics highly related to whether or not somebody is going to pay back a loan, and the resulting model does not output equal representation among arbitrary population groups, I fail to see a problem. If the data being used to train the model has absolutely no knowledge of arbitrary group identifiers, such as "gender" or "race," the resulting output quite simply not biased.

However, if something like "race" or "gender" is actually being used as a feature input, then the output from most ML strategies is highly likely to pick up on correlations between racial and gender groups and certain outputs. That is undoubtedly going to lead to negative discriminatory outputs.

So while I see absolutely no problem with the first scenario I mentioned, and I clearly see a problem with the second scenario...I'm inclined to believe that the politics of many people would lead them to have a problem with both.


The thing that makes this trickier than it looks is that these features aren't necessarily encoded in a straightforward way. For example, race could be inferred with some loss of accuracy from zip code. Add more variables and it might be inferred with more accuracy.

Machine learning doesn't necessarily do what you want it to do; it probably will "cheat" and use an easier proxy variable, unless you watch out for it.

There is the urban legend about the machine learning program that learned to "recognize" tanks based on time of day in the training set. It probably never happened [1] but it gets the point across.

[1] https://www.gwern.net/Tanks


Ya, I agree with you that it's definitely tricky business. Including features like zipcode in something like a loan payment predictor is a good example of a sloppy proxy for the underlying and relevant features. Especially when certain data becomes hard to resist using, when something like zipcode of residence might have a 90% overlap with income range and be extremely trivial to acquire, whereas getting detailed and accurate income information from an applicant might be much harder. Tricky stuff.


I am not saying that this is the case in your examples, but sometimes these "predictors" create an unfair feedback loop. e.g. Cops patrol majority black neighborhoods more because of higher misdemeanor rate and because misdemeanor is a strong predictor of crime. This higher patrol rate leads to higher arrests (for crimes such as having weed on you that might not have a significant difference among races and neighborhoods). The arrest data is then fed back into the system, leading to even more patrols, essentially creating a forced equilibrium. Whereas if we didn't use the model the system might have naturally evolved to a different state. This is one possibility of why these models might be effective in the short term but not "fair" (i.e. not effective in the long term). But in any case, I think it is obvious that the benefits of a private company (such as the insurance company) are not necessarily aligned with the benefits of the society.


Misdemeanors are not a predictor of crime, they are crime. Further, many types of misdemeanors are quality of life issues that have outsize impacts on the people who live in those neighborhoods. Go to a town hall meeting in a bad neighborhood and see what people complain about. It's often not so much burglaries as the guys who are chronically drinking and breaking bottles on the corner at 2 am. It's not terribly enlightened to criticize the police for going after QOL violations you would never tolerate in your own neighborhood.

As for whether police attention creates feedback loops, we can check by looking at crime victimization surveys, which generally show the same patterns of racial disparities as arrest records. For that matter, the black / white murder gap has been stable at about 6-8:1 for as long as we've been keeping track, and you can't change that by selectively hassling certain people for weed.


I fail to see how crime victimization surveys disprove possibility of feedback loops. You can have higher murder rates among the black population, yet have overall similar misdemeanour rates. You can even have a higher misdemeanour rate among the black population but have a propostionally even higher arrest rates due to these models and their feedback loops. Or maybe I am missing something in your argument? Care to elaborate?


Well, the victimization surveys tell us that for every category of crime that matters, there are large disparities in base rates that (a) predate policing decisions and (b) recommend focusing on black neighborhoods, because that's where the crime is. The idea that higher arrest rates are a self-reinforcing statistical artifact has no support from the data. The notion that "crime is where you look for it" is specifically undercut by victimization surveys as well as by murder, a crime that is largely impossible to conceal from the state indefinitely. If the baseline disparities are already so high to begin with, there's hardly any variance left over to be explained by self-reinforcing patrol strategies.

Put another way, if you took the first-order strategy you'd come up with by looking at the body count, and then layered the most moustache-twirlingly racist jaywalking policy that you could think of on top of it, the two strategies wouldn't look that much different.

That's all before questioning why it is a bad thing for neighborhoods to be policed for misdemeanors. I lived in West Baltimore for years and I can assure you that my neighborhood was, if anything, chronically underpoliced.


So now you have moved the goalpost to:

>> There might be unfair feedback loops, but they would be effectively negligible.

First of all, it is curious to me that you mention the 8x disparity in murder rate but don't mention that the violent crime survey disparity is less than 0.2x. Secondly, you just claim that the policing has nothing to do with people being arrested at different rates but do not offer any evidence.

Here are the actual numbers if you really care about facts. The victimization rates are 20.5% whites vs 24.1% blacks. Less than 18% difference. On the other hand incarceration rates are 0.7% whites vs 4.5% blacks. That is more than 540% difference. This means the blacks are being arrested at about 32 times the rate at which they commit violent crimes. So your claim that:

>> "we can check by looking at crime victimization surveys, which generally show the same patterns of racial disparities as arrest records"

is not only ignorant and false. It is so far from the truth that it is laughable.


I'd love to respond substantively to this post, and in particular to your figures, but hn has threatened to ban me for discussing this topic. I suppose each of us can only make of that what we will. Best wishes!


I would have personally liked to hear your response but I guess it is what it is. Best wishes to you as well!


Fair points.


> If you use data such as repayment history, income, assets, and other metrics highly related to whether or not somebody is going to pay back a loan, and the resulting model does not output equal representation among arbitrary population groups, I fail to see a problem. If the data being used to train the model has absolutely no knowledge of arbitrary group identifiers, such as "gender" or "race," the resulting output quite simply not biased.

Discrimination isn’t solely measured by input but also by outputs.

https://en.wikipedia.org/wiki/Disparate_impact


[flagged]


We've banned this account for repeatedly violating the HN guidelines and ignoring our requests to stop.

https://news.ycombinator.com/newsguidelines.html


Your incendiary remarks are unrelated to the article and are not welcome on HN.


[flagged]


The article is actually about benefiting credit-worthy people with a bad credit score by giving them loans they can pay back, and they show that attempting to enforce certain fairness criteria is actually counter to that.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: