Hacker News new | past | comments | ask | show | jobs | submit login
A Way to Detect Bias (paulgraham.com)
282 points by bakztfuture on Oct 31, 2015 | hide | past | favorite | 211 comments

People, most of whom clearly are not that good at math, are being really harsh on Paul Graham.

Graham is mostly right, but slightly incorrect. In particular, suppose group A has the distribution f(x) and B has the distribution g(x).

If f(x) and g(x) are shaped significantly differently past the cutoff, then mean(H(x-C)f(x)) and mean(H(x-c)g(x)) might not agree even though there is no bias by construction. (Here H(x) is a step function and C the cutoff).

However, there is an easy fix: compute the minima of the support of the distribution rather than mean. min(H(x-C)f(x)) = min(H(x-C)g(x)) = C.

In practice, measure the weakest male and weakest female to be accepted in your sample set, or some similar approximation.

I'm pretty sure this is a valid frequentist hypothesis test. I've got half a proof worked out on paper already. It depends very weakly (and non-parametrically) on f(x) and g(x), but it works in basically the exact way Graham wants it to. Every counterexample I can think of is really pathological. My next blog post will probably be a proof of this.

All this negativity is really an overreaction. I know it's fun to totally debunk someone on details, but these are mostly fixable details.

Wouldn't the weakest male and weakest female founders be underperforming more due to individual factors than bias?

Maybe a mean of the lower quartile would remove some the the noise.

Suppose the cutoff for men is C but for women is C+K. Then the weakest man can be expected to have quality C+epsilon, while the weakest woman will have quality C+K+epsilon.

Here epsilon is how close a typical candidate will be to the cutoff, and is mainly a function of the sample size. I don't know the behavior of epsilon off the top of my head.

I think MattHeard understood this and his point was that the weakest man will realistically have quality C + epsilon + random, and the weakest woman will have quality C + K + epsilon + random. The random term arises because no evaluation process is going to perfectly tell you how people are going to end up performing. But yeah, this seems fixable also, by averaging some number of the lowest performing members of each group, as MattHeard suggested.

I'm not 100% sure how to deal with noise + extremal statistics. But I've got a 12 hour plane ride ahead of me tomorrow, so I can probably work out a fix.

I suspect the math is easier if you reframe the hypothesis to be about the derivatives: adding one more person from either group should have the same marginal effect on outcomes. As you get more observations you get more in a neighborhood of the cutoff which gives a consistent estimator of the derivative.

> In practice, measure the weakest male and weakest female to be accepted in your sample set, or some similar approximation.

In practice, where do you get these definitive unbiased measurements of the expected return of investing in an individual (or graduation probability or some other score)?

This test can be used after the fact, so you're not measuring expected return but actual return.

It's pretty easy to measure actual return (or actual graduation rate, or actual GPA, etc.).

What test? We're talking about computing the minimum of what before-the-fact EV should be (based on what set of information?) here, not computing what the mean EV should be (for which you can add up the outcomes and divide by N).

> What test?

The test for bias proposed by pg and modified by yummyfajitas.

> before-the-fact EV

Why are you bringing in before-the-fact EV at all? That's not a component.

The test is based on comparing the minimums, yes. So, for example, what is GPA of the worst female and male students. What part of that requires expectations?

Before-the-fact EV is what the entire question of bias is about. In the case of investments, your minimum outcome is that you shut down the company. Comparing this for male/female sets is useless. Also, looking at other aspects of the distribution (e.g. how many $10M exits, $100M exits) isn't useful because for investing, you want to maximize mean return, and it's not like looking at 20-80%tile outcomes numerically tells you much about that -- are these people saving dead companies or are they derisking $10B companies?

With students there is more of a connection between before-the-fact EV and outcomes, and much more information exposed (grades and course registrations on a semester-by-semester basis). Without that information (if you just looked at who graduated, who didn't) you can't really say as much. (You could run a Netflix-like competition to see who has the best graduation prediction engine and use its estimates as your definitive answer to what an unbiased EV predictor would say about college applicants.) With GPA information, let's say you look at the 10th percentile GPA's in each set and below, for boys it's 0.1-1.1 and for girls it's 0.1-1.5. What does that tell you about before-the-fact EV cutoffs? It doesn't tell you much, because one set of students, when they fail, could be more likely to fail harder than the other. It takes a lot more work than just that.

Isn't this not as meaningful as the mean, because the minimum doesnt really tell you about true capability? I can also easily construct an institution that satisfies your parameters, the minimums of two groups match, as in they have a token member at C for every group, but the characteristic performance of all members is different?

This in my eyes amounts to a simplification that is neat mathematically, but removes most of the useful information. I admit though that if the minimums did not match very well, that is an obvious sign of bias.

Edit: I suppose what I'm essentially saying is, what if minimums do match, then we cannot rule out bias.

If you think you can construct examples that break my test, go ahead and do it. I'm curious to see what you can come up with.

However, your claim that you can have "a token member at C for every group" is merely the claim that adversarial sampling can destroy any statistical procedure. So what? This is boring mathematically. Nor is it relevant to the problem at hand unless you want to claim that First Round Capital is actively conspiring to both a) be biased and earn less money and b) waste more money hiding that bias by funding tokens.

In a grandchild comment, you note that how close the min of each group is to the cutoff depends on sample size. This is true, but I think it's deal-breakingly true.

Suppose we have to input groups, A and B. Members of each group are distributed as Exp(1), ie same underlying distribution. Our selection procedure is totally fair as well: we take everyone without question.

However, there are 9x as many people in group A as group B. So the min of group A accepted (= the min of all group A) will be distributed as Exp(9 * |B|) and the min of group B accepted will be distributed as Exp(|B|).

So in expectation the min from group A will be smaller than the min from B, and indeed this happens 90% of the time. (Aren't exponentials nice?)

Of course in this case we can note that this effect is from differences in sample size and exactly correct for it. But normally we will not know how to do this correction properly because we don't know the true underlying distribution or the acceptance criteria.

First of all, making a group 9x larger doesn't change the shape of the distribution. The distribution would look like 9 exp(|x|), not exp(9|x|).

Secondly, I calculate a p-value for this test here:


It's proportional to the smaller of the sample sizes. I'll have a more detailed writeup soon. Also, this test is non-parametric, so you don't actually need to know either C or the distributions f and g - all you need is a certain level of uniform regularity in f and g.

A note: The original post is correct in what it says. It is saying if one draws N independent samples from an exp(1) distribution then the minimum is distributed as exp(N). This is true. Because of this, for any finite sample there is a systematic 'bias' by your metric if the sample size differs between the two groups. However, you are correct in that this difference vanishes as N becomes large and further that one can make controlled statements about the expected size of this difference, given the distribution.

Oops - I did misread what he wrote. He was discussing the distribution of the min of the sample, not the distribution itself.

Also, there may be a bias in the true p-value, but there is none in the p-value bound based on h(d) and min(N1, N2).

I like your proposal in theory but I doubt it works in practice. I'll try to explain: in practice what First Round has done is measuring post investment performance. I'm pretty sure that if the sample is big enough you will have an "all male" startup and a "with female" startup that failed thus the min being zero in both case. This doesn't mean there's no bias.

The point is that when you decide which company you finance (the source of bias) you make an estimation of future potential. But your test (and PG's one) measure ex-post results. Since there's a lot of uncertainty between the ex-ante and the ex-post measure, your test doesn't work.

Let me put in another way. The measure you are able to perform is not H(x-C)f(x) but a * H(x-C)f(x) + (1-a) * random where random is a random number and a is the weight of f(x) on the final outcome. You are right if a is 1, you and pg are wrong if a is 0 (but it will be a problem for the VCs), for everything in between you have to make assumption on the distribution of random.

I discuss in a different comment the effect of additive noise. I admit I haven't worked out the details of the min statistic and additive noise yet. But I'm actually fairly bullish on it - this problem is mostly just simple statistics.

I also suspect that first rounds data is useless. PG is being too credible about it - I suspect he is just being political and making his post more PC, since he doesnt seem that statistically naive. (Taking a lesson from Sam Altman most likely.)

There's also the fact that the effect size is huge — 63%. Most real world population differences are not weird enough to generate that large an effect without bias. The bigger questions are how much would the effect size change without removing outliers (Uber as pg mentions but presumably others too) and how significant is the result.

This approach relies on the unspoken "positivity" assumption that the pool of applicants is large enough that there exist individuals in the biased-against category that were not selected, and moreover that these denied applicants are "exchangeable" with the successful ones.

For example, assume that we find that founders who won a MacArthur "genius" grant outperformed the others. Further assume that there are only a limited number of such founders, and that all available were selected. Certainly one wouldn't want to conclude in this case that there is a bias against MacArthur fellows.

That seems obvious, but it gets trickier once you have lots of factors involved. What if the group you find to outperform consists of female founders with a PhD, substantial industry experience, and red hair[1]. Can you conclude that the process is biased against females? Males with PhD's? Anyone with red hair? Generally no, unless you are willing to assume that all of your factors are causal.

Worse, you can't even assume that it's biased against people with all of the measured factors unless you also assume that all unmeasured factors are randomly distributed. If it turns out that "a positive mental attitude" is an unmeasured but defining characteristic of success, if the interviewers rejected applicants who had less of this but you failed to include this in your category, you would be wrong to conclude that there is an unfair bias.

[1] http://www.nature.com/nature/journal/v453/n7194/full/453562a...

Graham's statement about the possible bias of First Round is unfounded. This was not any sort of a real study like Graham thinks and First Round clearly notes that. When the returns are as skewed as they are in venture capital (http://www.sethlevine.com/archives/2014/08/venture-outcomes-...), a small sample size and a simple analysis won't do. First Round even excluded their investment in Uber because it would skew the results too much.

Even if it were a statistically appropriate sample size and female founders still out performed male founders it still wouldn't exlcude other likely explanations other than bias. What if the culture in venture capital is more willing to assist and mentor female founders leading to greater success? In academia there is a women are selected over equally qualified men 2:1 for tenure positions now [1]. It is not unreasonable to wonder if a similiar hand up is being given to female founders in terms of training, social network inclusion, and mentorship leading to greater success.

Alternatively, there may be a selection process in society that means only the most motivated women become entrepeneurs and so beat the average male entrepeneur.

[1]. http://www.pnas.org/content/early/2015/04/08/1418878112.abst...

> female founders still out performed male founders

That's not what the study says, it says groups that have at least 1 female founder, not female founders.


Assumption: There is no fundamental difference between a female and a male founder for achieving start-up success (average rates and variance/distribution of rates is the same)

Observation: VC funded start-ups with female founders are (on average) 60% more successful than start-ups with male founders

Hypothesis: VC funding is biased against female founders. The ones that do receive funding are better vetted, less risky, and have higher individual qualities.

Experiment: Start funding more female founders.

If we then observe: The numbers start to even out, then there is no fundamental difference. VC funding bias may have been the cause of the difference in success rate.

If we then observe: The numbers stay the same, then there is a fundamental difference and our assumption is flawed.

Rational choice: Start funding more female founders. This either removes a bias (levels the playing field), or increases your profit (funding more potentially successful founders).

PG should of course not use an hypothesis to prove an assumption (experiment/probing is needed for verification). But also: The possibility of an uneven distribution should not invalidate such an experiment (or PG's line of reasoning), it will merely bring it to light (the numbers would stay the same, thus we have shown that the difference is fundamental and not caused by a sampling bias).

Interesting thoughts. However, this argument is biased because it assumes that the performance of the applicants WHO WERE ACCEPTED is not biased by the selection process itself, and that the performance characteristics of the selected sample are representative of the performance characteristics of the total, which could be a weak assumption.

An attempt at translating to mathematics (feel free to correct me!):

X = event that person belongs to group x

Y = event that person belongs to group y

S = event that person is selected

W = event that person will perform like a 'winner'

for simplicity P(X) + P(Y) = 1

Naturally, 'unbiased' in this case is simply P(S|X) = P(S1), and P(S|Y) = P(S2), i.e. that the selection process is independent of a certain variable X or Y

PG says we can measure the the performance of these selected applicant winners for each class, i.e. P(X|S,W).

I believe PG assumes that:

P(X|W) / P(Y|W) should equal P(X|S,W)/P(Y|S,W). We can see that these are different distributions, since the second is already conditioned on the selection process.

Simplified, PG assumes that P(X|S,W) = P(X|W) i.e. that conditioning on the selection process does not bias the winning results.

Its left for the reader exercise to determine the 'pathological' cases where this selection variable's distribution makes PG's assumption correct or incorrect.

However, this is simply theoretical - the actual distribution may or may not be 'pathological' and the assumptions made by PG could very well be good.

There's a few trivial ways to be biased that would not be detected in this way.

The first is if you have multiple people accepting applicants and some of them are biased to the point of not accepting applicants of particular types. That means all the applicants that are discriminated against that did make it were simply selected by people who weren't biased, and therefore won't outperform anyone.

The second is if the actual selection process is somewhat random instead of being based on pure performance. The ones who make it through that process won't necessarily perform any better, they'll just be luckier.

The third is if the application process accepts everyone equally, and then randomly prunes out people according to a bias. This is similar to the second except the acceptance criteria is still performance-based, but because it randomly throws out people (instead of throwing out low-performers), the remaining people are still going to perform the same as those who were not pruned.

The first footnote on the page also points out that if the selection criteria are different for the different groups then this process won't work, which seems like a pretty important caveat that I wish was in the article proper. One really common form of bias (especially in tech) is being biased against women, and that's also a situation where it's very common to (unconsciously or otherwise) use appearance in judging female applicants but ignore appearance for male applicants.

The second example adds statistical noise, but does not invalidate Graham's procedure. In the absence of noise, the distribution of accepted people will be H(x-C) f(x) for the main group, and H(x-C-K)g(x) for the other group (where H(x) is a step function, f(x) and g(x) are the distributions of quality of the two groups).

If noise is present, then you get convolve(H(x-C)f(x), k(x)) instead, where k(x) is the pdf of the noise distribution.

You'll need more samples to measure this, but it's completely measurable via Graham's method.

A fourth is to deliberately select low-performing members of the discriminated-against group in order to game Graham's metric and justify the bias.

There's a flaw right in the assumptions here: "(c) the groups of applicants you're looking at have roughly equal distribution of ability."

Oh. See, the problem is that if an application process is biased, and applicants perceive that bias, then those against whom it is biased will be dissuaded from applying unless they far exceed the required standards. Whereas those towards whom the process is biased will be more likely to apply, even if they are marginally qualified, because they expect to benefit from the bias.

So that means that if you do have a biased process, there's a good chance it doesn't meet criterion c - applicants in the different groups between which its bias discriminates are not equal in ability. So your test might verify a lack of bias, when there is in fact bias present.

You can't verify a lack of bias just by looking at the outcomes of successful applicants - you need to look at the outcomes for unsuccessful applicants too, to determine whether your applicant pools really do meet criterion c. Or you could look at the outcomes for nonapplicants, but that's clearly a much harder problem.

> A couple months ago, one VC firm (almost certainly unintentionally) published a study showing bias of this type. First Round Capital found that among its portfolio companies, startups with female founders outperformed those without by 63%.

Well, they also said in their study:

> And we are not claiming that our data is representative of the industry...or even statistically significant.

Also, the wording is "startups with a female founder", not exclusively female founders... I think this is a detail that shouldn't be ignored.

And, the study doesn't show how many companies out of the 300 had female founders! Maybe it was just 1! They also say "Solo Founders do Much Worse Than Teams", so this is an important detail if there are no solo female teams ever backed in their firm! etc etc, the list goes on. Not exactly strong evidence to support the point PG is making, that bias would be easy to detect.

Measuring performance purely in terms of "how much money I make" is one way of doing it, but not the only way. And it wont cover the majority of jobs on the planet (how do you measure performance of someone who stacks shelves in a supermarket?)

I don't understand his point about First Round Capital showing their female founders did better than companies without female founders. What does that show? How do we know that female founders aren't simply better? Or maybe women are scared of applying, so out of women, only the best apply? In that case, the mere idea that there is a bias can cause "pre-selection" bias.

I lack the mathematics to prove this, but it seems that on the face of it, pg is simply wrong. Or I'm misreading terribly.

Tangentially: Speaking of bias, why doesn't YC publish information on their companies' tech choices? PG racked up a lot of inferred cachet (positive) by stating that use of Lisp gave them a huge advantage. Now that YC has data, they should be able to show how choice of technology correlates to performance.

It's certainly possible that the observed bias isn't the fault of First Round Capital's selection process if the applying populations are different -- the "pre-selection" bias you're talking about.

If that's the case, First Round Capital could profitably benefit from encouraging more female founders to apply.

The argument is that First Round Capital must have implictly made it harder for female founders to get funding, since the ones who do perform better. The rational course of action for First Round Capital would be to lower their threshold on female founders (or, conversely, raise the threshold on male founders) until they perform no better or no worse than male founders.

And that's not proven by the evidence. It might be a good thing to look into, but pg's statement that you don't need more info is wrong.

Yeah, I was just rephrasing PG's argument for the OP, not saying whether it was right or wrong. I'm not smart enough to know.

The implication of this analysis of http://10years.firstround.com/ is that First Round is biased against founding teams with experience at Amazon, Facebook, Apple, Google, Microsoft or Twitter.

Can this be true?

Yes, it can be true. FirstRound could still positively value that experience, but just not be valuing it enough.

Makes sense. Fascinating.

Um, the data set pg cites actually shows this to be fallacious.

They excluded Uber from the results. Which, if included, makes the male-run companies look "oversuccessful". What would happen if I excluded the top female-run business, I'd bet that makes the differences between the two groups much smaller.

Given both the small sample size as well as the outsized influence of outliers, drawing conclusions from this population group is going to be fraught with issues.

The phenomena of "stereotype threat" complicates this conclusion however: https://en.wikipedia.org/wiki/Stereotype_threat

When a member of a group is primed with a stereotype that their group underperforms at a task, they are more likely to underperform. So there could be a selection process biased against a group, and a selected member could be an above-average performer otherwise but, because of work environment, be underperforming.

Some universities work to remedy this through support groups or other practices aimed at under-represented minorities, and they appear to help students be more successful academically. On the other hand, there's the Hawthorne effect... https://en.wikipedia.org/wiki/Hawthorne_effect

This is only a bias if the stereotype effect is stronger on measurements than in real life. If stereotype threat affects test performance and real performance the same way, then it means that the stereotyped group is truly inferior.

Do you have evidence that stereotype threat hurts test performance more than real performance?

(Of course, in a hypothetical world which only eliminated the stereotype, the group would cease to be inferior. I.e., the inferiority is based on context, and is not intrinsic.)

I wouldn't agree with the conclusion of being truly inferior - there could be ongoing issues with the work context.

I'll put it another way: Graham says we can look at just the performance by group to detect bias in selection. But there could be bias in selection, and A) different treatment after selection, which would not be revealed through Graham's test. Or could also be bias in selection and B) continuous reminders of stereotypes triggering stereotype threat, and this would also not be revealed.

Now you could point to a tech company with few underrepresented minorities, let's say 1%. If there's no overt bias in the work environment then people should succeed and if they don't they're just worse performers. On the other hand, if you're in the 1%, just noticing the underrepresentation among your coworkers might be a constant reminder of stereotypes.

I don't claim to have a simple solution for this.

Differential treatment after selection would indeed invalidate Graham's test.

Stereotype threat, however, is NOT such a different treatment. Again - if stereotype threat reduces measured performance by X and actual performance by Y, then the bias it introduces is Y-X. If Y=X then there is no bias.

Do you believe Y != X? If so, why?

Minority groups will generally underperform when tested by majority groups (i.e. hispanic student white proctor) according to studies though I'm citing from memory so I may be incorrect. Also minority groups have access/insight/credibility with certain consumer groups and cultures majority groups do not and there's no reason to believe this advantage would be measured appropriately by examiners not fluent in that culture. If real-life performance means 'business success' and 'measurement performance' means 'being funded when pitching a startup to a white VCs' then I think that criterion is at its face satisfied.

Footnote one should be put more generally, biases in the performance metrics (either appearance for women vs ability for men, or how many words of the US national anthem someone knows for US citizens vs rest of the world) will cause this method to fail. Unfortunately, unbiased performance metrics are quite hard on single dimensions, let alone when moving to multi-dimensional metrics

Many of these comments remind me of one of the most potent biases known to man: confirmation bias. If you are smart and want to believe something, you will surely be able to come up with a mathematical (albeit flawed) way to convince yourself you are right.

There is an emerging subfield of computer science that studies what it means for data (or algorithms, or decision-making rules) to be biased, and how to remove certain forms of bias.

See http://fatml.org

In which a crowd of overwhelmingly white male American SW engineers tries to find a mathematical explanation to bias...

"For Bourdieu, cultural capital is the status culture of a society's elite insofar as that group has embedded it in social institutions, so that it is widely and stably understood to be prestigious. Schools take it as a sign of native academic ability but do not themselves impart it, performing acts of social alchemy that transform class privilege into individual merit."

"What it means for a selection process to be biased against applicants of type x is that it's harder for them to make it through. Which means applicants of type x have to be better to get selected than applicants not of type x. [1] Which means applicants of type x who do make it through the selection process will outperform other successful applicants."

There are many, many reasons that both sentences beginning "which means" are false that someone who is as smart as we're told Graham is should be able to come up with quite easily. It's astonishing that he made this tripe public.

Here's a gimme for each.

Say I'm selecting people to receive a prize; there are ten recipients and they're putatively chosen by [whatever]. But I don't like people with green eyes, so green-eyed candidates had better be pretty pleasing to me. But they can please me in any way, not necessarily in ways relevant to the metric for which the prize is awarded—maybe I also like tall people so a really tall green-eyed person averages out in terms of my predilections. They aren't relevantly better.

For the second, again, the question is "better" at what? Better at getting whatever is involved in getting selected? That doesn't necessarily correlate with outperforming anyone subsequently, especially if it's a matter of startupland. (Remember that New Yorker profile of Marc Andreessen, where Sam Altman basically admitted that he didn't know what he was doing in terms of selecting what to invest in? The flipside of that is being selected by Altman for an investment.)

I don't get it :-/ Why is there a bias?

Even if the VCs are totally unbiased, why couldn't the startups with women outperformed the others? It could happen for a variety of reasons. Just hypothetically speaking, maybe startups-with-women have different networking connections or insight that male-only-startups don't have?

> Why is there a bias?

Applications by definition are supposed to be biased. If an application weren't biased then it wouldn't be an application, it would just be a lottery. And while that's a system that a lot of charter schools actually use, or a least pretend to, I think it would be a tough sell to convince venture capitalists to allocate their capital this way.

Then, if I understand PG's argument correctly, the VCs should invest in even more startups-with-women, and lower their threshold on investing them, at least until they perform no better than startups-without-women.


If that were the case, VCs should accept more companies-with-women, which would change the TOTAL mix of accepted candidates so that the women no longer overperformed.

The issue, interestingly, is that there are a lot of women (or biased-against group) between the median performance and 160% performance who are being rejected.

In other words, only the best women can get accepted, which means that above-median (but not superstars) are getting rejected, while many more above-median men are getting accepted.

This has the ring of truth to me (as pg says, it's the definition of bias).

A related observation (which I've been making for a long time) is that the absence of mediocre women in positions of power is strong evidence of bias. Men can succeed when they're mediocre, but women have to be exceptional. Likewise for minorities.

What's "exceptional"? I've known many women in my career and many of them were mediocre. Some were in positions of authority and some weren't. Another commenter lists several "mediocre" women in the corporate and political world (and leaves out some big ones, like Meg Whitman). If you're talking about becoming a CEO, you have to be an "exceptional" man to get there too, in absolute terms.

I feel like the root erroneous assumption here is that an equal amount of people of all types are interested in the same things and that the only reason any group becomes more represented than another is that the others are getting alienated or funneled out somewhere along the way. That is a completely incorrect and invalid assumption. The fact that there are a lot more non-English-speakers in janitorial work in the US (when I worked as a janitor, I was 1 of 2 English speakers on the 12-person janitorial staff) doesn't necessarily mean the janitorial manager is biased against English speakers; it means that due to external considerations, like the fact that almost all other jobs require you to speak the native language, non-English-speakers are better suited for janitorial work, and therefore people do the logical thing, apply for work that they can do, and end up comprising a larger section of the application pool.

People make decisions based on social, cultural, and physical expectations of them, and there's not anything wrong with that. By and large, women do not have an interest in computer sciencey or entrepreneurial work. It's OK if a woman does, but it's also OK to note that most women don't. There's nothing we need to fix about it. Most women don't want to do it, and there's no reason to force them.

Why do you see fewer women becoming CEOs? Because fewer women want that kind of job and fewer women are qualified for that kind of job due to the biological realities of humanity that require women to take time out for pregnancy and child-rearing (sorry denialists, I didn't invent biology and choose that only women could bear and nurse children, so don't take it up with me), and the social and cultural expectations that have formed around these biological realities. In short, the serious applicant pool includes only a very small amount of women, so only a very small number of women obtain that position.

> People make decisions based on social, cultural, and physical expectations of them, and there's not anything wrong with that.

I couldn't disagree more. If the culture is plain chauvinism ("women belong in the kitchen not the boardroom") then there's everything wrong with that. All oppression throughout history is essentially "just culture", but that justifies nothing.

Your biological reductionism is completely at odds with our best scientific understanding of contemporary gender roles, as a few minutes on wikipedia will tell you.

I didn't say "just culture"; I said the confluence of social, cultural, and physical factors.

Why does "culture" develop? Because people are naturally evil and black-hearted? These things don't happen in a vacuum, they develop organically because they are the best way to support human and tribal propagation and prosperity. Perhaps some things can and should change, but things that are constant across nearly all successful human societies should be considered fairly well tested.

We should note that it takes a long time to see the full effects of changes to social structures and institutions, generally at least 3-4 generations. If a society is "testing" something and the society itself expires or its success is greatly diminished within 6-8 generations of implementation, the test should probably not be seen as successful.

The West will find that traditional principles that assign gender roles based on that gender's inherent advantages and disadvantages are much more useful than currently acknowledged. Forcing people to do things that they a) don't even want to do and b) aren't well-suited for is a losing proposition, no matter how much outrage you try to manufacture to justify it.

Before you take your theory too far, you need to explain why it's OK that your theory implies that black people in America were best suited to be slaves, up until the day they weren't.

Black slavery proves my point. Seen in the context of an experimental social institution, it was a massive failure that barely made it 8 generations before it completely imploded on itself (and took the lives of 650k Americans with it). There's no doubt that it seriously harmed everyone associated with its practice (including in ways you don't usually hear people mention, like decreased ambition and work ethic for everyone, slaves and masters, in slave economies), even mostly-innocent parties who were "guilty by association" like the free states. We're lucky that the US survived black slavery.

Slavery has been tried many times but the gross inequity it inflicts means that no one can operate a stable economy or social system that depends on it.

> and nurse children

Science mostly solved that problem decades ago, if you hadn't noticed, to the extent that breast feeding is now viewed as strange or embarrassing in certain cultures.

>Science mostly solved that problem decades ago

Medical practitioners strongly emphasize breastfeeding as the ideal form of nourishment for the baby. Formula should be used as little as possible. It's cool that we have a viable alternative solution in formula, but it's still worse than natural breastfeeding.

I've seen plenty of mediocre women and minorities in positions of power.

For a bipartisan example, consider Barack Obama and Sarah Palin.

I just asume there is bias... I mean the fact is, bias is what youare trying to work in favor of... that bias being factors of success. Chasing your tail against random statistics won't really show much, and a person is more complex that a few statistical groups. As far as investing goes, there's also the product, and how that leader/founder matches to that product category itself. A founder that succeeds in one category won't definitively succeed in another. Many founders fail their first few times, and later succeed. Others fail after some success(es).

I think as long as reasonable steps are made to avoid certain obvious bias, the rest is mostly chance.

This isn't really sound reasoning, for reasons mentioned elsewhere and because of the following.

You need to know that the probability of acceptance is conditionally independent of the "type" of the applicant given the success of the applicant.

For example, consider the following hypothesis for the First Round data: women are more honest than men. A woman presenting a bad idea to a VC will be rejected whereas a man may be able to weasel his way into getting funding. This will make men have a lower success rate, and correspondingly women will have a higher success rate.

However, this isn't really the same thing as having an across-the-board hidden bias against women.

Actual real world example (and application of an antidote) of this basic idea: https://en.wikipedia.org/wiki/Rooney_Rule

In the post, PG states that First Round's study is evidence of gender bias in VC financing. But footnote [2] is important: Uber was excluded as an outlier. Now...excluding Uber is reasonable (it is sort of an outlier), but so is not excluding it (it was a company that First Round invested in). When the conclusion from a data analysis depends on which way you go on something like this - which of two reasonable alternatives you pick - then the results are fragile and they don't really support either conclusion very well.

All other arguments aside ... this idea also fails if the judging party's idea of quality is mostly uncorrelated with actual quality. Which Graham says in other essays is usually the case (it's what you mean when you say it's almost impossible to predict which companies will be successful).

Graham says the subjects of bias "have to be better to get selected", but what is really going on is they have to be better according to the metrics of the judge which are essentially arbitrary.

This is false - noise doesn't hurt this test at all. See my comment explaining why:


Bad measurements add noise (and increase the sample size required) but they don't invalidate the bias detection procedure.

This is also in large part why people tend to hire/fund younger versions of themselves. If judgement wasn't arbitrary much more diversity should be expected.

If candidates from group A perform more strongly on average than those from group B there are other possible causes than bias in the selection process itself. For instance, members of group A may only apply at a higher level of self-assessment for how likely they are to succeed than those in group B. The reason for this could be opportunity cost not present for group B, overconfidence or lack of underconference in group B or underconference or lack of overconfidence in group A.

For a formalized and empirical version of this argument applied to the entirety of the US economy, check out the following article: The Allocation of Talent and U.S. Economic Growth by Hsieh et al. (http://klenow.com/HHJK.pdf). It quantifies the gains from the decreases in misallocation of women and african americans as racial discrimination in employment decreased over the past 50 years.

While we can have lots of fun arguments about the mathematics of this approach, the basic problem is the underlying data is too small and poor to draw any valid conclusion from.

I think I have a simpler counterexample to disprove pg's hypothesis than any other counterexample I've read in the comments. Suppose our goal is to admit the top 5 applicants with the following performances:

  A - 30,000
  A - 10,000
  A - 9,000
  B - 7,000
  B - 5,000   # Cutoff point below this line
  A - 4
  B - 3
  B - 2
Even though admitting the top 5 by score is perfectly fair, the applicants from group A perform better.

I don't see what you're getting at. Group A is better and there are more of them. What's the problem?

pg's argument is that if the average performance of one admitted group is better than the other the admission process has a bias. This example shows that you can have an unbiased process, but the average performance of the groups differs.

It's likely that the two of you are talking across each other because you read slightly different articles. Paul added an assumption to his article, possibly after William read it, which is intended to rule out his posited distribution: "(c) the groups of applicants you're comparing have roughly equal distribution of ability".

This strikes me as a "heroic assumption", but it's true that if you make it most of the flaws in his argument go away. Add in the unspoken assumption that the groups are both are large enough that sampling variation does not matter, and I think he's probably logically correct.

On the other hand, once you make these assumptions, the rest of his argument seems unnecessary, since all you need to know is the ratio of males and females funded. If male and female founders are exchangeable, the process is biased if one group is funded more often than they are represented in the applicants.

You don't even need to look at outcome, since we've already assumed the founders are of equal ability. I think that Paul is aiming at the case where we don't know the ratio of applicants. I think his argument can be useful in this case, but only if you have already accepted his assumptions.

> "Paul added an assumption to his article, possibly after William read it, which is intended to rule out his posited distribution"

Yeah. That is what happened.

Fittingly, another type of bias observed in the linked report is that against Solo Founders. The report states that solo founders do worse when measured against the same yardstick as multiple founders. Maybe from a VC perspective this is intended (big raise => bigger exit?), but I'd argue that you don't need to raise as much when you have a solo founder because dilution is less of a concern.

I think the first footnote in this is extremely valid. It all depends on what performance metric the selection process is identifying compared to the performance metric you use to determine success.

I would suspect the larger issue is that people are probably much worse at identifying what performance metrics for selection convert to their respective performance for success.

If students of Asian origin outperform the whole student body, can we conclude admissions folks are biased against students of Asian origin?

That's possible. Without concrete data to give you, there are some suspicions that because of the much better performance of Asian students, they're being limited in the admissions process. Otherwise, Asian students would make up the vast majority of the students admitted. This would crowd out the non-Asian students accepted for admission. In this case, it would be more accurate to say that admissions officers limit the number of Asian students accepted instead of saying admissions folks are biased against Asian students.

>admissions officers limit the number of Asian students accepted instead of saying admissions folks are biased against Asian students //

The effect is the same isn't it? Less chance for a student with ancestors from a particular geographic locale getting a placement.

The effect would result in the same situation, but the cause is much different.

Alternately, you could conclude that instead of "mediocre" Asians being excluded by bias, Asians have an external advantage that makes them perform better. Maybe it's cultural, since most Asians are taught a very strong work ethic and heavy emphasis is placed on formal schooling, succeeding, and fitting in. Maybe Asians are physically better adapted to that type of work, with brains that retain information more easily or buttocks that don't get sore from sitting in a chair all day.

The "high performance means there's a bias" theory only works if you assume that everyone is starting from the same social, cultural, and physical baseline. They aren't.

Maybe a better metric would be if there were no mediocre data points among a certain group; that would be more evidence (but still not necessarily good evidence) that you have to exceptional to get attention and overcome the "bias barrier", not simply that most of a certain type of performer does better than a different type.

That doesn't actually have a large effect. An example in numpy:

    In [14]: x = norm(0.0,1).rvs(100000)
    In [15]: mean(x[where(x > 2.0)])
    Out[15]: 2.3774795090391301
    In [16]: y = norm(0.5,1).rvs(100000)
    In [17]: mean(y[where(y > 2.0)])
    Out[17]: 2.4372124830289557
I.e., a difference in the mean of 0.5 sigma corresponds to 0.06 in Graham's test statistic.

Graham is a little bit off - a better place to look for bias is the bottom of the accepted distribution than at the mean.

At https://news.ycombinator.com/item?id=10483861 gizmo shares an intuitive anecdote that matches your math.

Even if Asians perform better for external reasons, the selection process should account for that before the selection is made, and the admitted class should be roughly equal performers, as a group. Unless Asians have a very lumpy shaped performance curve across the group

This presupposes that we should not accept unequal representation among groups. I don't believe that, and I don't think it's an implicit Western value. People should be allowed to flourish according to their natural advantages. The value is to try not to make an early judgment and exclude people based on assumptions about their group's capability, whether that exclusion is based on the group's perceived disadvantage or advantage. The idea is that the individual shouldn't be held accountable for things he had nothing to do with, and shouldn't be assumed to be automatically compliant with stereotypes. I don't see a need to start throttling groups that are doing "too well".

You misunderstood what I wrote. I didn't claim that a high performing subgroup should be throttled, I claimed that the Asians that pass an unbiased bar would be roughly as successful as the non-Asians that pass the same bar. I

This how I understand this:

Look back at decisions you have made under various lenses and learn about your decisions and what biases they have, so that you can avoid them or amplify them (if positive) in future.

Could someone who read it more attnetively tell me, by this methodology,

-> If in retrospect YC finds any factor that its selected founders who turn into unicorns ($1b, $10b etc) have in common (more than its non-unicorn, also accepted founders)

-> Then by this method, could it conclude retroactively that it had been "biased" against that factor? (since it is present more than in its non-unicorns whom it had also admitted; i.e. in other words, those with the factor are more performant than "would be expected" without the bias against it?)

Or have I misunderstood?

Even assuming equal distribution of ability, there is still the problem of whether you can measure performance without bias.

The test pg suggests was also proposed by the economist Gary Becker [1]. Like many people here noticed, the catch is that the test only works if you compare marginal performance and not average performance. Economists call this the inframarginality problem [2]. There are a number of solutions to this problem to restore pg's result:

- As pg himself says, if we assume certain statistical distributions of ability and selection rules, the inframarginality problem goes away.

- We'd also solve the inframarginality problem if we can tell roughly who the marginal applicants were. If pg could ask the VC firm, see who almost got rejected, and compare these two groups, he'd be set. pg is well-positioned to test this on the YC dataset.

Likewise, he could solve this problem if he can observe another variable that reveals who the marginal applicants likely were (for example, the startups that had the fewest co-investors).

- There's also an entire literature out there that tries to solve the problem using other ways. For example if a system follows the "KPT" sufficient conditions then the inframarginality problem also goes away.

[1] One prominent approach ... is the “outcome test,” which originated in Gary S. Becker (1957). In the context of motor vehicle searches, the outcome test is based on the following intuitive notion: if troopers are profiling minority motorists due to racial prejudice, they will search minorities even when the returns from searching them, i.e., the probabilities of successful searches against minorities, are smaller than those from searching whites. More precisely, if racial prejudice is the reason for racial profiling, then the success rate against the marginal minority motorist (i.e., the last minority motorist deemed suspicious enough to be searched) will be lower than the success rate against the marginal white motorist. (From [3])

[2] "While this idea has been well understood, it is problematic in empirical applications because researchers will never be able to directly observe search success rates against marginal motorists. This is due to the fact that we cannot identify the marginal motorist, since accomplishing this would require having complete information on all of the variables that troopers use in determining the suspicion level of motorists. Because of this omitted-variables problem, we can observe only the average success rate of searches against white and minority motorists, and not the marginal success rate. Since the equality of marginal search success rates does not imply, and is not implied by, the equality of the average search success rates, we cannot determine the relationship between the marginal search success rates of white and minority motorists by looking at average success rates. In past literature, this has been referred to as the “infra-marginality” problem. (From [3]).

[3] Anwar, Shamena, and Hanming Fang, "An Alternative Test of Racial Prejudice in Motor Vehicle Searches: Theory and Evidence." American Economic Review. (2006)


Okay, PG has an hypothesis test.

There's a large literature for that, e.g.,

E. L. Lehmann, Testing Statistical Hypotheses.

E. L. Lehmann, Nonparametrics: Statistical Methods Based on Ranks.

Sidney Siegel, Nonparametric Statistics for the Behavioral Sciences.

In this case, PG will be more interested in the non-parametric case, i.e., distribution-free where we make no assumptions about probability distributions.

We start an hypothesis test with an hypothesis, commonly called the null hypothesis which is an assumption that there is no effect or, in PG's case, no bias. Then with that assumption, we are able to do some probability calculations.

Then we look at the real data and calculate the probability of, say, the evidence of bias being as large as we observed. If that probability is small, say, less than 1%, then we reject the null hypothesis, that is, reject the assumption of no bias, and conclude that the null hypothesis is false and that there is bias. The role of the assumption about the sample is so that we know that the problem is bias and not something about the sample.

In hypothesis testing, about all that matters are just two numbers -- the probability of Type I error and that of Type II error. We want both probabilities to be as low as possible.

Type I Error: We reject the null hypothesis when it is true, e.g., we conclude bias when there is none.

Type II Error: We fail to reject (i.e., we accept) the null hypothesis when it is false.

When looking for bias, Type I error can be called a false alarm of bias, and Type II error can be called a missed detection of bias.

In PGs case, suppose we have 100 startups and five of those have women founders. Suppose for each of the startups we have the data from "their subsequent performance is measured".

Our null hypothesis is that the expected performance of the women is the same as that of the men.

So, let's find those two averages and take the difference, say, the average of the women less the average of the men.

PG says if this difference is positive, then there was bias, but PG has not given us any estimate of the probability of Type I error, that is, of the probability (or rate) of a false alarm.

I mean we don't want to get First Round Capital in trouble with Betty Friedan, Gloria Steinem, Marissa Mayer, Sheryl Sandberg, Hillary Clinton, Ivanka Trump, or Lady Gaga unjustly! :-).

Let's call this difference our test statistic.

So, let's find the probability of a false alarm:

So, let's put all 100 measurements in a pot, stir the pot vigorously (we can use a computer for this), pull out five numbers and average, pull out the other 95 numbers and average, take the difference in the two averages, that of the five less that of the 95, and do this, say, 1000 times. Ah, computers are cheap; let's be generous and do this 10,000 times.

For a random number, how about starting with a 32 bit integer, with appropriately long precision arithmetic multiply by 5^15, add 1, take modulo 2^47, and scale as we want?

So, we get an empirical distribution of these differences, from the five less the 95. Looking at the distribution, we see what the probability is of getting a difference as high or high or higher than our test statistic. If that probability is low, say, 1% or less, then we reject the null hypothesis of no bias and conclude bias with our estimate of probability of Type I error 1% or less.

If with the 1% we reject, then it looks like First Round has done a transgression, will get retribution from Betty, et al., and needs to seek redemption and Betty, et al., are happy to have their suspicions confirmed. Else First Round looks like the good guys, are "certified statistically fair to women", may get more deal flow from women, and Betty, et al., can be happy that First Round is so nice!

Notice that either way Betty, et al., are "happy". That's called "happy women, happy life"! Or, heads, the women win, tails they lose, and in no event is there a huge crowd of angry women in front of First Round's offices with a bonfire of lingerie screaming "bias"!

When we reject the null hypothesis, we want to know that the reason was men versus women and not something else, e.g., a biased sample. So here is where we use our assumption of independence with the same mean.

Now we have a handle on Type I error.

Here we have done a non-parametric statistical hypothesis test, i.e., have made no assumptions, except the means, about the distributions of the male/female CEO performance measurements.

And we can select our desired false alarm rate in advance and get that rate almost exactly.

For Type II error, that is more difficult.

Bottom line, what we really want is, for whatever rate of false alarms we are willing to tolerate, the lowest rate of missed detections we can get.

Can we do that? With enough more data, yup. There is a classic result due to J. Neyman (long at Berkeley) and K. Pearson (early in statistics) that shows how.

How? Regard false alarm rate as money and think of investing in SF real estate. We put our money done on the opportunities with highest expected ROI until we have spent all our money. Done. For details, an unusually general proof can follow from the Hahn decomposition from the Radon-Nikodym theorem in measure theory, e.g., Rudin, Real and Complex Analysis. Right, in the discrete case, we have a knapsack problem, known to be in NP-complete.

What we have done with our pot stirring is called resampling, and for more such look for B. Efron, long at Yale, and P. Diaconis, once at Harvard, now long at Stanford.

Tom, with a reputation as a hacker, likes to work late, say, till 2 AM. So, we look at the intrusion alerts each minute between 2 AM and 3 AM (something like the performance of the women) and compare with those of the other minutes of 24 hours (like the performance of the men) much as above and ask if Tom is trying to hack the servers.

Or, we have a server farm and/or a network, and we want to detect problems never seen before, e.g., zero day problems. So, we have no data at all on the problems we are trying to detect because we have never seen any of those before.

So, to do a good job, let's pick some system we want to monitor and for that system, get data on, say, each of 10 variables at, say, 20 times a second. Now what?

Our work with bias in women venture applications used just one number for our measurement and test statistic. So we were uni-dimensional. Here we have 10 numbers and need to be multi-dimensional.

Well, in principle we should be able to do much better (pair of Type I and Type II error rates) with 10 numbers than just one. The usual ways will require us to have, with our null hypothesis, the probability distribution of the 10 numbers, but can only get something like that from smoking funny stuff -- not even big data is that big.

So, we want to need no assumptions about distribution, that is, be distribution-free.

So, we want some statistical a hypothesis test that is both multi-dimensional and distribution free.

Can we do that? Yup.

"You mean you can select false alarm rate in advance and get that rate essentially exactly, as in PG's bias example?" Yup.

"Could that be used in a real server farm or network to detect zero day problems -- security, performance, hard/software failures, system management errors?" Yup -- just what it was invented for.

"Attempted credit card fraud?" Ah, once a guy in an audience thought so!

How? Ah, sadly there is no more room in this post!

What else might we do with hypothesis tests? Well, look around at, right, big data or just small data.

Do we have a case of big data analytics or artificial intelligence (AI)?

Ah, I've given a sweetheart outline of statistical hypothesis testing, and now you are suggesting some things really low grade? Where did I go wrong to deserve such an insult?



Type I Error: We reject the null hypothesis when it is true, e.g., we conclude bias when there is none.

Type II Error: We fail to reject (i.e., we accept) the null hypothesis when it is false.


Type I Error: We reject the null hypothesis when it is true; e.g., we conclude bias when there is none.

Type II Error: We fail to reject (i.e., we accept) the null hypothesis when it is false; e.g., we conclude there is no bias when there is.


For a random number, how about starting with a 32 bit integer, with appropriately long precision arithmetic multiply by 5^15, add 1, take modulo 2^47, and scale as we want?


For a random number, how about starting with a 32 bit integer, with appropriately long precision arithmetic multiply by 5^15, add 1, take modulo 2^47, take the resulting integer, scale as we want for stirring our pot, and use that integer as the start of another random number?


Else First Round looks like the good guys, are "certified statistically fair to women", may get more deal flow from women, and Betty, et al., can be happy that First Round is so nice!


Else First Round looks like the good guys, are statistically certified fair to women, may get more deal flow from women, and Betty, et al., can be happy that First Round is so nice!


Notice that either way Betty, et al., are "happy". That's called "happy women, happy life"! Or, heads, the women win, tails they lose, and in no event is there a huge crowd of angry women in front of First Round's offices with a bonfire of lingerie screaming "bias"!


Notice that either way Betty, et al., are "happy". That's called "happy women, happy life"! Or, heads, the women win, tails First Round loses, and in no event is there a huge crowd of angry women in front of First Round's offices with a bonfire of lingerie screaming "bias"!


So, we want some statistical a hypothesis test that is both multi-dimensional and distribution free.


So, we want a statistical hypothesis test that is both multi-dimensional and distribution free.


We put our money down on the opportunities with highest expected ROI until we have spent all our money.


We put our money down on the opportunities with highest expected ROI until we have spent all our money. Done.

On a simple mathematical basis, this is false.

Consider two groups of candidates for a scholarship, A and B. We want to select all candidates that have an 80% or better chance of graduation. Group A comes from a population where the chance of graduation is distributed uniformly from 0% to 100% and group B is from one where the chance is distributed uniformly from 10% to 90%, with the same average but less variation in group B.

Now suppose that we select without bias or inaccuracy all the applicants that have an 80% or better chance of graduation. That means we select a subset of A with a range of 80% to 100% and a subset of B with a range from 80% to 90%. The average graduation rate of scholarship winners from group A will be 90% and that from group B will be 85%.

But we haven't been biased against A. We've selected according to the exact same perfect evaluation process and criterion from both groups. It was just their prior distribution that was different.

The actual applicant groups for jobs or financing in the real world, when they are divided by demographic factors like age, sex, race, and educational level, will almost always manifest different variances in success levels even when the averages are the same. That makes this test useless and mathematically illiterate.

And when we use a normal distribution, as we should always expect given the central limit theorem, the mathematical problems get even more intense.

This short comment is not up to pg's usual high standards for his essays.

It's true that this test assumes groups of applicants are roughly equal in (distribution of) ability. That is the default assumption in most conversations I've been involved in about bias, and particularly the example I used, but I'll add something making that explicit.

I like the idea, but how do you apply this to power law distribution outcomes and get any statistical significance? I don't know the answer.

E.g. the underlying First Round's analysis likely has no statistical significance. Assuming the power law distribution of outcomes top 5 outcomes will account for 97% of value. So we now have a study with n=5.

To make the point let's apply this to YC's own portfolio. Assuming Dropbox, AirBnb and Stripe represent 75% of its value, we'll learn that YC is incredibly biased against:

  * MIT graduates
  * brother founders
  * founding teams that do not have female founders
  * and especially males named Drew
Hard to believe these conclusions are correct or actionable

> I like the idea, but how do you apply this to power law distribution outcomes and get any statistical significance? I don't know the answer.

See my post


where are distribution-free. So "power law", Gaussian, anything else, doesn't matter.

If feel the addition:

    "C" the applicants you're looking at have roughly 
    equal distribution of ability.
makes the reasoning more tautological/weak.

If we take two dart boards (one for female -, one for male founders) as a visual, where hitting near the bull's eye counts as "startup success".

If we take "C" to be true, then the darts would be thrown at random.

Now we draw a circle around the bull's eye. Anything landing in this circle we fund. If this circle has a smaller radius on the female dartboard, than on the male dartboard, then evidently the smaller female circle will contain more darts closer to the target (better average performance) than the larger radius male circle.

But then we do not even need performance numbers: Smaller radius circles will have less darts in them. Using "C" we only need to know that the male-female accept ratio is not 50%-50% for us to have found a bias.

In short: If you see a roughly equal distribution of ability, and (for simplicity) a roughly equal number of female to male fundraisers, then you should always have a roughly equal distribution of female to male founders in your portfolio, performance be damned.

The technique is still useful for when you do not have these female vs. male accept ratio's, and a VC publishes only success rates, but this information on ratio's is often more public than success rates/estimates.

Doesn't this logic assume that there are the same number of darts thrown total at both boards?

The issue with founder funding is there are fewer female applicants than male applicants, and the applications aren't published.

I am sorry for all posts in this thread (including this one). Imagine being PG and reading 200+ negative replies to a blog post you did. I could have reasoned in line with Graham and learned a lot more than when resisting and attacking a viewpoint different than yours.

I feel that a different number of darts is salvageable for this logic, but having thought about this blog post some more, I feel bias is inherently non-compute-able. Our decision on how to compute influences our results.

What PG did for me was show that there is no Pascal's wager in statistics: All outcomes/data/measurements/views are equally likely. The view that the female variable alone is able to divide skill/start-up success is weak. The assumption of non-uniform points is weak. The assumption of no variance/unequal rankings is weak. The assumption that a non-random sample is significant is weak. The assumption that VC's are unbiased in their selection procedure is weak. The assumption that nature/environment favors skilled women is weak. The assumption that decisions of who to fund does not influence future applicants. The assumption that women are still selected for capability is weak. The assumption that women ignore nature/environment and keep focusing on start-up capability is weak. It is much more likely that any other thing happens. PG's alternative is certainly a sane one, but one of many.

Perhaps women perform better because, while VC offers the same chance to men and women, they are better at picking capable women than capable men. Bias in favor of capable women.

Perhaps women perform better because, they are naturally better than men.

Perhaps women perform better because, VC is biased against women, and only the strong survive.

Perhaps women perform better because, affirmative actions to remove the inequality in performance (perceived bias) actually increased our objective bias.

Perhaps women perform better because, VC is bad at picking capable women, so they pick incapable women, of which there happen to be a lot more.

Perhaps women perform better because, now the smart and capable women start to act like the mediocre ones (bad funding decisions influence actors looking for reward)

Perhaps women perform better because, nature is "biased" against older risk-averse, but available, men and, older, unavailable women who have children, and nature favors both young males (who have to compete with the old males) and females (who compete only among themselves).

Perhaps women perform better because, our sampling method was biased.

Perhaps women perform better because, our measurements were 5 years old and we are seeing an old static state of a highly complex dynamic system.

Perhaps women perform better because, they are more variant. The good ones are really good and the bad ones are really bad, making it easier on VC's to pick the cream of the crop.

All I know is how little I know. That (algorithmic) bias is an important subject, worth thinking about, and that we need very smart people working on this subject. I would never have gotten away with upvotes on my posts in this thread if the subject was cryptography. I clearly know very little about both subjects (and only now I know that, which I hope is at least a start).

PG showed that we (I), perhaps too easily, go along with the status quo: Our measurements are all correct, our conclusions are all correct. While, if you think about it.. objectively I agree that women and men are equal in capability. If you believe this to be so, then you may have a selection bias, if you observe that men and women perform differently.

I think the least all views could do is to make sure the environment for female founders to flourish is healthy and in line with skill/capability. Then let nature do its thing.

P.S.: If we know that females actually perform better than males, what is the ethical thing to do? Fund even more female founders and make it harder for men? It would make you richer. Affirmative action? It would not remove a bias, it would introduce one.

Assuming that the distributions are exactly equal, the test would still give misleading results in situations in which the bias does not manifest as a different cutoff.

For example, if a VC funds all male founders but flips a coin to decide whether to fund each female founder, the test would fail to detect overwhelming bias.

Obviously that specific scenario is not realistic, but I believe something like this is plausible enough: A VC funds all male founders who are considered promising, and all female founders who are considered promising AND went to school with one of the partners.

And it's not hard to imagine a plausible scenario in which the test would give false positives rather than false negatives.

A small request: When you amend an article after it is published please note the change in a footnote. It took me a long time to realise the top comment on HN was referring to an older version of the article that didn't mention the equal distribution of ability.

Correlation != causation, and even if the correlation reveals something, with no explanatory theory, we are still at step one.

This essay was based on these two lines from here: http://10years.firstround.com/#one

> That’s why were so excited to learn that our investments in companies with at least one female founder were meaningfully outperforming our investments in all-male teams. Indeed, companies with a female founder performed 63% better than our investments with all-male founding teams

The comparison is not clear, but is not women versus men, but between companies with X number of males plus at least one female founder, versus those with zero female founders and Y male founders.

If we skip a step and take this fact as having some predictive value, it could be lots of things, including off-the-top-of-my-head:

1. Bias against women - which extends to teams that include men, e.g. the bias against woman exists in the presence of male co-founders.

2. That the personality traits shared by groups where women co-found startups with men are positively correlated with success. It is quite possible that these groups have much better EQ, while still retaining the IQ to impress the required amount to be selected.

3. That startups with at least one female select, and I am using this term in a very stereotyped way, "not-white-male" startups. Many Unicorn startups, from Atlasssian to Dropbox, specialise in problems faced by, again for wont of a better term, "white males". Given the mantra of solving problems we have ourselves, it is possible that mixed groups choose less male subjects. As men have been the founders of the majority of startups to date, there must be a plethora of such startup ideas left untouched. One example is DIAJENG LESTARI who started https://hijup.com/, described as "A pioneering Muslim Fashion store. We sell fashion apparel especially for Muslim women ranging from clothing, hijab/headscarf, accessories, and more." Little to no chance the archetypal "white male hacker founder" has that idea.

That's three ideas off the bat, only one of which is bias. It could still be bias, but I feel that points 2 & 3 are at least good candidates for exploration. Personally, I think there must be a lot of low hanging fruit in ideas not aimed at men, and female founders seem ideally poised to have those ideas.

No, in the null hypothesis, for the distribution, don't have to care; the distributions of the measurements for the men and the women can be different; the work can be distribution free, except assume that the expected values on the measurements are the same for the men and women.

For details, see my post


No in the null hypothesis, for the distribution, don't have to care, can be distribution free, except assume that the expected values on the measurements are the same. For details, see my post


The problem here is language and what our actual objectives are.

When people complain about bias, they are not really talking about mathematical bias, but about something else: Their idea of fairness. They are talking about discrimination. And when we are discussing that, we can't really think about whether rules are applied fairly or not, but whether the rules produce the outcomes that we want.

Let's go for a ludicrous example: We'll accept all applicants whose IQ is higher than their weight in pounds. We'll be explicitly discriminating against heavy people, but at the same time, we have pretty clear implicit biases against men, and ethnic groups who tend to be taller. We might as well have said that we prefer children and Japanese women. There's no need for mathematical bias: The bias comes from the rule selection.

So, in your example, if our actual objective is to graduate an even amount of people from groups A and B, we have to, explicitly, make it easier for group B to get the scholarship. And many times organizations have objectives like that.

As a more real example, let's consider a police department. If the objective is to have a racial makeup that represents the community, and different races have different drop-out rates, the candidate selection will prefer one kind over the other, precisely to counter the drop-out differential.

So when regular people, and not mathematicians, discuss bias, the mathematical definition is unimportant. The one important thing is our stated objectives.

Your comment makes sense; indeed, fairness in selection process does not imply that people aren't being discriminated against. For instance, the SAT is fair, but denies those with less opportunities a chance to get into top-tier schools. I can get on board with that.

> We can't really think about whether rules are applied fairly or not, but whether the rules produce the outcomes that we want.

This is a more explicit way of phrasing an attitude that I've noticed in my community (a liberal U.S. university). However, I don't think it's obvious that this is the right principle to uphold.

I squirm with discomfort at the idea that we will only support "fairness" and empirical data to the extent that it is applicable to the outcome that we personally desire. This seems to imply that all evaluation metrics are "biased", until we can find a measure that selects equal representation across all demographics, regardless of the size of applicant pool or ability distribution among that pool.

What outcomes, exactly, do we want? More representation of under-represented groups? How does this relate to the goal of maximizing return on the portfolio? What does this mean for people who want a "meritocracy" (if such a thing can exist)?


It's not easy to tell when bias shows up. Collage rankings might look like an unbiased formula, but it's selected so the 'top' schools end up being highly ranked instead of measuring useful things. Things like a high faculty to student ratio don't actually directly have much impact but it's the kind of stat easily gamed by 'top' schools so it's gotten some sort of mythic importance even if these people don't actually teach undergrad classes.

You can find the same inherent bias in many walks of life. Many of the hurtles to becoming a Doctor have nothing to do with being a good Doctor there just there to ensure the right kinds of people get into and out of the program.

What surprised me a bit is that pg decided to use the word "bias" without any clarification, considering his background in computer science and AI.

Anyway, I think pg's whole argument is rather moot because the three assumptions that he states are incredibly difficult to measure (Part of the reason why it is very difficult to argue for or against affirmative actions without coming across as "biased").

Unfortunately I think this is a problem with many of his essays. They often present a very specific argument with reservations, which makes the argument very hard to disagree with since you have to argue relevance which requires a lot more insight. It's therefor taken as truth by the readers, even if the original argument don't support their conclusion. In general I think they should be seen as opinion pieces rather essays. I have a hard time seeing many of them being up to e.g. basic university standard.

Great comment. There are two types of fairness, (a) fair rules, and (b) fair outcome.

Which are essentially two of the big branches of normative ethics: deontology and consequentialism.

Favoring fair outcome over having fair rules is against everything I believe. We should not strive for a participation trophy culture.

"Favoring fair outcome over having fair rules"

Which essentially no one does? People favor fair outcome because they don't think the rules are, or can be, fair and often as a proxy for rules becoming more fair.

I disagree. Rephrasing the tradeoff a little, Westerners are not willing to accept bad outcomes as a matter of course. It is probably objectively fair for individuals to stop wearing seatbelts or to "responsibly" pursue a meth habit, but those things are still illegal.

The "fair" rules would probably be to let people do stupid things and accept their own consequences, but Western culture is not willing to let houses burn down because people didn't buy into the local fire department co-op.

This well-circulated image shows that making everyone a winner has merit in some circumstances.


The left hand side is fair rules, the right hand side shows a fair outcome

It's never a bad idea to think about how these ideas play out when taken to their logical conclusion.

Kurt Vonnegut on the subject: http://www.tnellen.com/cybereng/harrison.html

That's alright if the goal is to help the individuals, like welfare. But it's not OK if the goal is to get people to do the most extreme things, like job applicant selection looking for a "best" applicant or baseball team for a best player. A person who can be successful without needing as many boxes to stand on as others.

Don't forget that it's these "best" people who add a vastly disproportionate amount of value to the world. They're the ones who invent new technology and discover new science. We all benefit greatly from their success.

The link is down (shows a jpeg with a single white pixel), but I guess that one is the same:


It's really more like this version of the image: http://i.imgur.com/DqKXPF3.png

The baseball game in the image wouldn't be worth if equality of outcome were the rule for baseball team tryouts. The entire game is based on fair competition under the rules pushing participants toward excellence.

Inequality of outcome is the entire reason we see baseball played at a high level. When you demand equality of outcome regardless of talent or effort, you're asking for society to stagnate. You're asking for pervasive mediocrity. You're asking for us to kill effort and motivation. No thanks.

You're beating a straw man. He said in some circumstances.

Yet he didn't list any or describe any criteria for evaluating them. The "in some circumstances" bit was just a way to weasel out of potential objections.

I thought the example was pretty obvious.

Indeed. The problem is that people frequently infer unfair rules from unequal outcomes, without taking into account the possibility of systematic group differences.

    Alan: I believe in equality of opportunity, not equality of outcome.

    Bob:  How do you know there isn't equality of opportunity?

    Alan: Well, just look at how unequal the outcomes are!
At this point, Bob would be wise to change the subject, because if he pressed on, he might get this:

    Bob:  Can you give me an example?

    Alan: Group X is underrepresented in Field Y.

    Bob:  Maybe Group X isn't as good at Field Y.

    Alan: What? That's racist and/or sexist!
And if Bob were to make a comment to this effect on Hacker News, he's probably get downvoted. This is because most people agree with Alan, and many of them abuse their downvote privileges to punish ideas they disagree with rather than those that don't further the discussion. This degrades the quality of discourse, but at least it helps reassure the downvoters that they aren't racist and/or sexist.

Bob: Can you give me an example?

nl: Sure.

To overcome possible biases in hiring, most orchestras revised their audition policies in the 1970s and 1980s. A major change involved the use of blind' auditions with a screen' to conceal the identity of the candidate from the jury. Female musicians in the top five symphony orchestras in the United States were less than 5% of all players in 1970 but are 25% today. We ask whether women were more likely to be advanced and/or hired with the use of blind' auditions. Using data from actual auditions in an individual fixed-effects framework, we find that the screen increases by 50% the probability a woman will be advanced out of certain preliminary rounds.[1]

Bob: What? But that doesn't count because...

[1] http://gap.hks.harvard.edu/orchestrating-impartiality-impact...

I didn't say there are no valid examples of bias, just that many people assume unequal outcomes must result from unequal opportunity, ignoring the possibility of real group differences. Surely you know many real-life Alans who see bias every time a particular Group X is underrepresented in Field Y. Moreover, the values of X aren't random; you'll almost never hear complaints of bias regarding, say, trash haulers, or NFL cornerbacks. (But NFL quarterbacks—ah, plenty of bias there!)

Speculative explanations - i.e. assumptions - are not uncommonly offered in support of the proposition that there is no underlying bias. Nl provides a rare example of the assumptions being systematically investigated.

Right, and unwillingness to consider the possibility of group differences comes from a quasi-religious devotion to the blank slate model of human nature. The way radical egalitarians see it, we're not only equal in dignity, but in potential.

That's a pretty view, but it's inconsistent with reality, and radical egalitarians need to come up with increasingly implausible explanations to explain everyday circumstances that make perfect sense once you drop the blank slate model.

There is a simple explanation for differences in abilities between groups that has nothing to do with their genetics are so-called natural ability: the fact that groups often grow up around other members of their group. Both nature and nurture are largely in common for many groups, so it could easily be either that causes the observed differences in ability.

could easily be either

Or both. They're not mutually exclusive. Do Jamaican sprinters excel because they grow up around other sprinters or because they are blessed with natural ability? Yes. Simply put, or != xor.

Hey rewqfdsa, maybe shoot me an email some time. Address is in profile.

Apply Occam's Razor to these supposed group differences. Which do you think is a more plausible reality?

A. Interviewers prefer candidates who are like themselves, interviewers are mostly white men, therefore most hires are white men.

B. The uterus and melanin both inhibit programming ability, interviewers are perfect judges of programming ability, therefore most hires are white men.

To look at the present (incomplete) evidence and decide that B is the more likely story, is racism/sexism.

B. The uterus and melanin both inhibit programming ability, interviewers are perfect judges of programming ability, therefore most hires are white men.

Serious question: can you at least steel man this point of view rather than making it a ridiculous straw man? If you cannot steel man it, what makes you so sure you really understand the argument?

For bonus points, you can also point out the glaringly obvious complication to this chain of logic: A. Interviewers prefer candidates who are like themselves, interviewers are mostly white men, therefore most hires are white men.

Which is more plausible?

a) the action of natural selection, sexual selection, and the hormone environment magically stop at the blood-brain barrier, or

b) there are real group differences between human populations?

We've already eliminated all overt discrimination. If you continue to cry discrimination, you're essentially postulating a giant unconscious conspiracy. I find the idea wildly implausible. It's much simple to just accept that not everyone is equal in aptitude and ability.

Women musicians started getting orchestra positions in much greater numbers after auditions were made blind.

If biases affect how a professional musician hears music, is it so shocking to think unconscious bias might affect someone's judgment a candidate based on multiple fuzzy factors like ability, culture, and personality?

And that's just for job applications. You really think the criminal justice system has removed unconscious bias?

Criminal justice and orchestra employment are non-market phenomena. The people in charge do not benefit if the orchestra is great and are not accountable if innocents go to jail and murders spree freely and the tubas clank.

So of course the bosses pick out their friends and cronies. And a decent polity should restrain their corruption with blind auditions and accountable audits of prosecutions.

But investors should be looking for a good return on their money. They should be looking for the best investments they can find. If they're not, that is the source of bias right there.

Of course, the Wall Street industry is located in New York because you can use big city lights, strippers, and steaks to scam small town municipal pension fund managers who aren't investing their own money. Sand Hill Road is supposed to operate on different principles.

The people in charge do not benefit if the orchestra is great

Have you ever actually worked for an orchestra? I have. The Chicago Symphony, Boston Symphony and other top orchestras take quality very seriously.

Do you think, say, Georg Solti or Daniel Barenboim were happy with "just pretty good" musicians? Their reputations (and fortunes, for top conductors are very well paid) depend on consistently outstanding performances.

And I don't know how you call it non-market. When you're income depends on millionaires donating vast sums of money, you damn well better care about quality.

It's like saying a football coach doesn't benefit if his team drafts the best players.

> you're essentially postulating a giant unconscious conspiracy

Let me introduce you to the extensive scientific literature on implicit bias: http://www.aas.org/cswa/unconsciousbias.html

If the bias reflects a real Bayesian prior, it isn't the kind of bias that's unjust.

Not so. Priors/posteriors are only as good as the model they're based on. For instance, if you choose parental income as the feature it will can be a stronger signal than skin color, even though both may be good predictors. But the correlation between the income and skin color can account for the predictive power of one feature when the other features is the true cause.

So if Americans from race A commit ten times more violent crime than others and the police consequently accuse and manhandle vast numbers of innocent Americans of race A, there's nothing unjust about that? The vast majority of citizens of race A are innocent of all offenses but deserve constant suspicion and low level official humiliation and violence in a just world for no reason other than being the same color as some crooks.

I don't agree.

accuse and manhandle vast numbers of innocent Americans of race A

One doesn't need to advocate accusing and manhandling to think the police should use statistically valid inferences in the name of justice. I myself am a member of a minority group—men—that is responsible for a vastly disproportionate share of crime, especially violent crime. You could mandate that cops ignore this reality and treat men and women with equal suspicion, but the result would be worse policing. For example, if you look at the statistics for New York's supposedly racist "stop-and-frisk" policy, you'll find that the disparity between whites and blacks is smaller than the disparity between men and women—indeed, smaller even than the disparity between white men and black women. Why have you never heard stop-and-frisk described as "sexist"?

This inconsistency is best explained politically: complaining about racial injustice against blacks is an effective route to power; complaining about gender injustice against men is not. It's the same reason you hear constant complaints about how white tech is, but not about how black sports are. Jesse Jackson can effectively shake down Apple and Intel [1], but there is no white equivalent shaking down the NFL. (Can you imagine if "increasing diversity in the NFL" meant "increasing the relative proportion of white players"? It would be a different world—not, incidentally, one I would particularly want to live in.)

Being male means people will infer based on a superficial assessment that I'm more likely to be a criminal than, say, my sister. But that inference is correct. Being a member of such a group is my lot in life, and complaining doesn't change what is.

[1]: See, e.g., http://www.mercurynews.com/census/ci_29048321/q-jesse-jackso...

A straightforward application of evolutionary biology to Homo sapiens yields group differences as the null hypothesis. You've done nothing but construct a ridiculous strawman to refute this. Moreover, discrimination and group differences aren't mutually exclusive—it's possible that Group X's underrepresentation in Field Y is the result of both discrimination and group differences. The only way to know for sure that it's pure discrimination is to show that group differences are negligible. This requires actually measuring them (which in fact has been done in exhausting detail [1]), but even suggesting the possibility of group differences frequently leads to accusations of racism and sexism—as you've just so ably demonstrated.

[1]: See, for example, The Blank Slate by Steven Pinker. Then, once you get over your knee-jerk "That's racist!!!" reflex, take a look—I mean actually read for comprehensionThe Bell Curve by Herrnstein and Murray. Maybe add a little Cavalli-Sforza (via Steve Sailer) to the mix (http://www.vdare.com/articles/052400-cavalli-sforzas-ink-clo...). You can then graduate to basically anything by Arthur Jensen. As a topper, read "Rational"Wiki's entry on Human Biodiversity (http://rationalwiki.org/wiki/Human_biodiversity) and cringe at the smug, supercilious tone, endless strawmanning and distortion, and at the realization that you, too, were once taken in by the ridiculous "mainstream" views. (I certainly was.)

Thanks for literature and links. Hope I'll find the time to read this. The article from vdare.com seems really emotional to me. Not very reputable.

"Don`t believe any of this. It`s merely a politically-correct smoke screen that Cavalli-Sforza regularly pumps out to keep his life`s work — distinguishing the races of mankind and compiling their genealogies — from being defunded by the leftist mystagogues at Stanford."

"As you can imagine, this finding could get him in a bit of hot water if the campus thought police ever found out about it."

This may be a better place to start -- https://jaymans.wordpress.com/jaymans-race-inheritance-and-i... and https://jaymans.wordpress.com/about/. VDare tends to preach and agitate to the already converted, that is the peril of having to survive on donations. That said, while Steve Sailer is snarky, he is also reputable. He takes good care to not get things wrong. I've been following him for a while, and when some bit of news comes out, or some new policy gets announced, and the NY Times says one thing, and Sailer says another, Sailer almost always ends up getting proved right.

If you want a book length treatment, Michael Hart's Understanding Human History is the complete opposite of the typical, Jared Diamond, environmentalist accounts of human society. It is worth perusing - https://lesacreduprintemps19.files.wordpress.com/2012/11/har...

Understanding Human History is one of my favorite books. I finished it and immediately reread it. This was especially instructive given that a decade ago I read Jared Diamond's Guns, Germs, and Steel twice as well, quite innocent of the political subtext.

This is a smart insight, although in fairness the article suggests, in its female founder example, that there was discrimination against women -- that given men and women of equal ability, men were more likely to be chosen.

Condemning that inequality is different from affirming that selection should be altered to produce the outcomes people view as fair. One is saying, "Don't discriminate against Xs." The other is saying, "Not only can't you discriminate against them, you need to ensure that Xs have outcome Y. That is, you may be required to discriminate in their favor."

The latter is a value, and your point about mathematics being irrelevant stands. But the former is a mathematical claim, and pg was making a mathematical claim, so the mathematical argument you replied to is relevant.

I think the grandparent example is pointing out that it will be hard to use this for sexism claims precisely because studies have shown that the variance in ability in male and female populations is, in fact, different. The studies I'm aware of show more men at both extremes of the bell curve. So more men at the very top and bottom in IQ measurements[1].

There's a genetic basis for this, as well: women have two copies of each chromosome, whereas men have X and Y, so there's no second copy to take over in men, leading to more extreme outcomes, whether good or bad.

Now of course we ought to treat every group of people fairly, but we do need to examine our priors when doing so, especially when proposing ways to detect and punish people who may be thinking bad things, consciously or otherwise.

[1] We may not know just what 'IQ' is, but we do know that tests of mental ability all correlate with each other, suggesting an underlying factor. This, in turn, can be correlated with many other things, like success (or lack thereof).

Sorry but your reasoning does not let pg off the hook. In his article he says that the way to determine bias is by measuring the performance of those that got through. With your excuse, all you have to do is measure the number of people that got through in a particular group relative to the other group.

But the stated objective of the root article was failed. It concluded that First Round Capital is biased against females. There is insufficient data to support that claim! It might actually be that FRC is biased FOR females. But the process leading up to FRC was so biased against females that females are still at a net loss.

The article defines bias as follows:

> Want to know if the selection process was biased against some type of applicant? Check whether they outperform the others. This is not just a heuristic for detecting bias. It's what bias means.

Under that definition, you have been biased against A. [edit: on reflection I see this as a weakness of his definition. I missed that your selection process does in fact select the best candidates.]

Yes, but that's not the common usage of the word. Or how most people understand it. With that usage you could say ivy league schools are NOT biased against Asians since Asian graduates aren't more successful than non-Asian ones, except nobody does.

Hypothetical logic is flawed anyway. Higher ability Asian graduates could be less / only equally successful in the workplace due to pervasive external bias too. A lot of this is exacerbated by the fact that "soft skills" are more important for high status careers, and your "soft skills" are pretty much defined by tribal associations. It's the core of how we interact socially, and it causes problems that are really only fixed by alleviating scarcity.

Unless you know what exactly caused A to outperform others, you won't really know if the process is biased or what made it biased.

When asserting biases, you must first distinguish them from random noise. Using pg's logic, every selection process that isn't perfect is biased.

>> This is not just a heuristic for detecting bias. It's what bias means.

> Under that definition

That's not a definition. It's a claim about what the term "bias" means.

Graham's intuition is assuming equality of the two distributions.

As I noted in a different comment here, you can pretty easily fix Graham's test. Compute min(accepted a) and min(accepted B) instead of the means. In your example, the min of the accepted distributions would both work out to be 80%.

This assumes that the populations of A and B are of the same size. A larger sample will tend to have a lower minimum under many real world distributions - a sample of one will have its minimum equal to its maximum.

Another reason the use of mins here is not helpful, is that adding one equally awful accepted candidate to group A and B would then remove whatever bias there was according to the test, which is not what we want the test to indicate.

The idea by PG is a rough rule of thumb and breaks down trivially - suppose VC fund X were to accept all candidates, but group A was worse than B, the test would falsely imply that the fund was biased.

It's unfortunate the idea was dressed up in statistical persiflage because it isn't rigorous -- it's a rough guideline. To make it rigorous wold be very hard: either the abilities of the candidate populations would have to be measured very closely (unrealistic), or a more scientific experiment conducted (A-B test where candidates from each group are included or excluded opposite to the prior decision, which would need big groups).

No, it doesn't assume or require equality. The p-value you compute for this test will be proportional to the smaller of the sample sizes.

Mins will fail if you conspire to cheat the test, it's true. Very few statistical tests stand up to conspiracy theories.

Compute min(accepted a) and min(accepted B) instead of the means.

Dude, your comments are normally smarter than this. Yeah, you can easily fix Grahams's test -- all you need are some numbers that do not exist and that we cannot measure.

We're talking about VC's evaluating founders. That does not, and cannot, get reduced to a numerical score. And even if VC's did use some sort of scoring rubric, then we would still not know if there was unfairness in the way they made the scores, or unfairness in the selection process. It would just be punting the problem down a layer. PG's central claim -- that a third-party can detect the bias/unfairness in the funding process just using math -- is false.

You can only know if the process is biased/unfair if you have deep qualitative understanding of the process.

A charitable interpretation of what he or she said is this: don't evaluate bias by looking at outcomes of the average applicant, look at the outcomes of the borderline applicants. Even if there is no perfect way to define or measure the minimum acceptable applicant, I think it is reasonable to identify whether applicants were borderline or not.

Isn't that, by the way, what YC has been saying for years in their rejection letters? "We're always surprised by how many of the last companies to make it wind up being the most successful"? Something like that.

A charitable interpretation of what he or she said is this: don't evaluate bias by looking at outcomes of the average applicant, look at the outcomes of the borderline applicants.

That is fine, that is what he was saying. The point is that his solution is completely impractical for the original goal of finding an objective, statistically valid way of measuring whether bias exists. "Borderline" cannot be measured objectively, only by subjective rubric scoring. And when you only measure the borderline candidates, you have reduced an already way-to-small sample even further.

PG and I are assuming a measurable outcome, which the selection process is explicitly supposed to predict.

I made no claims about practicality - right now all I have is a little bit of measure theory showing that pg's algo is, in principle, fixable. I fully agree that the first round capital data he cites is inadequate (and also wrong, due to the unjustified exclusion of uber, which they explicitly note would alter the results).

My concrete claim: PGs idea for a statistical test is solid, I can (and shortly will) prove a toy version works, and given enough work one can probably cook up a practical version for some problems.

"Your idea isn't 100% perfect right out of the gate" is a very unfair criticism. Are we supposed to nurture every idea in complete secrecy until it is perfect?

OK I missed that you meant "easily fixed" in the strictly mathematical sense, not in the practical, real-world application sense.

With statistics on human affairs, 99% of the hard part is not the math, it is applying that math to a complicated, heterogenous, and difficult to measure underlying phenomena. And in most cases, statistics alone will never give you a straight answer, the best they can do is supplement and confirm qualitative observations. Failing to recognize this is how you get all those unending media reports about how X is bad for your health. PG's post was at the level of one of those junk health news articles.

And because human affairs are hard, we should criticize anyone who dares to voice an idea they haven't fully figured out yet.

This idea that statistics can only confirm and supplement "qualitative observations" (I.e. my priors) is completely unscientific and anti-intellectual. If that's true, forget stats - lets just write down the one permitted belief on a piece of paper and not waste resources on science. Science is really boring when only one answer is possible.

This idea that statistics can only confirm and supplement "qualitative observations" (I.e. my priors) is completely unscientific and anti-intellectual.

Since when is investing in startups a science? What is anti-intellectual, what is anti-science is to use the wrong tool for the job. Human affairs are not a science in the way that physics is a science. Statistics are far, far more fraught because there are so many variables in play, phenomena are hard to quantify, each case is so heterogenous, etc. You cannot use statistics in human affairs without also having a very good observational understanding of what is actually going on, otherwise you will end up in all sorts of trouble.

So the PG estimator is clearly problematic. I agree that the yummfajitas (YM) estimator looks to be consistent. In this case though, we're dealing with (small) finite sample sizes, so we need to come up with some sort of test statistic. What would the YM test be here? It seems tricky since you are dealing with a conditional distribution based on left-censored data. I'm also not aware of any difference-of-minimums test, though I am happy to be educated if there is one!

I don't know of something to refer to, but I don't think the statistics are too hard. The test statistic would be exactly min(sample1) and min(sample2).

Suppose the cutoff sample is distributed according to f(x)H(x-C). Then the probability of the minima of a sample exceeding C+e by random chance, assuming the null hypothesis, is p = (1-\int_C^{C+e}f(x) dx)^N.

So now you have a frequentist hypothesis test. If you make reasonable assumptions on f(x) (non-vanishing near C, quantified somehow), it's even nice and non-parametric.

Does that assume both samples are identically distributed and the only difference is the cutoff? If it does, then couldn't we just continue to do a difference of means test and still be consistent? If it doesn't, how do you handle identifying the cutoff minima and the two different distributions in a frequentist way?

The only assumption I need is that P_{f,g}([C,C+d]) >= h(d) > 0 for some arbitrary monotonic function h(d). This comes directly from the p-value formula.

I.e., for any d, there is a finite probability of finding an A or a B in [C,C+d]. I don't actually care what the shapes of f or g are at all beyond this - as long as this probability exists and is bounded below (in whatever class of functions f and g might be drawn from), it's all fine.

Sorry, I'm confused here. A p-value makes an implicit assumption that your null hypothesis is a known N(0,1). That may be throwing me off a bit. I get the point of you want to look at the likelihood function which is just one minus the CDF in the given interval. I'm just not clear on how you can get around f and g being arbitrarily parameterized functions of a given class. Are you assuming we know the class and something about f?

A null hypothesis is just a specific thing you are trying to disprove. In this case, it's simply that the min of both distributions is identical.

I am assuming we know exactly one thing about the class the measures f and g come from: for every function in that class, \int_C^{C+d} f(x) dx >= h(d) for some monotonic function h(d).

The p-value is then computed in terms of h(d), since p >= h(d)^N.

Okay, I'll have to wait for your full write-up, because I am not seeing the path of thought here.

I'll have you know that this particular subthread was way too civil for the Internet.

You can only rarely calculate min(accepted a) in the actual world. In this example, the college learns no distributions; they only know whether the student passed or failed.

So in this particular case, assuming that First Round's sample size is significant, it may just be that the female founders who seek them out are just on average better than the male ones? I suppose that if women think that the selection process is biased against them, and most do (and it may be) perhaps the less than excellent ones just don't apply, whereas that isn't true for males?

The problem is different than that. It's that the measurement of performance is susceptible to differences in the conditional distribution of disparate groups, given that they were selected.

We can illustrate it by modifing the GP example to not include time at all, and make the metric perfect.

Suppose we are selecting for the next qualifying round for the Olympic 400m team, and select candidates if their 400m is under 50 seconds. Then, we measure performance -- immediately -- by a 400m run. We have two candidate pools: people who compete in the professional circuit, and everybody else. 100% of the former group who apply qualify, while only 10% of the others do.

Okay, so now we immediately measure the average 400m time of all professionals, versus average time of all amateurs who can beat 50 seconds. It's pretty reasonable to expect that the professionals might average closer to 45 seconds, while the other group might average around 48 (I'm not a 400m expert, the actual numbers might be off. WR is 43 seconds).

According to the article, we now conclude that our selection process is actually biased _against_ professionals! This is at the least very counter-intuitive. Maybe we provide both groups with coaching, and re-test after 6 months, a year, whatever. The professionals will certainly still outperform the amateurs. However, suppose one of the amateurs goes on to greatness, and wins. Wouldn't this obviously be biased against amateurs, according to our intuition?

So in this particular case, assuming that First Round's sample size is significant, it may just be that the female founders who seek them out are just on average better than the male ones? I suppose that if women think that the selection process is biased against them, and most do (and it may be) perhaps the less than excellent ones just don't apply, whereas that isn't true for males?

First, the sample size not significant. Adding back one data point, Uber, which was a real data point that was intentionally removed, likely reverses the effect.

But imagine we had real sample of thousands of companies, and it did show the result.

A typical scenario is that different demographics might connect with First Round via different deal flow channels. For instance, one channel might be longstanding personal connections, another channel might be outreach to companies in the news.

Now imagine female founders are much more likely to be found via outreach rather than personal connections. Perhaps this is due to a negative personal bias -- the VC's are less likely to be chummy with females because of their sex. So they only find female founders when their company is in the news.

It is typical in all businesses that different deal-flow channels have different average returns. So:

* If both channels perform equally well, no bias will be seen in the statistics, even though the VC's are in fact biased.

* If the outbound channel generally performs worse, then women founders in the sample will perform worse than average, even though the VC's are actively biased against them (they are ignoring all the women who would have done well, if only that had personally known them. Sine the VC's never invest in them, their results are not measured). This is the opposite of the statistical relationship that PG claims should exist.

* If the outbound channel generally performs better, then women in the sample will be better than the average.

I should also add that the differences in the channels might be due to a positive bias on the part of the VC -- perhaps they do more aggressive outbound outreach in order to get more female founders in the pipeline. Or the difference might be due to something completely neutral.

The lesson here is that using statistics is a perilous endeavor. If you want to detect something like bias, you cannot use numbers alone, you need to combine any numbers with a deeper understanding of the selection process. There is no way that a third party can run a simple correlation and determine with any degree of certainty that the field is in fact biased or not.

This seems plausible to me (though no less problematic, of course).

Great insight. I had the great fortune to take a course with Gary Becker where we went into some of the mathematics of college admissions. He made this precise point -- the variances of the particular populations you are looking at matter a great detail. He managed to build some pretty convincing models which provided a compelling narrative for "biases" that we seemed to observe in the real world, all with simple changes to the distributions of populations. Great comment.

I had the great fortune to take a course with Gary Becker

Lucky you. He was great. I never had the chance to take a class from him but it would have been worth a Chicago winter to have the chance.

Second-best: https://www.youtube.com/watch?v=QajILZ3S2RE&list=PL9334868E7...

(I took this very class the year prior)

I think there's a big problem with this counter example. Which is that you're selecting for one thing and judging whether or not you were biased based on another.

> Now suppose that we select without bias or inaccuracy all the applicants that have an 80% or better chance of graduation.

This is subtly different than selecting for the highest graduation rate possible because it's binary, you want a group with >80% chances not a group with the best chances. Imagine if instead of the distributions you had group A was composed of people with a 100% chance of graduation and B was composed of people with an 80% chance of graduation. Our process does nothing to distinguish between those people because that extra 20% chance of graduation doesn't matter.

This brings me to what I think is the fundamental problem with your criticism, it's not clear to me what it means for a group in your example to over perform. If you select a group with the goal of 80% of them graduating it doesn't make sense to call 90% of them graduating an over performance. That only makes sense if your goal up front is to maximize the graduation rate.

I think if you rerun your example but instead assume an unbiased strategy that selects for the highest graduation rate possible you'll find that pg's essay makes a lot more sense.

I find it slightly amazing he posted this. Did he not bother to ask someone with a probabilistic background before posting?

Very interesting! Seems like I don't understand probability as well as I think I do, since I bought the argument till I ran into this comment.

So, what are the effects of variance in different evaluation contexts, and do we have a meaningful way to measure bias if we take variance into account?

My initial reactions:

- It seems the higher the variance, the better examples you can trot out to say you are not biased against that particular group. since youll always find a member of that group who does amazing.

- If distributions of performance are multimodal, its even harder to conclude stuff because different institutions might cut off different modes when selecting the bar.

- Modeling the sources of variance may lead to insight into any actual bias.

This logic is similar to that of Larry Summers in his remarks about diversity in science and engineering, http://www.harvard.edu/president/speeches/summers_2005/nber.....

> But we haven't been biased against A. [...] It was just their prior distribution that was different.

This is why that isn't a counter example. Right at the beginning, the article clearly states that this method is only applicable if the prior distribution is equal:

| You can use this technique whenever (a) [...], (b) [...], and (c) the groups of applicants you're looking at have roughly equal distribution of ability.

AFAIK pg added that to the article after WildUtah made his comment, and pg acknowledges this in a reply to WildUtah

Your premise discriminates between two groups.

It's not clear how you can assign a candidate a 90% chance of graduation. That probability must be a subjective assessment that has come from some (biased) source. In truth, an individual will either graduate or not.

In your example, you can assign 0% and 100% probabilities in group A, but you can't in group B. The most plausible mathematical explanation for that is that you collected insufficient relevant information about candidates in group B.

> This short comment is not up to pg's usual high standards for his essays.

I can almost here him thinking in response, "if I throw a dog a bone, I don't want to know if it tastes good or not."

I suppose PG really should have said:

>the groups of applicants you're looking at have exactly equal distribution of ability.

Rather than "roughly equal".

But obviously that makes the whole thing infeasible.

absolutely spot on. Differences in distribution is only one way in which you could disprove pg. There are others. For example, different "treatment effects". If conditional on getting selected, VCs pay more attention or are more useful for women, then that would be another reason that we would get the pattern pg proposes, but is not due to bias at selection.

Why are you using biased mathematics? If statistics and the scientific method (the tools through which dead white men continue to colonize) give us obviously problematic results, we should abandon them in favor of a method of inquiry that promotes social justice.

I enjoyed this comment.

> social justice

Gee, I never saw a definition. Not sure the meaning of the phrase is clear without a definition.

Please tell me this is sarcasm...

Based on rewqfdsa's comment history, this is probably sarcasm.

correlation <> causation

Fortunately there's a way to measure bias that's much more reliable, when it can be used....A couple months ago, one VC firm (almost certainly unintentionally) published a study showing bias of this type. First Round Capital found that among its portfolio companies, startups with female founders outperformed those without by 63%.

Except if you want to use statistics to measure bias, you need a statistically significant sample. And actually, if you are studying complex human affairs, with a hundred different variables, you need more than statistical significance, you need a sensitivity analysis. It is similar to nutrition studies. There are so many variables at play that something can always be found to increase or decrease your risk of cancer by 50%. You really only need to pay attention when statistics show an order-of-magnitude correlation, as with the link between smoking and lung cancer.

With the First Round Capital data, they excluded Uber from their calculations, because it would skew everything. If a single data point can switch your findings to be opposite, then you just have to admit that you do not have enough data to make determination one way or another. In science it is sometimes ok to exclude an outlier, since it often indicates a measurement error. But in venture capital, you make most of your money off of the Uber-like outliers. So if you are trying to study the data to be the best venture capitalist possible, throwing out outliers is not valid.

Also, the initial premise is incorrect too. You cannot measure bias by comparing average results, because the average is not the marginal. Consider PG's footnote: "Although I used female founders as an example because that is a kind of bias people often talk about, the most striking thing was the degree to which First Round undervalue founders who went to elite colleges." Does he honestly believe that First Round is biased against founders from elite colleges?

At my last company my sense was that the MIT grads were better than the average programmer. So were we biased against MIT grads? Should we have hired more MIT grads until the average performance of MIT grads overall equaled the average performance of an employee overall? Should we have done more outreach to MIT? Should the industry as a whole hired more MIT grads?

If a talent distribution has a bunch of elite, and then a steep drop-off filled with "pretenders", then you can get this type of effect without being biased.

When we got an elite MIT grad, we hired them. When we got a "pretender", someone who was trading on the name but did not put in the work, we rejected them. And yes, I personally saw MIT grads that did terrible on simple coding exercises.

So even though the average MIT grad we hired was better than the average programmer at our company, there was no way to alter our hiring process to get more MIT grads. If we hired the marginal MIT grad that we rejected, we would have been worse off. Now we could do more outreach to MIT, and we did, but that is a highly competitive process. There were diminishing marginal returns to how much outreach we can do to get more applicants.

The statistical illiteracy of PG's post is simply stunning. Imagine a YC company gets a 100% ROI from PPC ads, and a 50% ROI from banner ads. Are they biased against PPC ads? Should they buy more PPC ads? Such an analysis is ridiculous. You look at what you are spending on the marginal PPC ad, and you stop spending when the ROI on the marginal ad is at zero, regardless of what the average is. That one advertising channel has a higher ROI on average does not mean that the company is biased against that channel.

So true.

PG's articles are generally filled with good intuitive insight. Unfortunately, statistics can be very tricky to turn into folksy wisdom. Rules of thumb like "you need 30 samples before you can say anything" that are derived from the CLT are a good example of ones that work well enough in practice, even if they obscure some underlying subtleties. This article is an example of a rule that sounds simple, but actually has so many asterisks that one would expect it to be mostly useless in practice.

If women are performing better on average, it doesn't mean that you should invest in more women necessarily. What if all the remaining candidates would have a negative mean return? If they included Uber and all of a sudden the women now underperform men, does that mean they're biased against men and they need to invest in less women?

There's just so many statistical fallacies at play here that it's a shame that Jessica, Sam, or Geoff didn't point out that maybe someone with a stats background should read the article first before publishing it.

> Does he honestly believe that First Round is biased against founders from elite colleges?

Maybe FR is. Imagine that elite college is highly predictive of success, so you prefer to pick elite college grads, all other available evidence being equal. You're biased toward elite college grads, right?

But what if elite college grads really are phenomenally more successful, and you can't see the detailed reason (high school experience, network, whatever), to the point that they are all better than all non-elite college grads. Then even selecting 90% of your pool from elites, and 10% from the rest, is biased against the actual merit of the applicants.

[these numbers are totally made up. I'm not saying elite college grads really have these characteristics.]

The trickiness is that you can't see everything when you evaluate, so you have to assign weights to the factors you have, and leverage corellations to hidden important factor.

All selection processes are biased. ■

What is the significance level? What is the model? This is freshman dorm room level analysis

Isn't every selection process based on experience, knowledge and/or mood and therefore baised?

He makes a few assumptions. But he is probably correct within those assumptions.

We assume all candidate pools are homogeneous. If all members of a certain subset A of a global population P is simply better at a task than any member of another subset B of the global population, we will see that this holds true for members of our sample as well. Thus, members of A in our sample will consistently outperform members of B in our sample. Does this mean there is a bias against A? Well, yes because if there wasn't then there would be fewer members of B or perhaps no members of B in our sample based on this result alone.

However, real life is not one-dimensional. Sometimes we need to consider other factors as well.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact