
A Way to Detect Bias - bakztfuture
http://www.paulgraham.com/bias.html
======
yummyfajitas
People, most of whom clearly are not that good at math, are being really harsh
on Paul Graham.

Graham is mostly right, but slightly incorrect. In particular, suppose group A
has the distribution f(x) and B has the distribution g(x).

If f(x) and g(x) are shaped significantly differently _past the cutoff_ , then
mean(H(x-C)f(x)) and mean(H(x-c)g(x)) might not agree even though there is no
bias by construction. (Here H(x) is a step function and C the cutoff).

However, there is an easy fix: compute the minima of the support of the
distribution rather than _mean_. min(H(x-C)f(x)) = min(H(x-C)g(x)) = C.

In practice, measure the _weakest_ male and _weakest_ female to be accepted in
your sample set, or some similar approximation.

I'm pretty sure this is a valid frequentist hypothesis test. I've got half a
proof worked out on paper already. It depends very weakly (and non-
parametrically) on f(x) and g(x), but it works in basically the exact way
Graham wants it to. Every counterexample I can think of is really
pathological. My next blog post will probably be a proof of this.

All this negativity is really an overreaction. I know it's fun to totally
debunk someone on details, but these are mostly fixable details.

~~~
MattHeard
Wouldn't the weakest male and weakest female founders be underperforming more
due to individual factors than bias?

Maybe a mean of the lower quartile would remove some the the noise.

~~~
yummyfajitas
Suppose the cutoff for men is C but for women is C+K. Then the weakest man can
be expected to have quality C+epsilon, while the weakest woman will have
quality C+K+epsilon.

Here epsilon is how close a typical candidate will be to the cutoff, and is
mainly a function of the sample size. I don't know the behavior of epsilon off
the top of my head.

~~~
tprice7
I think MattHeard understood this and his point was that the weakest man will
realistically have quality C + epsilon + random, and the weakest woman will
have quality C + K + epsilon + random. The random term arises because no
evaluation process is going to perfectly tell you how people are going to end
up performing. But yeah, this seems fixable also, by averaging some number of
the lowest performing members of each group, as MattHeard suggested.

~~~
yummyfajitas
I'm not 100% sure how to deal with noise + extremal statistics. But I've got a
12 hour plane ride ahead of me tomorrow, so I can probably work out a fix.

~~~
grayclhn
I suspect the math is easier if you reframe the hypothesis to be about the
derivatives: adding one more person from either group should have the same
marginal effect on outcomes. As you get more observations you get more in a
neighborhood of the cutoff which gives a consistent estimator of the
derivative.

------
nkurz
This approach relies on the unspoken "positivity" assumption that the pool of
applicants is large enough that there exist individuals in the biased-against
category that were not selected, and moreover that these denied applicants are
"exchangeable" with the successful ones.

For example, assume that we find that founders who won a MacArthur "genius"
grant outperformed the others. Further assume that there are only a limited
number of such founders, and that all available were selected. Certainly one
wouldn't want to conclude in this case that there is a bias against MacArthur
fellows.

That seems obvious, but it gets trickier once you have lots of factors
involved. What if the group you find to outperform consists of female founders
with a PhD, substantial industry experience, and red hair[1]. Can you conclude
that the process is biased against females? Males with PhD's? Anyone with red
hair? Generally no, unless you are willing to assume that all of your factors
are causal.

Worse, you can't even assume that it's biased against people with all of the
measured factors unless you also assume that all unmeasured factors are
randomly distributed. If it turns out that "a positive mental attitude" is an
unmeasured but defining characteristic of success, if the interviewers
rejected applicants who had less of this but you failed to include this in
your category, you would be wrong to conclude that there is an unfair bias.

[1]
[http://www.nature.com/nature/journal/v453/n7194/full/453562a...](http://www.nature.com/nature/journal/v453/n7194/full/453562a.html)

------
in3d
Graham's statement about the possible bias of First Round is unfounded. This
was not any sort of a real study like Graham thinks and First Round clearly
notes that. When the returns are as skewed as they are in venture capital
([http://www.sethlevine.com/archives/2014/08/venture-
outcomes-...](http://www.sethlevine.com/archives/2014/08/venture-outcomes-are-
even-more-skewed-than-you-think.html)), a small sample size and a simple
analysis won't do. First Round even excluded their investment in Uber because
it would skew the results too much.

~~~
khed
Even if it were a statistically appropriate sample size and female founders
still out performed male founders it still wouldn't exlcude other likely
explanations other than bias. What if the culture in venture capital is more
willing to assist and mentor female founders leading to greater success? In
academia there is a women are selected over equally qualified men 2:1 for
tenure positions now [1]. It is not unreasonable to wonder if a similiar hand
up is being given to female founders in terms of training, social network
inclusion, and mentorship leading to greater success.

Alternatively, there may be a selection process in society that means only the
most motivated women become entrepeneurs and so beat the average male
entrepeneur.

[1].
[http://www.pnas.org/content/early/2015/04/08/1418878112.abst...](http://www.pnas.org/content/early/2015/04/08/1418878112.abstract)

~~~
TomGullen
> female founders still out performed male founders

That's not what the study says, it says groups that have at least 1 female
founder, not female founders.

------
compbio
PG:

Assumption: There is no fundamental difference between a female and a male
founder for achieving start-up success (average rates and
variance/distribution of rates is the same)

Observation: VC funded start-ups with female founders are (on average) 60%
more successful than start-ups with male founders

Hypothesis: VC funding is biased against female founders. The ones that do
receive funding are better vetted, less risky, and have higher individual
qualities.

Experiment: Start funding more female founders.

If we then observe: The numbers start to even out, then there is no
fundamental difference. VC funding bias may have been the cause of the
difference in success rate.

If we then observe: The numbers stay the same, then there is a fundamental
difference and our assumption is flawed.

Rational choice: Start funding more female founders. This either removes a
bias (levels the playing field), or increases your profit (funding more
potentially successful founders).

PG should of course not use an hypothesis to prove an assumption
(experiment/probing is needed for verification). But also: The possibility of
an uneven distribution should not invalidate such an experiment (or PG's line
of reasoning), it will merely bring it to light (the numbers would stay the
same, thus we have shown that the difference is fundamental and not caused by
a sampling bias).

------
nartz
Interesting thoughts. However, this argument is biased because it assumes that
the performance of the applicants WHO WERE ACCEPTED is not biased by the
selection process itself, and that the performance characteristics of the
selected sample are representative of the performance characteristics of the
total, which could be a weak assumption.

An attempt at translating to mathematics (feel free to correct me!):

X = event that person belongs to group x

Y = event that person belongs to group y

S = event that person is selected

W = event that person will perform like a 'winner'

for simplicity P(X) + P(Y) = 1

Naturally, 'unbiased' in this case is simply P(S|X) = P(S1), and P(S|Y) =
P(S2), i.e. that the selection process is independent of a certain variable X
or Y

PG says we can measure the the performance of these selected applicant winners
for each class, i.e. P(X|S,W).

I believe PG assumes that:

P(X|W) / P(Y|W) should equal P(X|S,W)/P(Y|S,W). We can see that these are
different distributions, since the second is already conditioned on the
selection process.

Simplified, PG assumes that P(X|S,W) = P(X|W) i.e. that conditioning on the
selection process does not bias the winning results.

Its left for the reader exercise to determine the 'pathological' cases where
this selection variable's distribution makes PG's assumption correct or
incorrect.

However, this is simply theoretical - the actual distribution may or may not
be 'pathological' and the assumptions made by PG could very well be good.

------
eridius
There's a few trivial ways to be biased that would not be detected in this
way.

The first is if you have multiple people accepting applicants and some of them
are biased to the point of not accepting applicants of particular types. That
means all the applicants that are discriminated against that did make it were
simply selected by people who weren't biased, and therefore won't outperform
anyone.

The second is if the actual selection process is somewhat random instead of
being based on pure performance. The ones who make it through that process
won't necessarily perform any better, they'll just be luckier.

The third is if the application process accepts everyone equally, and then
randomly prunes out people according to a bias. This is similar to the second
except the acceptance criteria is still performance-based, but because it
randomly throws out people (instead of throwing out low-performers), the
remaining people are still going to perform the same as those who were not
pruned.

The first footnote on the page also points out that if the selection criteria
are different for the different groups then this process won't work, which
seems like a pretty important caveat that I wish was in the article proper.
One really common form of bias (especially in tech) is being biased against
women, and that's also a situation where it's very common to (unconsciously or
otherwise) use appearance in judging female applicants but ignore appearance
for male applicants.

~~~
yummyfajitas
The second example adds statistical noise, but does not invalidate Graham's
procedure. In the absence of noise, the distribution of accepted people will
be H(x-C) f(x) for the main group, and H(x-C-K)g(x) for the other group (where
H(x) is a step function, f(x) and g(x) are the distributions of quality of the
two groups).

If noise is present, then you get convolve(H(x-C)f(x), k(x)) instead, where
k(x) is the pdf of the noise distribution.

You'll need more samples to measure this, but it's completely measurable via
Graham's method.

------
jameshart
There's a flaw right in the assumptions here: "(c) the groups of applicants
you're looking at have roughly equal distribution of ability."

Oh. See, the problem is that if an application process _is_ biased, and
applicants _perceive_ that bias, then those against whom it is biased will be
_dissuaded from applying_ unless they far exceed the required standards.
Whereas those towards whom the process is biased will be _more_ likely to
apply, even if they are marginally qualified, because they expect to benefit
from the bias.

So that means that if you do have a biased process, there's a good chance it
doesn't meet criterion c - applicants in the different groups between which
its bias discriminates are _not_ equal in ability. So your test might verify a
lack of bias, when there is in fact bias present.

You _can 't_ verify a lack of bias just by looking at the outcomes of
successful applicants - you need to look at the outcomes for unsuccessful
applicants too, to determine whether your applicant pools really do meet
criterion c. Or you could look at the outcomes for nonapplicants, but that's
clearly a much harder problem.

------
TomGullen
> A couple months ago, one VC firm (almost certainly unintentionally)
> published a study showing bias of this type. First Round Capital found that
> among its portfolio companies, startups with female founders outperformed
> those without by 63%.

Well, they also said in their study:

> And we are not claiming that our data is representative of the industry...or
> even statistically significant.

Also, the wording is "startups with a female founder", not exclusively female
founders... I think this is a detail that shouldn't be ignored.

And, the study doesn't show how many companies out of the 300 had female
founders! Maybe it was just 1! They also say "Solo Founders do Much Worse Than
Teams", so this is an important detail if there are no solo female teams ever
backed in their firm! etc etc, the list goes on. Not exactly strong evidence
to support the point PG is making, that bias would be easy to detect.

Measuring performance purely in terms of "how much money I make" is one way of
doing it, but not the only way. And it wont cover the majority of jobs on the
planet (how do you measure performance of someone who stacks shelves in a
supermarket?)

------
MichaelGG
I don't understand his point about First Round Capital showing their female
founders did better than companies without female founders. What does that
show? How do we know that female founders aren't simply better? Or maybe women
are scared of applying, so out of women, only the best apply? In that case,
the mere idea that there is a bias can cause "pre-selection" bias.

I lack the mathematics to prove this, but it seems that on the face of it, pg
is simply wrong. Or I'm misreading terribly.

Tangentially: Speaking of bias, why doesn't YC publish information on their
companies' tech choices? PG racked up a lot of inferred cachet (positive) by
stating that use of Lisp gave them a huge advantage. Now that YC has data,
they should be able to show how choice of technology correlates to
performance.

~~~
esfandia
The argument is that First Round Capital must have implictly made it harder
for female founders to get funding, since the ones who do perform better. The
rational course of action for First Round Capital would be to lower their
threshold on female founders (or, conversely, raise the threshold on male
founders) until they perform no better or no worse than male founders.

~~~
MichaelGG
And that's not proven by the evidence. It might be a good thing to look into,
but pg's statement that you don't need more info is wrong.

~~~
esfandia
Yeah, I was just rephrasing PG's argument for the OP, not saying whether it
was right or wrong. I'm not smart enough to know.

------
wycats
The implication of this analysis of
[http://10years.firstround.com/](http://10years.firstround.com/) is that First
Round is biased against founding teams with experience at Amazon, Facebook,
Apple, Google, Microsoft or Twitter.

Can this be true?

~~~
tedsanders
Yes, it can be true. FirstRound could still positively value that experience,
but just not be valuing it enough.

~~~
wycats
Makes sense. Fascinating.

------
bsder
Um, the data set pg _cites_ actually shows this to be fallacious.

They excluded Uber from the results. Which, if included, makes the male-run
companies look "oversuccessful". What would happen if I excluded the top
female-run business, I'd bet that makes the differences between the two groups
much smaller.

Given both the small sample size as well as the outsized influence of
outliers, drawing conclusions from this population group is going to be
fraught with issues.

------
earljwagner
The phenomena of "stereotype threat" complicates this conclusion however:
[https://en.wikipedia.org/wiki/Stereotype_threat](https://en.wikipedia.org/wiki/Stereotype_threat)

When a member of a group is primed with a stereotype that their group
underperforms at a task, they are more likely to underperform. So there could
be a selection process biased against a group, and a selected member could be
an above-average performer otherwise but, because of work environment, be
underperforming.

Some universities work to remedy this through support groups or other
practices aimed at under-represented minorities, and they appear to help
students be more successful academically. On the other hand, there's the
Hawthorne effect...
[https://en.wikipedia.org/wiki/Hawthorne_effect](https://en.wikipedia.org/wiki/Hawthorne_effect)

~~~
yummyfajitas
This is only a bias if the stereotype effect is stronger on measurements than
in real life. If stereotype threat affects test performance and real
performance the same way, then it means that the stereotyped group is truly
inferior.

Do you have evidence that stereotype threat hurts test performance more than
real performance?

(Of course, in a hypothetical world which only eliminated the stereotype, the
group would cease to be inferior. I.e., the inferiority is based on context,
and is not intrinsic.)

~~~
earljwagner
I wouldn't agree with the conclusion of being truly inferior - there could be
ongoing issues with the work context.

I'll put it another way: Graham says we can look at just the performance by
group to detect bias in selection. But there could be bias in selection, and
A) different treatment after selection, which would not be revealed through
Graham's test. Or could also be bias in selection and B) continuous reminders
of stereotypes triggering stereotype threat, and this would also not be
revealed.

Now you could point to a tech company with few underrepresented minorities,
let's say 1%. If there's no overt bias in the work environment then people
should succeed and if they don't they're just worse performers. On the other
hand, if you're in the 1%, just noticing the underrepresentation among your
coworkers might be a constant reminder of stereotypes.

I don't claim to have a simple solution for this.

~~~
yummyfajitas
Differential treatment after selection would indeed invalidate Graham's test.

Stereotype threat, however, is NOT such a different treatment. Again - if
stereotype threat reduces _measured_ performance by X and _actual_ performance
by Y, then the bias it introduces is Y-X. If Y=X then there is no bias.

Do you believe Y != X? If so, why?

------
Phemist
Footnote one should be put more generally, biases in the performance metrics
(either appearance for women vs ability for men, or how many words of the US
national anthem someone knows for US citizens vs rest of the world) will cause
this method to fail. Unfortunately, unbiased performance metrics are quite
hard on single dimensions, let alone when moving to multi-dimensional metrics

------
owens99
Many of these comments remind me of one of the most potent biases known to
man: confirmation bias. If you are smart and want to believe something, you
will surely be able to come up with a mathematical (albeit flawed) way to
convince yourself you are right.

------
j2kun
There is an emerging subfield of computer science that studies what it means
for data (or algorithms, or decision-making rules) to be biased, and how to
remove certain forms of bias.

See [http://fatml.org](http://fatml.org)

------
felipeerias
In which a crowd of overwhelmingly white male American SW engineers tries to
find a mathematical explanation to bias...

"For Bourdieu, cultural capital is the status culture of a society's elite
insofar as that group has embedded it in social institutions, so that it is
widely and stably understood to be prestigious. Schools take it as a sign of
native academic ability but do not themselves impart it, performing acts of
social alchemy that transform class privilege into individual merit."

------
kenko
"What it means for a selection process to be biased against applicants of type
x is that it's harder for them to make it through. Which means applicants of
type x have to be better to get selected than applicants not of type x. [1]
Which means applicants of type x who do make it through the selection process
will outperform other successful applicants."

There are many, many reasons that both sentences beginning "which means" are
false that someone who is as smart as we're told Graham is should be able to
come up with quite easily. It's astonishing that he made this tripe public.

Here's a gimme for each.

Say I'm selecting people to receive a prize; there are ten recipients and
they're putatively chosen by [whatever]. But I don't like people with green
eyes, so green-eyed candidates had better be pretty pleasing to me. But they
can please me in _any_ way, not necessarily in ways relevant to the metric for
which the prize is awarded&mdash;maybe I also like tall people so a really
tall green-eyed person averages out in terms of my predilections. They aren't
relevantly better.

For the second, again, the question is "better" at what? Better at getting
whatever is involved in getting selected? That doesn't necessarily correlate
with outperforming anyone subsequently, especially if it's a matter of
startupland. (Remember that New Yorker profile of Marc Andreessen, where Sam
Altman basically admitted that he didn't know what he was doing in terms of
selecting what to invest in? The flipside of that is being selected by Altman
for an investment.)

------
d0m
I don't get it :-/ Why is there a bias?

Even if the VCs are totally unbiased, why couldn't the startups with women
outperformed the others? It could happen for a variety of reasons. Just
hypothetically speaking, maybe startups-with-women have different networking
connections or insight that male-only-startups don't have?

~~~
esfandia
Then, if I understand PG's argument correctly, the VCs should invest in even
more startups-with-women, and lower their threshold on investing them, at
least until they perform no better than startups-without-women.

~~~
wycats
Indeed!

------
gizmo
A related observation (which I've been making for a long time) is that the
absence of mediocre women in positions of power is strong evidence of bias.
Men can succeed when they're mediocre, but women have to be exceptional.
Likewise for minorities.

~~~
cookiecaper
What's "exceptional"? I've known many women in my career and many of them were
mediocre. Some were in positions of authority and some weren't. Another
commenter lists several "mediocre" women in the corporate and political world
(and leaves out some big ones, like Meg Whitman). If you're talking about
becoming a CEO, you have to be an "exceptional" man to get there too, in
absolute terms.

I feel like the root erroneous assumption here is that an equal amount of
people of all types are interested in the same things and that the only reason
any group becomes more represented than another is that the others are getting
alienated or funneled out somewhere along the way. That is a completely
incorrect and invalid assumption. The fact that there are a lot more non-
English-speakers in janitorial work in the US (when I worked as a janitor, I
was 1 of 2 English speakers on the 12-person janitorial staff) doesn't
necessarily mean the janitorial manager is biased against English speakers; it
means that due to external considerations, like the fact that almost all other
jobs require you to speak the native language, non-English-speakers are better
suited for janitorial work, and therefore people do the logical thing, apply
for work that they can do, and end up comprising a larger section of the
application pool.

People make decisions based on social, cultural, and physical expectations of
them, and there's not anything wrong with that. By and large, women do not
have an interest in computer sciencey or entrepreneurial work. It's OK if a
woman does, but it's also OK to note that _most women don 't_. There's nothing
we need to fix about it. Most women don't want to do it, and there's no reason
to force them.

Why do you see fewer women becoming CEOs? Because fewer women want that kind
of job and fewer women are qualified for that kind of job due to the
biological realities of humanity that require women to take time out for
pregnancy and child-rearing (sorry denialists, I didn't invent biology and
choose that only women could bear and nurse children, so don't take it up with
me), and the social and cultural expectations that have formed around these
biological realities. In short, the serious applicant pool includes only a
very small amount of women, so only a very small number of women obtain that
position.

~~~
gizmo
> People make decisions based on social, cultural, and physical expectations
> of them, and there's not anything wrong with that.

I couldn't disagree more. If the culture is plain chauvinism ("women belong in
the kitchen not the boardroom") then there's everything wrong with that. All
oppression throughout history is essentially "just culture", but that
justifies nothing.

Your biological reductionism is completely at odds with our best scientific
understanding of contemporary gender roles, as a few minutes on wikipedia will
tell you.

~~~
cookiecaper
I didn't say "just culture"; I said the confluence of social, cultural, and
physical factors.

Why does "culture" develop? Because people are naturally evil and black-
hearted? These things don't happen in a vacuum, they develop organically
because they are the best way to support human and tribal propagation and
prosperity. Perhaps some things can and should change, but things that are
constant across nearly all successful human societies should be considered
fairly well tested.

We should note that it takes a long time to see the full effects of changes to
social structures and institutions, generally at least 3-4 generations. If a
society is "testing" something and the society itself expires or its success
is greatly diminished within 6-8 generations of implementation, the test
should probably not be seen as successful.

The West will find that traditional principles that assign gender roles based
on that gender's inherent advantages and disadvantages are much more useful
than currently acknowledged. Forcing people to do things that they a) don't
even want to do and b) aren't well-suited for is a losing proposition, no
matter how much outrage you try to manufacture to justify it.

~~~
srtjstjsj
Before you take your theory too far, you need to explain why it's OK that your
theory implies that black people in America were best suited to be slaves, up
until the day they weren't.

~~~
cookiecaper
Black slavery proves my point. Seen in the context of an experimental social
institution, it was a massive failure that barely made it 8 generations before
it completely imploded on itself (and took the lives of 650k Americans with
it). There's no doubt that it seriously harmed everyone associated with its
practice (including in ways you don't usually hear people mention, like
decreased ambition and work ethic for everyone, slaves and masters, in slave
economies), even mostly-innocent parties who were "guilty by association" like
the free states. We're lucky that the US survived black slavery.

Slavery has been tried many times but the gross inequity it inflicts means
that no one can operate a stable economy or social system that depends on it.

------
tracker1
I just asume there is bias... I mean the fact is, bias is what youare _trying_
to work in favor of... that bias being factors of success. Chasing your tail
against random statistics won't really show much, and a person is more complex
that a few statistical groups. As far as investing goes, there's also the
product, and how that leader/founder matches to that product category itself.
A founder that succeeds in one category won't definitively succeed in another.
Many founders fail their first few times, and later succeed. Others fail after
some success(es).

I think as long as reasonable steps are made to avoid certain obvious bias,
the rest is mostly chance.

------
eatkinson
This isn't really sound reasoning, for reasons mentioned elsewhere and because
of the following.

You need to know that the probability of acceptance is conditionally
independent of the "type" of the applicant given the _success_ of the
applicant.

For example, consider the following hypothesis for the First Round data: women
are more honest than men. A woman presenting a bad idea to a VC will be
rejected whereas a man may be able to weasel his way into getting funding.
This will make men have a lower success rate, and correspondingly women will
have a higher success rate.

However, this isn't really the same thing as having an across-the-board hidden
bias against women.

------
Mz
Actual real world example (and application of an antidote) of this basic idea:
[https://en.wikipedia.org/wiki/Rooney_Rule](https://en.wikipedia.org/wiki/Rooney_Rule)

------
dthal
In the post, PG states that First Round's study is evidence of gender bias in
VC financing. But footnote [2] is important: Uber was excluded as an outlier.
Now...excluding Uber is reasonable (it _is_ sort of an outlier), but so is not
excluding it (it was a company that First Round invested in). When the
conclusion from a data analysis depends on which way you go on something like
this - which of two reasonable alternatives you pick - then the results are
fragile and they don't really support either conclusion very well.

------
jblow
All other arguments aside ... this idea also fails if the judging party's idea
of quality is mostly uncorrelated with actual quality. Which Graham says in
other essays is usually the case (it's what you mean when you say it's almost
impossible to predict which companies will be successful).

Graham says the subjects of bias "have to be better to get selected", but what
is really going on is they have to be better _according to the metrics of the
judge_ which are essentially arbitrary.

~~~
yummyfajitas
This is false - noise doesn't hurt this test at all. See my comment explaining
why:

[https://news.ycombinator.com/item?id=10483991](https://news.ycombinator.com/item?id=10483991)

Bad measurements add noise (and increase the sample size required) but they
don't invalidate the bias detection procedure.

------
proveanegative
If candidates from group A perform more strongly on average than those from
group B there are other possible causes than bias in the selection process
itself. For instance, members of group A may only apply at a higher level of
self-assessment for how likely they are to succeed than those in group B. The
reason for this could be opportunity cost not present for group B,
overconfidence or lack of underconference in group B or underconference or
lack of overconfidence in group A.

------
anecon2
For a formalized and empirical version of this argument applied to the
entirety of the US economy, check out the following article: The Allocation of
Talent and U.S. Economic Growth by Hsieh et al.
([http://klenow.com/HHJK.pdf](http://klenow.com/HHJK.pdf)). It quantifies the
gains from the decreases in misallocation of women and african americans as
racial discrimination in employment decreased over the past 50 years.

------
danieltillett
While we can have lots of fun arguments about the mathematics of this
approach, the basic problem is the underlying data is too small and poor to
draw any valid conclusion from.

------
mrwilliamchang
I think I have a simpler counterexample to disprove pg's hypothesis than any
other counterexample I've read in the comments. Suppose our goal is to admit
the top 5 applicants with the following performances:

    
    
      A - 30,000
      A - 10,000
      A - 9,000
      B - 7,000
      B - 5,000   # Cutoff point below this line
      A - 4
      B - 3
      B - 2
    

Even though admitting the top 5 by score is perfectly fair, the applicants
from group A perform better.

~~~
Nimitz14
I don't see what you're getting at. Group A is better and there are more of
them. What's the problem?

~~~
mrwilliamchang
pg's argument is that if the average performance of one admitted group is
better than the other the admission process has a bias. This example shows
that you can have an unbiased process, but the average performance of the
groups differs.

~~~
nkurz
It's likely that the two of you are talking across each other because you read
slightly different articles. Paul added an assumption to his article, possibly
after William read it, which is intended to rule out his posited distribution:
"(c) the groups of applicants you're comparing have roughly equal distribution
of ability".

This strikes me as a "heroic assumption", but it's true that if you make it
most of the flaws in his argument go away. Add in the unspoken assumption that
the groups are both are large enough that sampling variation does not matter,
and I think he's probably logically correct.

On the other hand, once you make these assumptions, the rest of his argument
seems unnecessary, since all you need to know is the ratio of males and
females funded. If male and female founders are exchangeable, the process is
biased if one group is funded more often than they are represented in the
applicants.

You don't even need to look at outcome, since we've already assumed the
founders are of equal ability. I think that Paul is aiming at the case where
we don't know the ratio of applicants. I think his argument can be useful in
this case, but only if you have already accepted his assumptions.

~~~
mrwilliamchang
> "Paul added an assumption to his article, possibly after William read it,
> which is intended to rule out his posited distribution"

Yeah. That is what happened.

------
pbnjay
Fittingly, another type of bias observed in the linked report is that against
Solo Founders. The report states that solo founders do worse _when measured
against the same yardstick as multiple founders_. Maybe from a VC perspective
this is intended (big raise => bigger exit?), but I'd argue that you don't
need to raise as much when you have a solo founder because dilution is less of
a concern.

------
urs2102
I think the first footnote in this is extremely valid. It all depends on what
performance metric the selection process is identifying compared to the
performance metric you use to determine success.

I would suspect the larger issue is that people are probably much worse at
identifying what performance metrics for selection convert to their respective
performance for success.

~~~
newjersey
If students of Asian origin outperform the whole student body, can we conclude
admissions folks are biased against students of Asian origin?

~~~
cookiecaper
Alternately, you could conclude that instead of "mediocre" Asians being
excluded by bias, Asians have an external advantage that makes them perform
better. Maybe it's cultural, since most Asians are taught a very strong work
ethic and heavy emphasis is placed on formal schooling, succeeding, and
fitting in. Maybe Asians are physically better adapted to that type of work,
with brains that retain information more easily or buttocks that don't get
sore from sitting in a chair all day.

The "high performance means there's a bias" theory only works if you assume
that everyone is starting from the same social, cultural, and physical
baseline. They aren't.

Maybe a better metric would be if there were _no_ mediocre data points among a
certain group; that would be more evidence (but still not necessarily _good_
evidence) that you have to exceptional to get attention and overcome the "bias
barrier", not simply that most of a certain type of performer does better than
a different type.

~~~
yummyfajitas
That doesn't actually have a large effect. An example in numpy:

    
    
        In [14]: x = norm(0.0,1).rvs(100000)
        In [15]: mean(x[where(x > 2.0)])
        Out[15]: 2.3774795090391301
        In [16]: y = norm(0.5,1).rvs(100000)
        In [17]: mean(y[where(y > 2.0)])
        Out[17]: 2.4372124830289557
    

I.e., a difference in the mean of 0.5 sigma corresponds to 0.06 in Graham's
test statistic.

Graham is a little bit off - a better place to look for bias is the bottom of
the accepted distribution than at the mean.

~~~
srtjstjsj
At
[https://news.ycombinator.com/item?id=10483861](https://news.ycombinator.com/item?id=10483861)
gizmo shares an intuitive anecdote that matches your math.

------
somberi
This how I understand this:

Look back at decisions you have made under various lenses and learn about your
decisions and what biases they have, so that you can avoid them or amplify
them (if positive) in future.

------
logicallee
Could someone who read it more attnetively tell me, by this methodology,

-> If in retrospect YC finds any factor that its selected founders who turn into unicorns ($1b, $10b etc) have in common (more than its non-unicorn, also accepted founders)

-> Then by this method, could it conclude retroactively that it had been "biased" against that factor? (since it is present more than in its non-unicorns whom it had also admitted; i.e. in other words, those with the factor are more performant than "would be expected" without the bias against it?)

Or have I misunderstood?

------
174676
Even assuming equal distribution of ability, there is still the problem of
whether you can measure performance without bias.

------
fanzhang
The test pg suggests was also proposed by the economist Gary Becker [1]. Like
many people here noticed, the catch is that the test only works if you compare
_marginal_ performance and not _average_ performance. Economists call this the
inframarginality problem [2]. There are a number of solutions to this problem
to restore pg's result:

\- As pg himself says, if we assume certain statistical distributions of
ability and selection rules, the inframarginality problem goes away.

\- We'd also solve the inframarginality problem if we can tell roughly who the
marginal applicants were. If pg could ask the VC firm, see who _almost_ got
rejected, and compare these two groups, he'd be set. pg is well-positioned to
test this on the YC dataset.

Likewise, he could solve this problem if he can observe another variable that
reveals who the marginal applicants likely were (for example, the startups
that had the fewest co-investors).

\- There's also an entire literature out there that tries to solve the problem
using other ways. For example if a system follows the "KPT" sufficient
conditions then the inframarginality problem also goes away.

 _[1] One prominent approach ... is the “outcome test,” which originated in
Gary S. Becker (1957). In the context of motor vehicle searches, the outcome
test is based on the following intuitive notion: if troopers are profiling
minority motorists due to racial prejudice, they will search minorities even
when the returns from searching them, i.e., the probabilities of successful
searches against minorities, are smaller than those from searching whites.
More precisely, if racial prejudice is the reason for racial profiling, then
the success rate against the marginal minority motorist (i.e., the last
minority motorist deemed suspicious enough to be searched) will be lower than
the success rate against the marginal white motorist. (From [3])_

 _[2] "While this idea has been well understood, it is problematic in
empirical applications because researchers will never be able to directly
observe search success rates against marginal motorists. This is due to the
fact that we cannot identify the marginal motorist, since accomplishing this
would require having complete information on all of the variables that
troopers use in determining the suspicion level of motorists. Because of this
omitted-variables problem, we can observe only the average success rate of
searches against white and minority motorists, and not the marginal success
rate. Since the equality of marginal search success rates does not imply, and
is not implied by, the equality of the average search success rates, we cannot
determine the relationship between the marginal search success rates of white
and minority motorists by looking at average success rates. In past
literature, this has been referred to as the “infra-marginality” problem.
(From [3])._

 _[3] Anwar, Shamena, and Hanming Fang, "An Alternative Test of Racial
Prejudice in Motor Vehicle Searches: Theory and Evidence." American Economic
Review. (2006)_

[http://economics.sas.upenn.edu/~hfang/publication/racial-
pro...](http://economics.sas.upenn.edu/~hfang/publication/racial-
profiling/aer_final.pdf)

------
graycat
Okay, PG has an _hypothesis_ test.

There's a large literature for that, e.g.,

E. L. Lehmann, _Testing Statistical Hypotheses_.

E. L. Lehmann, _Nonparametrics: Statistical Methods Based on Ranks_.

Sidney Siegel, _Nonparametric Statistics for the Behavioral Sciences_.

In this case, PG will be more interested in the _non-parametric_ case, i.e.,
_distribution-free_ where we make no assumptions about probability
distributions.

We start an hypothesis test with an _hypothesis_ , commonly called the _null
hypothesis_ which is an assumption that there is no _effect_ or, in PG's case,
_no bias_. Then with that assumption, we are able to do some probability
calculations.

Then we look at the real data and calculate the probability of, say, the
evidence of bias being as large as we observed. If that probability is small,
say, less than 1%, then we _reject_ the _null hypothesis_ , that is, reject
the assumption of no _bias_ , and conclude that the null hypothesis is false
and that there is bias. The role of the assumption about the sample is so that
we know that the _problem_ is bias and not something about the sample.

In hypothesis testing, about all that matters are just two numbers -- the
probability of Type I error and that of Type II error. We want both
probabilities to be as low as possible.

Type I Error: We reject the null hypothesis when it is true, e.g., we conclude
bias when there is none.

Type II Error: We fail to reject (i.e., we accept) the null hypothesis when it
is false.

When looking for bias, Type I error can be called a _false alarm_ of bias, and
Type II error can be called a _missed detection_ of bias.

In PGs case, suppose we have 100 startups and five of those have women
founders. Suppose for each of the startups we have the data from "their
subsequent performance is measured".

Our null hypothesis is that the expected performance of the women is the same
as that of the men.

So, let's find those two averages and take the difference, say, the average of
the women less the average of the men.

PG says if this difference is positive, then there was bias, but PG has not
given us any estimate of the probability of Type I error, that is, of the
probability (or _rate_ ) of a false alarm.

I mean we don't want to get First Round Capital in trouble with Betty Friedan,
Gloria Steinem, Marissa Mayer, Sheryl Sandberg, Hillary Clinton, Ivanka Trump,
or Lady Gaga unjustly! :-).

Let's call this difference our _test statistic_.

So, let's find the probability of a false alarm:

So, let's put all 100 measurements in a pot, stir the pot vigorously (we can
use a computer for this), pull out five numbers and average, pull out the
other 95 numbers and average, take the difference in the two averages, that of
the five less that of the 95, and do this, say, 1000 times. Ah, computers are
cheap; let's be generous and do this 10,000 times.

For a random number, how about starting with a 32 bit integer, with
appropriately long precision arithmetic multiply by 5^15, add 1, take modulo
2^47, and scale as we want?

So, we get an empirical distribution of these differences, from the five less
the 95\. Looking at the distribution, we see what the probability is of
getting a difference as high or high or higher than our test statistic. If
that probability is low, say, 1% or less, then we reject the null hypothesis
of no bias and conclude bias with our estimate of probability of Type I error
1% or less.

If with the 1% we reject, then it looks like First Round has done a
transgression, will get retribution from Betty, _et al.,_ and needs to seek
redemption and Betty, _et al.,_ are happy to have their suspicions confirmed.
Else First Round looks like the good guys, are "certified statistically fair
to women", may get more deal flow from women, and Betty, _et al.,_ can be
happy that First Round is so nice!

Notice that either way Betty, _et al.,_ are "happy". That's called "happy
women, happy life"! Or, heads, the women win, tails they lose, and in no event
is there a huge crowd of angry women in front of First Round's offices with a
bonfire of lingerie screaming "bias"!

When we reject the null hypothesis, we want to know that the reason was men
versus women and not something else, e.g., a _biased_ sample. So here is where
we use our assumption of independence with the same mean.

Now we have a _handle_ on Type I error.

Here we have done a _non-parametric_ statistical hypothesis test, i.e., have
made no assumptions, except the means, about the distributions of the
male/female CEO performance measurements.

And we can select our desired false alarm rate in advance and get that rate
almost exactly.

For Type II error, that is more difficult.

Bottom line, what we really want is, for whatever rate of false alarms we are
willing to tolerate, the lowest rate of missed detections we can get.

Can we do that? With enough more data, yup. There is a classic result due to
J. Neyman (long at Berkeley) and K. Pearson (early in statistics) that shows
how.

How? Regard false alarm rate as money and think of investing in SF real
estate. We put our money done on the opportunities with highest expected ROI
until we have spent all our money. Done. For details, an unusually general
proof can follow from the Hahn decomposition from the Radon-Nikodym theorem in
measure theory, e.g., Rudin, _Real and Complex Analysis_. Right, in the
discrete case, we have a knapsack problem, known to be in NP-complete.

What we have done with our pot stirring is called _resampling_ , and for more
such look for B. Efron, long at Yale, and P. Diaconis, once at Harvard, now
long at Stanford.

Tom, with a reputation as a hacker, likes to work late, say, till 2 AM. So, we
look at the intrusion alerts each minute between 2 AM and 3 AM (something like
the performance of the women) and compare with those of the other minutes of
24 hours (like the performance of the men) much as above and ask if Tom is
trying to hack the servers.

Or, we have a server farm and/or a network, and we want to detect problems
never seen before, e.g., _zero day_ problems. So, we have no data at all on
the problems we are trying to detect because we have never seen any of those
before.

So, to do a good job, let's pick some system we want to monitor and for that
system, get data on, say, each of 10 variables at, say, 20 times a second. Now
what?

Our work with bias in women venture applications used just one number for our
measurement and test statistic. So we were _uni-dimensional_. Here we have 10
numbers and need to be _multi-dimensional._

Well, in principle we should be able to do much better (pair of Type I and
Type II error rates) with 10 numbers than just one. The usual ways will
require us to have, with our null hypothesis, the probability distribution of
the 10 numbers, but can only get something like that from smoking funny stuff
-- not even _big data_ is that big.

So, we want to need no assumptions about distribution, that is, be
_distribution-free_.

So, we want some statistical a hypothesis test that is both multi-dimensional
and distribution free.

Can we do that? Yup.

"You mean you can select false alarm rate in advance and get that rate
essentially exactly, as in PG's bias example?" Yup.

"Could that be used in a real server farm or network to detect zero day
problems -- security, performance, hard/software failures, system management
errors?" Yup \-- just what it was invented for.

"Attempted credit card fraud?" Ah, once a guy in an audience thought so!

How? Ah, sadly there is no more room in this post!

What else might we do with hypothesis tests? Well, look around at, right, _big
data_ or just _small data_.

Do we have a case of _big data analytics_ or _artificial intelligence_ (AI)?

Ah, I've given a sweetheart outline of statistical hypothesis testing, and now
you are suggesting some things really low grade? Where did I go wrong to
deserve such an insult?

~~~
graycat
Errata:

Replace

Type I Error: We reject the null hypothesis when it is true, e.g., we conclude
bias when there is none.

Type II Error: We fail to reject (i.e., we accept) the null hypothesis when it
is false.

with

Type I Error: We reject the null hypothesis when it is true; e.g., we conclude
bias when there is none.

Type II Error: We fail to reject (i.e., we accept) the null hypothesis when it
is false; e.g., we conclude there is no bias when there is.

Replace

For a random number, how about starting with a 32 bit integer, with
appropriately long precision arithmetic multiply by 5^15, add 1, take modulo
2^47, and scale as we want?

with

For a random number, how about starting with a 32 bit integer, with
appropriately long precision arithmetic multiply by 5^15, add 1, take modulo
2^47, take the resulting integer, scale as we want for stirring our pot, and
use that integer as the start of another random number?

Replace

Else First Round looks like the good guys, are "certified statistically fair
to women", may get more deal flow from women, and Betty, _et al._ , can be
happy that First Round is so nice!

with

Else First Round looks like the good guys, are _statistically certified fair
to women_ , may get more deal flow from women, and Betty, _et al._ , can be
happy that First Round is so nice!

Replace

Notice that either way Betty, _et al.,_ are "happy". That's called "happy
women, happy life"! Or, heads, the women win, tails they lose, and in no event
is there a huge crowd of angry women in front of First Round's offices with a
bonfire of lingerie screaming "bias"!

with

Notice that either way Betty, _et al.,_ are "happy". That's called "happy
women, happy life"! Or, heads, the women win, tails First Round loses, and in
no event is there a huge crowd of angry women in front of First Round's
offices with a bonfire of lingerie screaming "bias"!

Replace

So, we want some statistical a hypothesis test that is both multi-dimensional
and distribution free.

with

So, we want a statistical hypothesis test that is both multi-dimensional and
distribution free.

Replace

We put our money down on the opportunities with highest expected ROI until we
have spent all our money.

with

We put our money down on the opportunities with highest expected ROI until we
have spent all our money. Done.

------
WildUtah
On a simple mathematical basis, this is false.

Consider two groups of candidates for a scholarship, A and B. We want to
select all candidates that have an 80% or better chance of graduation. Group A
comes from a population where the chance of graduation is distributed
uniformly from 0% to 100% and group B is from one where the chance is
distributed uniformly from 10% to 90%, with the same average but less
variation in group B.

Now suppose that we select without bias or inaccuracy all the applicants that
have an 80% or better chance of graduation. That means we select a subset of A
with a range of 80% to 100% and a subset of B with a range from 80% to 90%.
The average graduation rate of scholarship winners from group A will be 90%
and that from group B will be 85%.

But we haven't been biased against A. We've selected according to the exact
same perfect evaluation process and criterion from both groups. It was just
their prior distribution that was different.

The actual applicant groups for jobs or financing in the real world, when they
are divided by demographic factors like age, sex, race, and educational level,
will almost always manifest different variances in success levels even when
the averages are the same. That makes this test useless and mathematically
illiterate.

And when we use a normal distribution, as we should always expect given the
central limit theorem, the mathematical problems get even more intense.

This short comment is not up to pg's usual high standards for his essays.

~~~
pg
It's true that this test assumes groups of applicants are roughly equal in
(distribution of) ability. That is the default assumption in most
conversations I've been involved in about bias, and particularly the example I
used, but I'll add something making that explicit.

~~~
gleb
I like the idea, but how do you apply this to power law distribution outcomes
and get any statistical significance? I don't know the answer.

E.g. the underlying First Round's analysis likely has no statistical
significance. Assuming the power law distribution of outcomes top 5 outcomes
will account for 97% of value. So we now have a study with n=5.

To make the point let's apply this to YC's own portfolio. Assuming Dropbox,
AirBnb and Stripe represent 75% of its value, we'll learn that YC is
incredibly biased against:

    
    
      * MIT graduates
      * brother founders
      * founding teams that do not have female founders
      * and especially males named Drew
    

Hard to believe these conclusions are correct or actionable

~~~
graycat
> I like the idea, but how do you apply this to power law distribution
> outcomes and get any statistical significance? I don't know the answer.

See my post

[https://news.ycombinator.com/item?id=10484602](https://news.ycombinator.com/item?id=10484602)

where are _distribution-free_. So "power law", Gaussian, anything else,
doesn't matter.

------
mirimir
correlation <> causation

------
devalier
_Fortunately there 's a way to measure bias that's much more reliable, when it
can be used....A couple months ago, one VC firm (almost certainly
unintentionally) published a study showing bias of this type. First Round
Capital found that among its portfolio companies, startups with female
founders outperformed those without by 63%._

Except if you want to use statistics to measure bias, you need a statistically
significant sample. And actually, if you are studying complex human affairs,
with a hundred different variables, you need more than statistical
significance, you need a sensitivity analysis. It is similar to nutrition
studies. There are so many variables at play that something can always be
found to increase or decrease your risk of cancer by 50%. You really only need
to pay attention when statistics show an order-of-magnitude correlation, as
with the link between smoking and lung cancer.

With the First Round Capital data, they excluded Uber from their calculations,
because it would skew everything. If a single data point can switch your
findings to be opposite, then you just have to admit that you do not have
enough data to make determination one way or another. In science it is
sometimes ok to exclude an outlier, since it often indicates a measurement
error. But in venture capital, you make most of your money off of the Uber-
like outliers. So if you are trying to study the data to be the best venture
capitalist possible, throwing out outliers is not valid.

Also, the initial premise is incorrect too. You cannot measure bias by
comparing average results, _because the average is not the marginal_. Consider
PG's footnote: "Although I used female founders as an example because that is
a kind of bias people often talk about, the most striking thing was the degree
to which First Round undervalue founders who went to elite colleges." Does he
honestly believe that First Round is biased against founders from elite
colleges?

At my last company my sense was that the MIT grads were better than the
average programmer. So were we biased against MIT grads? Should we have hired
more MIT grads until the average performance of MIT grads overall equaled the
average performance of an employee overall? Should we have done more outreach
to MIT? Should the industry as a whole hired more MIT grads?

If a talent distribution has a bunch of elite, and then a steep drop-off
filled with "pretenders", then you can get this type of effect without being
biased.

When we got an elite MIT grad, we hired them. When we got a "pretender",
someone who was trading on the name but did not put in the work, we rejected
them. And yes, I personally saw MIT grads that did terrible on simple coding
exercises.

So even though the average MIT grad we hired was better than the average
programmer at our company, there was no way to alter our hiring process to get
more MIT grads. If we hired the marginal MIT grad that we rejected, we would
have been worse off. Now we could do more outreach to MIT, and we did, but
that is a highly competitive process. There were diminishing marginal returns
to how much outreach we can do to get more applicants.

The statistical illiteracy of PG's post is simply stunning. Imagine a YC
company gets a 100% ROI from PPC ads, and a 50% ROI from banner ads. Are they
biased against PPC ads? Should they buy more PPC ads? Such an analysis is
ridiculous. You look at what you are spending on the marginal PPC ad, and you
stop spending when the ROI on the marginal ad is at zero, regardless of what
the average is. That one advertising channel has a higher ROI on average does
not mean that the company is biased against that channel.

~~~
tansey
So true.

PG's articles are generally filled with good intuitive insight. Unfortunately,
statistics can be very tricky to turn into folksy wisdom. Rules of thumb like
"you need 30 samples before you can say anything" that are derived from the
CLT are a good example of ones that work well enough in practice, even if they
obscure some underlying subtleties. This article is an example of a rule that
sounds simple, but actually has so many asterisks that one would expect it to
be mostly useless in practice.

If women are performing better on average, it doesn't mean that you should
invest in more women necessarily. What if all the remaining candidates would
have a negative mean return? If they included Uber and all of a sudden the
women now underperform men, does that mean they're biased against men and they
need to invest in less women?

There's just so many statistical fallacies at play here that it's a shame that
Jessica, Sam, or Geoff didn't point out that maybe someone with a stats
background should read the article first before publishing it.

------
jsprogrammer
All selection processes are biased. ■

------
bruu_
What is the significance level? What is the model? This is freshman dorm room
level analysis

------
SeriousM
Isn't every selection process based on experience, knowledge and/or mood and
therefore baised?

~~~
newjersey
He makes a few assumptions. But he is probably correct within those
assumptions.

We assume all candidate pools are homogeneous. If all members of a certain
subset A of a global population P is simply better at a task than any member
of another subset B of the global population, we will see that this holds true
for members of our sample as well. Thus, members of A in our sample will
consistently outperform members of B in our sample. Does this mean there is a
bias against A? Well, yes because if there wasn't then there would be fewer
members of B or perhaps no members of B in our sample based on this result
alone.

However, real life is not one-dimensional. Sometimes we need to consider other
factors as well.

