
The Mathematics of Paul Graham's Bias Test - coris47
https://www.chrisstucchio.com/blog/2015/paul_grahams_bias_test.html
======
haberman
> So rather than comparing _mean_ performance, we'll compare _minimum_
> performance.

If I'm understanding correctly, the new test is based on a single data point
from each group, rather than an aggregate statistic (like mean). I'm no
statistician, but it seems like this data would have _far_ too much variance
and noise for this to be a useful test.

The minimum performer could be someone who had a sudden personal crisis. Or
who had 10 competitors suddenly pop up. Or any number of other circumstances
outside their control. The minimum performer is, almost by definition, an
outlier. It doesn't seem rational to suppose that an outlier is representative
of the group.

I can understand that statistically this test may be more rigorous. In
practice I would expect it to be less rigorous. Because the assumption it
makes (that a single outlier is representative of the group) seems even more
dubious than the assumptions required for Paul's original idea.

~~~
rcthompson
The sample minimum (or maximum) is not an inherently unstable statistic. If
there is sufficient density in the distribution near its minimum, the sample
minimum can be quite robust. For example, consider that the maximum likelihood
estimator for the upper bound of a uniform distribution is simply the sample
maximum, and the minimum-variance unbiased estimator is also based on the
sample maximum[1]. (This method was used by the Allies in World War 2 to
estimate the total number of German tanks by sampling the serial numbers from
destroyed tanks[2].)

Of course, a real thresholding process would not be perfect, so the lower
bound of the distribution of accepted candidates would not be a perfect
vertical cutoff as in the examples. Just like any process that adds additional
variation to the data, this would reduce the statistical power. You could
accept more bias in return for lower variance in your test by taking, say, the
5th percentile instead of the sample minimum as your test statistic. (You can
think of the sample minimum as the zeroth percentile.)

[1]
[https://en.wikipedia.org/wiki/Uniform_distribution_%28contin...](https://en.wikipedia.org/wiki/Uniform_distribution_%28continuous%29#Estimation_of_maximum)

[2]
[https://en.wikipedia.org/wiki/German_tank_problem](https://en.wikipedia.org/wiki/German_tank_problem)

~~~
nazka
It's very interesting. And what are the most suitable formulas we can use to
measure how robust it is?

~~~
chestervonwinch
If you are estimating a parameter with a sample statistic, two properties it
should have are unbiasedness and low variance:
[https://en.wikipedia.org/wiki/Minimum-
variance_unbiased_esti...](https://en.wikipedia.org/wiki/Minimum-
variance_unbiased_estimator)

------
lukev
One other thing which both this and PG's original theory get wrong:

Their basic premise is wrong, if bias continues to exist after the selection
event in question.

For example, if YC had (hypothetically) a real bias against black or women
entrepreneurs, it is _almost certain_ that future funding rounds, as well as
all possible exit scenarios, would exhibit very much of the same bias.

In which case, the future "performance" of those candidates would be poor, and
by PG's definition unbiased even though the only meaningful result is that YC
is no _more_ biased than subsequent performance evaluations.

~~~
agarden
Let's assume that bias does persist past selection through the duration of the
program. Does that change the interpretation when you look at First Round
Capital's data that shows its female founders outperforming the males by 63%?
I don't think it does.

The test may not be sufficient to prove that you have no bias, but it may be
good enough to prove that you do. When it does indicate bias, it seems likely
to be correct.

To put it another way, if it is 1948 and the only three black people in Major
League Baseball are all superstars, then the distribution of baseball skill
among black players is extremely unbalanced or there is a lot of bias keeping
the average and moderately-better-than-average black players out.

~~~
notahacker
The interpretation is still somewhat unreliable as in many cases there's still
another hypothesis, which is that $UNDERREPRESENTED_MINORITY actually
overperforms due to _favourable_ treatment after the selection process

Of course favourable treatment can't make people into superstar startup
founders or baseball players (and I'm sure any special treatment afforded to
black baseball players in the 1940s was the complete opposite of favourable).
But more generally it can make an organisation with fair selection processes
look like it sets a higher bar for $MINORITY because it addresses low numbers
by being very keen to promote and very reluctant to fire/deselect members of
said minority, so these kind of studies still have to be considered with care.

(Of course even if an organisation is proactively treating a minority group
favourably after selection doesn't mean that conscious or unconscious biases
don't exist in the selection process.)

------
bsder
The fact that they have to exclude Uber for no good a priori reason should
have been raising red flags all over the place.

"But Uber skews the results!" So what? You don't get to just throw out data
points you don't like without good reason.

If your "test" is that sensitive to individual outliers, then perhaps it isn't
really a good test after all.

~~~
ceph_
Dropping outliers is common in statistical analysis.

~~~
jamiequint
Dropping outliers can be done when outliers cloud the analysis, but doing this
in an analysis of startups is inane since startup investors' _entire goal_ is
to find outliers.

~~~
angelbob
Possible. In this case, we're not looking for outliers or measuring based on
financial success, but trying to tell if the VC is systematically biased anti-
woman.

It's not clear that dropping outliers is a bad idea there. It's also not clear
it's a _good_ idea, granted.

~~~
vasilipupkin
Well, if you are trying to measure whether men founders or women founders you
have funded on average make more, then you would have to include Uber. The
real issue with the analysis is that results are unlikely to be statistically
significant due to small samples and high variance, which means they are
useless.

------
danieltillett
Very nicely done Chris, but the basic problem with Paul’s analysis is not the
mathematics (this can be fixed as you have shown), but the underlying data.
Any data set you could get to measure bias in the start-up world is too small
and messy to tell you anything useful. No matter how sophisticated your
analysis, if the data is garbage then all you will end up with is garbage.

This does even consider the problem of data dredging which First Round Capital
engaged in.

~~~
jaz46
Maybe First Round's data set is too small or messy to get meaningful results,
but the entire startup world has plenty of data to potentially pull some
meaningful conclusions about bias.

I wonder if all YC companies would be enough data points to learn something
useful. Or maybe grab a large swath of VC-funded startups including First
Round's investments and many other top firms.

Most of the raw stats in Chris's post was above my head, but I'd love to see
this applied to a larger data set of fundings.

~~~
danieltillett
The basic problem is getting hold of good data. Most VC rightly consider their
data in this area very valuable and they are not going to part with it easily.
Not even Paul delved into YC’s data.

More fundamentally even if you could get enough data, the data is just too
messy to analyse and draw any valid conclusions.

------
Steko
> The idea is generally correct - bias in a decision process will be visible
> in post-decision distributions

I find what's wrong with the idea more fundamental, that it talks only about
the 'selection process' but in fact bias that impacts success or failure can
come at other points.

~~~
danieltillett
This is really important. Lets say the whole VC ecosystem is biased against
redheads (just to pick a random group). What would happen is the redheads
would under perform other groups as they were discriminated against at each
stage of the VC lifecycle. They would not show up as a group that over
performing later. The only bias you can detect using Paul’s approach is bias
that only applies at the initial stage and not later.

~~~
escape_goat
Just to cross the beams of pedantry here for a moment, a widespread and well
known --- if less than serious or systemic --- cultural/social bias against
red haired people, probably first coming into public consciousness in North
America due to the infamous South Park 'Ginger' episode, has in fact primed
you to select "redheads" as a non-contentious example of a plausibly ethnic
group that might be discriminated against, something that every red-haired
person knows, although you apparently do not. This means that the choice is
statistically insensitive and the social methodology is poor.

Actually, choosing an identifiable group at random would be both socially and
statistically unwise, as, following Patero distribution, there are vastly more
minority/extreme minority distinguishable groups of people than there are
majority/significant minority ones; this means, firstly, that any group
randomly selected with equal biasing between all groups has a high probability
of being subject to actual discrimination, mooting any social benefit of
choosing a group at random; secondly, that the generalizable qualities of the
group chosen would therefore have a distribution with very little deviation
(if I'm using my terms correctly) and would be highly predictable, thereby
obviating any possible statistical benefit of doing so.

~~~
nkurz
I'd prefer to imagine that it's because the majority of the HN readership
reads Nature for their regular dose of science fiction:
[http://www.nature.com/nature/journal/v453/n7194/full/453562a...](http://www.nature.com/nature/journal/v453/n7194/full/453562a.html)

Alternatively, I'd be OK imagining that it was subconsciously chosen here not
at random, but because I used this reference for an example in the previous
thread.

Is South Park another journal worth reading? Are they open access?

------
oskarth
I think this is the first post that's a DH5 on pg's _How to disagree_ scale
([http://paulgraham.com/disagree.html](http://paulgraham.com/disagree.html)).
Not only that, the OP is charitable enough to explicit state why it's not a
DH6:

 _Paul Graham wrote an article about an idea. The idea is generally correct -
bias in a decision process will be visible in post-decision distributions, due
to the existence of marginal candidates in one group but not the other. But
the math was wrong. // That's ok! Very few ideas are perfect when they are
first developed._

I'm not good enough at statistics to check that OP's math is sound, but _this_
is the mindset of a scientist. OP reasons rigorously, finds a way to salvage
the core insight and improves on it. As readers can see, it took quite a lot
of work and prior knowledge to do.

If I were pg I would consider putting a link to this post on both the
_disagree.html_ and _bias.html_ as a note for posterity.

~~~
leni536
To be fair pg's scale isn't perfect. DH1-DH5 is something you can improve if
you want to refute a given argument. DH6 is about choosing the "main" argument
you want to refute. There is nothing to improve between DH5 and DH6,
especially if you agree with the main argument but you disagree with some
minor argument.

------
JDDunn9
I think this is a really good start for the most common types of bias. A few
counter-examples that might slip through the cracks of this test:

Only examining the sample without looking at the population of applicants has
its limits. Especially as multiple interviews becomes the norm, filters that
don't affect the distribution of outcomes will be missed. For example, the
person screening resumes might weed out anyone with an ethnic-sounding name. A
different person, who is not biased, interviews the candidates. The quality of
the candidates accepted will be the same, but the number of minority
applicants will be smaller than it should be.

Measuring outcomes allows for external biases to distort the results. Start
with a company that is biased against women, so that the average female
founder is better than the average male. However, that same level of sexism
exists in the market, such that the company's performance is hampered due to
prejudice against the founder. The VC's bias would be hidden by the counter-
bias in the market.

------
nkurz
(comment reposted from the earlier submission that didn't catch:
[https://news.ycombinator.com/item?id=10513574](https://news.ycombinator.com/item?id=10513574))

Hi Chris ---

In the earlier thread, it seemed like some people were reaching different
conclusions because they were using different definitions of "bias". I think
my working definition would be something like "there existed in the actual
applicant pool a subset of unfunded female founders who should have been
statistically expected (given the information information available to the
VC's at the time of decision) to outperform an equal sized subset of male
founders who did in fact receive funding".

Alternatively (and I don't think equivalently?) one could reasonably take bias
to mean "Given their prejudices, if the same VC's had been blinded to the sex
of the applicants, they would have made funding choices resulting in higher
total returns than the sex-aware choices they actually made." I'm sure there
are many other ways of defining "bias". Could you define what would need to be
true for your test to show that "the VC process is biased against female
founders"?

~~~
yummyfajitas
My definition is the same as yours - it's exactly about the existence of
rejected women who are better than accepted men.

This particular test is terrible for VC since the min return in VC will always
be zero. But if you build a noise-sensitive version for something like college
admissions, what needs to be true is a) bias manifests as raising/lowering the
bar for one group relative to another, and b) both groups have a significant
number of members near the cutoff.

As an example of the type of bias this test would detect, consider
U-Michigan's point system [1]. An extra +1.0 GPA was added to black
applicants. I.e. an Asian person with 3.9 GPA and black person with 2.9 GPA
were equivalent. This would result in Asian people having a higher min GPA
than black people.

[1] They replaced the point system with vague human heuristics when the
supreme court said point systems can't be racist, but vague heuristics can.

------
lpage
Maybe I'm missing something but this seems like a pretty perfect application
for the bootstrap - a remarkably intuitive but powerful framework. Without
loss of generality, imagine that you have two populations, A and B, and that
you want to test some hypothesis about a statistic of A being different from a
statistic of B (mean, in this case). Using the simplest form of the bootstrap
you would do the following:

1\. Pool and randomly label the data from A and B 2\. Sample with replacement
and form two partitions of the same cardinality as the original A and B groups
3\. Compute the differences in mean 4\. Rinse and repeat millions of times to
form a distribution of mean differences 5\. Check if the observed difference
in means (from the true A/B labels) is statistically significant relative to
the distribution found in (4)

This has some problems with fat tailed distributions but tends to work great
otherwise. It's so simple that it avoids a host of pitfalls that can arise
with other resampling schemes (what's being proposed is a type of resampling),
and I love that it makes basically zero assumptions on the underlying data.

------
graycat
Sorry, I still believe that a better approach is in

[https://news.ycombinator.com/item?id=10484602](https://news.ycombinator.com/item?id=10484602)

That post shows that what PG is doing is a first-cut effort at a statistical
hypothesis test but with being vague on assumptions and without any
information on false alarm rate.

In particular, in my post, get to compare sample averages without making a
distribution assumption. Indeed make no distribution assumptions at all.

Yes, distributions exist, but that does not mean that we have to consider
their details in all applications!

Come on guys, this is distribution-free statistical hypothesis testing, and we
should be able to use that.

------
fauigerzigerk
So the alleged flaw in pg's reasoning is his assumption that the best
applicants from two large groups of humans should turn out to create equally
successful startups on average, if the selection process is not biased.

Is this really such an unreasonable assumption, given that pg restricts the
applicability of his bias test to groups of equal ability distribution and
that we can assume that both groups have roughly the same amount of capital at
their disposal?

The question is if the "equal ability" qualification is sufficient to make
sure the distributions are roughly similar. But that is not a mathematical
issue.

~~~
yummyfajitas
The point of the post is to relax the "equal ability distribution" assumption.
If the distributions are identical, any disparity in outcomes must be caused
by bias.

------
Tarrosion
Honest question: suppose I pick a plausible sounding h(x), perform the
proposed test, and get a vanishingly small p-value. So I feel pretty happy
about rejecting the null hypothesis. But the null hypothesis is a conjunction:
A group members accepted have cdf a(x), B group members accepted have cdf
b(x), and h(x) satisfies various technical conditions related to a( ) and b(
). So when I reject the null, I'm saying the data were unlikely to be
generated by the hypothesized process. Couldn't that simply be because I
guessed the wrong form of h( )?

------
tansey
Thanks for the write-up Chris. Now I understand why I couldn't follow the path
of logic you were laying out in our original discussion in PG's article's
comments.

The main problem I was having is that you are assuming our observation
variable is the latent skill or potential value variable (which you're calling
x here). However, the article by PG was talking solely about the average of
returns (let's call it y).

So the reason I was confused is that, assuming that the outcome of a startup
is dependent only on x, we are really observing y ~ f(x) = \int_0^1
g(x)h(x)dx, where h is your cut-off criteria for x, g(x) is some unknown
payoff distribution for a given skill level, and I'm assuming our x is in
[0,1] without loss of generality. So in essence, the real problem here, even
if you could see all of the individual returns for a given portfolio, is that
you have to perform a very, very difficult deconvolution problem. And I'm
pretty sure it's non-identifiable without some other information or additional
parametric assumptions.

Thinking out loud a bit, let's assume that y is actually log(return), where a
return of 1 is breaking even and 0 is losing everything. Since log(0) is
undefined, most startups return 0, and very few exit for less than 1, I would
think we could model this as a point-inflated normal distribution: p(y) = c *
\delta_0 + (1-c) * N(\mu, \sigma^2). Given this, we could then model our
latent parameters (c, \mu, \sigma) as being functions of x. Since the model is
separable, we can even just look at the zeros and non-zeros in isolation. Then
we can come up with a test from there, but I'm not really sure what that test
would be at this point. Anyway, that's a completely different line of
thinking, but it seems much more tractable in practice.

------
igonvalue
It's since been dressed up with some mathiness, but this idea was originally
proposed in the comment threads of pg's original article. [0] See the
responses there for a few reasons why it just won't work.

To be concrete, assuming "performance" is measured as return on investment,
min(performance) will always go to to -100% (i.e., bankruptcy) with a large
enough sample size.

[0]
[https://news.ycombinator.com/item?id=10484200](https://news.ycombinator.com/item?id=10484200)

~~~
eru
With a large enough sample size, someone might find a way to break the -100%
barrier.

------
wattle_park
To use maths/statistics to reason that YC is not biased against certain groups
of applicants is amusing. But to even consider that technical female founders
are weak candidates is disappointing.

The sample population was chosen by specific type of groups of partners. There
is no female technical partners in the group. As a female technical founder, I
am not interested in building 'tea-making bot', sandwich making bot or selling
organic condom. IMHO we have different views in looking at problems and
solving them. Without having female technical founder as partner, YC would be
perceived to be biased.

The algorithm of selecting promising candidates will vary once there is a
variety of partners.

------
tome
This still seems like nonsense to me. What if you reject each black candidate
(independently) with probability 0.5, and the proceed to perform a fair
interview process with all remaining candidates?

Surely the distribution of minimums would then be the same between all skin
colours, but you end up employing half the number of black applicants that you
should be.

------
rdslw
Interesting tests which can reveal your current implicit (read hidden)
attitudes toward race, gender, color.

Worth to do and discover few facts about ourselves, even if uncomfortable.

[https://implicit.harvard.edu/implicit/](https://implicit.harvard.edu/implicit/)

------
shoyer
I think the more fundamental flaw in PG's argument, which is just as present
here, is that it assumes the populations are otherwise identical. That's
obviously not the case -- there's no random assignment for bias -- so this
sort of test can't tell you anything direct about casuation.

Any credible statistical test for bias should be framed in the language of
causal inference, e.g., as described by Judea Pearl:
[http://ftp.cs.ucla.edu/pub/stat_ser/r350.pdf](http://ftp.cs.ucla.edu/pub/stat_ser/r350.pdf)

~~~
yummyfajitas
This test explicitly does NOT assume the populations are otherwise identical.
See the graph right after Theorem 1 - it shows two unequal distributions
satisfying the assumptions of the test. That's the whole point.

------
ageofwant
Wow that's a really nicely rendered page.

TIL about [https://www.mathjax.org/](https://www.mathjax.org/).

------
dlss
Posting this, since no one seems to have pointed it out:

> So rather than comparing mean performance, we'll compare minimum
> performance.

 __1. __This is a useless metric for startup investors to use, since (almost
surely) the minimum performance in _every_ group of reasonable size will 0
(the startup went out of business)... and this will be true even if the
investor is biased.

 __2. __The maximum statistic was rightly avoided here because for power-law
distributed values (which startups returns are), you 'd need to know the
population sizes to estimate if the distribution of {A} was different than the
distribution of {B}.

If you're willing to take on faith that both A and B have the same
distribution, then the test is easy: is the acceptance rate for As
significantly different than the acceptance rate for Bs? If you've invested in
more than, say, 100 startups, you have a big enough sample to check this...
this requires knowing the size of application pools, and who was accepted
though.

 __3. __I believe that in general it 's not possible to determine a bias from
the kind of aggregate statistic pg is discussing without at least some
knowledge of the sample space.

For example, using OP's method, you will find that almost every selection
process in the world is biased _for_ you if you divide the world as {you} vs
{non-yous} (you're doing significantly worse than the best non-you). And find
that almost every selection process in the world is heavily biased _against_
you if you use the minimum statistic (you're doing significantly better than
the worst non-you). This is also true for smallish groups (eg {your friends}
vs {not your friends}).

The same is true for PG's method -- it's highly unlikely that {you} fall
exactly at the average value of {non-yous}, or that {your friends} fall
exactly at the average of {not your friends}.

 __4. __I believe that the math here is distracting from the core question.

Core question 1: Do men and women on average make the same choices?

If you believe that, then determining bias is easy: we already know who the
investor funded. Is the number of men the investor funded different from the
number of women? Yes? Then the investor is biased. This is much more direct
than the the kind of forensic accounting pg is proposing.

I suspect that pg didn't propose this test because pg doesn't believe that men
and women on average make the same choices. He knows, for example, that the
number of female applicants to YC is different than the number of male
applicants (a gendered difference in behavior). Google "men and women career
choices" or similar if you're interested in learning more, or better yet, read
some first person accounts from FTM men about the cognitive effects of taking
testosterone.

Since it's clear that there's a gendered difference _before_ applying to YC,
it seems very difficult to justify an assumption there would be no gendered
difference in behavior _after_ applying to YC (or any other investment firm,
FirstRound in this case). Given that, the question we were asking becomes much
more confusing... a simple bias towards ideas and plans you understand/agree
with/are excited by is a gender bias in as much as your gender caused you to
like the idea or plan. Removing that bias (supporting plans you understand
less, agree with less, or are less excited by) seems like an obviously bad
idea.

Returning to the problem: if we accept that this sort of "makes sense to me
bias" can be observed when looking for gender biases, we are left in a really
hard place. That bias seems to be both a good thing, and confounds the entire
analysis. Unless you've controlled for the "makes sense" bias, such analysis
will apply pressure for investors to waste money from their perspective. This
seems obviously bad.

Core question 2: which biases do we want investors to have?

Investors who knowingly pass up good opportunities on the basis of the
founder's gender are punishing themselves worse than any company they pass
over -- their competitors who aren't gender biased will get higher returns,
and so will have more money to invest in the future. This is to say that
gender biases for startup investing are self-correcting. The investors already
have their self-interest maximally aligned with not being sexist.

I don't pretend to know which companies are worth investing in more than any
other smart technologist. I also don't pretend to know to what extent gender
differences cause differences in returns, so my answer is: investors should be
as biased (selective about investing) as they see fit. Startups are positive
sum for society, and anyone who can find a way to fund more of them profitably
is making the world better.

In large part, this is because I find it very unlikely that any modern
investor is knowingly sexist -- I think it's much more likely that the sort of
"makes sense" bias I discuss above is at play.

Of course, this is an early thought that came from first principals, so
counter arguments solicited. Perhaps there is something deeply evil about
passing over startups you don't feel comfortable investing in (assuming that
comfort has any correlation with founder gender), or perhaps there's some easy
fix which makes previously dicy-looking ideas from {other-gender} founders
look like obviously good investments. (If you know what that idea is, I'd love
to know it too).

5\. Thanks to both pg and Chris for the fun math/philosophy problem. :)

------
wavegeek
This kind of thinking could be problematic. What would happen if someone
compared the performance of whites and persons of African heritage at college?

~~~
dlss
They would detect if the acceptance criteria were biased. This is a good
thing, since after you've measured something you know if a change is in order.

------
LordHumungous
Lots of math in here premised on shaky foundations:

>Group A comes from a population where the chance of graduation is distributed
uniformly from 0% to 100% and group B is from one where the chance is
distributed uniformly from 10% to 90%

>The mean of group B is not lower because of bias (which would be reflected
near x=80), but because the very best members of group B are simply not as
good as the very best members of group A.

Yes, if we can assume some a-priori knowledge about certain "groups" of
people, then we can make a more "informed" decision. That's pretty much the
definition of bias, isn't it? Paul Graham's point, as I understood it, was
that those assumptions are often invalid. Therefore, bias could cause the
market to under value someone or some company. Your counterpoint seems to be,
"let's suppose those biases are legitimate."

~~~
nshepperd
Read it again. He's talking about the counterexample there. It's a
hypothetical.

~~~
LordHumungous
Yes I get that. The crux of his argument is:

>Unfortunately, using the mean as a test statistic is flawed - it only works
when the pre-selection distribution of A and B is identical, at least beyond C

His argument is based the proposition that different sexes/races have
different market value profiles. He needs to demonstrate why that is the case
before proceeding to heavy math.

~~~
zodiac
> His argument is based the proposition that different sexes/races have
> different market value profiles. He needs to demonstrate why that is the
> case before proceeding to heavy math.

Not really, his argument is that "PG's mean-post selection test (the 'PMST')
is only valid if different sexes have the same distribution of abilities". If
you or PG believe that the PMST is a valid way of showing that bias exists,
the burden is on you to show that different sexes have the same distribution
of abilities.

~~~
LordHumungous
From the Paul Graham article:

>You can use this technique whenever (a) you have at least a random sample of
the applicants that were selected, (b) their subsequent performance is
measured, and __(c) the groups of applicants you 're comparing have roughly
equal distribution of ability. __

So yes, OP is ignoring the entire premise of PG 's argument.

~~~
zodiac
OP acknowledges this... "Unfortunately, using the mean as a test statistic is
flawed - it only works when the pre-selection distribution of A and B is
identical, at least beyond C."

To me the rest of the article is asks the question, "requirement (c) is really
strong, is there a way we can use post-selection statistics to determine bias
while weakening (c)? what if we tried measuring the post-selection minimum
instead of the mean?"

Also PG edited his essay to add that disclaimer only after WildUtah's comment,
so it's possible that OP hasn't read the updated version.

