
Probability of acceptance of pull requests - paulmillr
http://paulmillr.com/posts/github-pull-request-stats/
======
famousactress
The title is a much more cynical take on the data than I have looking at it. I
have some familiarity and experience with Django, one of the projects that has
a pretty low pull request acceptance rate from his sample. At 29% I was
impressed that many pull requests were accepted, frankly! I think the signal
to noise is pretty high on pulls to high profile projects.

It's easy, and common I think, to throw a small change into a pull request and
fire it off without confirming it's desired or within the objectives of the
maintainers. I've done this at least once w/ Django personally. My thought
process was "This may be undesired, but if it _is_ desirable the chances of it
making it in without bikeshedding in a mailing list is way higher if I just
lead with a pull request."

In my case it wasn't desired (I think it was a settings.py var that allowed
users to configure auto-import of models in the shell, dropped cause it's
implicit over explicit and adds to settings.py).

Add to that all of the pull requests that are effectively proof of concepts
because the implementation fails to take some major stuff into account, and an
acceptance rate of almost a third starts to sounds pretty good to me.

~~~
tytso
It's so much better to send these sorts of proposal via e-mail. It's much
easier to review the patches via e-mail, and everyone on the development
community can make suggestions. Even so, it's not uncommon for patches to
require three or four versions, and not just by first-time contributors. There
are patches written by folks who have been working on ext4 for many years
which still require three or four or more revisions after being subjected to
multiple rounds of peer review.

If you show me a git project where most of the pull requests are accepted,
I'll show you a git project whose quality assurance is probably completely out
of control.

~~~
jjh42
I think that depends a lot on the project. Projects that heavily use github
tend to prefer a github pull request.

It makes it easy to comment on the diff and integrates with Travis CI
(integration tool) and other tools.

It is easy to make multiple changes to the pull request before it is accepted
so nothing about this system implies that pull request are always accepted
without changes.

~~~
jedbrown
It's easy for the _original author_ to make changes to the PR, but not easy
for the maintainer or a third party to make changes. The micro-pedantry
associated with spelling errors, tweaking commit messages, and trivial
spacing/formatting issues can be more work for everyone when communicated
through comments on PRs with the original author expected to apply and re-
roll.

~~~
bad_user
That's only partially true.

Pull requests on GitHub have to have an associated branch or fork in another
repo. Even if you're not the original author, if you have access to it, you
can do "git checkout" or "git clone", then you make the required changes and
then on push the commits are automatically included in the pull request.

The only problem is that the maintainers do not have write access to the
forked repositories of new or casual contributors. So if you're in a rush to
change something, you have to fork that branch, make changes and then merge
it, while ignoring the original pull request (although if you place the issue
number in the commit messages you push, they'll get referenced on the
associated page).

However, the commits logs made by the wannabe contributor are still in the
history, so proper credit/blame is given where it is due, unless you rebase of
course. And this isn't something you can say about diffs sent by email.

Also, GitHub sends you emails on pull requests. And you can reply by email
too.

What I like about GitHub Pull requests is that you have a link you can give to
people for review or that you can post wherever, a link that gives you a
descriptive diff of what happened. Plus, the discussions around that pull
request are left there, included in that link, left for everybody to see,
instead of turning into a private conversation that nobody else will know
about (incidentally, it's also the reason why I think native apps will never
beat web apps, as long as native apps don't have URLs for referring to content
or state within the app). This to me is useful enough to prefer it over diffs
by email.

EDIT: there is one thing about GitHub I don't like - you cannot give write-
access to people, while preventing them from pushing to master. If this where
possible, you could give everybody access to push, except for master, in which
case the above mentioned issue with pull requests would be a non-issue!

~~~
tytso
The problems with keeping the original commit and then putting the fixes on
top are, (a) it makes the history harder to follow, (b) it makes it easy for
someone to screw up a cherry-pick of a bug fix to a maintenance brance, since
now you have to cherry-pick the bug fix plus the fix(es) to the bug fix, and
(c) if the original patch had a potential bug that might cause the system to
crash or otherwise malfunction, even if you fix it in the subsequent commit,
it makes "git bisects" more likely to yield false positives, or at least make
things more confusing and more difficult than it has to be.

------
dmethvin
I can't speak for the other projects, but the data is completely wrong and
misleading for jQuery. It looks like this code only counts a pull request as
merged if it's landed with a merge commit. It's much more common for us to
touch up the pull request (for example, fixing small style or commit message
issues), then rebase, possibly squash some commits, and land via a fast-
forward commit.

So this is "accepted": <https://github.com/scalatra/scalatra/pull/267>

But this is not: <https://github.com/jquery/jquery/pull/1255>

~~~
jedbrown
The hosting sites need to start providing a way for a maintainer to update or
supersede a PR, as well as a way to track prior versions of commits in a PR
after the PR has been updated post-review.

------
kevingadd
Vaguely related anecdote:

I had a really bad experience trying to use node.js for a project. Eventually
I figured out that the module I needed was just broken, so I decided to at
least update the documentation. As it turned out, not only was it not
documented how to update the documentation (...) but I had to go back and
forth on undocumented style guide rules (for documentation that gets displayed
in the browser anyway!) and then sign the CLA, all to update a single line of
text in the docs to say 'don't use this module, it's broken'. It literally
would have taken less time for them to make the change themselves.

This soured me even more on node.js than actually using it had. It's bizarre
that project maintainers would actively put so many walls between themselves
and valuable (or not-valuable, for that matter) code contributions.

I can only assume that the legal downside to not having a CLA signed for every
single-line commit is enormous. Are there previous court cases in the US where
not having a CLA got someone destroyed?

~~~
Argorak
> I can only assume that the legal downside to not having a CLA signed for
> every single-line commit is enormous. Are there previous court cases in the
> US where not having a CLA got someone destroyed?

I don't know of any court cases, but I do know of several projects that were
not able to change their license because they didn't have a CLA and couldn't
contact all previous contributors. Thats the right that section 1 of the
nodejs CLA gives them.

Fun fact: nodejs reserves the right to use your contribution under "(b)
binary, proprietary, or commercial licenses".

------
neil_s
The vertical labels on the charts reduce readability greatly! Perhaps angle
them a bit and sort from highest to lowest so you can just scan down the
labels and know where they stand without having to look back up at the bar
height. Or simply flip the axis.

------
magoon
I'm surprised the acceptance rate is so high.

~~~
JeremyMorgan
thats what I was thinking! I suspected many more of them to be junk

------
mmanfrin
Dear Mods: This time you've edited the title to something other than the post
title. Can you clarify what the rule is? Should the title be informative, or
the title of the post?

------
jtchang
What is a CLA and why would a project want one / not want one? They seem like
another hurdle for contributors?

~~~
coherentpony
<http://lmgtfy.com/?q=code+CLA>

~~~
publicfig
That is such a lazy and non-helpful answer, especially considering that the
poster didn't just ask what it was.

------
gresrun
The curious question is why certain projects have significantly lower pull-
rates; I doubt the quality of the pull requests vary that widely across
projects.

Is it that project leadership has a clear vision for a product and feel pull
requests are a distraction? If so, share the vision and enlist willing
developers to help achieve that vision.

Is it that project leadership has a high-standard for the codebase? If so,
explicitly state the standard and provide constructive feedback to pulls that
don't quite measure up.

I guess my point is: project leadership has the ability to leverage a
tremendous pool of talent that is interested in their project using the pull
request mechanism. It does take effort and planning but the net result is so
worth it to all parties involved.

~~~
encoderer
This is a complicated problem, and one that I'm excited to think could be
improved in the coming years as Github et al have the time and resources to
invest in it.

I have a few thoughts on the topic and on your comments:

1\. I actually DO think that the quality of pull requests varies widely from
project to project. You dismiss this in your first sentence but I don't think
you should. I think there are a lot of reasons for this, but many can be
reduced to: Some projects attract more inexperienced developers than others.

2\. I don't think you can simply indict the project maintainers for not
optimally leveraging the talent pool of volunteer contributors. While Open
Source has been validated a million times over as a truly successful
movement/concept/etc, nobody has ever accused it of being efficient.
Specifically, most pull request submitters act on their own, without
coordination between submitters or with project maintainers. It's not just an
issue of coding standards, it's also things like the maintainers having a
roadmap that includes bug fixes, refactoring, new features, etc, and that may
conflict with Joe in Idaho and his pull request that he did to scratch his own
specific itch.

In many ways community submitted pull requests are a "million monkeys at a
million typewriters."

3\. The most successful way of getting a pull request accepted, I think, is to
first raise a discussion on the issue on the projects mailing list or issues
forum. Link to your in-progress patch, say "I have this fix here for problem
X, can anybody take a look and tell me if I'm on the wrong track with this?"
So many of us are code-first, conversation-later types and that constantly
conflicts with the social nature of our job.

I guess my takeaway here is that there is a lot of room for interesting
technology solutions to this problem, in the areas of:

1\. Tools that help maintainers cultivate the submitters talent pool on their
project.

2\. Tools and workflows that help submitters through coordination. Totally a
half baked idea but suppose if I'm fixing a bug in a function somewhere,
Github can help me discover what is happening to that function elsewhere.
Maybe it's been radically changed in a feature branch owned by the project
maintainers, maybe it's already been fixed nearly the same way by another
submitter who has an open pull request, maybe it was already submitted by
somebody and rejected.

3\. Static analysis that can power those tools and more. Would love a
meaningful quantative quality score on the pull request submission page. Like
an additional tab beside the "Commits" and "Diff" tabs. Maybe it could run
various analysis on the code and show formatting issues, known
vulnerabilities, known bugs, etc. Obviously the abilities here vary from
language to language.

~~~
gresrun
> Some projects attract more inexperienced developers than others.

Fair.

> In many ways community submitted pull requests are a "million monkeys at a
> million typewriters."

I strongly disagree with this assertion. To argue that community cannot be
lead or directed toward a common goal is very pessimistic. Sure, Joe in Idaho
has a specific itch, but there's a good chance there is an opportunity to
capture Joe's enthusiasm and willingness to do work on the project by working
with him instead of treating him like a code-monkey at a keyboard. Community
relations is hard because dealing with people is hard.

> 1\. Tools that help maintainers cultivate the submitters talent pool on
> their project.

Yes, I agree that this is someplace where some tools would help tremendously.

> Github can help me discover what is happening to that function elsewhere.

Great idea. I'd love to see a "related diffs" tool for a given piece of code.

> So many of us are code-first, conversation-later types and that constantly
> conflicts with the social nature of our job.

I totally agree. Ultimately, tools will only get us so far; people skills must
fill in the rest. Making "Joe" feel welcome and a part of the creative process
requires measures of patience, kindness, gentleness and mentorship.

------
thezilch
Is it possible to switch the axis and set the maxima at 100%? I have to re-
calibrate for each graph, or 30% almost looks like 50%. And it's hard to read
the rotated labels, but the percent on the horizontal axis would not need to
be rotated.

~~~
paulmillr
As for maxima — yeah, definitely. Done.

~~~
thezilch
Great.

Just to expand on the switching of axis. It's harder to compare data between
graphs, but upon second inspection, it appears the colors from the first graph
are not used in the secondary graphs.

Second, trying to get to the projects in the first graph, it's easier to exit
the graph from the bottom. Otherwise, short bars (eg. node) are hard to
"select" for the "Current" line, and in that case, I have to make a longer
trip around the graph to get to the link. Maybe link the labels? Move the
"Current" line underneath?

I enjoyed it nonetheless. If you're considering expanding or having a follow-
up piece, I'd enjoy combining the three graphs with the format / filtering at
<http://www.techempower.com/benchmarks/>.

------
chrismorgan
The conclusions don't seem reasonable to me. Higher acceptance rate is _not_
necessarily better; it can also be that the quality of pull requests is too
low: that the developers are instead discriminating in what they let into
their project, and that the project benefits as a result of this.

Using it as a project quality metric is in this way similar to lines of code:
higher doesn't necessarily mean better.

------
Timothee
What would be interesting to see would be merged vs. open vs. closed (as in
rejected), and also the typical delay between the opening of the PR and the
merging/closing.

Some projects are fast to close PRs if they don't fit their goals, some don't
attend to the PRs at all and you have PRs lingering in the queue forever.

Two projects could have a similar acceptance rate with very different
attitude: one could be on top of things but picky about what they merge in,
while another is just not attending to the list except for the odd PR now and
then. It seems that a project like the former would still be more interesting
to participate in just because you could expect more feedback on what is
wrong.

Regarding the presented data, I'm also surprised that it's that high for some
project. With the way GitHub works, it's easy to create a pull-request without
consulting anybody in the original project, so it's surprising that so many
would match the projects' intentions.

------
caf
What would increase the accept rate greatly would be if there was a function
to accept-pull-into-new-branch, which creates a new branch and applies. The
maintainer can then apply the necessary changes/corrections/documentation
updates, get the branch tested, then merge it back if all goes OK.

~~~
jedbrown

        $ git fetch url-or-remote their-branch-name:my/better-branch-name
        $ git checkout my/better-branch-name
        $ hack, commit, etc

~~~
caf
I'm aware you can do that with native git, of course - but that leaves the
original pull request on github appearing "unapplied", doesn't it?

I meant that it would be useful to add a function for that workflow in the
github UI.

~~~
jedbrown
If you edit commits in the branch, then yes, it appears as declined in the UI,
which looks hostile when they are notified. I write a friendly message every
time I do this, but it sucks to have to give that explanation and the UI
metadata is still wrong. I asked for this on bitbucket, though it's currently
on hold [1].

If you don't edit the existing commits, then it will show up in the
github/bitbucket UI as merged when you eventually merge, regardless of what
games you play with branches.

[1] [https://bitbucket.org/site/master/issue/6704/supersede-
pull-...](https://bitbucket.org/site/master/issue/6704/supersede-pull-request)

~~~
plorkyeran
One way I've worked around this is to merge the original pull request, then
immediately force-push the fixed version. Still doesn't fix all of the UI
metadata, but at least the pull request gets marked as merged.

~~~
jedbrown
Only works when _you_ have push access to the repository that is the source of
the PR. Normally maintainers don't have access to the repositories for all
contributors.

~~~
plorkyeran
Er, no? You merge the unfixed version via the web UI (to get the PR marked as
merged), then immediately force-push the fixed version to your _own_ repo, not
their's.

~~~
jedbrown
Oh, I thought you meant updating the pull request by force-pushing on their
branch. Your solution is racy, thus only acceptable for projects small enough
that the likelihood of someone pulling during that window is sufficiently low.

~~~
plorkyeran
It's definitely not an option for very large projects, but projects small
enough to have a very low chance of someone pulling in a several second window
describes the vast majority of projects (and in general by the time it gets
unsafe I'd expect fixing up incoming pull requests to be impractical time-wise
anyway).

------
tjbiddle
Very great analysis! I remember reading an article about an experiment someone
took in bringing up the contribution in his project - And that was to add
users as collaborators any time someone made a successful contribution.

I've been doing this with all of my projects so far (Granted, I do not lead
any interesting projects) but a small one I re-did recently has already gotten
additional activity from one contributor since I added him on.

------
icambron
The question this raises for me is what the acceptance rate is as a function
of the number of pull requests, or of the number of stars. A scatter chart
would be cool here. I strongly suspect the acceptance rate goes down as the
project gets bigger. That's some combination of the developers having less
time to field all the requests and the more popular repos getting a lot of
junk of PRs (anecdotal observation).

------
_pmf_
There's nothing bad about about a high rejection rate. Most likely there is a
strong correlation between the quality of code and a high rejection rate.

------
Blahah
Great analysis. Just a couple of data visualisation niggles:

1\. Please sort the bars in your barplots by descending y-value. This makes it
much easier to see ranking and to compare any two bars. In plots where bars
are coloured by group, it also gives an intuitive impression of group
performance.

2\. Please include error bars (sem) on the plots where bars represent the
mean, so we can understand how representative the mean is of the group.

------
jared314
It would also be interesting to track the distribution of those pull-requests,
with respect to user who submitted it. With the current setup, projects that
are only willing to accept pull-requests from known committers, and those who
are more open to new comers look identical.

------
zalew
great analysis. I thought big projects have a lower acceptance rate than
smaller ones in general, but when you take a look at django vs flask to rails
vs sinatra, it doesn't seem to be the case.

one nitpick: next time rotate the diagrams +90' please :)

~~~
coherentpony
> one nitpick: next time rotate the diagrams +90' please :)

What? No. The x-axis is the independent variable and the y-axis is the
dependent variable.

~~~
zalew
tell that to my neck.

~~~
coherentpony
Now you can tilt the other way --> :)

------
suredo
switch X and Y? I can't read this without turning my head 90 degrees

------
jonezy
that's a fantastic write up, thanks!

