
Understanding the Factors That Impact the Popularity of GitHub Repositories - nextjj
https://arxiv.org/abs/1606.04984
======
minimaxir
As someone who owns a repository with 11,529 stars
([https://github.com/minimaxir/big-list-of-naughty-
strings](https://github.com/minimaxir/big-list-of-naughty-strings)), I notice
a few problems with the logic presented.

I happen to have a scraper handy for tracking stars over time, and profiling
the users who made those stars ([https://github.com/minimaxir/get-profile-
data-of-repo-starga...](https://github.com/minimaxir/get-profile-data-of-repo-
stargazers)). Here is a chart of the daily number of Star events on the big-
list-of-naughty-strings repository as of today:
[http://i.imgur.com/NzzjuKK.png](http://i.imgur.com/NzzjuKK.png)

The paper assumes that popular repositories are popular on their own merit.
But as you can see from my chart, there are clear spikes, one when the
repository was first released, one a few days afterward, and one in February
for _literally no reason_. The cause of those spikes were Hacker News,
/r/programming, and /r/webdev respectively.

Here's another one of my repositories currently at 458 stars
([https://github.com/minimaxir/facebook-page-post-
scraper](https://github.com/minimaxir/facebook-page-post-scraper)) which only
got exposure after a Show HN last Thursday:
[http://i.imgur.com/8qpTrrb.png](http://i.imgur.com/8qpTrrb.png)

There's also a huge selection bias problem with only looking at Top
Repositories. While understandable from a scalability standpoint, it's also
possible that repositories are popular _because of good marketing_.

~~~
moron4hire
Correct. A lot of disaffected programmers think "marketing" is a dirty word
and they will have no hand in it. It's a type of magical thinking that forgets
the movie Field of Dreams ("if you build it, they will come") is about a
literal miracle. There's really no such thing in the real world.

"Advertising" open source projects doesn't typically involve taking out ads on
Facebook, but you still need to perform the marketing task. Get out to events
and stand up as a speaker--event organizers are always desperate for content.
Write blog posts. Make videos demonstrating how to use your tech.

If you don't care about growing your project beyond yourself, sure, feel free
to skip these things. But if you're really serious about growing your project,
it should take about half your total time on the project.

~~~
crdoconnor
Speaking at events really doesn't seem that effective to me at gaining
exposure. One post on reddit would net 25-30x the number of stars that a
conference/large meetup gig would.

It's more useful for reaching out and making stronger real life contacts with
people who might be interested in using your software, but the net isn't cast
nearly as wide.

~~~
moron4hire
On multiple occasions for various project over the years, I've received
several thousand views to my sites from a Show HN or a Reddit post. In
comparison, doing a talk is "only" 25 or so views. If all you care about is
views, clearly posting online is better.

But I care about people actually using my software. And giving a talk usually
ends with 3 or 4 people actually trying my project. Or if I'm running a class,
it's 30 or 40. In the last two years, I'm pretty sure there is _one_ person
from all the Show HNs and Reddit posts that is using my project. From even the
few number of local talks I've given, I have a small cohort of people not only
using my software, but giving me feedback on it as well. We see each other
once a month, at least. It's one of the greatest feelings ever to talk to a
person, face to face, who is excited about your project and hasn't run away
even after using it.

Also, I can't guarantee that a post I make will get to HN's front page where
it will get so many views. But at this point in time, I get asked to do a talk
about once a month. If I really put the effort in, I could probably be doing
talks once a week.

And all those thousands of HN viewers never once offered to sponsor the
continued development of my project. But my progress on giving in-person talks
got me in front of just the right people and now I can work on my project
full-time.

In a similar situation, my wife has a couple of sci-fi novels that she self-
publishes online. The vast, vast majority of her sales have been in-person at
book fairs. People at book fairs are ready to buy. They are there because they
want to spend money. It's an easy sale. Online, you have to intrude on people
and fight against the million other people self-publishing.

Yes, you cast a much wider net online. But it's a net with big holes and for
much, much smaller fish. Follow the numbers, the important ones, not "views"
or "stars" or other such things that aren't "cold, hard, cash". The cash says
in-person marketing is significantly easier to execute than online.

------
biot
I skimmed through this, but their conclusions are far too simplistic to be of
any use. A company repo is more popular than a personal one? Are they implying
forming a company around an unpopular repo will make it popular? It's likely
that if there's a company behind it then it's being done as paid work, that it
has specifications needed to interop with something else, with management
behind it driving quality, perhaps to a deadline. Second, saying more
contributors equates to greater success is a tautology. Or are they suggesting
one can simply give strangers commit access and that alone will determine
success? More likely, success and popularity attract contributors. And repos
tend to get more stars after a release, so if one releases every hour they'll
get the most stars, right? Fix a typo, it's version 79.0. Document a method,
it's version 80.0. Or maybe it's announcements around a release that serve as
marketing?

Lots of correlations in their conclusion, but no causation. The paper could be
shortened to "Have more visible activity in your repo".

~~~
mlinksva
> Second, saying more contributors equates to greater success is a tautology.
> Or are they suggesting one can simply give strangers commit access and that
> alone will determine success? More likely, success and popularity attract
> contributors.

They don't say this. They note that the measure of popularity they're using
(stargazer count) is weakly correlated with commits and contributors.

> And repos tend to get more stars after a release, so if one releases every
> hour they'll get the most stars, right?

They certainly don't draw that conclusion.

~~~
biot
Their conclusion states "... confirming the importance of a large base of
contributors to the success of open source software".

Also note that my questions are facetious and are my own. My apologies if that
wasn't clear.

Overall, it's a good study of properties that popular repositories have. In
terms of factors that impact the popularity, I found it lacking.

------
carsongross
Direct PDF link:

[https://arxiv.org/pdf/1606.04984v1.pdf](https://arxiv.org/pdf/1606.04984v1.pdf)

------
danso
While there is a mention of Hacker News, I was surprised to see that the
authors apparently hadn't attempted to correlate popularity with Reddit links.
Anything that I've had become remotely popular is because it got a few upvotes
on HN or Reddit...and anecdotally, I've seen Reddit/HN launch niche libraries
into the 1000+ star group, even if the library is relatively niche (hell, I'll
star things that look cool and had interesting discussion, even if it's likely
I'll never clone/fork the repo)

Meanwhile, libraries that were ubiquitous by the time Github became popular
have relatively few stars. The most prominent example in my mind is ruby/rake,
which has just 627 stars:
[https://github.com/ruby/rake](https://github.com/ruby/rake)

~~~
sdesol
> ruby/rake, which has just 627 stars

ruby/rake is in maintenance mode, so there really isn't a lot of reason for
people to notice it or even star it. You can see how active it is, in the
screenshots below:

[http://imgur.com/a/cYYwn](http://imgur.com/a/cYYwn)

And as the last screenshot shows, in the last year, they only changed 104
files with 227 commits from 23 contributors. And if you look at the churn
graph, there wasn't a lot of churn between 2011 to 2015.

Now compare this, to something like GitLab (18,000 stars), which is my goto
repo for showing high rates of activity. This is there churn for the last 7
days, which doesn't count added/deleted files and changes by merge commits.
And the reason for not counting added/deleted files, is I was told they are
doing a lot of restructuring.

[http://imgur.com/a/2A1VM](http://imgur.com/a/2A1VM)

In the last 7 days, they changed 505 files with 390 commits, from 35 different
contributors. In one week, they doubled the activity of ruby/rake's one year
activity. What would be interesting to know is, of the 35 contributors, how
many are GitLab employees.

I would also love to analyze GitHub's and Atlassian's Bitbucket development
repos. I can't imagine they are iterating at the pace that GitLab is and I
have yet to find another open source project that is going at their rate.

------
dre85
I also found both the methodologies and the conclusions to be very simplistic.
They conclude that repository age isn't a factor and then give apple/swift as
an example. I don't use swift, but my understanding is that it's been an ultra
popular language for a long time, but only recently got open sourced. I don't
think they account for project existence before GitHub appearance which may be
a significant factor for a good amount of top projects...

------
HillaryBriss
This is from the conclusion section of the paper:

 _We also reported the existence of a strong correlation between stars and
forks, a week correlation between stars and commits, and a week correlation
between stars and contributors (RQ #2)_

Ok. Yes. this is a lame thing to point out, but, it just strikes me as weird:
they spelled "weak" wrong.

------
SFJulie
Feedback loops with excessive competition for reputation?

Cheating is bound to happen.

There are some games for which not playing is the only way to not loose.

------
juskrey
Business literature bullshit coming to IT and CS.. One can't study the top to
make inference of the process.

