
Archiving repositories - manigandham
https://github.com/blog/2460-archiving-repositories
======
CyberShadow
I wish GitHub made it easier for the community to continue working on
abandoned projects. All too often, a project's owner goes MIA (without adding
more members), and the pull requests just keep piling up and are never acted
upon.

The only way to discover if a project has a more active fork elsewhere is to
go to the network tab and scroll around in the graph... Ideally there would be
a way to add a banner at the top of dead repositories with a text like "This
project is inactive, but there is a more active fork <here>", as otherwise
most people visiting the project's page would have no idea such a fork exists.

I've successfully revived an abandoned project (with an owner who ignored all
my attempts to contact them), but only because the primary resource for the
project was a wiki on another website, so it involved changing the links there
- but this isn't the case for projects whose GitHub page is their main
website.

~~~
jszymborski
It'd be difficult to engineer that feature in a reasonable way, I think.

Potential for abuse by hostile forks and ill-wishers would be a big problem
and I can't think of any real mitigations.

As is often hotly debated in HN comments, measures of repo inactivity/health
are not universally accepted. I can see thresholds in any of these metrics
potentially problematic if applied universally.

I do sympathise with the pain-point you're outlining, however!

~~~
rcthompson
I don't think such a feature is quite as difficult as one might think. The use
case is for repos that are completely abandoned with an actively maintained
fork, so you can be super conservative with the time frame in order to limit
the possibility of abuse. Don't allow it for any repository with any activity
from its owner in the past year, and give the owner something like 3 months
from the time the request is initiated to refuse. If they don't respond, then
the repo has been untouched for a minimum of 15 months. At that point, the
requester gets their fork marked as the "canonical" repo, with a banner on the
original repo informing people that it is unmaintained for X months and
directing them to the maintained fork. Thereafter, if the the original repo's
owner ever logs in, they could have a button on their repo to remove the
banner, or alternatively confirm the transfer of maintainership (after which
they can no longer take it back). In any case, the frequency of this kind of
thing should be rare enough that it also would be reasonable to require
approval by a human at Github for each one.

Anyway, the above is one possible concept for such a feature that I think
would be pretty resistant to any kind of abuse. But I'm not necessarily
arguing in favor to this specific concept. In fact, given how rare such events
would be, the most expedient "implementation" of this feature might simply be
a dedicated email address and instructions on how to write up such a request
for manual review, along with a stated policy of the minimum period of
inactivity before a repo is eligible for such treatment. But my overall point
is that with abandoned projects, I think you can slow down the timeline for
transfer of maintainership to the point that abuse of the feature becomes
essentially impossible.

~~~
jakub_g
BTW, GH already has a feature that if a repo gets deleted, and it had forks,
then the most starred (AFAIR) fork becomes the "master fork" and other forks
have their link "forked from..." updated.

It would be a good idea to do as you propose: if the maintainer doesn't click
a button for X months (and the button is prominent when you're logged in,
impossible to overlook), make another fork a primary fork.

------
geofft
This seems to be the first way to disable pull requests - I have a few repos
that are OSS code dumps of an internal project, rescues abandoned CVS repos on
SourceForge, weekend-long experiments that I gave up on, etc. that I am not
paying any attention to any more, and I imagine I'll archive those basically
immediately.

I still wish there were a direct way to disable all pull requests, though, to
set appropriate expectations. (People are still more than welcome to fork the
repo and do what they want, there's a free software license on it. But a pull
_request_ is asking me to maintain it, and I don't intend to maintain most of
the repos on my GitHub.)

------
Nelkins
A little off topic, but does anybody else feel like GitHub's pace of feature
development is...unusually slow? Admittedly, I have never used the paid
Enterprise version so I don't know if it has received more attention. But,
seriously, why is the search so terrible (talking about a global search across
repos)? Doesn't not having to worry about semantics (just focus on text
matches) make this easier? Also, why is the mobile view so crippled (no way to
sort or search issues or pull requests)?

If anybody has a reasonable explanation, I'd be very interested to hear it.

~~~
derefr
> But, seriously, why is the search so terrible

Because it's not just a matter of configuration/input-munging, the way getting
to "good search" for text is. The existing solutions for search at scale
(ElasticSearch, Solr) just don't work well for a code-document corpus the way
they work for a text-document corpus. Trying to trick them into doing so will
only get you half-way there, and then you're stuck in molasses if you try to
improve from there.

You really have to build "code search" from the ground up, doing things like
running language-specific parsers over each codebase and then building tries
out of linearized AST node token sequences.

~~~
sqs
There’s a big difference between code search over 20M repositories (GitHub’s
goal, and what people need a few times per week) and code search over 1-50,000
repositories (what developers at companies need several times per day). Doing
the latter lets you simplify the indexing a lot if you have a lot of memory
available (shameless plug: [https://about.sourcegraph.com/products/code-
search/](https://about.sourcegraph.com/products/code-search/)). But doing it
over 20M repositories needs an index that takes weeks, if not months, to
rebuild and severely limits your ability to ship improvements. I’m thankful
they are doing the work of building that for the developer community, though!

------
vortico
I use Github so much these days that a change like this that doesn't have any
major disadvantages is always a positive bonus to my life. When a maintainer
(like me) doesn't want to deal with a project any more, this will actually
encourage forks where the software lives a second life.

------
sharpercoder
I can imagine a large SAAS provider can make heavy optimizations when data is
marked immutable. In this case it seems to be a win, because humans _want_ to
have the data marked immutable.

~~~
andrewstuart2
Definitely a win-win if I ever saw one. :-)

------
simonw
It's not clear to me if archiving a repository excludes it from code search -
specifically search within organization.

I'd like to be able to hide some of our legacy repositories from code search,
because when I search across all of the code that belongs to my organization I
don't want to see code from repositories that are no longer maintained. I
frequently use code search across organization to see if e.g. a specific
library is being used, and archived repositories are not relevant to that use-
case at all.

My ideal solution would be the ability to default-to-not-including-archived-
repos, but with an option for "run this search against archived repos as
well".

------
jwilk
What's the "Pull request" tab doing there if the repo is supposed to be
archived?

I'd actually want to disable PRs without making the repo read only.

~~~
derefr
1\. It lets you see PR merge history.

2\. PRs are basically enhanced issues, so it lets you see the discussion on
old PRs. (It would make no sense to make the Issues tab disappear on archive,
right?)

3\. If a project gets archived with outstanding open PRs, and you want to
resuscitate the project with a fork of your own, you can also "rescue" the
outstanding PRs by creating new PRs that come from the same branches as the
original ones.

~~~
jwilk
FWIW, the Issues tab does disappear when you disable issue tracker.

------
michaelmior
I wouldn't be surprised for GitHub to eventually auto archive projects if they
haven't been modified in X years and move them to cold storage. This could
probably result in some significant cost savings.

~~~
vortico
Cold storage typically means offline storage. That is not what this is. The
storage needs to be just as fast as any normal repo. The point of this change
is not an infrastructure one, but an organizational one, as a signal to
viewers of the repo that the repo is no longer being maintained.

~~~
sebazzz
No - but they could optimize it internally to keep it out of caches,
deduplicate to a single server, etc.

~~~
yeahbutbut
That should probably be based on access patterns, not on mutability flags. If
it's "archived", but has 1K downloads a day it _should_ be in cache.

------
manojlds
I thought this will be mentioned but no - what is private repositories are
archived? Do they count against your private repositories (granted they want
to push the newer users based pricing model)

Edit - Tried actually doing it and here is the message:

> You will still be charged for this repository. This will not change your
> billing plan. If you want to downgrade, you can do so in your Billing
> Settings.

------
rsync
Here is my favorite way to archive a repo:

ssh user@rsync.net "git clone git://github.com/LabAdvComp/UDR.git github/udr"

Any repos/projects that are important to me get mirrored to my personal
rsync.net account every time I use them.

Sometimes repos disappear, or get taken down ...

~~~
akerl_
This is a fundamentally different meaning of the word "archiving" than GitHub
is using.

The new feature "archives" in the sense that it adjusts the repo to be read-
only, so that folks can view it on GitHub without being able to PR or file
issues. Like historical preservation.

Your example is "archive" in the sense of "have a backup of", which is
something more folks should definitely be doing, but isn't a replacement for
GitHub's new feature.

~~~
zokier
> This is a fundamentally different meaning of the word "archiving" than
> GitHub is using.

I think the difference is not in the meaning of "archiving" but in the meaning
of "repository". Github repository is far wider concept than plain git repo:

> Archiving a repository makes it read-only to everyone (including repository
> owners). This includes editing the repository, issues, pull requests,
> labels, milestones, projects, wiki, releases, commits, tags, branches,
> reactions and comments.

