Hacker News new | past | comments | ask | show | jobs | submit login
What is a fork, and how GitHub changed its meaning (drewdevault.com)
196 points by Sir_Cmpwn 26 days ago | hide | past | web | favorite | 76 comments



It's a little funny to focus on GitHub's fork/pull-request model if you're trying to critique their centralization and lock-in. Every single successful open source project — on GitHub or not — has a canonical "source" branch and some sort of organization/leadership that decides what goes in. I'd argue that GitHub's behavior here isn't a power grab so much as a reflection of reality.

The "real" lock-in is with the discussion model — on both issues and pull requests — and the organizational structure you're able to set up. It's easy to move the source code to another service. Heck, you can even email patches to maintainers of a project on GitHub without creating an account. It's not easy to move historical discussions/reviews or "commit bits"/maintainership roles, and of course you can't participate in reviews of an emailed patch that a maintainer PRs for you unless you create an account.


I'd argue that at least their API indicates they try to reduce rather than reinforce this lock-in. The API lets you build tools like https://github.com/colmsjo/github-issues-export to automatically export data from issues and other project data stored outside of git.

Moving this data is still not trivial but I don't think GitHub could unilaterally make it much easier.

(https://developer.github.com/v3/issues/)


I'd love to see a standard develop for storing this information in-repo. GitHub could certainly drive such an initiative.


You might be thinking about fossil. Every feature is in-repo.

https://fossil-scm.org/home/doc/trunk/www/index.wiki


That looks really cool. Is anybody already using this?


Well, SQLite is using it of course ;-)

Other large projects include Tcl/Tk and many related extensions, as well as all the cool stuff I write:

http://chiselapp.com/user/rkeene/



Well I guess they also have to make this possible as it is one of the principles in GDPR. Data portability


>Every single successful open source project — on GitHub or not — has a canonical "source" branch

This isn't true at all, and I went into detail on this with Linux as an example in the article. Linux is far from the only example, though.


Yes, I did read that. But what is Linus' branch then, and why do folks aim to "upstream" changes to it?

I'd argue that the other important Linux sources are _themselves_ cathedrals, each with their own form of centralized leadership — be it an individual or a group (like a distro or company). And these cathedrals happily work together.

You can totally have this model with GitHub.


I see your point. I alluded to a similar idea in an older draft which suggests that bazaar/cathedral style is a spectrum, with most projects not falling entirely within either camp, but clearly having more elements of one model than others.

I think that GitHub actively encourages the cathedral model, though. It is possible to use either style on either platform, but the fork/PR model seems to me to be explicitly designed to strongly favor the cathedral model - to the benefit of GitHub's centralization goals, even if obliquely.


I still respectfully disagree. If GitHub really wanted to deceitfully name this feature to foster the cathedral model they would have gone with "branch" instead. In fact, I recall them getting quite a bit of push back for naming it "fork" because it has historically been a pejorative term. Even further, I believe the actual operation is implemented as a branch on their backend for efficiency's sake. So why go so far as to call it a fork?

Well, because you can treat it as a fully maintained source tree itself (and participating like a member in a bazaar). It's easy to turn on issues locally, add collaborators, accept pull requests, and trade commits back and forth between the forks. Now, yes, GitHub does by default "link" it to the upstream repository and it even keeps tabs on how its history differs, but I'd call these features that reflect real use-cases.

Also amusing is how the repositories on kernel.org describe themselves; 149 use the word "fork". Sure, it's probably partly due to how GitHub uses this terminology, but I'd say it's a good thing that forking is no longer pejorative by default — now it's a "hard fork" that's a bad thing.


Without github's "fork" branches we would most likely still see repositories initialized from plain source dumps popping up all over the place, just like in the days of nondistributed VCS.

There is no technical reason to do that, but that is insufficient in presence of social reasons: Someone wants to exploratory tinker around, maybe scratch some personal itch. No problem, check out, go wild in the working copy. Pretty soon they want it backed up in the cloud, maybe transfer it to another machine. VCS to the rescue. But do you want to create a branch on the official repository? Hell no, definitely not ready for that. git veterans will casually add another remote (and even that won't be as good as a branch/"github fork"), but a lot of people would choose the devil they know, rm .git and reinitialize. Everybody loses. The forker loses history, easy merges from upstream and almost all chances of getting merged back into upstream. The project loses a potential committer and general overview of what is going on at the fringes. Github wastes traffic and storage and loses the moat that socially linking up forker and project would have been.

Github "fork" strikes the perfect balance between proximity and distance: all the benefits of a branch, all the freedom of working in your own personal account. With this github have solved the problem of source dump repos so hard that we easily forget that it ever existed.


> Without github's "fork" branches we would most likely still see repositories initialized from plain source dumps

I doubt that. Git inherently allows pushing and pulling from anywhere. You can easily create a new repo on GH and push your branch(es) from your local clone there. Creating a new clean repo is more work. See how Linux works.


You can email GitHub and have them remove the link to upstream after the fact too. I did this after I forked upstream to send a bunch of patches and then decided to take the project in a different direction as an experiment.

Github has a bunch of issues, but the way they fork is not in my top 3.


> I think that GitHub actively encourages the cathedral model

Characterizing a site where literally the entire development process can be done in the open as the cathedral model is a serious misreading of the The Cathedral and the Bazaar.


I was also unconvinced by your argument that these features relate to bazaar/cathedral.

Even in your linux examples, people still generally recognize an "upstream" and "downstream" (perhaps we could call that a 'lineage'), don't they?

Because, I suggest, a totally decentralized flat horizontal list of repos without any notion of "upstream" and "downstream" isn't actually too useful in practice. I'm not sure that's related to "cathedral" vs "bazaar".

It's possible github encourages 'cathedral' thinking/behavior somehow by the nature of the featureset, but your argument hasn't convinced me of that; or that, even if so, it's intentional.


I can't think of any open source project which isn't a hard fork (the traditional kind you mention, with renaming etc) which has multiple branches of equal maximum status.

Many projects have research branches, integration branches, stabilisation branches, vendor branches, branches run by people with platforms or use cases not well served by the canonical branch, etc. But those examples are all either feeders for a canonical branch, or derivatives of it.

Are there any projects where there are multiple branches right at the top?

The closest i can come is in the various distros' versions of ancient common unix utilities, which are no longer maintained by their original creators. Distros will make fixes to them, and may exchange fixes, but there is no shared upstream. However, that is really a case of there being zero canonical branches, not more than one!


In the retro console scene, sadly there are many programs with multiple independent forks which cannot be merged together.

Famitracker is a NES composing program. It was originally a monolithic cathedral project written by jsr without a Git repo. There were several forks made, including HertzDevil's 0CC-Famitracker which had a Github repo and added features. Famitracker's creator released a "beta" without source code, then abandoned the project. 0CC-Famitracker reverse-engineered the binary to merge in some features. I maintained a "bugfix patch-set" on top of 0CC for several months, but kept running into merge conflicts, so I ended up forking an older revision of 0CC, calling it j0CC, because 0CC's maintainer was adding bugs and refused to accept contributions (including bug fixes). Then we both abandoned our forks...

libgme is a library which emulates game-console sound chips, and is used in media players. It was originally written by blargg, but now has 2 active forks: A "linux" fork packaged in many Linux distros (including Ubuntu/Fedora), currently lightly-maintained by mpyne (which suffered a security vulnerability), and a "windows" fork used with foobar2000 (also builds on Linux via qmake) maintained by kode54, which is more ambitious with changes, including a slow but accurate SNES SPC700 emulator written by byuu and not affected by the security vulnerabilities.

I heard Amiga emulation (UAE) has multiple forks too, and WinUAE is more popular?

I would prefer if the forks had never happened, but it's not going to happen.


What does it mean to have more than 1 "canonical" of anything?


Quite!


This is especially true for projects like Kubernetes which cross references to github PRs in its commit logs. Have fun figuring out the change history when github is down. And yes, of course K8s commits do provide more information that that. But the community is way more reliant on github than it probably should be. You really start to see that when comparing K8s to Linux Kernel commit messages.


I use magit-forge for emacs then you can have a local copy of the pr's and issues.

https://github.com/magit/forge


Is it any easier to move discussion that happened in an email thread?


It's very easy. lists.sr.ht, for example, supports importing and exporting mailing list archives as mbox files (a well-known format for email storage).


Is it as easy to export tickets from Sourcehut? Can I participate in tickets on Sourcehut without creating an account?


It's not easy to export tickets right now, but I'm working on improving that. There's an API, so it's similar in capability to exporting tickets from GitHub, but I want to do better. I've put in about half the work necessary to file tickets without an account, the rest is just some database changes and should be live in the next few weeks.


> Is it as easy to export tickets from Sourcehut?

I know Drew is working on a few email things for the `todo` section of the site. Opening tickets, and responding via email.

> Can I participate in tickets on Sourcehut without creating an account?

Depends. I set up most of my public ones to allow anonymous users to contribute, but it's up to the owner to decide whether they want that or not.


I don't know if it's correct to characterize Github's model as a power grab. The design of Github definitely pushes things in a more centralized direction, but I think that approach is superior in many cases and it's not purely for profit.

For many projects, having a single "canonical" version is the best experience for both users of the project and developers. Linux is large and important enough that it may make sense to have many different distros running a slightly different set of patches and accept the overhead of managing multiple sources of truth. For smaller projects with more narrow contributor bases, it would be noisy and confusing.


Plus, there's absolutely nothing stopping anyone from taking their Git repos and putting them on another Git host. Or hosting their own. Or using multiple hosts.

Sure, they're doing things to further their own popularity... But it doesn't appear to be at the expense of anyone else.


> I don't know if it's correct to characterize Github's model as a power grab.

The important question would be if Github forks and pull requests are an open protocol or attached to the platform. I’m not aware if I can make a pull request from - say - bitbucket to github. Can I ? Then it’s not a power grab. If not then redefinition of fork/pull request is an extend and extinguish move.

Edit: okay, extinguish is too harsh. GitHub doesn’t want to extinguish git.


I don't really see forks or pull requests as being part of the git protocol. They are part of a workflow- no different than, say "git flow" is a workflow, not part of the protocol of how, why and when branching happens.

Further, github isn't looking to "extinguish" git in any way. Sure, old workflows (emailing patches?) might not be supported, but again, that's something entirely external to git itself.

As these things are all external to git, it makes sense that they're not portable between vendors- they are the vendor's, not git's, features. You're not using them from within git, you're using the vendor's features to interact with git. That's the primary difference between, say, M$FT's EEE of Java or AOL's instant messenger, HTML and other examples.


Extinguish doesn't mean to end, it means to end the open standard. Github would be happy if all git use were Github use, and every user locked in to their service brings that day closer.


What open standard? Email? Github doesn't change git, it changes the workflows around it.

Even if everyone used github, they're still using vanilla git underneath.


In a similar way you could argue that git doesn’t change diff and patch. It just adds a a convenient way to handle patch files. So why open source it? It turns out it’s very useful to have it in software.

The same is true for tracking forks and managing pull requests.


> The important question would be if Github forks and pull requests are an open protocol or attached to the platform.

Attached to the platform, but open. You could do all the work on your feature branch somewhere else, say, your company's gitlab installation, then when it's done, push the branch into your github "fork" and and create a pull request there. I really don't know on which side of an openness line that would fall, you can easily define the line around it either way.


It’s all git, so you can totally do all the development in gitlab, and only touch github when you are ready to make a pull request.

You’d have to run half a dozen commands, including adding new remote and running “hub fork”, but it would likely still be faster than sending an email.


Forks and PRs are UI sugar on top of Git operations, so it's technically open. That said, merging in a PR is an operation that touches 2 repos, with different user privileges, so I feel it would be hard (though not impossible) to make the UI work with multiple services, without adding extra cross-service auth headache.


That's only true for private repos though, for open source repos it would work just fine.


> it's technically open

Yes, but in a practical sense it ties your operation to GitHub.


Besides that, a de facto centralization is always a much easier pill to swallow then a de jure one.


I'm a casual GitHub user, and for me the fork button enabled me to grab a copy of some other repo, make changes , commit to my github repo without perturbing the main project and still keeping an online history at github. Its a step above just downloading the zip file and below getting commit rights to the original project. If I just made a personal branch, then I would have everything but the online backup because I couldn't commit.

One example is used by many online learning companies: they provide some baseline of code for use in a course. you need to get it to use in the lessons (edit: and make changes that you want to save). You can download a zip, clone or fork the repo. zip and clone don't get you the online backup.

It would be interesting to see how many people use Github for the reasons cited in the article (using a fork as staging for a pull request) or like I do.

As I wrote this, it made me realize I am a parasite on GitHub and the projects I fork, since I rarely contribute back (mainly because I don't have anything useful).


You aren't a parasite, you are an intended user like every other. A parasite is one that consumes the host, not one that merely benefits from it.


You don't need a fork button for that.

  git clone ssh://example.com/repo.git
  [... Edit ...]
  git push ssh://myserver example.com/myrepo.git master
Benefit of the fork button is that GitHub links those repos on their site.


Ah yes, all of the students in GP's example need merely need to learn how to provision, DNS-associate, and maintain a server, then they can store their Git repositories on it. That will surely not add any undue overhead for them.


There is no need to run a server. This is only about the "fork" button. You could but GitHub.com in the place of example.com. Git hosting existed before GitHub.


I can't imagine doing pull requests over email, which this article notes is an original feature of git. Maybe kernel developers could handle it, but there are a lot of people who don't even know how to send plaintext emails and this would not be conducive to reviewing PRs on mobile. I definitely like the GitHub flow better than SourceHut's.


I shared this video on the Lobsters thread which seems to have been helpful for some:

https://yukari.sr.ht/aerc-intro.webm

Don't repost this, please, the official aerc announcement is coming in a few weeks.

I'm working on making a UI similar to GitHub's for reviewing patches which is built on email underneath, but can be used entirely on the web. The first step of this became available a few weeks ago, and now email threads are being rendered into a review UI which is similar to GitHub's with inline feedback and such:

https://lists.sr.ht/~philmd/qemu/patches/5556

This page is fairly new and still needs a bit more work towards mobile support, but I hope that gives you some more confidence in the platform. I intend to extend this so that you can also review patches from the web, which will generate emails on the mailing list, and prepare patchsets from the web as well. The end result will be a UI which is remarkably similar to GitHub in terms of usability on the web, but is backed up with distributed technologies and seamlessly integrates with the workflows of devs who would rather use email.

On the whole Sourcehut is actually quite a bit better at responsiveness than GitHub, too. Rather than a dedicated and inferior mobile site, almost all sr.ht pages are responsive and equally capable on all form-factors.


I love how sourcehut complements git rather than parasitizing on it. That is, I can't fault github's ability to be a dumb remote to publish branches to - but once you use it for that, they have a foot in the door for all the prongs of their proprietary lock-in and what is frankly value-subtract. I'm not worried about that with sourcehut because 1) I don't have to go full sourcehut kool-aid unless I want to (in fact, there's actually too little linkage among the components at present), and 2) if I do, then the kool-aid I've bought into is... a mailing list workflow using decades-old standard technology, where the webapp views are optional.


Friendly pedantry: in that video, you referred to vi as "vee"? :)


Yes, because that's what it's called :)


So you're saying that you don't care what the original author named his software, and you're going to say it your way, just to be different?


Whoa! This is a great idea. Looking forward to the "real" post; I've also done some work on implementing distributed systems over email, so I'll be sure to give this a look.


Looks awesome!


See "git format-patch" and "git am". The short name of "git am" is a hint that Linus probably used it a lot, given that git was built to replicate his personal idiosyncratic workflow.

(Obviously you have to use a command line mail client too, in keeping with git's spirit of user friendliness)


I have to say that format-patch and am workflow is really handy, even when working alone. (Say you want to test a commit on multiple dev machines but you're not sure if you're ready to push to a remote, as an example.) The flexibility to store or transmit a commit in patch form apart from .git or the remote, then seamlessly add it to the history, is a good thing.


> Obviously you have to use a command line mail client too

Not really. You could use a GUI client like Thunderbird to respond to emails pertaining to a particular commit as long as you're not actually trying to paste code in the composer. If you want to put something like a code snippet delimited by a "scissors line", then you'll need to use git format-patch and/or git send-email.

In other words, it doesn't matter what MUA (Mail User Agent) you use if you're just sending text, but if you are sending code in the form of a diff that someone will apply using git am, then you'll have to use a tool like git format-patch and git send-email to compose the message and send it.


It is a curse that GitHub's code search can't code search forked repositories. If there's a really popular fork of a dead repo, it's code is invisible. Though, I have a feeling that is a technical, not business, limitation.


The Network graph has gotten increasingly buried in GitHub UX (currently it is under Insights > Network, at one point I recall it was a top level tab of its own) because it's a very useful power user feature, but can be super confusing to new users. (It's particularly a shame it is so hidden because IMNSHO it is the best commit graph that GitHub has, much more useful than the default commits list on the Code tab.)

The Network graph is extremely useful for getting a sense if a particular fork is dead and if there's a lot of activity happening on another fork (and letting you jump directly to other fork).


Assuming GitHub is willing to show you the graph, for some projects it isn't.


I've only seen GitHub not show the graph at all when the total fork count for a Repo was huge and they've since changed it to always try to show "top 100 most recently committed forks" instead as a performance optimization in that case.

Easiest example off the top of my head: https://github.com/DefinitelyTyped/DefinitelyTyped/network


Is that for piracy apps or something?


I suspect it’s largely to avoid ballooning the size of indexed search data they have to store, given the many many forks of popular repos. So a mix of biz/tech reasons, but not really nefarious.

That said, given the nature of git, I usually just fall back to clone/grep.


I just have the SourceGraph extension installed. It's also usable without too. It's a bit more powerful and lets me search forked repositories too on an instance of Sourcegraph on their infrastructure.

I still lament not being able to globally (like all of GitHub, not just one repo's forks) search GitHub's forks. It's amazingly useful for finding out approaches to using an API. Sorta like Google Code Search in the old days.


Sourcegraph CEO here. Yes, you can search forked repositories on Sourcegraph with regexp, exact terms, etc. Thanks for mentioning that!

Here’s an example of a fork that you can search: https://sourcegraph.com/github.com/uber/gonduit

Just change the repository owner/name in the URL to search any other one.


It's nuts. There's so many old JS/node libraries floating around on github, and it's incredibly difficult and time-consuming to find the one that has been moved forward the most in a fork. There should be a notice on a repo that hasn't had commits or any activity from the author in the last 6 months which links to a better fork view than the one they have.


I run a fork, and we didn't use GitHub's "fork" feature for that specific reason. However its pretty trivial to checkout from a repository and push that wholesale into a brand new github repo without forking.


I heard you can also contact support to separate this out too.

It's necessary to preserve issues and pull requests.


Yes, I've done this myself — it's just an email and they "disconnected" the fork within an hour. Very happily and friendlily, too.


It would be awesome if this problem was solved under Microsoft's ownership, it seems like the kind of challenge that could benefit from the resources of a giant firm, and of course promote Azure somehow. Code exactly duplicated from the parent repository could be excluded, so when you're searching Github you'd really be searching all of Github.


> On GitHub, a fork refers to a copy of a repository used by a contributor to stage changes they’d like to propose upstream

I'm not sure this is accurate and is the crux the argument. IMO a fork on GH refers to a "fresh" (in terms of the tooling around the repo, eg issues/PRs) copy of the repo on another account, which may be used as a branch to push to upstream but may also be the "traditional" fork, the difference is entirely in how the copy is used. If you've got a public "personal branch" with its own associated tooling it doesn't seem like a stretch to call that a fork, whether temporary and intended to be sent back upstream or not, and to me it's a difference without a distinction about why "personal branch" is a better term than "fork." Forking a repo (spinning off a dev section, including CI/CD/issue tracking/management -- a branch implies sharing that infrastructure IMO) isn't the same thing as forking a project (spinning off a competing project).

To the cathedral and bazaar points, I don't see how GH affects the development style at all. The only thing that really makes the mailing-list driven Linux dev more decentralized is that it's done via email... yet someone still chooses what goes into releases, maintains & hosts that mailing list/website mirror and could limit access, or the "core" team could simply email each other and lock out any public view. GH can be configured to be just as open (if needing their "fork" model to keep anyone from being able to push over the original hosted copy) or just as closed, depending on how the project is run. The cathedral and bazaar is about project philosophy and management, not the underlying tech, to my reading.

Given the disclaimer that the author is building a GH competitor my cynical thought is this is really marketing aimed at the programmer niche; I would have enjoyed it more as a contrast against centralization and the benefits of his (as I understand it) more free/open/decentralized competitor.

All that said, I think sr.ht/sourcehut is a cool project and can easily see myself switching to it. How have the ads in 2600 been working?


I really like how github fork can be implemented as new HEAD etc references to the same objects. New commits refer to the original project's objects.

i.e. git's content addressing, intended for identical distributed objects also automatically enables de-duplication of identical centralized objects.

Regarding the article, it seems to be saying you can't bazazr-fork a project on github. I don't see why not - although the default fork is associated with the forked project, why can't you start a new project, using a clone of the forked project?


>I really like how github fork can be implemented as new HEAD etc references to the same objects. New commits refer to the original project's objects.

This can be done (more effectively, actually) without the user's explicit involvement in the fork process. You can dedupe blobs across the entire platform on git push.

>Regarding the article, it seems to be saying you can't bazazr-fork a project on github

I'm not saying you can't (I thought that was clear enough), but that GitHub is designed to encourage you to use a different approach.


Fork distributes control over history and abstracts permission management away from the traditional git branch pattern so you don't clutter the namespace of the canonical remote but can still offer work.

Fork is almost nothing, it's just a useful pattern.


I don't understand. I forked quite some repos on GitHub, because they weren't maintained anymore or I needed some feature the maintainer wouldn't implement.


In the absence of external energy inputs, decentralized systems tend toward centralization. I don't much like it, bit it's not really Github's fault that it happens.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: