Hacker News new | past | comments | ask | show | jobs | submit login

In a very specific technical sense, it could be argued that it doesn't — the data for GitHub forks (i.e. their branches and the commits of such) is actually stored within the base repo forked from.

In other words, by forking something on Github, you're not distributing anything; rather, the original org is now distributing an additional thing you made — your fork branch[es].

This is the source of many confusing things about the security of GH forks; and the source of some recent GH vulnerabilities.

Also, if you're curious, this isn't a meaningless "implementation-level distinction", as it has semantic implications for repo management: it means that the branch attached to a PR coming from a fork repo continues to exist in the base repo, even if the fork repo that that branch originated from gets deleted. Because that branch was always "in" the base repo to begin with; the PR just changed the branch's GH ACLs to make it accessible to the owners of the base repo.

(Really, the "fork repo" itself is an illusion — it's like a SQL view. There's only the base repo, which contains both regular branches, and user-fork-namespaced branches. This is in part why forks can't be private; they're just a view of resources in another repo, already security-controlled by that other repo; so they can't have their own additional security logic acting on those same resources!)




This implementation choice has always felt a bit odd to me, almost like premature optimization. Is there a reason to have done it this way other than storage deduplication? Since git is already a content-addressed store anyway, how hard would it have been to have some kind of abstraction below the repo layer that would provide the same deduplication?

At this point there's obviously huge inertia in Github's early architectural decisions, but if you were building Github today, would it still make sense to go this route?


GitHub's fork feature works outside of git itself. It does not utilize the .git directory, and therefore does not utilize git's deduplication.

EDIT: Oh, I see what you mean. It would definitely be interesting to solve this namespace conflict problem from inside git. I wonder how many times meta-branches (or something similar) have been advocated for.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: