Hacker News new | past | comments | ask | show | jobs | submit login
How to publish Git repos that cannot be republished to GitHub (joeyh.name)
526 points by edward 76 days ago | hide | past | favorite | 197 comments

Quoting from the linked torrentfreak article:

> If you commit or post content to this repository that violates our Terms of Service, we will delete that content and may suspend access to your account as well

My reading of this is that github wants to stem the flood of garbage pull requests to their [dmca](https://github.com/github/dmca/pulls?q=is%3Apr+is%3Aclosed) repo, which I guess is probably reasonable.

I don't quite see where github is banning certain commit hashes or tree hashes on sight.

I certainly respect the authors desire to not publish their repos to github (and indeed one of the nice things about git is how easy it is to self host). And I can understand not wanting others to fork the repos to github (due to objections about the company itself), but I'm not sure how I feel about adding deliberate boobytraps to repos that seems to be purely punitive.

It's currently posssible to make a PR to the github dmca repo that merges youtube-dl, and so violate their TOS.

When you make that PR, you do not transfer any of youtube-dl to github. Their repo already has those objects.

So all you have done is pushed a repo containing a specific hash, which is now a TOS violation, in that specific case at least. I wrote "Certain commit hashes are rapidly heading toward being illegal" because of that specificity.

So, in addition to illegal primes, we are getting a collection of illegal random numbers?

(I wonder when illegal transcendental numbers will become a thing.)

There already are some illegal transcendental numbers. For example, you can't publish the transcendental number which is the circumference of a circle whose diameter in metres is

Excellent. Can we now please also have an illegal (proper) surreal number? ;)

No; by law there must be a set of illegal numbers. As a proper class the surreals are simply too vast to be outlawed.

This is a bit ridiculous. Every movie file or image file is technically just one long number. Or if you take a book and convert each character to Ascii and make that into one long number. It's all about how you interpret the number. While the fact that the DVD decryption program when arranged as a number happens to be prime is cool, the fact it is "illegal" is a dubious claim at best.

I appreciate that it's in metres. Does it need to be?

Nope, pi has no units.

But circumferences and diameters have units (of distance)


It's the DVD decryption key one of the first "illegal prime" https://en.wikipedia.org/wiki/Illegal_prime

It's not just the decryption key. It is a gzip file containing the full source code of the decoder!

Is illegal prime multiplied by 2 illegal too?

Law operates under the assumption that an idea called color exists. Computer scientists, for their job, generally must assume it does not. Color is a sort of Metadata property of something (an object, data), and in this case, the color of the illegal number pertains to how it was derived (where it came from). If the number is 'tainted' in its origin by having been influenced somewhere in its past by the offending illegal number, then it too can be illegal. A good explanation for this is below:


tl;dr yes. An illegal number multiplied by two is illegal. Randomly generating that number is not. The illegality is not strictly a property of the number itself, but it's color.

What about I randomly generate a number, add it to the illegal number to get a result. Is the resulting number illegal or not? (Thinking about it mathematically, the resulting number's distribution isn't really affected by the illegal number - Maybe we can practically find something by taking a lot of generated numbers and analyzing the apparent distribution)

Another idea is to generate a random number and multiply it with an illegal prime. If the illegal prime is a sufficiently large number; we can extract the original illegal prime with very high confidence by just finding its prime factors and picking the shortest one.

If you distribute a derivative of an illegal number, that derivative has the color of that illegal number. Whether anyone notices is a different question. The moment one of the intermediate steps involved the illegal number, the following steps gained it's color.

So too for addition then. An illegal number + 1 is illegal. An illegal number - 1 is illegal.

Therefore all numbers are illegal.

I believe you misunderstood my comment. Numbers who's origins are tainted by an illegal number can be illegal. Numbers who's origins do not have taint from an illegal number are not. It is not the number that is illegal, it is it's color. How it came into being, it's history, a permanent invisible mark that it carries with it, and transfers to whatever it touches.

You can't just dodge color by mutating the number via mathematical operations, color doesn't work like that. You can only get rid of the taint by using a fresh number without taint.


If there's an illegal number X and I give you a number Y with the intention that you obtain X from it by some transform then Y is also illegal because I "tainted" it when I did the original transformation from X.

If I just need Y (or X) for some operation that's not related to the "illegality" then those numbers are not illegal.

The essay that droffel linked to explains this very well.

> Numbers who's origins are tainted by an illegal number can be illegal [..]

Two thoughts:

1. This appears to show a complete disregard for basic mathematics...


2. Do lawyers ever wonder why people make jokes about them?

It doesn't disregard mathematics at all. Quite the contrary — it separates mathematics from the intent of the mathematician.

The number itself isn't illegal, because it can come from any number of contexts. What's illegal is wilfully working towards generating that number while knowing its purpose as a key. It's just the principle of Mens Rea at work.

I don't know! Maybe the MPAA's lawyers can't invert that obfuscation function :-)

please explain; thanks in advance

this is one of the leaked DVD encryption keys (expressed, of course, in very funny way), which is used since eternity by pirats for DVD decryption.

Used to be illegal to have it in US and other countries, don't know status now

Not only by pirates. As a Linux user, I need it to watch my legally bought DVDs since there is no viable other way.

> this is one of the leaked DVD encryption keys (expressed, of course, in very funny way)

It's actually even cooler: store this prime in base-2 and interpret it as a gzip file, and it'll gunzip to a full working DeCSS (DVD decrypter) source code in C :-)

Every file stored and processed by computers is just a really big number. Information theory seems to agree. It's all just bits.

When framed this way, so many laws become ridiculous. Copyright is about establishing a monopoly on numbers. It's illegal for normal people to know certain numbers because they represent classified information. People go to jail because of numbers stored in their hard drives.

Maybe. Numbers in real life — the domain of the legal system — tend to be either for counting or measuring.

Numbers that are huge and precise are data, information, something quite different.

The thing with illegal primes is that they are small enough to really feel like numbers rather than a concatenation of things forced to look like a number (“the shape of my bedroom is 201809!”)

What is the ontological status of "the smallest number which is outside the domain of the legal system in this sense"?

I don't think it exists because you can always copyright some piece of self-created digital art so that it hashes to the number (if the hash maps to enough bits to represent that number).

I think there is a biggest number though, and it should be higher than 64bit-1 because your processor would otherwise occasionally handle illegal numbers, thereby (probably) making it illegal as well.

My point is rather than if we agree that large-enough numbers can be data, then there is a number at which they start to be data, N say, then "all numbers greater than or equal to N are data" uses N not as data, but as "counting and measuring". So this distinction is self-contradictory, so incoherent.

You could just say "greater than N-1" to avoid this problem, so this doesn't seem like a very strong argument.

I think the better point to make is that the law will never actually produce a concrete N, preferring to leave itself vague so that it can criminalise things at a whim. This is what makes it a farce and why such notions need to be purged from the law.

How can you subtract 1 from data?

You don't. That was subtraction on a meta level. You name the largest non-data number instead of naming the smallest data "number".

the fact that you cannot identify the exact number of sand grains that make a collection into a heap of sand does not mean heaps of sand do not exist

At least in the case of sand you can give examples of numbers of grains that don't constitute a heap, and numbers that do. And numbers smaller than the former will be non-heaps and numbers larger than the latter will be heaps.

For natural numbers, the lawyers seem unable to do even that. Sure, they might want to name some of the "illegal primes" mentioned in this thread as examples, but what will they say about those numbers +1? They're gonna paint themselves into a corner of ridiculousness nomatter what.

N = 10^85

You’ll know it when you see it.

> Numbers in real life — the domain of the legal system — tend to be either for counting or measuring.

What does "in real life" mean? You and your brain presumably exist in real life. If you brain can construct any natural number, then do they not exist just as much as the ones that happen to appear as counts of rice or the price of beer?

I think any number whose information density is more than could be memorized becomes data, and no longer counts as a real life number.

One might stretch the definition to allow numbers that you could read on a postcard in one go to still count as numbers, but it’s a big stretch. If there’s a more understandable representation of the number — especially if there’s a more understandable representation — then it’s data and not a number.

Beyond a certain amount of information content numbers just become data. Certainly at the point where counting the number of digits is non trivial. (There are notations to make magnitude easier to see, but these purposefully delete information content.)

And you get into pretty absurd situations: Whatever the status quo on this matter is, I think we all agree that the integer 0 is in the public domain. And also the following algorithm: (0) let n = 0. (1) Increment n. (2) Go to (1).

But this will, in time, generate every copyrighted material.

Indeed. The only possible conclusion is that all copyrighted material already exists since all numbers already exist. Artists are just people who discover interesting and really big numbers through the convoluted process of "creation". After the number is found, all bets are off.

Most so called “illegal primes” involve encoding some information (eg. encryption key, that is an random number) as a prime number with the idea that transforming the information into prime makes it into “mathematical result” or something like that. It is intended as an argument why there should not be any “illegal data” which I find absurd because the idea that such transformation is somehow relevant is even larger nonsense that the idea of making some kinds of information “illegal”.

That's not true. They are illegal primes because encryption keys are just pairs of prime numbers used for public key cryptosystems, so posting the primes that make them up is sharing the encryption keys and thus against DMCA. There is no transformation involved - it is not just encoding normal illegal content into a numeric form.

>One of the earliest illegal prime numbers was generated in March 2001 by Phil Carmody. Its binary representation corresponds to a compressed version of the C source code of a computer program implementing the DeCSS decryption algorithm, which can be used by a computer to circumvent a DVD's copy protection.[1][2]


> so posting the primes that make them up is sharing the encryption keys and thus against DMCA

Except it's not.

You can post the numbers alone, without including information (directly or by reference) to the application about which information is banned, all you want.

This is also my understanding, but certainly not what RIAA and folks do want because this loophole allows for much more efficient distribution of those illegal numbers. Therefore they are motivated to overblock, whether the number is accompanied with instructions or not.

If you distribute a number that represents something you're not allowed to distribute, without announcing this connection, you're not really making that secret information available to anyone. There's no loophole there.

You need to somehow include information on how to use the number, which is arguably easier to pull off, but still doesn't make it legal, just harder to find out.

I didn't mean that a mere dissemination of numbers is illegal; rather, RIAA will try to make illegal and will penalize (not yet illegal) dissemination for any means possible exactly because of that. This has a huge implication both for who don't want get in trouble and for who want do the illegal thing.

How about if we first standardize a universal, generic encoding scheme which produces numbers? Then if you saw a random number somewhere, you'd know to simply try decoding it with this scheme to see whether it succeeds.

So, if I understand correctly you could do the following:

1. Someone somewhere creates a numbered list of primes (so 1=1,2=2,3=3,4=5,5=7,6=11,7=13,8=17, etc etc) 2. You refer to "the [N]th prime" with a vastly smaller number, and people could fetch it from the list of primes. 3. They could theoretically ban the number N?

>with a vastly smaller number

Fun fact, there are a _LOT_ of prime numbers. There are about 10^22 prime numbers smaller than 10^24, so about every 100th number is prime. The ratio decreases the higher you go of course. The vastly smaller number is actually just a few bits smaller :)

If you could do that reliably, you'd obviate all compression algorithms. Unfortunately, primes don't compress like that.

Historically shared DRM keys are typically symmetric encryption keys, not asymmetric ones, and thus not primes. I am not aware of any instances of an "illegal prime" that was prime because it was the private part of an RSA-based DRM cryptosystem. As far as I know, all of them were either encoded as primes deliberately, or just coincidentally prime.

Edit: except for this one, I guess. This one wasn't about DRM, but rather just a locked down device. TI sent bogus DMCA notices for that one, but as far as I can tell there is no solid legal theory under which those primes were illegal, and nobody ever got sued, so we can file this one under "manufacturer upset they no longer control their users' devices throws a legally meaningless fit".


There are some encryption schemes that use raw primes as their keys.

But the first popular 'illegal prime' was a gzipped code that would defeat the DRM of DVDs. See https://en.wikipedia.org/wiki/Illegal_prime

It's what happens when computer programmers try to pretend the law is an algorithm... the law doesn't work like computers do. To a computer, every sequence of bits is the same as another sequence in the same order; the law, on the other hand, cares about the source of the bits.

There is a great classic essay describing this called "What color are your bits?"


It's fun to think about, but any digital file is just a big long number. And prime numbers are surprisingly common.

So finding a variant of one offending digital file that's also a prime is a rather straightforward process.

What fascinates me more is all the illegal bits people put into the bitcoin block chain; and how bitcoin people deal with that.

Especially since you can make random looking strings illegal after the fact:

I have an illegal file A. I produce a random binary file R of the same size as A. I publish (R xor A) and R in two different venues, both mathematically-perfectly random files on their own. But together, they produce illegal file A.

How do the bitcoin people deal with it? I thought they just accepted it as a fact of life?

That seems quite naive. I don't think "I just accepted child porn had to be on my computer as a fact of life." would go down well in court.

It looks like they didn't solve it and just left it there. So I'm afraid a lot of bitcoin users are that naive.

The blockchain has been illegal for years, hasn't it?

What are the random numbers? Commit hashes are deterministic and based on the commit contents

I believe they were using "random" to mean "assorted" (which is perfectly cromulent).

edit- thanks commenters for the correction!

Git commit hashes are indeed deterministic, though it depends on the inputs.

From StackOverflow [0]:

Git uses the following information to generate the sha-1:

* The source tree of the commit (which unravels to all the subtrees and blobs)

* The parent commit sha1

* The author info (with timestamp)

* The committer info (right, those are different!, also with timestamp)

* The commit message

[0] https://stackoverflow.com/a/34764586/293064

It's deterministic, you've just changed the inputs. The "committer date" is usually excluded from logs so you don't see it, but it's the only missing input. Here is a demo:

    $ git rev-parse HEAD 

    $ git log HEAD      
    commit 78d66a4f169fec9cb9b3252f7a22bb020e967cd2 (HEAD -> master)
    Author: XXXX
    Date:   Mon Nov 2 20:44:04 2020 -0500


    $ git reset --soft HEAD\^

    $ GIT_COMMITTER_DATE='Mon Nov 2 20:44:04 2020 -0500' GIT_AUTHOR_DATE='Mon Nov 2 20:44:04 2020 -0500' git commit -m temp2
    [master 78d66a4] temp2
     1 file changed, 0 insertions(+), 0 deletions(-)
     create mode 100644 b

    $ git rev-parse HEAD

Git commit hashes _are_ deterministic. Run `git cat-file -p <hash>` on your two hashes. The timestamp in the committer field will be different. That is why the hash changed.

Gödel encoding expands your options considerably.

Primes are very common. So as long as you have a few bits to fiddle with in your file, you can probably make it a prime very quickly.

Given an arbitrary stopping point x, you have approximately x / log x primes between 1 and x.

In other words, for a file of length n, you have to try about O(n) slight variations to find a prime number.

A nice example are the drawings using prime numbers. For example see

"The Trinity Hall Prime - Numberphile" https://www.youtube.com/watch?v=fQQ8IiTWHhg

"The Emerging Art of Drawing in Prime Numbers" https://www.popularmechanics.com/science/math/a28649996/the-...

Ha! - but I don't think it is physically possible to publish a transcendental number, is it? (Nor any other non-rational real?)

Depends what you mean. If it can be specified by a finite sequence of operations, I’d say it’s publishable.

If you’re going to restrict “publishing” to binary fixed point, arguably 1/3 is unpublishable.

The real (no pun intended!) distinction is between computable and noncomputable reals, as first identified by Alan Turing in "On Computable Numbers".

The integers are a subset of the rationals, which are a subset of the algebraic numbers, which are a subset of the computable numbers, which (most mathematicians would accept) are a subset of the reals.

There are lots of subtleties about this. In fact Scott Aaronson just started a series that is in some sense about some of those subtleties, which I hope to be able to understand some day!

> If it can be specified by a finite sequence of operations...

Hmm... are you saying that, for example, pi is the one-step sequence of operations of dividing a circle's circumference by its diameter? But first, you have to get both those values...

There are simple formulas for computing Pi. For example Pi can be computed to arbitrary precision as 4 * (1 - 1/3 + 1/5 - 1/7...).

An algorithm for computing pi would be a finite sequence of operations. Even if the operations include "do an infinite loop".

Pedantically enough, by formal definitions algorithms have to finish. See eg https://en.wikipedia.org/wiki/Algorithm

In practice, speaking informally, people talk about programs that may run forever all the time.

People have developed lots of theory around those as well. Of specific interest here would be the class of programs that provably only runs for a finite time before it produces the next bit of output.

You need that latter class of programs to talk about producing digits of pi, or running the main loop of an OS.

See also https://en.wikipedia.org/wiki/Total_functional_programming

Then, I would say, you would be publishing a fact about pi, not pi itself. These facts do not yield pi in any finite time.

I'm confused by how you could specify a transcendental number using a finite sequence of operations. It's sort of in the very definition of transcendental numbers that you can't.

You could just write down a Turing machine.

That captures all of the transcendentals we care about. The remainder (to be fair, almost all of them) aren't computable at all, even in principle.

Thanks for clarifying. I do understand that these specific hashes are effectively forbidden in the dmca repo.

Are you aware of any specific instances where the presence of youtube-dl hashes in other repos have been found by github and used as the grounds for disciplinary action? That would be noteworthy indeed.

I find those pull requests absolutely hilarious, but my sense of humor is lowbrow childish.





DMCAs would be much better served OwO'ified.

It is wonderful to see GitHub being a victim of their own refusal to implement a feature to disable pull requests on a repo.

> I'm not sure how I feel about adding deliberate boobytraps to repos that seems to be purely punitive.

It also feels like it would violate the license of the repository. I see that one of the author's repos is LGPL [1], which allows redistribution of the source. Restricting how that source is distributed seems to me like a violation of that license, if not in letter at least in spirit.

(Of course, it goes without saying that I'm not a lawyer, which is why I said it feels like a violation.)

1: https://git.joeyh.name/index.cgi/haskell-mountpoints.git/tre...

I don’t even think it’s an in-spirit violation. You have to offer to distribute the source. That doesn’t have to be from any particular hosting provider; it just has to be reasonable.

(The license is ~14 years older than git itself. There’s no way distribution using a specific git hosting service is required under the license.)

He's not restricting GitHub from hosting it, he's just using an artifact of GitHub's implementation (or, well, a hypothetical GitHub implementation - I'm not sure GitHub actually blocks submodules that reference the relevant commit hash?) that happens to make GitHub refuse to host it.

GitHub is free to change their minds about the implementation and start accepting those commits again. That is, you, as a recipient of the code, are free to ask GitHub to host it, and to offer them a license under the terms of the GPL; GitHub is free to accept or to decline, same as if you sent me the code, I am free to host it on my personal website but I'm also free to not host it.

(It would feel more like violating the spirit if it, say, prevented you from uploading it to a GitLab instance you self-host... but since you self-host it, you could just patch out any suppression code, so that case can't come up.)

You can't force somebody else to host your GPL code if they are not a party to the license.

Making a repo that is incompatible with another hosting system is not a GPL violation.

There are all sorts of compatibility issues that can arise from platforms doing dumb things. For example, a repo with the files README.md and readme.md in the same directory would break on certain Windows machines due to case sensitivity problems.

If a country now decided that the Linux kernel violates law and bans it I don't think the kernel would violate their license just because it's now illegal to distribute it in that country.

Would you please kindly delete your comment before corporate lawyers get any funny ideas.

Can’t delete a HN comment if there is a reply to it :)

GPL doesn't require you to provide a git repository, a tarball is good enough.

It's a way of proving a point.

"Hey, GitHub has stupid practices which are just terribly harmful for the community. Also, here's my source code."

Someone comes along and pushes that same repo to GitHub. GitHub bans them. This is exactly what you'd been warned of!

I'd find it incredibly childish and needlessly divisive to make your repo impossible to host on a given provider just because you don't like the provider. This does nothing but make the tech community worse for everyone and less accessible. Newcomers shouldn't have to navigate petty politics to get started with tech and people already here shouldn't have to have their time wasted with it when they're just trying to be productive.

> I'd find it incredibly childish and needlessly divisive to make your repo impossible to host

If only it was! In that case, it is github who is allegedly making the repo impossible to host. It is, thus, github who is acting in a childish way. The repo author is only adding some silly hashes.

By the way, the real "childish" behavior is believing that speech can be censored by technical means alone. This is impossible, there are always technical means to circumvent it. Speech can only be effectively censored by means of armed police action.

That's a pedantic distinction. If github won't allow hash X and you put in that has on purpose just so github won't allow it then obviously the intent was on your part.

Newcomers have to navigate the copyright politics whether they like it or not. The amercian copyright laws have created this mess not the op.

Would you say the same about code that, for example, is designed to fail when used in a missile? E.g. navigation code that deliberately pilots into the ocean if onboard something going faster than Mach 3?

That would be outrageously unethical code to write, tantamount to criminal negligence.

I think the poster meant navigation code that was not intended to be used in a missile, being used in a missile. Consumer grade GPS units do exactly that; they intentionally stop working above a certain speed or altitude, so that they can't be used to build missiles.

There’s a big difference doing that in a consumer grade device and an open source project you publish for the world to use; i.e. when someone decides to use it in a hypersonic passenger jet

It's perfectly ethical to write open source code that will crash a passenger jet if used in such a manner. It's unethical (and probably criminally negligent) to include it in a passenger jet.

It's perfectly ethical to write open source code that will crash a passenger jet


We’re not talking about bugs here, we’re talking about the deliberate inclusion of dangerous logic traps based on assumptions that could be wrong.

In engineering terms, this is trying to solve the problem at the wrong level of abstraction. Maybe you don’t like the thought of your code being used in weapons; that’s perfectly reasonable. Maybe you want to rid the world of such weapons; also reasonable. My advice would be that open source is not the right model in those circumstances, and that to really solve the problem you’ll have to be politically engaged.

Littering code with booby traps and publishing it is an absolutely awful idea, one that should not be acceptable to us as a professional community.

What is the difference? I don't see it, besides one having a warranty and the other explicitly stating that it offers no guarantees or support.

Just stating you don’t have a warranty doesn’t absolve you of all ethical considerations. You can’t just publish something littered with poorly thought out booby traps and then absolve yourself with a license file! If you’re not interested in people using your code for specific purposes, maybe don’t publish it for all to use?

Others probably have not fully realized this yet, but with GitHub one can:

1) Publish arbitrary commits under your https://github.com/my/project URL, e.g. a fake https://github.com/my/project/blob/<faked_commit>/README.md in your project describing how to install it that actually describes installing malware.

2) Publish those commits under your name, with your email address, and GitHub will prominently display it as if you made the commit (most do not use GPG signatures, and most do not know to look for "Verified" anyway)

It seemed only a matter of time before this behavior got abused for something (anti-DMCA action is perhaps the best outcome of this situation I can imagine..)

I'm not quite sure what you're saying here. Are you claiming that you can push a commit to userA/project and then view it under the Github web interface for userB/project (assuming one repo is forked from the other)?

If so it seems like that's a relatively easy fix for Github, just check if the commit is actually contained in userB's fork of the repo.

Quick demo: go to https://github.com/slimsag/linux/tree/5895e21f3c744ed9829e3a... then try replacing "slimsag" with "torvalds"

These are "known issues" I believe GitHub doesn't intend to fix:

1. With youtube-dl[0] (and probably long before?) we know you can push a commit and view it under the web UI under another user, GitHub hasn't indicated this is a vulnerability at all.

2. Many people know you can impersonate another user through Git email addresses[1] (GPG signing is supposed to solve this, but a lot of people don't use it and even when they do others don't really know they should look for "Verified" in GitHub's web UI.)

Combine these two known-issues and you get a really convincing phishing attack.

I really hope this doesn't become a common-place thing. Hopefully this raises some awareness that, basically, you should not trust any GitHub URL with a commit SHA in it - only trust ones with branch names - because it could be a phishing attack otherwise.

[0] https://news.ycombinator.com/item?id=24882921

[1] https://bounty.github.com/ineligible.html#impersonating_a_us...

Very neat!

Though I probably wouldn't use a thing from a random hash? I'd go to the repo's main GitHub page and look for branches/tags that interest me. Doesn't everyone do that?

I often leave links with a specific hash, just in case the content changes in the future making my link invalid. Others don't even check the URL of the page they're looking at, as long as the link looks legit.

To echo the sibling comment, I regularly uses hashes because branches can move, and I don't want my links to get stale over time. I don't think I'd release software like that (I'd use a tag), but in bug reports and such I do this all the time.

> I'm not quite sure what you're saying here. Are you claiming that you can push a commit to userA/project and then view it under the Github web interface for userB/project (assuming one repo is forked from the other)?

Very close to that, yes. There's actually just one more step you need to take: you have to also open a pull request from userA/project to userB/project. As a convenience feature, GitHub automatically makes PR commits available under a special namespace. You can try it yourself with any pull request that comes from another repo--just `git fetch origin pulls/<PR #>/head.

Since the foreign commit is there in this special 'pulls' namespace, it can be viewed in the web UI by its commit ID. That's how people are adding youtube-dl to github's DMCA notice repo; they're just opening a PR containing youtube-dl's commit history.

> If so it seems like that's a relatively easy fix for Github, just check if the commit is actually contained in userB's fork of the repo.

It's not that easy. First of all, you'd need to define "contained in". The naive choice would be anything reachable by a named ref. However, all pull requests also create a ref in your repository, so you'd need to exclude those. But that would mean that if you make a PR and then delete the branch the PR is made from, the PR contents aren't visible anymore.

But even if you have the set of refs that you consider to be in a repository, that means every time an object is requested you'll have to walk the whole graph backwards to find if an object is reachable from a ref. That's an expensive operation, and Git is usually pretty fast because it tries really hard to avoid this.

The author claims the commit hash is bad links to an article... but that claim seems entirely unsubstantiated? Its not the hash that is bad, its the files that are bad. If you manage to hack git to get the commit without the files (clever trick by him), you don't hit any legal issues.

Am I missing something?

The hash is probably an excellent heuristic for spotting commits from the original YouTube-DL Git log, and so GitHub probably scans for that when searching for rehosted copies.

Agreed, but no one has actually stated that GH is scanning for these commits, let alone make the claim that the hashes are illegal. If you're going to start making legal claims they have to hold up to scrutiny, which this does not.

All you have to do to substantiate it is follow his procedure and see if you can successfully fork such a project to github. Let me know how it goes!

Totally down with the potential conflict of interest or implications of GitHub, but it’s not GitHub specifically that’s zealous about DMCA. Any server and any host of that server is going to be subject to it. The only appeal of a smaller or private server is less visibility, but legally it’s the same. DMCA isn’t going anywhere.

Well, GitHub/Microsoft could go on a PR campaign and say "we're not going to honor this RIAA DMCA since we know yt-dl isn't violating the DMCA!" but then GitHub/Microsoft would opening themselves up to a lawsuit against (basically) the entire music industry. The amount of goodwill MSFT loses over this (hopefully isolated) incident has to be worth a few orders of magnitude less than the tens of millions of dollars that would be burned to actually fight the RIAA.

Smaller hosts could get away with not honoring DMCAs since the RIAA likely isn't going to waste resources actually filing a lawsuit, but this yt-dl situation seems like the perfect setup for the RIAA to set a precedent outlawing video/music downloaders if someone were to actually fight them on it (and until then, they can continue to take down video/music downloaders until someone does counter it).

Microsoft is actually a member of the RIAA [1].

[1] https://www.riaa.com/about-riaa/riaa-members/

Microsoft is much wealthier than the “entire music industry” but this still might not be a hill they want to die on.

They don't want to get rid of copyright, it's just about creating a less punitive system. I don't think there's much downside for them if they really cared.

How deep is the RIAA's pockets? Has this most recent tactic held up in court? What are you basing this on?

Does github actually have to honor every request that comes in? I thought the youtubes of the world did that because of the volume, but hypothetically, couldn't they do some due diligence and push back on requests they don't think are valid instead of just taking them down and requiring the repository owner to appeal. I'm sure it would be more expensive for them, but it's still a choice.

It is my understanding that DMCA requests have to be honored immediately unless the hoster wants to expose themselves to large legal risk. The uploader of the banned content can file a counter request, upon which the copyright holder either withdraws, or things land in court. But if the hoster doesn't honor the request they lose their hosting privilege and can be sued for copyright infringement themselves.

DMCA is not a worldwide law.

It may not be legally, but practically. Almost all major content-hosting companies are headquartered in the USA and thus bound to the DMCA: Facebook/Twitter (social networks), Google/Amazon/Microsoft (clouds), Github/Gitlab/Sourceforge (code repositories), StackOverflow, Automattic (Wordpress), Akamai/Cloudflare/Fastly (CDNs), Wikimedia/Fandom (wiki hosting). The only major exception is Atlassian who are headquartered in Australia.

No matter if your content may be legal under e.g. European law (e.g. right to repair, right to interoperability, right to reverse engineer), you are going to have a hard time hosting it. And even if you get it hosted at an European provider (remember, we don't have anything that competes with any of the three US cloud giants in terms of functionality!), you will have issues with accepting donations easily - Paypal, Stripe and all credit cards are under US regulation.

And it's not just theoretical, just look at what happened to Kim Dotcom/Megaupload (or, tangentially related, Julian Assange). If the US deems you a danger to their business interests, you are going to get hunted down, no matter where in the world you are and if what you are doing is legal under the jurisdiction of that country.

I partially agree with you. This seems like a very good argument to start competing with the US harder.

As a counterexample, I'd like to offer you sci-hub which doesn't seem to significant hosting problems. Remember, we are not trying to replace an entire industry and all possible use cases at once. We're simply discussing the hosting of a few git repositories which some US entity might consider unsavoury due to a borked and unfair law.

> As a counterexample, I'd like to offer you sci-hub which doesn't seem to significant hosting problems.

They have to change domains all so often as the copyright mafia has a "blanket" seizure grant (https://torrentfreak.com/publisher-gets-carte-blanche-to-sei...), Cloudflare won't touch them as a result, their founder has (at least!) one court judgement of 15 million US$ by Elsevier in New York and another 4.8M$ by ACS against them and I bet that there is some sort of secret indictment floating around that gets unsealed in the case Elbakyan ever travels out of Russia so an extradition warrant can be put out.. the relatively unique advantage they have is that their founder is possibly linked to the Russian secret service GRU: https://www.washingtonpost.com/national-security/justice-dep...

Effectively, Elbakyan's right to free movement is restricted to those nations that don't extradite to the US and have friendly relations to Russia. And for what we learned from the Snowden and Assange cases, it's safe to assume that even flying over a country that has extradition agreements with the US in a passenger flight is grounds enough for intervention.

Can you elaborate your last sentence?

You can consider it a worldwide law:

  - a lot of big tech companies are based in the US
  - a lot of companies want to do business in the US
  - DCMA can become part of a trade agreement with the US, I don't know if the E.U. will save us at this point.

No, you can't. Let's not weaken words in order to make a presumed point.

I do appreciate that these are factors which make DMCA more relevant than if these factors did not exist. But your last point is not even a current fact but a hypothetical future.

I always invite people not to be defeatist but proactive in materializing the future they want to see. Citing unfortunate potential futures is not the right way to solve our problems.

This sounds like that social engineering challenge one guy did - where he said "I lose if you convince me to post <token> anywhere in my github repo", and someone won it by saying "you should put that challenge on your website" since the website code was in his github repo.

I read the post twice but don't quite get what the author wants to share. Is it some git(hub) trick I don't understand, just a way of saying Github's/Microsoft's monopoly is bad, or something else altogether?

Github complies with the DMCA, recently famously through the DMCA takedown of youtube-dl. The writer, Joey Hess (a relatively well-known OSS programmer, moreutils, for instance), is choosing to interpret this as their banning any repo that hosts the commit hash for youtube-dl (which they banned).

So he is jokingly showing you a way to create a Github poison pill. The joke is that if you include this 'illegal' hash you subject the repo to being DMCA'd out (not really, but it's sort of an 'imagine if'). Which means his repo is now unpublishable on Github (for long anyway).

It's kind of hacker humour in the sort of vein that you'd expect from an old-schooler like him. You sort of have to get the joke to get the joke.

I'd find the joke funny if it actually worked.

Since as far as I can tell, Github isn't actually banning specific hashes (because they care about the content, not the hashes), and there's no sign that they're going to do so in the future, I can't say I understand the humor...

That was my initial reaction too, but thinking about it, it's potentially (probably not, but potentially) a violation of the ToS even if it's not automatically banned.

So, the hacker humor point is that you have a "poison" repo which, if you push it to GitHub, might get your account deleted in the future. You have no guarantee that it won't, for sure.

The way I read it, it seems to claim that github will just "insert" a known object with that hash into the new repository when it's pushed, though I'm somewhat confused as to how this would work, or how the submodule trick even adds the hash to the commit log in the first place.

EDIT: I read the article again and couldn't find any of that, so I must have read that in a comment here on HN instead.

That was a very well-written comment, thank you. Hit the dopamine and gave me a smile for your explanation.

is it a joke though.....

..... yes

It's a way to sneak in the hash of a commit (the latest youtube-DL commit) in a non-obvious way.

A git repo is fundamentally composed of 3 object types:

* Blob (effectively a file)

* Tree (which refers to collection of Blobs and Trees, effectively a dir)

* Commit (comment/ author info and various other metadata, a Tree, and a previous Commit)

This is what the git tools are used to dealing with. However git submodules changed things. With submodules, a Tree object can also contain a commit object! Together with the .gitmodules metadata file, this allows you to include another git repo inside your repo.

Joey leveraged that ability to add the youtube-dl latest master commit into his repo as a submodule, but deleted the .gitmodules file so that git wouldn't be verbose about it.

And that's how you sneak a commit into a repo.

This additionally leverages GitHub's data model (which is really the git data model), where they just have this huge DB of all the objects of all the repos on GitHub. So effectively by including this commit in a repo on GH, you're making it refer to the the entire (buried) youtube-dl repo, but you need to be sneaky to be able to see it.

Note that it's only a _reference_ to a commit, not the commit object itself. So from a a github point of view, it would not even make sense to scan / trigger on this. As long as there are no commit objects with that hash, any reference is harmless.

They're using the fact the YouTube-DL (sp?) is being removed by GitHub automatically when rehosted in new repositories, due to DMCA complaint (and likely overreach). If someone rehosts his repository onto GitHub, the repository will be taken down, and the rehosting account likely banned.

Is there any evidence that GitHub is automatically removing reheated copies? Last I saw, GitHub had posted a comment asking folks not to re-host, at risk of being suspended, but there was no indication they’d started any kind of automated scanning.

Beyond that, it seems like a very community-hostile move to intentionally embed commit hashes that would even potentially cause a user to get banned from another platform if they push a copy of the code there. This doesn’t stick it to GitHub, it sticks it to the users of your repo.

It depends if you want to be nice to everyone, it burn some bridges to make a point. The point being that there shouldn't be censoring of specific hashes.

Honestly if a user did get banned for sharing that poisoned repo, I would still put 95% responsibility for that on GitHub. But it's not happening yet as far as I know.

It'll be a question of how many reposts they see. If enough people keep reposting with throwaway accounts, they will have no other choice but to start scanning automatically.

ROT13 the entire repo?

If they figure it out, then ROT14 it.

If they figure that out, encrypt it and post the key somewhere that blocks the offices of Github.

If they figure that out, we'll still find a way to build a better mouse.

If someone wants you to leave, it's usually best to leave. Good riddance to them if you find yourself in that position.

Far better would be to spend your efforts finding which RIAA members pressed the RIAA to make this takedown. Follow the money, cui bono, etc.

Why is it best to leave? Protests are a good thing, sometimes, and making sure it is on GitHub in some form or another is a form of such protest. At some point the relevant parties will have to realize that this mouse of a million engineers will always outsmart them. At some point they will have to realize that advocating for FOSS and DRM at the same time is hypocritical.

Rosa Parks was asked to leave, but she didn't leave.

People have been protesting against Facebook's policies almost since it was created (real-name only, no nudity even in classical art pieces, ...). That didn't change a thing. Facebook kept in-line with its policies, people stayed, and network effects only cemented the monopoly.

Github is extremely close to having a monopoly, leaving is the only way to prevent it from happening. If it becomes one, it has no incentives to listen to a few protests.

GitHub is not the instigator, as far as we know (MS is apparently a member of the RIAA though). I am as amused by the clever ways to get youtube-dl back onto GitHub as anyone else, but it's protesting the wrong thing in the wrong place.

The problem lies with whichever member organizations inside the RIAA (or nonmember orgs friendly with the RIAA) convinced them to file the takedown. Most of us aren't positioned to find that out, but that's where the most can be gained. Maybe the EFF or similar orgs can help. Then we'll know what the true motives were, and how to address them.

At the same time, technological measures can show it to them that they can't feasibly take it down and that they're fighting a losing battle. At some point the RIAA will just have to accept defeat because honestly it's not that much work to defeat their DMCA takedowns.

Right now they're issuing what, 1 takedown notice to GitHub? When we make that 100,000 takedown notices, some in ROT13, some in ROT24 on an Israeli server, some in RSA encryption on a Chinese server with keys published on Russian servers, maybe they'll just be too overwhelmed with paperwork at that point.

Add some honey-pots with unrelated content and counter-sue them for circumventing encryption on copyrighted content. Because how would they know what is encrypted unless they circumvented it without permission ... unless they perjured themselves?

It's probably related to Google killing^W sunsetting Google Play Music and moving everyone to Youtube Music..

No, no. If they figure it out, you send THEM a DMCA notice for circumventing encryption to gain access to copyrighted content. Of course, you write your license (unencrypted) as open source except for specific companies.

Edit: On second thought, I just visited riaa.com . How about we lobby jQuery, WordPress and bootstrap to add a specific exclusion to their licensing? Now, this wouldn't affect the RIAA immediately, only the next time they update.

Joke's on them, I already ROT26 all my repos.

It's Git performance art/protest art, more or less.

Git has a mechanism called "submodules" which allows one repo to reference another repo. It doesn't actually include any of the content from the second repo: all that's included is the URL (in the file .gitmodules) and a commit hash (in Git's view of the directory structure).

When you clone a Git repository with submodules and you pass an option to git clone, or if you run a git submodule command later, Git will make a nested clone at whatever subdirectory path contains the submodule, and check out that commit. If you make commits in the submodule (and, hopefully, remember to push them), the outer repo will appear to be modified, and you can git add the changes, which will record the new commit hash, but only the hash.

The purpose is to deal with vendoring or sometimes making changes to large repositories owned by someone else without copying the whole source code into your own repository. (It's also occasionally used as a mechanism to split up very large repositories that are owned by the same group/organization, e.g., if you have a test suite or large data files or something, or a shared library used by multiple projects, you might find it helpful to use submodules.) https://github.com/ceph/ceph is an example repo that uses them - "ceph-object-corpus" and "ceph-erasure-code-corpus" at the top level are submodules, and if you click on the .gitmodules file, you'll find that there are other submodules inside src/, too. This is also an example of how GitHub handles submodules also hosted on GitHub - it will link you to the submodule at that commit.

So, GitHub does at least some parsing of submodules. The author is claiming that if you include a banned commit hash as a submodule in the history of your own project, and then promptly delete it, it won't noticeably increase the size of your Git repo, nor will you actually include any of the banned content yourself, but GitHub will potentially prevent that history from being pushed. As a result, you have a Git repo that GitHub would not accept.

Since the author both does not use GitHub and wants to encourage others to do the same, this would be a fairly effective way of forcing that.

(The article doesn't show GitHub actually refusing to accept the history containing this submodule, though - it only shows the submodule having been created, locally. Although maybe the author's argument is simply that pushing it violates the Terms of Service even if it's not blocked by automated means.)

It's a trick to (theoretically) get someone banned for pushing a cloned project to Github despite not actually containing banned content.

Has anyone read github's terms of service? Github is the one violating them, not the users.

Yes, github is disabling accounts to people who post youtube-dl. No, this is not in compliance with github's ToS.

github says: “Please note that re-posting the exact same content that was the subject of a takedown notice without following the proper process is a violation of GitHub’s DMCA Policy and Terms of Service." Ref: https://torrentfreak.com/github-warns-users-reposting-youtub...

github's DMCA policy says: "One of the best features of GitHub is the ability for users to "fork" one another's repositories. What does that mean? In essence, it means that users can make a copy of a project on GitHub into their own repositories. As the license or the law allows, users can then make changes to that fork to either push back to the main project or just keep as their own variation of a project. Each of these copies is a "fork" of the original repository, which in turn may also be called the "parent" of the fork.

GitHub will not automatically disable forks when disabling a parent repository. This is because forks belong to different users, may have been altered in significant ways, and may be licensed or used in a different way that is protected by the fair-use doctrine. GitHub does not conduct any independent investigation into forks. We expect copyright owners to conduct that investigation and, if they believe that the forks are also infringing, expressly include forks in their takedown notice." Ref: https://docs.github.com/en/free-pro-team@latest/github/site-...

youtube-dl doesn't fall under any of the restrictions in github's ToS either. ref: https://docs.github.com/en/free-pro-team@latest/github/site-...

I have very little sympathy for github here. They ought to follow the DMCA process. Per the DMCA process, I can't dispute the takedown of the main Youtube-dl repo, since I'm not a party to the process. If I post youtube-dl to github, it should stay up, pending a DMCA notice from RIAA. If such a notice is sent, I should then have the right to then dispute that.

I think we need to work towards the future where DVCS's are _distributed_ and federated. Every time you give anyone centralized control, they will abuse it. I'm not aware of any exceptions.

IF you've already pulled it down your origin will be from where you pulled it.

Create an empty repo else where.

With your environment do a: git remote rm origin; git remote add origin <new environment> ; git push -u origin master

You've filled up the new repo with what you have, history and all.

You should be including --mirror to get all refs pushed.

Nice I had no idea about that arguement.

> Github. Which, as we know, is your resume.

There in lies the problem. Encouraging people to put their resume on centralized platforms (Be it Facebook, LinkedIn, or Github) is putting your entire career at risk. It's better to have your own domain and self-host, that way you aren't locked in and at risk of loosing your entire professional history if you get banned or a service goes offline.

Its no longer relevant to me nor him, but I can't help but think the attitude of this article kind of violates the DFSG. If the stance is to insist that free software should be allowed to be redistributed by anyone, flaming those which have in the past wanted to mirror your code on github by indicating you could get them banned by fiddling with the hashes in your repo is pretty aggressive. I totally get the frustration, but when I learnt the DFSG, this was the biggest lesson for me. Freedom means freedom to anyone. Restricting your code to, say, non-military use, isn't compliant with the DFSG. So, now that we have perceived hostile code hosters, do we really want to go there? "You're not going to push my stuff to x.y.z, or I will make you feel the pain."

What do you make of an author who only provides tarballs?

GitHub does not have a monopoly.

That said, this is a very clever use of / exposure of the current systems in place. Unintended consequences of bad laws should be highlighted like this more regularly.

It depends which market you look at. Git hosting in general, especially for enterprise, Github has no monopoly. Gitlab (esp self hosted) is very strong there.

But for public repositories, OSS development and to a certain degree even job applications, Github does have a monopoly.


> GitHub does not have a monopoly.

Yeah, well, that's just, like, your opinion, man.

joeyh, if you're reading this, some feedback: on Android Chrome, at least, your top navbar overlays your article. Suggest you update your CSS for mobile views

When I first read this article, I was really sad.

"If you use one of my open-source repository, you can get in trouble, mouhahaha"

It feels like a dick move to me. But then, I realize that the RIAA bans can now be used as an anti RIAA partnership against GitHub and others, which is... pretty cool.

Definitively a feature that we need \o/ !

Why not call Github Microsoft knowing that Microsoft owns Github?

Microsoft has refused to challenge a highly dubious DMCA notice when they have more than enough resources to do so. Isn't Microsoft the party at fault given that their lawyers know the DMCA notice doesn't have a leg to stand on.

probably doing the ol' delayed switcheroo like facebook is doing with whatsapp - maintain product as is for a couple of years to avoid immediate backlash while slowly enmeshing and branding all the tools in the product.

then boom one day it's called azure officeserver or some similar microsoft style name and monetised like crazy with github being a deprecated sub-project left to wither and die.

it is NOT in their interest to defend github in any way. any negative press attached to the product literally doesn't matter to them since it's not associated with microsoft

It turns out Microsoft is a member of the RIAA - https://www.riaa.com/about-riaa/riaa-members/. I should have known that. It figures.

So why not just write a tiny script that goes through all youtube-dl commits, recommit them with an added a space character to some file. So you will have a clone with different hashes, but the same content.

Since Git's DAG is a hash tree where each commit points to its parent(s) and its own hash is derived from that, then presumably it would suffice to do this only with the first commit.

Why are we shooting the messenger?

first if all GitHub is nobodies "résumé" and secondly you can put a git repo anywhere you like there are any number of hosting sites or easy ways to just host it yourself.

I mean really. what a stupid article

There are a LOT of companies just spamming github repo owners with job interviews nowadays

my last job was through an automated email i got about my github resume hashtags. iirc all they look for is certain tags and a regular commit history after which the recruiter takes a brief look at the repo and fires off the email if the criteria is met.

seems enough to call github a resume to me.

illegal number

This is why aliens don't visit us.

publish to gitlab :)

Gitlab is the most awful mess of javascript you could ever invent though. I use it every day for work and I wish I could avoid it.

I don't see it, and I'm a long time noscript user.

Also been using Gitlab off-prem, on-prem, for more than 6 years. When someone says Gitlab problems all I can think of is resource issues with Sidekiq when I was self-hosting it.

All this drama is only bringing to light how unsafe it is for projects to host their code on a closed platform.

Better be on the safe side and use something you can easily migrate away, like Gitlab.

Why's that? I've been using it for a while (on prem) and like the interface more than Github by now. I never problems in either Firefox or Chrome.

I feel good for Joey because he's really been one of the few who stuck to his principles and kept from hosting on Github even though it probably would have been way easier to do so.

The git-annex project (and I'm sure many of his other ones) have been going strong without Github and I appreciate him for that. He's finally being vindicated I believe (and eventually many of the warners against these free-as-in-beer but not free-as-in-software services have been proven right)

You just turned a completely defensible innocent act into premeditated destruction of a livelihood.

If you feel that losing access to Github would mean the destruction of your professional life, I'd encourage you to take steps to prevent that corporation from being able to do that to you.

I don't feel that... the author feels that. It was the first line of the article.

(joeyh _is_ the author)

hypocrisy has _many_ forms.

I think that line was intended as a sarcastic joke.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact