If you have any more info about the licensing issues around deep learning models...

Iv · on Sept 1, 2020

Well, I'd be happy to have someone to chat with and exchange ideas about it. I am currently digging that rabbit hole that seems to be basically uncharted waters.

I would like to find a way to make true open source deep learning models.

Debian legal newsletter [1] and lwn[2] have interesting takes on the relevance of GPL. To them, putting a trained model under the GPL implicates that you have to open your dataset too, which are the "sources". That seems somehow consensual but I still think it is debatable and could need clarification.

I also dug around the question whether a trained model can actually be copyrightable if the training code and the dataset are free. This is akin to a "compilation" operation that adds no creative input (anyway applying copyright to source code is already a bit of a hack). There is a pretty strong ground to argue that they are similar to "compilation of facts" which come with very little protection.

I am now wondering if open source can actually work for deep learning: if trained models are not copyrightable, open source licenses require strong copyright protection to be implemented. Maybe a DL model is not protected enough for that.

Finally, I am reassured by recent fair use rulings that a model will probably not be considered a derived work of its dataset and that proprietary data can legally be used to produce an unencumbered model but the legal uncertainty still exists.

If you are interested in helping me trying to figure out how to protect crucial models so that the first AGI will be beneficial to all and open sourced, I'd be very happy to have someone poke holes into my ideas.

[1] https://lists.debian.org/debian-legal/2009/05/msg00028.html [2] https://lwn.net/Articles/760142/

pabs3 · on Sept 2, 2020

The Debian ML policy linked above goes a fair way to making truly open source deep learning models. The biggest problem with the policy is they do not address the economic disparity that means only folks with a lot of money can train a model even if they had all the training software, drivers and source data under a free license etc. Perhaps Debian can get enough donated compute time that we can solve this though.

The products of compilation seem to be copyrightable, otherwise software piracy wouldn't be prosecutable. Perhaps the same would apply to trained models.

Do you have a link to those fair use rulings? Also note that fair use is an American concept and doesn't apply in many countries, some of which have similar but more restricted concepts. Also, I wouldn't consider a model produced under your example as a free model, that would be more of a ToxicCandy model in the Debian ML Policy parlance.

pabs3 · on Sept 2, 2020

Some links to Debian mailing list discussions:

https://lists.debian.org/msgid-search/33417ce2bcf9b6a0efaf47... https://lists.debian.org/msgid-search/20190608184309.GA10146... https://lists.debian.org/msgid-search/f544829dcd6c0f92ea11cd... https://lists.debian.org/msgid-search/20180712123524.GA25751...

Iv · on Sept 3, 2020

Thanks! It really gave some good insights!

Iv · on Sept 2, 2020

Thanks for the links below, reading these opinions took me two more hours of my time but helped me grind some thoughts!

First a quick answer to your two last questions. Programs and binaries are widely recognized as copyrightable. What I am wondering is whether the action of compiling a program constitutes a contribution worthy of protection and of additional copyright. To give a concrete example, imagine I am a company that uses gcc and big machines to provide compilation as a service. You feed it a BSD-licensed source code. My server returns a binary on which I claim a proprietary copyright. Are you allowed to dismiss it as being just the result of a totally deterministic and automated process and reclaim it as BSD? I would argue yes but it could be a non-obvious court case.

Anyway, I don't think I agree on the comparison between compilation and training.

> Do you have a link to those fair use rulings?

I was thinking about this [1] ruling (Authors Guild, Inc. v. Google, Inc.) in which Google scanned commercial books and used this obviously non-free dataset to provide in-text search mechanisms. I am pretty bitter about the fact that one of the main reason for the favorable outcome (Google won) was that the judge estimated it had an "obvious" usefulness when the ruling finally happened, some 10 years after the scanning started at which point it was certainly not appearing obvious to non-tech people. So Google had to prove a tech while in a legal grayzone, a luxury orgs like Debian may not have.

------------------

Now for the real meat :-)

> The Debian ML policy linked above goes a fair way to making truly open source deep learning models

Actually, I am wondering if they are not a bit blinded by the way the GPL works and if they don't constraint themselves a bit artificially by imaginary legal precedent.

They all seem to assume that a trained model will be recognized as a compiled binary, but I see at least 5 competing comparisons that were proposed and could hold ground legally:

1. Trained models as compiled binary 2. Compilation of facts as proposed here [2]. I find it pretty persuasive even if its author dismisses it for what I think is not a good argument. 3. Rendered 2D image from a 3D model 4. 2D photograph of a real 3D object 5. Training as a copyrightable creative creation [3]

It is understandable that Debian maintainers think about everything in terms of programs and source but I feel they shoehorn a bit that notion in the case of machine learning and may not realize how much more flexible the legal framework actually is.

Admittedly, I am less interested in the consequences of slapping the GPL on a trained model than I am about finding a way to solve the potential problems caused by bad actors in the field, just like FOSS did it for regular software. I am strongly suspecting we may have to write a viral license adapted to ML.

One of my example is how would one go to prevent one's work being used by OpenAI the day they decide to refuse releasing their trained models? Or to prevent helping Google or Facebook gained an even more dominant position by adding data to an already good model?

We benefit a lot from the fact that, right now, there seems to be genuinely good will from wealthy actors to contribute to the research community but it feels to me like a Mexican standoff. What happens when one decides to run off with what is published and secretly improves it for commercial gains?

I must say that I have been happily surprised by how much things are free for use right now, from research, algorithms, frameworks and trained models. We avoided a lot of dystopias, probably through some unsung heroe researchers who imposed openness to their employers upon being hired.

The risk still exists though, as all this openness can be reversed on a whim. Basically, I am wondering how we can put all the chances on our sided that the first AGI will benefit the humanity instead of its owner?

Sorry for the wall of text, but if you are still there and would like to continue that discussion, here is fine, but real time discussion is also fine, you can shoot me a mail at yves.quemener@gmail.com and we can do Hangout or Signal from there.

[1] https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,.... [2] https://lists.debian.org/debian-devel/2018/07/msg00175.html [3] https://lists.debian.org/debian-devel/2019/05/msg00380.html

pabs3 · on Sept 4, 2020

One other thing is the economic aspects, I just saw this on HN today:

https://learning-at-home.github.io/ https://news.ycombinator.com/item?id=24370510

Iv · on Sept 5, 2020

There are lots of talks about that aspect on the debian MLs as well. They argue on whether they should have machines to redo the training of models that are considered open-source, if that should be part of the "build" process.

I think it is also worth noticing that we are going that way but there are a lot of actors with a lot of processing power. Notice how, two years after one model breaks records, there are ways to make it run with 1000x less power. We are bruteforcing the problem but I am having doubts that raw power is going to matter a lot in a few years.

Also a cat detectors is pretty usable at 99%, not everybody needs 99.99%.

More than processing power, the real power in distributed training lies in the variety of situations. A thousand users may have a hard time having more computing power than Facebook's TPU farm but it will be easier for them to have a larger dataset.

Iv · on Sept 1, 2020

Also, another back and forth in DebConf 2012 that introduces the problem and presents some real world implications: http://penta.debconf.org/dc12_schedule/events/888.en.html