Hacker News new | past | comments | ask | show | jobs | submit login

Be wary of using this model - the licensing of this model seems sketchy. Several of the datasets used for training like WSJ and TED-LIUM have clear non-commercial clauses. I'm not a lawyer but releasing a model as "MIT" seems dubious, and hopefully OpenAI has paid for the appropriate licenses during training as they are no longer a research-only non profit.



This is a big dispute right now: OpenAI and other AI companies generally take the position that models learning from data does not make the output of the models a derivative work of that data. For example, GitHub Co-pilot uses all publicly available GitHub code regardless of license, and DALLE-2/StableDiffusion/etc use lots of non-free images. I don't think this has been challenged in court yet, and I'm very curious to see what happens when it is.


I think it might be even less problematic with something like Whisper than with DALLE/SD? Merely consuming data to train a system or create an index is not usually contrary to the law (otherwise Google wouldn't exist) – it's the publication of copyright content that's thorny (and is something you can begin to achieve with results from visual models that include Getty Photos logo, etc.)

I think it'd be a lot harder to make a case for an accurate audio to text transcription being seen to violate the copyright of any of the training material in the way a visual could.


They're not just training a system but publishing the trained system


> models learning from data does not make the output of the models a derivative work of that data

Most of the debate seems to be happening on the question of whether everything produced by models trained on copyrighted work represents a derivative work. I argue that at the very least some of it does; so the claim said to be made by the AI companies (see quote above) is clearly a false one.

We're in a weird place now where AI is able to generate "near verbatim" work in a lot of cases, but I don't see an obvious case for treating this any differently than a human reproducing IP with slight modifications. (I am not a lawyer.)

For example, copyright law currently prevents you from selling a T-shirt with the character Spider-Man on it. But plenty of AI models can give you excellent depictions of Spider-Man that you could put on a T-shirt and try to sell. It's quite silly to think that any judge is going to take you seriously when you argue that your model, which was trained on a dataset that included pictures of Spider-Man, and was then asked to output images using "Spider-Man" as a search term, has magically circumvented copyright law.

(I think there's a valid question about whether models represent "derivative work" in the GPL sense specifically, but I'm using the idea more generally here.)


That's right: the model is definitely capable of creating things that are clearly a derivative work of what they were trained on. But this still leaves two questions:

* Does the model require a copyright license? Personally I think it's very likely a derivative work, but that doesn't necessarily mean you need a license. The standard way this works in the US is the four factors of fair use (https://copyright.columbia.edu/basics/fair-use.html) where Factor 1 is strongly in favor of the model being unrestricted while 2-4 are somewhat against (and in some cases 4 is strongly against).

* Is all output from the model a derivative work of all of the input? I think this is pretty likely no, but unclear.

* Does the model reliably only emit derivative works of specific inputs when the user is trying to get it to do that? Probably no, which makes using one of these models risky.

(Not a lawyer)


This is even slightly more direct: access to WSJ data requires paying LDC for the download, and the pricing varies depending on what institution / license you're from. The cost may be a drop in the bucket compared to compute, but I don't know that these licenses are transferable to the end product. We might be a couple court cases away from finding out but I wouldn't want to be inviting one of those cases :)


I think they didn't use WSJ for training, only for evaluation. Paper includes WSJ under "Evaluation datasets"


Are there any AI/ML models that don't use sketchy licensed datasets? Everything seems to be "downloaded from the internet, no license" or more explicitly proprietary. The only exception I can think of would be coqui/DeepSpeech?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: