
Training algorithms on copyrighted data not illegal: US Supreme Court - alok-g
https://towardsdatascience.com/the-most-important-supreme-court-decision-for-data-science-and-machine-learning-44cfc1c1bcaf
======
herniatedeel
SCOTUS denied the petition for writ of certiorari, thereby leaving the 2nd
Circuit's ruling in Google's favor intact.

However, the 2nd Circuit's ruling is not binding on any other federal
circuits.

Also, as Enginerrd stated, the holding is not nearly as broad as the article
makes it out to be.

The holding was:

1\. Google’s unauthorized digitizing of copyright-protected works, creation of
a search functionality, and display of snippets from those works are non-
infringing fair uses. The purpose of the copying is highly transformative, the
public display of text is limited, and the revelations do not provide a
significant market substitute for the protected aspects of the originals.
Google’s commercial nature and profit motivation do not justify denial of fair
use.

2\. Google’s provision of digitized copies to the libraries that supplied the
books, on the understanding that the libraries will use the copies in a manner
consistent with the copyright law, also does not constitute infringement.

Based on the above holding, I think the article's conclusion is a stretch for
general training algorithms using copyrighted data because: (1) there would
not be a library supplying the information to the training algorithm, (2)
there would be no similar display of snippets, and (3) we do not know if a
training algorithm would provide a market substitute for the copyrighted data.

~~~
bluGill
While the decision is only binding in the 2nd circuit, the precedent is
admissible in other courts. If this goes to trial in a different circuit you
can bring the finding to the judge who will consider it - it won't be binding,
but he will consider it. If it goes to appeal in a different circuit the next
circuit will reference this in their decision - if they decide the 2nd is
wrong they will be very clear why the they think the 2nd circuit is wrong when
they make ruling (and this can in turn be re-submitted to the 2nd circuit who
might change their mind if the reasoning is good enough). If this goes to the
supreme court in the future they will read this decision and it will influence
them - again they can decide either way.

~~~
herniatedeel
Yes, the 2nd Circuit's decision is persuasive authority for other circuits.
However, that's not what the article claims. The article claims that SCOTUS
ruled when it, in fact, did not.

~~~
dragonwriter
SCOTUS ruled on the cert petition; what people may not understand is that
while that is a ruling, it is (and there is explicit precedent on this point)
not one which has precedential weight (even as persuasive authority) as
regards the merits of the issues addressed in the lower court ruling.

~~~
angry_octet
Isn't it unlikely that a case will be granted cert if appeals courts in
different circuits are in agreement? I.e. not a circuit split? So while not
legally binding, it might be in practice indicative.

I wonder, does this mean I can scrape Instagram/Facebook for photos and use
them for face recognition? Is that 'fair use'? Is an Instagram post a
publication?

~~~
dragonwriter
> Isn't it unlikely that a case will be granted cert if appeals courts in
> different circuits are in agreement?

As I understand, it's generally viewed to be th case that a circuit split
makes cert. more likely, sure.

> So while not legally binding, it might be in practice indicative.

I guess that it's indicative that, barring change in membership of th court,
cert. would likely be denied in a future case raising the same issue from the
same or a different circuit with the same result.

It definitely should not be seen as indicative of anything on the merits other
than that the members of the court don't see it as obviously and urgently
wrong.

------
jacquesm
Towardsdatascience.com is rapidly rising on my irritation meter. Tons of
submissions to HN of questionable value. This article is worse than most, the
person interpreting the ruling does not appear to have a legal background and
has essentially twisted the ruling to support his foregone conclusion. I'd
love for an actual lawyer (paging Rayiner) to interpret this and to see
whether or not any such far reaching conclusions are supported by the ruling
the article is about. I've read the ruling and I've come to the conclusion
that it makes no such statement but since I'm also not a lawyer you should
assign as much value to that opinion as to the article itself.

~~~
mchen076
Medium hosted articles in general tend to have both low quality and value.
It's a very shiny platform which hosts a massive amount of really poorly
reasoned articles.

------
fyp
We probably need a technical rather than legal solution for this. The problem
is that generative algorithms are susceptible to accidental memorization so
you can't guarantee that the output will be transformative. For example, play
with [https://talktotransformer.com](https://talktotransformer.com) to see how
many well known pieces of text it can spew out verbatim. It is very prone to
derailing into a harry potter fan fiction regardless of prompt.

Other than copyright there are privacy considerations too. For example Gmail's
Smart Compose is trained on users' private messages so you don't want it to
memorize "private" details (such as credit card numbers):
[https://arxiv.org/abs/1802.08232](https://arxiv.org/abs/1802.08232)

Is it possible to solve this by adversarially checking if the output is
"original" enough or not? Or is that intractable, given how much resource our
society already pours into making the same classification in court?

~~~
tedivm
This point is extremely important, particularly in the healthcare field (which
I happen to be in at the moment). We have to be very, very positive that our
deidentification process is thorough and accurate to prevent a HIPAA violation
from occurring.

~~~
missosoup
> We have to be very, very positive that our deidentification process is
> thorough and accurate to prevent a HIPAA violation from occurring

If your model is any more useful than something trained on coarse aggregates,
then it can be used to reidentify individuals. This is a pretty hard dilemma
in the entire industry, not just health.

I hope my observations are skewed, but instead of trying to seriously address
this issue I've seen an entire legal-loophole style data laundering industry
emerge where highly identifiable information changes hands without it
'technically' changing hands in the legal sense. I'm talking about entities
like datarepublic.

~~~
tedivm
It really is a complicated topic, and one we spend a lot of time thinking
about. We're using a peer reviewed method for removing PHI identifiers, and
are combining that with an approach that involves using the least amount of
data possible to get results. Our models will also never be released to the
public, but instead will have an API where we can see abnormal behavior (such
as sending lots of requests to try and tease out other information) and
intervene.

~~~
angry_octet
Other than easily controlled walled garden type accesses it is hard to limit
with throttling.

Is there a way to determine how much information about a particular individual
has been leaked out?

------
Enginerrrd
This isn't nearly as broad a precedent as the title sounds.

They used some pretty reasonable tests of copyright infringement to conclude
that no such infringement occured.

------
prepend
If this legal theory is correct and upheld, possession of data will be pretty
important. So all those data you licensed for one reason or another might be
able to be used to train.

I think the logic makes sense because imagine if humans were prevented from
getting ideas from watching movies. It seems similar to not letting AI watch
every movie ever and learn.

~~~
tzs
It sounds like to get the data into their AI, Google had to make copies. They
digitized the physical books, and then trained using those copies they made.
Also, their system included excerpts from the books which it could retrieve
and show users.

Hence, they had to use fair use to justify it.

I think if you could train the AI without having to make a copy first, such as
having the AI read the physical books directly, or in the case of your movie
example having the AI watch the movie on a TV hooked up to a DVD player
playing a copy of the movie on DVD that you bought from a retailer authorized
by the copyright owner to sell such DVDs, you might not even need to make a
fair use argument.

The definitions section of the US copyright statutes, 17 USC 101 [1], defines
"copies" like this:

> “Copies” are material objects, other than phonorecords, in which a work is
> fixed by any method now known or later developed, and from which the work
> can be perceived, reproduced, or otherwise communicated, either directly or
> with the aid of a machine or device. The term “copies” includes the material
> object, other than a phonorecord, in which the work is first fixed.

and "fixed" is defined like this:

> A work is “fixed” in a tangible medium of expression when its embodiment in
> a copy or phonorecord, by or under the authority of the author, is
> sufficiently permanent or stable to permit it to be perceived, reproduced,
> or otherwise communicated for a period of more than transitory duration.

An AI reading or watching the work as one of many many works in order to learn
weights for a neural net does not result in a material object from which the
work can be perceived, reproduced, or otherwise communicated. Thus, there is
no copy, and hence no copyright issue.

[1]
[https://www.law.cornell.edu/uscode/text/17/101](https://www.law.cornell.edu/uscode/text/17/101)

~~~
jacquesm
> "does not result in a material object from which the work can be perceived"

This does not hold true in all cases. Note that the ruling lists the end goals
as 'fair use' goals and that that seems to have been an important part in the
conclusion reached.

The key thing to strive for in creating derivative works that are deserving of
copyright protection in their own right is that they contain 'substantial
originality', mere machine transformation does not qualify.

~~~
visarga
> mere machine transformation does not qualify

That minimises the contribution of thousands of researchers in designing the
models and their training regimens.

------
MrStonedOne
From a 2016 refusal to review a case

------
watzon
I mean, why would it be? We train ourselves on copyrighted material all the
time.

------
anonytrary
We should be surprised that nine non-technical people are making technical
decisions that impact 300m people. It seems that legal decisions are
increasingly scope-creeping into domains where domain experts are necessary.
This is even more evident after seeing the Zuck-Congress hearings, where
Congress proved to the people that they aren't really the best people to work
on technological issues.

------
sgjohnson
Misleading title. SCOTUS refused to hear the case. They didn’t rule in
Google’s favour.

------
qwerty456127
Is using pirated or otherwise illegally acquired data to train an algorithm
legal? If yes then why is it illegal to use it for other purposes?

Is it legal to use the data to train an algorithm if the license disallows
that explicitly?

~~~
TallGuyShort
Does copyright law actually make it illegal (as opposed to just frowned-upon,
and maybe hindered by one's ISP) to receive pirated material? I was under the
impression that the illegal part was the distribution of it.

~~~
dragonwriter
> Does copyright law actually make it illegal (as opposed to just frowned-
> upon, and maybe hindered by one's ISP) to receive pirated material?

Receiving, strictly, no. However with digital material, most use involves
copying; for legitimate copies that copying is covered by an implied license
for the normal use of the work, for copies which are not themselves
authorized, there is likewise no implied license for use.

Also, _receipt_ of digital copies itself often involves copying directed by
the receiver, which is prohibited, and may even involve a request from the
receiver to the originator to make the copy under circumstances which would be
viewed as knowing that it was unauthorized, which, may often, as a
solicitation of an unlawful act, itself be illegal.

------
karussell
Could this be used against Google? E.g. train an algorithm from their road
traffic information (fetched legally via their API or web UI) to improve rough
time estimates based on OpenStreetMap data.

~~~
dragonwriter
> Could this be used against Google? E.g. train an algorithm from their road
> traffic information (fetched legally via their API or web UI) to improve
> rough time estimates based on OpenStreetMap data.

No, because then contractual ToS, not naked copyright law, will be at issue.
Even if Google doesn't have the right terms to forestall this now, it's a
trivial change for them to adopt.

------
breck
Anyone know of any organizations working to repeal Intellectual Monopoly laws?
I know there are organizations that try to counter the influence of the IM
industry like EFF and FSF, but I’m looking for groups that have come out and
said Intellectual Monopoly laws need to go, period.

~~~
bitwize
Pirate Parties worldwide. Sometimes they even get seats in parliament.

------
therealmarv
Somebody know the situation in EU for that?

