
The coming IP war over facts derived from books - awinter-py
https://abe-winter.github.io/2020/02/11/books-facts-ip.html
======
PeterisP
There's no IP war coming over facts derived from books because copyright
doesn't cover facts derived from books and the other forms of IP (trademarks,
patents) are even less relevant.

I've done some work with corpus linguistics and quantitative linguistics, and
large parts of these disciplines essentially are about facts derived from
books in some manner. Modern approaches tend to involve machine learning, deep
neural networks and other things fashionable on hackernews, but in general
that's an old, traditional area that was working on facts derived from books
for decades before the "ML era".

To work on facts derived from books, we're sourcing all kinds of books and
other written language, such as newspapers. Some publishers and authors are
cooperative and helpful for such research, some are uncooperative and prefer
to intentionally make working on their sources difficult - but in any case,
even in the case of disagreement and conflict there's no " _IP_ war", the
conflict in our case tends to be about practical convenience of access, not
about IP, because they don't really have a leg to stand on in claiming a
copyright violation. They hold the copyright on the original text, which gives
them certain exclusive rights, there's a bunch of intermediary data that we
can't make available to public without their permission, but these rights
don't extend to facts derived from that text, and we legally don't need their
permission to work on, analyze, transform, publish and use stuff based on
facts in the text or facts about the text, we can do that openly even if
they've explicitly made it clear that they don't want us to do that. That's
nothing new, that's established law that probably predates modern computers.

~~~
Iv
I found it interesting from a legal point of view when someone pointed out
that the recent "AI dungeon generator" that was using BERT to act like a game
master was in some occurrences basically copying (relevant) excerpts from
books.

Can an AI commit copyright infringment? BERT probably "knows" that Cthulhu is
a giant thing evoking squids, tentacle and non-orthonormic dimensions. These
are facts based on books, but you can produce copyright infrigement based on
those facts. It is called "producing a derived work".

In the past years I never managed to get anyone with legal knowledge
interested in what they saw as a totally impossible scenario: the idea that AI
could one day produce original work y learning its craft, like human do, from
copyrighted works. Their criterion was "if you fed copyrighted work into an
algorithm to produce a new work, then that's a derived work".

Humans are somehow imbued with a magic property that allows them to watch read
WH40K books, alien and predator movies, then produce the Starcraft universe,
and have it count as original work.

We do have a philosophico-legal discussion to have there. And way overdue, if
I may. The state of copyright is already late in acknowledging internet, DL-
generated work will be even more of a conundrum for it.

~~~
Arelius
> Humans are somehow imbued with a magic property that allows them to watch
> read WH40K books, alien and predator movies, then produce the Starcraft
> universe, and have it count as original work.

I feel like this is speculation. Do you have any citations? It seems to me
that in this fuzzy area an AI will be judged identically to a human. While you
give an example where he StarCraft universe is created and considered original
work. There are many cases where a human learns their craft from copyright
work, and creates a derived work, fanfic is a huge genre of example. I suspect
that in the legal arena the nature and content of the new work will be far
more influential in the status of the copyright, than the details of how the
new work was created.

So, while I think it's likely that something that is generated by an AI that
looks like original work will be considered original work. I think a question
that is less clear, and much more important is if the AI itself would be
considered a derived work. In some ways, it can be argued that an AI is a
transformation of the original representation, and that substantial portions
of the original work are/can be maintained within the AI itself, just how a
work can be transformed by a compressor, but still be considered to maintain
the copyright. However afaik this question likely remains still untested.

IANAL and all.

Also, I think:

> BERT probably "knows" that Cthulhu is a giant thing evoking squids, tentacle
> and non-orthonormic dimensions.

Is highly debatable.

~~~
CaptArmchair
The underlying presumption for copyright is that the work was created by a
legal concept called "natural person" which ties into the framework of "legal
personhood".

One is a natural person simply by the mere fact of having been born. And
that's what makes all the difference.

This idea pretty much the basis of large swathes of jurisprudence across the
world, really.

AI, as such, is legally speaking no different from a simple pencil when it
comes to writing a book. It's a tool through which a natural person creates a
creative work thus establishing a copyright on the part of the natural person.

See, what most people fail to see is that copyright isn't tied to the creative
work; it's tied to its creator. Hence why copyright seizes to exist some
arbitrary amount of time (20, 40, 70 years) after the creator - a natural
person - has died.

So, when you say "a neural network acquires copyright by itself when it
generates a new creative work", you are forced to consider whether a neural
network is a "person". Which is a can of worms in itself. (consider animals as
persons - case: monkey selfie)

[https://en.wikipedia.org/wiki/Legal_person](https://en.wikipedia.org/wiki/Legal_person)
[https://en.wikipedia.org/wiki/Natural_person](https://en.wikipedia.org/wiki/Natural_person)

~~~
voxic11
Obviously neural networks created with current technology are not persons in
any sense, legal or otherwise.

I think the more interesting questions are whether the operator of the neural
network is the legal author of its creations and whether such creations
satisfy the creativity requirements for copyright.

I think the operator would be the author of the work, similar to how the
operator of a camera or word processor is the author of works created by those
tools. However I think in some cases the work may not meet the creativity
requirement.

> “[T]he requisite level of creativity is extremely low.” Even a “slight
> amount” of creative expression will suffice. ... An author’s expression does
> not need to “be presented in an innovative or surprising way,” but it
> “cannot be so mechanical or routine as to require no creativity whatsoever.”

[https://www.copyright.gov/comp3/chap300/ch300-copyrightable-...](https://www.copyright.gov/comp3/chap300/ch300-copyrightable-
authorship.pdf)

~~~
CaptArmchair
> However I think in some cases the work may not meet the creativity
> requirement.

Exactly. Copyright is very murky in that respect. The basic notion of
"creativity" is willfully vaguely defined to ensure that there's a universal
maxim.

For instance, did you know that databases are copyrightable? Even when their
constituent parts consist of uncopyrightable facts? Copyright law considers
the database as a "collection" or a whole, and so the entire collection can be
seen as a creative work. But copyright only applies to the whole, not the
constituent parts.

[https://www.bitlaw.com/copyright/database.html](https://www.bitlaw.com/copyright/database.html)

Other example, suppose you digitize an ancient piece of pottery by making a
digital photograph. Have you then created a new creative work of art? Some
would argue you did. Why? Because you didn't make a 1:1 copy of the pottery by
creating a new physical pot with similar materials. You created an image using
a particular mechanism, introducing elements such as lighting, color,
contrast,... that may give your image an original element.

The latter example is actually a legal problem for digitization programs of
cultural collections. Institutions hire a photographer to digitize collection,
but then discover that the images are pretty much unusable because the
photographer is able to enforce their own copyright i.e. demand a licensing
fee every time someone wants to use or download an image. Which implies that
institutions are also forced to add legal provisions in any contracts
pertaining to the transfer of rights.

Hence why copyright law is rife with exceptions and exemptions. For instance,
did you know that any image made by the U.S. Government automatically ends up
in the public domain?

[https://en.wikipedia.org/wiki/Copyright_status_of_works_by_t...](https://en.wikipedia.org/wiki/Copyright_status_of_works_by_the_federal_government_of_the_United_States)

The problem with copyright is that digital technology is innately the act of
creating copies. Each time I send a request over the Internet, I basically
create a copy of the 1's and 0's stored at the other side. The basic tenets of
copyright don't concern themselves with conceptual models and higher
abstractions. They go back to the fact that a string of 1's and 0's was
created on a physical carrier and then copied over to another carrier.

But that's not how humans work, we don't really apply notions of copyright to
the physical representation on a disk, we apply them to the ephemeral,
assembled representation on our screens and displays. This tension is what
creates a ton of tension in this space.

------
Nasrudith
That complaint about books and stealing personally strikes me as deeply silly
even by permission culture standards. The whole point of books is to learn
from them. Proper summarization already separates plagerism from original
content (even if it is preferrable to provide citations). It doesn't matter
how it is derived - either the end product is fair use or it is effectively
unauthorized publishing from including too much source content.

We should be rejoicing at the ability to have an assistant that digests the
world's libraries not worrying that someone might make a profit off of it
without permission.

~~~
miker64
But we won't have an assistant that digested the world's libraries. We'll have
an advertising company gatekeeping the digitally digested world's libraries.

I think that's worth worry about. As well, if Google in their drive to
monetize content that they don't own, causes the various publishers and IP
owners to go on the legal attack, any other option/startup will be quickly
dissuaded from building a similar, or better, assistant.

~~~
philipkglass
The HathiTrust is a partnership of the academic libraries involved in Google
Books and other digitization efforts. They offer the Google-originated scans
free to the public for works that are already in the public domain, and allow
university members and research partners to access scans of books that are
still under copyright.

[https://www.hathitrust.org/partnership](https://www.hathitrust.org/partnership)

They're not going to compete with Google in software development, but Google
isn't the sole gatekeeper of the book scans.

------
ImaCake
I particularly like the author's heirachy of information value. It applies to
what individuals should be reading too. But I would probably seperate
blogs/articles into "clickbait" and "serious" and put the latter category
equal with books. It's important to be very selective with your internet
resources, most of the internet reguritates information in a continous and
boring cycle, while select corners push novel content and engaging ideas.

What does the author mean by "CRS"? Coordinate Reference Systems?

~~~
milesvp
I'd tend to agree with this categorizing certain free blogs/web content above
paywalled newspaper content. I can't really remember the last time information
available from a news organization actually allowed me to change my behavior
such that it helped me achieve any of my long term goals, or even modified my
long term goals for that matter. Impacts of things like extreme weather and
economic trends, and recently disease trends are about the only thing news is
ok for, and even then, I find good comment threads give me a better general
sense for the actual severity of things to come.

~~~
ImaCake
For weather tracking you could try windy.com or your local weather service
forecast. Both are more engaging and beneficial than other ways of getting
weather news!

------
huac
I can absolutely assure you that FB and G's ad targeting algorithms are more
complex and significantly more performant than "cluster and sort by click rate
per keyword." For one, you rank ads on an eCPM basis, which includes bid, but
leaving that aside, trust me, just because you don't know how ads systems
work, that doesn't mean they're not working.

Here's a clear example of ML improving search: voice search. You might not use
it, but it's extremely popular in India and other developing markets. "G
search has gotten worse as they’ve focused on recency in the index, gotten
more tolerant of synonyms, and gotten less strict about quoted phrases." None
of these are "machine learning" \- these are product decisions. If you wanted
to say "Google is not an ML company," you'd point to the outsized human
influence on search rankings (see, e.g.
[https://static.googleusercontent.com/media/guidelines.raterh...](https://static.googleusercontent.com/media/guidelines.raterhub.com/en//searchqualityevaluatorguidelines.pdf)).

Google Maps is extremely valuable as a proprietary dataset, and we're all
making it better whenever we do a captcha, doing object recognition from
streetview. So are YouTube, News, Translate, and so many others.

There are so many papers detailing practical metric improvements from ML:
[https://arxiv.org/abs/1810.09591](https://arxiv.org/abs/1810.09591), is one
("Replacing the manual scoring functionwith a gradient boosted decision tree
(GBDT) model gave one of the largest step improvements in homes bookings in
Airbnb’s history, with many successful iterations to follow" and deeper neural
nets offered significant improvements after that).

~~~
awinter-py
this airbnb paper is great, thanks for posting this

------
heartbeats
> This will do to non-fiction books what youtube did to music: drive down the
> price in ways that makes distribution only economical for low-margin
> platforms. It could give G a monopoly on the market and create a
> disincentive for production of new knowledge.

This is just tiresome. Wasn't piracy supposed to doom us all?

It's regrettable to see Google gaining more power, but the copyright cartel
doesn't have a solid moral standing from which to complain.

~~~
robtherobber
My thought exactly. The alleged damage that piracy should bring about never
actually happened, on the contrary, if we are to consider a number of studies.

Then you have you simplistic and ridiculous statements like

> When you free something that belongs to someone it’s called stealing.

As if anything that was ever invented, written, done or created by a human
being was done so in complete isolation and not based on what others have done
before.

------
egypturnash
The opening bits about how “monetization of machine learning has been five
years away for several years” reminds me of how fusion has been twenty years
away since at least the mid-fifties (so about 65 years). And now I am
wondering if anyone has ever looked at the track records on people saying
“technology X will be usable in Y years” versus the actual amount of time
things took to become usable, if ever.

~~~
Multicomp
Reminds me of the Popular Mechanics magazines promising new battery
technologies any day now. Similarly, battery technologies that show great
promise in the labs of 2015/16/17/18/19 are still 'soon on the market, any
minute now'.

~~~
K0SM0S
That's entertainment though; ever since I was a kid (child of the 1980's) I've
never heard a single physicist confirm that 'new' battery tech, operating
through different first principles, was anywhere in sight for commercial use.

All we do is refine the materials and building and designs based on the same
core principles (this optimization has yielded decent but not paradigm-
changing results, unlike what a "new S-curve" would entail). The perceived
increase in "lasting power" of our portable devices compared to 40 years ago
was largely due to Moore's Law, optimization on the consumption side of the
energy equation, not the source.

~~~
philipkglass
The first commercial lithium ion battery dates to 1991:

[https://en.wikipedia.org/wiki/Lithium-
ion_battery#Commercial...](https://en.wikipedia.org/wiki/Lithium-
ion_battery#Commercialization_and_later_history)

The lithium ion battery has incrementally improved energy density every year
since its commercialization (see Figure 4):

[https://www.intechopen.com/books/ict-energy-concepts-
towards...](https://www.intechopen.com/books/ict-energy-concepts-towards-zero-
power-information-and-communication-technology/energy-storage-battery-
materials-and-architectures-at-the-nanoscale)

Most breakthroughs or projected-breakthroughs that get Popular Science
articles written about them are overstated or never materialize at all. But
battery technology is improving over time. It's not improving at the pace of
microelectronics, but hardly anything improves that quickly.

~~~
K0SM0S
OK, so my perception may be wrong indeed.

I'm really nitpicking the theoretical aspect here, I guess. Is Lithion-ion
really different from a first principle perspective? Is any battery technology
not based on electrolyte principles?

See, when I look at a vapor engine, and compare it with an electrical engine,
I really do have two different first principles driving motion, two different
conversions of energy. When I look at a regular / convection oven (heated
resistors) and compare it with a microwave oven, again two fundamentally
different ways of heating a solid. Magnetic induction compared with
thermodynamic heat conduction. X-ray compared with MRI. All of these are
breakthroughs, using different first principles to complete the task.

I fail to see how battery technology is not all based on one and the same
fundamental principle. Quoting your second link:

> “ _Batteries are electrochemical devices that store electrical energy by
> directly converting it to a chemical form._ ”

I'm not sure we were talking about the same thing, because your I read comment
and it seems to fit my view, actually substantiates it. Am I misunderstanding
these concepts?

Edit: FWIW, I was heavy daily user of portable music devices in the 1990s, and
while new battery tech gave you an extra hour or so every iteration, none of
it was life changing, it was a slow increment, not orders of magnitudes — a
sign that we're operating on the same principles, just with more efficiency.
My point was that current battery life is fantastically aided by improvements
on the consumption side, much more than on the source side. I'm not claiming
there's none in the latter, not at all.

Edit 2: TL;DR: I believe there is no fundamentally new physics in "battery"
(storing energy), it's been the same thing for centuries (and reportedly was
invented but not used in Ancient times). Unlike many other technologies like
engines, ovens, body imaging, etc. Please don't hesitate to teach me more.

~~~
philipkglass
Batteries are always based on chemistry. But since you said that portable
device batteries today are little-changed from 40 years ago, I wanted to point
out that the lithium ion battery is newer than that, and still improving.

That's why cordless saws and leaf blowers are practical now but weren't
practical 40 years ago. Better batteries made them work. They didn't benefit
from Moore's Law.

~~~
K0SM0S
Ah, fair point about these tools. I think I understand better what you meant.
Point very well taken, and thanks for the informed perspective.

------
michaelmrose
This is broken thinking.

> If I’d published a non-fiction book in the last 100 years I’d put $10 right
> now into a class action to prevent this product from hitting the market.

Authors are a 1 100th of a percent of the population. If we do create new ways
for people whose entire life and minds are derivative of millennia of
civilization to own facts observed in the world around them the primary
funding for and beneficiary of such a change would be an even smaller class of
people who collect much more of the sweat of the authors brow than the author
even will. The proper response to this is voting any bums who vote for this
out of office. If this doesn't work the next step is the guillotine.

> Google will force us to create a new format for information by removing the
> profitability from the existing one.

The fact that actual scarcity is giving way to plenty in no way suggests that
we ought to fight to impose artificial scarcity for the dubious privilege of
ensuring that leaches can keep profiting in order to keep a minority of the
money filtering down to the people who do the actual work. Perhaps we ought to
discover a way for everyone to profitably enjoy the greater bounty instead of
glorifying working for a living.

> It’s not like they didn’t tell us they were doing this. Their mission
> statement was to ‘free the world’s information’. Small wonder they don’t
> understand privacy. In this case we’re talking about information that’s
> protected by IP rights. When you free something that belongs to someone it’s
> called stealing.

Our inherent emotional reaction to real scarcity based on the rivalrous nature
of physical goods is a poor foundation to build a case for inventing new
rights designed to divvy up the world for the benefit of the rich. I'm sick
unto death of hearing proponents of new and inventive varieties of imaginary
property describing circumvention of their imaginary rights "stealing". There
are no words in keeping with the dignity of this site that I could use to
aptly describe my feelings for the authors words. People like him are
emphatically the enemies of the people.

~~~
nnq
EXACTLY! The article is infected with a really vile and evil way of thinking
that is almost the equivalent of the "but think of the children" argument for
justifying privacy invasion and censorship, but in the field of IP...

------
tristanho
The system described by the author of this post actually already exists, and
was indeed created by google:

[https://books.google.com/talktobooks/](https://books.google.com/talktobooks/)

It's just really not that good (yet)...

------
changoplatanero
Pretty sure that ML is adding billions of dollars of value in ad ranking

------
jtbayly
The author is confused. The thing he's talking about is actually Wikipedia.
Facts, mostly derived from books, is exactly how he describes Wikipedia. And
indeed, it _was_ transformative for education, and it did cause us to question
parts of education.

------
aaron695
No

There's lots wrong with this article

> Google has only one valuable proprietary dataset

They own none of the map/street view data?

Google already answers questions using books.

And if you too want the world's book repository you can just download it, just
illegally. There are big megapacks of it. Way better data than Google books,
The only thing Google books might beat you on is original documents for
history.

But books are almost dead, maybe a decade?

ML with human assisted help will be able to pop out quality books quite
easily. It'll still take a human, just it'll do in a month what took years.

~~~
andrey_utkin
> But books are almost dead, maybe a decade?

Paper book format maybe is.

"Long form content" is what "book" means now and it is not going away.

------
fergie
This article makes no sense. Quality non-fiction books have always cited other
quality non-fiction books and this is a Very Good Thing.

~~~
michaelt
Maps are collections of facts about road locations, and if I write directions
based on a map, doing so doesn't infringe on the map producer's copyright.

But if I'm starting a map company, and I scan in and trace the roads in my
competitors' maps? I'd say that's less clear cut - and may well be copyright
infringement, even though I'm extracting facts from their publication and
creating a new publication containing the same facts.

If I use an entire copyrighted book to train an AI, is it more like the first
example, or more like the second?

~~~
kragen
Maps and charts have been statutorily subject to copyright in the US since the
Constitution was ratified. Facts, on the contrary, have been protected from
copyright.

------
ggm
I always liked that many of the ancients words are only known because somebody
writes _" plato tells us that socrates said..."_ which is in context, pretty
much what monetising the actual semantic intent of those scanned books would
be.

------
bambax
Where do "facts derived from books" come from? Archives and original research.
Wouldn't it make more sense, and be economically (and legally) more defensible
to index those primary sources, than books?

------
mdale
It's odd that the post mentions Wikipedia as it is a counter argument to the
central premise.

I.e contesting book fact aggregation would already had to have sued an won
against Wikipedia and Britannica before it.

------
looper_dude
If the cost of books went down I'd happily start buying them again. Until
then, I can't justify paying $150 for non-fiction that will be outdated in 2
years.

------
jbj
The author makes a side note about getting access to gmail. aren't there at
least a handful of third party services that require access to content of
gmail?

------
sanxiyn
People already mentioned map, but isn't search history Google's most valuable
proprietary dataset?

------
anonsivalley652
IANAL, but this seems like a good video response topic for a copyright
attorney such as Leonard French.

------
buboard
i mean, it's not impossible for a competitor to re-do the archiving of all
books. Doesn't amazon own all the books!

------
imvetri
Good job!

