
Torching the Modern-Day Library of Alexandria: The Tragedy of Google Books - jsomers
https://www.theatlantic.com/technology/archive/2017/04/the-tragedy-of-google-books/523320/#?single_page=true
======
gridit
I have been going down the rabbit hole of copyright, fair use, and the Google
Books Settlement recently. This article is a great summary including a lot of
the peripheral issues, but the "2003 law review article" linked in TFA is nigh
unreadable to me, compared to the actual legal opinions and briefs[0].

They are a couple of fascinating documents. The Authors Guild seems gobsmacked
by the final ruling, and so am I. Perhaps the SCOTUS was correct to turn down
hearing the case, if only to let the issue settle a little more, but it really
feels like it's likely to be overturned in the near future.

There are some interesting tidbits in the opinions: 1) In the definitive
ruling, the judge decides that the harm done to the market for the books is
negligible, or overcome by the transformative "purpose" of the the usage
("purpose" is significant because most examples of fair use include some type
of new creative "expression"). This is surprising to me. 2) Google Books is
ruled fair use in part because the book descriptions (and snippets?) are
metadata _describing_ the books, information that should not be controlled by
the authors.

[0] [http://www.scotusblog.com/case-files/cases/authors-guild-
v-g...](http://www.scotusblog.com/case-files/cases/authors-guild-v-google-
inc/)

~~~
yohui
The final ruling in _Authors Guild v. Google_ was really just a footnote to
the whole saga, though. The article barely mentions it.

The article focuses on the failure of the class action settlement, due to the
"perfect being the enemy of the good" (librarians and individual authors
objected to the settlement because they hoped Congress would pass a law to
free orphan works, but what actually happened is that no progress has been
made).

~~~
ghaff
The battle lines around orphan works are interesting because they don't really
follow the same contours as do a lot of the other disagreements about
copyright law. From what I've seen, the main opponents of freeing orphan works
are individual content creators and the organizations that purport to
represent them like ASMP.

The fear I gather is that large content users won't make much of an effort to
contact rights holders and will use orphan works legislation to just take it
for free.

~~~
diggernet
And this is one reason why i believe that copyright should require a minimal-
fee registration every ten years. If you keep your registration current, there
is no effort required to contact you. If you can't be bothered to do that,
your copyright clearly isn't worth much to you and expires. Either way, the
status of the work is unambiguous.

~~~
ghaff
In the case of something like a photograph, that means a minimal-fee
registration on _each_ photograph every 10 years. This is also exactly the
sort of effort that opponents of orphan works legislation feel that large
content corporations will take advantage of when all the little guys forget to
renew.

I'm actually mostly for orphan works legislation but I understand the
perspective of the opponents.

~~~
cryptarch
Wouldn't it be easy to have a provision for bulk registration?

Like, "renew the photographs with SHA's .....", and then providing a simple
tool to list all the SHA's of all files with a given extension in a directory?

One request with 50.000 photographs?

~~~
frooxie
How about if I draw 7000 sketches or drawings in a year? How would that be
"bulk registered"?

------
ghaff
One of the interesting tidbits in the article is the discussion about the
length of copyright terms. The common wisdom is that the current (too long
IMO) terms are the result of lobbying by Disney and other media companies.

The article goes into how, in fact, this really came out of Europe and a
fundamentally different perspective on the purpose of copyright than the US
Constitution. Wikipedia also has what seems to be a pretty good discussion.[1]

So when people say that current copyright law goes way beyond "promote the
progress of science and useful arts" they're absolutely right. But copyright
law in continental Europe was much more focused on protecting the rights of
authors.

[1]
[https://en.wikipedia.org/wiki/History_of_copyright_law](https://en.wikipedia.org/wiki/History_of_copyright_law)

~~~
pmoriarty
With conservative Supreme Court justices loudly and proudly rejecting non-
American sources of legal standards, I wonder how European perspectives came
to dominate the US legal system on copyright issues but are ignored on human
rights, labor law, and the environment?

~~~
ghaff
This wasn't about legal precedent though, so very little if anything to do
with SCOTUS. This was about aligning worldwide copyright under the Berne
Convention.

I'm not familiar with the detailed history but it's pretty easy to imagine
that aligning on a longer term would be much easier than on a shorter one.
After all, even in the US, on the one side you have plenty of interests in
favor of longer terms even if there are somewhat abstract constitutional
principles that favor a shorter term.

~~~
toyg
It's just a play: one side says they need to align to the other, so they pass
law to match the other's limits... and raise them a bit. The other side
notices and lobbies for alignment, which again will go a little bit further.
Rinse and repeat.

Deflecting criticism by claiming another party forced your hand is one of the
oldest tricks in the book, but it still works extremely well. See also: EU
directives, which are requested and agreed upon by all EU governments, just to
be immediately turned into "tyrannical rules from Bruxelles" the minute they
have to be applied.

------
jkn
It's too bad the vision depicted at the beginning of the article (full texts
potentially available in all libraries), didn't come true. But I feel that the
public did get the most important benefit from the project: the ability to
search these books. I've been researching a history of science subject
recently and it's amazing the amount of information I could get from Google
Books and nowhere else online. And where the snippets are not enough, I have
the book title and author name, so I know where to look for the information in
print.

~~~
greglindahl
The public benefit of searching the books isn't fully realized, alas, until
more than just Google can see all of the text. Here's an example of a book
discovery tool built using the Internet Archive's scanned book collections:

[https://blog.archive.org/2016/02/09/how-will-we-explore-
book...](https://blog.archive.org/2016/02/09/how-will-we-explore-books-in-
the-21st-century/)

[https://books.archivelab.org/dateviz/?q=Gregorian+Calendar&y...](https://books.archivelab.org/dateviz/?q=Gregorian+Calendar&y=1582)

------
hackuser
Imagine an intellectually curious but poor high schooler: They can't afford to
buy journal articles and books; they have almost no option to access serious,
quality information. How much potential is lost to this travesty?

We've fallen far, far short of the potential and dream of the Internet and the
democratization of knowledge, and the state of things has become a norm; few
even notice it or realize what they are missing.

The truly valuable knowledge, to a great extent, still is inaccessible to the
vast majority of the world. It is in books and academic journals. As a simple
example beyond Google Books, I was thinking the other day that Safari Books by
itself contains much more valuable knowledge (and far less misinformation) on
many technical issues than the rest of the Internet; I learn more about some
topics in a few hours on Safari Books than in a year on the Internet.

Technically, books and journals easily could be made universally accessible,
creating an explosion of knowledge and all the things knowledge enables and
motivates - the Enlightenment, science, technology, democracy, liberty,
prosperity, most of modern civilization, etc. Instead of being well-informed,
most of humanity is left with the dregs, and instead of the Internet providing
an explosion of knowledge it has created a plague of misinformation and
propaganda. IMHO the lack of high quality knowledge also robs the public of
the ability to discriminate between good and bad information: Most lack a
model of what quality knowledge is, of even the questions to ask (something
encountered frequently in serious scholarship). Few even realize the vast gulf
between the quality of generally available information and what is in the
books and journals. (I'll add that the demise of bookstores means few even see
or are aware that the books exist.) And even if they know, it's inaccessible.

Instead of embracing a technological revolution in the distribution of
information - a turning point in the history of humanity - we have brought
forward the model used for the old technology, with distribution as controlled
and limited as the old medium of paper. For the most part, it seems like the
same few people have the quality information, the professional scholars. Let's
not forget and give up; it's too important.

~~~
tradersam
> I'll add that the demise of bookstores means few even see or are aware that
> the books exist.

According to this data[0] (take it with a grain of salt), there are as of 2017
at least >20,000 book stores in the U.S.

Pretty much every job gives you a book (policy manual) when you get hired.

Schools, even the most technologically advanced, still have plenty of physical
books.

So although I agree with most of your comment, I'd gander to say most of
humanity knows physical books exist.

[0]:[https://www.statista.com/statistics/249027/number-of-
booksto...](https://www.statista.com/statistics/249027/number-of-bookstores-
in-the-us/)

~~~
hackuser
I didn't mean that that people aren't aware that such things as books exist,
which of course would be absurd.

I meant that people aren't aware of the serious and scholarly books that exist
because they don't experience the serendipity[0] of seeing them in particular,
or en masse, in the bookstore.

[0] IIRC, serendipity is actually part of the design of library arrangement
systems (Library of Congress, Dewey, etc.): Books are arranged so that you
will happen across related information when you look for the book you came
for.

~~~
sedachv
This point cannot be overstated right now. Recommendation systems like
Amazon's are terrible for book discovery compared to looking at the rest of
the shelf when you are getting a book at a library. There is a popular
sentiment that the Internet has made knowledge discovery easier than going to
the library, which in my experience is absolutely misleading, and is causing
people to wrongly believe they have done research on a subject when in fact
they have completely overlooked a giant corpus of published material.

------
userbinator
_In August 2010, Google put out a blog post announcing that there were
129,864,880 books in the world._

That number actually sounds surprisingly low. In contrast, I wonder how many
the underground "bookz" scene have scanned so far. It's hard to find exact
numbers, but from what I could find, LibGen contains approximately 3M books,
so if Google is accurate, that's ~2.3% of all books ever published. No doubt
there are other sites I'm unaware of, probably in other languages, which have
also accumulated massive collections of ebooks; but the fact that there exist
people who have, for free and on their own time and at risk of being sued for
copyright infringement, voluntarily scanned and shared over 2.3% of all the
world's books is somewhat amazing.

~~~
ssivark
The number of "books" being published is growing exponentially. So, a
significant fraction of all books ever would have been published in the recent
past (which also means that they would probably be "electronic natives" that
don't need scanning and digitization). I imagine that the uptick in self-
publication opportunity is already (or will become) and important factor in
that growth. For these reasons, I don't find it shocking that some online
repository has a few percent of all books ever written.

PS: I couldn't find numbers regarding my statements. It would be great if
someone can provide sources buttressing or refuting my claims.

~~~
cooper12
The problem with self-publication is that your book might as well not exist[0]
unless you already have a significant following (in which case you could
easily secure a publisher). This isn't an issue unique to the digital age,
there's a term specifically for publishers who will just publish anything:
vanity presses (though the catch is that _you_ have to pay _them_ usually,
hence the "vanity" part). Publishers provide their name, marketing, and get
your books into stores which is pretty important. (I write Wikipedia articles
on manga and you can easily tell the North American publishers apart by how
hard they get the manga scene to write about and review—via advance
copies—their books, especially with regards to press releases and staff
interviews. If a manga didn't have this, it became really hard to justify a
Wikipedia article for it on the basis of Notability) Also I bet for statistics
purposes, only books registered with the Library of Congress are counted (how
else would you find all those self-published works?).

There are services like Lulu for free self-publication,[1] but they don't
carry the same "legitimacy" or reach of publishers. I think the best analogue
to online self-publishing would be the zine: they were easy to reproduce (via
photocopying) and distribute (at events or via post). However, none of them
really had a long-lasting legacy[2] and anything successful eventually
legitimizes itself as a periodical or magazine and becomes established. Thus,
I think self-publishing doesn't really change much for the individual, but
makes it much easier for groups[3] to gain traction. Overall, being self-
published on the internet just increases your accessibility, but we should be
careful confusing it with traditional publishing or counting it in statistics
because a lot of it is just noise. (bringing up Wikipedia again, there are
tons of "books" on Google Books that are actually just random compilations of
Wikipedia articles)

Anyway just some of my random thoughts, hope I didn't digress too much.

[0]: Overused thought experiment: "If a tree falls in a forest and no one is
around to hear it, does it make a sound?"

[1]: [https://www.lulu.com/](https://www.lulu.com/)

[2]: An exception like the _Phrack_ ezine might be of interest to the HN
crowd.
([https://en.wikipedia.org/wiki/Phrack](https://en.wikipedia.org/wiki/Phrack))

[3]: Here's where the print-web distinction breaks down. A _ton_ of blogs and
amateur news websites have evolved and became taken seriously. Just because
they're not published in a book format, doesn't mean they're distinct in my
opinion.

~~~
egypturnash
Some people are all "Self-publishing is great! I get half of the price of
every book I sell."

After doing it for a while I would gladly take a much lower percentage of each
sale in return for much huger sales, and having people to deal with printing,
distribution, publicity, advertising, and all the other parts of the process
that aren't "me drawing the next page of comics". And I'm lucky to be working
in comics, rather than words, where there's a significant tradition of
"underground" publishing that's become legitimatized into "small-press" and
"independent" publishing, rather than an epithet like "vanity press".‘

------
zmmmmm
> Many of the objectors indeed thought that there would be some other way to
> get to the same outcome

I really feel like Google is a victim of their own engineering brilliance
sometimes: the objectors really thought that because Google made this look
easy, that it _was_ easy. They figured if one company could just casually
decide to do this, that they could reliably expect that someone else or maybe
government or another legal avenue will come along. The reality of course, is
that Google is special; nobody will do it now and even Google is losing its
"specialness".

And further, because Google appeared to be doing it so easily, they all
thought that Google profiting from it in some way was unfair. They didn't see
it as reasonable that Google should be rewarded for the genuine investment of
labor and intellectual property involved in pulling this off, precisely
because Google didn't give the appearance that it was hard. If Google had
given more of an appearance of struggling to achieve it - I'd bet the authors
would have suddenly appreciated what Google was doing more and probably
accepted the idea that it was fair for Google to profit from it in some way.

~~~
devoply
As someone who has looked into this quite a bit it's not difficult to do what
Google Books had been doing in 2017. The reason is that various groups have
ripped and converted to PDF thousands of books. It's trivial to facilitate
search on these and other cool stuff at this point. Someone could do it if
they really wanted too without too much effort. They haven't most likely
because there isn't much profit in it as it stands and legal hurdles since
Google failed.

~~~
jfim
The article does mention that one of the big issues Google encountered was
logistics.

It's one thing to scan a few dozens/hundreds of books. It's a completely
different thing to do it for _all_ books. Assuming you'd want to digitize 100
million books in three years, you'd need to process 91324 books per day, or
roughly one book per second, assuming no breaks and 24x7x365 operation.

As the article said, Google poured hundreds of millions of dollars on this, so
I'd wager it's not as trivial as it sounds.

~~~
CydeWeys
Thank you. This is not "re-create /r/place in a weekend" territory. This is
physical hardware development and a substantial logistical challenge.

~~~
fefesafaea
Found a video of one of their prototype scanners. IIRC they looked at like
every scanning solution available and also got a bunch of universities and
libraries to help them purchase and operate scanning equipment. Pretty cool
stuff.

[https://www.youtube.com/watch?v=YhL7qJYzcd4](https://www.youtube.com/watch?v=YhL7qJYzcd4)

edit: Next vid looks good too. In depth on different scanners.

[https://www.youtube.com/watch?v=4JuoOaL11bw](https://www.youtube.com/watch?v=4JuoOaL11bw)

------
pmoriarty
_"...here we’ve done the work to make it real and we were about to give it to
the world and now, instead, it’s 50 or 60 petabytes on disk, and the only
people who can see it are half a dozen engineers on the project who happen to
have access because they’re the ones responsible for locking it up._

 _" I asked someone who used to have that job, what would it take to make the
books viewable in full to everybody? I wanted to know how hard it would have
been to unlock them. What’s standing between us and a digital public library
of 25 million volumes?_

 _" You’d get in a lot of trouble, they said, but all you’d have to do, more
or less, is write a single database query. You’d flip some access control bits
from off to on. It might take a few minutes for the command to propagate."_

Now this would be an interesting leak to Wikileaks.

~~~
userbinator
That just makes them viewable, and being still hosted on Google's servers, it
would probably be closed in a few minutes too as people start downloading
everything they can find. Something similar might've happened to Springer a
while ago:
[https://news.ycombinator.com/item?id=10810271](https://news.ycombinator.com/item?id=10810271)

Leaking 25M books' worth of files, however, is going to be far more difficult.
It would have to be a very carefully coordinated effort both on the "inside"
and "outside"; one person doing a Snowden won't have any effect.

------
timonovici
Couldn't they have proposed a neutral party that would store and manage all
the books? Just like the Books Rights Registry was going to handle most of the
money. I suppose that Google didn't expect all that backlash. And, now that I
think about, that was out of the scope of the lawsuit as well... In the words
of an American president: "Sad!"

------
tim333
Maybe Google could get around some of the issues by spinning the thing off as
a non profit? It could always owe Google a few million for what they'd spent
so far.

~~~
fourthark
A few hundred million.

------
cryptarch
So it's between 50 and 60 petabytes of data?

I've been wondering how it would be possible for a disparate group of tech-
oriented people to make a collection like that. It would only take a 1000
people with 6 terabytes of storage, which doesn't sound impossible to me.

The main issues I see are:

a) How to share access to the data without exposing yourself?

b) How to make the data discoverable and searchable?

c) How do you ascertain survival of the data?

and optionally: d) How to deal with the freeloader problem?

~~~
denimnerd
If we use the private torrent site scene as a model all of those things are
pretty much solved.

These regulatory agencies go around and play whack a mole on them but they
tend to live for a long time and have vast archives when they become mature.

see the history of the late what.cd for a rundown of what once existed for
music. I think that cheap streaming services have kind of killed the peak
potential of the music version of these sites though. It's kind of sad though
because what.cd had every single release of every single song catalogued.
Streaming sites will only give you 1 or a few.

~~~
cryptarch
Well... I'd say it isn't solved. What.CD went down. Fuck, that still hurts.

People now speak of Google Books as a library of Alexandria, but What.CD was
the real thing. Google Books was barely available to anyone, ever.

That shouldn't be possible the next time. How though? Distributed metadata
curation is a problem we haven't worked out well. I know I haven't.

Especially if the metadata is stored on and for data stored on a diverse set
of platforms, like "not only BT", but also Freenet, HTTP, FTP, IPFS... It just
doesn't exist.

~~~
denimnerd
I guess separating the meta data from the content in such a way the index
can't be targeted for copyright violation.

~~~
cryptarch
The thing is, you have to index the pieces of metadata or otherwise make them
discoverable, and then decide which parts you trust.

This would need some kind of signing/trust-distribution scheme, something like
namespaces ("YIFY can only approve movie and show releases because they only
do movie and show releases").

It would also need way to blacklist malicious metadata (automatic scanners
that publish lists of files with viri?).

It's very much non-trivial as far as I can see.

That's ignoring the copyright issue, which can mostly be ignored if you
somehow make it impractical to prosecute the distributors of the metadata. But
I think (part of) the metadata will still be subject to copyright lawsuits, in
the context of "the right to be forgotten" and fair-use safe harbors.

------
laughfactory
Somehow all 25 million books need to be freed. It seems like it would be a
great thing for society if this somehow just ended up online.

It's the orphan works that need to be freed the most. Many good books have
been orphaned and will never be reprinted or digitized because the initial
publisher is gone, author is hard to track down, etc.

------
idiot74
Well, google have an amazing resource on their hands. Data mining, machine
learning, etc.

------
sgt101
Did Google offer to scan and release as creative commons at any point? Seems
like the least evil option to me.

~~~
cooper12
That's only possible to do if you're the copyright holder. What Google would
have been doing would have been specifically for orphan works, meaning no
known copyright holder was able to be found. (as for public domain works, in
America[0] you can't claim copyright on them again unless you significantly
transform them into a derivative work. It doesn't stop museums from claiming
copyright on public domain paintings regardless though...) Besides, I'm not
sure Creative Commons would have helped—since I'm assuming you're referring to
non-commercial—because the settlement depended on Google being able to pay
anyone claiming the books, pay the publishers, and sell the books to
recoup/profit. Just merely making the books free online would enrage the
publishers class who would feel that they are losing possible profits.

[0]: For countries that follow
[https://en.wikipedia.org/wiki/Sweat_of_the_brow](https://en.wikipedia.org/wiki/Sweat_of_the_brow)
doctrine they might actually be able to but that defeats the purpose of
wanting to make public domain works freely available

------
rolleicord
That's a heck of a lot of training data :O

------
kartan
Copyright is broken.

* [https://www.youtube.com/watch?v=tk862BbjWx4](https://www.youtube.com/watch?v=tk862BbjWx4)

