Hacker News new | past | comments | ask | show | jobs | submit login
Torching the Modern-Day Library of Alexandria: The Tragedy of Google Books (theatlantic.com)
310 points by jsomers on April 22, 2017 | hide | past | favorite | 71 comments

I have been going down the rabbit hole of copyright, fair use, and the Google Books Settlement recently. This article is a great summary including a lot of the peripheral issues, but the "2003 law review article" linked in TFA is nigh unreadable to me, compared to the actual legal opinions and briefs[0].

They are a couple of fascinating documents. The Authors Guild seems gobsmacked by the final ruling, and so am I. Perhaps the SCOTUS was correct to turn down hearing the case, if only to let the issue settle a little more, but it really feels like it's likely to be overturned in the near future.

There are some interesting tidbits in the opinions: 1) In the definitive ruling, the judge decides that the harm done to the market for the books is negligible, or overcome by the transformative "purpose" of the the usage ("purpose" is significant because most examples of fair use include some type of new creative "expression"). This is surprising to me. 2) Google Books is ruled fair use in part because the book descriptions (and snippets?) are metadata describing the books, information that should not be controlled by the authors.

[0] http://www.scotusblog.com/case-files/cases/authors-guild-v-g...

The final ruling in Authors Guild v. Google was really just a footnote to the whole saga, though. The article barely mentions it.

The article focuses on the failure of the class action settlement, due to the "perfect being the enemy of the good" (librarians and individual authors objected to the settlement because they hoped Congress would pass a law to free orphan works, but what actually happened is that no progress has been made).

The battle lines around orphan works are interesting because they don't really follow the same contours as do a lot of the other disagreements about copyright law. From what I've seen, the main opponents of freeing orphan works are individual content creators and the organizations that purport to represent them like ASMP.

The fear I gather is that large content users won't make much of an effort to contact rights holders and will use orphan works legislation to just take it for free.

And this is one reason why i believe that copyright should require a minimal-fee registration every ten years. If you keep your registration current, there is no effort required to contact you. If you can't be bothered to do that, your copyright clearly isn't worth much to you and expires. Either way, the status of the work is unambiguous.

In the case of something like a photograph, that means a minimal-fee registration on each photograph every 10 years. This is also exactly the sort of effort that opponents of orphan works legislation feel that large content corporations will take advantage of when all the little guys forget to renew.

I'm actually mostly for orphan works legislation but I understand the perspective of the opponents.

Wouldn't it be easy to have a provision for bulk registration?

Like, "renew the photographs with SHA's .....", and then providing a simple tool to list all the SHA's of all files with a given extension in a directory?

One request with 50.000 photographs?

How about if I draw 7000 sketches or drawings in a year? How would that be "bulk registered"?

> 2) Google Books is ruled fair use in part because the book descriptions (and snippets?) are metadata describing the books, information that should not be controlled by the authors.

It would be very interesting if, instead of showing verbatim snippets from books, there was an appropriate, high quality machine generated summary. This would be a genuine transformation of the source material.

Maybe in another 10 years. Good summarization is hard, especially over books with illustrations.

Overturned how?

[As a layperson] the most convincing argument the Authors Guild makes is given I think in their SCOTUS petition: that Google's "fair use" is sidestepping a legitimate business opportunity for the rights holders. Books are not just the paper they're printed on, and authors already as a matter of course hold the rights to plays, movies, etc adapted from the text. Particularly when there is no new expression, it seems to me you are just getting away with not licensing the data. This argument is one of the least-covered in the briefs, however. So.

If I were to make a bar bet, based on my limited knowledge, I would say that any bolder attempt to use mass digitized books for a "transformative purpose" like a chatbot or AI would not pass scrutiny (which kinda sucks, because that would be awesome). That's what I mean by overturned -- perhaps the current GB usage is fine because of point 2) above.

Of course, like many Court issues, the best solution (as yohui alludes above) seems to be to have Congress fix things with real law, such as to create compulsory licensing scheme like in music.

A legitimate business opportunity affects 1 part of the 4-factor fair use test. There are plenty of other cases where things were found to be fair use despite a market existing. I've met a lot of people who have the same gut feeling that you do, but the legal history is much more complicated than that.

One of the interesting tidbits in the article is the discussion about the length of copyright terms. The common wisdom is that the current (too long IMO) terms are the result of lobbying by Disney and other media companies.

The article goes into how, in fact, this really came out of Europe and a fundamentally different perspective on the purpose of copyright than the US Constitution. Wikipedia also has what seems to be a pretty good discussion.[1]

So when people say that current copyright law goes way beyond "promote the progress of science and useful arts" they're absolutely right. But copyright law in continental Europe was much more focused on protecting the rights of authors.

[1] https://en.wikipedia.org/wiki/History_of_copyright_law

OP> Copyright terms have been radically extended in this country largely to keep pace with Europe,

Over the decades, I've talked with at least one government-side principal of the negotiation which resulted in the 1976 extension act. They reported knowing the act was bad public policy, but considered themselves engaged in damage control. Because of Disney's political might, and the "crazy" policies being sought. They considered the act a success - as public interest damage control.

I can't speak to the 1998 extension. My now vague impression is "no one" thought it was a good idea, simply an exercise in brute force political power.

So one possibility is the author James Somers is simply being clueless. The closest I can come to an alternative is this: the increased regulatory capture of the last three decades, and the long-term narrative of reconciling with European law, might hypothetically have created some institutional momentum independent of the prime mover's financial interests. I don't see it for 1998, but maybe for the upcoming copyright extension. But this is a stretch. I think he just got suckered - it seems to happen to a lot of journalists doing drive-by reporting on the area.

Sidebar, a life lesson: When people ask you to keep information confidential, ask them for how long - how many years or decades. It becomes a long-term irritant, not being able to garbage collect such restrictions.

It's certainly fair that there was a "convenient" aligning of interests between large media corps in the US and longstanding copyright practices/philosophies in Europe. There was absolutely no concentrated lobby in the US to push for an alignment on shortened copyright terms.

With conservative Supreme Court justices loudly and proudly rejecting non-American sources of legal standards, I wonder how European perspectives came to dominate the US legal system on copyright issues but are ignored on human rights, labor law, and the environment?

>I wonder how European perspectives came to dominate the US legal system on copyright issues but are ignored on human rights, labor law, and the environment?

Even the left in the US would be considered right-wing in much of the rest of the world. It's not surprising that the dominant perspectives in these cases tend to be whatever most favors corporate interests, or opposes state regulation or traditionally "leftist" influences, such as labor unions or environmentalism.

In most situations, the quickest way to get Americans to reject an idea on principle is to tell them that it's the way things are done in Europe.

This wasn't about legal precedent though, so very little if anything to do with SCOTUS. This was about aligning worldwide copyright under the Berne Convention.

I'm not familiar with the detailed history but it's pretty easy to imagine that aligning on a longer term would be much easier than on a shorter one. After all, even in the US, on the one side you have plenty of interests in favor of longer terms even if there are somewhat abstract constitutional principles that favor a shorter term.

It's just a play: one side says they need to align to the other, so they pass law to match the other's limits... and raise them a bit. The other side notices and lobbies for alignment, which again will go a little bit further. Rinse and repeat.

Deflecting criticism by claiming another party forced your hand is one of the oldest tricks in the book, but it still works extremely well. See also: EU directives, which are requested and agreed upon by all EU governments, just to be immediately turned into "tyrannical rules from Bruxelles" the minute they have to be applied.

It's called picking a choosing what's convenient and makes/saves the most money.

Kind of. It's really an instance of market failure. In particular, it's an instance of the failure of market based systems, which is intrinsic, to be able to set prices when there isn't any scarcity.

Markets can't really solve that problem. Unfortunately our present legal/economic system decides to solve it by creating scarcity.

It's too bad the vision depicted at the beginning of the article (full texts potentially available in all libraries), didn't come true. But I feel that the public did get the most important benefit from the project: the ability to search these books. I've been researching a history of science subject recently and it's amazing the amount of information I could get from Google Books and nowhere else online. And where the snippets are not enough, I have the book title and author name, so I know where to look for the information in print.

The public benefit of searching the books isn't fully realized, alas, until more than just Google can see all of the text. Here's an example of a book discovery tool built using the Internet Archive's scanned book collections:



Imagine an intellectually curious but poor high schooler: They can't afford to buy journal articles and books; they have almost no option to access serious, quality information. How much potential is lost to this travesty?

We've fallen far, far short of the potential and dream of the Internet and the democratization of knowledge, and the state of things has become a norm; few even notice it or realize what they are missing.

The truly valuable knowledge, to a great extent, still is inaccessible to the vast majority of the world. It is in books and academic journals. As a simple example beyond Google Books, I was thinking the other day that Safari Books by itself contains much more valuable knowledge (and far less misinformation) on many technical issues than the rest of the Internet; I learn more about some topics in a few hours on Safari Books than in a year on the Internet.

Technically, books and journals easily could be made universally accessible, creating an explosion of knowledge and all the things knowledge enables and motivates - the Enlightenment, science, technology, democracy, liberty, prosperity, most of modern civilization, etc. Instead of being well-informed, most of humanity is left with the dregs, and instead of the Internet providing an explosion of knowledge it has created a plague of misinformation and propaganda. IMHO the lack of high quality knowledge also robs the public of the ability to discriminate between good and bad information: Most lack a model of what quality knowledge is, of even the questions to ask (something encountered frequently in serious scholarship). Few even realize the vast gulf between the quality of generally available information and what is in the books and journals. (I'll add that the demise of bookstores means few even see or are aware that the books exist.) And even if they know, it's inaccessible.

Instead of embracing a technological revolution in the distribution of information - a turning point in the history of humanity - we have brought forward the model used for the old technology, with distribution as controlled and limited as the old medium of paper. For the most part, it seems like the same few people have the quality information, the professional scholars. Let's not forget and give up; it's too important.

> We've fallen far, far short of the potential and dream of the Internet and the democratization of knowledge, and the state of things has become a norm; few even notice it or realize what they are missing.

Actually, the Internet has made this go backwards.

Most libraries had basic books on most subjects. And they could get other books as required.

Now, you can't find those basic books anymore on the shelves. "Oh, we can order that, it will be here in a week." Well, that's great, except that the book you ordered isn't a basic one. Oops. Well, there goes another week ...

And, even worse, computer stuff from about 1985-1996 probably isn't online. One of my humorous moments was watching a "Millenial" have to fix a VB6 program. The fact that the information he needed wasn't anywhere on the web gobsmacked the poor boy.

> They can't afford to buy journal articles and books

Torrents (often just googling "[book name] pdf" works) and https://sci-hub.cc/ have largely solved that problem (certainly for a high schooler).

> I'll add that the demise of bookstores means few even see or are aware that the books exist.

According to this data[0] (take it with a grain of salt), there are as of 2017 at least >20,000 book stores in the U.S.

Pretty much every job gives you a book (policy manual) when you get hired.

Schools, even the most technologically advanced, still have plenty of physical books.

So although I agree with most of your comment, I'd gander to say most of humanity knows physical books exist.


I didn't mean that that people aren't aware that such things as books exist, which of course would be absurd.

I meant that people aren't aware of the serious and scholarly books that exist because they don't experience the serendipity[0] of seeing them in particular, or en masse, in the bookstore.

[0] IIRC, serendipity is actually part of the design of library arrangement systems (Library of Congress, Dewey, etc.): Books are arranged so that you will happen across related information when you look for the book you came for.

This point cannot be overstated right now. Recommendation systems like Amazon's are terrible for book discovery compared to looking at the rest of the shelf when you are getting a book at a library. There is a popular sentiment that the Internet has made knowledge discovery easier than going to the library, which in my experience is absolutely misleading, and is causing people to wrongly believe they have done research on a subject when in fact they have completely overlooked a giant corpus of published material.

> Pretty much every job gives you a book (policy manual) when you get hired.

Whoa whoa whoa, you are painting with broad strokes here. Most businesses are small businesses, and most of those have a handful of guidelines at best, even when you get above a dozen employees.

In the mid-size bracket of businesses, I know 300 employee businesses with no proper policy manual.

My apologies, every company I've ever worked for (even the one person cobbler shop I worked at in high school) gave me one, so I guess this example was anecdotal.

I've worked at a few businesses, and the closest I ever got to a booklet with rules was toward the end of my first job, where they tried to ban sharing of salaries and quite a few other things. Ended up calling L&I to address the illicit policy and quit a few months later.

Goddamnit, the truth in this is really depressing.

Overzealous copyright law extension in the farcical goal of the "advancement of the arts and sciences" has truly been a travesty to human knowledge.

In August 2010, Google put out a blog post announcing that there were 129,864,880 books in the world.

That number actually sounds surprisingly low. In contrast, I wonder how many the underground "bookz" scene have scanned so far. It's hard to find exact numbers, but from what I could find, LibGen contains approximately 3M books, so if Google is accurate, that's ~2.3% of all books ever published. No doubt there are other sites I'm unaware of, probably in other languages, which have also accumulated massive collections of ebooks; but the fact that there exist people who have, for free and on their own time and at risk of being sued for copyright infringement, voluntarily scanned and shared over 2.3% of all the world's books is somewhat amazing.

The number of "books" being published is growing exponentially. So, a significant fraction of all books ever would have been published in the recent past (which also means that they would probably be "electronic natives" that don't need scanning and digitization). I imagine that the uptick in self-publication opportunity is already (or will become) and important factor in that growth. For these reasons, I don't find it shocking that some online repository has a few percent of all books ever written.

PS: I couldn't find numbers regarding my statements. It would be great if someone can provide sources buttressing or refuting my claims.

The problem with self-publication is that your book might as well not exist[0] unless you already have a significant following (in which case you could easily secure a publisher). This isn't an issue unique to the digital age, there's a term specifically for publishers who will just publish anything: vanity presses (though the catch is that you have to pay them usually, hence the "vanity" part). Publishers provide their name, marketing, and get your books into stores which is pretty important. (I write Wikipedia articles on manga and you can easily tell the North American publishers apart by how hard they get the manga scene to write about and review—via advance copies—their books, especially with regards to press releases and staff interviews. If a manga didn't have this, it became really hard to justify a Wikipedia article for it on the basis of Notability) Also I bet for statistics purposes, only books registered with the Library of Congress are counted (how else would you find all those self-published works?).

There are services like Lulu for free self-publication,[1] but they don't carry the same "legitimacy" or reach of publishers. I think the best analogue to online self-publishing would be the zine: they were easy to reproduce (via photocopying) and distribute (at events or via post). However, none of them really had a long-lasting legacy[2] and anything successful eventually legitimizes itself as a periodical or magazine and becomes established. Thus, I think self-publishing doesn't really change much for the individual, but makes it much easier for groups[3] to gain traction. Overall, being self-published on the internet just increases your accessibility, but we should be careful confusing it with traditional publishing or counting it in statistics because a lot of it is just noise. (bringing up Wikipedia again, there are tons of "books" on Google Books that are actually just random compilations of Wikipedia articles)

Anyway just some of my random thoughts, hope I didn't digress too much.

[0]: Overused thought experiment: "If a tree falls in a forest and no one is around to hear it, does it make a sound?"

[1]: https://www.lulu.com/

[2]: An exception like the Phrack ezine might be of interest to the HN crowd. (https://en.wikipedia.org/wiki/Phrack)

[3]: Here's where the print-web distinction breaks down. A ton of blogs and amateur news websites have evolved and became taken seriously. Just because they're not published in a book format, doesn't mean they're distinct in my opinion.

Some people are all "Self-publishing is great! I get half of the price of every book I sell."

After doing it for a while I would gladly take a much lower percentage of each sale in return for much huger sales, and having people to deal with printing, distribution, publicity, advertising, and all the other parts of the process that aren't "me drawing the next page of comics". And I'm lucky to be working in comics, rather than words, where there's a significant tradition of "underground" publishing that's become legitimatized into "small-press" and "independent" publishing, rather than an epithet like "vanity press".‘

Yeah, you'd think the number would be inflated by e.g. all those much-mocked procedurally-generated "books" on Amazon (Toilet Seat Sales Trends in China and East Asia, 2004-2005, or whatever). Of course, if no human ever reads or writes a book, nor is it ever printed on paper, is it really a book?

How about "machine monograph"?

I thought that number sounded low as well, but it sort of checks out after doing back of the hand math. 130M books / 107B people that ever lived = 0.0012 books/person, 1:1000 book to person ratio.

Checks out. I'm a struggling author who has written approximately 0.0012 books

With Libgen being around, i feel like the problem mentioned in the article partially sorted itself out(assuming the higher quality/use books go to libgen). So we've got the data.

But the layer above, the application layer, could use much more work. It won't happen in the legal realm[1], but i wonder what kind of amazing things we could achieve if we found a way to create a wide and deep developer ecosystem around Libgen, dedicated to help make this data useful?

[1]Altough i think the focus of the legal project is a bit all-or-nothing, and if Google had decided to focus only of subset of books(which it can get legal rights), and create the best applications possible, they could create a ton of value.

There are a lot of duplicates on LibGen...

> Many of the objectors indeed thought that there would be some other way to get to the same outcome

I really feel like Google is a victim of their own engineering brilliance sometimes: the objectors really thought that because Google made this look easy, that it was easy. They figured if one company could just casually decide to do this, that they could reliably expect that someone else or maybe government or another legal avenue will come along. The reality of course, is that Google is special; nobody will do it now and even Google is losing its "specialness".

And further, because Google appeared to be doing it so easily, they all thought that Google profiting from it in some way was unfair. They didn't see it as reasonable that Google should be rewarded for the genuine investment of labor and intellectual property involved in pulling this off, precisely because Google didn't give the appearance that it was hard. If Google had given more of an appearance of struggling to achieve it - I'd bet the authors would have suddenly appreciated what Google was doing more and probably accepted the idea that it was fair for Google to profit from it in some way.

> really feel like Google is a victim of their own engineering brilliance sometimes

Per the numbers, ($400MM for 25MM books), it doesn't seem like it was that easy. It seems like Google had the money and the wherewithal to devote the necessary money and muscle to the effort.

As someone who has looked into this quite a bit it's not difficult to do what Google Books had been doing in 2017. The reason is that various groups have ripped and converted to PDF thousands of books. It's trivial to facilitate search on these and other cool stuff at this point. Someone could do it if they really wanted too without too much effort. They haven't most likely because there isn't much profit in it as it stands and legal hurdles since Google failed.

The article does mention that one of the big issues Google encountered was logistics.

It's one thing to scan a few dozens/hundreds of books. It's a completely different thing to do it for all books. Assuming you'd want to digitize 100 million books in three years, you'd need to process 91324 books per day, or roughly one book per second, assuming no breaks and 24x7x365 operation.

As the article said, Google poured hundreds of millions of dollars on this, so I'd wager it's not as trivial as it sounds.

Thank you. This is not "re-create /r/place in a weekend" territory. This is physical hardware development and a substantial logistical challenge.

Found a video of one of their prototype scanners. IIRC they looked at like every scanning solution available and also got a bunch of universities and libraries to help them purchase and operate scanning equipment. Pretty cool stuff.


edit: Next vid looks good too. In depth on different scanners.


"...here we’ve done the work to make it real and we were about to give it to the world and now, instead, it’s 50 or 60 petabytes on disk, and the only people who can see it are half a dozen engineers on the project who happen to have access because they’re the ones responsible for locking it up.

"I asked someone who used to have that job, what would it take to make the books viewable in full to everybody? I wanted to know how hard it would have been to unlock them. What’s standing between us and a digital public library of 25 million volumes?

"You’d get in a lot of trouble, they said, but all you’d have to do, more or less, is write a single database query. You’d flip some access control bits from off to on. It might take a few minutes for the command to propagate."

Now this would be an interesting leak to Wikileaks.

That just makes them viewable, and being still hosted on Google's servers, it would probably be closed in a few minutes too as people start downloading everything they can find. Something similar might've happened to Springer a while ago: https://news.ycombinator.com/item?id=10810271

Leaking 25M books' worth of files, however, is going to be far more difficult. It would have to be a very carefully coordinated effort both on the "inside" and "outside"; one person doing a Snowden won't have any effect.

Couldn't they have proposed a neutral party that would store and manage all the books? Just like the Books Rights Registry was going to handle most of the money. I suppose that Google didn't expect all that backlash. And, now that I think about, that was out of the scope of the lawsuit as well... In the words of an American president: "Sad!"

Maybe Google could get around some of the issues by spinning the thing off as a non profit? It could always owe Google a few million for what they'd spent so far.

A few hundred million.

So it's between 50 and 60 petabytes of data?

I've been wondering how it would be possible for a disparate group of tech-oriented people to make a collection like that. It would only take a 1000 people with 6 terabytes of storage, which doesn't sound impossible to me.

The main issues I see are:

a) How to share access to the data without exposing yourself?

b) How to make the data discoverable and searchable?

c) How do you ascertain survival of the data?

and optionally: d) How to deal with the freeloader problem?

If we use the private torrent site scene as a model all of those things are pretty much solved.

These regulatory agencies go around and play whack a mole on them but they tend to live for a long time and have vast archives when they become mature.

see the history of the late what.cd for a rundown of what once existed for music. I think that cheap streaming services have kind of killed the peak potential of the music version of these sites though. It's kind of sad though because what.cd had every single release of every single song catalogued. Streaming sites will only give you 1 or a few.

Well... I'd say it isn't solved. What.CD went down. Fuck, that still hurts.

People now speak of Google Books as a library of Alexandria, but What.CD was the real thing. Google Books was barely available to anyone, ever.

That shouldn't be possible the next time. How though? Distributed metadata curation is a problem we haven't worked out well. I know I haven't.

Especially if the metadata is stored on and for data stored on a diverse set of platforms, like "not only BT", but also Freenet, HTTP, FTP, IPFS... It just doesn't exist.

I guess separating the meta data from the content in such a way the index can't be targeted for copyright violation.

The thing is, you have to index the pieces of metadata or otherwise make them discoverable, and then decide which parts you trust.

This would need some kind of signing/trust-distribution scheme, something like namespaces ("YIFY can only approve movie and show releases because they only do movie and show releases").

It would also need way to blacklist malicious metadata (automatic scanners that publish lists of files with viri?).

It's very much non-trivial as far as I can see.

That's ignoring the copyright issue, which can mostly be ignored if you somehow make it impractical to prosecute the distributors of the metadata. But I think (part of) the metadata will still be subject to copyright lawsuits, in the context of "the right to be forgotten" and fair-use safe harbors.

> I'd say it isn't solved. What.CD went down. Fuck, that still hurts.

Ethereum. Forever.

Explain please?

Why is it 50 petabytes? If we are talking content of books in some kind of markup, assuming a heavy duty 10mb per book...It would be 1 petabyte. Reasonably, I would assume it would be a few hundred GB.

What is in the data that's making it so heavy - the original scanned images ?

I think it's the scans, yeah, so ~50mb for the scans and ~1mb for the OCR'd version.

Don't forget the 100Gb fiber connections.

Somehow all 25 million books need to be freed. It seems like it would be a great thing for society if this somehow just ended up online.

It's the orphan works that need to be freed the most. Many good books have been orphaned and will never be reprinted or digitized because the initial publisher is gone, author is hard to track down, etc.

Well, google have an amazing resource on their hands. Data mining, machine learning, etc.

Did Google offer to scan and release as creative commons at any point? Seems like the least evil option to me.

That's only possible to do if you're the copyright holder. What Google would have been doing would have been specifically for orphan works, meaning no known copyright holder was able to be found. (as for public domain works, in America[0] you can't claim copyright on them again unless you significantly transform them into a derivative work. It doesn't stop museums from claiming copyright on public domain paintings regardless though...) Besides, I'm not sure Creative Commons would have helped—since I'm assuming you're referring to non-commercial—because the settlement depended on Google being able to pay anyone claiming the books, pay the publishers, and sell the books to recoup/profit. Just merely making the books free online would enrage the publishers class who would feel that they are losing possible profits.

[0]: For countries that follow https://en.wikipedia.org/wiki/Sweat_of_the_brow doctrine they might actually be able to but that defeats the purpose of wanting to make public domain works freely available

That's a heck of a lot of training data :O

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact