They are a couple of fascinating documents. The Authors Guild seems gobsmacked by the final ruling, and so am I. Perhaps the SCOTUS was correct to turn down hearing the case, if only to let the issue settle a little more, but it really feels like it's likely to be overturned in the near future.
There are some interesting tidbits in the opinions:
1) In the definitive ruling, the judge decides that the harm done to the market for the books is negligible, or overcome by the transformative "purpose" of the the usage ("purpose" is significant because most examples of fair use include some type of new creative "expression"). This is surprising to me.
2) Google Books is ruled fair use in part because the book descriptions (and snippets?) are metadata describing the books, information that should not be controlled by the authors.
The article focuses on the failure of the class action settlement, due to the "perfect being the enemy of the good" (librarians and individual authors objected to the settlement because they hoped Congress would pass a law to free orphan works, but what actually happened is that no progress has been made).
The fear I gather is that large content users won't make much of an effort to contact rights holders and will use orphan works legislation to just take it for free.
I'm actually mostly for orphan works legislation but I understand the perspective of the opponents.
Like, "renew the photographs with SHA's .....", and then providing a simple tool to list all the SHA's of all files with a given extension in a directory?
One request with 50.000 photographs?
It would be very interesting if, instead of showing verbatim snippets from books, there was an appropriate, high quality machine generated summary. This would be a genuine transformation of the source material.
If I were to make a bar bet, based on my limited knowledge, I would say that any bolder attempt to use mass digitized books for a "transformative purpose" like a chatbot or AI would not pass scrutiny (which kinda sucks, because that would be awesome). That's what I mean by overturned -- perhaps the current GB usage is fine because of point 2) above.
Of course, like many Court issues, the best solution (as yohui alludes above) seems to be to have Congress fix things with real law, such as to create compulsory licensing scheme like in music.
The article goes into how, in fact, this really came out of Europe and a fundamentally different perspective on the purpose of copyright than the US Constitution. Wikipedia also has what seems to be a pretty good discussion.
So when people say that current copyright law goes way beyond "promote the progress of science and useful arts" they're absolutely right. But copyright law in continental Europe was much more focused on protecting the rights of authors.
Over the decades, I've talked with at least one government-side principal of the negotiation which resulted in the 1976 extension act. They reported knowing the act was bad public policy, but considered themselves engaged in damage control. Because of Disney's political might, and the "crazy" policies being sought. They considered the act a success - as public interest damage control.
I can't speak to the 1998 extension. My now vague impression is "no one" thought it was a good idea, simply an exercise in brute force political power.
So one possibility is the author James Somers is simply being clueless. The closest I can come to an alternative is this: the increased regulatory capture of the last three decades, and the long-term narrative of reconciling with European law, might hypothetically have created some institutional momentum independent of the prime mover's financial interests. I don't see it for 1998, but maybe for the upcoming copyright extension. But this is a stretch. I think he just got suckered - it seems to happen to a lot of journalists doing drive-by reporting on the area.
Sidebar, a life lesson: When people ask you to keep information confidential, ask them for how long - how many years or decades. It becomes a long-term irritant, not being able to garbage collect such restrictions.
Even the left in the US would be considered right-wing in much of the rest of the world. It's not surprising that the dominant perspectives in these cases tend to be whatever most favors corporate interests, or opposes state regulation or traditionally "leftist" influences, such as labor unions or environmentalism.
In most situations, the quickest way to get Americans to reject an idea on principle is to tell them that it's the way things are done in Europe.
I'm not familiar with the detailed history but it's pretty easy to imagine that aligning on a longer term would be much easier than on a shorter one. After all, even in the US, on the one side you have plenty of interests in favor of longer terms even if there are somewhat abstract constitutional principles that favor a shorter term.
Deflecting criticism by claiming another party forced your hand is one of the oldest tricks in the book, but it still works extremely well. See also: EU directives, which are requested and agreed upon by all EU governments, just to be immediately turned into "tyrannical rules from Bruxelles" the minute they have to be applied.
Markets can't really solve that problem. Unfortunately our present legal/economic system decides to solve it by creating scarcity.
We've fallen far, far short of the potential and dream of the Internet and the democratization of knowledge, and the state of things has become a norm; few even notice it or realize what they are missing.
The truly valuable knowledge, to a great extent, still is inaccessible to the vast majority of the world. It is in books and academic journals. As a simple example beyond Google Books, I was thinking the other day that Safari Books by itself contains much more valuable knowledge (and far less misinformation) on many technical issues than the rest of the Internet; I learn more about some topics in a few hours on Safari Books than in a year on the Internet.
Technically, books and journals easily could be made universally accessible, creating an explosion of knowledge and all the things knowledge enables and motivates - the Enlightenment, science, technology, democracy, liberty, prosperity, most of modern civilization, etc. Instead of being well-informed, most of humanity is left with the dregs, and instead of the Internet providing an explosion of knowledge it has created a plague of misinformation and propaganda. IMHO the lack of high quality knowledge also robs the public of the ability to discriminate between good and bad information: Most lack a model of what quality knowledge is, of even the questions to ask (something encountered frequently in serious scholarship). Few even realize the vast gulf between the quality of generally available information and what is in the books and journals. (I'll add that the demise of bookstores means few even see or are aware that the books exist.) And even if they know, it's inaccessible.
Instead of embracing a technological revolution in the distribution of information - a turning point in the history of humanity - we have brought forward the model used for the old technology, with distribution as controlled and limited as the old medium of paper. For the most part, it seems like the same few people have the quality information, the professional scholars. Let's not forget and give up; it's too important.
Actually, the Internet has made this go backwards.
Most libraries had basic books on most subjects. And they could get other books as required.
Now, you can't find those basic books anymore on the shelves. "Oh, we can order that, it will be here in a week." Well, that's great, except that the book you ordered isn't a basic one. Oops. Well, there goes another week ...
And, even worse, computer stuff from about 1985-1996 probably isn't online. One of my humorous moments was watching a "Millenial" have to fix a VB6 program. The fact that the information he needed wasn't anywhere on the web gobsmacked the poor boy.
Torrents (often just googling "[book name] pdf" works) and https://sci-hub.cc/ have largely solved that problem (certainly for a high schooler).
According to this data (take it with a grain of salt), there are as of 2017 at least >20,000 book stores in the U.S.
Pretty much every job gives you a book (policy manual) when you get hired.
Schools, even the most technologically advanced, still have plenty of physical books.
So although I agree with most of your comment, I'd gander to say most of humanity knows physical books exist.
I meant that people aren't aware of the serious and scholarly books that exist because they don't experience the serendipity of seeing them in particular, or en masse, in the bookstore.
 IIRC, serendipity is actually part of the design of library arrangement systems (Library of Congress, Dewey, etc.): Books are arranged so that you will happen across related information when you look for the book you came for.
Whoa whoa whoa, you are painting with broad strokes here. Most businesses are small businesses, and most of those have a handful of guidelines at best, even when you get above a dozen employees.
In the mid-size bracket of businesses, I know 300 employee businesses with no proper policy manual.
That number actually sounds surprisingly low. In contrast, I wonder how many the underground "bookz" scene have scanned so far. It's hard to find exact numbers, but from what I could find, LibGen contains approximately 3M books, so if Google is accurate, that's ~2.3% of all books ever published. No doubt there are other sites I'm unaware of, probably in other languages, which have also accumulated massive collections of ebooks; but the fact that there exist people who have, for free and on their own time and at risk of being sued for copyright infringement, voluntarily scanned and shared over 2.3% of all the world's books is somewhat amazing.
PS: I couldn't find numbers regarding my statements. It would be great if someone can provide sources buttressing or refuting my claims.
There are services like Lulu for free self-publication, but they don't carry the same "legitimacy" or reach of publishers. I think the best analogue to online self-publishing would be the zine: they were easy to reproduce (via photocopying) and distribute (at events or via post). However, none of them really had a long-lasting legacy and anything successful eventually legitimizes itself as a periodical or magazine and becomes established. Thus, I think self-publishing doesn't really change much for the individual, but makes it much easier for groups to gain traction. Overall, being self-published on the internet just increases your accessibility, but we should be careful confusing it with traditional publishing or counting it in statistics because a lot of it is just noise. (bringing up Wikipedia again, there are tons of "books" on Google Books that are actually just random compilations of Wikipedia articles)
Anyway just some of my random thoughts, hope I didn't digress too much.
: Overused thought experiment: "If a tree falls in a forest and no one is around to hear it, does it make a sound?"
: An exception like the Phrack ezine might be of interest to the HN crowd. (https://en.wikipedia.org/wiki/Phrack)
: Here's where the print-web distinction breaks down. A ton of blogs and amateur news websites have evolved and became taken seriously. Just because they're not published in a book format, doesn't mean they're distinct in my opinion.
After doing it for a while I would gladly take a much lower percentage of each sale in return for much huger sales, and having people to deal with printing, distribution, publicity, advertising, and all the other parts of the process that aren't "me drawing the next page of comics". And I'm lucky to be working in comics, rather than words, where there's a significant tradition of "underground" publishing that's become legitimatized into "small-press" and "independent" publishing, rather than an epithet like "vanity press".‘
But the layer above, the application layer, could use much more work. It won't happen in the legal realm, but i wonder what kind of amazing things we could achieve if we found a way to create a wide and deep developer ecosystem around Libgen, dedicated to help make this data useful?
Altough i think the focus of the legal project is a bit all-or-nothing, and if Google had decided to focus only of subset of books(which it can get legal rights), and create the best applications possible, they could create a ton of value.
I really feel like Google is a victim of their own engineering brilliance sometimes: the objectors really thought that because Google made this look easy, that it was easy. They figured if one company could just casually decide to do this, that they could reliably expect that someone else or maybe government or another legal avenue will come along. The reality of course, is that Google is special; nobody will do it now and even Google is losing its "specialness".
And further, because Google appeared to be doing it so easily, they all thought that Google profiting from it in some way was unfair. They didn't see it as reasonable that Google should be rewarded for the genuine investment of labor and intellectual property involved in pulling this off, precisely because Google didn't give the appearance that it was hard. If Google had given more of an appearance of struggling to achieve it - I'd bet the authors would have suddenly appreciated what Google was doing more and probably accepted the idea that it was fair for Google to profit from it in some way.
Per the numbers, ($400MM for 25MM books), it doesn't seem like it was that easy. It seems like Google had the money and the wherewithal to devote the necessary money and muscle to the effort.
It's one thing to scan a few dozens/hundreds of books. It's a completely different thing to do it for all books. Assuming you'd want to digitize 100 million books in three years, you'd need to process 91324 books per day, or roughly one book per second, assuming no breaks and 24x7x365 operation.
As the article said, Google poured hundreds of millions of dollars on this, so I'd wager it's not as trivial as it sounds.
edit: Next vid looks good too. In depth on different scanners.
"I asked someone who used to have that job, what would it take to make the
books viewable in full to everybody? I wanted to know how hard it would
have been to unlock them. What’s standing between us and a digital public
library of 25 million volumes?
"You’d get in a lot of trouble, they said, but all you’d have to do, more
or less, is write a single database query. You’d flip some access control
bits from off to on. It might take a few minutes for the command to
Now this would be an interesting leak to Wikileaks.
Leaking 25M books' worth of files, however, is going to be far more difficult. It would have to be a very carefully coordinated effort both on the "inside" and "outside"; one person doing a Snowden won't have any effect.
I've been wondering how it would be possible for a disparate group of tech-oriented people to make a collection like that. It would only take a 1000 people with 6 terabytes of storage, which doesn't sound impossible to me.
The main issues I see are:
a) How to share access to the data without exposing yourself?
b) How to make the data discoverable and searchable?
c) How do you ascertain survival of the data?
and optionally: d) How to deal with the freeloader problem?
These regulatory agencies go around and play whack a mole on them but they tend to live for a long time and have vast archives when they become mature.
see the history of the late what.cd for a rundown of what once existed for music. I think that cheap streaming services have kind of killed the peak potential of the music version of these sites though. It's kind of sad though because what.cd had every single release of every single song catalogued. Streaming sites will only give you 1 or a few.
People now speak of Google Books as a library of Alexandria, but What.CD was the real thing. Google Books was barely available to anyone, ever.
That shouldn't be possible the next time. How though? Distributed metadata curation is a problem we haven't worked out well. I know I haven't.
Especially if the metadata is stored on and for data stored on a diverse set of platforms, like "not only BT", but also Freenet, HTTP, FTP, IPFS... It just doesn't exist.
This would need some kind of signing/trust-distribution scheme, something like namespaces ("YIFY can only approve movie and show releases because they only do movie and show releases").
It would also need way to blacklist malicious metadata (automatic scanners that publish lists of files with viri?).
It's very much non-trivial as far as I can see.
That's ignoring the copyright issue, which can mostly be ignored if you somehow make it impractical to prosecute the distributors of the metadata. But I think (part of) the metadata will still be subject to copyright lawsuits, in the context of "the right to be forgotten" and fair-use safe harbors.
What is in the data that's making it so heavy - the original scanned images ?
It's the orphan works that need to be freed the most. Many good books have been orphaned and will never be reprinted or digitized because the initial publisher is gone, author is hard to track down, etc.
: For countries that follow https://en.wikipedia.org/wiki/Sweat_of_the_brow doctrine they might actually be able to but that defeats the purpose of wanting to make public domain works freely available