
How Google Book Search Got Lost - chrismealy
https://backchannel.com/how-google-book-search-got-lost-c2d2cf77121d
======
dzdt
This is the value-destruction side of copyright protection. With more
reasonable length and more easily determinable copyright status, efforts like
google books would be able to do so many more things. But without a permissive
legal framework that innovation is shut down.

~~~
baldfat
Blame it 100% on Micky Mouse and Disney that is where all the blame belongs.

~~~
smnrchrds
Here is what I don't understand: how come copyright gets expanded left and
right but patent does not. Does Disney the entertainment industry have more
power than the entirety of traditional industries which depend on patents to
make a profit (pharmeceutical, manufacturing, electronics, etc.) combined?

~~~
andrewla
I think it's a lot less clear what the societal benefits of forcing copyrights
to expire are. For patents, it's much more clear -- a patent prevents others
from copying the invention, so its expiration allows other to use that
invention for economically interesting purposes, which have a net societal
benefit. So on the one hand, the temporary monopoly allows the inventor to
profit from their invention, and costs others their ability to exploit it, and
after a certain amount of time, those needs are reversed.

For copyrights, in the purest form, what do you really get from letting it
expire? The only really tangible benefit is to allow people to obtain the work
without having to pay anyone, and (theoretically) make it easier to preserve
and distribute orphaned works without the permission of the creator. It seems
like as a lawmaker, it's really not that hard to rationalize saying "I can
pass this legislation, extending copyright, or I can directly cause Disney to
forfeit X million dollars in revenue".

At least with a patent expiration, you can counterweight that by saying "I can
extend the patent, and Boeing continues to get X dollars, or I can allow the
patent to expire, and other companies get to use that invention to make Y
dollars." The tradeoff is a little clearer.

I would love to see better rules for copyrights moving to the public domain,
not so much for the sake of Mickey Mouse, but rather for obscure and orphaned
works, as well as simply decreasing the cost for the consumer to improve
accessibility. But it's very easy in my mind to see why this is unlikely to
happen.

~~~
thaumasiotes
> For copyrights, in the purest form, what do you really get from letting it
> expire? The only really tangible benefit is to allow people to obtain the
> work without having to pay anyone, and (theoretically) make it easier to
> preserve and distribute orphaned works without the permission of the
> creator.

You seem to be forgetting about derivative works.

~~~
scandox
Yeah my Columbo novel should not be denied an audience.

~~~
thaumasiotes
Your Columbo novel is probably awful. (No offense.) But there's someone out
there who would have written a great one.

~~~
scandox
None taken.

Let me tell you about it. There are three interleaved narratives.

One is about an elderly actor who played the TV role for so long that he's no
longer sure if he's an actor or the actual detective. The second is about a
real life police detective called Frank Columbo who is plagued by his identity
with the fictional detective and who he resembles in almost every particular.
The third is about the actual Columbo.

All three are simultaneously solving different murders, all of which have
their exposition up front - in the classic Columbo style.

------
gridit
"If Google could find a way to take that corpus, sliced and diced by genre,
topic, time period, all the ways you can divide it, and make that available to
machine-learning researchers and hobbyists at universities and out in the
wild, I’ll bet there’s some really interesting work that could come out of
that. Nobody knows what,” Sloan says. He assumes Google is already doing this
internally. Jaskiewicz and others at Google would not say."

For books that are scanned, but with no extra licensing, would Google be
allowed to do anything with the data? Create a very delocalized n-gram set?
Use it as the "test" set (not even cross-validation, where it might influence
hyperparams) for a ML algorithm?

Edit: would love to know where google's authorization derives from, with the
ngram set. Somewhere in the Judge's orders? A negotiated fee with the Authors
Guild?

~~~
gridit
Ok, here is one of the important opinions in the Google Books settlement, by
Judge Chin in 2013 [0]. He basically says (paraphrasing), "I'm going to assume
Google has violated copyright by creating digital copies and serving them. But
it's fair use, because the new products are transformative".

For example, re:ngrams

""" Similarly, Google Books is also transformative in the sense that it has
transformed book text into data for purposes of substantive research,
including data mining and text mining in new areas, thereby opening up new
fields of research. Words in books are being used in a way they have not been
used before. Google Books has created something new in the use of book text-
the frequency of words and trends in their usage provide substantive
information. [...]

On the other hand, fair use has been found even where a defendant benefitted
commercially from the unlicensed use of copyrighted works

"""

Oh man, this is mind-blowing.

[0] [https://copyright-casebook.com/about/recent-cases-
edited/aut...](https://copyright-casebook.com/about/recent-cases-
edited/authors-guild-v-google-inc-954-f-supp-2d-282-s-d-n-y-2013/)

~~~
pbhjpbhj
Data-mining, indexing, quotations, meta-data, have all been extracted before.
It seems more like the degree to which Google are/want to do it, rather than
the idea to do it?

If I get the same treatment as Google before the law then doesn't this mean I
can copy any whole corpus of work, use it, recopy it, share it, make
derivative works, etc., all as long as at the end I write something new - a
music track inspired by their work, say? That appears to be what the judge is
saying when applied to other works??

------
WalterBright
> "They should have just licensed the books instead.”

The problem with orphaned works is that can't be done, as nobody knows who
owns them.

~~~
CobrastanJorji
One presumes that, as the speaker is the Author's Guild, they would probably
be happy to fix that problem by accepting the licensing fees for orphaned
works themselves.

~~~
ghaff
Furthermore, there's a fair bit of pushback by many content creators around
orphan works legislation. This has been perhaps most pronounced with
photography. There seems to be a concern that big corporations won't try very
hard to reach people who have neglected to renew copyrights and will snap up
their works.

(Not saying I agree with this POV but it's out there.)

~~~
pbhjpbhj
>people who have neglected to renew copyrights //

Renewal, indeed registration, is largely a USA thing since the Berne
Convention (adopted nearly everywhere before USA signed in 1988) did away with
the need for registration in the 19th Century.

Worth bearing in mind, the USA as with "English" measurement is a good step
out of line with the practice in the rest of the world. Any argument based on
someone not registering is going to need completely rethinking outside USA.

~~~
ghaff
Fair enough. There have been proposals to make copyright terms shorter with
renewal options but those aren't inherent in orphan works legislation.

In general, I favor orphan works legislation but there is both potential for
abuse and ambiguity in how much effort will and should go into tracking down
rights holders.

------
greglindahl
If you'd like to see an example of what innovative things you can do with book
contents:

[https://blog.archive.org/2016/02/09/how-will-we-explore-
book...](https://blog.archive.org/2016/02/09/how-will-we-explore-books-in-
the-21st-century/)

[https://books.archivelab.org/dateviz/](https://books.archivelab.org/dateviz/)

~~~
pbhjpbhj
I'm not _that_ impressed, am i missing something? Looks like pre-indexed
search matched with full-text search??

~~~
greglindahl
It's an index of concepts that appear in a sentence with dates. The blog post
shows an example of how this type of index surfaces important dates for the
concept 'Gregorian Calendar'.

------
petra
It's not terribly hard to think of good uses for Google Books[1]. It's just,
the legality was a bit murky, and what are the incentives ?

One idea(among many, surely): many people prefer visual explanations. In many
subject areas, books offer better visual explanations. If when searching for
something, Google would have also linked to some visual explanations from
books(in the main search), and used machine learning to find the best - that
may be really great, may really improve the experience and value of the web.

But is it legal ? it's unclear.

Can Google monetize that ? probably no.

Or am i wrong, is there some way Google can monetize that?

~~~
seanp2k2
"Google's mission is to organize the world's information and make it
universally accessible and useful."

Literally why Google exists:
[https://www.google.com/intl/en/about/](https://www.google.com/intl/en/about/)

~~~
cyberpunk
Does anyone actually believe this? I can't think of a single example of
altruistic behaviour from this company...

~~~
CobrastanJorji
[https://www.google.org/](https://www.google.org/)

That's the arm of Google whose mission is to spend $100 million per year on
charitable projects.

~~~
cyberpunk
Google made $74bn in 2015 [1], 100m on charity is just PR. Is that a days
revenue? An hours? I'm glad it's happening, but I'm cynical enough to consider
that little more than an investment in company image.

Every company which makes such ridiculous amounts of money throws some of it
at a cause or other, they get far more returns from it in PR than they would
spending it another way.

Altruism? How about drop half on making the world better and not making a
marketing circus from it.

1:
[https://abc.xyz/investor/news/earnings/2015/Q4_google_earnin...](https://abc.xyz/investor/news/earnings/2015/Q4_google_earnings/index.html)

~~~
CobrastanJorji
Could you give me an example of a theoretical activity that would qualify to
you as altruism but would not be dismissible as "they get far more returns
from it in PR"?

~~~
cyberpunk
There will always be a tint of 'image bias' I guess.

I think upping the numbers to where it's clearly a loss for whoever is
donating would tip the scale more towards the 'good citizen' mark than these
token amounts.

We'll never know if that works since no one has done it, and probably never
will.

I'm not trying to call out google in particular, in fact that they even give
this amount makes them better than many. These corporate statements about
doing things "for the good of humanity" fall kind of flat when they do so
little towards it though and I'm not sure why we buy into their marketing.

If your mission statement is to advance the species, or even give universal
access to information for everyone; why would you sit on $84bn of yearly
profit instead of using it to achieve that goal?

I know, life isn't as easy as that and I understand the reality of capitalism.

It's pie in the sky stuff, and before I'm flamed to death I'm not saying it's
feasable and I'm not calling on it to happen, but just for the sake of
discussion: am I the only one who thinks the change that apple, google and
facebook could make to our species if they just gave 1/2 of their bank
accounts would be significant and beneficial to everyone? Would it really make
much difference to them if they have 400bn instead of 900bn in the bank?

 _shrug_ \-- I don't have the answers.

------
douche
I'm sad that outfits like Project Gutenberg are stonewalled at 1923 or
whatever year Steamboat Willie came out. There are so many good books whose
authors' grandchildren are dead, that can't legally be reproduced.

~~~
waqf
Library Genesis is your friend. ([http://libgen.io](http://libgen.io) or
[http://gen.lib.rus.ec](http://gen.lib.rus.ec), among other mirrors.)

------
The_ed17
This. Relatedly, losing an easy Google News Archive was killer for some of the
research I'd like to do. Several papers/articles I wrote in c. 2010 would not
be possible to do today.

~~~
greglindahl
Common Crawl has a new news archive started a few months ago
([http://commoncrawl.org/2016/10/news-dataset-
available/](http://commoncrawl.org/2016/10/news-dataset-available/)) and the
Internet Archive has had one going for quite a while.

~~~
The_ed17
Thanks for this! I'm talking about old scanned newspapers. :-) The Internet
Archive has a good start, but it's pretty heavy on Kentucky, and few have in-
text search available, which is killer if you're researching an event with
few/no specific dates. (That's not to knock them—IA is pretty amazing, and
OCRing newspapers is notoriously difficult.)

------
sebisebi
The articles alludes to it at the end: the "corpus" of scanned books is
incredibly valuable to a big data company like google and gives them a real
edge against other companies.

~~~
WalterBright
If it is incredibly valuable, other companies can do it too, and sell access
to the corpus to other companies.

~~~
_rpd
Seems a natural investment for Amazon.

~~~
greglindahl
Here's Amazon's "search inside" program, which started in 2001:
[http://newsbreaks.infotoday.com/NewsBreaks/Search-Inside-
the...](http://newsbreaks.infotoday.com/NewsBreaks/Search-Inside-the-Book-
FullText-on-Amazon-16587.asp)

------
amelius
Perhaps they can dump it as a torrent, so any hacker could build a valuable
service with it.

See e.g. SciHub.

------
ChuckMcM
I think Book Search is entirely a red herring, one of Google's many 'say one
thing, do something entirely different.' Kind of herring. A great example I
know of personally was the Google 411 project.

Google 411 was a project which offered 'directory services' by voice over your
phone. It was pretty cool, you could call its 800 number, ask for a listing,
and it would use Google search to find it and then read it back to you. _Then
you got to tell it how well it did._ People started using it, adoption spread,
and then they cut it off.

So why did Google run this service in the first place? Was it to see if they
could make a business out of directory services? No, it was to collect a data
set where a spoken phrase could be matched to its exact translation (the thing
they were looking for) which could then be processed and re-processed to train
algorithms for independent speaker recognition. On the one hand they could try
to pay a million people to come in and say something and then confirm or deny
they understood what they said, or they could use a bit of spare hardware and
collect that information with out paying anyone a dime. They own that data
set, it is extremely valuable for testing improvements in voice recognition,
and it is yet another barrier to a new company trying to get into that space.

Now let's look at Google Books. The 'story' was digitize all the books and
make a great library available, and maybe even offer up PDF copies of out of
print books for people. It set off a legal firestorm (as mentioned in the
article) it got all of the archivists on board and libraries contributed
millions of volumes, and Google got a ruling out of the Supreme Court that
said that digitizing a book was not an act of copyright infringement.

But the subtle part is that while folks say 'everything is on the web' (and
that may be true) literally 99.99% of everything on the web is complete and
utter crap, written by people who are seeking to game advertising selection
engines _not_ digitize information. Most of the stuff in books is not crap,
because it cost someone significant time and labor to take that information
and publish it. (romance novels excluded :-)).

Google had digitized the single largest collection of human knowledge ever,
and put it in a form where 100's of thousands of machines can process and re-
process it to derive ontological accurate facts into the largest knowledge
base in history. If you want to test whether your algorithm can identify
credible information, there is no better way to do it than to prime it with a
ground truth which has much higher credibility than most of the accessible
data out there.

That data set exists, they own it, and they don't have any obligation to share
it with anyone. And it will allow for the creation of trained models for
differentiating fact from fiction, jest for insult, and command from comment.

It is my opinion that any company that wants to be 'serious' about AI better
have access to an equivalent data set or they will lose.

~~~
walterbell
If it is fair use to create transformative data from scanned books, could the
Internet Archive work with universities and commerical entities to create an
_open license_ corpus of data that is derived from their collection of scanned
books?

As precedent, Google did contribute Freebase to Wikidata, even though it was
the starting point for their proprietary Knowledge Graph (Facebook has a
competing graph),
[https://en.wikipedia.org/wiki/Freebase](https://en.wikipedia.org/wiki/Freebase)

~~~
ChuckMcM
Yes, and I strongly support their effort to do so.

~~~
walterbell
Thanks, didn't realize this was already far along:

[https://cloud.google.com/bigquery/public-data/gdelt-
books](https://cloud.google.com/bigquery/public-data/gdelt-books)

[http://www.gdeltproject.org](http://www.gdeltproject.org)

------
ma2rten
_And also: the dawning realization that Scanning All The Books, however
useful, might not change the world in any fundamental way._

I think this is actually the main takeaway.

~~~
douche
Remember when ReCaptcha used Google Books snippets? You may not, it's been
swapped over to Street View images for a while. But Google Books has been a
big deal for automated OCR.

------
known
I believe PageRank is irrelevant to Google Books

------
theparanoid
[deleted]

~~~
draw_down
Occam's Razor: Copyright holders prevented it from being useful

------
johansch
(Is there some protest somewhere I can join to fight this idiotic image-fade
in effect? You know, the one made universally hated by medium.com.)

~~~
analogmemory
You want to protest lazy loading images? Cool.

~~~
johansch
Yup.

