
How Badly Is Google Books Search Broken, and Why? - longdefeat
http://sappingattention.blogspot.com/2019/02/how-badly-is-google-books-search-broken.html
======
ppod
Google tried to make this work, but they were sued; and then they made a deal,
and then many, many people objected to the resulting deal. This is the usual
process whereby a corporation is first criticised for having too much power,
and then when they relinquish power they are criticised for not doing enough.

[https://www.theatlantic.com/technology/archive/2017/04/the-t...](https://www.theatlantic.com/technology/archive/2017/04/the-
tragedy-of-google-books/523320/)

The end of that article is a not-so-subtle plea for someone within google to
perhaps accidentally anonymously place this material in public.

~~~
zeveb
It's fascinating to read that article:

> Page wanted to know how long it would take to scan more than a hundred-
> million books, so he started with one that was lying around. Using the
> metronome to keep a steady pace, he and Mayer paged through the book cover-
> to-cover. It took them 40 minutes.

> Michigan told Page that at the current pace, digitizing their entire
> collection—7 million volumes—was going to take about a thousand years. Page,
> who’d by now given the problem some thought, replied that he thought Google
> could do it in six.

> In just over a decade, after making deals with Michigan, Harvard, Stanford,
> Oxford, the New York Public Library, and dozens of other library systems,
> the company, outpacing Page’s prediction, had scanned about 25 million
> books. It cost them an estimated $400 million. It was a feat not just of
> technology but of logistics.

> At its peak, the project involved about 50 full-time software engineers.
> They developed optical character-recognition software for turning raw images
> into text; they wrote de-warping and color-correction and contrast-
> adjustment routines to make the images easier to process; they developed
> algorithms to detect illustrations and diagrams in books, to extract page
> numbers, to turn footnotes into real citations, and, per Brin and Page’s
> early research, to rank books by relevance.

Doesn't that take you back to an optimistic time when Google was exciting, and
we thought that it could do amazing amounts of good for the world? I miss that
era.

~~~
icelancer
Yeah, it does. It's also not wholly Google's fault, obviously, since ppod
clearly illustrated why. No good deed goes unsued.

------
mimixco
Here's another little-known fact: Every American is entitled to a library card
from the Library of Congress (you have to go there in person to get one).

The Library of Congress is a Hathi Trust partner! So if you go get that card,
you can download all of the out-of-print books that Google scanned on your own
computer. No copyright holders are getting paid (and no one is being harmed),
so why all these barriers in-between?

~~~
lstamour
I can confirm that the Library of Congress Reader card is all it takes to
login. And you don’t have to be a US Citizen to get one, but you do need a
passport or other US-recognized identification to present and validate in-
person in D.C. And you have to do a bit of research on how and where to get
it, they don’t just hand them out at the front desk as other libraries might.
The Library of Congress uses the card to distinguish researchers from one-off
tourists and so while the card is easy to get, they have just enough process
in place that it’s clear it’s not a souvenir and you have to traverse a maze
of hallways to get it. (Or you had to when I did, at least...) But once you
have it, just login online and you’ll have access to Hathi Trust here.

~~~
amyjess
Thank you and mimixco so much for pointing this out.

If I ever end up visiting DC again, I'm definitely going to do this. I have a
new bucket list item!

~~~
lstamour
Don’t forget to check out some of the amazing reading rooms while there! :)
I’d love to go back!

------
gwern
> If I worked at Google, I would have implemented a text-based date-prediction
> algorithm to flag erroneously classified books. (I have actually done this
> and sent a list to the HathiTurst of books they may have erroneously
> released into the public domain. It works).

With friends like these, who needs enemies?

------
ChuckMcM
It isn't surprising, this bit is sad:

 _Google Books has failed to live up to its promise as the company has moved
away from its original mission of organizing information for people._

Google was only about organizing all the worlds information while search ads
was an unlimited fountain of money. As Google's ability to generate money with
search ads has dwindled, their more grand (and not monetizable) projects have
been either starved for resources or outright killed.

Sure the lawsuit was a pain. And book publishers are turds for arguing that
they still have rights to books that they won't publish ever again. But the
courts found that there was nothing wrong with Google having the
information[1]. That trove of text could be the worlds greatest source of
knowledge but as we all know, people using internet search for work _never_
click on ads and not enough of them are willing to pay a subscription service
price to cover the cost of infrastructure. Google hoped that at one time they
would make money by printing on demand those books that were out of print but
people wanted, but that was shot down by short sighted publishers and agents.
Perhaps it will be taken up by Amazon which has the resources to do it.

[1] [https://arstechnica.com/tech-policy/2015/10/appeals-court-
ru...](https://arstechnica.com/tech-policy/2015/10/appeals-court-rules-that-
google-book-scanning-is-fair-use/)

~~~
porpoisely
It is sad. But it's not just google books, it's google search, google news,
youtube, etc.

There was a time when all of google's properties catered to the users. Their
search engine was the best. Google news was the best aggregate site. Youtube
recommends used to be amazing to the point you could spend hours following
their recommends.

Now google search, google news, youtube, etc are all garbage. It doesn't serve
the people. It serves corporate interests. You can thank media companies and
the elites who pressured them for that.

~~~
ChuckMcM
No need to cast aspersions on 'elites' and 'corporate interests', what is
pressuring them is that the ratio of the amount of money coming in to that
going out, has to be maintained at a certain level for Google to remain
Google. They really have only two choices there, either sell more ads
(generally means putting ads on more things, or coming up with new ways to
charge for new things like being in the 'shopping' box on product searches) or
cut costs which means shutting down projects, reducing staff, Etc. Depending
on how you look at it, Google gets something like one 500th of what they used
to get for an ad on their web site.

~~~
porpoisely
Am I cast aspersions or just stating facts? Google changed because of pressure
from the elites and corporate interests who used the media to badger them. It
certainly isn't in google's interest to make their product worse purely to
benefit others.

~~~
ChuckMcM
> _Am I cast aspersions or just stating facts?_

Your previous comment was attributing without evidence, actions of malice by
descriptive but undefined third parties. That is the definition of "casting
aspersions."

"Stating facts" would start with something like, "See this evidence that
Google's policies were changed by <corporate entity> or <person or persons>."

Since you are doing the former, and not the latter, I conclude that the answer
to your question is that yes, you are casting aspersions.

~~~
porpoisely
I'd assume you'd already know that media and elite pressure is why google
changed since most people here work in the tech industry. Are you new to HN or
do you work in a non-tech industry?

This reporter claims that she got youtube to change it's search list.

[https://twitter.com/aprilaser/status/1076215375732174848](https://twitter.com/aprilaser/status/1076215375732174848)

Should we believe her or is she lying?

"Google follows Facebook's lead and removes 39 YouTube channels linked to
Iran"

[https://finance.yahoo.com/news/google-follows-facebook-
apos-...](https://finance.yahoo.com/news/google-follows-facebook-apos-
lead-190836833.html)

These channels had been up for many years. Why do you think all of a sudden
google decided to remove them?

Certainly it wasn't corporate, media or elite's pressure. So then who? Aliens?
When chinese or russian social media companies remove and change their
policies, why do you think that is? Aliens as well?

After 10 years of spectacular success of youtube being "you"tube, why did it
suddenly become "corporate"tube? Why did they change their recommends,
trending, etc? Must be aliens. It can't possibly be the elites and the media
constantly attacking it?

"Facebook and Google are doomed, George Soros says"

[https://www.washingtonpost.com/news/the-
switch/wp/2018/01/26...](https://www.washingtonpost.com/news/the-
switch/wp/2018/01/26/facebook-and-google-are-doomed-george-soros-says)

------
philipkglass
If you're using HathiTrust seriously and aren't affiliated with a partner
library, consider Hathi Download Helper to get complete public domain books
archived to local storage. I wrote an earlier command-line version of the
tool. Someone else built a GUI and put in the work to keep up with the
evolving API.

I often use Google to locate a book, then check Internet Archive and
HathiTrust if it's old enough that it should be public domain under US law. I
really appreciate HathiTrust putting in the effort to check copyright renewals
and make more of their materials fully visible. I _don 't_ appreciate the
technical barriers to downloads that they erect, but that's out of the hands
of the developers working there. As long as their web viewer shows individual
pages you can be sure there will be a way to reassemble full books.

~~~
mimixco
Thanks for that tip and for writing the software! I had to play that game of
manually assembling PDF pages for an old magazine article lately.

------
babalulu
Google Books has issues of Billboard magazine dating back to 1942. It used to
be valuable for research, but it's become much less so over the years.
Currently, search results that return actual magazine issues are limited to
the first page. After that, it's just normal Google links. Even searching for
something like Elvis or Glenn Miller, both of whom should have been in a
crapload of issues, returns only one page of relevant results.

Trying to search by date is very hard. Limiting search to "Glenn Miller
October 1942" might return one or two relevant results, or it might not return
any. Trying to search by issue date doesn't work at all.

They have an index of Billboard issues which allows you to go to individual
issues and read them, but the index stops at 50 pages, and for a weekly
magazine, that limits the index to only a handful of years. Using the index,
you can't go directly to issues before the 1980s, and with search by issue
date useless, that means you're just out of luck if you want to see a
particular issue in the 1970s.

~~~
rasz
They do seem to be crippling book search on purpose. Just yesterday I was
looking for "PC Mag 1997 january Pentium MMX" and Google refused to return PC
Mag 7 Jan 1997 issue results, whats even more weird clicking "browse all
issues" returns

    
    
        The requested URL /books/serial/ISSN:08888507?rview=1 was not found on this server
    
    

but "About this magazine" will happily give you a list of all scanned issues
:o and opening january one will let you search it and will return positive
results.

------
EliRivers
I completed a Masters in Mathematics back in 2014, involving a lot of
historical research into the development of geometry. At the time, the ease
with which I could open up books written over a century ago, search them and
read them, was a fantastically useful tool.

I have some of those sources still on my hard drive in their scanned PDF
format. They've now effectively vanished from the open internet. So much,
available for such a short window. Our children will never believe us when we
tell them what was once right there at our fingertips, and those that do
should never forgive us.

~~~
toomuchtodo
Consider uploading them to the Internet Archive.

------
greglindahl
Here's an example of an exploration tool built using the content of books:
[https://books.archivelab.org/dateviz/](https://books.archivelab.org/dateviz/)

It would be nice if anyone could build such tools, but all of that data is
locked up inside of places like Google Books and Hathi Trust. Google isn't
even interested in making their metadata available, other than by running
searches.

------
oasisbob
This breakage reminds me of similar constrained-search problems with the
Google Groups/ Dejanews USENET archive. Once upon a time it was nice to
research with.

~~~
darkpuma
It reminds me of the demise of google's code search and github's woefully
inadequate code search. Debian's Code Search is okay, but github allowing it
would be great.

------
mimixco
Here's an easy example... Google Books has many old issues of Popular
Mechanics which exist in full downloadable form on the Internet Archive -- yet
you can only see individual pages on Google Books and can't download the
magazine.

This is because Google Books is acting like _they_ own the copyright (or, at
least, they feel the need to police it.)

There are many cases where you can download the entire book from Hathi Trust
when you are sitting at a university library, giving you a PDF you can use
anywhere. But you cannot even see the entire book or download it from Google
Books (which has given its scan to Hathi). This is just stupid.

------
drallison
Google Books seems to have major problems. An alternative interested parties
should explore is Archive.org. The Internet Archive has a significant
collection of scanned books and other materials.

------
jetrink
Searching within books is broken too. If you search for 'cat', any instance of
the word 'cats' will also be returned (and any other word beginning with
'cat'), but with the message, "No preview is available." No link to the page
of the result is provided. It seems like the kind of bug that should be
straightforward to locate and fix, but it has been this way for years. (My
guess is that the tool that builds previews doesn't recognize partial
matches.)

~~~
mimixco
It's not a bug. It's on purpose. They actively limit previews, full page
views, and downloads -- even though you can go to a university and download
the same book from Hathi Trust.

Hathi Trust isn't paying the copyright holders, either, so who cares?

------
phonon
Using
[https://babel.hathitrust.org/cgi/ls?a=srchls&anyall1=all&q1=...](https://babel.hathitrust.org/cgi/ls?a=srchls&anyall1=all&q1=%22set+in++stone%22&field1=ocr&op3=AND&yop=before&pdate_end=2002&facet_lang=&facet_format=)

I get 14,515 results, with 3,115 of them full view.

There is also
[https://analytics.hathitrust.org/](https://analytics.hathitrust.org/)

which seems interesting!

~~~
mimixco
And _all_ of them are full view if you are physically sitting in a Hathi Trust
"partner" university library. These libraries are open to the public and allow
downloading and saving of the materials you browse, making the whole point of
locking them up completely pointless.

~~~
elektor
Is it then possible for an Aaron Schwartz/ Sci Hub character to then download
all of the available books and make them available on the internet?

~~~
sodosopa
yes and that’s theft.

------
ccleve
I wonder if the date problem is a bug, not a feature?

Is it possible to dump the metadata of a book and check if they have the right
date? There should probably be multiple dates for a book -- date written, date
copyrighted, date published, date of latest edition, etc.

My guess is that Google does not have a publicly-available issue tracker for
Google Books so you can't easily report this problem. Hacker News is a good
way to get their attention, though...

------
mimixco
Based on my experience with research using Google Books and Hathi Trust, this
author is correct. Google has purposely broken Google Books so that it doesn't
compete with any paid sources for these materials -- even if there aren't any
paid sources and the book is out of print!

The TL;DR is that Google Books started out with the goal of digitizing every
book ever written. Publishers sued, so they crippled their search and display
functions and handed over the full texts they already had to a group called
Hathi Trust.

Hathi Trust is seriously crippled on purpose. It only allows access to full
texts when you are _sitting in a physical library_ of one its partner
universities. That's right... I can drive to a big university, sit in their
library, and download a full PDF of any book I like. But if I'm at my house, I
can only read one page at a time in a browser. This is ridiculous. Hathi Trust
is helping the oil business more than they're helping researchers.

The marriage of Google Books content and Hathi Trust as a distribution
platform is a joke. In some cases, you will even have to order an old book
from interlibrary loan (see worldcat.org) if you can even get it -- when all
the while Google has a scanned copy!

~~~
kevin_thibedeau
If you were an author of a book with active copyright you might want to get
paid for it.

~~~
Spooky23
How?

My grandfather wrote a book in the 1940s that’s been out of print since the
mid 50s. Every entity associated with the book is dead, including the
publisher, which merged into another in the late 50s and is probably an
inactive imprint of some successor company. Grandpa died in 1985, and a cousin
or I is likely the heir to his rights.

I have a copy of the book, which I bought via Alibris from a bookstore in
Wales 15 years ago. If you needed the book for research, you’d probably get it
via inter-library loan from a university or a big city library. Whomever the
publisher is, they don’t have it and aren’t selling it. In no scenario does
anyone get paid for transacting, other than a reseller or the post office.

~~~
dwyerm
Stay with me for a second; I'm going to go wildly afield... does intellectual
property need an Adverse Possession law, too? [1]

Your grandfather had a book, the rights of which should have been presumably
passed down to you. My grandfather had a patented mining claim that has been
passed down to me. Where the ownership of your grandfather's book is
questionable, for me the physical corners of the property are questionable. A
number of them are defined by things like a "4 foot spruce post" or a "12 inch
diameter tree trunk" that haven't survived the ravages of time.

But it is important that I patrol my property at least every couple of years
because of Adverse Possession. If someone else were to use my property
continuously and I don't say anything against it, one day their trespassing
suddenly and magically would become ownership. For a land-owner, it is a scary
idea that someone can steal my property from me, as has actually happened. [2]

But I can acknowledge that it makes some sense. It comes from the idea that
land is meant to be used, and if you aren't using it, maybe the person who
_is_ using it should get the rights.

If nobody can stand up for an intellectual property claim, perhaps some kind
of adverse possession is in order.

[1]
[https://en.wikipedia.org/wiki/Adverse_possession](https://en.wikipedia.org/wiki/Adverse_possession)
[2] [http://articles.latimes.com/2007/dec/03/nation/na-
land3](http://articles.latimes.com/2007/dec/03/nation/na-land3)

~~~
mikez0r
Counter-argument: Adverse possession is justified by the scarcity of real
property. Which does not apply to IP.

\- Real property (i.e. land) is a scarce and limited resource. If a party is
making productive use of the land, they should hold title. (There is only so
much arable land. If someone raises crops, let them.)

\- Intellectual property (particularly copyright) is not a scarce or limited
resource. (Create your own copyrightable work if you wish to own the rights.)

~~~
mikez0r
Of course, adverse possession presumes that land _ought_ to be made "useful."
(Haven't thought yet about how these critiques of real property theory map to
IP.)

Much of real property theory arose from the assumption that the government
should recognize and encourage the "highest and best use" of real property.
Traditionally, the highest and best use of land is the use that can most
profit from the land's resources; often mining, grazing, farming, logging.

This is problematic.

\- This view justifies colonization, and taking land from original inhabitants
who don't use the land to extract resource value.

\- This view does not recognize preservation of an ecosystem as a valuable
use.

\- This view does not account for externalities from use of the land's
resources.

~~~
robkop
This is a very interesting argument .

I think it's worth noting that you can calculate an estimate of the
externalities and remove that from the profit to achieve a more balanced
justification. Though unfortunately, unless you counted the loss of culture as
an externality then you could still trivially justify the removal of land from
those less productive/ technologically advanced than you.

Furthermore, even though I'm not personally supportive of the removal of land
at the individual's loss I do have to ask if the removal could account for a
net gain overall; improving many people's lives. Perhaps profit isn't the best
measure of improvement to the collective but it is at least indicative.

------
amelius
What would be a scientific approach to compare search results? Let's say I do
a search on DDG and on Google, how do I determine which engine provided the
most accurate results?

~~~
brownbat
I think if you have a population of people doing the same search, split
randomly across the two sites, things like how long it takes to leave your
site through a search result, how far down that result is, and how often
people come back to rephrase their search are all good metrics.

Not sure there's an answer for a single search for a single individual though.

------
nottorp
I thought google search (without "books") is broken... why is the books search
being broken a surprise?

