
What happened to Google's effort to scan university library books? (2017) - RyanShook
https://www.edsurge.com/news/2017-08-10-what-happened-to-google-s-effort-to-scan-millions-of-university-library-books
======
GlenTheMachine
FWIW: you have to be affiliated with an academic institution to get access to
most full texts on Hathi. However, one such academic institution is the
Library of Congress. I would think anyone reading Hacker News in the US would
qualify as a researcher at the Library of Congress; the catch (admittedly a
big one) is that you have to present yourself in person to get a library card
("reader registration card" in LoC parlance) -- meaning that unless you live
in the immediate DC area you have to travel there. And they're only valid for
two years.

Link to how to get a library card at the LoC:

[https://www.loc.gov/rr/readerregistration.html](https://www.loc.gov/rr/readerregistration.html)

~~~
TheAceOfHearts
I've always wanted to visit the Library of Congress and now I have even more
of a reason to do so. Any other advice or suggestions for my visit?

~~~
joshklein
It may be the only way to find some of thing things preserved via citizen
archives. I enjoyed the talk Ian MacKaye gave there about archiving the live
performances of Minor Threat & Fugazi. Here [1] is a post on the subject.

[1] [https://blogs.loc.gov/thesignal/2013/05/ian-mackaye-and-
citi...](https://blogs.loc.gov/thesignal/2013/05/ian-mackaye-and-citizen-
archiving/)

~~~
Sherl
Open only to citizens?

~~~
joshklein
Hmm, that seems unlikely but I honestly don’t know. Possibly free for US
citizens but a small fee for visitors? I can’t imagine it would be closed to
visitors, though you should be able to contact them to find out.

The talk I so enjoyed is on YouTube with many others:
[https://youtu.be/AvqtY_7Q7hI?list=PLEA69BE43AA9F7E68](https://youtu.be/AvqtY_7Q7hI?list=PLEA69BE43AA9F7E68)
(video 30 is MacKaye)

------
mcguire
" _A Data Capsule is a secure, virtual computer that allows what’s known as
“non-consumptive” research, meaning that a scholar can do computational
analysis of texts without downloading or reading them. The process respects
copyright while enabling work based on copyrighted materials._ "

And that is completely ridiculous; a technical solution to an invented
problem.

~~~
claudeganon
> And that is completely ridiculous; a technical solution to an invented
> problem.

Congratulations, you’ve just described most software projects by libraries and
archives.

It’s funny, because back in the 60s-80s, libraries were leaders in building
shared data systems and networked infrastructure. The history of OCLC
describes this well.

But once the web came around, they had an identity crisis, were unable to
react to technology trends, and largely got conned into predatory and
restrictive arrangements by service providers (Elsevier, ProQuest, etc). The
same thing actually happened in the past with microfiche and led to libraries
destroying huge, valuable portions of their collections, that could have been
better preserved with the advent of scanning technologies.

~~~
orpheansodality
Most librarians would agree with you that the current situation is terrible,
and if you look at library literature they've generally felt that way from the
beginning. They are highly constrained by copyright law.

re: microfiche, it lasts much longer than digital scans -- if kept in good
conditions it has a usable half-life on the scale of >100 years, while digital
files need much more active maintenance to both prevent bit-rot and to ensure
the file-type is still readable (eg countless file formats have been abandoned
and are only currently accessible via emulators of older machines).

And if you want to bring the cloud into this, most libraries don't have the
funding to bring in the technical know-how to manage a private S3 instance
only accessible from that building.

~~~
WalterBright
> prevent bit-rot

What I do is regularly copy my files forward onto newer media. I started this
back in the 1970s, and it is the only reason I still have a copy of the
FORTRAN-10 source code of Empire:

[https://github.com/DigitalMars/Empire-for-
PDP-10](https://github.com/DigitalMars/Empire-for-PDP-10)

All the other stuff I wrote at the time is lost because I stored it on a
magtape, and the Caltech magtape drive had drifted so far out of spec the
tapes could only be read on that machine which was lost.

I managed to preserve that by copying it over a serial line to a PDP-11 and
storing it on a PDP-11 floppy. I later was able to save my 11 code by copying
it over a much later serial line to an IBM PC, to put on 5.25 disks. As time
went by, the files migrated to zip drives, then CD-ROMs, then a long sequence
of hard drives (my older hard drives can't be read with modern IDE interfaces,
even if the connector fits, I have no idea why).

I remember reading boxes of 5.25 floppies and burning them onto CD-ROMs, a
long and tedious process. But now, nothing will read 5.25 floppies any more,
but copying a year old hard drive to a new one is a simple process, especially
since the new drives are usually much larger than the older ones.

Hence I have most of the stuff I worked on since the early 1980s. The old
Zortech bulletin board stuff is gone, though, even though I still have the
hard drive for it. Nothing can read that old drive. Not that there's anything
particularly interesting on it, but I enjoyed running the BBS for many years.

------
at_a_remove
Speaking from experience, the Hathi metadata is a hot mess. If I had to guess,
I would say that it was largely sourced from whatever institutions catalog
without a great deal of regularization.

There are levels of dirtiness in data. Hathi's metadata reached a level of
dirtiness where the filthier fields could actually be mined for _other_ data,
given enough labor. While I was able to extract some benefits from this at
some cost, it was no substitute for having things done well.

Many of the scans are described as "unusable" by the faculty and having looked
at many of their complaints I would agree.

~~~
claudeganon
I worked on this data in a former life and it is indeed a mess and derived
from institutional records. But it’s not any more a mess than most other forms
of library-derived metadata because MARC and related formats are essentially
pseudo-schemas that are _at best_ haphazardly applied.

From what I recall, there were attempts at regularization by the big
universities involved with project, but they’re both not very good at tech
projects and especially not good at finding and keeping the high-level talent
that would be needed for something of this scale.

But if you think bikeshedding is bad in tech, take a look into some disputes
between library cataloguers over their arcana. Much of what they produce is
valuable at level of a trained cataloguer who can navigate its inconsistencies
and vagaries, but rarely so at the machine level.

------
rwhaling
I used to work with HathiTrust and attended a few workshops; my experience was
that the notion of a "data capsule" that ensures "non-consumptive use" is
quite onerous, in practice, and that getting that access can acquire prior
approval and supervision of research activities:

[https://wiki.htrc.illinois.edu/display/COM/HTRC+Data+Capsule...](https://wiki.htrc.illinois.edu/display/COM/HTRC+Data+Capsule+Environment)

I think it's actually a clever technical solution to a hideous legal problem,
and it makes a lot of research possible that would be totally illegal
otherwise, but it 100% gets in the way of totally legitimate research as well.

~~~
mcguire
Does it make you feel any better that most of the texts over a decade old are
permanently out of print and thus valueless, and that the hideous legal
problem only applies to a tiny fraction of the works in question?

~~~
marvindanig
Not to mention how much of Amazon's rainforest have been/are being razed to
ground to make way for pulp farming for the dead trees. It's scary and stupid
both.

~~~
ars
Are you being sarcastic? Amazon rainforest wood is not used for paper, the
trees are cut down for farmland, not for the wood.

There are far better sources of wood, but land is more scarce.

~~~
marvindanig
Who said anything about using old trees for book pulp? Pulp farming requires
arable land which is taken away from rainforests.

------
yaroslavvb
I was an engineer in Google Books in 2011 and got an impression that this
effort turned out to be not as useful as expected. This took away energy from
the legal fight. Having access to every printed book seemed amazing in
principle, but it didn't translate to amazing demand for the service. My
theory is that it didn't lower the barriers to knowledge much -- there's not
that much useful information left in offline-only storage.

~~~
mnl
I think there's a lot of information sleeping in depositories only on paper.
The thing is people are busy with the urgent stuff and stay away from the
important if they can't get something immediate from it.

I've used extensively Google Books for research of 17th century Mathematics
(not just me, also the whole department were I was at that point). I couldn't
have afforded to visit all the libraries to look up the originals but thanks
to this repository I could download several copies of obscure works, which
gives your very interesting insights: sometimes you hit a personal copy of a
previous researcher, or there's an ex-libris from an old institution that you
know your author was connected to, or you just discover something no one had
noticed before simply with a clever string search. You can even backtrack to
the forgotten true primary source of a mistakenly repeated fact for instance.
It's been a blessing, honestly.

If you are into whatever not too transited alley of history, you realise soon
that your (former?) company has produced a treasure trove for future
generations that just shouldn't be shut down ever: there are many books there
not digitized anywhere else. So thank you very, very much for your work there.
It definitely belongs to the important.

------
oefrha
As an academic library user, it would be especially helpful if one could read
obscure/old titles stored offsite in digital form; currently one has to have
it shipped to campus, pick it up, and maybe realize it doesn’t actually
contain the needed info — a waste of energy on multiple parties.

~~~
jdshaffer
I find archive.org fairly useful for a lot of older books. If they are past
the copyright date, then many books can be found and downloaded as PDFs. I've
found lots of old Language & Literature books from 80-140 years ago. :-)

------
Metacelsus
See also:
[https://www.theatlantic.com/technology/archive/2017/04/the-t...](https://www.theatlantic.com/technology/archive/2017/04/the-
tragedy-of-google-books/523320/)

------
Anon84
Another story, with a lot more detail: [https://www.wired.com/2017/04/how-
google-book-search-got-los...](https://www.wired.com/2017/04/how-google-book-
search-got-lost/)

------
musicale
> “Somewhere at Google there is a database containing 25 million books and
> nobody is allowed to read them.”

That is the key result: the greatest library in human history is locked shut,
primarily because it was perceived as a threat to the legacy business model of
paper book publishing (and royalties based on sales of paper books.)

Restricting libraries to "non-consumptive use" betrays the fundamental purpose
of a library: to enable people to read the books in the collection; it makes
little sense with physical books and even less with digital texts which don't
wear out.

------
fierarul
I heard of HathiTrust Research Center but didn't know almost all the books
were scanned by Google.

Google was nice to want to share. But I think the original and ongoing concern
was improved image recognition and NLP. Ie. Machine Learning the whole world
knowledge.

~~~
extra88
> Google was nice to want to share.

Niceness had nothing to do with it. From the article, "As part of the deal,
Google’s partner libraries made sure they got to keep digital copies of their
scanned works for research and preservation use."

------
lawrenceyan
It sucks that we could have basically had a virtual Library of Alexandria
thanks to Google, but because of book publishers lobbying in order to try and
conserve their monopolies, it's now locked away to the general public.

------
selfishgene
Libgen happened.

------
softfalcon
What happened?

tldr; lawyers...

~~~
readhn
like it was a surprise

