Hacker News new | past | comments | ask | show | jobs | submit login
What happened to Google's effort to scan university library books? (2017) (edsurge.com)
204 points by RyanShook 9 days ago | hide | past | web | favorite | 46 comments

FWIW: you have to be affiliated with an academic institution to get access to most full texts on Hathi. However, one such academic institution is the Library of Congress. I would think anyone reading Hacker News in the US would qualify as a researcher at the Library of Congress; the catch (admittedly a big one) is that you have to present yourself in person to get a library card ("reader registration card" in LoC parlance) -- meaning that unless you live in the immediate DC area you have to travel there. And they're only valid for two years.

Link to how to get a library card at the LoC:


I've always wanted to visit the Library of Congress and now I have even more of a reason to do so. Any other advice or suggestions for my visit?

It’s a closed-stack library and their catalog is online. Figure out the items you want to start with before you arrive so that you can put in the order; it takes a little while for the staff to fetch it from the stacks and deliver it to the reading room.

Also, pick your reading room carefully: the general collection can be delivered anywhere, but there are are some items that you can only get in a specific one.

It may be the only way to find some of thing things preserved via citizen archives. I enjoyed the talk Ian MacKaye gave there about archiving the live performances of Minor Threat & Fugazi. Here [1] is a post on the subject.

[1] https://blogs.loc.gov/thesignal/2013/05/ian-mackaye-and-citi...

Open only to citizens?

Hmm, that seems unlikely but I honestly don’t know. Possibly free for US citizens but a small fee for visitors? I can’t imagine it would be closed to visitors, though you should be able to contact them to find out.

The talk I so enjoyed is on YouTube with many others: https://youtu.be/AvqtY_7Q7hI?list=PLEA69BE43AA9F7E68 (video 30 is MacKaye)

Be sure to visit the Jefferson Building, its a "temple to learning" with wonderful paintings and tiles.

As I recall the Library of Congress also has some digital subscriptions on-site, though not Elsevier. One subscription the Library of Congress has that Libgen and Scihub don't have is ProQuest, so you can download theses and dissertations from ProQuest there. I'm not sure that you even need to be registered to use the building wifi, actually, though I'd recommend that too.

Another way to access Hathi is by physically visiting an affiliated university campus and doing it from their computers. Most university libraries allow the public this access.

As you seem to know stuff about the LoC: I've read that of each newly published book in the US you have to send samples to the LoC. However, I've read that they don't keep all samples. Do you know what happens with the ones they discard with? How many are discarded and what are the criteria?


"Each working day the Library receives some 15,000 items and adds more than 10,000 items to its collections. Materials are acquired as Copyright deposits and through gift, purchase, other government agencies (state, local and federal), Cataloging in Publication (a pre-publication arrangement with publishers) and exchange with libraries in the United States and abroad. Items not selected for the collections or other internal purposes are used in the Library’s national and international exchange programs. Through these exchanges the Library acquires material that would not be available otherwise. The remaining items are made available to other federal agencies and are then available for donation to educational institutions, public bodies and nonprofit tax-exempt organizations in the United States."

I swung by the LoC on a biz trip to DC and obtaining the reader registration card was a quick and painless process; highly recommend to anyone on this forum who finds themselves in the area and can spare an hour.

Realizing that Congresspeople can check out books from the Library was the only thing that ever made me want to run for Congress.

I wish I had known that when I last visited DC. I'll make sure to get a registration card on my next visit.

Can non americans get one?

I would guess so but the website doesn’t specify.

I will find out someday, its on my bucket list now

"A Data Capsule is a secure, virtual computer that allows what’s known as “non-consumptive” research, meaning that a scholar can do computational analysis of texts without downloading or reading them. The process respects copyright while enabling work based on copyrighted materials."

And that is completely ridiculous; a technical solution to an invented problem.

There's an easier solution, we call it "my friend Ivan".

I have a friend, Ivan, who lives somewhere in the world - he's notoriously reticent about his location, and only communicates over OTR at strange times. Whenever I want to get some research done, I ask him if he happens to know this or that fact about this or that copyrighted database. I then cite him as a source if anyone has any questions.

> And that is completely ridiculous; a technical solution to an invented problem.

Congratulations, you’ve just described most software projects by libraries and archives.

It’s funny, because back in the 60s-80s, libraries were leaders in building shared data systems and networked infrastructure. The history of OCLC describes this well.

But once the web came around, they had an identity crisis, were unable to react to technology trends, and largely got conned into predatory and restrictive arrangements by service providers (Elsevier, ProQuest, etc). The same thing actually happened in the past with microfiche and led to libraries destroying huge, valuable portions of their collections, that could have been better preserved with the advent of scanning technologies.

Most librarians would agree with you that the current situation is terrible, and if you look at library literature they've generally felt that way from the beginning. They are highly constrained by copyright law.

re: microfiche, it lasts much longer than digital scans -- if kept in good conditions it has a usable half-life on the scale of >100 years, while digital files need much more active maintenance to both prevent bit-rot and to ensure the file-type is still readable (eg countless file formats have been abandoned and are only currently accessible via emulators of older machines).

And if you want to bring the cloud into this, most libraries don't have the funding to bring in the technical know-how to manage a private S3 instance only accessible from that building.

> prevent bit-rot

What I do is regularly copy my files forward onto newer media. I started this back in the 1970s, and it is the only reason I still have a copy of the FORTRAN-10 source code of Empire:


All the other stuff I wrote at the time is lost because I stored it on a magtape, and the Caltech magtape drive had drifted so far out of spec the tapes could only be read on that machine which was lost.

I managed to preserve that by copying it over a serial line to a PDP-11 and storing it on a PDP-11 floppy. I later was able to save my 11 code by copying it over a much later serial line to an IBM PC, to put on 5.25 disks. As time went by, the files migrated to zip drives, then CD-ROMs, then a long sequence of hard drives (my older hard drives can't be read with modern IDE interfaces, even if the connector fits, I have no idea why).

I remember reading boxes of 5.25 floppies and burning them onto CD-ROMs, a long and tedious process. But now, nothing will read 5.25 floppies any more, but copying a year old hard drive to a new one is a simple process, especially since the new drives are usually much larger than the older ones.

Hence I have most of the stuff I worked on since the early 1980s. The old Zortech bulletin board stuff is gone, though, even though I still have the hard drive for it. Nothing can read that old drive. Not that there's anything particularly interesting on it, but I enjoyed running the BBS for many years.

> Most librarians would agree with you that the current situation is terrible, and if you look at library literature they've generally felt that way from the beginning.

I know! I’m a former librarian who has worked with a lot of “big players” on the institutional and software side of things.

> They are highly constrained by copyright law.

Here’s where I disagree with you: they’re largely constrained by the kind of administrative bloat that has permeated all of academia, which has no technical expertise and prefers corporate solutions or managing large-scale, dead-end projects for resume padding. I was on the receiving end of this so many times I left the field.

> re: microfiche, it lasts much longer than digital scans -- if kept in good conditions it has a usable half-life on the scale of >100 years, while digital files need much more active maintenance to both prevent bit-rot and to ensure the file-type is still readable (eg countless file formats have been abandoned and are only currently accessible via emulators of older machines).

True on the digital files part, not so much on the microfiche/ film which proved in many cases to be of poor durability and prone to data loss. But my comment was more about how it’s adoption caused libraries to destroy huge parts of their collections with little recourse once microfiche/film proved to not live up to its marketing claims or in cases where it was poorly implemented. I recommend Nicholson Baker’s “Double Fold” for a good account of all of this.

Copies of microfiche are lossy though, right? It may last longer between copies, but eventually you'll need to copy it.

It is discouraging that so much effort has been spent in a futile effort to impose the limitations of physical media onto digital information.

It's as if we passed laws to require that all email must be delayed for at least two days in order to preserve the business model of the post office.

Speaking from experience, the Hathi metadata is a hot mess. If I had to guess, I would say that it was largely sourced from whatever institutions catalog without a great deal of regularization.

There are levels of dirtiness in data. Hathi's metadata reached a level of dirtiness where the filthier fields could actually be mined for other data, given enough labor. While I was able to extract some benefits from this at some cost, it was no substitute for having things done well.

Many of the scans are described as "unusable" by the faculty and having looked at many of their complaints I would agree.

I worked on this data in a former life and it is indeed a mess and derived from institutional records. But it’s not any more a mess than most other forms of library-derived metadata because MARC and related formats are essentially pseudo-schemas that are at best haphazardly applied.

From what I recall, there were attempts at regularization by the big universities involved with project, but they’re both not very good at tech projects and especially not good at finding and keeping the high-level talent that would be needed for something of this scale.

But if you think bikeshedding is bad in tech, take a look into some disputes between library cataloguers over their arcana. Much of what they produce is valuable at level of a trained cataloguer who can navigate its inconsistencies and vagaries, but rarely so at the machine level.

I used to work with HathiTrust and attended a few workshops; my experience was that the notion of a "data capsule" that ensures "non-consumptive use" is quite onerous, in practice, and that getting that access can acquire prior approval and supervision of research activities:


I think it's actually a clever technical solution to a hideous legal problem, and it makes a lot of research possible that would be totally illegal otherwise, but it 100% gets in the way of totally legitimate research as well.

My experience with HathiTrust is that I search for something through my university library's site, I click on a link to something that takes me to Hathi and it says I can't access it. Thanks for nothing HathiTrust. I guess it might be useful for people that want to do some quantitative work but for those of us who want to actually read the works it is just digital blue balls.

Does it make you feel any better that most of the texts over a decade old are permanently out of print and thus valueless, and that the hideous legal problem only applies to a tiny fraction of the works in question?

Not to mention how much of Amazon's rainforest have been/are being razed to ground to make way for pulp farming for the dead trees. It's scary and stupid both.

Are you being sarcastic? Amazon rainforest wood is not used for paper, the trees are cut down for farmland, not for the wood.

There are far better sources of wood, but land is more scarce.

Who said anything about using old trees for book pulp? Pulp farming requires arable land which is taken away from rainforests.

I was an engineer in Google Books in 2011 and got an impression that this effort turned out to be not as useful as expected. This took away energy from the legal fight. Having access to every printed book seemed amazing in principle, but it didn't translate to amazing demand for the service. My theory is that it didn't lower the barriers to knowledge much -- there's not that much useful information left in offline-only storage.

I think there's a lot of information sleeping in depositories only on paper. The thing is people are busy with the urgent stuff and stay away from the important if they can't get something immediate from it.

I've used extensively Google Books for research of 17th century Mathematics (not just me, also the whole department were I was at that point). I couldn't have afforded to visit all the libraries to look up the originals but thanks to this repository I could download several copies of obscure works, which gives your very interesting insights: sometimes you hit a personal copy of a previous researcher, or there's an ex-libris from an old institution that you know your author was connected to, or you just discover something no one had noticed before simply with a clever string search. You can even backtrack to the forgotten true primary source of a mistakenly repeated fact for instance. It's been a blessing, honestly.

If you are into whatever not too transited alley of history, you realise soon that your (former?) company has produced a treasure trove for future generations that just shouldn't be shut down ever: there are many books there not digitized anywhere else. So thank you very, very much for your work there. It definitely belongs to the important.

As an academic library user, it would be especially helpful if one could read obscure/old titles stored offsite in digital form; currently one has to have it shipped to campus, pick it up, and maybe realize it doesn’t actually contain the needed info — a waste of energy on multiple parties.

I find archive.org fairly useful for a lot of older books. If they are past the copyright date, then many books can be found and downloaded as PDFs. I've found lots of old Language & Literature books from 80-140 years ago. :-)

> “Somewhere at Google there is a database containing 25 million books and nobody is allowed to read them.”

That is the key result: the greatest library in human history is locked shut, primarily because it was perceived as a threat to the legacy business model of paper book publishing (and royalties based on sales of paper books.)

Restricting libraries to "non-consumptive use" betrays the fundamental purpose of a library: to enable people to read the books in the collection; it makes little sense with physical books and even less with digital texts which don't wear out.

I heard of HathiTrust Research Center but didn't know almost all the books were scanned by Google.

Google was nice to want to share. But I think the original and ongoing concern was improved image recognition and NLP. Ie. Machine Learning the whole world knowledge.

> Google was nice to want to share.

Niceness had nothing to do with it. From the article, "As part of the deal, Google’s partner libraries made sure they got to keep digital copies of their scanned works for research and preservation use."

It sucks that we could have basically had a virtual Library of Alexandria thanks to Google, but because of book publishers lobbying in order to try and conserve their monopolies, it's now locked away to the general public.

Libgen happened.

What happened?

tldr; lawyers...

like it was a surprise

Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact