How can I legally convert a “dead-tree” library (1050 books) to eBooks?

vitovito · on May 30, 2015

They're your books, so you can scan them if you want. But you can't sell, donate or give away the physical book once you've scanned it; the scan is a copy/backup of the original for personal use, and if you get rid of the original, you have to delete the copies, too. You can only destroy the original.

There are bulk, destructive book scanners which will give you PDFs of your books. You don't get the books back, because they cut off the covers and the binding and feed them through a sheet-fed scanner. http://1dollarscan.com/ is an example.

If you don't want to destroy the books, your only real option is to hire someone to scan them, or spend the time to scan them yourself.

Non-destructive book scanning machines can run thousands through tens of thousands of dollars, or you can build your own for about a thousand dollars. (USD)

I maintain the previous model of this at a hackerspace in Austin, TX: http://www.diybookscanner.org/archivist/

That model claims 1000 pages/hour for practiced operator. 1050 books * 350 pages each on average means you can scan them all in about a month and a half of eight-hour days. If you're paying someone $20/hour, that's about $7500, so for under $10,000, you can non-destructively scan your entire collection. ($10,000 is what you'd pay for a commercial scanner, without the labor to scan the books.)

I'd recommend looking for a hackerspace which already has one, and hiring someone to scan the books for you as you need them.

walterbell · on May 30, 2015

Is there any way to buy a kit for a nondestructive scanner?

1dollarscan is rumored to be watermarking/encoding the customer name in their scans. If you highly magnify the page image, there are a large number of small artifacts, which may explain the large file sizes, e.g. 150MB for a 400 page book.

Their service ($1 per 100 pages) is cost-effective, given that Staples will charge $2 to remove the spine from a book, without scanning. A spine cutter costs a few hundred dollars.

vitovito · on May 30, 2015

> Is there any way to buy a kit for a nondestructive scanner?

Yes, I linked to http://www.diybookscanner.org/archivist/ in my original post. It has a "get a kit" link on the left. The forum post it links to cites a cost of $1200 for a full kit, which are available for pre-order.

The plans are also available at that same URL, and if you know someone with a 4'x8' CNC router, they should be able to help you put them together and cut them for about half that cost in time and materials. Then you'll just need to source the nuts and bolts and cameras and electronics yourself.

alok-g · on May 31, 2015

>> 1dollarscan is rumored to be watermarking/encoding the customer name in their scans.

I rather appreciate that as a means against piracy of the scanned books.

chachalarue · on May 30, 2015

Any recommendations for converting books into another format besides PDF? Ever since I read SICP as a texinfo in Emacs while working on another screen I've been looking for an easier automated way to convert my library to texinfo or LaTeX source.

vitovito · on May 30, 2015

If by "automated" you mean "cheap/free," no.

If by "automated" you mean "I don't have to do any work" irrespective of cost, yes.

When you "scan" a book, you're taking photos of the pages, which you can then run through OCR, but OCR, even of a page scanned with a flatbed scanner, is not going to understand page layout and a variety of typefaces perfectly. You're going to have a lot of errors to correct, and usually some reformatting to do, and adding in the text that OCR missed because it was part of an image or something.

You can do it yourself, comparing the scan to the OCR'd text, a layperson can edit and correct ~18 lines per minute, probably an entire weekend of your time.

You can hire a professional editor, who can edit and correct ~25 lines per minute, probably a few to several hundred dollars (USD) per book.

You could Mechanical Turk it, but I'm not sure the math and time trade-off works out, given that you have to have the edits confirmed and redone, either by other turkers, or by you.

At the end, you could have an hOCR file that you could readily turn into an ePub or something else, but there's no magic solution. (These figures are based on research and testing I did in 2013, using hOCR editing tools and hiring and timing a professional editor versus myself.)

wodenokoto · on May 31, 2015

If the book is just text and headings, can't you expect OCR to do a good job without any real need of human intervention?

vitovito · on May 31, 2015

Depends on what you mean by a "good job" and what you're doing with the results. 80-90% recognition isn't good enough if it means when you search for a term it doesn't show up because the OCR saw "rn" and wrote "m", or if you're having it translated, or having it read aloud with a text-to-speech synthesizer.

In my tests, we were seeing accuracy problems of 10-15% of lines needing correction, and this is a book that was primarily headers and text.

Sometimes this is character-level issues, like I cited above.

Sometimes this is dust, debris, shadows or markings being confused with text.

You get a little closer by running spelling and context checks against the words, but it's never 100% accurate. And if you aren't looking at the original pages, or you need automated systems to search/parse/translate/etc. the text, you need it to be 100% accurate, which means you need a human editor.

OCR isn't a solved problem.

wodenokoto · on May 31, 2015

Thank you for the detailed response. I honestly thought it was a solved problem (as in better performing than a human) as long as there is just running text.

walterbell · on May 30, 2015

Scanning and image capture is only the start of the process of creating a usable ebook. Post-processing workflow is also labor-intensive.

https://www.memoryoftheworld.org/wp-content/uploads/2014/12/...

http://www.konradvoelkel.com/2013/03/scan-to-pdfa/

http://scantailor.org

ColinWright · on May 30, 2015

I think bumbledraven[0] has done this, although I think he is/was in the USA while doing it. He was pleased with the result, but it is a destructive process. Send your book, get back the e-book, you don't get the original back because it's destroyed.

[0] https://news.ycombinator.com/user?id=bumbledraven

brudgers · on May 30, 2015

If you have several terabytes of data the fastest channel is burning a hard disk and shipping FedEx overnight. Which makes it worth mentioning that $10.000 can buy a lot of parcel post. Come to think of it, it can buy a lot of shipped used books as well.

Depending on your travel plans and reference patterns a low tech approach may work.

Good luck.

kw71 · on May 30, 2015

You might want to see how much of your collection is available in Library Genesis [1]. I am no authority on The Law but see no moral issue with collecting other people's scans of printed books that you own.

[1] http://gen.lib.rus.ec/

qubex · on May 31, 2015

Thanks to everybody who took the time to answer. I'll look into the various proposed solutions. Thanks.