Hacker News new | past | comments | ask | show | jobs | submit login
How can I legally convert a “dead-tree” library (1050 books) to eBooks?
8 points by qubex on May 30, 2015 | hide | past | favorite | 14 comments
I have a largish dead-tree library/bookcase of about 1050 books (all catalogued digitally, with titles and ISBN numbers). I’m looking for a way to legally convert them to eBooks without incurring a totally unreasonably large cost, mainly because I will be travelling an awful lot over the next few years and I need many of them for reference. I live in Italy and bought most of them from Amazon.co.uk. For a while I was quite enthusiastic because Amazon.com offered a programme whereby eBook copies of purchased books could be had for little or nothing, but (as far as I know) this was never extended overseas. I’m left wondering whether there is a manner of doing this economically and legally, or whether I will be faced with having to abandon the idea, incur a massive expense to re-purchase eBook editions of everything, or go off the beaten track.

Assuming the denizens of HN to be a fairly literate lot, I assume at least somebody has faced this same problem before, and assuming the technical prowess of NH denizens, I assume at least somebody has made a reasonably proficient stab at solving this.




They're your books, so you can scan them if you want. But you can't sell, donate or give away the physical book once you've scanned it; the scan is a copy/backup of the original for personal use, and if you get rid of the original, you have to delete the copies, too. You can only destroy the original.

There are bulk, destructive book scanners which will give you PDFs of your books. You don't get the books back, because they cut off the covers and the binding and feed them through a sheet-fed scanner. http://1dollarscan.com/ is an example.

If you don't want to destroy the books, your only real option is to hire someone to scan them, or spend the time to scan them yourself.

Non-destructive book scanning machines can run thousands through tens of thousands of dollars, or you can build your own for about a thousand dollars. (USD)

I maintain the previous model of this at a hackerspace in Austin, TX: http://www.diybookscanner.org/archivist/

That model claims 1000 pages/hour for practiced operator. 1050 books * 350 pages each on average means you can scan them all in about a month and a half of eight-hour days. If you're paying someone $20/hour, that's about $7500, so for under $10,000, you can non-destructively scan your entire collection. ($10,000 is what you'd pay for a commercial scanner, without the labor to scan the books.)

I'd recommend looking for a hackerspace which already has one, and hiring someone to scan the books for you as you need them.


Is there any way to buy a kit for a nondestructive scanner?

1dollarscan is rumored to be watermarking/encoding the customer name in their scans. If you highly magnify the page image, there are a large number of small artifacts, which may explain the large file sizes, e.g. 150MB for a 400 page book.

Their service ($1 per 100 pages) is cost-effective, given that Staples will charge $2 to remove the spine from a book, without scanning. A spine cutter costs a few hundred dollars.


> Is there any way to buy a kit for a nondestructive scanner?

Yes, I linked to http://www.diybookscanner.org/archivist/ in my original post. It has a "get a kit" link on the left. The forum post it links to cites a cost of $1200 for a full kit, which are available for pre-order.

The plans are also available at that same URL, and if you know someone with a 4'x8' CNC router, they should be able to help you put them together and cut them for about half that cost in time and materials. Then you'll just need to source the nuts and bolts and cameras and electronics yourself.


>> 1dollarscan is rumored to be watermarking/encoding the customer name in their scans.

I rather appreciate that as a means against piracy of the scanned books.


Any recommendations for converting books into another format besides PDF? Ever since I read SICP as a texinfo in Emacs while working on another screen I've been looking for an easier automated way to convert my library to texinfo or LaTeX source.


If by "automated" you mean "cheap/free," no.

If by "automated" you mean "I don't have to do any work" irrespective of cost, yes.

When you "scan" a book, you're taking photos of the pages, which you can then run through OCR, but OCR, even of a page scanned with a flatbed scanner, is not going to understand page layout and a variety of typefaces perfectly. You're going to have a lot of errors to correct, and usually some reformatting to do, and adding in the text that OCR missed because it was part of an image or something.

You can do it yourself, comparing the scan to the OCR'd text, a layperson can edit and correct ~18 lines per minute, probably an entire weekend of your time.

You can hire a professional editor, who can edit and correct ~25 lines per minute, probably a few to several hundred dollars (USD) per book.

You could Mechanical Turk it, but I'm not sure the math and time trade-off works out, given that you have to have the edits confirmed and redone, either by other turkers, or by you.

At the end, you could have an hOCR file that you could readily turn into an ePub or something else, but there's no magic solution. (These figures are based on research and testing I did in 2013, using hOCR editing tools and hiring and timing a professional editor versus myself.)


If the book is just text and headings, can't you expect OCR to do a good job without any real need of human intervention?


Depends on what you mean by a "good job" and what you're doing with the results. 80-90% recognition isn't good enough if it means when you search for a term it doesn't show up because the OCR saw "rn" and wrote "m", or if you're having it translated, or having it read aloud with a text-to-speech synthesizer.

In my tests, we were seeing accuracy problems of 10-15% of lines needing correction, and this is a book that was primarily headers and text.

Sometimes this is character-level issues, like I cited above.

Sometimes this is dust, debris, shadows or markings being confused with text.

You get a little closer by running spelling and context checks against the words, but it's never 100% accurate. And if you aren't looking at the original pages, or you need automated systems to search/parse/translate/etc. the text, you need it to be 100% accurate, which means you need a human editor.

OCR isn't a solved problem.


Thank you for the detailed response. I honestly thought it was a solved problem (as in better performing than a human) as long as there is just running text.


Scanning and image capture is only the start of the process of creating a usable ebook. Post-processing workflow is also labor-intensive.

https://www.memoryoftheworld.org/wp-content/uploads/2014/12/...

http://www.konradvoelkel.com/2013/03/scan-to-pdfa/

http://scantailor.org


I think bumbledraven[0] has done this, although I think he is/was in the USA while doing it. He was pleased with the result, but it is a destructive process. Send your book, get back the e-book, you don't get the original back because it's destroyed.

[0] https://news.ycombinator.com/user?id=bumbledraven


If you have several terabytes of data the fastest channel is burning a hard disk and shipping FedEx overnight. Which makes it worth mentioning that $10.000 can buy a lot of parcel post. Come to think of it, it can buy a lot of shipped used books as well.

Depending on your travel plans and reference patterns a low tech approach may work.

Good luck.


You might want to see how much of your collection is available in Library Genesis [1]. I am no authority on The Law but see no moral issue with collecting other people's scans of printed books that you own.

[1] http://gen.lib.rus.ec/


Thanks to everybody who took the time to answer. I'll look into the various proposed solutions. Thanks.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: