Hacker News new | comments | ask | show | jobs | submit login
Scanning books at 250 pages a minute (2008) (u-tokyo.ac.jp)
199 points by jacquesm 57 days ago | hide | past | web | favorite | 51 comments

I am not the domain expert on this project, but here at the Internet Archive we ended up developing our own system (not dissimilar from this one, albeit with significantly less automation) at a fairly low cost. Some links for details:


https://archive.org/details/tabletopscribesystem (links to additional detail pages there)

https://motherboard.vice.com/en_us/article/jp5kjy/saving-hum... (a few years old)

We've found that high-cost implementations are less appealing to smaller sites and libraries (which may have mandates that books not be shipped away), and we can achieve high scanning rates through parallelism rather than single extremely high-throughput stations. Additionally, many books are much more complex, valuable, fragile, or simply not amenable to automatic methods whereas skilled scanners can move through them with ease.

Hi Jonah, just wanted to say congrats on the Tabletop Scribe and also wanted to point out that I designed the first version (the Hackerspace Scanner) with lots of input from the DIY Book Scanner community, who deserve a lot of credit.

If anyone here has interest in building their own, check out diybookscanner.org/archivist for the Archivist model, which I placed in the public domain and documented to the last detail.

Also, the TTScribe is based on our old Hackerspace scanner model, which was just a beta, not the newer and more thoroughly engineered Archivist which also has more permissive licensing, fewer components and more sensible lighting. If you want to get started, that's the place to start.

Neat! What I really like about the system I posted is that it appears to be a lot quicker because it does not require the book to be pressed against something to flatten it. That is a time consuming step, introduces yet another source of optical errors in to the path between the camera and the book and might get dirty.

It looks as if the pages are turned and fixed by air after the 'thumb' on the right hand side releases the next page. There is a long screw that turns slowly moving the 'thumb' down at a rate that release about 4 pages every second, the system then scans the image the moment the page is fixed. Quite clever.

It is neat, it does have the downside of not really being usable on valuable/rare or delicate books though.

That's true. But almost by definition there will not be that many rare books and valuable ones typically also are not available in large numbers. So the scope for applying this tech is huge. I looked at a very large scale digitization operation in some detail a while ago, some of the tech they had there was super interesting.

I love the creativity around solutions like these.

RE " ....ended up developing our own system...." The Internet Archive should more clearly credit where this design originated from. Whose work it is based on. It is only reluctantly that you gave them any credit.

Hi, if you take the time to scroll down to the bottom of the second link I posted, you'll see:

Before we close, we want to recognize the early work of Daniel Reetz and the DIY community. They saw a need for a DIY image capture device and developed a working prototype offered under an OpenHardwareDesign license - CC-BY-SA[2]. You may find information about that here - http://www.diybookscanner.org

All of these events significantly predate my time with the Internet Archive, so I can't speak to them directly, but as with so many things this seems more likely to be a combination of unfortunate neglect and lack of resources than any kind of malice.

It's based on an open source design . SNIP ( from bottom of page https://forum.diybookscanner.org/viewtopic.php?f=26&t=3161 ) page "...the way the TableTopScribe was handled wasn't great (there's a long, long behind-the-scenes story that goes with that), but I was able to convince them to post that credit at the bottom of the page after some back-and-forth with Robert Miller. When the Archive uses and changes our technology, It shows that our ideas are viable and that our approach has merit and value. Ultimately, the intention of making things Open Source was to get people to copy it, and so if I step back far enough, I feel a real sense of pride and accomplishment when I see things like this. (even when the purple tinge shows they are still using the cheap COB LEDs... sigh). Truth be told, at least when I was there, the Archive people were my people and I still care about them and their mission.

Unfortunately it was the Archive (through my work with them, when I worked there, and later, when they rebuilt the design in China) that unintentionally taught me that Open Hardware is a toothless thing. They did, in the end, "release" their "source files" - an .easm which requires Solidworks and doesn't give you the ability to change things*, but does give you a functional assembly to take measurements from. For the Archivist, I chose to share DXFs, STLs, and dimensioned drawings, which have a different set of problems, but at least have wide interoperability. All in all, it's an imperfect world, but a good one. Ultimately, I care far, far more about scanning books than I do about the specifics of their operations, so I've come to peace with all of it and feel good that they've taken it as far as they have. Scan on....." and their reluctant acknowledgement where the idea came from ..right at bottom of page ( https://archive.org/details/tabletopscribesystem) .... "...Before we close, we want to recognize the early work of Daniel Reetz and the DIY community. They saw a need for a DIY image capture device and developed a working prototype offered under an OpenHardwareDesign license - CC-BY-SA[2]. You may find information about that here - http://www.diybookscanner.org...."

Thank you for posting this.

Hey Dan!

Thank you so much for your work on the diy book scanner project!

This is indeed very interesting to see and made me want to share my stories. I've also scanned over thousands of books (mostly Japanese manga, in a way to digitalize them and make them persist) myself and ever wondered where tech could bring this digitalization to. I've spent more than 10,000 hours on this for more than a decade and will probably do it until I couldn't.

Due to the requirement for maximum restoration of manga contents, especially for those cross-spine scenes, the high speed camera-based solution like above or other places which focus on formal prints won't work quite well and that leads to two major ways in the trade for decades since some Japanese started this self-scanning thing around 80s:

1) Destructively break the book into unglued pages (which also invovle cutting the glue with heavy-duty paper cutter) and send them to scanner with auto feeder, (then try to glue it back if keeping the book is vital, at best effort). 2) Manually press each page on flatbed scanner (thus A3 scanner works better as it allows vertically scanning spine area usually with the greatest shadow) and scan each page using predefined rough area which would therefore include non-content area as shown in : https://imgur.com/a/2ITBlJg (left/right edge and spine).

Solution 1) works great in terms of efficiency but could be a pain for those who love to keep book still in the collection (though malformed anyway, and I myself might consider this when I get old and couldn't bear the effort). 2) would cost a lot more time (on average, #1 takes less than 30 mins per book and #2 takes at least an hour) especially for manually removing noises as shown above. I've tried some basic CV tools/scripts to auto cut at least the edges (middle spine could be troublesome and I'm fine with manually working on that), which work really poorly since content itself could also be largely variant if we try to determine "edge" by checking pixel distribution.

Hopefully before 2nd decade finishes, I either give in to destruction or find a perfect way to automate the noise-clearing process.


There are about 129,864,880 books in the entire world. The median length for all books is about 68000 words. It would roughly translate into 250 manuscript pages in average.

If you must know, it would take 247 years for one scanner to do its magic and digitalize all the books ever written.

Ps: Pretty sure I screwed something along the way, would appreciate if you don't roast me for my math :v

Or one year with 247 scanners. :-D

I am pretty sure it takes a long time to put a book in that device.

Spot on.

Throw a mere 1000 scanners at it in a warehouse and you'll be done in 3 months.

ps. amount of information in the world doubles roughly every few years, I wonder how that reflects on the number of books.

Fortunately most of those new books will be created using digital software so preserving them is much easier theoretically. Just have to convince the publisher to give them up.

Quality of information is a very important metric. Books have editors, proofreaders, fact-checkers. Online stuff not so much. As the value of a bit goes down so does the value of a collection of those bits.

There are more than enough junk books out there too for sure.

There is also an argument to made that editors etc serve as gatekeepers to keep out "the wrong kind of people," which frequently correlates to the wrong gender, the wrong race, the wrong religion, the wrong social class and so on.

For some people, freedom to try to express themselves and/or freedom to hear others try to express themselves is far more valuable than whether or not the writing is grammatically correct and typo-free.

That's Google Book's estimate as of 2010, so there are more now: http://booksearch.blogspot.com/2010/08/books-of-world-stand-...

The rate of publishing as recorded by Bowker (issuers of ISBNs) is about 300,000 "traditional" books (publishing houses), and about 1 - 1.5 million if you include "nontraditional" (self-published / print-on-demand) books.

The 300k figure has been very consistent since the 1950s, and is even down somewhat from the 1970s / 1980s IIRC (US LoC annual report data).

So the total is only up by at most about 10-15m books, and if you're looking at traditional publishing, about 3m.

The quality of books published is another question.

For us mere, mortal humans, there is https://www.diybookscanner.org/

pretty cool. I'd improve the design by adding a foot lever and footswitch for taking the photo in order to reduce the back and forward hand motion

And what about the processing of all those images?

There seems to be a lack of free software for efficiently tidying up images obtained from an ordinary, non-automatic scanner. Sometimes turning raw scans into a nice PDF takes much longer than the manual scanning, so making the scanning faster wouldn't be the first priority for improving things (though better hardware might give images that need less tidying up).

Rotating and skewing the image so that the lines of text are horizontal and the margins vertical doesn't seem to be a very hard problem, but I've not seen an easily available and easy-to-use program for doing it. If you end up GIMP-ing each page you can see how that takes longer than the scanning.

There are several programs that look for the bounding box of the text and rotate based on that, but they don't work very well: they get confused by page numbers, chapter headings, side notes, and random blotches. It would be better to recognise the lines of text, which is what you really want to be straight and parallel.


Not all of it. Is there something in there telling me what to do with a load of distorted images made with an ordinary flat-bed scanner? No reason why there should be. My whinge is a bit off-topic. Sorry about that.

Does anyone in the UK offer a cheap book-scanning service using high-tech hardware? If not, why not? Is there just not enough demand for it to be worth getting the equipment? In other words, in the UK, at least, it's cheaper to pay someone to turn the pages?

Still under-performs the fictional predicted rate of robotics in the 80s. https://tenor.com/view/johnny-johnny5-johnnyfive-reading-sca...

There's also Vernor Vinge's "librareome" idea: chuck entire books into a heavy-duty shredder, photograph the resulting shreds with lots of high-speed cameras as they're blown through a wind tunnel, and finally reassemble images of the original pages with software.

That's a very interesting idea, but I can see a whole pile of practical issues with it. Still, top marks for out of the box thinking, it's like shotgun sequencing for books.


I think that's where the name comes from. I suspect that it's satirical to point out the whole pile of practical issues with shotgun sequencing.

This was more or less literally how the Stasi archives were reconstructed. It took a lot of algorithmic work and I don't know if it's finished.

They are having trouble with the hand-torn puzzle pieces, those currently require manual sorting: https://www.theguardian.com/world/2018/jan/03/stasi-files-ea...

I wonder if anyone has tried that yet really enjoyed the idea when I read it.

Anyone know of any commercial solutions for this that don't involve e.g. having to disassemble the book and feed in a stack of pages? Ideally something on Amazon but "prosumer" lever stuff is good too.

I would love to have digital backups of my library and personal notebooks.

I worked for a company that did this. We chopped off the spines to turn them into looseleaf, which could be rapidly scanned in an automated fashion. Then disposed of the original materials and mailed our clients CD-ROMs of high-res scans, optionally OCRed. Obviously there are downsides to this approach, but for situations where fits, it makes a ton of sense.

Please don't take this sentence out of context, but I kept a lot of severed spines as souvenirs when I left that job.

Semi-related: Law students often [used to in the early 90s / may still] get the spines sliced off heavy casebooks and the innards punched for 3-ring binders. Carry around only the pages you’ll need.

It was extravagant and fussy, but super handy for a bike commute. Scanning the loose pages would have been OK if that tech had been more friendly. 1000 pages, thin, flimsy paper and tiny print might still be a challenge. No tablets in those days.

> I kept a lot of severed spines as souvenirs when I left that job


Would be a fantastic way to outfit a bar or lounge. Severed spines with an e-interface for content access. Maybe some kinda laser pointer based lookup.

Suddenly those fake bookshelves that people buy to look as if they're well read make perfect sense. Have to re-calibrate.

Speaking of spines and copyright issues:

In K W Jeter's excellent dark cyberpunk novel "Noir", intellectual property theft is viewed as literally killing people by removing their livelihood, so copyright violators were punished by having their still-living spinal cords stripped out and made into high quality speaker cords in which their consciousness is preserved, usually presented to the copyright owner as a trophy.

"In the cables lacing up Alex Turbiner's stereo system, there was actual human cerebral tissue, the essential parts of the larcenous brains of those who'd thought it would be either fun or profitable to rip off an old, forgotten scribbler like him."


>There’s a lot to like in the novel.

>My favorite section is the middle section where the origin of the asp-heads is detailed via McNihil’s pursuit of a small time book pirate and the preparation of the resulting trophy. The information economy did, in this future, largely come to place. As a result, intellectual property theft is viewed as literally killing people by removing their livelihood. Therefore, death is a fitting punishment. McNihil, in his point by point review of the origin of asp-heads, notes that even in the 20th Century there was the phrase: “There’s a hardware solution to intellectual property theft. It’s called a .357 magnum.”

>Actually it’s decided that death is too good and too quick for pirates.

>Their consciousness is preserved by having their neural network incorporated in various devices. (Turbiner likes to use stripped down spinal cords for speaker wire.)

>This sounds like a cyberpunk notion but, in other parts of the novel, Jeter takes a swipe at such hacker/information economy/internet cliches as information wanting to be free (McNihil destroys a nest of such net hippies) or the future economy being based on information. Villain Harrisch sneers at the notion stating that information can be distorted but atoms – and the wealth they represent – endure.

>Still, his novel is chock full of the high-tech, low-life that characterizes cyberpunk.

(I'd quote some more, but as a high-tech, low-life net hippie, I'm afraid of having my nest destroyed and getting my spine ripped out!)

I suspect the writer to be showing a bit of his bias here. Wonder what his stance was on librarians, death by being buried under books perhaps?

K W Jetter was a good friend of Philip K Dick, and the character Alex Turbiner had some similarities and might have been based on him! From the review I linked to:

>A sort of Dick-like (in the sense of a largely ignored and prolific author of paperbacks and lover of music) author and idol of McNihil shows up in Turbiner. (Jeter wryly notes that authors were particularly “mean bastards” in regard to copyrights.)

It's ironic and fitting that PKD has been reincarnated as a robot and new versions of his mind and his work have been reconstructed by infringing on his intellectual property rights with machine learning.



K W Jeter also wrote some authorized sequels to Blade Runner (the movie, not the book Do Androids Dream of Electric Sheep).


Check https://www.indiegogo.com/projects/aura-speeds-simplifies-al... There's a discount going on currently.

I wish I could afford one. Digitized versions of books in Indian languages are almost non-existent.

We had a Treventus ScanRobot at my previous job. Not exactly "prosumer" but more "pro". It vacuums two pages and does linear scanning.

Just search for "book scanner" on aliexpress there shouldn't be any problem.

There appear to be a lot of (overpriced) cameras-on-a-stand, but nothing fully automatic.

The hard part is not capturing an image of each page, but turning them automatically.

Given that planetary scanners are used for rare/fragile/valuable books (for common ones, it's best to just disassemble them and scan the individual leaves), turning the pages automatically at ludicrous speed is exactly what you don't want to do.

Even if books are common they might not be yours and a destructive way of scanning is unacceptable in that case.


Some nerds have done this ... scanning through a closed book.

In 1985 ucl had a camera co produced with the British library for photography of rare tomes which couldn't be opened flat for risk to the binding. Wedge/prism shaped descending digital camera

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact