
Scanning books at 250 pages a minute (2008) - jacquesm
http://www.k2.t.u-tokyo.ac.jp/vision/BFS-Auto/
======
jonah-archive
I am not the domain expert on this project, but here at the Internet Archive
we ended up developing our own system (not dissimilar from this one, albeit
with significantly less automation) at a fairly low cost. Some links for
details:

[https://archive.org/scanning](https://archive.org/scanning)

[https://archive.org/details/tabletopscribesystem](https://archive.org/details/tabletopscribesystem)
(links to additional detail pages there)

[https://motherboard.vice.com/en_us/article/jp5kjy/saving-
hum...](https://motherboard.vice.com/en_us/article/jp5kjy/saving-human-
knowledge-at-800-pages-an-hour) (a few years old)

We've found that high-cost implementations are less appealing to smaller sites
and libraries (which may have mandates that books not be shipped away), and we
can achieve high scanning rates through parallelism rather than single
extremely high-throughput stations. Additionally, many books are much more
complex, valuable, fragile, or simply not amenable to automatic methods
whereas skilled scanners can move through them with ease.

~~~
asdefghyk
RE " ....ended up developing our own system...." The Internet Archive should
more clearly credit where this design originated from. Whose work it is based
on. It is only reluctantly that you gave them any credit.

~~~
asdefghyk
It's based on an open source design . SNIP ( from bottom of page
[https://forum.diybookscanner.org/viewtopic.php?f=26&t=3161](https://forum.diybookscanner.org/viewtopic.php?f=26&t=3161)
) page "...the way the TableTopScribe was handled wasn't great (there's a
long, long behind-the-scenes story that goes with that), but I was able to
convince them to post that credit at the bottom of the page after some back-
and-forth with Robert Miller. When the Archive uses and changes our
technology, It shows that our ideas are viable and that our approach has merit
and value. Ultimately, the intention of making things Open Source was to get
people to copy it, and so if I step back far enough, I feel a real sense of
pride and accomplishment when I see things like this. (even when the purple
tinge shows they are still using the cheap COB LEDs... sigh). Truth be told,
at least when I was there, the Archive people were my people and I still care
about them and their mission.

Unfortunately it was the Archive (through my work with them, when I worked
there, and later, when they rebuilt the design in China) that unintentionally
taught me that Open Hardware is a toothless thing. They did, in the end,
"release" their "source files" \- an .easm which requires Solidworks and
doesn't give you the ability to change things*, but does give you a functional
assembly to take measurements from. For the Archivist, I chose to share DXFs,
STLs, and dimensioned drawings, which have a different set of problems, but at
least have wide interoperability. All in all, it's an imperfect world, but a
good one. Ultimately, I care far, far more about scanning books than I do
about the specifics of their operations, so I've come to peace with all of it
and feel good that they've taken it as far as they have. Scan on....." and
their reluctant acknowledgement where the idea came from ..right at bottom of
page (
[https://archive.org/details/tabletopscribesystem](https://archive.org/details/tabletopscribesystem))
.... "...Before we close, we want to recognize the early work of Daniel Reetz
and the DIY community. They saw a need for a DIY image capture device and
developed a working prototype offered under an OpenHardwareDesign license -
CC-BY-SA[2]. You may find information about that here -
[http://www.diybookscanner.org...."](http://www.diybookscanner.org....")

~~~
daniel_reetz
Thank you for posting this.

~~~
markvdb
Hey Dan!

Thank you so much for your work on the diy book scanner project!

------
fallmonkey
This is indeed very interesting to see and made me want to share my stories.
I've also scanned over thousands of books (mostly Japanese manga, in a way to
digitalize them and make them persist) myself and ever wondered where tech
could bring this digitalization to. I've spent more than 10,000 hours on this
for more than a decade and will probably do it until I couldn't.

Due to the requirement for maximum restoration of manga contents, especially
for those cross-spine scenes, the high speed camera-based solution like above
or other places which focus on formal prints won't work quite well and that
leads to two major ways in the trade for decades since some Japanese started
this self-scanning thing around 80s:

1) Destructively break the book into unglued pages (which also invovle cutting
the glue with heavy-duty paper cutter) and send them to scanner with auto
feeder, (then try to glue it back if keeping the book is vital, at best
effort). 2) Manually press each page on flatbed scanner (thus A3 scanner works
better as it allows vertically scanning spine area usually with the greatest
shadow) and scan each page using predefined rough area which would therefore
include non-content area as shown in :
[https://imgur.com/a/2ITBlJg](https://imgur.com/a/2ITBlJg) (left/right edge
and spine).

Solution 1) works great in terms of efficiency but could be a pain for those
who love to keep book still in the collection (though malformed anyway, and I
myself might consider this when I get old and couldn't bear the effort). 2)
would cost a lot more time (on average, #1 takes less than 30 mins per book
and #2 takes at least an hour) especially for manually removing noises as
shown above. I've tried some basic CV tools/scripts to auto cut at least the
edges (middle spine could be troublesome and I'm fine with manually working on
that), which work really poorly since content itself could also be largely
variant if we try to determine "edge" by checking pixel distribution.

Hopefully before 2nd decade finishes, I either give in to destruction or find
a perfect way to automate the noise-clearing process.

------
basitmakine
#IDidTheMath

There are about 129,864,880 books in the entire world. The median length for
all books is about 68000 words. It would roughly translate into 250 manuscript
pages in average.

If you must know, it would take 247 years for one scanner to do its magic and
digitalize all the books ever written.

Ps: Pretty sure I screwed something along the way, would appreciate if you
don't roast me for my math :v

~~~
TaylorAlexander
Or one year with 247 scanners. :-D

~~~
fredguth
I am pretty sure it takes a long time to put a book in that device.

------
zmix
For us mere, mortal humans, there is
[https://www.diybookscanner.org/](https://www.diybookscanner.org/)

~~~
lucaspottersky
pretty cool. I'd improve the design by adding a foot lever and footswitch for
taking the photo in order to reduce the back and forward hand motion

------
bloak
And what about the processing of all those images?

There seems to be a lack of free software for efficiently tidying up images
obtained from an ordinary, non-automatic scanner. Sometimes turning raw scans
into a nice PDF takes much longer than the manual scanning, so making the
scanning faster wouldn't be the first priority for improving things (though
better hardware might give images that need less tidying up).

Rotating and skewing the image so that the lines of text are horizontal and
the margins vertical doesn't seem to be a very hard problem, but I've not seen
an easily available and easy-to-use program for doing it. If you end up GIMP-
ing each page you can see how that takes longer than the scanning.

There are several programs that look for the bounding box of the text and
rotate based on that, but they don't work very well: they get confused by page
numbers, chapter headings, side notes, and random blotches. It would be better
to recognise the lines of text, which is what you really want to be straight
and parallel.

------
petermcneeley
Still under-performs the fictional predicted rate of robotics in the 80s.
[https://tenor.com/view/johnny-johnny5-johnnyfive-reading-
sca...](https://tenor.com/view/johnny-johnny5-johnnyfive-reading-scan-
gif-12961036)

~~~
teraflop
There's also Vernor Vinge's "librareome" idea: chuck entire books into a
heavy-duty shredder, photograph the resulting shreds with lots of high-speed
cameras as they're blown through a wind tunnel, and finally reassemble images
of the original pages with software.

~~~
jacquesm
That's a very interesting idea, but I can see a whole pile of practical issues
with it. Still, top marks for out of the box thinking, it's like shotgun
sequencing for books.

[https://en.wikipedia.org/wiki/Shotgun_sequencing](https://en.wikipedia.org/wiki/Shotgun_sequencing)

~~~
msandford
I think that's where the name comes from. I suspect that it's satirical to
point out the whole pile of practical issues with shotgun sequencing.

------
_underflow_
Anyone know of any commercial solutions for this that don't involve e.g.
having to disassemble the book and feed in a stack of pages? Ideally something
on Amazon but "prosumer" lever stuff is good too.

I would love to have digital backups of my library and personal notebooks.

~~~
knodi123
I worked for a company that did this. We chopped off the spines to turn them
into looseleaf, which could be rapidly scanned in an automated fashion. Then
disposed of the original materials and mailed our clients CD-ROMs of high-res
scans, optionally OCRed. Obviously there are downsides to this approach, but
for situations where fits, it makes a ton of sense.

Please don't take this sentence out of context, but I kept a lot of severed
spines as souvenirs when I left that job.

~~~
Cyph0n
> I kept a lot of severed spines as souvenirs when I left that job

Brutal.

~~~
contingencies
Would be a fantastic way to outfit a bar or lounge. Severed spines with an
e-interface for content access. Maybe some kinda laser pointer based lookup.

~~~
jacquesm
Suddenly those fake bookshelves that people buy to look as if they're well
read make perfect sense. Have to re-calibrate.

~~~
DonHopkins
Speaking of spines and copyright issues:

In K W Jeter's excellent dark cyberpunk novel "Noir", intellectual property
theft is viewed as literally killing people by removing their livelihood, so
copyright violators were punished by having their still-living spinal cords
stripped out and made into high quality speaker cords in which their
consciousness is preserved, usually presented to the copyright owner as a
trophy.

"In the cables lacing up Alex Turbiner's stereo system, there was actual human
cerebral tissue, the essential parts of the larcenous brains of those who'd
thought it would be either fun or profitable to rip off an old, forgotten
scribbler like him."

[https://marzaat.wordpress.com/2018/01/27/noir/](https://marzaat.wordpress.com/2018/01/27/noir/)

>There’s a lot to like in the novel.

>My favorite section is the middle section where the origin of the asp-heads
is detailed via McNihil’s pursuit of a small time book pirate and the
preparation of the resulting trophy. The information economy did, in this
future, largely come to place. As a result, intellectual property theft is
viewed as literally killing people by removing their livelihood. Therefore,
death is a fitting punishment. McNihil, in his point by point review of the
origin of asp-heads, notes that even in the 20th Century there was the phrase:
“There’s a hardware solution to intellectual property theft. It’s called a
.357 magnum.”

>Actually it’s decided that death is too good and too quick for pirates.

>Their consciousness is preserved by having their neural network incorporated
in various devices. (Turbiner likes to use stripped down spinal cords for
speaker wire.)

>This sounds like a cyberpunk notion but, in other parts of the novel, Jeter
takes a swipe at such hacker/information economy/internet cliches as
information wanting to be free (McNihil destroys a nest of such net hippies)
or the future economy being based on information. Villain Harrisch sneers at
the notion stating that information can be distorted but atoms – and the
wealth they represent – endure.

>Still, his novel is chock full of the high-tech, low-life that characterizes
cyberpunk.

(I'd quote some more, but as a high-tech, low-life net hippie, I'm afraid of
having my nest destroyed and getting my spine ripped out!)

~~~
jacquesm
I suspect the writer to be showing a bit of his bias here. Wonder what his
stance was on librarians, death by being buried under books perhaps?

~~~
DonHopkins
K W Jetter was a good friend of Philip K Dick, and the character Alex Turbiner
had some similarities and might have been based on him! From the review I
linked to:

>A sort of Dick-like (in the sense of a largely ignored and prolific author of
paperbacks and lover of music) author and idol of McNihil shows up in
Turbiner. (Jeter wryly notes that authors were particularly “mean bastards” in
regard to copyrights.)

It's ironic and fitting that PKD has been reincarnated as a robot and new
versions of his mind and his work have been reconstructed by infringing on his
intellectual property rights with machine learning.

[https://www.theguardian.com/technology/2006/sep/14/copyright...](https://www.theguardian.com/technology/2006/sep/14/copyright.guardianweeklytechnologysection)

[https://www.vox.com/2016/6/1/11787262/blade-runner-neural-
ne...](https://www.vox.com/2016/6/1/11787262/blade-runner-neural-network-
encoding)

K W Jeter also wrote some authorized sequels to Blade Runner (the movie, not
the book Do Androids Dream of Electric Sheep).

[https://en.wikipedia.org/wiki/K._W._Jeter](https://en.wikipedia.org/wiki/K._W._Jeter)

------
padmabushan
[https://www.media.mit.edu/projects/reading-through-a-
closed-...](https://www.media.mit.edu/projects/reading-through-a-closed-
book/overview/)

Some nerds have done this ... scanning through a closed book.

------
ggm
In 1985 ucl had a camera co produced with the British library for photography
of rare tomes which couldn't be opened flat for risk to the binding.
Wedge/prism shaped descending digital camera

