https://archive.org/details/tabletopscribesystem (links to additional detail pages there)
https://motherboard.vice.com/en_us/article/jp5kjy/saving-hum... (a few years old)
We've found that high-cost implementations are less appealing to smaller sites and libraries (which may have mandates that books not be shipped away), and we can achieve high scanning rates through parallelism rather than single extremely high-throughput stations. Additionally, many books are much more complex, valuable, fragile, or simply not amenable to automatic methods whereas skilled scanners can move through them with ease.
If anyone here has interest in building their own, check out diybookscanner.org/archivist for the Archivist model, which I placed in the public domain and documented to the last detail.
Also, the TTScribe is based on our old Hackerspace scanner model, which was just a beta, not the newer and more thoroughly engineered Archivist which also has more permissive licensing, fewer components and more sensible lighting. If you want to get started, that's the place to start.
It looks as if the pages are turned and fixed by air after the 'thumb' on the right hand side releases the next page. There is a long screw that turns slowly moving the 'thumb' down at a rate that release about 4 pages every second, the system then scans the image the moment the page is fixed. Quite clever.
I love the creativity around solutions like these.
Before we close, we want to recognize the early work of Daniel Reetz and the DIY community. They saw a need for a DIY image capture device and developed a working prototype offered under an OpenHardwareDesign license - CC-BY-SA. You may find information about that here - http://www.diybookscanner.org
All of these events significantly predate my time with the Internet Archive, so I can't speak to them directly, but as with so many things this seems more likely to be a combination of unfortunate neglect and lack of resources than any kind of malice.
Unfortunately it was the Archive (through my work with them, when I worked there, and later, when they rebuilt the design in China) that unintentionally taught me that Open Hardware is a toothless thing. They did, in the end, "release" their "source files" - an .easm which requires Solidworks and doesn't give you the ability to change things*, but does give you a functional assembly to take measurements from. For the Archivist, I chose to share DXFs, STLs, and dimensioned drawings, which have a different set of problems, but at least have wide interoperability. All in all, it's an imperfect world, but a good one. Ultimately, I care far, far more about scanning books than I do about the specifics of their operations, so I've come to peace with all of it and feel good that they've taken it as far as they have. Scan on....." and their reluctant acknowledgement where the idea came from ..right at bottom of page ( https://archive.org/details/tabletopscribesystem) .... "...Before we close, we want to recognize the early work of Daniel Reetz and the DIY community. They saw a need for a DIY image capture device and developed a working prototype offered under an OpenHardwareDesign license - CC-BY-SA. You may find information about that here - http://www.diybookscanner.org...."
Thank you so much for your work on the diy book scanner project!
Due to the requirement for maximum restoration of manga contents, especially for those cross-spine scenes, the high speed camera-based solution like above or other places which focus on formal prints won't work quite well and that leads to two major ways in the trade for decades since some Japanese started this self-scanning thing around 80s:
1) Destructively break the book into unglued pages (which also invovle cutting the glue with heavy-duty paper cutter) and send them to scanner with auto feeder, (then try to glue it back if keeping the book is vital, at best effort).
2) Manually press each page on flatbed scanner (thus A3 scanner works better as it allows vertically scanning spine area usually with the greatest shadow) and scan each page using predefined rough area which would therefore include non-content area as shown in : https://imgur.com/a/2ITBlJg (left/right edge and spine).
Solution 1) works great in terms of efficiency but could be a pain for those who love to keep book still in the collection (though malformed anyway, and I myself might consider this when I get old and couldn't bear the effort). 2) would cost a lot more time (on average, #1 takes less than 30 mins per book and #2 takes at least an hour) especially for manually removing noises as shown above. I've tried some basic CV tools/scripts to auto cut at least the edges (middle spine could be troublesome and I'm fine with manually working on that), which work really poorly since content itself could also be largely variant if we try to determine "edge" by checking pixel distribution.
Hopefully before 2nd decade finishes, I either give in to destruction or find a perfect way to automate the noise-clearing process.
There are about 129,864,880 books in the entire world. The median length for all books is about 68000 words. It would roughly translate into 250 manuscript pages in average.
If you must know, it would take 247 years for one scanner to do its magic and digitalize all the books ever written.
Ps: Pretty sure I screwed something along the way, would appreciate if you don't roast me for my math :v
ps. amount of information in the world doubles roughly every few years, I wonder how that reflects on the number of books.
For some people, freedom to try to express themselves and/or freedom to hear others try to express themselves is far more valuable than whether or not the writing is grammatically correct and typo-free.
The 300k figure has been very consistent since the 1950s, and is even down somewhat from the 1970s / 1980s IIRC (US LoC annual report data).
So the total is only up by at most about 10-15m books, and if you're looking at traditional publishing, about 3m.
The quality of books published is another question.
There seems to be a lack of free software for efficiently tidying up images obtained from an ordinary, non-automatic scanner. Sometimes turning raw scans into a nice PDF takes much longer than the manual scanning, so making the scanning faster wouldn't be the first priority for improving things (though better hardware might give images that need less tidying up).
Rotating and skewing the image so that the lines of text are horizontal and the margins vertical doesn't seem to be a very hard problem, but I've not seen an easily available and easy-to-use program for doing it. If you end up GIMP-ing each page you can see how that takes longer than the scanning.
There are several programs that look for the bounding box of the text and rotate based on that, but they don't work very well: they get confused by page numbers, chapter headings, side notes, and random blotches. It would be better to recognise the lines of text, which is what you really want to be straight and parallel.
Does anyone in the UK offer a cheap book-scanning service using high-tech hardware? If not, why not? Is there just not enough demand for it to be worth getting the equipment? In other words, in the UK, at least, it's cheaper to pay someone to turn the pages?
I would love to have digital backups of my library and personal notebooks.
Please don't take this sentence out of context, but I kept a lot of severed spines as souvenirs when I left that job.
It was extravagant and fussy, but super handy for a bike commute. Scanning the loose pages would have been OK if that tech had been more friendly. 1000 pages, thin, flimsy paper and tiny print might still be a challenge. No tablets in those days.
In K W Jeter's excellent dark cyberpunk novel "Noir", intellectual property theft is viewed as literally killing people by removing their livelihood, so copyright violators were punished by having their still-living spinal cords stripped out and made into high quality speaker cords in which their consciousness is preserved, usually presented to the copyright owner as a trophy.
"In the cables lacing up Alex Turbiner's stereo system, there was actual human cerebral tissue, the essential parts of the larcenous brains of those who'd thought it would be either fun or profitable to rip off an old, forgotten scribbler like him."
>There’s a lot to like in the novel.
>My favorite section is the middle section where the origin of the asp-heads is detailed via McNihil’s pursuit of a small time book pirate and the preparation of the resulting trophy. The information economy did, in this future, largely come to place. As a result, intellectual property theft is viewed as literally killing people by removing their livelihood. Therefore, death is a fitting punishment. McNihil, in his point by point review of the origin of asp-heads, notes that even in the 20th Century there was the phrase: “There’s a hardware solution to intellectual property theft. It’s called a .357 magnum.”
>Actually it’s decided that death is too good and too quick for pirates.
>Their consciousness is preserved by having their neural network incorporated in various devices. (Turbiner likes to use stripped down spinal cords for speaker wire.)
>This sounds like a cyberpunk notion but, in other parts of the novel, Jeter takes a swipe at such hacker/information economy/internet cliches as information wanting to be free (McNihil destroys a nest of such net hippies) or the future economy being based on information. Villain Harrisch sneers at the notion stating that information can be distorted but atoms – and the wealth they represent – endure.
>Still, his novel is chock full of the high-tech, low-life that characterizes cyberpunk.
(I'd quote some more, but as a high-tech, low-life net hippie, I'm afraid of having my nest destroyed and getting my spine ripped out!)
>A sort of Dick-like (in the sense of a largely ignored and prolific author of paperbacks and lover of music) author and idol of McNihil shows up in Turbiner. (Jeter wryly notes that authors were particularly “mean bastards” in regard to copyrights.)
It's ironic and fitting that PKD has been reincarnated as a robot and new versions of his mind and his work have been reconstructed by infringing on his intellectual property rights with machine learning.
K W Jeter also wrote some authorized sequels to Blade Runner (the movie, not the book Do Androids Dream of Electric Sheep).
I wish I could afford one. Digitized versions of books in Indian languages are almost non-existent.
The hard part is not capturing an image of each page, but turning them automatically.
Some nerds have done this ...
scanning through a closed book.