Hacker News new | past | comments | ask | show | jobs | submit login
Linear Book Scanner – Open-source automatic book scanner (2014) (linearbookscanner.org)
388 points by gorenb on Sept 17, 2023 | hide | past | favorite | 84 comments



Hehe nice. There is a whole community about this topic at: https://diybookscanner.org/

Years ago I once wrote a little tool in Java called bookbuilder, where you could turn the pages manually, make a photo and then run an automatic process on all images to build a searchable pdf.

I used https://boofcv.org/, an impressive Computer Vision library in pure Java, still exists and it is pretty fast, too.

It was able to detect the page contour, deskew it, flatten the image and remove finger contours by matching the skin tone, then build a PDF with integrated invisible OCR Layer without any user interaction. I remember that I was working on line slope detection with some kind of watershed algorithm to improve the flattening part.

Fun project, I wonder if I have the source code laying around somewhere... even the download page is gone today. This was long before I went open source with all of my little side projects, because I never thought it could be interesting for someone else :-)


RE ....There is a whole community about this topic at: https://diybookscanner.org/..... The actual f(book scanner ) forum is down at present. It has been down for a few weeks . We are working to get it back up.


We need a little help bringing the forum back online. If anyone wants to help and has some solid PHPBB experience, send me an email - danreetz@gmail.com


The store seems to be down as well. Also some of this hardware is quite old such as the model of digital cameras used. Do we still really buy old point and shoot cameras? Would something like pixel phones and their excellent cameras do better in 2023?


I've been interested in book scanning for a while. I think I remember seeing your software! I built one of the early versions that I saw on the site, and then happened to meet Jonathon Duerig and do some waterjet cutting for him a few years ago.

When I saw the linear bookscanner the first time, I realized it was upside down. If it were suspended, and able to move itself, then all the issues about not knowing the mass of the book, and dealing with friction could be avoided. Counterweights could keep the force constant, and it would be moving a known mass when scanning any book.


BoofCV is incredibly cool, thanks for sharing! It's unfortunate I didn't know about it before today, I would've definitely invested time to learn and hopefully contribute back (there's still a chance, but at this point it seems to cover everything I've ever heard of, and more).

I just spent 30 minutes clicking through and inspecting every example.


Check the android test app. Very cool stuff to see there.


Please make it available on github!


Well, if I find the time I'll try to find the sourcecode, but I'm not sure in which state it is :-)


There‘s a somewhat similar commercial scanner [1] [2], with a V-design as well but inverted to scan from the top. Much gentler on the books as it‘s the scanner that moves, not the books themselves. Super happy to see someone develop an open-source alternative!

[1] https://www.treventus.com [2] https://youtu.be/SdipuAuWsEs?si=dFWRtva5gO2oM91o


It's probably expensive even to lease/rent, or to use their digitization service. Why? Because they have no pricing info available for those two options either. Just an inquiry form to fill out.


Right, no public pricing usually means rates far above what personal or low-budget projects can afford.

Not sure how much of the design is protected and how much inspiration one can take for a non-commercial DIY project like the one presented by OP.


I remember contacting them about 10 years ago... Can confirm it was way out of my price range.


I want to guess something like 80k USD for a system? Wild-ass guess. Am I close? Laughably low?


Which is discussed in the Google talk associated with the OP project here:

https://youtu.be/4JuoOaL11bw?si=do1qet5Kq_WErQgz&t=162


A DIY of the same idea, as linked by others: https://www.diybookscanner.org


Um,I'm not sure this is a great alternative, as this scanner chops out a page for every scan. After that, I don't think it really matters how gentle it is...

Unless you're concerned with binding the pages to form a new book. I think that would be possible with the leftovers.


I think you are mistaken. Neither of the machines appear to chop pages out.


You're right! My mistake. It looked like there was a sort of deli-slicer blade and suction to remove pages, but, looking again, it's not what's happening. That's what I get for posting pre morning coffee...


I love the idea, although the risk of torn pages is mildly concerning for archival purposes or valuable books. Though if that were the case, I'm sure scanning by hand would be preferred anyway. I've often wanted a device like this for the purpose of digitizing my excessively large collection of books.

Regarding frequency of torn pages in the FAQ:

> Prototype 1 could scan the majority of books without damage, but may tear one or two pages in some books. Out of 50 books tested, 45% had one or two of their pages either torn or folded. This is a very early prototype and there are many areas for improvement in the design.

In my opinion, this is mostly acceptable. Especially if a future revision reduces the 45% to somewhere around the ~10-20% range. If I had the space for a device like this, I would definitely consider building one.


For a while, the Internet Archive built a book scanner that rests the book in a V-shaped cradle. A volunteer turns pages by hand and lowers/raises a pair of glass panes that gently press the pages for imaging by a pair of DSLR cameras on an angled mount. The whole assembly isn’t automatic, but can be easily operated by hand.

https://blog.archive.org/2021/02/09/meet-eliza-zhang-book-sc...


Interestingly the Internet Archive copied the open source design (of their scanner) from the site https://diybookscanner.org/ ( as they are allowed to by its open source licence ) The internet Archive then effectively refused to release any details back to the community. After a lot of “pushing”, the Internet Archive did acknowledge the source their design was based on came from the site bookscanner.org. This would have been very disappointing, in my opinion for the designer of the scanner – who released the info open source. At one time Internet Archive sold this scanner to organizations for $10K I think the price has dropped now ( I think ) to a few thousand.


I'm not sure if one of those machines was the cause, but I've seen far too many old books on archive.org which have pages that appear to have been torn by the scanner; thus I doubt they're manually doing it.


You said "for a while"; are they not using these machines anymore?


I imagine lawsuits like this[0] caused them to slow down unfortunately.

[0] https://en.m.wikipedia.org/wiki/Authors_Guild,_Inc._v._Googl....


If you haven’t already seen this site, it’s well worth a visit:

https://www.diybookscanner.org/en/intro.html


I made the cardboard scanner many years ago both to scan whole books as well as sections of books from the library. It worked pretty well but nowadays in practice I usually look into the open library at the Internet archive first. Hope it sticks around.


Great link, thanks!


> In my opinion, this is mostly acceptable.

Not if the book is irreplaceable, e.g. old and out of print. No torn pages are acceptable.

And before computers, there was no electronic copy.


Ah yes I should have said something more like “in my usecase”, which is just to digitize ordinary texts that are still in print but that occupy too much space in my home to justify keeping!


This is a problem domain where software hasn't caught up with what is possible, so people do in hardware what could be done in software.

With two or more photos or a stereo image (new iPhone?) one could triangulate to infer a flattened page, and produce images that look like they came from cut pages in a flatbed scanner. Now just pay someone well in Ethiopia to carefully turn pages without damage.

As any researcher can attest, our digital libraries now hold a century of scanned work of questionable quality. AI could infer scans indistinguishable from an outline font format original on an 8K monitor.

I once helped consult on the 1980's font wars, turning old formats and digital scans into Postscript and TrueType fonts. This was hard then, but will soon be understood as the "correct" way to scan text, when software catches up.

For the scientific literature, we need a ChatGPT equivalent to reconstruct LaTeX source that can reproduce each page. (We really need a successor to LaTeX that isn't such an arcane language, and can author fixed and flowable text with equal ease.)


This is very much a physical problem and there aren't too many shortcuts you can take.

I've been part of a preservation project and scanned a LOT of magazines. Generally, it doesn't matter much if you place it on a flat bed or not. Flipping the page manually takes a while either way.

There's already a known way for scanning magazines and books very fast: cut the spine and feed the pages to an automatic scanner. This is of course not applicable to anything you'd like to keep around after scanning, because your copy is destroyed.

All in all, the best way to automate scanning without destroying the item, will have to combine a top level camera with a machine to turn pages. I believe this is what was going on in Google's massive scanning project.

Maybe using x-ray could work for "scanning" some books without having to turn the pages. But I suppose there'll be a new set of problems to solve there.


I recently saw some work from the University of Kentucky on reading the Herculaneum scrolls. These scrolls were carbonized by volcanic activity (Pompeii?) and obviously can't be unrolled without disintegrating. They used some interesting CT (xray) scanning plus machine learning to distinguish the carbon-based ink from the mostly carbon substrate and retrieve legible text.

Of course, that only gets you the printed text. You might lose notes and doodles in the margins, or other physical evidence. But, it's certainly promising for works that are too delicate to physically open and inspect


Printed text on _scrolls_?


RE "....There's already a known way for scanning magazines and books very fast: cut the spine and feed the pages to an automatic scanner. This is of course not applicable to anything you'd like to keep around after scanning, because your copy is destroyed......" I've always thought the pages could be rebound. Not perfect but a halfway solution.


Most signatures (https://en.wikipedia.org/wiki/Section_(bookbinding)) are glue bounded, not just sewn. Some use cases prefer/require the pages to be unbound because the printing goes all the way into the gutter and cutting the spine can also leave out some data. It's highly inneficient as you have to heat the spine carefully and then remove the glue residues. A tiny glue leftover can smear your autofeed scanner if not completely jam and tear the page. For a unique item, makes sense using a non-destructive scanning method, but for anything else, a carefully cut spine (or better yet, a bookbinding plow https://duckduckgo.com/?t=palemoon&q=bookbinding+plow&iax=im... ) leave a perfect cut and the loose pages can be kept in a ziplog bag for any future reference.


I guess you could, but everything where you would even consider cutting the spine is probably not worth fixing afterwards. E.g. it's a contemporary magazine, where you could just buy 2 if you really need to keep a physical copy.


Cutting the spine and feeding it to the scanner is the most cost effective method for most modern books.


How is setting up infrastructure to exploit third world labor a "software problem" exactly?

I think the problem isn't that software can't do de-warping well, it's that by the time you set up everything for book scanning you might as well use a setup that doesn't need it.


"Exploitation" is an arguable word, when without outside intervention there would be no employment above a few cents a day. Any employment by those who can pay even a dollar a day could instantly ratchet an entire family out of abject poverty.


You've also got to wonder about idea behind it that it's supposed to be trivial to send a large number of books to Ethiopia and back without damaging them.


Shipping is cheap, generally.

Even for relatively high-mass objects such as books. Slow boats are slow but exceedingly efficient.

The main risk would likely be container loss off a ship. Possibly environmental damage if spending much time in warm humid climates.


The source article sends these devices to scan books in Ethiopia, where there are people looking for work.


That is actually good thing for the local people (in Ethiopia) as well. I would have loved to take a project like this.


Employing sb in the third world is a collaboration; not an exploitation.


Software was there in 2017 already. I've worked on a device for blind people that did basically this (except there's no need for a stereo image).

Here you can see how flattening works (this was handled by the library, we didn't need to do any custom code):

https://youtu.be/DPu0iJtK2sI?t=1542

There's also a feature where it tells you to turn the page, detects that it has been turned, takes a photo, etc. And in the background it flattens, splits into pages and OCRs the photos. With a little practice you can scan and OCR a whole book at 1-5 seconds per page.

https://youtu.be/DPu0iJtK2sI?t=1909

Then it saves the OCRed book and it can read it to you whenever you like.


Regarding your point about a successor to LaTeX: https://typst.app/ is turning out to be great.


No. Absolutely NO to AI. The last thing we need is for text to be changed in subtle ways by the digitisation process.

https://news.ycombinator.com/item?id=29223815

That risk of "plausible but incorrect" has been a concern even before AI.


Software absolutely already has page flattening capability. Labor is not as trivial as you make it sound.


The Fujitsu ScanSnap SV600 has that, as do some Czur book scanners. Reliably turning pages without damaging the book is the tricky bit.


Most OCR libraries have that feature AIAK.


This comment is silly because of course you need to turn the pages and at volume paying people to do so is prohibitive.

On the software side though, progress marches on: https://facebookresearch.github.io/nougat/ is downloadable and great.


Aren‘t you losing information in the parts that aren‘t perfectly straight? Yes, you can stretch those to recreate the original layout, but that would come at the cost of resolution in the interpolated sections of the page. Granted, not a problem for most books, but probably a reason prople are still looking for mechanical solutions to the problem.


I think the suggestion is that with AI you can interpolate to the actual letterforms, not to pixels.

Working a typical volume the letter “e” will appear hundreds of times and be identical, so there should be lots of data to help resolve ambiguities in the poorer parts of images.

Not to mention data that can be used across volumes.


If the goal is to ultimately OCR then it's moot. But yes, of course information is lost.

That being said, modern phone cameras are going to produce "scans" above 300 DPI, and while 600 DPI or higher might be tricky they're stills possible if you take partial shots of a document, assuming you can focus that close.

What you lose in quality you make up with convenience, I suppose.


> For the scientific literature, we need a ChatGPT equivalent to reconstruct LaTeX source that can reproduce each page. (We really need a successor to LaTeX that isn't such an arcane language, and can author fixed and flowable text with equal ease.)

Check out Nougat: OCRing scientific papers with a deep net trained end to end. It was released by Meta a few days ago.

“PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents.”

https://facebookresearch.github.io/nougat/


What about very old books, or handwritten books, or books with images in it?

The only valid archival approach has to be taking a good photo / scan while the page is as flat as possible.

Every further processing can be done later based on that, as a separate step... as technology advances.

Also carefully FLIPPING THE PAGES is literally the whole problem. Everything else can be solved by lowering a glass pane on the book and automatically taking a photo from above.


> With two or more photos or a stereo image (new iPhone?) one could triangulate to infer a flattened page

Especially if you project a grid over the pages using e.g. a scanning laser.


It is interesting that you mentioned Ethiopia. I live Ethiopia and am hopping to build a machine like that. lol.


> As any researcher can attest, our digital libraries now hold a century of scanned work of questionable quality

I keep thinking so much collaborative potential is not being utilized. Imagine each (unique) Google Book is basically an editable wiki where people can directly correct OCR errors as they come across them (with an associated Talk page where they can give explanations, etc)


Read The Atlantic article I referenced elsewhere:

https://www.theatlantic.com/technology/archive/2017/04/the-t...

The insurmountable problem is that Google doesn't have, and can't get the rights to distribute all these books. Even regular Google employees can't see them (and I tried when in Legal there).

So this is neither a hardware problem nor a software problem. Unfortunately.


You are right! I was really thinking more about books published before the 20th century (which constitute the majority of my own usage of Google Books), and I had in mind something like a collaborative WikiSource integrated into Google Books.

Very interesting read. Thanks for sharing it.



While they are certainly safer they aren't automatic and require someone to turn the pages, which this project attempts to solve, although it certainly should be done in a more gently manner as to not damage the books.


Just curious - Once we scan, we have all contents in digitized format. So, why unbinding a book to pages before scanning is not a scalable model? Is this to avoid additional work of unbinding?


Archivists typically don't like destroying a work when preserving it. Scanning a book also only gives you an optical image. In some cases, like medieval manuscripts, the pages may have been erased and written over. If we simply scan the book and destroy the physical copy, we've lost that evidence.

But again, for cheap mass-market works, most archivists probably won't care about destroying one copy out of a million to preserve the work. It's really only a problem for very old and very rare works


Unless the book is super valuable then it is often easier to just use a guillotine to slice off the binding completely and feed the pages through a sheet-feed scanner.


"easy" is not the goal if there are no other copies, or very few. It doesn't have to be "super valuable" for cutting to be unacceptable.


This seems like it’ll shred any book that’s even slightly damaged..


I also cringed at the look of this. I have a bunch of old books I would like to digitize, and most bookscanners I come across require you to at least spread open the book completely, which would, as you say shred them... And they don't even involve vacuum, movement across sharp edges, and stepper motors...

Oooh.. Second question in https://linearbookscanner.org/faq/...

> Out of 50 books tested, 45% had one or two of their pages either torn or folded

I would not use this even on my less valuable books.


Also, the phrase "out of 50 books tested, 45% had" sounds like you want someone to mistake that for 90%.


It definitely would. And probably some that are not damaged when something gets jammed or the environmental conditions change and react with the paper. But if you’ve got a mountain of books to scan and you can do the bulk of them with a machine like this and then use a more careful approach for the special cases, that’s a win.


Google already did all the books, to a first approximation. Problem was copyright and owners of it.

They used low paid labor to flip the pages. I'll send a link later if no one beats me to it.

Back home: it's https://www.theatlantic.com/technology/archive/2017/04/the-t...


The device in the post was actually developed at google: https://linearbookscanner.org/designs/


That's not what they mostly used, though. It was an actual workstation, with foot pedals.


Sure, but saying "Google did all the books" seems to imply there's no reason to make a scanner. But Google made this scanner specifically because they were interested in "doing all the books", even if they didn't end up using it for that purpose.


This looks absolutely fantastic. Not only can you digitize your books, but it also shreds them for you for free!

Pretty sweet 2 for 1 deal.


I am allergic to dust mites. After about six months on a shelf, I can't touch a book without getting an allergic reaction. Simply dusting or even vacuuming the exterior doesn't work. This gizmo looks like something I can use to rid a book of dust mites so I can read it.


Of course this would be possible with DRMd books. One could, hypothetically of course, use e.g. some kind of script to automatically turn the page of the e-reader, screenshot it, then use image recognition to convert to epub, etc. to archive them.


This is cool, easily scan my books at home that i will use to train my own LLM! Super me!


I tried it but not found it useful.


Fantastic concept! I agree with the concerns about damaged pages. Perhaps this is something that could be easily improved.


tl;dr a diy system for scanning books. Basically, build a triangular prism with a special zig zag slit for a single page to snake through. Put a vacuum on each side to pull the paper through the slit. Put two optical scanners in the middle of the slit, to scan both sides of the sheet as it hangs down. Attach a motor to move a sled pushes just the top, then bottom, of the book. The site features six designs. Some include videos of operation [0].

[0] https://www.youtube.com/watch?v=84byulcC6i4 30 seconds long

In the fullness of time, maybe I would make one of these, given that I live in an apartment and not a house with space to construct/store this scanner. I check etsy every couple of years and haven't seen someone offer a kit. I use 1dollarscan, though they've had to restrict their offering as Pearson, et al notice their existence.


just saw the binding off and push it through a scansnap or the like




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: