Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How can I automatically scan and catalog a mountain of books?
275 points by cconcepts 19 days ago | hide | past | web | favorite | 135 comments
This really kind, eccentric guy in my neighbourhood is stockpiling books and has been doing so for years. He has an enourmous barn that he is obsessively filling with whatever reasonable quality books he can get his hands on but he is completely overwhelmed in terms of cataloging/indexing them so customers have to go through his barn sifting through cartons full of books. He charges $1 or $2 for whatever book you dig out.

He buys bulk lots from deceased estates and bookstores that are closing down. Entire shipping containers are being gifted to him and showing up at his barn. The barn is full and he is now storing in shipping containers outside.

There is great quality books among this quagmire but it takes hours of searching to find them. I figured HN might be able to point me to a solution where I could quickly photograph the front cover and have a script/google images compare the image to online info to index the title and author and then perhaps list them online...

I dunno, it just seems like such a treasure trove of books that he will sell for practically nothing because he loves books and hopes that they will find their way to people who want them - the barrier is allowing customers to find what they are looking for.


Please, please tag each batch of books with a unique lot number, so they can be associated with a specific estate or deceased bookstore. One or more humans spent a lot of time curating those collections. If one of the lots was well curated, then anyone who finds a book that they like will want to see the other books in that lot.

Source: people who have spent years trying to find the names of the 8,000 books in R.A. Lafferty's personal library, lost after he died. About 300 title names have been recovered, https://www.ralafferty.org/tulsa-books/

This would also be good when you inevitably find letters or photographs tucked between pages.

One of these days I need to write my essay titled "Rubbish has no SKU".

I've seen a few of these, and the basic minimum difference between "pulp waiting to happen" and "bookshop" is basic shelving. Different shelves by category: fiction vs non-fiction and their subdivisions. Within the shelves, alphabetise. Now it's possible for browsers to actually find things. When you put them on Amazon this will also help find them for shipping.

This process will also help you find the stacks of duplicates. You'll have a crate of 50 Shades and Twilight and Stephen King. The Stephen King will eventually resell; the others won't.

This page from the excellent Barter Books on their acceptance policy may be of some help: https://www.barterbooks.co.uk/html/About%20Us/Incoming%20Boo...

> Different shelves by category: fiction vs non-fiction and their subdivisions. Within the shelves, alphabetise.

I worked on a similar project, with thousands of books to organize. We did just that, because it seemed common sense. We had a couple of issues:

1. We found that sorting books by category took quite a lot of time. For every book we had to spend a couple of seconds deciding in which category it should go. By the time we started wishing we had put them all together, sorted only by alphabet, it was already late.

2. Shelving books alphabetically is a problem when they keep coming and going. How do you know how much space to leave on the shelf for a specific letter? Maybe now you have 10 books by Stephen King, but the next container will have 50 more. If you don't leave enough space you'll have to shift all the books after that letter.

> If you don't leave enough space you'll have to shift all the books after that letter.

This is what card files were originally invented for— number all the shelves, put your new book wherever there’s room, and put a card with the title and shelf number in a sorted drawer. When you get rid of the book, get rid of the card.

That works well for closed stack collections. Open ones, not so much.

At a local LP shop, they had a sign saying "If you misshelve a LP, you might as well steal it, at least someone will listen to it" (approximate translation).

Shelving in instead central to many specialty shops!

Yes, but there is something to be told for wading through mountains of junk to unearth something notable, I think many book nerds connect to that feeling.

One banana box full of random books: sure. Fun to dig into and see what's there.

Twenty unsorted banana boxes: every book just becomes a blur (as in, “Oh great, another Danielle Steele novel, and another bible, and another Excel 97 for dummies, …”).

I’m not the world’s biggest bookworm but I go to Goodwill almost every week and look through the used books for sale. The most fun part is browsing because I never go in knowing what I’m looking for.

That is only fun because they already cataloged them for you. They made a little collection "used books for sale" with just the right size for you to enjoy browsing through. Now imagine you run into a barn full of books, a couple of containers outside, unsorted, all labeled "used books for sale" and no way to tell which were there last week and which weren't. At best you can visit 1% of the books this week, and next week the same 1%, or maybe another. That's what the issue is here.

Most people do not enjoy that more then once or twice.

While I largely agree with your sentiment, I'd like to note that alphabetizing only assists those who know what they are looking for.

In a bookstore, this is a virtue. But we appear to be dealing with a book barn. Perhaps the patrons of book barn have not wandered in by accident while searching for a bookstore :)

For OP: I think you might be better off photographing the ISBN and then using a service or script to do a lookup and associate that with a cover and title. Various editions might not have their covers recorded in a database, and titles will give many dupes, but an ISBN will uniquely identify a book.

Additionally, OCR'ing stylized text is problematic. I wouldn't expect easy reading of covers, particularly of used books.

The challenge is, if the catalogued book is not immediately associated with where to find it if retrieving it later, all is in vain...

You need to stick a barcode on the location (shelf, box, etc) like they do in a warehouse. Scan the book, check it is correct, scan the location to book it in. Doesn't have a barcode, put it in another area for later. I wrote a sketch of exactly this once...

Ish. It would help with inventory anyway, by comparing data at point of scanning and point of sale - “we sold the last Stephen King, sorry”, “we do have a Twilight recorded, so it’s there, but you’ll have to find it...”

I wonder if OP has considered the 'next' step: what to do with the referenced data. If it's just to catalog for ease of search or to help identify potential jewels, keepers, junk and then price appropriately.

The answer to this is important in order to properly size the effort. If the goal is to impove the business efficiency of the store, then it should be seen from ROI stand-point. Even the fellow customers/rummaggers could be engaged with a right incentive and tools. Otherwise, the next estate books container shipment will negate the gains of ordering.

Basically, is OP ready to overhaul the operations or is just willing to do something nice just for now?

Ideally, a book cover and info page should be scanned/photo'ed on first touch either by receiver or shopper and sticker coded somehow as processed, then left wherever. In case any jewel-worthy titles uncovered from OCR, the book could be located (stickerwise, date log, crate, whatever) and brought to prominence and priced as appropriate.

Unleashing imagination, as an incentive and QC strategy - some sort of automated OCR and lookup could locate the pointed book on amz or elsewhere and thus reviews and going price vs rummager's deal.

Either way, I'd see this more of a business question, rather than a technical one. Donating a technical solution is fun, but without changing the operations process is not going to be sustainable.

Oh, I am very pleased to see this request, and I may have some actual help for you.

A number of years ago a west coast startup spent quite a lot of time on a product that could identify books by their spine image which I think is what you want here; finding isbns and barcode scanning them is totally impractical at this scale.

A few months before they closed up shop, I introduced them to Brewster kahle at the internet archive and convinced them to leave a copy of their database with the internet archive.

I have no idea what happened after that, but I believe they did send the data over. Machine learning is vastly different today than when they launched, and even back then they had enough data that they could get 10/12 of my books in a single photo; I really encourage you to get in touch with the archive.

The company was called bitlit then Shelfie.

As a side note I got interested because I thought it would be great to get a spine image as an api for rendering my ebooks as a library in vr/ar - I still think this would be cool.

likely the dataset never became publicly available


After installing the kobo app, there's no sign of the Shelfie technology.

This dataset would be useful if it can be extended. Do you recall how they seeded the collection of spine images?

    >  the barrier is allowing customers to find what they are looking for.
I am really glad that there are people like the old man who are willing to do stuff like this and people like yourself who are willing to help.

The real barrier, I think, is a bit more complicated than just being able to find stuff. It is also the fact he will be running out of space and that as more and more people find what they want the undesirable stuff (that no one wants) will just keep growing. There does need to be regular culling, I think, to keep weeding out the duplicates or books that no one wants. Also, there needs to be some effort to discover and sell the really valuable books which could produce occasional windfall funds to keep the endeavor going.

"The Book Thing" in Baltimore (https://bookthing.org/) which I have visited many times seems to be tackling this problem. It's basically a "free book" exchange. Massive. In a warehouse. It is a fairly popular place and is run by an interesting eccentric fellow with very particular ideals. I would recommend see how they do this stuff.

As for thoughts, I think that regardless of what he does, he will need one or more employees (or dedicated volunteers) to actually perform the indexing and physically organize the books.

I would check out https://www.librarything.com/ also. They have a decent app for scanning barcodes and retrieve data from multiple sources. Their own database which consists of lots of imported marc records from university libraries I believe, library of congress, amazon, etc. Then they have another project librarycat where you can set up your books as a lending library.

Cataloging a large number of books is not going to be an easy process unless they are all relatively new popular books. According to librarything my library is 439 books, every few years I delete my catalog and re-import them it takes about a full weekend. Older books don't have barcodes, old paperbacks have the ISBN barcode on the inside cover. Some books don't have ISBN numbers or Library of Congress numbers. So you will still end up doing a fair amount of manual entry and searching.

Came here to also recommend LibraryThing. It's been around for 14 years, and the community has lots of similar situations to OP's. Probably a good place to look up anecdotal evidence for this kind of project, too.

LibraryThing also hooks into [TinyCat](https://www.librarycat.org/) which can make your database more like a library's.

I second or third librarything; I've used tags for things like "box1" or "shelf2."

It may very expensive in time and resources to scan it all even if it is just the covers. You need to work out how long it takes to fetch a book from the barn or container, flatten/unbind if necessary, scan the cover, rebind, and put back. Then multiply by how many books...

I worked at a small startup in the early 2000s that somehow got massive contract to digitalise a Middle Eastern Oil & Gas company's very very extensive documentation library. We had an e-learning product where you could use a scanner to digitalise a printed book into online documentation.

Demos of scanning a book or two was really impressive. So surely scanning more than a million books/manuals/charts will be just as easy. Not quite.

Think we calculated it would take years as the bottleneck is the manual unbinding and re-binding before and after scanning. Scale that to a million and it was not the 2 months project initially forecasted. Buying more scanners and hiring more local staff scaled that part horizontally and improved the speed but still a long project.

However the client "forgot" to pay us for a few months, the bank and our accountants forgot to check and we went bankrupt soon before we really got started. Though at least I got a trip to the Middle East for a few weeks.

I think “scan” here was in the sense of “scan the ISBN to catalogue the book”, not scan the insides. Since many of these are old books, many of them will not be perfect bindings. Scanning their contents therefore either requires opening the books and moving the pages (as Google Books did) or cutting the spine, which would be highly unhelpful since that would preclude any rebinding of the books into anything other than a perfect binding, reducing strength, repairability, and the ability to fold flat.

Zotero (the free software reference manager) hooks into a bunch of online catalogues. You can use Zotero to manage books (I manage my own collection with it, but that's just a small personal home library of around 1000 books).

If a book has an ISBN, often Zotero will manage to find it using the magic lookup button. Just enter the ISBN (DOI's work too!) and it will usually find the book you meant. That covers about 90% of books with an ISBN.

The rest would have to be entered manually.

Zotero is not a full-blown inventory manager, but it may suit your needs.

Get a barcode scanner, scan the ISBN and use that to do a API query on Amazon to retrieve title, author, category, price, etc. Store this info in a DB.

Some scenarios:

1. Generally lookup should return something. Store these book by categories, e.g. business, children, fiction, etc. in their shelves/containers for physical browsing by your customers. The more subcategories you can do the better.

2. If price is bigger than some threshold then store these books privately and list for sale directly in an online marketplace. There’s an industry around book scalping (forgot the actual term) where traders buy books from fairs and sell online based solely on margin.

3. The lookup returns nothing - these books are probably very valuable or worthless. Some manual action required.

I was actually considering doing something like this for remainders before, but never got it going. I’d love to know more about your eventual solution.

Last fall, Amazon shut down access to (one of?) their ISBN lookup service, unless you sold enough things from their advertising.

I've been using Tellico (http://tellico-project.org/) and a barcode scanner for many years and it works well. That's just for about 3,000-4,000 books, though. One of the other lookup options works. (ISBNSearch.org?)

Goodreads has a scanner in their app (on iOS/Android) that can scan covers although for some reason it automagically adds those books into a "to-read" shelf but I guess this isn't a problem for you if you create an account for the purpose.

The API is severely rate-limited (1rps), non-standard oath and badly documented, but you should be able to get some xml out of it and parse that however you'd like.

I'm wondering about the option of having a (cheap, not worth stealing) smartphone, with the Goodreads app on it, logged in to the Goodreads account for this place. People who come in are asked to use it to scan 10 books, and move them over to the "scanned and catalogued" shelves, in addition to paying the $1 fee to get a book. Most people would be fine with that, and over time you get it all scanned.

Goodreads cannot recognize anything, but since it works on either book cover or barcode it will work on lots.

Also, you can download the book information in a CSV format.

A barn full of old books sounds like the perfect breeding ground for all sorts of bugs... I had an acquaintance that had a problem with bed bugs that apparently started when she got books off of those "Take one book" boxes people put on the front of their houses.

May be worth for your neighbor to check that sort of thing too. Apparently there are dogs that are trained to sniff bedbugs... those furry guys can sniff anything :-) [1]

1: https://www.nytimes.com/2012/12/06/garden/bedbugs-hitch-a-ri...

If they are recent books (from about 1980) then they probably have a barcode on the back cover, so use that. My guess is that it won't be worth trying to automatically recognise older books from the cover: a lot of them had a dust jacket, that goes missing, and a cover under the dust jacket that is not at all distinctive. The title might be on the spine, but how many online images show the spine clearly?

I tried doing a book catalog about ten years ago. I got about 80% recognition rate by using multiple numbers (ISBN, and the Library of Congress number) and multiple online data sources. It was a pretty slow process, to the point where simply keyboarding the information was easier and less error-prone, and I had to manually enter the books that didn't get any online matches anyway.

Definitely not a "scan/beep/scan/beep" kind of thing. More like "scan . . . uh, scan . . . scan, damn you, SCAN I say! (beep) Okay . . . now the first problem is that 'The Sands of Mars' which I am holding is definitely not 'Great Montana Flapjack Recipes' on B&N, let's try the library of Congress . . . . nope, not 'Annals of 1959 Steelmaking', so (tap tappity-tap...)"

The last time I looked at this (admittedly quite a while ago) the book bar code contained the ISBN.

What was causing the mismatches? Bar codes that did not contain the ISBN? Non-unique numbers?

You would be surprised to see how many books have the wrong ISBN or have a mismatch between the barcode and the ISBN. Not as much of a problem with major publishers now, but some from the 80s and 90s were hilarious.

The ISBN mappings were maybe fifty percent reliable, service by service. So you'd do three or four ISBN lookups and paw through the results, and then failover to the LOC. The Library of Congress numbers were more accurate, but more effort to enter. Nothing worked 100%, and the failure rate on some of the more obscure books was very high.

This sounds like a use-case inventaire.io ought to support. I'll try to ask them about it. They use wikidata for filling up book metadata.

Otherwise, as stated elsewhere in this thread, Zotero can usually find books with very little information:ISBN or title. It might be worth trying to set up an OCR with it.

In any case, if you go to the length of taking a picture for each book, you might as well save them and make the dataset public, for OCR training purposes (and a second pass). There is also the mechanical Turk option if you go this way.

And as someone stated already, you should plan the physical layout in advance.

yep, inventaire.io could help there, to some extent: they could scan books barcode in bulk from the webapp https://inventaire.io/add/scan , which should find data for most books. But then it gets tricky for books without barcode/ISBN has they would probably have to fill the data manually, which can be quite some work for large inventories. No plans to add OCR, yet ;)

Wow, judging by the response this is a problem a lot of people think about. Am overwhelmed by the helpful info. Obviously have to start at the low hanging fruit as I am working with non-technical people and am relatively non-technical myself. I just tested LibraryThing and it seems very fast and accurate so will give it a whirl.

Again, thanks HN for the overwhelming response.

I use the paid version of Books from Sort It Apps: https://itunes.apple.com/us/app/book-list-library-isbn-scann...

All you have to do is hold the phone over the bar code for a second and it automatically downloads all the relevant information. This is by far (IMHO) the fastest way to catalog a mountain of books.

Not affiliated with the app or company in any way, just a happy user.

https://www.reddit.com/r/DataHoarder/ and https://www.reddit.com/r/datacurator/ are good resources for this kind of thing.

Surprised to see no mention of AbeBooks yet. We have an indie bookstore in Waterloo which is integrated with them and it seems to work pretty well. He tells me he still does most of his business in the IRL shop, but there's a steady stream of people buying online as well. Plus, it's nice for him to be able to quickly check how many of something he already has before committing to buying a bunch more of them. See: https://www.abebooks.com/old-goat-books-waterloo-on-canada/1...

I'm not sure what options there are for hardware integrations, but Abe provides at least online inventory and ordering capabilities. I assume if you had a barcode scanner capable of acting as a USB keyboard and entering ISBNs, it would go pretty quickly.

My plan for books is to pull the rare/valuable ones, then subscribe for the $100/mo 100 book/mo plan at http://1dollarscan.com/ and send them all the rest, produce PDFs, and pulp the books. I have maybe 3000 books in storage and this would be preferable to anything else I've found, as I ultimately would rather consume them electronically.

1DS $100/month is about 30 books of ~300 pages (3+ "sets" of 100 pages, rounded up).

Scanning is only worthwhile for books which are not already available in electronic form ... somewhere.

If you bought from Amazon, there's sometimes an option to get the ebook cheaply. Archive.org has many books. There are also e-books at public libraries, so it may be enough to keep a list/photos/calibre of all your titles and discard rarely-accessed books.

A lot of people are suggesting querying Amazon for ISBN data - another option is the ISBNdb API: https://isbndb.com/ There's also the OpenLibrary API (from the Internet Archive) which may include some more info https://openlibrary.org/

Don’t forget about the Dewey decimal system. For the books with ISBNs, you can sort them into boxes by their Dewey decimal. If you don’t have time manually categorize the books without ISBNs, they can be put into “other” boxes and left unsorted

Library of Congress catalog information is more generally available, by ISBN if not already on the copyright page. For any conventionally published book since 1970.


Photograph the books, a dozen at a time. Put a box number label next to the books. Put the books in their box, glue the label to the box. Stack the boxes.

Sort the boxes by height, line them up in a row. Put slats of wood between the rows to distribute and stabilize the load.

I wrote a program to automatically generate simple HTML files to display the images. See sample:


Use OCR to digitize title & author..

Hire Mechanical Turks or cheap offshore labor to type titles & authors from the photos (no need to ship the books).

Where is that? I really would like to pay a visit...

That story also reminds me of this fellow (who actually might get me to the middle of nowhere SK): https://www.macleans.ca/news/canada/canadas-most-inconvenien...

(Edited to add Apple News links without ads if anyone uses it:

Free version: https://apple.news/Ar3trUQ-cR7C-c9L3YzPYjg

Issue version: https://apple.news/AQD4nDgB4SKi6yTcZy60tPw)

There seems to be a ton of relevant help in this thread and that seems exciting.

Like someone pointed out—something like OCR might be a best first step as it seems like a data entry task at a glance.

It does sound like there may be a significant amount of physical, tedious work involved no matter what software solution you find. Sometimes you have to accept that aspect and push through. Your best bet might be to recruit some physical help there—start a fund or a labour drive or something. Recruit book lovers, etc. Seems worthwhile. Maybe he would donate books to helpers.

Use one of the solutions listed below, but you HAVE to do sorting on the fly. You need to have places to put books and sort them by some general genres and you HAVE to throw out books that aren't worth the time due to damage or any other reason a book would be deemed a recyclable. With that many books, a proper library style cataloging system may be your best bet.

That being said, if you do want to do image comparison for covers, books without covers usually have a copyright page with most of the info on it. Use that to determine what a book is when the other method fails. Throwing together some cheap bookshelves with plywood and 2x4's will greatly help with the finding part, but while scanning use some big bins to do a rough sort.

And I can't stress enough you HAVE to throw out books. It's clear that there's a space issue and if he's willing to get them for free but has a hard time getting rid of them, that's hoarder behavior, not just eccentricity.

that'll need a bunch of volunteers / friends to help with this work. also, check this podcast episode. might get some info / contacts https://www.npr.org/sections/money/2014/11/10/363103753/text...

once you've started this sorting / cataloging work: request visitors not to reshelf the books. have a central location (table / bins) for them to put the books, so the volunteers / barn-man can keep back in the right shelves.

also, what exactly is the barn-man's objective? just collect books and not bother about further? or, be the most helpful to book-lovers? or, make good money from these books ?

simple approach would be to ask book lovers around the locality to volunteer with this task. Borrow some barcode scanners and computers. Give'em whatever books they like and it's kinda get-together for bibliophiles.

Been there... since my library is now a little over a thousand physical books, and in multiple languages.

If you have a Mac, get a copy of Delicious Library (https://www.delicious-monster.com/) and a compatible barcode scanner, like the Flic.

If you have an Android phone, and you're happy with dealing with your phone and CSV export, you'll probably be okay with Libib (https://play.google.com/store/apps/details?id=com.libib.app).

The biggest issue is that there are tonnes of books (especially if like me you have older ones) that predate ISBN. That kinda sucks - but it's life.

A lot of the older ones can be looked up by their Library of Congress number, IIRC from digitizing and barcoding a school library’s catalog a couple decades ago.

FWIW, that effort took a half dozen people about 3 months for somewhere on the order of 50,000 books.

100%. But given that I want automatically - the barcode scanner integration with varying tools just doesn't do it.

The flip side is that I have books in English, French, Hebrew, and Arabic. The religious Hebrew books don't have ISBNs, unless they came from a North American publisher. The same is true of the Arabic books I have. The French on the other hand went ISBN furiously - or at least enough that it scans true :)

What percentage of each? It sounds like this book barn could use some organization in general, and sorting by language might be a good first step. Then you can scan the English and French as lower hanging fruit, and maybe recruit extra help to deal with the Hebrew and Arabic manually, or something?

Sell 5 random surprise books for $15 (ex shipping). Include a box to send back any books they don't want (or any other book). Process returned books (take pictures, put barcode, register title+author). For the next order give $3 discount for every book they sent back previously.

It can definitely be done but I don't know the details.

A couple times a year the local Half Price Books Outlet does a "fill a bag for $20" event and every time there are at least a couple people there with shopping carts full of bags of books.

They have dedicated bar code scanners attached to their phones and will scan books at around 1 a second. I don't know what software they are using but clearly they are looking up prices to see what they can get to sell for a profit.

I use goodreads to keep track of my own book collection and using the camera and the goodreads app usually takes 30 seconds plus to focus on the bar code and then to look it up. So whatever they are using is much faster than that.

Yep, I've seen the exact same thing at my local used bookstore on normal business days. Someone with an ISBN scanner methodically scanning every book and buying whatever the app told him.

I work with scanning documents for business purposes. I went to sales meeting with a company called Biels, which has now been bought and is called Instream.

While at this meeting they were displaying a book scanner that you could place in the machine, it would flip each page then take a high resolution photo, and had options for OCR software wihich would read the entire page and present any questionable words or characters the OCR could not identify. This machine and software was pitched to Museums and large libraries. I would highly suggest asking a local Museum or Library if they have any hardware that would be able to archive the books your describing.

I tried searching for the exact machine but I could not locate it, I want to say Canon was the vendor.

I wish success with your en devour.

I am also grading books in my kids library. They don't have a librarian and they have a stack of 1,000 books that have been donated.

I am grading them by reading level A-Z. Currently I am googling the book and then adding "reading level" to the end and then if it has it, it will show up, or I can find the Lexile number and use that as a grade also. I am using the speech to text command in google, so it doesn't take that much time.

This is a hassle and am looking into other ways to speed up this process. And or get other parents to volunteer if it was an easier process.

First, is this really a problem that needs to be solved? Personally, his place sounds like my favorite kind of book shop. A lot of bookworms prefer wandering through dense forests of precariously-balanced piles of books. Is he getting those people, or is he getting people that are expecting Barnes & Noble?

If they really do need to be cataloged, then the next thing is to forget all about trying to inventory the entire thing. Instead, you're going to partition the collection into "easy to catalog" and "hard to catalog": pick a section of the barn and make this the organized area. Get a barcode scanner (https://www.newegg.com/Barcode-Scanner/SubCategory/ID-583) and throw together a quick API client that'll take an ISBN and display a title, author, edition, and picture. If it comes up correct, great: book goes into the cataloged section. If it doesn't, it goes somewhere else. Make it really simple, so that a single keystroke can accept that book into inventory.

Grocery stores have to regularly inventory everything on the shelves. I worked for an outfit once that wanted to do it all in-house, so we bought the commercial Telxon handheld wireless devices and I set about figuring out their software. Turned out that they just wanted to speak basic telnet to a server at a pre-configured IP address, so I put together a sloppy little telnet server interface and staff were able to count the entire store right on the devices in a few hours. That's way more complicated than what you'll need to do, so, y'know, your thing is doable. You'll have the added benefit of free online book databases and better hardware and easier-to-hack-together software.

Also might not be a bad idea to talk to your local librarian. They're book nerds too and he or she might have an actual library science degree. This would be right up their alley.

> Also might not be a bad idea to talk to your local librarian. They're book nerds too and he or she might have an actual library science degree. This would be right up their alley.

I second this. Also, maybe check with university libraries or university MLS/MLIS programs (Masters of Library Science). This is not a new problem, and they would be aware of existing tools/methodology. Also, maybe you could get a grad student/intern to help.

Smartphones of both flavours can load cheap or free apps that are quite effective enough to read barcodes and identify books. Librarything and its various catalogue tools can help with the metadata too. That said, the advice to get specialist help is well-founded.

Instead of creating an app for managing the books and integrating with Amazon API, I'd suggest Calibre which is an excellent open source ebook app. You can use it to manage paper books as well, and it will download all metadata from a given ISBN including Amazon ratings, cover page and so on.


A dozen actions to get each book in the system sounds OK if you're adding a book every day, or a few books a week/month, not to input hundreds or thousands.

> First, is this really a problem that needs to be solved?

For insurance purposes, probably, yeah, especially if there's anything rare, antique, valuable, etc.

I doubt old guy with barn full of books is looking to insure his rare books.

Maybe he would, if the thought weren't so daunting.

When I did this for my (admittedly medium sized) collection, I used Booxter (https://www.deepprose.com/) and a cuecat scanner to catalog all those books.

Was a simple process of having enough boxes and labels, and I did that anytime I had some free time. Scan a bunch of books, drop them in a box, slap a label on the box, wait for booxter to find and fetch the metadata, update the label in booxter and repeat.

Will take time, but is easily doable.

You also ask at the forum of


Perhaps there are users with experience...

I second DIY Book Scanner. I don't own one but I did scan a book once with it. The scanning part is incredibly fast. I'm sure there are solutions out there for creating a PDF (with OCR).

Once upon a time (years ago) i wrote a tool called bookbuilder, which did exactly this :-) Take a bunch of camera photos of a book on dark background, find the edges and extract the text part :-) I'm not sure, it is still working, because it was java 6... but it is using the excellent boofCV library (http://boofcv.org) If you would like to try it, give it a shot (tesseract 3 is needed):


java -jar bookbuilder-0.2.jar --input-path=input --output-file=output/output.pdf --ocr-embed-layer --rotation-degrees=180

There’s a whole universe of folks selling used books on Amazon FBA. I suggest start with a google search of exactly that.

There’s apps which allow you to scan UPC codes and look up a price on Amazon. I’d personally sort the books by market value. Sell the books that are profitable, trash the ones which are not, save the ones which are very rare or have no UPC code, and use the money to grow the storage space.

Except you're talking about a guy who seems to be happy getting containers filled with almost certainly worthless books. I doubt his values and yours are aligned.

I have a related problem on a much smaller scale (only a few hundred books) in that I wish to make full digital copies of my books. I reached out to Archive.org but they can't use them due to copyrights. I'm looking into https://1dollarscan.com/ but it's a destructive scanning technique.

While we are a bit on this topic: is there a alternative to calibre [1] for managing a shitload (50000+) of ebooks which is still performant? Specially ebooks which have no ISBN (PDF, whitepapers, etc) which only information about them is in their EXIF file data.

[1] https://calibre-ebook.com

Performant for which function? Searching? Parsing metadata from imported files? Manual editing?

Zotero, generally?

I raised the same question several years ago on this very forum. I got some good answers but none that ultimately satisfied my needs, but they could be useful for you: https://news.ycombinator.com/item?id=9631362

I worked for a company that scanned and catalogued many books in the ‘00s. There’s two primary challenges to solve, nondestructive scanning and speed.

1. In order to get a good scan (back then) we had to lay each page flat against a piece of glass (no matter the orientation). This tended to damage or destroy the binding by the time scanning was complete.

2. An average of ten seconds per scan (from page flip to page flip) is blazing fast (including rescans). For a 200 page book this is 33 minutes. To scan a library of 200 books at this rate requires 3.2 man years of work (normal 40 hour work week + holidays).

One way to speed this process drastically is to use a bulk scanner. This requires slicing the binding off the book and feeding in the book as a stack of pages, scanning the cover separately. Obviously this completely destroys the book.

Good luck.

These days, using a couple of dedicated hi-res cameras may be a much faster way to aquire the page images

A scanner's workbench could be rigged with screens for live preview and QC. Then assemble/OCR in software. The main manual task is page turning, the rest could be fixed (light, exposure, alignment etc)

This would be my MVP: I'd implement simple inventory app based on ISBN scanning and simply enumerated boxes with, say, 50 books each. Scan ISBN - put in the next empty box, take another box when full, and so on. Then based on title demand, I'll sort popular titles in their own boxes.

20 to 25 books is about the max that can be comfortably lifted.

Use a OCR service such as Firebase ML Text kit or the Amazon's similar offering or something and take pictures cover by cover, ping an API - even amazon or ebay might do to see if it exists and price of the book on average.

It also shouldn't be hard to up the speed by taking pictures of a stack of books - if you take an image of a stack of books and crop it book by book, training models to recognise books shouldn't be that hard but you could also use a CV solution (firebase, amazon, azure again) and then from the books it found in the stack ping the API for each one. This could probably be the fastest way if you can take a panorama and have it search from that.

Anyways, if you do it - try to get the price, ISBN and editions from the results.

Dump them all in a shredder. Blow the shreds through a well-lit tunnel full of digital cameras. Assemble the books from the images. Now it's just a software problem.

(This isn't my idea. Either Rudy Rucker, Vernor Vinge or Cory Doctorow thought it. I forget exactly who.)

The ham-handed Librareome digital preservation project from Vernor Vinge's Rainbows End

>> "The raging maw was a "NaviCloud custom debinder". The fabric tunnel that stretched out behind it was a "camera tunnel" ... thousands of books that had already been sucked into the "data rescue" equipment"

Has anybody tried that? It seems like a fun software problem.

I used Tellico to scan my library, it can automatically lookup the books from Amazon with their ISBN if you have a barcode scanner (else you have to type them by hands...)


Amazon shutdown that API last fall, I think. I've been using one of tellico's other options since then. (I don't remember which offhand.)

LibraryThing has an app, or you can order a CueCat scanner. https://wiki.librarything.com/index.php/Adding_and_importing...

Google has a Books API. Look into that. There are smartphone apps that solve the problem of books that have barcodes. No matter what, this will be a huge task to complete. I scanned my small library (2-3 shelves) and was quite tired of it in the end.

Surprised that no-one has produced a Vivino-alike scanner for books.

(Although I suspect the range of book covers is somewhat larger than the range of wine labels...)


Please do let me know more regarding your pain points. I released http://mybooklist.club Obviously it already includes manual insert and barcode scanning but now i'm working to implement adding by image recognition (of the cover).

Tangential anecdote:

I once entered used-book store looking for and old math book. Noone there except a grumpy looking man with a wild hair and a big beard sitting at a desk in the far corner.

I start browsing the shelves.


“Um, eh, Play with Eternity.”


My thought is that the entire point of an old fashioned second hand book shop is to be able to wander around and explore. IF everything is catalogued, his barn instantly turns into a very bleak version of Amazon.

Not sure if this will help - https://aws.amazon.com/rekognition/

Using computer vision may provide a good enough result for this use case. An interesting approach would be to segment the book piles and bulk scan the spines. There are projects that already tackled this problem:


Won't be as accurate as barcode scanning, but will be definitely less time consuming.

That's fantastic. I was after a similar solution and after playing with OpenCV, I can see how they put the pieces together.

Annoyingly, I was trying to scan multiple books in charity shops, when one of my favourites started putting their own stickers directly over the barcode (when it existed).

I have experience working with document storage business. You need to barcode and index all books. Barcode all locations and scan all books to that container they are. It will be like an excel table with 3 columns. Book name, barcode (ISBN or smth) and location barcode. If someone will look up from the catalog you will know the location. If they have ISBN it's possible to write a script to pull book info by it.

A while back i wrote an ISBN barcode scanner which would lookup the item on amazon and fond the price, the script would use your webcam as the source for the image, im sure you could adapt it to your needs, its very simple and has minimal error checking so beware.


I'm surprised nobody mentioned https://www.libib.com/. It comes with a mobile app that has a barcode scanner with an option for manual entry.

Yes, some older books, especially in languages other than English, are not in its database, leaving you with the manual option, but it will let you index the books that are there in no time.

For books that are not that old, you will often find the info you need on the copyright page - for US publications, the Library of Congress CIP info is there; see http://www.loc.gov/publish/cip/ . Other countries have similar programmes eg the British Library does the same.

You might try Delicious Library: https://delicious-monster.com

I was going to second that; it's the first thing that came to my mind—but then it doesn't work for books without barcodes (or does it now?). I worked with version 2 a while back, it was great. The iPhone app to scan books without the laptop nearby requires version 3 though. The Mac app could also check the price the book sells for online.

My local library has a 24/7 return system. You simply put the books on a conveyor belt and it takes them in and scans them. I imagine it reads the RFID but you could get a similar system to scan the barcode, you just have to insert them back side up. Would be quicker than scanning with a handheld barcode scanner.

This is a complex project, though also a well-developed space -- it's much of what libraries do.

Numerous queryable catalogues of book and other matrials exist, with Worldcat arguably the most developed of those:


The US Library of Congress also has a huge (if intimidating) amount of information available.



Figuring out what you hope to accomplish, how, with what resurces (people, software, equipment, space, etc.), in what timeframe, and with what throughput (how fast are materials arriving and leaving, whatis the current backlog) are all considerations. And what end this will serve; book sales, in-person or online, and what is sufficient to that end is also significant.

Abebooks has a freely-available inventory-management system, HomeBase: HomeBase is AbeBooks' free inventory management software and one of the most widely adopted programs for booksellers worldwide.

This easy-to-use program streamlines inventory management and bookselling on AbeBooks. HomeBase helps take care of everything from maintaining your inventory database, keeping track of buyers and issuing receipts, to uploading your active listings to sell on AbeBooks. Plus, you can also use HomeBase 3.0, to send your inventory to other marketplaces such as Amazon.


There's a longer list of software and tools here:


Maybe look into how public libraries get books back onto shelves https://www.bibliotheca.com/library-return-sorting/


though this is to scan whole books in case there are rare ones..

The amazon seller app does exactly this. I think eBay might have it built in as well, but I’m fairly certain with amazon you can scan the barcode or cover of a book.

Ideally you'd just do a pass through with a high res video camera, generally make sure the book spines are facing out to capture their names, run it through some image filter to pickup the book names.

Then you'd have to run some algo to match the name with the isbn and tag it with the general location.

Once you get the process down you could run a new video every few weeks.

This is kind of like Google Street view for the book barn.

Am I dreaming? Could this work on practice.?

The easiest way by far is to download Goodreads and use the barcode scanner in their app and their lists feature.

Google docs will ocr.

At a first parse, just take a picture of a whole bunch of books spines. Some will be ok, some not.

You should ask on reddit on r/DataHoarder. They are the best for this kind of stuff.

Maybe light hardware scanner will be helpful, something like workers use in warehouses.

You could scan the barcodes and use something like TELLICO to make a catalogue

1st, where is this treasure trove?

2nd, barcode scanning would certainly be the most effective method.

Photograph the spines of the books on the shelves and OCR the titles and authors?

Yep, charge each customer 1$ + at least 5 casual books indexed into a database

i would recommend to check with archive.org, if the books are already available online and if not if they are interested scanning or get the scans.

Ask a professional. And by that I mean - get in touch with Jason Scott: https://twitter.com/textfiles

If we say books and professional, I think of the Internet Archive. They developed their own software which is used on the TT Scribe system, see https://archive.org/details/tabletopscribesystem. These people run an amazing operation, but even for them this still involves a lot of manual labor. In the end, what it takes is a person to grab a book and type in the title.

Jason Scott is the Internet Archive's Free Range Archivist.


Is it true what vessenes said above about the Internet Archive receiving a dataset of book spines? And would it be possible for the IA to release that dataset publicly?

So where is this barn of books?

I’d like to have a look. Thanks!

Hay On Wye is a town known for the book shops in Wales. Maybe this is the business model that you need to look into.

What Hay on Wye is known for is a literary festival. So there was a pivot to this a few decades ago that has worked.

This is the guy that started it:


Note the way it started, buying library stock from America that was available due to libraries closing.

Some of the Hay book shops are really good, there is a former cinema that you could get lost for hours in. Some other book shops are more like 'extras'. They might have books in cases on the pavement with prices being pennies. This stuff could be fairly pulped, but, collectively it gives the whole town this aura of literature that is way beyond what the local sheep farmers necessarily go in for.

Now if a tourist visits Hay for the festival then they spend £££ on books but they also spend a lot more on cups of tea, admission fees to see performances, accommodation and whatever else. A given tourist might spend pennies on books but in so doing spend many pounds in the town. They might not even read the books purchased, they might become more souvenir value, and far from generic souvenirs.

The reputation from the festival is enough to bring a respectable amount of tourists to Hay throughout the rest of the year.

It also works with a sponsor, normally the Hay festival works with people who have a vested interest in it being successful, so you get a lot of coverage on BBC's Radio 4.

Hay also has splendid scenery going for it as well as it being in Wales, proper. There are towns nearby that are just as pretty with similar scenic backdrops but nobody remembers the names of those places. The books thing - which is the effort and inspiration of one man - has put the place on the map, literally.

So, rather than the high tech solution, maybe the preinternet solution has some pointers. Get some local store fronts that are closed premises to become book shops. Segment the collection so that some shops are more specialist than others. Have some shops in less prime locations so that collectively there is the same thing going on as with Hay on Wye. Create a fake literary hub and then make it into a real literary hub by putting on the ten day festival.

If you get the council and local businesses in on the act then you might be able to get the whole thing started. Build it and they will come works for Hay even though it is the middle of nowhere with just sheep for local population.

Trying to shift the product online for a pittance is no fun at all, the festival and tourist location thing could be much more exciting. Try and twin the town with Hay to get started...

> […] has put the place on the map, literally.

I'm pretty sure Hay On Wye was quite literally on a lot of maps before that. Figuratively speaking you are probably correct.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact