
Ask HN: How do you digitize documents? - caseyf7
Any recommendations on scanning documents, bills, articles, etc. Advice on scanners, software and workflows would be greatly appreciated.
======
simonblack
I store most digitised documents as .PDFs.

Very often you can obtain original .PDFs from companies by downloading from
websites, as well as (or instead of) the paper documentation they send you.

For local scanning, I use a HP MFP. If I need to scan individual pages, I can
then merge those, if necessary, with a 'merge.pdf' type of software utility.

Store the scanned/downloaded documents in some type of tree-structured
directory format. This greatly reduces the time taken to find a specific
document.

I keep financial documents separate from other documents. Financial documents
are also segregated into separate tax-year 'trees'.

Documents are backed-up month by month, and also daily. The monthly back-ups
are stored indefinitely, and separately from the daily back-ups which are
deleted in reverse chronological 'exponential' order.

Daily-backups remaining at the moment. Day 0000 was back on 23rd June 2012.
Last word is server name. Note how there are more recent backups than earlier
backups:

    
    
         0000-120623nullius
         1024-150401nullius
         2048-180131centrepoint
         2304-181014centrepoint
         2560-190627centrepoint
         2688-191102centrepoint
         2720-191204centrepoint
         2736-191220centrepoint
         2752-200105centrepoint
         2756-200109centrepoint
         2758-200111centrepoint
         2759-200112centrepoint
         2760-200113centrepoint

------
throwaway78678
I've got a decent brother scanner like so
[https://www.ebay.com/p/13030519316](https://www.ebay.com/p/13030519316), when
I scan a document it ends up on a folder from my NAS.

I've built a small webapp that reads the content of this folder as untagged
documents. Tagging them will move them to a proper folder and the docs will
finally be visible in a treeview.

It is relatively robust and low maintenance. I might at some point work on
download + OCR scripts to get and auto-tag bills and such that are already in
PDF. Not sure if it is really useful to be honest at this point

------
rfmw19
My method was more specific to bills and finance documents. I used a generic
photo scanner. It's not as automatic as the purpose-built document scanners
that have automatic feeders and support multiple pages, but I wanted something
that I could use for photography as well.

I coupled this with some very hacked together Perl scripts with Tesseract
OCR[1] that fed in data to ledger-cli[2] for handling bills. I put other
generic documents into folders by date.

It worked pretty well, and I was able to generate some pretty graphs from data
that was fully reconciled with financial institutions like my bank, credit
card, investments, etc., but still took too much time. So what do I do now?
Nothing!

This was years ago. I assume there is now better support from financial
institutions for extracting data and this coupled with improved OCR/machine
learning might make things more robust and make it worthwhile to try again.

[1]
[https://en.wikipedia.org/wiki/Tesseract_(software)](https://en.wikipedia.org/wiki/Tesseract_\(software\))

[2] [https://www.ledger-cli.org/](https://www.ledger-cli.org/)

------
clintonb
What’s your goal? I haven’t received a paper bill in years. They are already
digitized. Same for most news/magazine articles. Aside from older/historical
documents, nearly every piece of paper I encounter has a digital counterpart
that I can access in some form.

------
2rsf
with bills the quality is secondary, and indexing is more important. I scan
using Microsoft Office Lens and email to myself adding a few keywords in the
title "Electricity bill for November 2020"

