
Paperless: Scan and index paper documents - DaGardner
https://github.com/danielquinn/paperless
======
bonoboTP
For me the problem is stability and future-proofness. Technology changes very
quickly. If the maintainer loses interest, the software may rot away as the
dependencies change, etc.

Important documents often need to be stored for 5-10-20 years. Why put
everything in this shiny new software, when it may change in 1 or 2 years?

I think it's best to just put scanned pdfs in folders based on year and topic.
Those can be easily and transparently backed up and searched.

But on a few months timescale this software could be useful.

~~~
upofadown
That's why I archive my document scans as one bit per pixel PNGs. It ends up
being 20KB-50KB per page at 150 PPI. I figure that there will always be a way
to get the pixels out of a PNG. PDF is a more complex and dynamic standard.

~~~
bonoboTP
That's true, but PNGs can't have a text layer for searchability.

~~~
zo1
I'm pretty sure text files can reside in the same folder as PNG files.

~~~
bonoboTP
Yes, but not all documents are trivial to convert to a text file because the
layout can be quite complex. A PDF file can have little bits of text floating
anywhere and when you search inside the file, you can see it highlighted at
its actual position.

~~~
zo1
I've had to work with PDF files before, and they're absolutely horrible.
Precisely because of what you state: "little bits of text floating anywhere".
Or something like disjoint, not-grouped, lines for table drawing, instead of a
generic table with formatting, width/height, etc.

Though, I generally agree with you, PDF/A is quite a good way of storing
documents for long-term. But, that doesn't mean that PNG files along with text
files, even with x:y coordinates next to the pieces of text, aren't a feasible
alternative.

------
mattdlondon
I have been doing something similar for a couple of years.

My printer can scan to a shared drive on my home LAN, saving files as PDFs.
These are then uploaded Google Drive where everything else happens
automatically (e.g. if you search for something, it will find it in scanned
PDFs automatically).

Its super-useful especially since the mobile clients for drive is rock solid.
I can be on the phone to someone and pull up basically any document I've had
since the 90s in a couple of seconds, for free. Its kinda fun being on the
phone to a call centre and being able to pull up data quicker than they can.
Tax returns are an absolute doddle when everything is paperless.

The only thing that is missing for me from Google Drive is like a "Knowledge
Graph" for my own documents - I can search by keyword or filename etc sure,
but I'd like to get some "intelligence" next like we're used to with Google
Now, but for my scanned docs, like "show me my bank statements with a payment
to Amazon in the last 3 months" etc.

~~~
stephenr
So now people are _trying_ to give Google more highly sensitive personal
information?

/facepalm

~~~
mdp
Why is this a facepalm? Privacy advocacy is about giving people the option to
decide what they share, not preventing them from sharing. I don't have a
problem at all if someone voluntarily decides to share their private
information with Google. Why does it matter to you what he does with his data?

~~~
stephenr
It matters for the same reason I think US gun violence is a problem, even
though I don't live there.

It matters for the same reason I think banning encryption in the UK is a
problem, even though I don't live there.

It matters for the same reason I think the millions of people riding around on
motor scooters here without _any_ protective equipment/clothing, often against
traffic in the parking/emergency lane is a problem.

The more data people give to an organisation like Google, the more power it
gets.

The more power it gets, the more data it gets.

The more data it gets, ....

~~~
jstx
Going to need you to finish that train of thought. Data aggregation (or power
concentration) isn't any more inherently machiavellian than any of the non
sequiturs you rattled off.

------
cstuder
If you don't want to buy a document scanner, just use your mobile phone for
this.

I personally use Scanbot for this, it automatically recognizes, crops and OCRs
documents (on the device) and stores them as PDF with the extracted text in
the location of your choosing. Works well enough.

------
jkmcf
I've been using Evernote's Scannable for receipts and single pages. I had been
using a scanner w/ ADF, but it was slow I never automated it.

Scannable works really fast and Evernote indexes PDFs.

If only Evernote's editor didn't make me want to switch away every time I use
it...

------
kozikow
I use google docs for it. You can upload scanned documents to Google docs.
Documents are automatically OCRed, you can search by keywords and you can
still access the original image.

Disclaimer: I work at google, although not on the Google docs team.

------
leni536
Last time I checked it's much cheaper to get document scanner with ADF built
together with a laser printer than to buy one standalone. I was quite
surprised.

~~~
thenipper
I can't really back this up with empirical evidence but in my experience the
ADF's on consumer all in one's tend to be a bit crap compared to getting a
standalone one.

------
ams6110
Nice combination of technologies to solve a problem -- could be very useful
for a business that needs to be able to archive and access paper records.

But for a household -- there are very few documents you need to keep long
term. Better to just keep those in a fireproof file box, and shred and discard
everything else rather than devote any resources or mental energy to keeping
them around in either paper or digital form.

~~~
payne92
>few documents you need to keep long term

I disagree. While I'm a huge fan of purging, there are many, many cases where
you need/want documents.

Theft/fire/casualty: old receipts prove ownership and value.

Maintenance: who worked on the furnace 4 yrs ago?

Warranty: our windows have a 20 year warranty (and we're using it!)

Basis for home improvements: when you sell your home, if you can document
improvements, you can raise your basis and lower your capital gains.

Repair: where's the part number & diagram for the faucet that's leaking?

School records for your children.

Etc.

~~~
leni536
I would add university notes. I was really clumsy with mine but my gf has all
her notes for all the classes from uni. It takes up two large shelves, it's
freaking massive. We want to scan it but it's a huge task.

------
zellyn
It would be nice if this joined forces with Camlistore to hurry up the
Scanning Cabinet replacement :-)

~~~
epaulson
Mathieu's got his code up for review!

[https://camlistore-review.googlesource.com/#/c/5416/](https://camlistore-
review.googlesource.com/#/c/5416/)

------
gwbas1c
I bought a high-speed scanner with OCR a few years ago. MacOS automatically
indexes PDFs, so I can easily search through my scanned documents in Finder.

A magic folder system, like Dropbox or Syncplicity, makes sure that the pdfs
are safely backed up for me.

------
avirambm
You can use Docady's scanner that also does OCR and recognizes its content. It
then stores your documents and encrypts them. At the moment it's available on
iOS, but should be available soon in Android too.

Demo:
[https://www.youtube.com/watch?v=cN_Zw6xoUaw](https://www.youtube.com/watch?v=cN_Zw6xoUaw)

App:
[https://itunes.apple.com/US/app/id921250909?mt=8](https://itunes.apple.com/US/app/id921250909?mt=8)

(Full Disclosure: I work at Docady and part of its team)

------
petemc_
I've found adobe acrobat x to be great for OCR and indexing of PDFs. Nothing
else I've used comes close to what it can recognise.

~~~
atourgates
There's lots I don't like about Acrobat X (and now DC), but ClearScan is an
awesome format for scanning and retaining PDF documents. I wish (though don't
expect) Adobe would open source it.

------
stephenr
It seems somewhat ironic to me that someone built this whole paper to ocr
system, and then says "hey use it with a scanner like X", which has OCR
capabilities (producing searchable PDFs) built in.

~~~
mayoff
It's not built in to the scanner. It's done by the software that comes with
the scanner for Windows and OS X.

~~~
stephenr
Thanks for the clarification.

------
ictaot
OP, great job. I have been trying to solve this very same problem for over an
year now, and have a business plan based on the same. Is there a way I can pm
you and get some clarifications. Thanks.

------
Chris2048
Any chance of a wiki to group-collaborate on getting different scanners to
work with this?

I have a HP envy I'd like to glue to the cloud.

------
hendry
No Dockerfile? I have a Dockerfile for handling Web cam images sent by FTP:
[https://github.com/kaihendry/camftp2web/blob/master/Dockerfi...](https://github.com/kaihendry/camftp2web/blob/master/Dockerfile)

------
nickthemagicman
What about just take a picture?

~~~
noxToken
Take a picture with your stock camera, or use an app that willcrop, apply OCR
and convert your image to a document format? The former lacks features without
relying on another program. The latter is only good for people who
infrequently need something scanned. Most people can probably get by using a
phone app, but this is for people with lots of paper documents.

~~~
DannoHung
That said, even if you have a dedicated document scanner, you should spend the
$5 or whatever to get Scanbot Pro for your phone. Makes it _so_ friggin easy
to get a quick scan of a document if you ever need it. And it does on-phone
OCR (far as I can tell the results are pretty good!) with the option to upload
to dropbox or a similar service.

I mention Scanbot specifically because I bought and tried about 5 or 6 of the
available document scanning apps on iOS before settling on it. It really does
do literally everything I want in a document scanning app short of tagging the
documents (which would, of course, be a file-system specific thing).

